The Developer’s Data Modeling Cheat Guide

If you've developed applications using Apache Cassandra® before, you know the drill: query patterns first, data modeling second, and forget about ad hoc queries unless you want to rewrite half your schema. Cassandra's performance and scale were always awesome — but only if you played by its strict, read-optimized rules.

But with Cassandra 5.0, things have changed. A LOT.

The biggest shift? Storage-attached indexes (SAI). It's not just another secondary index — it’s a whole new way to interact with your data. SAI gives developers the flexibility to query multiple columns with filtering, range conditions, and even better performance, all without having to bend your entire data model around your queries.

You can stop doing mental gymnastics just to store and retrieve your data. It's Cassandra like you've never seen before. It brings a much more developer-friendly experience to the table, while still keeping the scalability and resilience we all love.

Let’s show you what this means and why it is such a game changer. Let's go!

Movies galore

The examples in this post are based on movies because it's a dataset that most developers are already familiar with, especially with streaming services taking over the world.

The dataset including images used in this article were sourced from The Movie Database (TMDB), a community-built movie and TV database. Movies in the TMDB are uniquely identified by a movie ID and the corresponding IMDb ID for cross-referencing with the Internet Movie Database (IMDb), a popular online database of movies and TV shows.

In the example from TMDB below is the Netflix movie Extraction released in 2020. You can see from the screenshot under the movie title that it has a release date of April 24, the genres listed as "Action" and "Thriller", plus a runtime of 1 hour 56 minutes.

It shows the overview which is typically a brief description of the movie plot, and the cast which lists Chris Hemsworth in the leading role as Tyler Rake.

Note that for the purposes of this post, I only loaded a small subset of the data, or roughly 40K records out of over a million movies in the TMDB. This is to limit the number of rows returned in the examples to make it easier to view.

Traditional modeling

Our notional movies application retrieves details of a movie based on the movie's ID like:

GET movie details FOR movie id X

In Cassandra, we would model the data by partitioning the table on the movie ID [PRIMARY KEY (movie_id)]. It would look something like:

CREATE TABLE movies_by_id (
    movie_id text,
    cast map<text, text>,
    genres set<text>,
    imdb_id text,
    overview text,
    release_date date,
    release_year int,
    runtime int,
    title text
    PRIMARY KEY (movie_id)
)

so the table can be queried with:

cql> SELECT * FROM movies_by_id WHERE movie_id = '545609';

But what if the application needed to retrieve the data in a different way? This is where it gets challenging for developers new to Cassandra. The idea that they can't just query a table with a filter on an ad hoc column is usually a surprise. Allow me to illustrate with these examples:

GET cast list beginning with lead actors FOR movie id X

GET list of movies FOR genre Y

Traditionally, we would have to design separate tables for these queries. In the case where the app needs a list of movie actors:

CREATE TABLE cast_by_movie (
    movie_id text,
    cast_order int,
    actor_name text,
    actor_id text,
    character text,
    release_year int STATIC,
    title text STATIC,
    PRIMARY KEY (movie_id, cast_order)
)

To get the cast list for the movie Extraction:

cql> SELECT actor_name, cast_order, character
    FROM cast_by_movie
    WHERE movie_id = '545609';

movie_id | cast_order | actor             | character
---------+------------+-------------------+------------
  545609 |          1 |   Chris Hemsworth | Tyler Rake
  545609 |          2 | Rudhraksh Jaiswal | Ovi Mahajan
  545609 |          3 |     Randeep Hooda | Saju
<snip>

To retrieve just the leading actor:

cql> SELECT actor_name, cast_order, character
    FROM cast_by_movie
    WHERE movie_id = '545609'
    AND cast_order = 1;

movie_id | cast_order | actor             | character
---------+------------+-------------------+------------
  545609 |          1 |   Chris Hemsworth | Tyler Rake

The schema for storing movies by genre would look something like:

CREATE TABLE movies_by_genre (
    genre text,
    movie_id text,
    imdb_id text,
    overview text,
    release_year int,
    title text
    PRIMARY KEY (genre, movie_id)
)

To retrieve the list of action movies:

cql> SELECT title, release_year, movie_id
    FROM movies_by_genre
    WHERE genre = 'Action';

The next challenge for developers is keeping the three denormalised tables synchronized. An update to one table needs to be duplicated to the other two tables. In Cassandra this is achieved through a CQL BATCH update. Here is the example code in Java:

insertMovie = SimpleStatement.newInstance("INSERT INTO movies_by_id (...) VALUES (...)");
insertCast = SimpleStatement.newInstance("INSERT INTO cast_by_movie (...) VALUES (...)");
insertGenre = SimpleStatement.newInstance("INSERT INTO movies_by_genre (...) VALUES (...)");

batch = BatchStatement.newInstance(
      insertMovie,
      insertCast,
      insertGenre);
session.execute(batch);

If you find the above daunting—whether you're new to Cassandra or not—let me tell you it doesn't have to be, because there's a simpler way.

Enter Cassandra 5.0

One of the most anticipated features is storage-attached indexing for a very good reason. SAI enables query patterns which are much more flexible and performant, requires less coding, and provides an accelerated path to extending application functionality.

DataStax has been working on SAI for several years; it’s been deployed in Astra DB for some time, and its reliability and performance has been getting high marks from end users.

SAI was donated to the Apache Software Foundation (ASF) back in 2020 as Cassandra Enhancement Proposal CEP-7. It was planned to be implemented in two stages with so-called version 1 features included in Cassandra 5.0. SAI version 2 features are slotted to be included in the next major release (planned) which means that some examples in this post will not work in version 5.0, particularly SAI queries using the OR (union) operator and range queries on some data types.

While we're on the subject, Astra DB has the most complete version of SAI, with the same feature-complete version deployed on both DataStax Enterprise (DSE) and Hyper-Converged Database (HCD) – all built on Apache Cassandra.

Query your way

We can query a table with a filter on any column if it is indexed by SAI. This reduces the complexity of having to design a data model for each application query and removes the need to keep multiple denormalised tables synchronised client-side so there is less code to write and maintain.

If for example we wanted to query by the movie title, we just need to index the title column:

cql> CREATE CUSTOM INDEX ON movies_by_id (title)
     USING 'StorageAttachedIndex';

To retrieve movies with the title "Extraction":

cql> SELECT title, release_year, movie_id, imdb_id
     FROM movies_by_id
     WHERE title = 'Extraction';


 title      | release_year | movie_id | imdb_id
------------+--------------+----------+-----------
 Extraction |         2015 |   326425 | tt4382872
 Extraction |         2013 |   218381 | tt2823574
 Extraction |         2020 |   545609 | tt8936646

From the example output above, there are three movies sharing the exact same title but released in different years. The movie from 2013 stars Danny Glover and Vinnie Jones, while the 2015 version stars Bruce Willis and Kellan Lutz:

Similarly, we can create an SAI index on the release_year column:

cql> CREATE CUSTOM INDEX ON movies_by_id (release_year)
     USING 'StorageAttachedIndex';

which would allow us to retrieve the list of movies released in 2013:

cql> SELECT title, release_year, movie_id, imdb_id
     FROM movies_by_id
     WHERE release_year = 2013
     LIMIT 3;


 title          | release_year | movie_id | imdb_id
----------------+--------------+----------+-----------
 Green Street 3 |         2013 |   182873 | tt2628316
 Something Good |         2013 |   236329 | tt2542502
 Fright Night 2 |         2013 |   214597 | tt2486630

Range queries

The unique implementation of indexing numeric types as a balanced binary search tree with postings at both internal and leaf nodes enables extremely efficient retrieval of results.

Range queries using SAI indexes use syntax familiar to developers. For example, to get a list of movies released after 2024:

cql> SELECT title, release_year, movie_id, imdb_id
     FROM movies_by_id
     WHERE release_year > 2024
     LIMIT 3;

 title          | release_year | movie_id | imdb_id
----------------+--------------+----------+------------
          Yunan |         2025 |  1189738 | tt24022322
         Inside |         2025 |  1225582 | tt29424168
 Le Marsupilami |         2026 |  1145899 | tt33029380

SAI also allows more complex range queries using intersection (AND) and union (OR) boolean operators in a single pass. For example, if we index the release date:

cql> CREATE CUSTOM INDEX ON movies_by_id (release_year)
     USING 'StorageAttachedIndex';

then we can get a list of movies released in the second half of April 2020:

cql> SELECT title, release_date, movie_id, imdb_id
     FROM movies_by_id
     WHERE release_date >= '2020-04-15'
     AND release_date <= '2020-04-30';

 title       | release_date | movie_id | imdb_id
-------------+--------------+----------+------------
  Extraction |   2020-04-23 |   545609 |  tt8936646
      Gutted |   2020-04-17 |  1286694 | tt10768652
 Bad Therapy |   2020-04-17 |   527382 |  tt8488518

Indexing on steroids

SAI is backed by Apache Lucene™️so we can level up the indexes using built-in analyzers from Lucene. For example, we can index the movie titles using the built-in STANDARD analyzer which splits text into "words" (tokens) and converts them into lowercase. Note that you will need to drop the index on a column if one already exists using the CQL DROP INDEX command before you can reindex the column.

cql> CREATE CUSTOM INDEX ON movies_by_id (title)
     USING 'StorageAttachedIndex'
     WITH OPTIONS = { 'index_analyzer': 'STANDARD'};

In a previous example, we retrieved a list of movies which were titled "Extraction" which is a filter for an exact match for the term "Extraction". With the STANDARD analyzer, we can get a list of movies whose title contains the term "extraction" (not just an exact match):

cql> SELECT title, release_year, movie_id
     FROM movies_by_id
     WHERE title:'extraction';

 title                            | release_year | movie_id
----------------------------------+--------------+----------
                       Extraction |         2015 |   326425
                     Extraction 2 |         2023 |   697843
                       Extraction |         2013 |   218381
                       Extraction |         2020 |   545609
 Predators: Moments of Extraction |         2010 |   704882

Note the use of the SAI operator (:) in the WHERE clause food querying "analyzed" indexes instead of the equality (=) operator.

Text analyzers

Lucene analyzers enable us to extract index terms from text similar to full-text search engines such as Apache Solr™️ and Elasticsearch. Analyzers enable term matching on strings and makes it easier to find relevant rows in a table.

The generic analyzers include standard (used in the example above), simple (uses letter tokenizer with lowercase filter), whitespace, stop, and lowercase.

There are specific analyzers for over 30 languages that include Arabic, Bulgarian, Dutch, Hindi, Irish, and Persian. There are 14 available tokenizers (these split text into words or tokens) which include standard, classic, nGram, and wikipedia.

Supported CharFilters are cjk, htmlstrip, mapping, persian, and patternreplace. There are over a hundred token filters with lowercase, classic, N-Grams, and stemmers among some of the most popular.

Stemmers normalise terms into base words by removing suffixes so words like "jumps" and "jumping" are transformed into "jump". We can recreate the index on the movie titles and add the Porter stemming filter to illustrate how it significantly extends SAI's capabilities:

cql> CREATE CUSTOM INDEX ON movies_by_id (title)
     USING 'StorageAttachedIndex'
     WITH OPTIONS = {
         'index_analyzer' : '{
             "tokenizer" : {"name":"standard"},
             "filters" : [
                 {"name":"lowercase"},
                 {"name":"porterstem"}
             ]
         }'
     };

A query for "extraction" (all lowercase) reduces the search term to its base form "extract":

cql> SELECT title, release_year
     FROM movies_by_id
     WHERE title:'extraction';

 title                               | release_year
-------------------------------------+--------------
                           Extracted |         2012
                          Extraction |         2015
                        Extraction 2 |         2023
                             Extract |         2009
 Extract: Mike Judge's Secret Recipe |         2009
                          Extraction |         2013
                          Extraction |         2020
    Predators: Moments of Extraction |         2010

and we now get matches for not just "extraction" but also for "extract" and "extracted" in the results:

Working with CQL collections

In traditional data modeling, previously we had to create a new table that is partitioned by genres so we can get a list of action movies, for example. But with SAI, we can simply index the genres column and as you'll see from the example syntax below there is nothing special about indexing a CQL set collection:

cql> CREATE CUSTOM INDEX ON movies_by_id (genres)
     USING 'StorageAttachedIndex';

Now that we have indexed it with SAI, we can query for action movies with a filter using the CONTAINS operator:

cql> SELECT title, release_year, genres
     FROM movies_by_id
     WHERE genres CONTAINS 'Action'
     LIMIT 5;

 title             | r_year | genres
-------------------+--------------+------------------------
       Narasimhudu |   2005 |                    {'Action'}
      Mad Dog Time |   1996 | {'Action', 'Comedy', 'Crime'}
       Pipe Nation |   2023 |                    {'Action'}
        Kaala Sona |   1975 |         {'Action', 'Western'}
 Soldier's Revenge |   1986 |    {'Action', 'Drama', 'War'}

How about searching for movies by the actor's name? Yes, we can!

The cast column is a CQL map where the key is the movie character's name and the value is the actor's name. If we want to filter by the actor's name, we need to index with VALUES() function:

cql> CREATE CUSTOM INDEX ON movies_by_id (VALUES(cast))
     USING 'StorageAttachedIndex';

and now we can get a list of movies that stars Chris Hemsworth with a filter using the CONTAINS operator:

cql> SELECT title, release_year
     FROM movies_by_id
     WHERE cast CONTAINS 'Chris Hemsworth'
     LIMIT 5;

 title                       | release_year
-----------------------------+--------------
                    Red Dawn |         2012
              Doctor Strange |         2016
 Snow White and the Huntsman |         2012
     Avengers: Age of Ultron |         2015
                    Blackhat |         2015

If we index the map keys (with KEYS()):

cql> CREATE CUSTOM INDEX ON movies_by_id (KEYS(cast))
     USING 'StorageAttachedIndex';

we can get a list of movies that feature the character "Thor" with a filter using the CONTAINS KEY operator:

cql> SELECT title, release_year
     FROM movies_by_id
     WHERE cast CONTAINS KEY 'Thor'
     LIMIT 5;

 title                       | release_year
-----------------------------+--------------
     Avengers: Age of Ultron |         2015
        George of the Jungle |         1997
         Damsels in Distress |         2012
        Thor: The Dark World |         2013
              Thor: Ragnarok |         2017

Multi-term search

SAI uses a query plan that provides the ability to filter on multiple indexed columns with the intersection (AND) and union (OR) boolean operators in a single query, for example:

cql> SELECT title, release_year, movie_id, imdb_id
     FROM movies_by_id
     WHERE title:'extraction'
     AND release_year = 2020;

 title      | release_year | movie_id | imdb_id
------------+--------------+----------+-----------
 Extraction |         2020 |   545609 | tt8936646

Similarly, we can query for movies which are in both the action and thriller genres:

cql> SELECT title, release_year, genres
     FROM movies_by_id
     WHERE genres CONTAINS 'Action'
     AND genres CONTAINS 'Thriller'
     LIMIT 3;

 title            | genres
------------------+----------------------------------
   The Lost Tribe | {'Action', 'Horror', 'Thriller'}
 Mail Order Bride | {'Action', 'Comedy', 'Thriller'}
         Spy Hunt | {'Action', 'Crime', 'Thriller'}

Finally, here's a more complex query with nested brackets which searches for movies with "extraction" and "avengers" in the title and stars Chris Hemsworth:

cql> SELECT title, release_year, movie_id
     FROM movies_by_id
     WHERE (title:'Extraction' OR title:'Avengers')
     AND cast CONTAINS 'Chris Hemsworth';

 title                   | release_year | movie_id
-------------------------+--------------+----------
 Avengers: Age of Ultron |         2015 |    99861
            Extraction 2 |         2023 |   697843
      Avengers: Doomsday |         2026 |  1003596
       Avengers: Endgame |         2019 |   299534
              Extraction |         2020 |   545609
            The Avengers |         2012 |    24428
  Avengers: Infinity War |         2018 |   299536

What's next?

In an upcoming post, I'll show you how SAI can level up your applications with Generative AI so you can "talk" to your data using natural language. You'll be able to retrieve data from Cassandra with a query like "superhero movies with strong female leads!"

In the meantime, check out the SAI Quickstart Guide plus SAI examples and try Astra DB for free—or book a demo with our data architects for a guided tour of SAI in action!