Migrate from Solr to SAI for Accelerated Development and Performance: Part 1

Storage-attached indexing has been one of the most anticipated features in Apache Cassandra® 5.0—and for good reason. Query patterns are much more flexible and performant, there’s far less coding, an easier path to adding application functionality—the list goes on.
DataStax has been working on SAI for several years; it’s been deployed in our Cassandra-as-a-service Astra DB for some time, and its reliability and performance has been getting high marks from end users.
DSE Search is a powerful search engine built on Apache Solr™ that provides full-text search capabilities. But it suffers from Apache Lucene™'s limitations, most notably the two billion document limit per shard index and write-amplification where a single table mutation incurs an additional write for each indexed column.
Working with hundreds of customers over the years, our experience indicates that most of the existing use cases do not require a full-fledged text search engine. Here, we'll show you how easy it is to replace your Solr queries with faster SAI lookups.
A familiar use case
We will be using a movie application to demonstrate the power of SAI. Movies are familiar to most people, and developers understand movies have a title, release date, plot description, cast and crew, and belong to certain genres. Let's look at a fairly recent movie that stars Chris Hemsworth:
Title | Extraction |
Release date | April 23, 2020 |
Runtime | 116 minutes |
Overview | A hardened gun-for-hire's latest mission becomes a soul-searching race to survive when he's sent into Bangladesh to rescue a drug lord's kidnapped son. |
Cast |
Chris Hemsworth as "Tyler Rake" Rudhraksh Jaiswal as "Ovi Mahajan" Randeep Hooda as "Saju" Priyanshu Painyuli as "Amir Asif" Golshifteh Farahani as "Nik Khan" Pankaj Tripathi as "Ovi Mahajan Sr." David Harbour as "Gaspar" |
Genres | Action, thriller |
Poster | ![]() |
The movie data including images used in this article was sourced from The Movie Database (TMDB), a community-built movie and TV database. Movies in the TMDB are uniquely identified by a movie ID and a corresponding IMDB ID. Our notional application retrieves details about a movie using the ID so we modeled the data accordingly:
CREATE TABLE movies_by_id ( movie_id text, cast map<text, text>, genres set<text>, imdb_id text, overview text, release_date date, release_year int, runtime int, title text PRIMARY KEY (movie_id) )
so the table can be queried with:
SELECT * FROM movies_by_id WHERE movie_id = '545609';
If the application needed to retrieve the list of actors for a movie beginning with the lead characters, the table schema would look something like this:
CREATE TABLE cast_by_movie ( movie_id text, cast_order int, actor_name text, actor_id text, character text, release_year int STATIC, title text STATIC, PRIMARY KEY (movie_id, cast_order) )
Or if the application needs to generate a list of movies by genre:
CREATE TABLE movies_by_genre ( genre text, movie_id text, imdb_id text, overview text, release_year int, title text PRIMARY KEY (genre, movie_id) )
Solr search
With DSE Search, we can query the data in the primary table movies_by_id
and filter on a non-primary key if it is indexed with Solr. For example, to retrieve the list of movies released in 2019 with:
SELECT title, release_date, overview FROM movies_by_id WHERE release_date = 2019;
we would need to index the release_date
column. The following Solr schema indexes columns of the movies_by_i
d so filters on those columns can be used in the WHERE
clause:
<?xml version="1.0" encoding="UTF-8" standalone="no"?> <schema name="autoSolrSchema" version="1.5"> <types> <fieldType class="org.apache.solr.schema.TrieIntField" name="TrieIntField"/> <fieldType class="org.apache.solr.schema.StrField" name="StrField"/> <fieldType class="com.datastax.bdp.search.solr.core.types.SimpleDateField" name="SimpleDateField"/> <fieldtype class="org.apache.solr.schema.TextField" name="TextField"> <analyzer> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.StandardFilterFactory"/> <filter class="solr.LowerCaseFilterFactory"/> </analyzer> </fieldtype> </types> <fields> <field docValues="true" indexed="true" multiValued="false" name="release_year" type="TrieIntField"/> <field indexed="true" multiValued="false" name="movie_id" type="StrField"/> <field docValues="true" indexed="true" multiValued="false" name="release_date" type="SimpleDateField"/> <field indexed="true" multiValued="false" name="title" type="TextField"/> <dynamicField indexed="true" multiValued="false" name="cast*" type="StrField"/> <field indexed="true" multiValued="false" name="imdb_id" type="StrField"/> <field indexed="true" multiValued="true" name="genres" type="StrField"/> <field docValues="true" indexed="true" multiValued="false" name="runtime" type="TrieIntField"/> <field indexed="true" multiValued="false" name="overview" type="TextField"/> </fields> <uniqueKey>movie_id</uniqueKey> </schema>
With the title
column indexed, the table can be queried with a filter on this column with a Solr search. For example:
cql> SELECT title, release_year, movie_id, imdb_id FROM movies_by_id WHERE solr_query = 'title:extraction'; title | release_year | movie_id | imdb_id --------------+--------------+----------+------------ Extraction | 2015 | 326425 | tt4382872 Extraction | 2013 | 218381 | tt2823574 Extraction | 2020 | 545609 | tt8936646
The same applies for other indexed columns like genres
:
cql> SELECT title, release_year, movie_id, imdb_id, genres FROM movies_by_id WHERE solr_query = 'genres:Action'; title | release_year | movie_id | imdb_id | genres -----------------------------+--------------+----------+------------+---------------------------------------------------------------- Getaway | 2013 | 146227 | tt2167202 | {'Action', 'Crime'} Midnight Limited | 1940 | 217530 | tt0032790 | {'Action', 'Adventure', 'Crime'} Mortal Kombat Legacy | 2011 | 1421494 | None | {'Action', 'Adventure', 'Crime', 'Fantasy', 'Science Fiction'} The Emerald Forest | 1985 | 11532 | tt0089087 | {'Action', 'Adventure', 'Drama'} White Fang | 1991 | 12227 | tt0103247 | {'Action', 'Adventure', 'Drama', 'Family'} LEGO Hero Factory: Breakout | 2012 | 1266370 | None | {'Action', 'Adventure', 'Animation', 'Science Fiction'} Defence of the Realm | 1986 | 29627 | tt0089009 | {'Action', 'Thriller'} Crash and Byrnes | 2000 | 206994 | tt0208074 | {'Action'} Exterritorial | null | 1233069 | tt30876483 | {'Action', 'Thriller'} Back in Action | 1994 | 30443 | tt0109195 | {'Action'}
A simpler way
We can easily replace the complicated and resource-intensive Solr infrastructure with SAI. For most workloads that perform an equality filter on an indexed column, we don't need the full-text search capability of Solr.
An application query that filters on a column's value like WHERE column = value
is the primary candidate for SAI. To filter on the title column of the movies_by_id
table, create a SAI index with:
CREATE CUSTOM INDEX ON movies_by_id (title) USING 'StorageAttachedIndex';
Monitor the status of SAI indexes by querying the system virtual tables:
cql> SELECT keyspace_name,table_name,index_name,is_queryable,is_building FROM system_views.indexes; keyspace_name | table_name | index_name | is_queryable | is_building ---------------+--------------+------------------------+--------------+------------- app | movies_by_id | movies_by_id_title_idx | True | False
From the example output above, the index has finished building and can be queried:
cql> SELECT title, release_year, movie_id, imdb_id FROM movies_by_id WHERE title = 'Extraction'; title | release_year | movie_id | imdb_id ------------+--------------+----------+----------- Extraction | 2015 | 326425 | tt4382872 Extraction | 2013 | 218381 | tt2823574 Extraction | 2020 | 545609 | tt8936646
We can do the same for the release_year
column:
CREATE CUSTOM INDEX ON movies_by_id (release_year) USING 'StorageAttachedIndex';
and query the table for movies released in 2013:
cql> SELECT title, release_year, movie_id, imdb_id FROM movies_by_id WHERE release_year = 2013 LIMIT 3; title | release_year | movie_id | imdb_id ---------------------------------+--------------+----------+----------- Green Street 3: Never Back Down | 2013 | 182873 | tt2628316 Something Good | 2013 | 236329 | tt2542502 Fright Night 2: New Blood | 2013 | 214597 | tt2486630
Range queries
The unique implementation of indexing numeric types as a balanced binary search tree with postings at both internal and leaf nodes enables extremely efficient retrieval of results.
Range queries using SAI indexes use syntax familiar to developers. For example, to get a list of movies released after 2024:
cql> > SELECT title, release_year, movie_id, imdb_id FROM movies_by_id WHERE release_year > 2024; title | release_year | movie_id | imdb_id ----------------+--------------+----------+------------ Yunan | 2025 | 1189738 | tt24022322 Inside | 2025 | 1225582 | tt29424168 Le Marsupilami | 2026 | 1145899 | tt33029380
Similarly, we can get a list of movies released in the second half of April 2020:
CREATE CUSTOM INDEX ON movies_by_id (release_date) USING 'StorageAttachedIndex';
cql> > SELECT title, release_date, movie_id, imdb_id FROM movies_by_id WHERE release_date >= '2020-04-15' AND release_date <= '2020-04-30'; title | release_date | movie_id | imdb_id -------------+--------------+----------+------------ Extraction | 2020-04-23 | 545609 | tt8936646 Gutted | 2020-04-17 | 1286694 | tt10768652 Bad Therapy | 2020-04-17 | 527382 | tt8488518
What's next
In the next installment of the series, we'll show you how to replace more complex Solr searches with SAI queries, including working with CQL collections.
In the meantime, check out the SAI Quickstart Guide plus SAI examples and try Astra DB for free—or book a demo with our data architects for a guided tour of SAI in action!