Migrate from Solr to SAI for Accelerated Development and Performance: Part 1

Storage-attached indexing has been one of the most anticipated features in Apache Cassandra® 5.0—and for good reason. Query patterns are much more flexible and performant, there’s far less coding, an easier path to adding application functionality—the list goes on.

DataStax has been working on SAI for several years; it’s been deployed in our Cassandra-as-a-service Astra DB for some time, and its reliability and performance has been getting high marks from end users.

DSE Search is a powerful search engine built on Apache Solr™ that provides full-text search capabilities. But it suffers from Apache Lucene™'s limitations, most notably the two billion document limit per shard index and write-amplification where a single table mutation incurs an additional write for each indexed column.

Working with hundreds of customers over the years, our experience indicates that most of the existing use cases do not require a full-fledged text search engine. Here, we'll show you how easy it is to replace your Solr queries with faster SAI lookups.

A familiar use case

We will be using a movie application to demonstrate the power of SAI. Movies are familiar to most people, and developers understand movies have a title, release date, plot description, cast and crew, and belong to certain genres. Let's look at a fairly recent movie that stars Chris Hemsworth:

Title	Extraction
Release date	April 23, 2020
Runtime	116 minutes
Overview	A hardened gun-for-hire's latest mission becomes a soul-searching race to survive when he's sent into Bangladesh to rescue a drug lord's kidnapped son.
Cast	Chris Hemsworth as "Tyler Rake" Rudhraksh Jaiswal as "Ovi Mahajan" Randeep Hooda as "Saju" Priyanshu Painyuli as "Amir Asif" Golshifteh Farahani as "Nik Khan" Pankaj Tripathi as "Ovi Mahajan Sr." David Harbour as "Gaspar"
Genres	Action, thriller
Poster

The movie data including images used in this article was sourced from The Movie Database (TMDB), a community-built movie and TV database. Movies in the TMDB are uniquely identified by a movie ID and a corresponding IMDB ID. Our notional application retrieves details about a movie using the ID so we modeled the data accordingly:

CREATE TABLE movies_by_id (
    movie_id text,
    cast map<text, text>,
    genres set<text>,
    imdb_id text,
    overview text,
    release_date date,
    release_year int,
    runtime int,
    title text
    PRIMARY KEY (movie_id)
)

so the table can be queried with:

SELECT * FROM movies_by_id WHERE movie_id = '545609';

If the application needed to retrieve the list of actors for a movie beginning with the lead characters, the table schema would look something like this:

CREATE TABLE cast_by_movie (
    movie_id text,
    cast_order int,
    actor_name text,
    actor_id text,
    character text,
    release_year int STATIC,
    title text STATIC,
    PRIMARY KEY (movie_id, cast_order)
)

Or if the application needs to generate a list of movies by genre:

CREATE TABLE movies_by_genre (
    genre text,
    movie_id text,
    imdb_id text,
    overview text,
    release_year int,
    title text
    PRIMARY KEY (genre, movie_id)
)

Solr search

With DSE Search, we can query the data in the primary table movies_by_id and filter on a non-primary key if it is indexed with Solr. For example, to retrieve the list of movies released in 2019 with:

SELECT title, release_date, overview FROM movies_by_id WHERE release_date = 2019;

we would need to index the release_date column. The following Solr schema indexes columns of the movies_by_id so filters on those columns can be used in the WHERE clause:

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<schema name="autoSolrSchema" version="1.5">
  <types>
    <fieldType class="org.apache.solr.schema.TrieIntField" name="TrieIntField"/>
    <fieldType class="org.apache.solr.schema.StrField" name="StrField"/>
    <fieldType class="com.datastax.bdp.search.solr.core.types.SimpleDateField" name="SimpleDateField"/>
    <fieldtype class="org.apache.solr.schema.TextField" name="TextField">
      <analyzer>
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.StandardFilterFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
    </fieldtype>
  </types>
  <fields>
    <field docValues="true" indexed="true" multiValued="false" name="release_year" type="TrieIntField"/>
    <field indexed="true" multiValued="false" name="movie_id" type="StrField"/>
    <field docValues="true" indexed="true" multiValued="false" name="release_date" type="SimpleDateField"/>
    <field indexed="true" multiValued="false" name="title" type="TextField"/>
    <dynamicField indexed="true" multiValued="false" name="cast*" type="StrField"/>
    <field indexed="true" multiValued="false" name="imdb_id" type="StrField"/>
    <field indexed="true" multiValued="true" name="genres" type="StrField"/>
    <field docValues="true" indexed="true" multiValued="false" name="runtime" type="TrieIntField"/>
    <field indexed="true" multiValued="false" name="overview" type="TextField"/>
  </fields>
  <uniqueKey>movie_id</uniqueKey>
</schema>

With the title column indexed, the table can be queried with a filter on this column with a Solr search. For example:

cql> SELECT title, release_year, movie_id, imdb_id FROM movies_by_id WHERE solr_query = 'title:extraction';


 title        | release_year | movie_id | imdb_id
--------------+--------------+----------+------------
   Extraction |         2015 |   326425 |  tt4382872
   Extraction |         2013 |   218381 |  tt2823574
   Extraction |         2020 |   545609 |  tt8936646

The same applies for other indexed columns like genres:

cql> SELECT title, release_year, movie_id, imdb_id, genres FROM movies_by_id WHERE solr_query = 'genres:Action';

 title                       | release_year | movie_id | imdb_id    | genres
-----------------------------+--------------+----------+------------+----------------------------------------------------------------
                     Getaway |         2013 |   146227 |  tt2167202 |                                            {'Action', 'Crime'}
            Midnight Limited |         1940 |   217530 |  tt0032790 |                               {'Action', 'Adventure', 'Crime'}
        Mortal Kombat Legacy |         2011 |  1421494 |       None | {'Action', 'Adventure', 'Crime', 'Fantasy', 'Science Fiction'}
          The Emerald Forest |         1985 |    11532 |  tt0089087 |                               {'Action', 'Adventure', 'Drama'}
                  White Fang |         1991 |    12227 |  tt0103247 |                     {'Action', 'Adventure', 'Drama', 'Family'}
 LEGO Hero Factory: Breakout |         2012 |  1266370 |       None |        {'Action', 'Adventure', 'Animation', 'Science Fiction'}
        Defence of the Realm |         1986 |    29627 |  tt0089009 |                                         {'Action', 'Thriller'}
            Crash and Byrnes |         2000 |   206994 |  tt0208074 |                                                     {'Action'}
               Exterritorial |         null |  1233069 | tt30876483 |                                         {'Action', 'Thriller'}
              Back in Action |         1994 |    30443 |  tt0109195 |                                                     {'Action'}

A simpler way

We can easily replace the complicated and resource-intensive Solr infrastructure with SAI. For most workloads that perform an equality filter on an indexed column, we don't need the full-text search capability of Solr.

An application query that filters on a column's value like WHERE column = value is the primary candidate for SAI. To filter on the title column of the movies_by_id table, create a SAI index with:

CREATE CUSTOM INDEX ON movies_by_id (title) USING 'StorageAttachedIndex';

Monitor the status of SAI indexes by querying the system virtual tables:

cql> SELECT keyspace_name,table_name,index_name,is_queryable,is_building FROM system_views.indexes;

 keyspace_name | table_name   | index_name             | is_queryable | is_building
---------------+--------------+------------------------+--------------+-------------
           app | movies_by_id | movies_by_id_title_idx |         True |       False

From the example output above, the index has finished building and can be queried:

cql> SELECT title, release_year, movie_id, imdb_id FROM movies_by_id WHERE title = 'Extraction';

title      | release_year | movie_id | imdb_id
------------+--------------+----------+-----------
 Extraction |         2015 |   326425 | tt4382872
 Extraction |         2013 |   218381 | tt2823574
 Extraction |         2020 |   545609 | tt8936646

We can do the same for the release_year column:

CREATE CUSTOM INDEX ON movies_by_id (release_year) USING 'StorageAttachedIndex';

and query the table for movies released in 2013:

cql> SELECT title, release_year, movie_id, imdb_id FROM movies_by_id WHERE release_year = 2013 LIMIT 3;

 title                           | release_year | movie_id | imdb_id
---------------------------------+--------------+----------+-----------
 Green Street 3: Never Back Down |         2013 |   182873 | tt2628316
                  Something Good |         2013 |   236329 | tt2542502
       Fright Night 2: New Blood |         2013 |   214597 | tt2486630

Range queries

The unique implementation of indexing numeric types as a balanced binary search tree with postings at both internal and leaf nodes enables extremely efficient retrieval of results.

Range queries using SAI indexes use syntax familiar to developers. For example, to get a list of movies released after 2024:

cql> > SELECT title, release_year, movie_id, imdb_id FROM movies_by_id WHERE release_year > 2024;

 title          | release_year | movie_id | imdb_id
----------------+--------------+----------+------------
          Yunan |         2025 |  1189738 | tt24022322
         Inside |         2025 |  1225582 | tt29424168
 Le Marsupilami |         2026 |  1145899 | tt33029380

Similarly, we can get a list of movies released in the second half of April 2020:

CREATE CUSTOM INDEX ON movies_by_id (release_date) USING 'StorageAttachedIndex';

cql> > SELECT title, release_date, movie_id, imdb_id FROM movies_by_id WHERE release_date >= '2020-04-15' AND release_date <= '2020-04-30';

 title       | release_date | movie_id | imdb_id
-------------+--------------+----------+------------
  Extraction |   2020-04-23 |   545609 |  tt8936646
      Gutted |   2020-04-17 |  1286694 | tt10768652
 Bad Therapy |   2020-04-17 |   527382 |  tt8488518

What's next

In the next installment of the series, we'll show you how to replace more complex Solr searches with SAI queries, including working with CQL collections.

In the meantime, check out the SAI Quickstart Guide plus SAI examples and try Astra DB for free—or book a demo with our data architects for a guided tour of SAI in action!