Migrate from Solr to SAI for Accelerated Development and Performance: Part 2

In part 1 of this series, we discussed a notional application that uses Apache Solr™️ to index data so users could search for movies to watch. Solr is a powerful tool but it comes at a cost that we believe is too steep for a majority of workloads.

Storage-attached indexing (SAI) has been one of the most anticipated features in Apache Cassandra® 5.0. Query patterns are much more flexible and performant, there’s far less coding required, provides an easier path to adding application functionality, and much more.

In this post, we’ll showcase advanced SAI capabilities and show how to convert Solr schema fields to corresponding SAI index options, particularly with index analyzers, tokenizers and filters. Let's go!

A quick recap

For a quick reminder, we’re using a notional application that stores information about movies and uses Solr to provide users with the ability to search for movies.

As a reminder, the movie data including images used in this post were sourced from The Movie Database (TMDB), a community-built movie and TV database. Movies are uniquely identified by a movie ID in the TMDB, together with the corresponding IMDB ID.

The data in the movies_by_id table is partitioned by the movie ID to enable us to retrieve movies by their title, the title column is indexed by Solr with the following schema definition:

<types>
    ...
    <fieldtype class="org.apache.solr.schema.TextField" name="TextField">
      <analyzer>
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.StandardFilterFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
    </fieldtype>
  </types>
  <fields>
    ...
    <field indexed="true" multiValued="false" name="title" type="TextField"/>
    ...
  </fields>
  <uniqueKey>movie_id</uniqueKey>

Refactoring Solr schema

If we create a regular SAI index on the title column:

CREATE CUSTOM INDEX ON movies_by_id (title)
USING 'StorageAttachedIndex';

we can query the table with a simple equality filter, which returns exact matches for the term "Extraction":

SELECT title, release_year, movie_id, imdb_id
FROM movies_by_id
WHERE title = 'Extraction';

 title      | release_year | movie_id | imdb_id
------------+--------------+----------+-----------
 Extraction |         2015 |   326425 | tt4382872
 Extraction |         2013 |   218381 | tt2823574
 Extraction |         2020 |   545609 | tt8936646

The query returned three matches for movies that share the same title:

Image of three different movies sharing the same "Extraction" title.

As a matter of disclosure, there are more movies titled "Extraction" but we only retrieved roughly 40K out of the 1.02 million movies listed on TMDB so the data in the example tables contain a small subset of TMDB.

But a regular SAI index is not enough because the title field has StandardTokenizer and LowerCaseFilter in the Solr schema:

<fieldtype class="org.apache.solr.schema.TextField" name="TextField">
      <analyzer>
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.StandardFilterFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
   </fieldtype>

The StandardFilter was deprecated in Lucene 7.x (LUCENE-8356) so we need to create the SAI index with the built-in STANDARD analyzer (splits the text into words and converts them into lowercase):

CREATE CUSTOM INDEX ON movies_by_id (title)
USING 'StorageAttachedIndex' 
WITH OPTIONS = { 'index_analyzer': 'STANDARD'};

Now we get hits for movies with more than just the search term "Extraction" in the title:

SELECT title, release_year, movie_id, imdb_id 
FROM movies_by_id 
WHERE title:'Extraction';

 title                            | release_year | movie_id | imdb_id
----------------------------------+--------------+----------+------------
                       Extraction |         2015 |   326425 |  tt4382872
                     Extraction 2 |         2023 |   697843 | tt12263384
                       Extraction |         2013 |   218381 |  tt2823574
                       Extraction |         2020 |   545609 |  tt8936646
 Predators: Moments of Extraction |         2010 |   704882 |  tt1777600

Note the use of the SAI query (:) operator for querying analyzed indexes instead of the equality (=) operator in the previous example above.

Text analyzers

Because SAI is built on Apache Lucene™️, we can use the built-in Lucene 9.8.0 text analyzers to extract index terms from text just like we do in Solr. We've already used the standard analyzer above and the other generic analyzers include simple (uses letter tokenizer with lowercase filter), whitespace, stop, and lowercase.

There are specific analyzers for over 30 languages that include Arabic, Bulgarian, Dutch, Hindi, Irish, and Persian. There are 14 available tokenizers (these split text into words or tokens) that include standard, classic, nGram, and wikipedia.

Supported CharFilters are cjk, htmlstrip, mapping, persian, and patternreplace. There are over a hundred token filters with lowercase, classic, N-Grams, and stemmers among some of the most popular ones.

If we wanted to, we could add a stemming filter like PorterStemFilter to normalise the terms into their base words by removing suffixes, so words like "jumps" and "jumping" are transformed into "jump":

CREATE CUSTOM INDEX ON movies_by_id (title) 
USING 'StorageAttachedIndex'
WITH OPTIONS = {
    'index_analyzer' : '{
        "tokenizer" : {"name":"standard"},
        "filters" : [
            {"name":"lowercase"},
            {"name":"porterstem"}
        ]
    }'
};

A query for "extraction" (all lowercase) reduces the search term to its base form "extract" so we now get matches for not just "extraction" but also for "extract" and "extracted" in the results:

SELECT title, release_year, movie_id, imdb_id 
FROM movies_by_id 
WHERE title:'extraction';

 title                               | release_year | movie_id | imdb_id
-------------------------------------+--------------+----------+------------
                           Extracted |         2012 |    97605 |  tt1757746
                          Extraction |         2015 |   326425 |  tt4382872
                        Extraction 2 |         2023 |   697843 | tt12263384
                             Extract |         2009 |    12569 |  tt1225822
 Extract: Mike Judge's Secret Recipe |         2009 |   683019 |  tt1585566
                          Extraction |         2013 |   218381 |  tt2823574
                          Extraction |         2020 |   545609 |  tt8936646
    Predators: Moments of Extraction |         2010 |   704882 |  tt1777600

Image of more hits from adding index analyzers.

Strictly speaking, we should replace StandardFilter with ClassicFilter so periods/dots in acronyms and possessives ('s) are stripped from the tokens. For example, "S.W.A.T." will be transformed to "SWAT" and "Jack's" to "Jack" with the addition of filter name:classic in the index options:

CREATE CUSTOM INDEX ON movies_by_id (title) 
USING 'StorageAttachedIndex'
WITH OPTIONS = {
    'index_analyzer' : '{
        "tokenizer" : {"name":"standard"},
        "filters" : [
            {"name":"classic"},
            {"name":"lowercase"},
            {"name":"porterstem"}
        ]
    }'
};

Note that much like in the Solr schema, the order of the analyzers in the pipeline chain matters.

For the list of supported Lucene analyzers, see Using analyzers with CQL for Astra DB.

Working with CQL collections

SAI supports the use of the CONTAINS and CONTAINS KEY operators when querying collections.

Let's create an index on genres so we can search for action movies:

CREATE CUSTOM INDEX ON movies_by_id (genres)
USING 'StorageAttachedIndex';

The genres column is a CQL set so we can filter column values using the CONTAINS operator:

SELECT title, release_year, genres 
FROM movies_by_id 
WHERE genres CONTAINS 'Action' 
LIMIT 10;

 title             | release_year | genres
-------------------+--------------+--------------------------------------------------------
       Narasimhudu |         2005 |                                             {'Action'}
      Mad Dog Time |         1996 |                          {'Action', 'Comedy', 'Crime'}
       Pipe Nation |         2023 |                                             {'Action'}
        Kaala Sona |         1975 |                                  {'Action', 'Western'}
 Soldier's Revenge |         1986 |                             {'Action', 'Drama', 'War'}
    The Lost Tribe |         2009 | {'Action', 'Adventure', 'Drama', 'Horror', 'Thriller'}
 WWE No Mercy 2007 |         2007 |                                    {'Action', 'Drama'}
  Mail Order Bride |         2003 |   {'Action', 'Comedy', 'Drama', 'Romance', 'Thriller'}
  Parasyte: Part 2 |         2015 |       {'Action', 'Drama', 'Horror', 'Science Fiction'}
          Spy Hunt |         1950 |    {'Action', 'Crime', 'Drama', 'Mystery', 'Thriller'}

How about searching for movies by the actor's name? Yes, we can!

The cast column is a CQL map where the key is the movie character's name and the value is the actor's name. If we want to filter by the actor's name, we need to index the map values (with VALUES()):

CREATE CUSTOM INDEX ON movies_by_ID (VALUES(cast)) 
USING 'StorageAttachedIndex';

And we query the map values with the CONTAINS operator:

SELECT title, release_year, movie_id, imdb_id 
FROM movies_by_id 
WHERE cast CONTAINS 'Chris Hemsworth' 
LIMIT 10;

title                                 | release_year | movie_id | imdb_id
---------------------------------------+--------------+----------+------------
                              Red Dawn |         2012 |    60747 |  tt1234719
                              Vacation |         2015 |   296099 |  tt1524930
                        Doctor Strange |         2016 |   284052 |  tt1211837
                            Extraction |         2020 |   545609 |  tt8936646
            Bad Times at the El Royale |         2018 |   446021 |  tt6628394
           Snow White and the Huntsman |         2012 |    58595 |  tt1735898
               Avengers: Age of Ultron |         2015 |    99861 |  tt2395427
                          Extraction 2 |         2023 |   697843 | tt12263384
                              Blackhat |         2015 |   201088 |  tt2717822
                      Transformers One |         2024 |   698687 |  tt8864596

An image some of the films Chris starred in over the years

Some of the films Chris starred in over the years

Let's index the map keys (with KEYS()) so we can search for movies by character:

CREATE CUSTOM INDEX ON movies_by_ID (KEYS(cast)) 
USING 'StorageAttachedIndex';

We query the map keys using the CONTAINS KEY operator:

SELECT title, release_year, movie_id, imdb_id 
FROM movies_by_id 
WHERE cast CONTAINS KEY 'Thor' 
LIMIT 10;

 title                                      | release_year | movie_id | imdb_id
--------------------------------------------+--------------+----------+------------
        Rage and Honor II: Hostile Takeover |         1993 |    66854 |  tt0107917
                    Avengers: Age of Ultron |         2015 |    99861 |  tt2395427
                       George of the Jungle |         1997 |    10603 |  tt0119190
                    The Monster and the Ape |         1945 |   251814 |  tt0037927
                        Damsels in Distress |         2012 |    82533 |  tt1667307
                The Incredible Hulk Returns |         1988 |    26881 |  tt0095368
                                  Good Boys |         2019 |   521777 |  tt6977338
                          Avengers: Endgame |         2019 |   299534 |  tt4154796
                       Thor: The Dark World |         2013 |    76338 |  tt1981115
                             Thor: Ragnarok |         2017 |   284053 |  tt3501632

Multi-term search

SAI uses a query plan that provides the ability to filter on multiple indexed columns with the intersection (AND) and union (OR) boolean operators in a single query.

This enables us to replace multi-term Solr searches like:

SELECT title, release_year, movie_id, imdb_id
FROM movies_by_id 
WHERE solr_query = 'title:extraction AND release_year:2020';

 title                                 | release_year | movie_id | imdb_id
---------------------------------------+--------------+----------+------------
                            Extraction |         2020 |   545609 |  tt8936646

with a single SAI query that uses multiple column indexes:

SELECT title, release_year, movie_id, imdb_id
FROM movies_by_id 
WHERE title:'extraction' AND release_year=2020;
title                                 | release_year | movie_id | imdb_id
---------------------------------------+--------------+----------+------------
                            Extraction |         2020 |   545609 |  tt8936646

Or replace a Solr search for movies that are in both action and thriller genres like:

SELECT title, release_year, genres 
FROM movies_by_id 
WHERE solr_query = 'genres:Action AND Thriller';

with this SAI query:

SELECT title, release_year, genres 
FROM movies_by_id
WHERE genres CONTAINS 'Action' AND genres CONTAINS 'Thriller' LIMIT 3;

 title            | release_year | genres
------------------+--------------+--------------------------------------------------------
   The Lost Tribe |         2009 | {'Action', 'Adventure', 'Drama', 'Horror', 'Thriller'}
 Mail Order Bride |         2003 |   {'Action', 'Comedy', 'Drama', 'Romance', 'Thriller'}
         Spy Hunt |         1950 |    {'Action', 'Crime', 'Drama', 'Mystery', 'Thriller'}

Here's a more complex query that searches for movies with "Thor" or "Avengers" in the title and stars Chris Hemsworth:

SELECT title, release_year, movie_id, imdb_id 
FROM movies_by_id 
WHERE (title:'Extraction' OR title:'Avengers') 
AND cast CONTAINS 'Chris Hemsworth';

 title                   | release_year | movie_id | imdb_id
-------------------------+--------------+----------+------------
 Avengers: Age of Ultron |         2015 |    99861 |  tt2395427
            Extraction 2 |         2023 |   697843 | tt12263384
      Avengers: Doomsday |         2026 |  1003596 | tt21357150
       Avengers: Endgame |         2019 |   299534 |  tt4154796
              Extraction |         2020 |   545609 |  tt8936646
            The Avengers |         2012 |    24428 |  tt0848228
  Avengers: Infinity War |         2018 |   299536 |  tt4154756

What's next

In part three of the series, we'll show you how vector search is more powerful than Solr's text search and how it can propel your apps into the stratosphere with generative AI.

In the meantime, check out the SAI Quickstart Guide plus SAI examples and try Astra DB for free—or book a demo with our data architects for a guided tour of SAI in action!