Migrate from Solr to SAI for Accelerated Development and Performance: Part 2
data:image/s3,"s3://crabby-images/32b44/32b44d1c8f84cbf61117f4347c890d29c6e5fa03" alt="Migrate from Solr to SAI for Accelerated Development and Performance: Part 2"
In part 1 of this series, we discussed a notional application that uses Apache Solr™️ to index data so users could search for movies to watch. Solr is a powerful tool but it comes at a cost that we believe is too steep for a majority of workloads.
Storage-attached indexing (SAI) has been one of the most anticipated features in Apache Cassandra® 5.0. Query patterns are much more flexible and performant, there’s far less coding required, provides an easier path to adding application functionality, and much more.
In this post, we’ll showcase advanced SAI capabilities and show how to convert Solr schema fields to corresponding SAI index options, particularly with index analyzers, tokenizers and filters. Let's go!
A quick recap
For a quick reminder, we’re using a notional application that stores information about movies and uses Solr to provide users with the ability to search for movies.
As a reminder, the movie data including images used in this post were sourced from The Movie Database (TMDB), a community-built movie and TV database. Movies are uniquely identified by a movie ID in the TMDB, together with the corresponding IMDB ID.
The data in the movies_by_id
table is partitioned by the movie ID to enable us to retrieve movies by their title, the title
column is indexed by Solr with the following schema definition:
<types> ... <fieldtype class="org.apache.solr.schema.TextField" name="TextField"> <analyzer> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.StandardFilterFactory"/> <filter class="solr.LowerCaseFilterFactory"/> </analyzer> </fieldtype> </types> <fields> ... <field indexed="true" multiValued="false" name="title" type="TextField"/> ... </fields> <uniqueKey>movie_id</uniqueKey>
Refactoring Solr schema
If we create a regular SAI index on the title
column:
CREATE CUSTOM INDEX ON movies_by_id (title) USING 'StorageAttachedIndex';
we can query the table with a simple equality filter, which returns exact matches for the term "Extraction":
SELECT title, release_year, movie_id, imdb_id FROM movies_by_id WHERE title = 'Extraction'; title | release_year | movie_id | imdb_id ------------+--------------+----------+----------- Extraction | 2015 | 326425 | tt4382872 Extraction | 2013 | 218381 | tt2823574 Extraction | 2020 | 545609 | tt8936646
The query returned three matches for movies that share the same title:
As a matter of disclosure, there are more movies titled "Extraction" but we only retrieved roughly 40K out of the 1.02 million movies listed on TMDB so the data in the example tables contain a small subset of TMDB.
But a regular SAI index is not enough because the title
field has StandardTokenizer
and LowerCaseFilter
in the Solr schema:
<fieldtype class="org.apache.solr.schema.TextField" name="TextField"> <analyzer> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.StandardFilterFactory"/> <filter class="solr.LowerCaseFilterFactory"/> </analyzer> </fieldtype>
The StandardFilter
was deprecated in Lucene 7.x (LUCENE-8356) so we need to create the SAI index with the built-in STANDARD analyzer (splits the text into words and converts them into lowercase):
CREATE CUSTOM INDEX ON movies_by_id (title) USING 'StorageAttachedIndex' WITH OPTIONS = { 'index_analyzer': 'STANDARD'};
Now we get hits for movies with more than just the search term "Extraction" in the title:
SELECT title, release_year, movie_id, imdb_id FROM movies_by_id WHERE title:'Extraction'; title | release_year | movie_id | imdb_id ----------------------------------+--------------+----------+------------ Extraction | 2015 | 326425 | tt4382872 Extraction 2 | 2023 | 697843 | tt12263384 Extraction | 2013 | 218381 | tt2823574 Extraction | 2020 | 545609 | tt8936646 Predators: Moments of Extraction | 2010 | 704882 | tt1777600
Note the use of the SAI query (:
) operator for querying analyzed indexes instead of the equality (=
) operator in the previous example above.
Text analyzers
Because SAI is built on Apache Lucene™️, we can use the built-in Lucene 9.8.0 text analyzers to extract index terms from text just like we do in Solr. We've already used the standard
analyzer above and the other generic analyzers include simple
(uses letter tokenizer with lowercase filter), whitespace
, stop
, and lowercase
.
There are specific analyzers for over 30 languages that include Arabic, Bulgarian, Dutch, Hindi, Irish, and Persian. There are 14 available tokenizers (these split text into words or tokens) that include standard
, classic
, nGram
, and wikipedia
.
Supported CharFilters are cjk
, htmlstrip
, mapping
, persian
, and patternreplace
. There are over a hundred token filters with lowercase
, classic
, N-Grams, and stemmers among some of the most popular ones.
If we wanted to, we could add a stemming filter like PorterStemFilter
to normalise the terms into their base words by removing suffixes, so words like "jumps" and "jumping" are transformed into "jump":
CREATE CUSTOM INDEX ON movies_by_id (title) USING 'StorageAttachedIndex' WITH OPTIONS = { 'index_analyzer' : '{ "tokenizer" : {"name":"standard"}, "filters" : [ {"name":"lowercase"}, {"name":"porterstem"} ] }' };
A query for "extraction" (all lowercase) reduces the search term to its base form "extract" so we now get matches for not just "extraction" but also for "extract" and "extracted" in the results:
SELECT title, release_year, movie_id, imdb_id FROM movies_by_id WHERE title:'extraction'; title | release_year | movie_id | imdb_id -------------------------------------+--------------+----------+------------ Extracted | 2012 | 97605 | tt1757746 Extraction | 2015 | 326425 | tt4382872 Extraction 2 | 2023 | 697843 | tt12263384 Extract | 2009 | 12569 | tt1225822 Extract: Mike Judge's Secret Recipe | 2009 | 683019 | tt1585566 Extraction | 2013 | 218381 | tt2823574 Extraction | 2020 | 545609 | tt8936646 Predators: Moments of Extraction | 2010 | 704882 | tt1777600
Strictly speaking, we should replace StandardFilter
with ClassicFilter
so periods/dots in acronyms and possessives ('s) are stripped from the tokens. For example, "S.W.A.T." will be transformed to "SWAT" and "Jack's" to "Jack" with the addition of filter name:classic
in the index options:
CREATE CUSTOM INDEX ON movies_by_id (title) USING 'StorageAttachedIndex' WITH OPTIONS = { 'index_analyzer' : '{ "tokenizer" : {"name":"standard"}, "filters" : [ {"name":"classic"}, {"name":"lowercase"}, {"name":"porterstem"} ] }' };
Note that much like in the Solr schema, the order of the analyzers in the pipeline chain matters.
For the list of supported Lucene analyzers, see Using analyzers with CQL for Astra DB.
Working with CQL collections
SAI supports the use of the CONTAINS
and CONTAINS KEY
operators when querying collections.
Let's create an index on genres so we can search for action movies:
CREATE CUSTOM INDEX ON movies_by_id (genres) USING 'StorageAttachedIndex';
The genres
column is a CQL set
so we can filter column values using the CONTAINS
operator:
SELECT title, release_year, genres FROM movies_by_id WHERE genres CONTAINS 'Action' LIMIT 10; title | release_year | genres -------------------+--------------+-------------------------------------------------------- Narasimhudu | 2005 | {'Action'} Mad Dog Time | 1996 | {'Action', 'Comedy', 'Crime'} Pipe Nation | 2023 | {'Action'} Kaala Sona | 1975 | {'Action', 'Western'} Soldier's Revenge | 1986 | {'Action', 'Drama', 'War'} The Lost Tribe | 2009 | {'Action', 'Adventure', 'Drama', 'Horror', 'Thriller'} WWE No Mercy 2007 | 2007 | {'Action', 'Drama'} Mail Order Bride | 2003 | {'Action', 'Comedy', 'Drama', 'Romance', 'Thriller'} Parasyte: Part 2 | 2015 | {'Action', 'Drama', 'Horror', 'Science Fiction'} Spy Hunt | 1950 | {'Action', 'Crime', 'Drama', 'Mystery', 'Thriller'}
How about searching for movies by the actor's name? Yes, we can!
The cast column is a CQL map
where the key is the movie character's name and the value is the actor's name. If we want to filter by the actor's name, we need to index the map values (with VALUES()
):
CREATE CUSTOM INDEX ON movies_by_ID (VALUES(cast)) USING 'StorageAttachedIndex';
And we query the map values with the CONTAINS
operator:
SELECT title, release_year, movie_id, imdb_id FROM movies_by_id WHERE cast CONTAINS 'Chris Hemsworth' LIMIT 10; title | release_year | movie_id | imdb_id ---------------------------------------+--------------+----------+------------ Red Dawn | 2012 | 60747 | tt1234719 Vacation | 2015 | 296099 | tt1524930 Doctor Strange | 2016 | 284052 | tt1211837 Extraction | 2020 | 545609 | tt8936646 Bad Times at the El Royale | 2018 | 446021 | tt6628394 Snow White and the Huntsman | 2012 | 58595 | tt1735898 Avengers: Age of Ultron | 2015 | 99861 | tt2395427 Extraction 2 | 2023 | 697843 | tt12263384 Blackhat | 2015 | 201088 | tt2717822 Transformers One | 2024 | 698687 | tt8864596
Some of the films Chris starred in over the years
Let's index the map keys (with KEYS()
) so we can search for movies by character:
CREATE CUSTOM INDEX ON movies_by_ID (KEYS(cast)) USING 'StorageAttachedIndex';
We query the map keys using the CONTAINS KEY
operator:
SELECT title, release_year, movie_id, imdb_id FROM movies_by_id WHERE cast CONTAINS KEY 'Thor' LIMIT 10; title | release_year | movie_id | imdb_id --------------------------------------------+--------------+----------+------------ Rage and Honor II: Hostile Takeover | 1993 | 66854 | tt0107917 Avengers: Age of Ultron | 2015 | 99861 | tt2395427 George of the Jungle | 1997 | 10603 | tt0119190 The Monster and the Ape | 1945 | 251814 | tt0037927 Damsels in Distress | 2012 | 82533 | tt1667307 The Incredible Hulk Returns | 1988 | 26881 | tt0095368 Good Boys | 2019 | 521777 | tt6977338 Avengers: Endgame | 2019 | 299534 | tt4154796 Thor: The Dark World | 2013 | 76338 | tt1981115 Thor: Ragnarok | 2017 | 284053 | tt3501632
Multi-term search
SAI uses a query plan that provides the ability to filter on multiple indexed columns with the intersection (AND
) and union (OR
) boolean operators in a single query.
This enables us to replace multi-term Solr searches like:
SELECT title, release_year, movie_id, imdb_id FROM movies_by_id WHERE solr_query = 'title:extraction AND release_year:2020'; title | release_year | movie_id | imdb_id ---------------------------------------+--------------+----------+------------ Extraction | 2020 | 545609 | tt8936646
with a single SAI query that uses multiple column indexes:
SELECT title, release_year, movie_id, imdb_id FROM movies_by_id WHERE title:'extraction' AND release_year=2020; title | release_year | movie_id | imdb_id ---------------------------------------+--------------+----------+------------ Extraction | 2020 | 545609 | tt8936646
Or replace a Solr search for movies that are in both action and thriller genres like:
SELECT title, release_year, genres FROM movies_by_id WHERE solr_query = 'genres:Action AND Thriller';
with this SAI query:
SELECT title, release_year, genres FROM movies_by_id WHERE genres CONTAINS 'Action' AND genres CONTAINS 'Thriller' LIMIT 3; title | release_year | genres ------------------+--------------+-------------------------------------------------------- The Lost Tribe | 2009 | {'Action', 'Adventure', 'Drama', 'Horror', 'Thriller'} Mail Order Bride | 2003 | {'Action', 'Comedy', 'Drama', 'Romance', 'Thriller'} Spy Hunt | 1950 | {'Action', 'Crime', 'Drama', 'Mystery', 'Thriller'}
Here's a more complex query that searches for movies with "Thor" or "Avengers" in the title and stars Chris Hemsworth:
SELECT title, release_year, movie_id, imdb_id FROM movies_by_id WHERE (title:'Extraction' OR title:'Avengers') AND cast CONTAINS 'Chris Hemsworth'; title | release_year | movie_id | imdb_id -------------------------+--------------+----------+------------ Avengers: Age of Ultron | 2015 | 99861 | tt2395427 Extraction 2 | 2023 | 697843 | tt12263384 Avengers: Doomsday | 2026 | 1003596 | tt21357150 Avengers: Endgame | 2019 | 299534 | tt4154796 Extraction | 2020 | 545609 | tt8936646 The Avengers | 2012 | 24428 | tt0848228 Avengers: Infinity War | 2018 | 299536 | tt4154756
What's next
In part three of the series, we'll show you how vector search is more powerful than Solr's text search and how it can propel your apps into the stratosphere with generative AI.
In the meantime, check out the SAI Quickstart Guide plus SAI examples and try Astra DB for free—or book a demo with our data architects for a guided tour of SAI in action!