BLOG | Technology

Simplify Your Stack: SAI Leaves Solr in the Dust

Updated: April 04, 2025
Simplify Your Stack: SAI Leaves Solr in the Dust

Storage-attached indexing has been one of the most anticipated features in Apache Cassandra® 5.0—and for good reason. Query patterns are much more flexible and performant, there’s far less coding, there’s an easier path to adding application functionality—the list goes on. 

DataStax has been working on SAI for several years; it’s been deployed in our Cassandra-as-a-service Astra DB for some time, and its reliability and performance have been getting high marks from end users.

DSE Search is a powerful search engine built on Apache Solr™ that provides full-text search capabilities. Working with hundreds of customers over the years, our experience indicates that most of the existing use cases do not require a full-fledged text search engine, which means enterprises don't need the added complexity that comes with running Solr alongside Cassandra.

It's time to simplify your stack and here are the reasons why. Let's go!

SAI outperforms all other indexers

SAI is hands down better than other indexing methods available for CQL.

It provides more functionality than Casandra's legacy secondary indexing (2i) at a fraction of the storage footprint, so it costs less to own and operate. Similarly, SAI uses less disk space than the equivalent Solr implementation.

SAI has significantly better throughput than 2i and Solr, and blows them out of the water in latency:

a graphic comparison of latency and throughput for SAI, 2i, and Solr

SAI indexes data in memory, in real time

When a node receives a write request, two things take place:

  1. The data is stored in a memory structure called memtable.

  2. The mutation is appended to the end of a commitlog (a write-ahead log or WAL) for durability.

SAI indexes the data in the memtables synchronously as each write comes in. When the memtables are eventually flushed to disk, the corresponding indexes are written to disk along with the SSTables (see SAI write path).

All SAI index updates associated with the memtable are grouped together and written to disk in one IO operation, leading to significantly higher throughput and very low latency.

On the other hand, Solr performs a read-before-write on the data to be indexed, making it a lot slower than SAI. Each index update also incurs a disk IO, so if there are 10 indexed columns then there are 10 additional IOs for each table write, making Solr throughput significantly lower than SAI.

SAI indexes columns at the table level

SAI indexes can be created on any column in a table. The only exception is on a single column partition key because it doesn't make sense to index such a column.

Since the index is defined on the whole table, the SAI index can be queried on any node in any data center. Unlike SAI, Solr is enabled on a per-DC basis to isolate Solr queries from regular OLTP workloads.

Solr requires significantly more CPU, memory and IO compared to regular Cassandra workloads, so it needs to be deployed on separate nodes with more resources and dedicated disks. As such, Solr queries can only be executed against a DC where it is enabled.

SAI supports Lucene analyzers and tokenizers

SAI is built on Apache Lucene™️ so SAI works with built-in analyzers to extract index terms from text just like Solr including standard, simple (uses letter tokenizer with lowercase filter), whitespace, stop, and lowercase.

There are over 30 supported language-specific analyzers, which include Arabic, Bulgarian, Dutch, Hindi, Irish, and Persian. There are 14 supported tokenizers (splits text into words or tokens) that include standard, classic, nGram, and wikipedia.

Supported CharFilters are cjk, htmlstrip, mapping, persian, and patternreplace. There are over a hundred token filters with lowercase, classic, N-Grams, and stemmers among some of the most popular.

A relatively complex Solr schema for a text field like this example:

<fieldtype class="org.apache.solr.schema.TextField" name="TextField">
      <analyzer>
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.StandardFilterFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
   </fieldtype>

can easily be replaced with an SAI index:

CREATE CUSTOM INDEX ON movies_by_id (title) 
USING 'StorageAttachedIndex'
WITH OPTIONS = {
    'index_analyzer' : '{
        "tokenizer" : {"name":"standard"},
        "filters" : [
            {"name":"classic"},
            {"name":"lowercase"},
        ]
    }'
};

The StandardFilter was deprecated in Lucene 7.x (LUCENE-8356) so we replaced it with the ClassicFilter to achieve the same outcome, i.e. periods/dots in acronyms and possessives ('s) are stripped from the tokens.

For more examples, see Migrate from Solr to SAI for Accelerated Development and Performance: Part 1 and Part 2.

SAI shares common index data

On tables where multiple columns are indexed, SAI by design enables all column indexes to share per-SSTable index files (common index data) on the same table.

This feature enables users to create many indexes on a table without running into scalability issues. In contrast, each indexed column in Solr incurs a write penalty for each mutation (Cassandra write request payload) leading to heavy write IO on the disk where the Solr data directory is mounted.

SAI's design provides disk space savings of at least 75-90% for deployments with 5-10 indexed columns compared to other indexing implementations like legacy secondary indexes and SASI. The space savings is even higher compared to Solr indexes with complex nested analyzers and tokenizers.

Disk space savings comparisons

SAI indexes are only built once

As mentioned above, SAI indexes the data in memory and when the memtables are flushed to disk, the corresponding indexes are also written to disk together with the SSTables.

Because SSTables are immutable – once written to disk, they never change and never get updated – the indexes attached to the SSTables are also immutable. This is a game-changing side-effect of SAI –  the indexes are persistent for the life of the SSTables they are attached to.

When you backup the data, the SAI indexes are archived together with the SSTables. When you restore the data, the indexes are immediately available to query without having to rebuild the indexes.

SAI indexes are compatible with the Zero Copy Streaming feature in Cassandra. When adding or decommissioning nodes, or restoring snapshots, the indexes are fully streamed with the SSTables and do not need to be serialised or rebuilt when it gets to the receiving node.

With Solr, you first have to restore the SSTables from the snapshots, then reindex from scratch. Depending on the amount of data to be indexed, this adds a significant delay to your recovery window.

Similarly, when nodes are added to or removed from the cluster, Solr needs to reindex the streamed data on the replica, so there’s also a delay in the availability of the indexes.

SAI is better than a text search engine

SAI can do more than just traditional text search and term-matching. In fact, SAI can do a whole lot more with a technique called semantic search that can help modernize your applications by using natural language processing (NLP) and machine learning algorithms to understand the underlying meaning and context of a user's query to deliver more accurate and relevant results.

SAI enables vector search which means Cassandra is also a vector database so it can take advantage of large language models (LLMs) to provide generative AI capabilities!

Imagine being able to query a database of movies with an unstructured query string like "superhero movies with strong female leads":

Image showing three movies with strong female leads

We can do this by using AI models to extract the underlying meaning behind the data and generate vector embeddings. We can extract the essence of unstructured data using LLMs and generate vector embeddings. Embeddings are arrays of vectors (floating point numbers) that are numerical representations of data or objects in multidimensional space where each dimension is an encoding of a feature or attribute which captures the semantic essence of objects or things.

By encoding the data in our database as vectors, we can perform mathematical operations on the embeddings to measure similarity between vectors. Doing so enables us to use SAI to quickly locate the top N rows in the table which are most similar (most semantically relevant) to the user query.

The vector embeddings of each row is stored in the same table (movies_by_id in this example) as another column of type vector<float, #>, for example:

ALTER TABLE movies_by_id ADD embeddings vector<float, 1536>;

We then index the column with SAI:

CREATE CUSTOM INDEX ON movies_by_id (embeddings)
USING 'StorageAttachedIndex';

so when we generate a vector embedding of the query string "action or drama movies starring Chris Hemsworth," SAI can perform computations to compare the query vector against the embeddings stored in the table to find the movies with the closest semantic similarity:

Image showing action or drama movies starring Chris Hemsworth

More importantly, we can do this in just five lines of code. Yes, five lines.

Here's the Python code for generating the embedding on the query string with OpenAI's text-embedding-3-small model:

from openai import OpenAI
openai_client = OpenAI()

response = openai_client.embeddings.create(
    input=user_query,
    model="text-embedding-3-small"
)
query_vector = response.data[0].embedding

And here's the SAI query which executes the vector search:

vector_search = "SELECT * FROM movies_by_id ORDER BY embeddings ANN OF %s LIMIT 3"
movies = session.execute(vector_search, [query_vector])

Got your attention? Check out How to Replace Solr with SAI & Vector Search!

What are you waiting for?

Check out the SAI Quickstart plus Vector Search Quickstart guides and try it free on Astra DB — or book a demo with our data architects for a guided tour of SAI in action!

More Technology

View All