CompanyJune 12, 2021

The Best Apache Cassandra® Use Cases: Netflix, SoundCloud, and Instagram

The Best Apache Cassandra® Use Cases: Netflix, SoundCloud, and Instagram

Originally developed by Facebook, Apache Cassandra® was open-sourced in 2008 and has grown to become a leading distributed NoSQL database solution. Thousands of companies use it, including Apple, Instagram, Uber, Spotify, Twitter, Cisco, Rackspace, eBay, and Netflix.

Unlike traditional relational databases, Cassandra's architecture is based on a peer-to-peer model, which ensures that all nodes in the cluster are equal with no master-slave relationship. This flexible data model provides high availability and fault tolerance without a single point of failure. This means that even if one node goes down, the system continues to function seamlessly, ensuring uninterrupted access to data. It's an ideal choice for applications in finance, telecommunications, and e-commerce because it handles structured, semi-structured, or unstructured data, and seamlessly scales massive amounts of data across many commodity servers with minimal latency. 

Let’s examine the advantages that make Cassandra one of the most widely used NoSQL databases.

Benefits of using Apache Cassandra

Open source: 

Increases innovation, speed of implementation, flexibility, and extensibility. More cost-effective, while avoiding vendor lock-in.

Handles a high volume of data with ease: 

Built to handle a massive amount of data across many servers. Some large organizations are using it to manage petabytes of information.

Continuous availability: 

No single point of failure means zero downtime. If a particular node fails, users will be automatically moved to the closest working node. The system will continue to work as designed, with applications always available, and data always accessible. Users will never know there was an outage. This is a key for companies that can’t afford to ever have their database go offline, or to lose any data.

High performance and fast: 

Cassandra has a peer-to-peer, distributed architecture, where every node can perform all read and write operations. This adds resiliency, while improving performance. Write speed is especially fast. And Cassandra can write loads of data, without speed or accuracy being affected.

Straightforward scalability: 

Cassandra’s horizontal scaling is straightforward and cost-effective. Instead of scaling vertically with expensive hardware, Cassandra enables companies to expand to any size simply by adding low-cost commodity servers or virtual machines—no shutdowns required. And its linear scalability ensures high performance is maintained across all nodes. Cassandra’s scalability benefits make it popular with companies working with large datasets, that have many concurrent users, and are expecting continued growth.

Masterless replication: 

Apache Cassandra doesn’t require a fixed schema, making replication across data centers simple. Unlike traditional database systems that rely on a master-slave architecture, Cassandra is a peer-to-peer system that treats all nodes in the cluster equally. Each node stores a subset of the data and can perform read and write operations independently. This masterless design eliminates the risk of a single point of failure, ensuring that your data is always accessible and your application remains operational with zero data loss—an outage in any particular region won’t matter. And placing data closer to end users also leads to low latency. Wide, even global, distribution is possible.

Familiar interface: 

Most developers will be able to pick up Cassandra’s query language quickly. That’s because Cassandra Query Language (CQL) has a strong resemblance to SQL.

Tuneable consistency for flexible data management

Cassandra’s tunable consistency model is a game-changer. It gives developers the flexibility to choose between eventual consistency and strong consistency based on their application’s requirements. This means you can adjust the consistency level to balance the trade-offs between consistency, availability, and performance. 

High write performance for real-time data processing

Designed with high write performance in mind, Apache Cassandra can handle a large number of writes per second, making it ideal for applications that require real-time data processing and analytics. Whether you’re dealing with IoT sensor data, financial transactions, or social media analytics, Cassandra’s architecture ensures that you can process large amounts of data quickly and efficiently. This high write performance enables your application to respond to changing conditions in real-time, allowing you to make data-driven decisions and stay ahead of the competition.

By leveraging Cassandra’s robust features, you can build a database system that meets your current needs and also scales effortlessly as your data grows. Whether you’re managing geographically dispersed data centers or handling large and diverse datasets, Apache Cassandra provides the tools you need to ensure high availability, fault tolerance, and exceptional performance.

Those are just some of the reasons organizations are embracing Cassandra. It has plenty of other benefits, including the flexibility to handle structured, semi-structured, and unstructured data, along with automatic workload and data balancing. To top it off, it offers operational simplicity, low overhead, and the ability to support hybrid and multi-cloud environments.

Cassandra has a laundry list of benefits, but how do companies actually use it?

Common Apache Cassandra use cases

On the community team at DataStax we spend a lot of time talking to, and hearing from, companies that are using Apache Cassandra in production for

  • E-commerce and inventory management

  • Personalization, recommendations, and customer experience

  • Internet of things and edge computing

  • Fraud detection and authentication

E-commerce and inventory management

E-commerce companies can’t afford to have their site go down, and that’s especially true during a peak period. Every minute they’re offline quickly eats away at their bottom line. Since rapid growth is always the goal, they also need the ability to cost-effectively scale their online inventory on the fly. For the same reason, these organizations need a database that can handle an enormous amount of data with ease. And to meet or exceed customer expectations, they need the flexibility to continuously modify their product mix.

Cassandra can deploy across multiple data centers, allowing for efficient data replication and fault tolerance, which is crucial for high-growth business environments.

Here’s why Cassandra is a good fit for e-commerce and inventory management:

  • Resilient with zero downtime: Distributed with multi-region replication, Cassandra ensures zero downtime. Even the loss of an entire region won’t bring it down.

  • Highly responsive: Cassandra’s peer-to-peer architecture also allows data to reside in regions around the world and closer to any particular customer—allowing the system to be highly responsive and fast.

  • Predictable scalability: Cassandra’s horizontal scalability is straightforward, predictable, and cost-effective.

  • Provides faster catalog refreshes.

  • Analyzes its catalog and inventory in real time. 

Personalization, recommendations, and customer experience with NoSQL databases

Personalization and recommendation engines are everywhere now. Like personal assistants built into apps and websites, they help us decide what events to buy tickets to, surface articles we might find interesting, and much more. 

Eventbrite now uses Cassandra instead of MySQL to power their mobile experience, letting users know what events are happening around them that they will be interested in attending. Eventbrite chose Cassandra for its read/write capacity and ease of deployment. Outbrain, a company you use frequently, but may be unfamiliar with, uses Cassandra to power their content discovery platform, helping companies add revenue streams by serving up applicable third-party articles you may find interesting. 

Most database systems experience slower write performance as the volume of data increases, often due to issues like locking mechanisms or the need to maintain ACID compliance. In contrast, Cassandra excels in this area with its distributed architecture and asynchronous writing capabilities.

Near real-time, relevant, personalized experiences are now expected. Here’s why Cassandra is the right choice to power tailored experiences:

  • Fast response times.

  • Extremely low latency, even as your customer base expands.

  • Handles all types of data, structured and unstructured, from a variety of sources.

  • Built to scale while staying cost-effective.

  • Ability to store, manage, query, and modify extremely large datasets, while delivering personalized experiences to millions of customers.

  • High read/write capacity.

  • Ease of deployment.

  • Flexible, enabling continuous customer experience innovation.

Consider the success story of Macquarie Bank:

With an architectural foundation built on Cassandra, the company moved from no retail banking presence to a top contender in the digital banking space in less than two years, by truly understanding customer behavior and prioritizing personalization. Learn how MacQuarie Bank uses Cassandra to provide personalization for their customers

Internet of things and edge computing with a distributed database

Modern tracking produces an avalanche of never-ending data:

  • weather
  • traffic
  • energy consumption
  • inventory levels
  • health indicators
  • video game stats
  • farming conditions
  • internet of things (IoT) sensors
  • wearables
  • vehicles
  • mobile devices
  • applicances
  • drones
  • other devices at the edge

This data needs to be securely collected—sometimes from millions of devices—aggregated, processed, and analyzed continuously.

Consider how the National Renewable Energy Laboratory uses Cassandra to store and analyze sensor data at the world's most environmentally friendly building. They find ways to save water and energy by running the world's smartest thermostat on top of Cassandra. The system continuously learns about energy usage patterns, and automatically adjusts settings, even when no one is there to program it.

Here are some of the reasons Cassandra is a good fit for IoT and edge computing needs:

  • Cassandra can ingest concurrent data from any node in the cluster, since all have read/write capacity.

  • Ability to handle a large volume of high-velocity, time-series data.

  • High availability.

  • Supports continuous, real-time analysis.

Fraud detection and authentication

Security threats continue to rise, and many companies are always on the defensive, playing catch-up with their smart fraud detection capabilities. That’s because fraudsters are constantly on the attack, looking for new and creative ways to steal customer data and compromise other sensitive information.

To have any chance of preventing illegitimate users from gaining access, companies need data and a lot of it. Continuous, real-time analysis of large and diverse datasets is required to find patterns and anomalies that can be indicators of fraud. A high priority for all businesses, fraud detection’s importance is elevated in areas like financial services, banking, payments, and insurance. 

As an example, take a look at how ACI Worldwide has used Cassandra to drastically improve its fraud detection rate and false positive rate.

Identity authentication is the other side of the fraud detection coin. Instead of focusing on keeping fraudsters out, the goal of authentication is to confirm that only legitimate customers gain access. The trick is, you want to make the log-in process as painless and fast as possible, while still making absolutely sure they are who they say they are. As with fraud detection, to pull this off, you need to conduct real-time analysis of a wide variety and high volume of data. And since authentication is likely a central part of all your systems, outages must be avoided at all costs. If a customer experiences friction trying to access your site, whether due to a false positive or because the auth system is down, it likely won’t take too long for them to leave in frustration.

Here are some reasons Cassandra is a great database choice for fighting fraud and ensuring identity authentication:

  • Flexible schema: Handles numerous data types, and they can be quickly added to the mix.

  • Enables complex, real-time analysis, including the ability to incorporate and support machine learning and AI.

  • Handles large-scale, growing datasets. 

Other common Cassandra use cases

There are countless other applications that can benefit from Cassandra. Here are a few more:

  • financial services and payments

  • messaging

  • playlists

  • logistics and asset management

  • content management systems

  • transaction logging

  • tracking of all kinds, including packages and orders

  • digital and media management

Now that we’ve reviewed some of Cassandra’s most common use cases, let’s explore how it has helped some brands you might have heard of.

Notable Cassandra use cases in action

With so many prominent companies using Apache Cassandra, it’s highly likely we all interact with it in some way multiple times a day. For example, the next time you go for a jog and queue up your Spotify playlist, you’re using an application built on top of Cassandra. In fact, Cassandra has been instrumental in the expansion of many iconic brands, including Netflix, Uber, Instagram, Reddit, Soundcloud, and more.

Let’s dig deeper to see how Netflix, Soundcloud, and Instagram leverage some of Cassandra’s most powerful features.

Compliance audit logging

It’s important for companies to have a bullet-proof audit trail for their database. Using audit logging, organizations can track and record significant changes to the database, along with noting the time they occur and who triggered them. Reviewing these records is necessary to ensure regulatory compliance and security standards are met. Audit logging can also be very helpful to uncover the root cause of bugs. Apache Cassandra has audit logging built-in allowing users to easily create a persistent record of important changes.

Cassandra in action: Netflix

It’s no surprise Netflix deploys Cassandra’s audit logging capability at scale. After all, their own cloud database architects and engineers contributed heavily to its development. When implementing audit logging, Netflix wanted to make sure it was performant, accurate, usable, and extensible. Their setup audits everything and logs user, host, source IP address, source port, timestamp, type, category, keyspace, scope and operation.

Dashboards and data processing

Dashboards provide a handy and visual way to quickly get a read on a situation. They are often used to access the latest information on a particular topic, check status, or to monitor a process or project. Companies use dashboards for many purposes both internally for employees, and externally for customers. Either way, users are often given the ability to personalize dashboards to best fit their needs.

Cassandra provides a solid foundation for dashboards for many reasons, including:

  • Easily handles frequent updates—there are typically many, ongoing updates for each user.

  • Built to take on extremely large datasets—hundreds of millions of events can reside in one table.

  • Efficient way to store time-series data.

Cassandra use case in action: SoundCloud

The dashboard SoundCloud provides its customers is one of its most popular features. In fact, the company credits much of its rapid data growth to it. SoundCloud customers can personalize their dashboard with the option to see where in the world their audio uploads are being listened to, and by which users, along with being a home for incoming sound clips from people they follow, and much more.

SoundCloud turned to Cassandra because of its ability to store and access vast amounts of data, its built-in persistence of that data, and for its scalability. Cassandra’s read/write capabilities were also a strong selling point for SoundCloud’s adoption. With Cassandra, they can provide each customer with a sequential read path—so posts can be browsed in the correct time order. 

Cassandra also allows SoundCloud users to randomly access write events, and have any particular one put into sequential order. One write event could end up in millions of users’ dashboards, and SoundCloud uses Cassandra to ensure it's always displayed in the right place. The company also leans on Cassandra to explore relationships between customers, and personalize their experiences.

Replication

As discussed earlier, creating and storing replicas of datasets at geographically dispersed data centers makes a lot of sense. It increases fault tolerance, reliability, and availability. If a data center in the cluster goes down, operations will continue without a blip. Having data closer to customers, no matter where they’re using your app around the world, also decreases read and write latency.

Cassandra has a peer-to-peer, distributed architecture, without the need of a primary node. Instead, every node can perform read and write operations, and all replicas, across the cluster, are equally important. That means data can quickly be replicated across all nodes and Cassandra doesn’t have a single point of failure. That translates into always-on availability with zero downtime. That’s why so many companies turn to Cassandra when data storage is mission critical, and they need a database that can comfortably handle petabyte-sized datasets and full global replication.

Cassandra in action: Instagram

Instagram has used Cassandra from the beginning, way back in 2010. When they started expanding, they created replicas in each new data center. As Instagram kept replicating in data centers around the world, they found their performance dropped. To combat the dip, they started storing data only in the region closest to where it was generated. Local data access has helped them provide a faster, more efficient service to the more than one billion active daily users they have today. Learn more about how Instagram uses Cassandra to replicate on a global scale.

 

CLICK HERE FOR MORE CASE STUDIES

Apache Cassandra can help your company

Leading companies around the world, ranging from social media to international banking, are using Cassandra for all kinds of use cases. That’s because it can help any company that requires the ability to manage a large volume of data, always-on availability, high fault tolerance, easy and cost-effective scalability, and seamless replication—all without compromising performance. It’s also a perfect fit for cloud-native applications, or hybrid cloud and multi-cloud environments.

Many of the largest internet apps and the Fortune 100 use DataStax Enterprise (DSE) as their implementation of Cassandra.

Apache Cassandra Resources

Get started with our Learning Series on Cassandra Fundamentals

Download the whitepaper: Apache Cassandra® Architecture

 

FAQs

What is Apache Cassandra most useful for?

Apache Cassandra uses Apache to process large amounts of structured data, unstructured data, and semi-structured data that are distributed worldwide.

Why does Netflix use Cassandra?

Netflix decided to use Apache Cassandra for its own annotation storage solution. Cassandra is a distributed NoSQL database that supports horizontal scaling across multiple database types.

Is Apache Cassandra still relevant?

It is not surprising that Cassandra has always been an extremely popular database. Its popularity has grown since its introduction ten years ago and it is a favorite NoSQL database.

What is the main difference between NoSQL and relational databases?

A relational database organizes data into tables with a fixed schema, using structured query language (SQL) for queries. A NoSQL database is schema-less and stores data in formats like key-value pairs, documents, or wide columns, allowing for greater flexibility and scalability.

How do NoSQL databases handle distributed environments?

NoSQL databases, such as Cassandra, are designed to operate as distributed databases across multiple nodes and data centers. This allows for high availability and fault tolerance, even in cases of node or regional failures.

Why are NoSQL databases preferred for certain applications over relational databases?

NoSQL databases are ideal for applications that require high scalability, the ability to handle large volumes of unstructured or semi-structured data, and the need for distributed data storage across multiple data centers. They are also suited for scenarios where fast, scalable writes and reads are critical.

How does Cassandra manage data distribution?

Cassandra uses a partition key to distribute data evenly across multiple nodes in a cluster. This ensures that the data is balanced and that no single node becomes a bottleneck, enabling efficient write data operations and fault tolerance.

Can NoSQL databases be used for data warehousing?

While relational databases have traditionally been used for data warehousing due to their structured nature, NoSQL databases can also be used, especially when dealing with large-scale, distributed data that needs to be stored across multiple data centers.

What are the advantages of using a distributed database like Cassandra?

Cassandra's distributed architecture allows for data to be replicated across multiple nodes and data centers, ensuring high availability, fault tolerance, and low latency access. It also supports horizontal scaling, making it easier to manage large datasets.

One-Stop Data API for Production GenAI

Astra DB gives developers a complete data API and out-of-the-box integrations that make it easier to build production RAG apps with high relevancy and low latency.