TechnologyFebruary 26, 2025

Moving Data between Apache Cassandra® and Astra DB

Moving Data between Apache Cassandra® and Astra DB

Data movement between Apache Cassandra® and Astra DB is an essential step of any migration process. Astra DB has a unique architecture that enables it to thrive while serving high-throughput workloads, which includes the ability to separately scale compute and storage. 

But this distinction makes it unable to cluster-together with (non-Astra) Cassandra database instances. So the paths for migrating from one Cassandra database to another can’t be used. Additional migration tools like ZDM Proxy can help bridge the operational gap between Cassandra and Astra DB. But to move any existing data, we need to look to additional, offline tools and processes to do the heavy lifting.

Tools

There are two main tools that we’ll look at today: the Cassandra Data Migrator (CDM) and the DataStax Bulk Loader (DSBulk). While the CDM is the most-common tool used for a Cassandra-to-Astra migration, DSBulk has been around for several years and is a favorite of many Cassandra users. Data migrations to Astra DB have been successful with both tools. However, there are some differences that we will outline below, to help in selecting the most appropriate tool for the cluster.

CDM

The Cassandra Data Migrator is a great tool for moving data between Apache Cassandra® and DataStax Astra DB. It works in conjunction with the Apache Spark framework, which allows it to move large amounts of data between different Cassandra data stores without any intermediate storage.

Note: We recommend Java 11-21 and Apache Spark 3.5 for use with the CDM.

The CDM can be obtained a few different ways. It can be found on DockerHub and run as a Docker container. Its JAR file can also be downloaded from the CDM GitHub repository. The GitHub repo can also be cloned locally and built using Java and Maven.

Configuration

The primary source of configuration for the CDM is the cdm.properties file. Inside it, we can specify all of the connection details for both our origin Cassandra cluster and our target Astra Database. We will also specify our keyspace and table names, and migrate one table at a time.

The user is welcome to build their own Spark environment. Should the need arise, a tarball for Spark 3.5.4 can be downloaded and expanded with these commands:

wget https://archive.apache.org/dist/spark/spark-3.5.4/spark-3.5.4-bin-hadoop3-scala2.13.tgz
tar -xvzf spark-3.5.4-bin-hadoop3-scala2.13.tgz

Execution

Note: It is recommended to have the cdm.properties, cassandra-data-migrator.jar, and Astra DB secure connect bundle files all in the same directory. 

The CDM can be run as a Spark job, like this: 

spark-submit --properties-file cdm.properties --master "local[*]" --driver-memory 25G --executor-memory 25G --class com.datastax.cdm.job.Migrate cassandra-data-migrator.jar

The memory specified for the driver and executor should be adjusted based on the size of the table to be migrated and the available system resources.

DSBulk

The DataStax Bulk Loader has been around since 2018. It can work with any database that operates on the CQL protocol standard, and includes functionality for exporting from a Cassandra database, importing to a Cassandra database, as well as counting the number of rows in a Cassandra table. This versatility has enshrined it as a frequent part of a Cassandra DBA’s toolkit.

DSBulk can be downloaded as a ZIP or tarball from DataStax Downloads or from its GitHub repository.

Configuration

DSBulk is configured from its command line flags which are sent along at execution time. Common flags for a DSBulk unload (from a Cassandra cluster) are:

flag description
-h A host or contact point for the origin cluster. A port is optional, but can also be passed along. Ex: -h “1.2.3.4:9042”
-u Cassandra username with read access to the keyspace and table.
-p Cassandra password for the above-mentioned username.
-url Location for the (CSV) file to be written.
-k Origin keyspace name.
-t Origin table name.

 

Common flags for a DSBulk load operation (into an Astra database) are:

flag description
-b Location for the Astra database’s secure connect bundle (SCB) file.
-u Either their Astra client ID, or the literal word “token.”
-p Either their Astra client secret, or their Astra token.
-url Location for the (CSV) file to be loaded.
-k Origin keyspace name.
-t Origin table name.

Execution

An export operation using DSBulk may look like this:

dsbulk unload -url ./csv/youtube_videos -h 1.2.3.4:9042 -k killrvideo -t youtube_videos -u cassandra -p cassandra

An import  operation using DSBulk may look like this:

dsbulk load -url ./csv/youtube_videos/output-000001.csv -k killrvideo -t youtube_videos -b ./secure_connect.zip -u token -p "AstraCS:nOtrEAlnOtrEAl:1a2b656blahblahblahba8750"

Other considerations

It is important to remember that exporting data introduces load onto the origin Cassandra cluster, regardless of the tool used. Be sure to examine appropriate metrics to gauge both active and slow times for the cluster. Obviously, the data migration should be scheduled for a time window when the cluster usage is at a minimum.

As a part of the data migration planning, it is also advisable to assess the cluster’s current resources. The origin cluster may benefit from adding a few additional nodes to increase the maximum operational throughput. Following these suggestions should help with minimizing the impact on continued production usage of the application.

Need help getting started?

Moving data offline is a significant part of the Cassandra to Astra DB migration process. The Cassandra Data Migrator and the DataStax Bulk Loader are two tools which are capable of performing this task with maximum efficiency. Both are capable tools with many great features. While the CDM is the recommended option, DSBulk makes a great addition to your DBA team’s tactical toolbox.

Are you considering an enterprise database consolidation effort? Looking at moving to a Cassandra DBaaS, but not sure where to start? Check out DataStax’s many resources on Modernizing your Cassandra workloads, including our playlist of migration videos.

JUMP TO SECTION

Tools

CDM

Configuration

Execution

DSBulk

Configuration

Execution

Other considerations

Need help getting started?

One-Stop Data API for Production GenAI

Astra DB gives developers a complete data API and out-of-the-box integrations that make it easier to build production RAG apps with high relevancy and low latency.