Skip to content

Cassandra – 2 – Basics of Cassandra

The word Cassandra was taken from the name of ancient Greet prophet Cassandra.

Apache Cassandra is a distributed NoSQL database system. It is a distributed database system. It is NOT RDBMS system like Oracle. It is NoSQL database.

High availability and linear scalability are some key benefits of Cassandra database. Cassandra is designed for high-volume, low-latency cloud applications.

Cassandra as such is open source database but Datastax is major organization which supports the use of Cassandra and has commercial Cassandra products which makes it easy for Clients to install Cassandra in production systems.

Some key benefits of the Cassandra database are:

  1. Open Source — So no charge and you can customize according to your needs.
  2. High Fault Tolerance – Data is propagated “automatically” to multiple nodes. There is no single point of failure
  3. High Throughput – which means high performance
  4. No single point of failure – because  of distributed architecture. All nodes are identical, no mater-slave relation
  5. Familiar SQL command line – Many commands of Cassandra Query Language (CQL) are similar to the SQL language
  6. Awesome Scalability – Without any downtime/interruption, scale your database. Horizontal scalability adds more node to cluster easily
  7. Flexible database – Example: A table can have a varying number of columns among its rows!
  8. Less costly hardware – yes, you can use commodity low-cost hardware to build your database and it will not impact your performance
  9. Easy Administration – less moving part as compared to Oracle database so makes it easy to administer. Adding More nodes do not mean more complexity.  Easy to install too.

Cassandra is a partitioned data store,. “Partitioned” here means that the database uses unique keys for each row to distribute the rows across multiple nodes.

Some key features which makes Cassandra a high performance database are:

  1. Compression – Data is compressed to reduce the volume of data stored on disk and also to lower the disk I/O
  2. Data Cache – Cached data is stored across the cluster so that when one node become unavailable the client reads data from another cached copy. Partition key cache and row cache are the two type of data cache used.
  3. Bloom Filters usage – It is a probabilistic data structure which is designed to tell you, rapidly and memory-efficiently, whether an element is present in a set.
  4. Compaction – Deletes are not actually deleted at once. At the background, Cassandra goes on compacting the data based on rules defined

Some drawbacks/limitations of Cassandra databases are:

  1. Consistency of course is not granted at all the times. Data propagation does involve some latency
  2. No joins, no foreign keys, no indexes in Cassandra. RDMBS DBAs are very much familiar with these terms.
  3. Transactions concepts- locking/rollback etc. do not apply here. Some lightweight transaction is still there though

Let us list down some difference in data/data models between RDBMS and Cassandra:

PARAMETER RDBMS CASSANDRA
Type of Data Structured Unstructured
Type of schema Fixed Flexible
What is the significance of a column? Columns represent a relation’s

attributes.

Columns are a unit of storage
What is the significance of a row? A row is a single record A row is a unit of replication
How are relations defined? Uses foreign keys and joins Uses collections to represent relationships
What is the significance of a table? array of arrays (records) list of nested key-value pairs
What is the container for all data? Database is the outermost

data container

Keyspace is the outermost container for data

Fast performance and support for a large number of complex data types, distributed data kind of features has made Cassandra a popular NoSQL database.

Cassandra works great for data where not much updates/deletes happens and you need fast writes.  Also everything in Cassandra is Java-based so as an Administrator if you know basics of Java then it will help you in understanding errors that pops up sometimes, tuning the JVMs and also to monitor it efficiently.

Brijesh Gogia
Leave a Reply