Cassandra is a distributed database system and it runs on multiple nodes at once. The nodes use a peer-to-peer communication protocol to exchange any information.
Some key points regarding Cassandra architecture/functioning:
- Table rows are stored in tables, each with a mandatory primary key
- Data gets first written to log file for durability and then somewhat similar to RDBMS databases, write the data to cache and when cache is full it write the data finally to disk
- Automatically partition the data and replicate it.
- At regular interval, it compacts the data in the database
- Cluster nodes are chosen randomly to fetch the data as per client need
Some of the key high level structures/concepts of Cassandra are:
NODE: All data gets stored in Nodes. A node is essentially a server machine. You can have one node in Cassandra (for basic demo/testing) or can have multiple nodes. In real life situation, multiple nodes are used.
DATACENTER: Not to be confused with the actual data center that normal people think. It is a SET OF NODES that you configure as a group for replication proposes. In real life Cassandra implementation, there will be many Cassandra datacenters (replication groups) which will be interconnected and will be replicating data continuously.
CLUSTER: It is bigger than Datacenter and is super set of Datacenter. It is a set of one or more datacenters and can span across multiple locations.
COMMIT LOG: These are secure recovery mechanism for the incoming data. A commit log is memory location where Cassandra writes the incoming data for durability
MEMTABLES: After data is written to Commit logs, it is next written to memtables, which are data structures that live only in memory.
SSTABLE: Memtables are finally get written to SSTables once the memtables reach a threshold value
CQL TABLE: We query CQL table to retrieve data from a Cassandra database. It is a set of ordered columns fetched by table row.
KEYSPACE: Logical entity to group a set of tables, usually belonging to a single application. Logically similar to concept of schema in RDBMS.
PEERS: All nodes in a Cassandra cluster are known as peers. A client’s request can go randomly to any of the nodes(peers) in a cluster.
COORDINATOR: The first peer which will get the client request will take the role of coordinator. So it is kind of proxy between client and the Nodes (Peers).
GOSSIP: As term suggests, this is communication mechanism between the nodes. Nodes talk and get information on location and state of nodes
COMPACTION: Cassandra periodically consolidates data stored in tables in data files and this is known as compaction. It finds most up-to-date row source, writes it to SSTable and let the old /obsolete row marked for deletion. After all pending reads are completed then it goes on deleting those obsolete data.
PARTITIONER: This determines which nodes receive the first copy of data chunk and how it should distribute it to the rest of nodes.
SNITCH: This is Cassandra topology to place the replicas of data on the nodes
NODE REPAIR: This is a regular maintenance task that corrects the inconsistencies among the replicas
At this stage many of these terms do not make complete sense but in our next posts we will go into details of these structures/concepts.
- Oracle Multitenant DB 4 : Parameters/SGA/PGA management in CDB-PDB - July 18, 2020
- Details of running concurrent requests in Oracle EBS - May 26, 2020
- Oracle Multitenant DB 3 : Data Dictionary Architecture in CDB-PDB - March 20, 2020