Apache Cassandra as a NoSQL Database
Organizations are in an era where the variety and volume of data they have access to are increasing everyday. To utilize the available data, there is becoming an increasing number of ways to store the data. These come in the form of NoSQL databases, often interpreted as “not only SQL”, meaning that the databases can sometimes act as relational databases but are not constrained by relationships between stored tables. As a data scientist, it is becoming more likely that you will have to interact with NoSQL databases, so this blog will introduce you to one, Apache Cassandra, and explore why more organizations are using it to store their data.
There are a variety of NoSQL databases which include key-value, document-based, column-based, and graph-based databases; all have certain pros and cons and are used for different applications based on the needs of an organization. Cassandra is a column-based database that is optimized for fast lookups of existing data in very large datasets. It is worth noting that Cassandra is not intended to support complex ad-hoc queries.
Out of the variety of column-based databases, such as Bigtable, Accumulo, and Vertica, Cassandra is gaining popularity. Companies such as Instagram, Netflix, and Reddit all rely on Cassandra to store the mass of data they are collecting (If you’re like me when I first started learning data science and don’t know how media files can be stored, look into the BLOB data type). Simply said, Cassandra stores data in columns rather than rows which allows for fast querying of data from very large datasets; on the user-side there often appears to be no difference between column-based and relational databases.
There are several reasons why Cassandra is gaining popularity, but two big reasons are that it does not require full ACID compliance and is fault tolerant. ACID compliance is a set of protocols used in relational databases to ensure data consistency as database transactions occur; by not requiring full ACID compliance, Cassandra enables high availability of data and fast (cheap) writes. Fault tolerance is the capability for a database to remain in operation even if some of its’ servers fail. These properties make Cassandra ideal for large websites that require fast data lookups and real-time streaming of data into a database.
Rather than requiring full ACID compliance, Cassandra allows the data architect to set the level of consistency and the time allowed to achieve consistency throughout the database. With regards to atomicity, Cassandra will allow a write on one node to succeed even if the write fails on the other nodes. If Cassandra required atomicity, then the write would only succeed if it succeeded on all nodes, otherwise no write would occur. Nonetheless, Cassandra records what nodes the write failed on to ensure that they are eventually updated.
Cassandra does not sacrifice durability, though. If a server crashes or the software fails, then any writes that were occurring would be recorded and reprocessed once the server is back up. This ensures that no writes are ever lost. If you’re interested in learning more about the internal Cassandra environment, additional documentation by DataStax can be found here.
Cassandra is also fault tolerant, since it allows for data to be replicated and distributed across nodes; if one node fails then the other remaining nodes can pick up the workload. This is critical for organizations that can’t afford to lose access to their database for even a few seconds. How many times the data is replicated is determined by the data architect and the business needs. Data is commonly replicated between three to five times, meaning that three to five servers can all retrieve or add to the same data. Since replicating the data takes up storage space, the data architect must weigh the cost of data replication vs. the cost of losing access to the data or losing the data entirely.
Another feature, although many other databases offer it too, that makes Cassandra appealing to organizations is that it allows for easy horizontal scaling. This refers to adding additional servers when additional capacity is needed to add or retrieve data; with cloud computing widely available servers can be added with ease. Being able to horizontally scale quickly and efficiently is desirable since the necessary capacity of a database often fluctuates, which can leave organizations paying for computing capacity that is being unused or, sometimes worse, unable to meet demand for their data. When new servers are added, often referred to as nodes, Cassandra handles all of the heavy lifting needed to distribute the data across the added servers.
With all of these new database types come new ways of querying data. Luckily, Cassandra implements CQL as a query language, which is very similar to SQL. Much of the syntax is the same as in SQL, but there are several big differences worth noting.
The first is that CQL doesn’t support joins. One reason is that there is no mandatory requirement for there to be relationships between stored tables. Another reason is that joins are a computationally expensive process, so querying data can be sped up dramatically by avoiding them. To replace the need for joins, Cassandra introduced the concept known as “column families.” In short, a column family is a table that contains all of the variables commonly queried together that would normally have to be joined in a relational database. Sure data may be replicated in many column families, but since writes in Cassandra are cheap (partially due to it not requiring ACID compliance) the benefit of faster reads makes up for it.
If you’re familiar with data architecture in a relational database, you might be thinking “What about normalization?” Since writes are cheap with Cassandra you really can forget about normalization altogether. If you are not familiar with the concept of normalization, it is simply a protocol used in relational databases to reduce data redundancy and ensure data consistency. Redundant data is a problem in relational databases because this means that if data is updated, it has to be updated everywhere it is stored. Writes are expensive (slow and require a lot of computational power) in relational databases, so it is best to avoid data redundancy.
The second difference is that not all columns in a table referred to as the column family, stored on Cassandra can be used for subqueries. When architecting a Cassandra database, one or more columns from the column family are chosen as “clustering columns.” These are the only columns that can be used for sub-queries and the data must be queried in the order that the clustering columns were declared.
As an example, suppose a table contains columns for storing state, county, city, median income, and population. If the data architect chose state, county, and city as the clustering columns, in order to create a subquery by city, you must first declare a state and county. And if you wanted to create a subquery based on median income, it is not possible. The queries below would return errors.
The first query would return an error because state and county were not declared first, while the second query would return an error because median income is not a clustering column. A correct query would first declare the state, then the county and finally the city, as shown below.
So with Cassandra the order of the columns used for subqueries matters! This may appear to be a needless constraint, but it ultimately allows for faster queries from large datasets. If the clustering order is chosen with the data needs in mind, then this should likely not present any obstacles when querying the data. It is also worth noting that multiple tables with the same data can be created with different clustering columns in order to best serve different users.
Other differences include how data is deleted and several performance issues when filtering data over a long range of values. This certainly is not an exhaustive list of the differences, but I think it covers the main ones that you should initially be aware of when querying data from a Cassandra database. If you’re interested in learning more, check out the CQL documentation page at DataStax found here.
I hope this introduction has made you familiar with why companies are choosing Cassandra and how to query a Cassandra database. Feel free to comment below with any questions or tips for working with Cassandra!