In big data analytics Hadoop plays vital role in solving typical business problems and gives the particular domain provides peculiar business options. In a Hadoop eco system each component plays a unique role in data processing, data validation and data storing. MongoDB and Cassandra are NoSQL databases which acts and provide certain features like a database for storing huge amount of different type of data formats.
We have existing traditional relational databases already in usage and production. Some of the business problems are not solved by these relational databases because of its row based storage structure. In terms of storing unstructured, semi structured data storage and retrieval, relational databases can’t fit. Fetching results by applying query on huge data sets that is stored in Hadoop storage is also challenging task. Nosql storage technologies provide the best solution for faster querying on huge data sets.
MongoDB data is going to get stored in terms of collections and documents. These are the equivalents of rdbms tables and rows. The equivalent of a column is a Field. The collection is a set of documents and exists within a single database. For better understanding of this concept, we can compare both rdbms and MongoDB storage with the help of simple examples.
The table is having rows and columns in the rdbms. Data is going to store in tables. Similarly, collections have a group of documents and in each document we can have more number of fields. The document is set of key-value (KV) pair.
Sample document storage structure in MongoDB
Title : 'MongoDB Sample',
description: 'MongoDB is no sql database',
tags: ['mongodb', 'database', 'NoSQL', ’Document storage ’],
message: 'My first comment',
dateCreated: new Date(2015,1,20,2,15),
message: 'My second comments',
dateCreated: new Date(2015,1,25,7,45), like: 10
Cassandra is open source apache distributed column oriented database system. If data is stored in commodity servers and to handle huge amount of data sets Cassandra is the best fit. It also comes under the NoSQL family tree. It provides features like high availability, scalability and simplicity in design.
By adding the number of nodes to cluster we can increase throughput of program using Cassandra. It implements a Dynamo-style replication model with no single point of failure. It deals with unstructured data and has flexible schema. Key space and column families are present in Cassandra architecture for data modeling.
Column family is an ordered collection of rows and Key space is a collection of column families. Each column family is having rows. Rows contain columns and values. This is the typical hierarchy level storage in Cassandra. In Cassandra cluster set up, we can have multiple key spaces and each key space can have multiple column families present in it.
Architecture diagram MongoDB Vs Cassandra
Figure shows both architectures with its components present in it. Both Architectures provide replication and sharding. The above figure describes the typical high level architectures of both databases in terms of its storage and replication levels.
Some of Similarities between MongoDB and Cassandra
|Open source Distribution||Yes||Yes|
|Server Operating system capability||Yes||Yes|
|Supports Programming languages c#, c++||Yes||Yes|
|Partition methods||Includes sharding||Includes sharding|
|Consistency methods||Eventual Consistency and immediate Consistency||Eventual Consistency and immediate Consistency|
|APIs and other methods||Featured using proprietary protocol (Using JSON)||Featured using proprietary protocol|
|Storage model||Document oriented storage||Column oriented storage|
Some of Differences between MongoDB and Cassandra:
|Service handling access rights||Users can be defined by the administrator with either full access or read only access.||Access rights for individual users are defined per object by administrator.|
|Server side scripts||Supports server side scripts||Does not support server side scripts.|
|Additional programming support||Supports large number of programming languages that include Mat lab, R, Power Shell.||It doesn’t support more programming languages.|
|Usage||Working with occasionally changing or consistent data.||Read and write intensive applications.|
|Replication settings||Allows very granular ad-hoc control down the query level through driver options which can be called in code at run time.||Done on node level with configuration files.|
|Suitable use cases||If we want to perform dynamic queries on data and if we prefer to define indexes, MongoDB is the best option to exhibit.||When we have more writes compare to read we prefer Cassandra. We use Especially logging events. Financial and banking industries are best use cases for using Cassandra.|
The data model provided by the both models looks similar but there are differences. Both are having NoSQL storage, but in terms of the type of data loading these are different. If we look at the sample data set stored in MongoDB and Cassandra we will get to know the exact method of storage.
In MongoDB, data is going to replicate in replica sets. As shown in the above figure, replica sets again consists of primary and multiple secondaries. Suppose, during a job flow a master fails , the remaining replica sets that present in the cluster will choose a new master. All replica sets are assembled into a cluster.
In Cassandra, there is no concept of master, slave or any replica sets. In Cassandra, it uses a hash ring based algorithm to choose its own data ownership. Cassandra provides replication factor. Whatever the data that is stored in Cassandra will have a hash function applied to the primary key of the row, which determines the node responsible for the key.
MongoDB offers so many options for querying the data sets and fetching the results in a most effective way. But in some cases it can cause some performance problems with data sets. In the typical production environment, we have to avoid certain type of queries. For example, it will be better to avoid a query which results in every shard being queried.
Cassandra users/developers can model their data in such a way as to minimize the total number of queries through more careful planning and denormalization.
Both provide scalable options for exhibiting better performance. But scaling up systems will require more knowledge about business requirements and better understanding of design of scalable systems.
Backup operations are more complex in MongoDB compared to Cassandra. In MongoDB we have to take backups using disk based snapshots.
Depending on the business requirement and type of applications, we choose the appropriate NoSQL databases. Each provide features which will support scalability and replication factor. These databases work efficiently with Hadoop ecosystem.