HBase Architecture & CAP Theorem

    1 Votes

Hbase is a column oriented distributed database in Hadoop environment. It can store massive amounts of data from terabytes to petabytes. Hbase is scalable, distributed big data storage on top of the Hadoop eco system. If we compare HBase with traditional relational databases, it posses some special features. Hbase is built for low latency operations.

Hbase is used extensively for random read and write operations. Hadoop performs batch processing that will run jobs in parallel across the cluster. For example, if a client wants to perform simple jobs on Hadoop, he need to search the entire data set to get the desired result. A large dataset when processed results in another large set of data, which would be processed sequentially. What this implies is that, the operation will take more time to execute.

In these types of data operations scenarios, we require a new type of solution to access any point of data in a single unit of frame. The solution we can call as random access to retrieve data. Some of the databases like Cassandra, MongoDB and CouchDB store large data sets and can provide facility of accessing the data in a random manner.

CAP Theorem

For any distributed system, CAP Theorem reiterates the need to find balance between Consistency, Availability and Partition tolerance. Consistency means all the nodes see the same data at the same time. Availability implies that every request receives a response about whether it was successful or failed. It’s more of a handshaking mechanism in computer network methodology.

Coming to partition tolerance, the system continues to operate despite arbitrary message loss or failure of part of the system. Systems with partition tolerance feature works well despite physical network partitions.

According to CAP Theorem distributed systems can satisfy any two features at the same time but not all three features. Traditional systems like RDBMS provide consistency and availability. Column oriented databases like MongoDB, Hbase and Big Table provide features consistency and partition tolerance. Let us have a look at some the differences between RDBMS and HBase.

HBASERDBMS
Schema-less in database. Having fixed schema in database.
Column oriented database. Row oriented data store.
Designed to store De-normalized data. Designed to store Normalized data.
Wide and sparsely populated tables present in Hbase. Contains thin tables in database.
Supports automatic partitioning. Has no built in support for partitioning.
Well suited for OLAP systems. Well suited for OLTP systems.
Read only relevant data from database. To retrieve one row at a time and hence could read unnecessary data if only some of the data in a row is required.
Structured and semi structure data can be stored and processed using Hbase. Structured data can be stored and processed using an RDBMS.
Enables aggregation over many rows and columns. Aggregation is an expensive operation.

HBASE Vs HDFS

Hbase runs on top of HDFS and Hadoop. Some key differences between HDFS and Hbase are in terms of data operations and processing. HDFS are suited for high latency operations and batch processing, whereas Hbase is suited for low latency operations. In HDFS, data are primarily accessed through MR (Map Reduce) jobs, whereas Hbase provides access to single rows from billions of records.

HDFS doesn’t have the concept of random read and write operations, whereas in Hbase data is accessed through shell commands, client API in Java, REST, Avro or Thrift.

Some typical IT industrial applications use Hbase operations along with Hadoop. Applications include stock exchange data, online banking data operations and processing Hbase is the best suited solution.

HBASE Architecture

HBase Architecture

Hbase architecture consists of mainly HMaster, HRegionserver, HRegions and Zookeeper. Zookeeper is a centralized monitoring server which maintains configuration information and provides distributed synchronization. If the client wants to communicate with regions servers, client has to approach Zookeeper.

HMaster

HMaster in Hbase plays vital role in terms of performance and maintaining nodes in the cluster. It provides admin performance and distributes services to different region servers. HMaster assigns regions to region servers.

The HMaster has the features like controlling load balancing and failover to handle the load over nodes present in the cluster. When client wants to change any schema and to change any Meta data operations, HMaster takes responsibility for these operations.

HRegions Servers

It will perform the following functions in communication with HMaster and Zookeeper.

  • Hosting and managing regions.
  • Splitting regions automatically.
  • Handling read and writes requests.
  • Communicating with clients directly.

HRegions

It contains multiple stores, one for each column family. It consists of mainly two components, which are Memstore and Hfile. The Memstore holds in-memory modifications to the store.

Hbase Data flow

Data Flow

The client communicates in a bi-directional way with both Zoo keeper and HMaster. To read and write operations, it directly contacts with HRegion servers. HMaster assigns regions to region servers and in turn check the health status of region servers. In entire architecture, we have multiple regional servers. Hlog present in region servers will be used to store all the log files.

Hbase Use Cases

In this Hbase use case, we have to take some parameters into consideration like amount of data, speed at data flows and scalability. If the client wants to access a single row details from billions of records Hbase will be used. Hbase permits high compression rates due to few distinct values in the column.

Telecom Industry Use case - Storing billions of mobile call records and providing real time access to the call records and billing information to customers. Traditional storage/database systems couldn't scale to the loads and provide a cost effective solution.

The solution to this use case HBase is used to store billions of rows of call record details. 20TB of data is added monthly. To handle large amount of data in this use case Hbase gives the best solution in telecom industry.

Conclusion

Hbase is one of the NoSQL column-oriented distributed database available in Apache foundation. Hbase gives more performance for retrieving less records rather than Hadoop or Hive. It’s very easy to search for given any input value because it supports indexing, transactions and updating. We can perform inform in-memory analytics using Hbase. It has automatic and configurable sharding for datasets or tables, scalable and provides restful API’s to perform the MapReduce jobs.