Hbase is a column oriented distributed database in Hadoop environment. It can store massive amounts of data from terabytes to petabytes. Hbase is scalable, distributed big data storage on top of the Hadoop eco system. If we compare HBase with traditional relational databases, it posses some special features. It is built for low latency operations.
Hbase is used extensively for random read and write operations. Hadoop performs batch processing that will run jobs in parallel across the cluster. For example, if a client wants to perform simple jobs on Hadoop, he need to search the entire data set to get the desired result. A large dataset when processed results in another large set of data, which would be processed sequentially. What this implies is that, the operation will take more time to execute.
Coming to partition tolerance, the system continues to operate despite arbitrary message loss or failure of part of the system. Systems with partition tolerance feature works well despite physical network partitions.
According to CAP Theorem distributed systems can satisfy any two features at the same time but not all three features. Traditional systems like RDBMS provide consistency and availability. Column oriented databases like MongoDB, Hbase and Big Table provide features consistency and partition tolerance. Let us have a look at some the differences between RDBMS and HBase.
|Schema-less in database.||Having fixed schema in database.|
|Column oriented database.||Row oriented data store.|
|Designed to store De-normalized data.||Designed to store Normalized data.|
|Wide and sparsely populated tables present in Hbase.||Contains thin tables in database.|
|Supports automatic partitioning.||Has no built in support for partitioning.|
|Well suited for OLAP systems.||Well suited for OLTP systems.|
|Read only relevant data from database.||To retrieve one row at a time and hence could read unnecessary data if only some of the data in a row is required.|
|Structured and semi structure data can be stored and processed using Hbase.||Structured data can be stored and processed using an RDBMS.|
|Enables aggregation over many rows and columns.||Aggregation is an expensive operation.|
HBASE Vs HDFS
Hbase runs on top of HDFS and Hadoop. Some key differences between HDFS and Hbase are in terms of data operations and processing. HDFS are suited for high latency operations and batch processing, whereas Hbase is suited for low latency operations. In HDFS, data are primarily accessed through MR (Map Reduce) jobs, whereas Hbase provides access to single rows from billions of records.
HDFS doesn’t have the concept of random read and write operations, whereas in Hbase data is accessed through shell commands, client API in Java, REST, Avro or Thrift.
Some typical IT industrial applications use Hbase operations along with Hadoop. Applications include stock exchange data, online banking data operations and processing Hbase is the best suited solution.
Hbase architecture consists of mainly HMaster, HRegionserver, HRegions and Zookeeper. Zookeeper is a centralized monitoring server which maintains configuration information and provides distributed synchronization. If the client wants to communicate with regions servers, client has to approach Zookeeper.
HMaster in the master server of Hbase and it coordinates the HBase cluster. HMaster is responsible for the administrative operations of the cluster. A region server serves a region at the start of the application. During failure of region server, HMaster assign the region to another Region server. HMaster can also assign a region to another region server as part of load balancing.
It will perform the following functions in communication with HMaster and Zookeeper.
- Hosting and managing regions.
- Splitting regions automatically.
- Handling read and writes requests.
- Communicating with clients directly.
For each column family, HRegions maintain a store. Main components of HRegions are
- Memstore - Holds in-memory modifications to the store
The client communicates in a bi-directional way with both Zoo keeper and HMaster. To read and write operations, it directly contacts with HRegion servers. HMaster assigns regions to region servers and in turn check the health status of region servers. In entire architecture, we have multiple regional servers. Hlog present in region servers will be used to store all the log files.
Hbase Use Cases
In this Hbase use case, we have to take some parameters into consideration like amount of data, speed at data flows and scalability. If the client wants to access a single row details from billions of records Hbase will be used. Hbase permits high compression rates due to few distinct values in the column.
Telecom Industry Use case - Storing billions of mobile call records and providing real time access to the call records and billing information to customers. Traditional storage/database systems couldn't scale to the loads and provide a cost effective solution.
The solution to this use case HBase is used to store billions of rows of call record details. 20TB of data is added monthly. To handle large amount of data in this use case Hbase gives the best solution in telecom industry.
HBase is an open-source, non-relational, column-oriented distributed database developed as part of Apache Software Foundation build on top of HDFS for faster read/write operations on large datasets. It provides faster retrieval of data for any search query due to indexing and transactions. It also provides configurable sharding of tables, linear/modular scalability, natural language search and real-time queries.