Hadoop Hive Architecture, Data Modeling & Working Modes

    1 Votes

Hive is developed on top of Hadoop as its data warehouse framework for querying and analysis of data that is stored in HDFS. Hive is an open source-software that lets programmers analyze large data sets on Hadoop. Hive make the operations like ad-hoc queries, huge data-set analysis and data encapsulation execute faster.

The hive’s design reflects its targeted use as a system for managing and querying structured data. While coming to structured data in general, Map Reduce doesn’t have features like optimization and usability, but Hive framework provides those features. SQL-inspired language of Hive, converts hive queries into Map Reduce programs, reducing the complexity associated with it. It reuses RDBMS concepts such as schema, tables, rows and columns, for the ease of learning.

Hadoop programming works on text files. To improve the performance of complex queries, hive uses directory structures to partition data. Hive uses metastore to store the schema information. Metastore of Hive usually resides in RDBMS.
To interact with the Hive, a programmer can use Web GUI or JDBC. Most of the interaction will take care by command line interface (CLI). To write Hive Query Language (HQL), Hive provides its own CLI. Since the HQL and SQL syntax are almost similar, a programmer will not have any issue while writing hive queries.
The four file formats supported by Hive are TEXTFILE, SEQUENCEFILE, RCFILE(Record Columnar File) and ORC. Based on the no of uses, Hive uses either MYSQL(Multi-User) or derby(Single user) databases to store metadata.
Major difference between HQL and SQL is that, Hive query executes on the Hadoop infrastructure rather than traditional database. Hadoop is a distributed storage, so when we submit hive query, it will apply on huge data sets. The data sets are so large that high-end, expensive, traditional databases would fail to perform operations.

For efficent execution of complex queries, Hive query is converted into series of Map-reduce jobs automatically. For faster retrieval of data during query execution, Hive uses partition and buckets concepts. For data cleansing and filtering, Hive supports user defined functions. According to the project requirements, programmers can define Hive UDFs.

Hive Architecture

Hadoop Hive Architecture

There are 3 major components in Hive as shown in the architecture diagram. They are hive clients, hive services and Meta Store. Under hive client, we can have different ways to connect to HIVE SERVER in hive services.

These are Thrift client, ODBC driver and JDBC driver. Coming to thrift client, it provides an easy environment to execute the hive commands from a vast range of programming languages. Thrift client bindings for Hive are available for C++, Java, PHP scripts, python scripts and Ruby. Similarly, JDBC and ODBC drivers can be used for communication between hive client and hive servers for compatible options.

Job Execution Inside Hive

Hive query processing

HIVESERVER is an API that allows the clients (JDBC) to execute the queries on hive data warehouse and get the desired results. Under hive services driver, compiler and execution engine interact with each other and process the query.

The client submits the query via a GUI. The driver receives the queries in the first instance from GUI and it will define session handlers, which will fetch required APIs that is designed with different interfaces like JDBC or ODBC. The compiler creates the plan for the job to be executed. Compiler in turn is in contact with matter and its gets metadata from Meta Store.

Execution Engine (EE) is the key component here to execute a query by directly communicating with Job Tracker, Name Node and Data nodes. As discussed earlier, by running hive query at the backend, it will generate a series of MR (Map Reduce) Jobs. In this scenario, the execution engine plays like a bridge between hive and Hadoop to process the query. For DFS operations, EE contacts Name Node.

At the end, EE is going to fetch desired results from Data Nodes. EE will be having bi-directional communication with Metastore. In hive, side is a framework to serialize and de-serialize input and output data from HDFS to local or vice versa.

Metastore is used for collection of all the Hive metadata and it’s having back up services to backup meta store info. The service runs on the same JVM as the services of hive running on. The structural information of tables, their columns, column types and similarly the partition structure information will also be stored in this.

Hive Vs Relational Databases

By using Hive, we can perform some peculiar functionalities that can't be achieved by Relational Databases. For huge amounts of data that is in beta bytes, querying it and getting results in seconds is important. In this scenario, the hive will achieve fast querying and produce results in a second time.

Some key differences between hive and relational databases are the following

  • Relational databases are of “Schema on READ" and "Schema on Write”. First creating a table and then inserting the data into the particular table. Insertions, Updates, Modifications can be performed on this relational database table.
  • Hive is “Schema on READ only”. Update, modifications won't work on this because the hive query in typical cluster is set to run on multiple Data Nodes. So it is not possible to update and modify data across multiple nodes. Hive provides READ Many WRITE Once.

Hive Working Modes

Hive works in two modes. They are Interactive mode and Non Interactive mode. In Interactive mode, it directly goes to hive mode (Hive Shell) when you type hive in command shell. The non interactive mode is about executing code directly in console file mode. In the both the modes, we can create two types of tables – Internal table and External table.

Hive Data Modeling

In Hive data modeling - Tables, Partitions and Buckets come in to picture.

Coming to Tables, it’s just like the way that we create a table in Traditional relational databases. The functionalities such as filtering, joins can be performed on the tables. Hive deals with two types of table structures - Internal and External, depends on the design of schema and how the data is getting loaded in to Hive.

Internal Table is tightly coupled with nature. At first, we have to create tables and load the data. We can call this one as data on the schema. By dropping this table, both data and schema will be removed. The stored location of this table will be at "/user/hive/warehouse".

External Table is loosely coupled with nature. Data will be available in HDFS; the table is going to get created with HDFS data. We can say that its creating schema of data. At the time of dropping the table, it dropped only schema, data will be available in HDFS as before. External tables provide an option to create multiple schemas for the data stored in HDFS instead of deleting the data every time whenever schema updates.


Partitions come into place, when table is having one or more Partition keys which is the basis for determining how the data is stored. For Example: - “Client has Some E–commerce data which belong to India operations in which each state (29 states) operations mentioned in as a whole. If we take the state as partition key and perform partitions on that India data as a whole, we will be able to get a Number of partitions (29 partitions) which is equal to the number of states (29) present in India. Each state data can be viewed separately in the partition tables.”


Buckets are used for efficient querying. The data, i.e. present in that partition can be divided further into buckets. The division is performed based on hash of a particular column that we had selected in the table.


Hive provides data warehousing platform which deals with large amounts of data (peta bytes in volume). Querying and fetching the required results from large data sets in seconds of time is important. In Cloudera enterprise edition, it comes up with Impala for Increasing query time performance.

Popular Videos


How to improve your Interview, Salary Negotiation, Communication & Presentation Skills.

Got a tip or Question?
Let us know

Related Articles

Big Data Analytics
Hadoop Architecture - HDFS and Map Reduce
Apache Pig Architecture & Execution Modes
Hadoop Single Node Cluster Installation in Ubuntu
HBase Architecture & CAP Theorem
Hadoop Map Reduce Architecture and Example
Hadoop Ecosystem and its Major Components
Hadoop FS & DFS Commands
Big Data Analytics(BDA) Use cases
Architectural Differences between MongoDB and Cassandra
Cassandra Architecture, Features and Operations
Python Hadoop Features and Advantages