Hadoop Distributed Framework is designed to handle large data sets. It can scale out to several thousands of nodes and process enormous amount of data in Parallel Distributed Approach. Apache Hadoop consists of two components. First one is HDFS (Hadoop Distributed File System) and the second component is Map Reduce (MR). Hadoop is write once and read many times.
HDFS is a scalable distributed storage file system and MapReduce is designed for parallel processing of data. If we look at the High Level Architecture of Hadoop, HDFS and Map Reduce components present inside each layer. The Map Reduce layer consists of job tracker and task tracker. HDFS layer consists of Name Node and Data Nodes.
High Level Architecture of Hadoop
HDFS is a Hadoop distributed file system. As it's name implies, it provides distributed environment for the storage and its file system is designed in a way to run on commodity hardware. There are some distributed file systems in the market like Cloudera, Horton works, etc. which provides distributed storage in HDFS and it has its own special advantages.
HDFS provides a high degree of fault tolerance and runs on cheap commodity hardware. By adding nodes to cluster using cheap commodity hardware, Apache Hadoop can process large data sets which will give the client better results as compared to existing distributed systems. In this, data stored in each block replicate into 3 nodes in such a way that even if one node goes down, there will be no loss of data since we have a proper back up recovery mechanism.
So HDFS provides concepts like Replication Factor, High memory block size and it can scale out up to several 1000 nodes.
MapReduce is mainly used for parallel processing of large sets of data. Initially, it is a hypothesis specially designed by Google to provide parallelism, data distribution and fault-tolerance. MR processes data in the form of key-value pairs. A key-value (KV) pair is a mapping element between two linked data items - key and its value.
The key(K) acts as an identifier to the value. An example of a key-value (KV) pair is a pair where the key is the node Id and the value is its properties including neighbor nodes, predecessor node etc. MR API provides the following features like batch processing, parallel processing of huge amounts of data and high availability.
For effective scheduling of work, Hadoop provides specific features at the architecture level. They are Fault tolerance, Rack Awareness and Replication Factor. As compared to two native UNIX/LINUX (8 to 16 KB) environment, the Block size in Hadoop by default is 64 MB. There is a provision to change to 128 MB. The Replication Factor by default is 3.
But it depends on the business requirement. We can increase/ decrease the replication factor. Compared to disk blocks, HDFS blocks are larger in size, so it will decrease the costs of six.
Hadoop Requires Java Runtime Environment (JRE) 1.6 or higher, because Hadoop is developed on top of Java APIs. Hadoop work as low level single node to high level multi node cluster Environment.
The master/slave architecture manages mainly two types of functionalities in HDFS. They are file management and I/O. We can call the master program as Name Node and the slave programs are called Data Nodes. An HDFS cluster consists of a single namenode, a master server that manages the file system namespace and regulates access to files by clients.
The namenode operates on file system namespace operations such as opening and closing files, etc. In the presence of a cluster of machines, a dedicated machine runs the Name Node, which is the arbitrator of the Data Nodes and the repository of HDFS metadata.
Hadoop Architecture - Read & Write Operations
Functionalities of Hadoop daemons
Figure above explain the Hadoop architecture in detail and its read and write operations.
Name Node, Secondary Name Node and Job Tracker are known as Master daemons. Task Tracker and Data Nodes are known as slave daemons. Whenever a client submits a Job, the actual processing is going to take place in Task Tracker.
NameNode consists of all Meta data information such as racks, blocks (r1, r2, b1, b2). NN sends its image file to secondary NN every 3sec (Heartbeat Mechanism). Name node is having special feature “Rack Awareness”, it will give information about nodes present in different racks of Hadoop cluster environment.
- Secondary Name Node (Snn) - If NN goes down, Snn will take the image instantaneously and acts like NN and performs the NN functionalities.
- Job Tracker (JT) - it is used for job scheduling and maintenance. It assigns different tasks to Task Tracker and monitors it.
- Task Tracker (TT) - Actual computational processing is going to happen here. It gets in contact with JT and has some heart beat mechanism between them.
For Single Node Hadoop Setup, it includes single master node and multiple working nodes. In this case, master consists of Job Tracker, Task Tracker, Name Node (NN) and Data Node (DN). A slave or working nodes act as both the Data Node and Task Tracker. For high level application developments, Single Node Hadoop will give limited options (like memory, capacity) only, so it’s not that suitable.
For Multiple Node Hadoop Setup (Cluster Set up) HDFS is having Name Node server and secondary Name Node to capture the snapshots of Name Node’s Metadata information in Cumulative Intervals. If Name Node shuts down, the secondary name node acts like Name node and executes the instructions to Job Tracker.
Writing data into HDFS
From figure 2, if the client wants to write files into HDFS, the client would send requests to Name Node for block locations. Block locations are the location of blocks stored in Data Nodes. Name Node will provide the Block locations and data node information which are currently free. Then, the client directly contacts the Data Nodes to store the data with the block locations received from Name Node. Consequently, NN will have the metadata of the original data, i.e. stored in Data nodes.
Reading data from HDFS
The client sends request to NN, asking for file information where it has been stored in HDFS. NN will provide Data Node information where the files were stored. Then client contact Data nodes and retrieve the file. NN always contains the Meta data information like blocks, racks and nodes.
Job Execution and its Work flow
Let’s say for performing some computational mechanism using Java, we submitted some Job and data of size 1000 MB. Once a client submits the job, it will contact NN for the resources which are readily available for the job to execute. NN will provide data node information to the JT for further proceeding. Depends on the availability of resources, the JT splits and assigns tasks to Task Tracker.
Suppose, the Job is of 1000MB(1GB), assume JT splits the work into 10 tasks and allocates 100 MB to each process. Here the capacity to handle task tracker depends on input split like block size (64 MB or 128 MB).
Coming to open source apache Hadoop Limitations, the Hadoop 1.0 has scalability limited to 5000 nodes in cluster and maximum 40000 concurrent tasks can be available for this and it means each node is going to provide 8 concurrent tasks. Hadoop 2.2 overcomes the limitations of Hadoop 1. Hadoop enterprise editions will provide features like to distribute storage space in the cloud, Extensibility to upgrade nodes.
An Enterprise edition provides options according to client requirements. The client will choose these Hadoop editions by taking factors like data usage and data storage of the company. Enterprise editions like Cloudera, Horton works and Big Insights are all developed on top of Apache Hadoop only.
As a whole, Hadoop Architecture provides both storage and processing of job as a distributed framework. Compared to existing methods of storing and processing of large sets, Hadoop gives additional advantages in terms of market strategies. The complete enterprise editions like Cloudera and Horton works provides a complete environment for the Hadoop and its cluster maintenance and Support.