Hadoop ecosystem comprises of services like HDFS, Map reduce for storing and processing large amount of data sets. In addition to services there are several tools provided in ecosystem to perform different type data modeling operations. Ecosystem consists of hive for querying and fetching the data that's stored in HDFS.
In order to handle large data sets, Hadoop has a distributed framework which can scale out to thousands of nodes. Hadoop adopts Parallel Distributed Approach to process huge amount of data. The two main components of Apache Hadoop are HDFS (Hadoop Distributed File System) and Map Reduce (MR). The basic principle of Hadoop is to write once and read many times.
Apache Hadoop Ecosystem
Figure above, shows the complete Apache Hadoop ecosystem with its components. We will discuss about each component below.
The HDFS is designed to store large sums of data in a distributed environment. Due to its peculiar design, HDFS file system can be run on a cluster of normal machines when compared to high end servers required for existing distributed file system solutions. It also provides a high degree of fault tolerance and availability. Hadoop is one of the distributed file systems available in the market.
So HDFS provides concepts like Replication Factor, High memory block size and it can scale out up to several 1000 nodes. The data that is stored in HDFS can be of any format. The data can be structured, semi structured and unstructured. We can scale out Hadoop cluster by adding more nodes to the cluster. By adding more nodes to cluster the storage area also will become more to store huge amount of data sets.
Map Reduce is mainly used for parallel processing of large sets of data. Initially, it is a hypothesis specially designed by Google to provide parallelism, data distribution and fault-tolerance. MR processes data in the form of key-value pairs. A key-value (KV) pair is a mapping element between two linked data items - key and its value. The key (K) acts as an identifier to the value.
When we need to process huge volume of data, Map Reduce is used. Based on the business use cases, programmers will write Map Reduce applications. For writing an effetive MR application, a programmer should know the MR working flow and its deployment across Hadoop clusters.
Sqoop is a data migration tool for migrating data from traditional RDBMS servers to HDFS. The Sqoop provides some UNIX based command to import and export data from RDBMS to HDFS and vice versa.
If, data is present in online web servers, flume acts as a tool to get data from online servers and its logs and place it in hdfs. The developers have to set particular properties in flume.conf configuration file to get the data. Sometimes, we can call flume as log collector.
Pig developed on top of Hadoop provides the data flowing environment for processing large sets of data.The pig provides a high-level language. It is an alternative abstraction on top of Map Reduce (MR). Pig program supports parallelization mechanism. For scripting of Pig it provides Pig Latin language.
The pig takes Pig Latin scripts and turns into a series of MR jobs. Pig scripting has its own advantages of running the applications on Hadoop from the client side. Pig fulfills the one of the important objectives of Big data, i.e. Variety of data. Pig not only operates on Relational data types it operates on a variety of data, i.e. these included semi structured (XML, JSON), nested data and unstructured data.
To query and analyze data stored in HDFS, Hive is used. It is an open source software built upon Hadoop as its data warehouse framework. It helps programmers analyze huge volumes of data sets on Hadoop. One of the advantages of the Hive is that, its SQL like language help the programmer from the complexity of Map Reduce programming. It reuses some of the similar concepts from the RDBMS, such as tables, rows, columns and schema.
Hbase is column oriented distributed database in Hadoop environment. It can store massive amounts of data from terabytes to peta bytes. Hbase is a very constrained access model in the Hadoop ecosystem which is having some specific features compared to traditional relational models. The Hbase is built for low latency operations.
For Statistical computations we use R language. It is provided in building packages from which we can apply algorithms easily on the sample data sets and we can derive insights fast. R is also an open source tool. R provides integration with Hadoop and we can call that one as RHadoop. If the work is more of stats based, we use R language. We use R with Hadoop to apply algorithms on larger data sets that are stored in hdfs.
It’s also open source apache development tool to provide coordination between different tools present in the Hadoop ecosystem.
Ambari is an open source framework which provides features like provisioning, monitoring and management of Hadoop cluster. It can install in master machine and it can give health status of each node present in the Hadoop cluster.