Apache Pig is developed on top of Hadoop. It provides the data flowing environment for processing large sets of data. The Apache pig provides a high level language. It is an alternative abstraction on top of Map Reduce (MR). Pig program supports parallelization mechanism. For scripting of the Pig, it provides Pig Latin language.
Apache Pig takes Pig Latin scripts and turns it into a series of MR jobs. Pig scripting have its own advantages of running the applications on Hadoop from the client side. The nature of programming is easy, compared to low level languages such as Java. It provides simple parallel execution, the user can write and use its own custom defined functions to perform unique processing.
Apache Pig Latin provides several operators like LOAD, FILTER, SORT, JOIN, GROUP, FOREACH and STORE for performing relational data operations. The operators implement data transformation tasks with simple lines of codes. Compared to MR code, the Pig Latin codes are very less in lines and gives better flexibility in some IT industrial use cases.
Apache Pig Architecture
Pig Architecture consists of Pig Latin Interpreter and it will be executed on client Machine. It uses Pig Latin scripts and it converts the script into a series of MR jobs. Then It will execute MR jobs and saves the output result into HDFS. In between, it performs different operations such as Parse, Compile, Optimize and plan the Execution on data that comes into the system.
Job Execution Flow
When a Apache Pig programmer develops scripts, they are stored in the local file system in the form of user defined functions. When we submit Pig Script, it comes in contact with Pig Latin Compiler which splits the task and run a series of MR jobs, meanwhile Pig Compiler fetches data from HDFS (i.e. input file present). After running MR jobs, the output file is stored in HDFS.
Pig Execution Modes
Pig can run in two execution modes. The modes depends on where the Pig script is going to run and where the data is residing. The data can be stored in a single machine, i.e. local file system or it can be stored in a distributed environment like typical Hadoop Cluster environment. We can run Pig programs in three different modes.
First one is non interactive shell also known as script mode. In this we have to create a file, load the code in the file and execute the script. Second one is grunt shell, it is an interactive shell for running Apache Pig commands. Third one is embedded mode, in this we use JDBC to run SQL programs from Java.
Pig Local mode
In this mode, the pig runs on single JVM and accesses the local file system. This mode is best suitable for dealing with the smaller data sets. In this, parallel mapper execution is not possible because the earlier versions of the Hadoop versions are not thread safe.
By providing –x local, we can get in to Pig local mode of execution. In this mode, Pig always looks for the local file system path when data is loaded. $pig –x local implies that it’s in local mode.
Pig Map reduce mode
In this mode, we could have proper Hadoop cluster setup and Hadoop installations on it. By default, the pig runs on MR mode. Pig translates the submitted queries into Map reduce jobs and runs them on top of Hadoop cluster. We can say this mode as a Map reduce mode on a fully distributed cluster.
Pig Latin statements like LOAD, STORE are used to read data from the HDFS file system and to generate output. These Statements are used to process data.
During the processing and execution of MR jobs, intermediate data will be generated. Pig stores this data in a temporary location on HDFS storage. In order to store this intermediate data, temporary location has to be created inside HDFS.
By Using DUMP, we can get the final results displayed to the output screen. In a production environment, the output results will be stored using STORE operator.
Pig Use Case in Telecom Industry
Telecom Industry generates huge amount of data (Call details). To process this data, we use Apache Pig to De-identify the user call data information.
First step is to store the data into HDFS, applying Pig scripts on the loaded data and refining user call data and fetching important call information like Time rate, repetition rate, and some important log info. Once the de-identified Information comes out, the result will get stored into HDFS.
Like this huge amount of data comes into the system servers and it will be stored in HDFS and processed using scripts. During this process it will filter the data, iterates the data and produces results.
IT companies which use Pig to process their data are Yahoo, Twitter, LinkedIn and eBay. They use Apache Pig to run most of their MR jobs. The Pig is mainly used for web log processing, typical data mining situations and for image processing.
By providing data flowing and parallel mechanism which is going to run jobs across clusters, Apache Pig is very popular in terms of usage. When it comes to flexibility, high level scripting language gives programmers an easy interface to process and get results in an efficient way. Pig provides optimization techniques to flow data smoothly across the cluster.
Specific filtering, grouping and iterations in scripting reduces the complexity of code and runs in an effective manner. Last but not the least, as a whole Pig fulfills key functionalities of Big data like volume, velocity and variety by its unique high level data flowing language.