Python with Apache Hadoop is used to store, process, and analyze incredibly large data sets. For streaming applications, we use Python to write map reduce programs to run on Hadoop cluster. Hadoop has become a standard in a distributed data processing, but relied on Java in the past. Today, there are a numerous open source projects that support Hadoop in Python. Python supports other Hadoop ecosystem projects and its components such as HBase, Hive, Spark, Storm, Flume, Accumulo, and a few others.
Hadoop requires Java runtime environment (JRE) 1.6 or higher, because Hadoop is developed on top of Java APIs. Hadoop works as a low level single node to a high level multi node cluster Environment. Map Reduce is mainly used for parallel processing of large sets of data. Initially, it is a hypothesis specially designed by Google to provide parallelism, data distribution and fault-tolerance.
MR processes data in the form of key-value pairs. The code that developer writes for MR can be written in java as well as Python and C++ languages. Working with Hadoop using Python instead of Java is entirely possible with a collection of active open source projects that provide Python APIs to Hadoop components. Python also has some advantages over Hadoop when compared with Java.
The reasons for using Hadoop with Python instead of Java are not all that different than the classic Java vs. Python arguments. One of the most important differences is that we don't have to compile our code, instead we can use a scripting language. This makes more interactive development of analytics possible, makes maintaining and fixing applications in production environments simpler in many cases, makes for more succinct and easier to read code, and so much more. Also, by integrating Python with Hadoop, you get access to the world-class data analysis libraries such as numpy, scipy, nltk, and scikit-learn that are best-in-breed both inside of Python and outside.
The following libraries and approaches will guide python developers through a series of exercises:
- Interacting with files in the Hadoop Distributed File System with the snakebite Python module to store potentially peta bytes of data
- Writing Map Reduce jobs with the mrjob Python module to analyze large amounts of data over potentially thousands of nodes
- Writing Map Reduce jobs with Apache Pig (a higher-level data flow language) in conjunction with Python user-defined functions
Java Vs Python Features
|Cross platform working||Applications that work across various platforms.||Python Not supports this feature|
|Execution||Slower||Faster execution of jobs|
|Blocking methods||Traditional braces to start and end blocks||It uses indentation of blocks|
|Typing||Static typing||Dynamic typing|
|Simpler and compact||Less Simpler and compact||More simpler and compact|
|Language||Java best suits for low level implementation language.||Python is much better suited as “glue” language|
Sample MR Program for hadoop in python
In this following Python code, we will use the Hadoop Streaming API to help us to pass data between our Map and Reduce code via STDIN (standard input) and STDOUT (standard output). We will simply use Python’s sys.stdin to read input data and print our own output to sys.stdout. Hadoop streaming will take care of processing and programming flow.
Steps to Write and execute code
Above figure giving details about both mapper.py and reducer.py logics at one place.
Save the following code in the file /home/hduser/mapper.py.
It will read data from STDIN, split it into words and output a list of lines mapping words to their (intermediate) counts to STDOUT. The Map script will not compute an (intermediate) sum of a word’s occurrences though. Instead, it will output <word> 1 tuples immediately – even though a specific word might occur multiple times in the input. In our case we let the subsequent Reduce step do the final sum count.
We have to provide file with this execution permission chmod +x /home/hduser/mapper.py
Save the following code in the file /home/hduser/reducer.py.
It will read the results of mapper.py from STDIN (so the output format of mapper.py and the expected input format of reducer.py must match) and sum the occurrences of each word to a final count, and then output its results to STDOUT.
We have to provide file with this execution permission chmod +x /home/hduser/reducer.py
Running python code on text files that are already stored in hdfs storage. Execute the following commands showing in the figure 3 below
As you can see in the output above, Hadoop also provides a basic web interface for statistics and information. When the Hadoop cluster is running, open http://localhost:50030/ in a browser and have a look around. A screenshot of Hadoop Job Tracker web interface showing the details of the Map Reduce job.