Python Hadoop Features and Advantages

    1 Votes

Python with Apache Hadoop is used to store, process, and analyze incredibly large data sets. For streaming applications, we use Python to write map reduce programs to run on Hadoop cluster. Hadoop has become a standard in a distributed data processing, but relied on Java in the past. Today, there are a numerous open source projects that support Hadoop in Python. Python supports other Hadoop ecosystem projects and its components such as HBase, Hive, Spark, Storm, Flume, Accumulo, and a few others.

Hadoop requires Java runtime environment (JRE) 1.6 or higher, because Hadoop is developed on top of Java APIs. Hadoop works as a low level single node to a high level multi node cluster Environment. Map Reduce is mainly used for parallel processing of large sets of data. Initially, it is a hypothesis specially designed by Google to provide parallelism, data distribution and fault-tolerance.

MR processes data in the form of key-value pairs. The code that developer writes for MR can be written in java as well as Python and C++ languages. Working with Hadoop using Python instead of Java is entirely possible with a collection of active open source projects that provide Python APIs to Hadoop components. Python also has some advantages over Hadoop when compared with Java.

The reasons for using Hadoop with Python instead of Java are not all that different than the classic Java vs. Python arguments. One of the most important differences is that we don't have to compile our code, instead we can use a scripting language. This makes more interactive development of analytics possible, makes maintaining and fixing applications in production environments simpler in many cases, makes for more succinct and easier to read code, and so much more. Also, by integrating Python with Hadoop, you get access to the world-class data analysis libraries such as numpy, scipy, nltk, and scikit-learn that are best-in-breed both inside of Python and outside.

The following libraries and approaches will guide python developers through a series of exercises:

  • Interacting with files in the Hadoop Distributed File System with the snakebite Python module to store potentially peta bytes of data
  • Writing Map Reduce jobs with the mrjob Python module to analyze large amounts of data over potentially thousands of nodes
  • Writing Map Reduce jobs with Apache Pig (a higher-level data flow language) in conjunction with Python user-defined functions

Java Vs Python Features

FeatureJavaPython
Cross platform working Applications that work across various platforms. Python Not supports this feature
Execution Slower Faster execution of jobs
Blocking methods Traditional braces to start and end blocks It uses indentation of blocks
Typing Static typing Dynamic typing
Simpler and compact Less Simpler and compact More simpler and compact
Language Java best suits for low level implementation language. Python is much better suited as “glue” language

Sample MR Program for hadoop in python

In this following Python code, we will use the Hadoop Streaming API to help us to pass data between our Map and Reduce code via STDIN (standard input) and STDOUT (standard output). We will simply use Python’s sys.stdin to read input data and print our own output to sys.stdout. Hadoop streaming will take care of processing and programming flow.

Mapper Reducer logic Python

Steps to Write and execute code

Above figure giving details about both mapper.py and reducer.py logics at one place.

Step: 1

mapper.py

Save the following code in the file /home/hduser/mapper.py.

It will read data from STDIN, split it into words and output a list of lines mapping words to their (intermediate) counts to STDOUT. The Map script will not compute an (intermediate) sum of a word’s occurrences though. Instead, it will output <word> 1 tuples immediately – even though a specific word might occur multiple times in the input. In our case we let the subsequent Reduce step do the final sum count.

We have to provide file with this execution permission chmod +x /home/hduser/mapper.py

Step: 2

Reducer.py

Save the following code in the file /home/hduser/reducer.py.

It will read the results of mapper.py from STDIN (so the output format of mapper.py and the expected input format of reducer.py must match) and sum the occurrences of each word to a final count, and then output its results to STDOUT.

We have to provide file with this execution permission chmod +x /home/hduser/reducer.py

Step: 3

Testing python code

Step: 4

Running python code on text files that are already stored in hdfs storage. Execute the following commands showing in the figure 3 below

Command Execution hadoop

Step: 5

Observing output

Python Hadoop Output

As you can see in the output above, Hadoop also provides a basic web interface for statistics and information. When the Hadoop cluster is running, open http://localhost:50030/ in a browser and have a look around. A screenshot of Hadoop Job Tracker web interface showing the details of the Map Reduce job.

Popular Videos

communication

How to improve your Interview, Salary Negotiation, Communication & Presentation Skills.

Got a tip or Question?
Let us know

Related Articles

Big Data Analytics
Hadoop Architecture - HDFS and Map Reduce
Hadoop Hive Architecture, Data Modeling & Working Modes
Apache Pig Architecture & Execution Modes
Hadoop Single Node Cluster Installation in Ubuntu
HBase Architecture & CAP Theorem
Hadoop Map Reduce Architecture and Example
Hadoop Ecosystem and its Major Components
Hadoop FS & DFS Commands
Big Data Analytics(BDA) Use cases
Architectural Differences between MongoDB and Cassandra
Cassandra Architecture, Features and Operations