Frequently asked Map Reduce Interview Questions and answers for freshers and 2-5 year experienced Hadoop developers on Map, Shuffle and Reduce phase, Partition, combiner, Speculative Execution etc.
In Order to handle huge sets of data and process them parallely within a Hadoop cluster, MapReduce is used. This hypothesis was initially put forward by Google to create a fault tolerant and distributed environment which can handle millions of search queries coming to them every second, parallely.
In Map reduce architecture, data get processed in two phases. The first phase is map stage and the second phase is reduce stage.
4) Explain Map phase in MR job?
During Mapper Phase, the data get split into Key-Value(intermediate) pairs by various mapper tasks running in parallel across the Hadoop cluster. The output of the map phase will be used as an input by sort and shuffle phases. Both key and value should be writable and it's key class needs to implement the Writable interface. In order to sort the key, key classes have to implement the Comparable interface.
5) Explain Shuffle phase in MR job?
Output of Map Phase will be shuffled and sorted during the Shuffle Phase. First, it will sort the key-value pairs produced by the map phase, based on the key. Then, it will combine the values having the same key in an array and store it against the key.
6) Explain Reduce phase in MR job?
The Shuffled and sorted data from Shuffle phase goes as input to the reducer. In this phase, all incoming data is going to get combined and a smaller set of tuples (Key-Value pair) is produced using user defined reduce functions. The output of the reducer is going to get written into hdfs system. Reducer is not mandatory for searching and mapping purpose.
7) Explain Partition in MR in brief?
The output of the mapper goes into reducer after sorting and shuffling. There can be a one or more reducers, based on the job configuration. It is the duty of the Partitioner to make sure that all the values associated with a particular intermediate key are passed on to same Reducer. It helps "reducer" by reducing its workload and gives more performance. We can perform customized partition, based on different criteria.
8) Explain combiner in brief?
Combiner is also termed as "mini-reducer". In case, the mapper phase produces large volume of data it consumes a lot of network bandwidth. To solve this bandwidth issue, we place "reducer" code in mapper which summarize the mapper output for better performance. Combiner should have same output key-value types as the Reducer class.
9) What will happen in intermediate process?
The mapper output data undergoes shuffle and sorting in intermediate process. The intermediate data is going to get stored in local file system which won't be replicated into other Hadoop nodes. This intermediate data gets generated based on certain computations and logics.
10) What is default partition used in MR?
Hash partitioning is used as the default partition in Hadoop.
11) What is functionality of Job tracker in MR?
Job tracker splits the jobs into multiple tasks and it will assign each task to task trackers.
12) What happens if a data nodes fail in Hadoop?
When a datanode fails, the following activities occur in Haddop. First, Jobtracker and Namenode detect the failure. The, all the tasks of the failed node is being rescheduled. Then, namenode replicates the failed node's user data to another node.
13) What is speculative Execution?
When various jobs are getting executed, speculative execution plays an important role in managing the execution for better performance. If multiple mappers are working on the same data, and if one of the mappers is running slow then Job tracker assigns tasks of the slower mapper to the next mapper for faster execution. The tasks are executed on FIFO (First In First Out) basis.
14) What happens in textinputformat?
In Hadoop, TextInputFormat class represents inputFormat for plain text files, where each line in the file is considered as a record. Value is the content of the line while Key is the position in the file. For instance, Key: longWritable, Value: text.
15) Please explain the configuration to set the number of reducers for a particular in Hadoop?
In the configuration file mapred-site.xml, we have to enable certain properties to set the number of reducers for the particular task.
16) Can you give the name of the Hadoop daemon in which actual Hadoop MR Job processing will happen?
17) Difference between input split and Hadoop block size?
Input Split is a logical split, but Hadoop block size is a physical storage.
18) What are the basic parameters of mapper?
- LongWritable and Text
- Text and Intwritable
19) In Hadoop, which algorithm is used to write intermediate data to local disk?