File sorting and file content sorting are the solutions to the most pressing situation of the organizing and maintaining the data for any systems. Sorting the files may be on the basis of their size or the date of creation. This is significant in the archiving of files. This helps in the reorganization of files in the systems with vast amounts of data which are updated frequently. Content sorting can be counted as a real world application that requires the sorted output for further manipulation. Apache hadoop provides a framework for large scale parallel processing using a distributed file system and the map-reduce programming paradigm. Both of these sorting programs are implemented within the framework with the help of map reduce module that forms one of the major pillar of Hadoop framework.
Hadoop Map-Reduce is a software framework for easily writing applications which process vast amounts of data (multi-terabyte data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner. A Map-Reduce job usually splits the input data-set into independent chunks which are processed by the map tasks in a completely parallel manner. The framework sorts the outputs of the maps, which are then input to the reduce tasks. Typically both the input and the output of the job are stored in a file-system. The framework takes care of scheduling tasks, monitoring them and re-executes the failed tasks. The Map-Reduce framework consists of a single master Job Tracker and one slave Task Tracker per cluster-node. The master is responsible for scheduling the jobs' component tasks on the slaves, monitoring them and re-executing the failed tasks. The slaves execute the tasks as directed by the master.
The sorting algorithm that is made use of here is merge sort. Merge sort is a stable sort, parallelizes better, and is more efficient at handling slow-to-access sequential media. The merge sort and Hadoop map reduce go on the same lines as both follows the divide and conquer approach.
Feasibility study is a test of a system proposal according to its work ability, impact on the organization, ability to meet user needs and effective use of resources. Three key considerations are very much involved in the feasibility analysis economic, technical, behavioral. Before entering into the procedure of system designing, we are obliged to study about the feasibility of introducing the new system. To replace the existing system with the new one is quite easy, but we have to vanish the drawbacks of current system and to make the user able to enjoy the advantages of coming system. The proposed system must be evaluated from a technical, operational and economical feasibility of developing the system within the hadoop framework. The objective of the feasibility study is not only to solve the problem but also to acquire the sets of scope. During the study, the problem is defined, is crystallized and aspect of the problem to be included in the system are determined. Consequently, costs and benefits are estimated with greater accuracy at this stage.
The existing system for file sorting is a computerized system that uses any efficient sorting algorithm. The algorithm, though so simple, provides a tedious and time consuming task when it deals with the vast amount of data.
These amounts of data raise the problems of computing power and storage. If there are ten terabytes of data to be read and they have to be read from five nodes it will take about 330 minutes. If the data is spread across 500 nodes it is just going to take five minutes when reading at an average rate of 100 megabytes per second. If we would start to transfer the data over the network to other nodes the time would increase by some orders of magnitude.
Hadoop addresses these problems by distributing the work and the data over thousands of machines. The workload is spread evenly across the cluster and there are several techniques to be explained which ensure fault-tolerance and data-integrity. It also ensures that only a small amount of the data is transferred over the network on processing time so it suits best in cases were data is written once and often read. The Map Reduce programming model allows for the programmer to work on an abstract level without having to tinker with any cluster-specific details, concurrent use of resources, communication and data flow. The two functions, map and reduce, receive incoming data, modify it and send it to an output channel.
Hadoop limits the amount of communication which can be performed by the processes, as each individual record is processed by a task in isolation from one another. While this sounds like a major limitation at first, it makes the whole framework much more reliable. Hadoop will not run just any program and distribute it across a cluster. Programs must be written to conform to a particular programming model, named "Map Reduce “.
The selection of hardware is a very important task related to the software development.
System : IBM compatible PC
Processor: Intel Pentium IV
Memory: 1GB RAM
Hard Disk Drive: 40 GB
Operating System : Fedora 10 or above
Other applications : Apache Hadoop
Java 1.6 or above
Apache hadoop is a framework that implements the tasks using map-reduce algorithms. It is a efficient and much more quicker approach to task implementation. Java 1.6 is the language used to run the algorithms and implement them in the hadoop framework. Eclipse is used to write the front end GUI and the back end algorithms.
Please go through the attachment for system design and source code.