Skip to content

Basic concepts of MapReduce and Hadoop

If you haven’t already read it, you may want to read this post to understand the basics of Big Data. To get maximum value from Big Data, you need to associate it with the traditional enterprise data by applications, reports, queries, and other approaches.
There are some obvious challenges when we manage Big Data:
  1. Data Growth:
    Over 80 percent of the data in the enterprise consists of unstructured data and it is growing at really fast pace.
  2. Processing/CPU power:
    We are custom to using a big powerful computer server to crunch data information and generate required results. Unfortunately the custom way doesn’t work for Big Data which is too enormous for one single server to handle however how strong it is. Distributed computing coupled with parallel processing is the way forward with Big Data.
  3. Physical storage:
    Not a big challenge looking at the cost of Disk space now a days but still capturing and managing all this information can consume enormous resources.
  4. Data issues:
    This is one of the major. Many data format is proprietary which makes data mobility and interoperability an issue
  5. Costs:
    You need highly specialized, well-designed software (and highly skilled human resources also) to handle Extract, transform, and load (ETL) processes for Big Data. It is an expensive process.
As with any big problem, for solving it, the first thing that come to mind is ‘divide and conquer’ approach. ‘MapReduce’ is built on the proven concept of divide and conquer
MapReduce is a group of technologies and architectural design philosophies which was designed by Google. Faced with its own set of unique challenges, in 2004, Google decided to bring the power of parallel, distributed computing to help digest the enormous amounts of data produced during daily operations. They named it MapReduce.
It is a software framework that breaks big problems into small, manageable tasks and then distributes them to multiple servers or nodes (can be hundreds or thousands). These server nodes
work together in parallel to arrive at a result. MapReduce can work with raw data or the data stored in relational databases, or both. The data may be structured or unstructured.
MapReduce is broken into two parts or activities ‘Map’ and ‘Reduce’. Let us understand it with an example:
——————————————————————————————————–
Say you own an online medicine store. You want to know what the consumers are searching on your website.It is a simple demand but will take huge data crunching. Over time, you’ve collected an enormous collection of search terms that your visitors have typed in on your website. This is raw data and measures several TBs.
You have multiple server nodes (say around 500) where you will do parallel processing to get the result. This terabyte of data will be broken into many small files (say 1GB of file) and will be distributed to all the server nodes.
MAP PHASE:
=========
MapReduce uses key/value pairs. So each instance of a key/value pair is made up of two data components.
First, the key identifies what kind of information we’re looking at.
In our example ‘keys’ can include:
1) medicine name
2) medicine price
Next, the value portion of the key/value pair is an actual instance of data associated with a key. S
In our example ‘values’ can include:
1) aspirin (which is medicine name)
2) 12.25 (which is price of medicine)
So if we put the keys and values together, we end up with key/ value pairs:
1) medicine name/aspirin
2) medicine price/12.25
On each of the 500 nodes, the Map step will produce a list, consisting of each medicine name in the file along with how many times it appears. So in this map phase key will be MEDICINE NAME and ‘value’ will be the NUMBER OF TIMES IT WERE SEARCHED.
For example,one node might come up with these intermediate results from its own set of data:
..
aspirin/102345
calpol/23444
..
All 500 nodes individually will make such list in the map phase and end results will be 500 lists.
REDUCE PHASE:
=============
After the Map phase is over, all the intermediate values for a given output key are combined together into a list. The Reduce step will then consolidate all of the 500 result lists from the Map step, producing a final list of all medicines and the total number of times they were searched. For example, the combined counts for these search terms might look like this:
..
aspirin/87602345
calpol/634544
..
In simple words, you first broke your work and distributed to 500 server nodes in MAP phase and in REDUCE phase you consolidated the results to arrive at final required result.
——————————————————————————————————–
So what is Hadoop?
Hadoop is based on this MapReduce frameworks and is maintained by Apache Software Foundation. It is based on MapReduce but in actual it is much more than MapReduce architecture.
It is written in Java so knowing a bit of Java really helps in understanding internal details of it.
A lot many companies in wide variety of domains like Finance, Retail, Government etc (Like Facebook, Linkedin, Netflix etc) are using Hadoop capabilities to gain measurable benefits through churning their huge data.
Hadoop uses a file system which is called HDFS. It is a new data storage techniques which support new architectures like MapReduce frameworks discussed above. This file system is meant to support enormous amounts of structured as well as unstructured data. Hadoop Distributed File System is based on Google File System. This innovative technology is a powerful, distributed file system meant to hold enormous amounts of data. Google had optimized this file system to meet its voracious information processing needs.
Some basic properties of HDFS are:
  1. Block storage:
    Files are stored as blocks, with a default size of 128 MB.
  2. Reliable:
    Reliability is provided through Replication. Each block is replicated across two or more DataNodes; the default value is three.
  3. Centralize Management:
    A single master NameNode helps in coordinating access and metadata.
  4. No data caching:
    With such a large data, it is not worth implementing caching management.
Components of Hadoop:
Hadoop basically contains various nodes. Each node consists of several specialized software components and the data.
These nodes can be of following type:
a) Master node
There are more than one master node instances. Having more than one master node helps eliminate the risk of a single point of failure.
Major elements present in the master node are:
i) JobTracker: This process interact with client applications and distribute MapReduce tasks to particular nodes within a cluster.
ii) TaskTracker: This process receive tasks (including Map, Reduce, and Shuffle) from a JobTracker.
iii) NameNode: These processes store a directory tree of all files in the Hadoop Distributed File System (HDFS). They also keep track of where the file data is kept within the cluster. Client applications contact NameNodes process when they need to locate a file, or add, copy, or delete a file and get the Data Nodes (described next) information..
b) DataNodes
The DataNode stores data in the HDFS, and also replicates data across clusters. It also interact with client applications when the NameNode has supplied the DataNode’s address.
c) Worker nodes
These are the actual power source and place where work is done in Hadoop deployment. There are dozens or even hundreds of worker nodes, which provide enough processing power to analyze terabytes of data. Each worker node includes a DataNode as well as a TaskTracker.
Hadoop Architecture:
Hadoop environment consists of three basic layers:
a) Application layer/end user access layer
This is the point of contact for applications to interact with Hadoop. These applications may be custom written solutions (written using language Java, Pig (a specialized, higher-level MapReduce language), or Hive (a specialized, SQLbased MapReduce language), or third party tools such as business intelligence etc.
b) MapReduce workload management layer
Commonly known as JobTracker (look at JobTracker work in the ‘Master Node’ section above).
Works that are performed in this layer are:
– scheduling and launching jobs
– balancing the workload among different resources
– dealing with failures and other issues.
c) Distributed parallel file systems/data layer
This layer is responsible for the actual storage of information. It commonly uses a specialized distributed file system which in most cases is HDFS(Hadoop Distributed File System).
Other file system which can be used are : IBM’s GPFS, the MapR filesystem from MapR Technologies, Kosmix’s CloudStore, and Amazon’s Simple Storage Service (S3) etc.
Brijesh Gogia
Leave a Reply