The Next Generation of Big Data: MapReduce2 with YARN

I am sure that you have already heard about the next generation of MapReduce proposed by Hadoop. Its popularly called MapReduce2 or MR2.

With MR2 they are introducing many enhancements, the prime one being introduction of a new component called YARN (Yet Another Resource Negotiator).
With the current MapReduce implementation, there is just one Job Tracker that takes care of two critical functions:

  1. Manage resources across the cluster and schedule jobs using that information.
  2. Keep track of job execution. This includes rerunning failed nodes, job check-pointing, etc.

In YARN, the Job Tracker goes away and each of these two tasks are given to two different components.

Resource Manager

There is one Resource Manager per Hadoop Cluster and is responsible for scheduling jobs. It has the state information about all the nodes and thus it can make smarter scheduling decisions.

Application Master

There is one Application Master (AM) per job. Resource Manager schedules one AM per job, and once that is done, its AM’s responsibility to successfully complete the execution of the job. This takes away a lot of responsibility from the Resource Manager, and thus it can scale to many more nodes in the cluster and many more jobs.

Few points to know about this new architecture are:

  • Application Master aggressively writes check-pointing state to HDFS and thus load on HDFS increases. This application state is used for job recovery; if Application Master fails a job can be restarted from the last checkpoint.
  • The Task Node is replaced by NodeManager (More about this in future blog). There is an option to write Node Manager log to HDFS. Thus logs can go to a central place and debugging will be easier. This will further stress the HDFS cluster.
  • Yarn no longer just works with MapReduce. It can work with other distributed computing platform.
  • Yahoo introduced Storm: A Real time distributed computing platform. Open Source by Yahoo!!
  • This version also introduces Web services for Hadoop Cluster status. No longer you need to scrape web pages to automate stuff.

Please share the article if you liked it. Let me know of your thoughts in comments below.



Leave a Reply