Menu
Facebook boosts Hadoop with scheduling muscle

Facebook boosts Hadoop with scheduling muscle

Facebook's Corona will make better use of clusters than MapReduce does, the company claims

Facebook has beaten some of the limitations of the Apache Hadoop data processing platform, its engineers assert.

Facebook has released source code for scheduling workloads on the Apache Hadoop data processing platform. Engineers at the social networking company claim this program, called Corona, is superior to Hadoop's own scheduler in MapReduce.

In tests, the Corona scheduler was able to put more than 95 percent of a cluster to work on jobs, whereas MapReduce could utilize, at the most, 70 percent of a cluster, Facebook said.

By using the clusters more efficiently, Facebook is able to analyze more information with existing hardware. Corona offers a number of additional benefits as well, including faster loading of workloads and a more flexible way of upgrading the software.

Facebook announced the release of Corona in a posting by a number of Facebook engineers who contributed to the software, including Avery Ching, Ravi Murthy, Dmytro, Ramkumar Vadali and Paul Yang.

Facebook's operations and users generate more than half a petabyte of data each day, which is analyzed by more than 1,000 Facebook personnel, mostly by using the Apache Hive query engine.

Typically, analysis jobs running on Hadoop are scheduled through the MapReduce framework, which breaks jobs into multiple parts so they can be executed across many computers in parallel.

Facebook ran into issues using MapReduce, however. The scheduler could not keep all the computers supplied with work. "At peak load, cluster utilization would drop precipitously due to scheduling overhead," the blog stated.

Another issue with MapReduce is that the software typically delayed jobs before executing them, the Facebook team said. In addition, the framework offered no easy way of scheduling non-MapReduce jobs on the same cluster, and software upgrades required system downtime, which necessitated stopping jobs that are then being executed.

Facebook engineers developed the Corona scheduler so it would not have these limitations. The software would scale more easily and make better use of clusters. It would offer lower latency for smaller jobs and could be upgraded without disrupting the system.

Facebook is now in the process of moving MapReduce workloads onto clusters equipped with Corona. Initially, the social networking company deployed the software on 500 nodes. Once Corona proved effective, it was then tasked with all non-mission critical workloads, including larger workloads involving 1,000 or more servers. Now, the company is deploying Corona for all Hadoop workloads.

In tests, Corona has shown itself to be more effective than MapReduce across a number of metrics, Facebook asserted. In performance tests, Corona took around 55 seconds to fill an empty workspace, whereas MapReduce took 66 seconds -- which constitutes a 17 percent improvement. Job are started more quickly now, as well, within 25 seconds, down from 50 seconds with MapReduce.

Corona is not the only alternative to MapReduce. Facebook also looked at Yarn, which is Apache's overhaul of MapReduce, planned for release as MapReduce 2.0. Facebook engineers were unsure Yarn could execute jobs as large as those of the social networking site, however.

Joab Jackson covers enterprise software and general technology breaking news for The IDG News Service. Follow Joab on Twitter at @Joab_Jackson. Joab's e-mail address is Joab_Jackson@idg.com

Follow Us

Join the ARN newsletter!

Error: Please check your email address.

Tags applicationsdatabasesmiddlewaredata miningsoftwaredata integrationData managementFacebooksystem managementdata warehousing

Upcoming

Slideshows

IN PICTURES: Nutanix's .NEXT channel event in Sydney (+20 photos)

IN PICTURES: Nutanix's .NEXT channel event in Sydney (+20 photos)

Nutanix recently held its customer and channel event, .NEXT, in Sydney. The event, held at the Sheraton on the Park saw attendance from more than 150 channel and technology partners and customers. It was the first in a series of events Nutanix is holding in A/NZ in August and September, the objective of which is to brief partners and customers on “what’s next” in the design and management of datacentre technology.

IN PICTURES: Nutanix's .NEXT channel event in Sydney (+20 photos)
IN PICTURES: EDGE 2015 sponsor debrief (+23 photos)

IN PICTURES: EDGE 2015 sponsor debrief (+23 photos)

Some of the sponsors of ARN's inaugural EDGE 2015 event got together at the ARN office for a debrief of the event. Over some drinks and cheese, these attendees got an update on some key statistics that arose from the EDGE event and discussed potential topics and improvements that can be made at next year's event.

IN PICTURES: EDGE 2015 sponsor debrief (+23 photos)
IN PICTURES: ARN Distributor Roundtable, Sydney, 26.08.15 (+26 photos)

IN PICTURES: ARN Distributor Roundtable, Sydney, 26.08.15 (+26 photos)

ARN hosted a distributor roundtable at Cafe Del Mar in Sydney, at which attendees and their partners discussed the changing role of the traditional IT distributor. They spoke about the challenges of digital disruption, the blurring lines of the channel in the age of digital transformation, and examined the ever-evolving business models. This roundtable was sponsored by Distribution Central, Exclusive Networks, Rhipe, and Hemisphere Technologies. Photos by ARN Editorial Director, Mike Gee.

IN PICTURES: ARN Distributor Roundtable, Sydney, 26.08.15 (+26 photos)

iasset.com is a channel management ecosystem that automates all major aspects of the entire sales, marketing and service process, including data tracking, integrated learning, knowledge management and product lifecycle management.

Show Comments