ReadWriteCloud

LinkedIn Opens DataFu: A Library for Working with Hadoop and Pig

This post is part of our ReadWriteCloud channel, which is dedicated to covering virtualization and cloud computing. The channel is sponsored by Intel and VMware. Read the case study about how Intel Xeon processors and VMware helped virtualize 12 business critical database applications.

Hadoop logo 150x150LinkedIn has been making heavy use of Apache Hadoop and Pig with its People You May Know and skills features (among others), and has pulled together a lot of User Defined Functions (UDFs) for Pig in the process.

On January 10th, LinkedIn's Matthew Hayes announced the release of DataFu on the LinkedIn engineering blog. DataFu is available on GitHub under the Apache 2.0 license. DataFu is a collection of UDFs that LinkedIn has developed for data mining and statistics.

The DataFu library has been tested against Pig 0.9. The library provides a number of functions for running PageRank, performing operations on Pig data bags, filtering input data and more.

Hayes' post walks through using DataFu to work through an example scenario computing quantiles from a fake data set, so interested developers can jump in and try the DataFu library out immediately. The project also includes a set of unit tests for each UDF.

It's impressive to see just how much work is coming out of the Hadoop community these days. Any projects that you're keeping an eye on?

Sponsored by


Learn More about the Intel/VMware Alliance

Recent Comments