Project Overview and Motivation

MapReduce provides a software framework that allows specification and execution of programs in large-scale distributed environments using maps and folds. While a number of data processing applications have been demonstrated in MapReduce, with a few noted exceptions, these tend to have structured/dense dependencies. This project pushes the envelope of MapReduce applications by investigating its suitability to large sparse unstructured real-world graph analysis problems. Specifically, it aims to demonstrate that the MapReduce framework, with suitable semantic enhancements, is capable of high performance and scalability for a variety of graph-structured applications on diverse platforms.

Project Goals

The main aim of the project goals is to investigate -

  • How can highly unstructured graph-based formalisms be cast in the MapReduce framework?
  • How effectively can these specifications leverage the MapReduce infrastructure?
  • How can these environments be enhanced to provide the semantic expressiveness necessary for programmability and scalable performance?
  • How can we integrate these analysis tasks into comprehensive scientific resources usable by the wider applications community?

Answers to these questions will result in an efficient and scalable MapReduce graph analysis toolkit, a comprehensive resource for comparative analysis of biological networks, integrated into the Biochemical Pathways Workbench, and enhancements to MapReduce semantics, along with efficient implementations on wide-area clusters, multicore, and multicore/SMP platforms.