CS 590N Project 1

Project Overview

For this project, you will be implementing a text search engine using Hadoop, an Apache Foundation implementation of map/reduce. You will create map/reduce jobs to build a search index from text documents, and a job which uses that index to perform a search. Information on how to use Hadoop and our cluster follows the project definition. Some parts of the project description require a minimum of familiarity with the Hadoop architecture and map/reduce in general.

The search engine you will be creating searches for English-language sentences (subject to some significant simplifying assumptions) in source documents which contain all of a given list of words. These words may occur in any order within the sentence. In addition to specifying where these sentences can be found in the source documents, the engine will return the matching sentences to the user.

For the sake of simplicity, a sentence is defined to start at the termination of the previous sentence, and consist of all characters up to and including one or more instances of the sentence-ending characters [.!?]. Leading and trailing characters from the set [][(){}"'`-] are to be ignored for the purpose of this index. As an example, the paragraph below would be split into the following list:

At that moment a loud voice, the voice of a man whose heart was inaccessible to fear, was heard. To this voice responded others not less determined. "Is everything thrown out?" "No, here are still 2,000 dollars in gold." A heavy bag immediately plunged into the sea.

At that moment a loud voice, the voice of a man whose heart was inaccessible to fear, was heard.
To this voice responded others not less determined.
Is everything thrown out?
No, here are still 2,000 dollars in gold.
A heavy bag immediately plunged into the sea.

Note that these rules allow for surprising divisions, such as the sentence:

"That will be three," replied Pencroft; "and with Herbert and me five. But the balloon will hold six--"

... which would be stored as:

That will be three," replied Pencroft; "and with Herbert and me five.
But the balloon will hold six

As parsing natural language is an extremely difficult task, do not worry about such errors.

Likewise, we will define a "word" as a series of characters consisting of only characters from the set [a-zA-Z'], and beginning and ending only with characters from the set [a-zA-Z], or a series of digits. (That is to say, [a-zA-Z]|[0-9]+|[a-zA-Z][a-zA-Z']*[a-zA-Z], as a regular expression.) Your index should cover all such sequences in the source text, even if they are not whitespace-separated. Search terms are to be case-insensitive.

Project Details

Your project will be divided into three parts.

Part One

The first part of your project is a prose description of your project implementation. It should specify the indices you build in Part Two, as well as how you use them in Part Three to perform your search. Additionally, it should explain the structure of your code at a high level, including information such as the map/reduce operations utilized for building and processing your indices. Please do not include detailed class or code documentation here, but do describe each map operation, reduce operation, record reader, etc. at a "black box" level. For example, for the sample WordCount class, you might say:

WordCount uses a WordMap map operation which takes a WritableComparable and a Text object representing a line of text, and outputs a <Text, IntWritable> tuple for each word in that line of text with the IntWritable having a value of 1. It then reduces this using the CounterReduce reduction, which accepts any WritableComparable object and a vector of IntWritable occurrence counts, and sums these counts. The output format is a directory of files containing whitespace-separated words and their counts, one word and count pair to a line.

We suggest that you design and document your index creation and search procedures first, to ensure that you have considered the index you will be implementing carefully, and understand how you will use it to perform the search.

This description must also document the top-level class names you chose for each of the two tasks implemented in Part Two and Part Three, as used by the hadoop jar command.

Part Two

For the second part of your project you will implement a Hadoop task which, given a set of source documents and an index directory, builds an index which can be used to perform the search specified in the overview over those documents. The format of this index is up to you, but it should be designed to be appropriate for the map/reduce architecture. In your description in Part One, you must explain the format of your index, as well as compare and contrast searching your index using map/reduce with a simple single-pass search through the source files. Explain why your index is appropriate for the map/reduce process.

Your task must accept two command line arguments. The first argument is the name of an HDFS directory which contains source documents to be indexed. The second argument is the name of a non-existent HDFS directory which your task will create and use to store the index it builds.

Part Three

Part Three of your project is to implement a hadoop task which consults the document index created by Part Two and the original source documents, and performs the search. It must create as its output a directory files containing matches in the source text, as lines of the form:

<source document name>:<byte offset> <sentence>

For example, assuming that the sample text in the project description is in the file mysterious_island.txt, and a search is performed for the terms "voice" and "not", the output file might look like:

mysterious_island.txt:12345 To this voice responded others not less determined.

There should be one line for each matching sentence in the source text. You need not worry about result ordering, or division among output files. You may use the Hadoop-provided TextOutputFormat to write your results if you like.

Your task must accept four or more command line arguments. The first argument is, as in Part Two, the name of an HDFS directory containing those source documents indexed in Part Two. The second argument is the output directory specified to the index task in Part Two. The third argument is a non-existent directory which your task should create and fill with the search results. The fourth and remaining arguments are words to be searched in the text, one word per argument. As an example, the command used to perform the sample search (more information below, in Working Environment, on executing hadoop tasks) for "voice" and "not" might look like:

hadoop jar search.jar edu.purdue.cs.eblanton.SentenceSearch /user/cs590n/input /user/eblanton/index /user/eblanton/output voice not

Working Environment

This information is not intended to be a replacement for reading the Hadoop documentation available at apache.org! You will want to read some of the documentation available there as necessary. This information is to acquaint you only with the configuration of our Hadoop cluster, and get you up and running just enough to know where to find the next information you will need.

The Hadoop service is running on four machines in the MC cluster, mc13-mc16.cs.purdue.edu. It is on these machines that your projects will be evaluated. We are using Hadoop 0.18.0.

In order to use the Hadoop installation on these machines, you must first prepare your environment. Assuming you are using the bash shell, the easiest way to do this is to put the following line in your ~/.bashrc:

. ~cs590n/bin/env.sh

The leading period is intentional and necessary. This script will set up a number of environment variables necessary for proper Hadoop operation, as well as add the required Hadoop binaries to your path. Once you have done this (and then either logged out and back in, or executed the above command in your shell), you can run the hadoop binary. This binary allows you to manipulate the cluster HDFS, execute Hadoop tasks, and perform other operations on the cluster. Running the command hadoop with no arguments will display a list of subcommands. As this output notes, most subcommands will provide additional help when executed with no arguments.

If you normally use a shell other than bash, you can either run bash before using hadoop, or examine env.sh and translate it appropriately for your shell of choice. Note that it may be updated as the semester proceeds. To change your default shell to bash if you so desire, run the chsh command, and enter /usr/local/bin/bash as your new shell.

The Hadoop cluster uses a filesystem called HDFS. Your Hadoop tasks will use this filesystem in performing their work. Each student has a home directory on HDFS, in /user/$USER. Some resources, including sample input files, are available to you in /user/cs590n. Directories of sample input files are in /user/cs590n/input. The HDFS is not normally readable or writable from the Unix environment; in order to manipulate it outside of your tasks, you will use the hadoop dfs command. Many common Unix utilities are implemented as hadoop dfs -<command>; for example, you will find that -ls, -cp, mv etc. behave more or less as you would expect. There are some caveats, however. E.g., the -rm command will not accept an -r argument, and the -rmr command must be used in its place. Some common commands such as rmdir do not exist. You should read the help for hadoop dfs as necessary. To copy files from the Unix filesystem to HDFS, use hadoop dfs -put <source> <destination>. Similarly, -get will retrieve files created by your tasks.

Hadoop 0.18.0, which we are using, requires Java 1.5. Java 1.5 is available to you on the CS machines in /p/java-1.5.0_16/. You may wish to add /p/java-1.5.0_16/bin to your PATH.

We will be using mc12.cs.purdue.edu as a compilation host. DO NOT use mc13-mc16 for compilation, as this could interfere with your classmates' cluster operations. You may use any machine with Java 1.5 and the Hadoop 0.18.0 class files available, if you like. To compile a .java file which uses Hadoop on mc12, run:

javac -classpath ${HADOOP_HOME}/hadoop-${HADOOP_VERSION}-core.jar <source file>

Sample Code

A sample Hadoop task based on the Hadoop map/reduce tutorial is available for your perusal in mapred.tar.gz. After extracting the sample code, you should find the file wc.jar in the top-level directory of the source. This file contains the compiled Hadoop task, which you may execute without modification. It requires two command line arguments; a directory containing input files, and a non-existent directory to be created to hold its output. The input directory should be an HDFS path containing text files to be processed; any of the directories in /user/cs590n/input are appropriate. The output directory should be an HDFS path where the final element does not exist (it will be created). E.g.:

hadoop jar wc.jar edu.purdue.cs.eblanton.WordCount /user/cs590n/input/fp output

This will create a directory named output in your home directory on the HDFS and populate it with output files containing lines of the format "<word> <count>". You can see the first such file (for that directory, there will be only one) by running hadoop dfs -cat output/part-00000. (That example will be quite long, you probably want to dump it to a pager — fp/ contains the entire text of the Federalist Paper.)

When you have finished reading through the Hadoop map/reduce tutorial, you may wish to examine this sample code to see how it differs from the tutorial; it is more general than the tutorial code, and will hopefully help you understand how some of the various classes fit together. There are javadoc comments inline and built in the docs/ directory of the source which you may find helpful.

Submitting your project

You will submit your project using the turnin command. To prepare your project for submission, build the jar files for your Hadoop tasks and remove all other compiled .class files. For the turnin command, specify the class as cs590n and the project as project1, as follows:

turnin -ccs590n -pproject1 <directory>

After submitting your project, verify that the submission was successful with the turnin -v command.

References

For assistance, you will wish to refer to some or all of the following:

In addition, as the project progresses, we will be maintaining a list of Frequently Asked Questions. Please refer to this FAQ as you have difficulties; if other students have encountered it before you, the solution may be documented there. This FAQ document should be your first line of support for questions relating to this project. It will be updated regularly as the project progresses. Please also use cs590n@cs.purdue.edu to email questions. This is a discussion list containing the whole class.

Updated: September 19, 2008

CS590N Data Center Architecture
Project/Homework 1