CS590N Project/Homework 2

Back to CS590 main page

Project Overview

For the second project, you will be implementing the Google Cluster Architecture using Mace. You will be implementing the Google Web Service (GWS), an index search with multiple, replicated shards, and document retrieval from multiple replicated shards.

Your GWS will be searching and serving documents from NSDI'08. Search results will be returned in "pages", like a Google search, with 10 documents per page, starting from the most relevant and working toward the least relevant documents, including "snippets" from the documents.

You are expected to be passingly familiar with the Mace HOWTO document from the Mace source.

Project Details

This project will be divided into two parts.

Part One

Part One of your project will be a high-level prose description of your implementation. You should include a description of the interface between your GWS and the index and document servers, as well as any major design decisions you make. As with Project 1, please do not include detailed class or code documentation in this description.

This document should be turned in in the toplevel directory of your project submission as plain text, PostScript, or PDF, named project2.{txt,ps,pdf}, as appropriate.

Part Two

Part Two is the completed GCA implementation.

Your code needs support only one instance of the GWS, and you may assume that the GWS does not fail or become unavailable to the Index and Document servers. Index and Document servers, however, must register themselves dynamically with the GWS, and the GWS must be robust to their coming and going (including unexpected failures). You may not hard-code or pre-configure service addresses, other than the location of the GWS. (Configuration is discussed below in the Working Environment section.) This means that the GWS must maintain membership state for the various services it manages.

Each of the three services (GWS, index service, and document service) must run independently, in separate processes on the cluster. Additionally, they may or may not run on the same machine. The GWS process must not store locally any index or document data, and must provide all of its results from queries to the index and document servers.

While your service will be serving a rather small set of documents, it should behave as if it were serving an enormous index. With this in mind, your index searches must touch only one copy of any given index shard for each keyword searched, and document servers should fetch only documents which appear in the active result.

As the document set being served will be stable, search results should also be stable. Performing the same search multiple times should yield the same search results in the same order. This includes results on successive pages of the search — documents should not repeat or rearrange for these later pages.

Document relevance is defined as the fraction of keywords found in a given document. Search results should be returned in order of relevance (with a stable tie-breaker function, as required above). Documents which contain none of the search keywords are not relevant, and should be omitted from search results.

Rather than building a context-sentitive "snippet" as the Google search engine does, your search will return a snippet which is merely the first two kilobytes of the document. You do not need to worry about what this snippet looks like, whether it ends on a word or sentence boundary or not, etc. Simply return the first 2048 bytes.

Your implementation must support multiple simultaneous lookups, limited only by the available index and document servers. Searches should not exhibit head-of-line blocking behavior.

For the sake of simplicity, all of your nodes will run the same code, with node roles being determined by runtime configuration. (This is further discussed below in Working Environment.

All communication should be between the GWS master node and non-master (index or document) nodes. Non-master nodes should not communicate among themselves. The GWS master node will be responsible for coordination of the index and document servers, and selection of these servers for keyword searches and document retrieval.

You will not be responsible for building a document index or retrieving documents from the filesystem. In lieu of this, we have provided libproject2data.a and project2data.h, in ~cs590n/mace/lib and ~cs590n/mace/include, respectively. Your index and document searches will be in-memory searches of maps provided by this library. Usage information for this library is below in the Working Environment section.

For Part Two, you should turn in all of the code necessary to build your GCA application. This will include, at minimum, the code for each of the three services (GWS master, index server, and document server) and the interface and handler interface for the index and document servers. Your project should compile cleanly when configured with CMake and make is invoked with no arguments in the top-level source directory, and leave an application binary named gca. If minimal additional configuration is required for compile (other than that performed by CMake), document and justify it in your documentation from Part One.

If you choose, you may submit your design document from Part One at any time within the first week of the project in order to receive feedback from the instructors on your design decisions.

Working Environment

Document and Index Shards

The documents your service will search, and the search index, are provided for you in libproject2data.a. This library implements two functions (defined in project2data.h) to provide this information to your index and document servers. They are:

void populate_index_shard(IndexMap &indexmap,
                          int total_shards, int shardid);

void populate_document_shard(DocumentMap &docmap,
                             int total_shards, int shardid);

Each of these functions takes a map to be populated, the total number of shards in the system (not shard servers, but shards — there may be multiple servers per shard), and the shard ID of the map being filled. Shard IDs start from zero and run to the number of shards minus one. The index shard map will be filled with a mapping from individual search keywords to document identifiers. The document shard map will be filled with a mapping from document identifiers to complete document text. The document identifier structure DocumentInfo is defined in DocumentInfo.h, and consists merely of a mace auto-type containing a filename and the shard from which that file is served.

The include path and link statement is already handled by the provided CMake configuration. If you build on other machines, you will need to update the CMake configuration appropriately. You will need to #include <project2data.h> from your source file.

Configuration

Mace applications have access to the Mace configuration file framework, which we will be using to configure the various services and roles in the Google Cluster Architecture. Because your code will be provided in one monolithic binary, run-time configuration through Mace configuration files will determine the behavior of your application. The address book example application contains code which uses the params:: interface, which you may find helpful.

Six configuration variables will be used to configure the behavior of your services. They are:

GCARole

The variable GCARole takes one of three values: master, index, or document. The value of this variable determines whether the Mace application takes the role of GWS master, index server, or document server. Only one machine will be configured as the master at any given time, but many machines may serve as document or index servers. Each application invocation can serve only one of the three duties, as directed by this configuration variable.

GCAMaster

GCAMaster contains the location of the master node's Mace service. It is used by every node except the master itself. It should be a string of the format "host:port", which can be used by the document and index servers to contact and register themselves with the GWS master node. See the variable ADDRESS_BOOK_SERVER in the address book client for an example of a similar configuration variable.

GCAShards

This integer variable represents the number of document and index shards. The index data will be divided into a number of shards equal to this variable, and the document search terms and document data will be divided among these shards. There should be at least this many index servers and at least this many document servers for the service to work properly.

GCAShardID

This is also an integer variable, ranging from 0 to GCAShards - 1. It represents the shard ID of this instance of document or index server, as passed to the functions populate_index_shard and populate_document_shard.

HTTP_PORT

This Mace configuration variable controls the port on which your master node serves XMLRPC clients. It is required for the master node, and ignored for the other services.

MACE_PORT

Each of your Mace services will bind to MACE_PORT on the local machine to receive Mace communications. Note that if you wish to run more than one service on the same physical machine (e.g., a document server instance as well as an index server instance), each of these services will have to have a unique MACE_PORT.

A sample configuration file for an Index server serving shard 2 of 4 (max shard ID of 3), with the GWS master running on mc13, follows:

GCARole=index
GCAMaster=mc13:11111
GCAShards=4
GCAShardID=2

Two configuration variables will be used to configure the GCA client application:

GCA_Master_IP

GCA_Master_IP is provided to clients of the GCA, and contains the IP address of the master node's XMLRPC service. It will be the same address as provided to GCAMaster for cluster services.

GCA_Master_PORT

This port is the XMLRPC service port of the GCA master node. It should be the same value as provided to that node in the HTTP_PORT parameter.

Each of the variables GCAMaster, HTTP_PORT, GCA_Master_PORT, and MACE_PORT should be selected from the 100 ports available to you in your three-digit-prefixed port namespace. The port number provided to GCAMaster and MACE_PORT on the master node should be the same; make sure that all hosts in the system agree on the same GCAMaster value. If your three-digit prefix is 111, 11100 through 11199 are available to you.

Starter Code

We have provided starter code via a subversion repository. From the mc machines, to obtain a copy of the code, issue the command:

svn co file:///homes/cs590n/svn/gca/trunk gca

This will create a directory gca with the starter code. It is expected that you will build your GCA within this directory, and submit it (source only) as the project. This directory contains a fully functional (dummy) GCA, which on any query, returns a single, dummy response. Your first exercise should be to compile and run this code.

First, make sure cmake is in your path. The easiest way to do this is to use the course environment configuration script:

. /homes/cs590n/bin/env.sh

Next, create the build directory, and configure gca using CMake:

cd gca mkdir build cd build cmake -D Mace_DIR:PATH=/homes/cs590n/mace/src/build ..

This should generate a number of files, and when complete you can build the project using 'make'. The gca applications are built in the 'gcaapp' subdirectory, and are 'gcaclient' (the client query binary), and 'gcad' (the gca daemon).

First, run the gcad binary using the command (substitute your port prefix for 111):

./gcad -GCAMaster localhost -GCARole master -GCAShardID 0 -GCAShards 1 -HTTP_PORT 11100 -MACE_PORT 11101

Next, open a new terminal and run the gcaclient binary:

./gcaclient -GCA_Master_IP localhost -GCA_Master_PORT 11100 -search "foo bar"

Assuming everything is working properly, you'll see this output

[ DocResult(position=0, relevance=1, file=foo.txt, snippet=foo bar) ]

Note that in general, these parameters would be placed in a config file. They are shown here on the command line for ease of testing. If the search parameter provided to the client application contains more than one word, they must be quoted together on the command line, but should not be quoted in a config file.

Now you're ready to go. You should update the implementation in gca/GoogleWebServer/GoogleWebService.mac, add services to gca/DocServer and gca/IndexServer, and add interfaces to gca/gcainterfaces. You shouldn't need to modify the application at all, though you are welcome to look at it to see how it works. As you add new files, and reference new services within services, you may have to run "make rebuild_cache" in the build directory, to tell CMake to re-learn dependencies. If you have other compile errors and questions, feel free to check the FAQ or email the course listserve.

Submitting your project

You will submit your project using the turnin command. For the turnin command, specify the class as cs590n and the project as project2, as follows:

turnin -ccs590n -pproject2 <directory>

DO NOT submit binaries with your project; be sure to run make clean and clean up any leftover files from development and debugging before running turnin.

After submitting your project, verify that the submission was successful with the turnin -v command.

References

For assistance, you will wish to refer to some or all of the following:

In addition, as the project progresses, we will be maintaining a list of Frequently Asked Questions. Please refer to this FAQ as you have difficulties; if other students have encountered it before you, the solution may be documented there. This FAQ document should be your first line of support for questions relating to this project. It will be updated regularly as the project progresses. Please also use cs590n@cs.purdue.edu to email questions. This is a discussion list containing the whole class.

Updated: October 4, 2008

Copyright 2008, E. Blanton, C. Killian
CS590N Data Center Architecture
Project/Homework 2