Table of Contents

This project is broken into 2 milestones:

Project 4 - Indexing & Search

This is a group project to be completed in teams of 2. You must form your team with a person in your lab section. You can set this up either in lab or on Piazza using the Search for Teammates post.

Learning Objectives

  1. Engage & use external libraries ( JSoup )
  2. Interpret Java documentation
  3. Implementing your own data structure
  4. File I/O, Exception Handling, Concurrency
  5. Team work

Introduction

Malicious computer hackers have broken into Googles servers and destroyed the web crawler that they use to find, parse and index web pages as well as the search functions that return results to our search queries. Incredibly, all that remains is a Javadoc ( written in good practice by a hard working programmer ) to describe the web crawler and search classes. After exhausting all possibilities they have come to Purdue and asked you, the benevolent CS180 student, to recreate the web crawler and search engine by creating a project that will crawl, parse, and index pages as well as be able to then execute search queries on this index.

Academic Dishonesty

It is considered dishonest either to read someone else's solution or to provide a classmate with a copy of your work. Do not make the mistake of thinking that superficial changes in a program (such as altering comments, changing variable names, or interchanging statements) can be used to avoid detection. If you cannot do the work yourself, it is extremely unlikely that you can succeed in disguising someone else's work. We are adamant that cheating in any form is not tolerated. Even the most trivial assignment is better not done than if you cheat to complete it.

Getting Started

The project has 3 main components; parsing, crawling, and searching. It is suggested that you sit down with your partner to design your solution before you begin, as your decisions on parsing and crawling, specifically what you decide to store and how you are going to store it, will have big implications for the search functionality.

Specifically, you and your partner should perform the following steps before writing any code;

  1. Read through this handout
  2. Read through the recovered Javadoc
  3. Create a plan with your partner for your solution (especially Parts 3 & 4)
  4. Get the starter materials here ( mirror ) and import it into into IntelliJ. Tutorial.

Your solution is required to have the classes, method signatures, and class fields (both types and variable names) described in the recovered Javadoc. However, you are allowed to add additional methods, classes, class members as needed.

This is a large project split into 2 milestones. For the first milestone, you will build the Word, Page, MyQueue (covered in Part 1) classes as well as Parser and ParseException (covered in Part 2). You are given the verbatim Vocareum tests for these so that you can test them locally. It is critical to get the first part done as it lays the foundation for tying it all together in Parts 3 & 4. .

Part 1 - Building Blocks

Requirements Checklist

These are the items you should complete in Part 1:

  • Create Word, Page, MyQueue and Node classes as described by the Javadoc
  • Pass all provided JUnit test cases for Word, Page and MyQueue

Overview

This project is going to require a number of components to put together the entire crawling & search experience. In parts 1 & 2, you will build the required components so that you will be ready to begin tying them together in order to complete parts 3 & 4. In this section, we will create ways to represent Pages and Words in our project, as well as create a way for us to Queue items for processing.

Representing Words and Pages

Words and Pages are the fundamental components of this project. You not only need to be able to associate words with pages, but also be able to keep track of the number of pages, the pages in which a word appears in, and the number of times a word appears in a given page. In addition to this, you need to be able to quickly access and update your word and page objects as well as their associations on-the-go.

One idea is to think of words and pages as objects. Let's start by looking at what a page is. A page to us is just a web page that has a URL and an ID number that we give it when we first see it.

Page.java
public class Page 
{
 
   String url;
   int urlID;
 
}

Briefly look ahead to how we're going to store all of the information and relationships between Pages and Words by going to the Building an Index section in Part 3. As you can see, each word needs to hold a list of associated web pages. If we think of Words as objects, we can have our Word object contain an ArrayList that contains these IDs for us.

Word.java
public class Word 
{
 
   String word;
   ArrayList<Integer> associatedPages;
 
}

Queuing Items

As each website is visited and successfully parsed, the list of websites to visit grows with each new (i.e. unvisited) web page. To accomplish this, you are going to implement a Queue. The MyQueue and Node classes that you will need to implement are described in the next section.

As described above, we need some type of data structure that will allow us to add items to the end and quickly grab the first item as necessary, with no traversing. For this, we are going to use a Queue:

Definition: A Queue is called a FIFO data structure - short for First In First Out. This means that new items are added to the end of the Queue, and the first item is the item that is processed (much like standing in a line).

For this we will have 2 classes - MyQueue and Node.

Your MyQueue class will house all of the operations you have seen with other classes like ArrayList, such as add() and remove(). Your Queue will be composed of linked together Node objects, and the methods defined below will be responsible for managing these linkages.

MyQueue.java
public class 
MyQueue {
 
   Node head;
   public Node add() {}
   public Node remove() {}
   public boolean isEmpty() {}
 
}

The Node class is the basic component to the your Queue. Each Node represents an entry, and you will add, remove, and update references (next & previous) as necessary.

Node.java
public class Node 
{
 
 private Object data;
    private Node next;
 
    public Node(Object obj) {}
    public void setNext(Node next) {}
    public Node getNext() {}
    public Object getData() 
{}
 
}
v

The data type for each Node is listed as Object because you will choose what you want to store (e.g. you could store Strings, Pages, etc). You can think of a Queue as a number of linked Node objects, where each Node's next reference links to the next Node in the chain.

If the next reference is null, we have reached the end of the Queue. Below is an illustration of a Queue with 3 nodes. The data component is storing an Integer. Also note how the first element is labeled Head and the last item has its' next reference set to null. We will keep the former as a global Node reference while the latter will help us identify the end of the Queue.

Part 2 - Parsing

Requirements Checklist

These are the items you should complete by the end of Part 2:

  • Connect to a webpage and return a Jsoup Document
  • Extract Links from a Jsoup Document
  • Extract and return body text of a Jsoup Document
  • Throw your custom exception for outlined cases
  • Match framework outlined in the Javadoc
  • Test thoroughly with provided JUnit test cases as well as your own.

Overview

This section gives an overview of what parsing is, what we are actually parsing, and a quick tutorial of how we are going to perform this parsing with JSoup.

Parsing itself is simply taking an input and performing some operations from or as a result of the input. In the context of this project, we are taking a web page as input and extracting text and links from it like below.

Note: in Parser class, you threw your custom ParseException when a Document was null, URL was null, or some other Exception was thrown. It is important for the crawling action to withstand these failures. If an attempted parsing fails, you should not count this URL as a parsed URL and should not increment CurrentID. Rather, you should continue crawling with the next candidate URL in your Queue.

Using Jsoup

In Jsoup, the webpage is represented as a Document that contains Elements , which are lists of individual Element (e.g. <p>) objects (e.g. one for each <p> tag ). A basic example of how to represent the parse the above diagram in Jsoup might look like this;

  /* Get the web page as a Document */
  Document d = Jsoup.connect("https://purdue.edu").get();
  
  /* Get all link (<a>) tags as an Elements object */
  Elements links = d.select("a[href]");
  
  /* Elements is an ArrayList of Element objects. You can enumerate over them like this; */
  for(Element link : links)
      System.out.println(link.text());
      
  /* Get the entire contents of the body tag */
  Element body = d.body();
  String content = body.text();

For our project, we will only be considering the body text. The project starter code contains a Sandbox class that has additional examples you can run and play with to help get you started.

ParseException

As you play around with the Jsoup implementation above, you will soon discover that there are a number of Exceptions that could be thrown (IOException, UnknownMimeException, etc). Additionally, one problem that is not covered is the case where the Document object is null. Therefore, for each of the 3 methods in Parser, you will catch any exceptions thrown (e.g. when Jsoup.connect() is called) as well as check that the Document object is not null. If either of these happen, then you will throw a ParseException with the following message:

Method Condition Message
getDocument() Document is null getDocument() failed: Document is null.
getDocument() Parameter String is null getDocument() failed. String url is null.
getDocument() Parameter String is empty String getDocument() failed. String url is empty.
getDocument() Jsoup.connect() fails getDocument() failed. Connection failed.
getBody() Parameter is null getBody() failed. Document parameter is null.
getLinks() Parameter is null getLinks() failed. Document parameter is null.

This concludes Milestone 1 (parts 1 & 2). Please make sure you pass all of the provided test cases for this milestone, then submit to Vocareum by 11:59 pm EST on 3/31

Part 3 - Crawling

Crawling is the process of visiting a website, extracting and storing links from the page, then extracting text to keep track of the associated content you saw on each page. We store the links we see on each page in order to visit them next, one after the other. We will create and use a Queue to do this. To represent this connected assortment of pages and text, we will create Objects, detailed below. It is important to read through the next 2 subsections and plan your solution for the Crawler class before writing any code.

Requirements Checklist

These are the items you should have completed at the end of Part 3:

  • Implement the Crawler class as described in the Javadoc
  • Be able to crawl web pages, queuing pages using your MyQueue class
  • Your crawler should not visit the same page twice, and not leave the described domain.
  • Build your index in memory using your Word and Page components
  • Implement the FileUtils class as described in the Javadoc
  • Be able to write your List of Word and Page objects to files called words.txt and pages.txt respectively, via your FileUtils class after crawling completes.

Overview

A web crawler works by taking a web page (e.g. https://purdue.edu), extracting all available links & text from the web page (using your Parser class), adding these links to a waiting list, and then proceeding to visit all of the links in the waiting list. In our case, our Crawler constructor takes seed and domain Strings. The seed is where your crawler will start, the domain will be the website that your crawler will be restricted to (e.g. cs.purdue.edu). The relationship between pages is shown graphically below, along with a rough progression where each box with outgoing arrows represents links found on that page.

How does Crawling Work?

The crawler continuously pulls URLs off of the Queue and parses them (extracts links and text). As URLs are discovered that have not been seen before, they are added to the Queue for processing. This process continues either until the Queue is empty or a preset limit to the number of URLs to visit is reached. The generalized algorithm;

run:
    currentID <- 0
    while Queue is not empty
        url <- grab a URL from the Queue
 
        parse(url) <- pass url to parse method
 
        increment currentID

CrawlerID should start at 0 and you should be careful when you increment it. Note how the parse() method in Crawler returns a boolean value (true if successful, false otherwise). You should only increment the currentID when a URL is successfully parsed. That way, all of our ID's are sequential (i.e. 0,1,2,3,4), with no gaps in between. This will make our life easier later.

Building an Index

As our crawler finds and parses new web pages, we need to create and update an index of what we have seen. For this, we recommend maintaining an inverted index, where each word is associated with a list of web pages it appears in. Since we're associating each Page with a URL ID, each Word can represent its presence on a Page by storing a List of URL IDs like this:

For our implementation, we need to create a List of Word objects. Within each Word object, there will be a List of Integers, where each entry in the List is a URL ID on which that Word occurs. This List can and should store duplicates, as you will want to record that a given word (e.g. Purdue) appeared on the Page with URL ID 2 however many times it did. For more information on this, see the Inverted Index section.

File Utilities

When we are finished with the crawling process, we will need to write the index we have built to a file in order to access it later. This will be done in the FileUtils class. For this, we will lean on the idea of Serialization - if a class and all of its subclasses implement the Serializable interface, then you can actually write those objects (e.g. a ArrayList of words and Word objects) to a text file, and then restore that data structure from the file when needed.

All of your classes will need to implement the Serializable interface, and your FileUtils class will contain methods that will use the ObjectOutputStream and ObjectInputStream classes to read and write your word and page lists to files. A high-level overview can be found in week 10 slides and a tutorial can be found in the help section at the end.

Part 4 - Searching

Requirements Checklist

These are the items you should have completed at the end of Part 4:

  • Implement the Result and Search classes as described in the Javadoc.
  • Be able to read in your crawled List objects from file using FileUtils functions.
  • Execute multi-word queries on your index.
  • Search class divides up scoring responsibility over multiple SearchThread threads.
  • Output a ranked list of Results.

Overview

Your Search class will be responsible for reading and rebuilding the index from file, accepting user queries, building a List of Results and displaying these results to the end user.

Concurrency

Your search function will use threads to more quickly evaluate the ArrayList. For example, given a word list containing 1000 entries and a program that contains 5 threads, you will divide the work up evenly between the 5 threads:

Your word list will contain many more entries than 1000, so you will have to create a way to divide the list evenly amongst the 5 threads. Each thread will receive the query string and be responsible for (1) looking for each term in their assigned indices, (2) computing the score for a found term (3) adding a new Result or updating an existing one in the shared Result List. This is an example of domain decomposition, where each thread runs the same code over a different part of the input. This code will be housed in your SearchThread class.

Since the Results List is shared among threads, you must design a way to protect it from double-counting by competing threads. There is a tutorial about this at the bottom in the Synchronizing Lists section.

Scoring

When the user enters a query, you will need to search your index for relevant results, rank the results and then print them. To get around casing issues (e.g. Purdue vs purdue), you should ignore the case of words in the query when looking them up in your word list. For simplicity, we recommend using a scoring function that simply sums the frequency a term occurs in each document and uses this for ranking. Below is the mapping of the word Purdue to it's associated urlID's. A query that included the word Purdue would find this word, add unique page to the results list, and increment the score count for each occurrence of a unique page in the Purdue list (shown on the right).

Result Class

You have been given a Result class. This will represent each result in your results list, and contains the URL, URL ID, and score for each entry. You must implement the exact method signatures described below (and described in the Javadoc) for this class.

Result.java
public class Result 
{
 
   public int getScore() {}
   public int getURLID() {}
   public String getURL() 
{}
 
}

Remember earlier how we mentioned we wanted to make sure our currentID values were carefully incremented so that they were all sequential? This can come in handy now. If we sort the ArrayList of Pages by URL ID, then the indices of the ArrayList will correspond with our ID's, meaning we can simply use pageTable.get(URL ID) to get a Page object with that URL ID and return the String URL.

Why does this matter?

Sample Output

Your Search class will return the top 10 ranked results in the form of result #, URL, score. The below sample output is generated by the Driver class, which is what will be run for in-lab testing:

This concludes Milestone 2 (parts 3 & 4). Please be sure to submit to Vocareum by 11:59 pm EST on 4/10

Submission Instructions

Submit your project to Vocareum. Keep a copy of the project on your CS account for demonstration & grading in lab.

Grading Rubric

(70%) Vocareum Tests:

(30%) Demo Tests:

Tips, Tutorials, Help

Working with JSoup to Parse Webpages

JSoup is a powerful library for parsing HTML pages. In addition to strong documentation and a number of example usage cases, there is a tutorial covering all of the functionality you'll need in the started code. There is an extensive demo with sample code provided in a class named Sandbox.java, which covers getting web pages, extracting links, and getting body text.

Preventing Duplicate Visits

Earlier we discussed using an ArrayList to store Page objects for pages we are waiting to visit. Could we do the same for the pages that we have already seen?visited. We could simply add pagesects that we have visited to an ArrayList for this purpose, but how would we determine whether the ArrayList contains a given URL? Before adding Page objects to the waiting list to be visited, we need to be sure that the Page is not already in the visited list.

If our ArrayList was only storing the String of the URL, we could simply use the contains() method from the ArrayList. However, if we are using Page objects, we would have to override the contains() method to see if the list contains the String url member of the Page. More information on this is included below.

Reading & Writing Objects

The ObjectInputStream and ObjectOutputStream classes handle most of this for us. The demonstration below covers the basics of what you will need to accomplish this, without the necessary exception handling.

Consider a Student class with member fields for the student name and age;

Student.java
public class 
Student implements Serializable {
 
  String name;
  int age; 
  // Rest of your implementation 
}

Since our Student class and the String class both implement Serializable, all components involved are eligible for Serialization like below;

FileUtils.java
public class 
FileUtils  {
 
  public void write(Student s, String filename)
  {
      /* Create input streams for the File and Objects */
      FileOutpuStream fos = new FileOutputStream(new File(filename)));
      ObjectOutputStream oos = new 
ObjectOutputStream(fos);
 
      /* Write the object to File */
      oos.writeObject(s);
      fos.close();
      oos.close();
  }
  public Student read(String filename)
  {
      /* Create output streams for the File and Objects */
      FileInputStream fis = new FileInputStream(new File(filename));
      ObjectInputStream ois = new 
ObjectInputStream(fis);
 
      /* Create a Student object and apply a cast to the read in object 
*/
      Student s = (Student) ois.readObject();
      ois.close();
      fis.close();
 
      return s;
  }
 
}

Comparing Page Objects

In this project, you will have a List of Pages and a List of Results, and will have to be able to determine whether, given a URL for example, a Page object is already in the List of Pages. We've used ArrayLists of Strings where we can check whether a String exists in a given ArrayList by doing something like the following;

ArrayList<String> 
list = new ArrayList<String>();
list.add("Hello");
 
System.out.println(list.contains("Hello")); // prints true

The ArrayList class has a method called contains() that looks over all of the items in the ArrayList and, based on the type of Object the List contains, calls that Objects equals() method. Above, we are using String objects (and String has a .equals() method we are well familiar with). This is what happened behind the scenes when we called list.contains(Hello):

boolean contains(Object candidate)
    for(Object str : list)
        if(str.equals(candidate)
            return true
 
    return false;

Someone else has already written the code to take a generic Object, in our case a String, and call the equals() method for that object. Therefore, if we want to be able to do this for our Page and Result classes, we need to introduce our own equals() method in these classes that ArrayList will automatically call when you call contains(). Introducing these two equals() implementations in these classes allows you to avoid having to write any loops! Here is an example in the Page class:

@Override // This tells Java that we've got our own 
version of equals() to use
boolean equals(Object object) {
    if(object instanceof Page) {
        // compare this Page to the Page object you're looking for
        // e.g. this.url.equals((Page)object.url)
        return true;
    }
    return false;
}

Sorting an ArrayList

In order to use the contains() method for our ArrayList of Page objects, we had to define what would make two Page objects equal to each other. We did this by overriding the equals() method in our Page class. In order to sort an ArrayList of our own object type, we have to do something similar.

We can do this by defining a class that implements the Comparable interface, which requires us to define a method called compareTo(Object candidate). If you looked at the source for the String and Integer classes (for example), you'd see that they implement this method too, which is why the following works;

ArrayList<Integer> list = new ArrayList<Integer>();
list.add(3);
list.add(2);
list.add(1);
Collections.sort(list); // List is now sorted

For this project, you will have to sort the Page list by URL ID and the Result List by score. Therefore, we require these two classes to implement the Comparable interface. An example is given below with a Student class that we want to be able to sort by grade:

public class Student 
implements Comparable<Student> {
    String name;
    int grade;
    // Constructor etc
    // Compare 2 Student objects
    public int compareTo(Student candidate){
        // If they are the same, return 0
        if(this.grade == candidate.grade)
	     return 0;
        // Otherwise, if the grade for the candidate Student is higher, return 
-1
        // to say two is less than one. Otherwise signal that Student on
        // is less than two.
	    return this.grade > candidate.grade ? -1 : 
1;
    }
}

Then, where we want to sort an ArrayList of Student objects by grade all we have to do is;

Collections.sort(studentList);

After the above executes, our Student list will be sorted by ascending grade! This should get you started for the implementations you'll need. More information about this is available here.

Why is the Queue of type Object?

From the Javadoc, your MyQueue class is designed to contain general Object types. For example, the add method;

public void add(Object o){}

But why is this? Simply put; it gives you the flexibility to choose what you want to hold in your Queue (we have mentioned two possibilities being String or Page objects). As a result of Inheritance in Java, every type of Object you create (including your Page class) implicitly extends the Object class! You can test this like below:

public class Test {
    public static void 
main(String[] args) {
        Test t = new Test();
        System.out.println(t instanceof Test); // True
        System.out.println(t instanceof Object); // True
    }
}

Note: this may require you to apply a cast.

How to Read a Javadoc

One of the most crucial skills in software development is being able to read documentation, which is one of the reasons this project is structured the way it is. Every library of class you will use will have Java documentation that follows roughly the same format. To understand how it works, let's look at the String classes Javadoc.

In the String class Javadoc, you'll notice many of the same features that the project 4 Javadoc has. There is an introduction section describing the class and sections for field summary, constructor summary, and method summary. For example, if I want to create a new String object that holds the string Hello, I can see the following constructor that matches my needs;

Now I know I can simply do;

String str = new String("Hello");

The method summary section is arguably the most crucial, as it defines most of the functionality for the class. Say you were in a closed live-coding exam and were only provided access to the Java documentation. As you're writing some code for a question, you discover that you can't remember exactly how the substring method for String works (e.g. which value is exclusive, beginning index or ending index?). Luckily, if we know how to navigate the documentation we can look this up. In the method summary section, find the method for substring with the method signature: substring(int beginIndex, int endIndex) and click it. You will be taken to a section like this;

This tells us the method signature, what the method does, describes the parameters, return types, and details any exceptions that could be thrown by it. Knowing how to read these documents is very useful in practice, and critical to developing the correct project structure for project 4.

Synchronizing Lists

When you are executing a search query, you are dividing the work among 5 threads who are responsible for executing the same piece of code on a specified subset of the List (this is called domain decomposition). However, since the threads are sharing a single Result List, we need to devise a way to synchronize the List. To do this, we can rely on a nice static method from the Collections class called synchronizedList(..). Let's look at an example implementation below:

public class NumberList 
{
    public List<Integer> numbers;
 
    public NumberList() {
        numbers = Collections.synchronizedList(new ArrayList<Integer>());
    }
    // Rest of class
}

By wrapping our ArrayList initialization with this method from the Collections class, we take care of protecting it from any threading-related problems! This will be very useful for your Search class implementation. This is a cool example of polymorphism making your life easier!

Inverted Index

An inverted index consists of a list of all the unique words that appear in any page, and for each word, a list of the pages in which it appears.

For example, lets say we have two webpages composed of the following content:

Page 1: The quick brown fox jumped over the lazy dog
Page 2: Quick brown foxes leap over lazy dogs in summer

Your Parser in part 2 will get the content and split it by space in order to get all of the words on the page. You will then add these words to a List as well as add the pages URL ID to the page list for the Word List. Below is this process done for the above content. The Word column represents the List of Words, and the other columns represent an association between the Page and Word.

Word      Page 1  Page 2
-------------------------
Quick   |       |  X
The     |   X   |
brown   |   X   |  X
dog     |   X   |
dogs    |       |  X
fox     |   X   |
foxes   |       |  X
in      |       |  X
jumped  |   X   |
lazy    |   X   |  X
leap    |       |  X
over    |   X   |  X
quick   |   X   |
summer  |       |  X
the     |   X   |
------------------------

Now, if we want to search for quick brown, we just need to find the pages in which each term appears:

Term      Page 1  Page 2
-------------------------
brown   |   X   |  X
quick   |   X   |
------------------------
Total   |   2   |  1

Both pages match, but the first page has more matches than the second. In part 4, you will score each page by summing the occurrences of each word on each page and then ranking the pages by score.

Importing JUnit into IntelliJ

If you're having trouble importing JUnit into IntelliJ for the provided test cases, take a look at the video below:

Environment Setup Tutorial

If you're having trouble getting your environment set up, try following this brief tutorial: