Frequently Asked Questions

If you have other questions, email the TAs at cs251-ta@cs.purdue.edu.

In the table holding the list of URLs and descriptions, the lab says the description should be the first 500 characters of the site. Is that the first 500 characters of the raw HTML or the first 500 characters of the content?

Store the first 500 characters of the content. We'll use this description in a later lab when we present Google-like results from searching our crawled database.

How should I modify the SimpleHMTLParser class?

You shouldn't modify the SimpleHTMLParser class! Instead, have your WebCrawler class be a subclass of the SimpleHTMLParser class. That way you inherit all the functionality of the parser without breaking programs such as gethttp that also have classes which inherit from SimpleHTMLParser.

How do I subclass in C++?

Take a look at gethttp.cpp for an example of declaring a subclass of an existing class.

The handout says that the webcrawler program will have the syntax: webcrawl [-u <maxurls>] url-list If this is what is entered at the terminal, how does the url-list get entered?

Separate the URLs with spaces. For example, one possible way to start the program is:

webcrawl -u 100 http://www.purdue.edu http://www.slashdot.org http://www.cnn.com

How do I get the URLs into my program?

See gethttp.cpp for an example of how to parse command line options and arguments.