CS 47300: Web Information Search And Management

MWF 13:30-14:20

KRAN G016 (subject to change).
The course will be amenable to remote attendance even for those not registered for the distance learning section

Chris Clifton

Email: clifton_nospam@cs_nojunk.purdue.edu

Course Outline

Course Topics

This course teaches important concepts and knowledge of information retrieval for managing unstructured data such as text data on Web or in emails. At the same time, students will be exposed to a large number of important applications. Students in the course will get hands on experience from homework and a course project. The first part of the course focuses on general concepts/techniques such as stemming, indexing, vector space model, and feedback procedure. The second part of the course shows how to apply the set of techniques on different applications such as Web search, text categorization, and information recommendation.


This course is anticipated to be oversubscribed, and as such registration is initially limited to CS students. If you have been unable to register, please follow the CS department process for waitlisting/registration. Please do not ask the instructor for an override, I have been told that if the course is shown as full, the registrar will not allow registration even with a form 23 signed by the instructor. Please follow the process above or consult with your advisor.

Teaching Assistants

Office hour times, locations, and procedures to be determined and may change through the semester, as we learn more of social distancing requirements and facility availability. Please watch this space for changes, and check Piazza for potential temporary adjustments. TAs will also be available remotely during office hours at the posted links.

Instructor Office Hours

Thursday 8:30-10, WebEx 120 210 9752 password InfoRetrieval

In person meetings in LWSN 2116E by appointment, to avoid crowding. Email a few good times for you and I'll pick what works. (This also works for setting up a videoconference, if Thursday morning doesn't work.) Monday, Wednesday, and Friday mornings are currently the most open, but I have quite a few free slots throughout the week. University policy permitting, you can just drop by, I'm often in, and if not tied up with something that has be be finished right away I'll be happy to meet with you.

Mailing List

There will be a course email list used for high-priority announcements. This will use your @purdue.edu email address; make sure this is forwarded to someplace you look on a regular basis.

We will be using Gradescope to turn in and comment on assignments; Brightspace will be used for recording and distributing grades, as well as for any other non-public information about the course.

Course Methodology

The course will primarily be taught through lectures, supplemented with reading. The lecture delivery method is subject to change based on COVID-19 related restrictions, but there will always be some form of online access to lecture material. Initial plans are for live in-class lectures, recorded and made available through Boilercast. We are currently in a room that supports real-time access to Boilercast, given student demand I will try to set up live interactive access to the lecture (e.g., WebEx or Zoom.) The written assignments and projects are also a significant component of the learning experience.

We will be using Piazza or Brightspace to facilitate discussions; this will enable you to post questions as well as respond to questions posted by others. Be aware that the default is for posts to be identified and visible to everyone.

We may be using some form of real-time feedback in class. (Note that my standard practice is to drop approximately the lowest 10-15% of in-class response scores to allow for absences.)


The formal prerequisite is CS 25100: Data Structures and Algorithms (or ECE 36800). It will help if you have taken CS37300: Data Mining and Machine Learning and/or a statistics course such as STAT 35000: Introduction to Statistics or STAT 51100: Statistical Methods. (If you have comparable courses, such as ECE 36800, please contact the instructor.)


Evaluation is a somewhat subjective process (see my grading standards), however it will be based on your understanding of the material as evidenced in:

All assigned work (including exams) will be provided and submitted online. Exact formats will be evolving, you can get some idea from looking at how midterm 2 and the final were handled in CS37300

Late work will be penalized 15% per day (24 hour period or fraction thereof). The penalty is based on possible points, not your actual score (so after 5 days, if your submission garners less than 75% of the possible points, you get a 0.) Each assignment has a hard deadline of five days after the published due date, after which the solution sets go out and no further submissions are accepted. You are allowed five seven extension days, to be used at your discretion throughout the semester (illness, job interviews, etc.); no penalty is assessed for late work within this limit. If your assignments add up to more than five days late over the semester, the late days will be automatically applied to the highest value assignments (e.g., projects), so the 15% late penalties are applied to the lower value assignments first. Late penalties will only be applied at the end of the semester, so if you go beyond the alotted late days, you may see your score for some assignments drop. You must keep track of late days yourself. Fractional use is not allowed, and this may not be used to extend submission past the hard deadline.

Policy on Intellectual Honesty

Please read the departmental academic integrity policy above. This will be followed unless I provide written documentation of exceptions. You should also be familiar with the Purdue University Code of Honor and Academic Integrity Guide for Students. You may also find Professor Spafford's course policy useful - while I do not apply it verbatim, it contains detail and some good examples that may help to clarify the policies above and those mentioned below.

In particular, I encourage interaction: you should feel free to discuss the course with other students. However, unless otherwise noted work turned in should reflect your own efforts and knowledge.

For example, if you are discussing an assignment with another student, and you feel you know the material better than the other student, think of yourself as a teacher. Your goal is to make sure that after your discussion, the student is capable of doing similar work independently; their turned-in assignment should reflect this capability. If you need to work through details, try to work on a related, but different, problem.

If you feel you may have overstepped these bounds, or are not sure, please come talk to me and/or note on what you turn in that it represents collaborative effort (the same holds for information obtained from other sources that provided substantial portions of the solution.) If I feel you have gone beyond acceptable limits, I will let you know, and if necessary we will find an alternative way of ensuring you know the material. Help you receive in such a borderline case, if cited and not part of a pattern of egregious behavior, is not in my opinion academic dishonesty, and will at most result in a requirement that you demonstrate your knowledge in some alternate manner.

Other Issues and Resources

If you have other issues please feel free to talk to me - if I can't help, I'll try to point you in the right direction. Be aware that due to Title IX and state law, there are some things for which I can't promise confidentiality (but see CARE below).

University Emergency Preparedness instructions

Nondiscrimination Statement: Purdue University is committed to maintaining a community which recognizes and values the inherent worth and dignity of every person; fosters tolerance, sensitivity, understanding, and mutual respect among its members; and encourages each individual to strive to reach his or her own potential. In pursuit of its goal of academic excellence, the University seeks to develop and nurture diversity. The University believes that diversity among its many members strengthens the institution, stimulates creativity, promotes the exchange of ideas, and enriches campus life. Purdue’s nondiscrimination policy can be found at http://www.purdue.edu/purdue/ea_eou_statement.html.

Purdue University strives to make learning experiences as accessible as possible. If you anticipate or experience physical or academic barriers based on disability, you are welcome to let me know so that we can discuss options. You are also encouraged to contact the Disability Resource Center at: drc@purdue.edu or by phone: 765-494-1247.

Student Mental Health and Wellbeing: Purdue University is committed to advancing the mental health and wellbeing of its students. If you or someone you know is feeling overwhelmed, depressed, and/or in need of support, services are available. For help, such individuals should contact Counseling and Psychological Services (CAPS) at (765)494-6995 and http://www.purdue.edu/caps/ during and after hours, on weekends and holidays, or through its counselors physically located in the Purdue University Student Health Center (PUSH) and the Psychology building (PSYC) during business hours.

Sexual Violence: Purdue University is devoted to fostering a secure, equitable, and inclusive community. If you or someone you know has been the victim of sexual violence and are interested in seeking help, there are services available. Reporting the incident to any Purdue faculty and certain other employees, including resident assistants, will lead to reference to the Title IX Coordinator, as these individuals are mandatory reporters. The Title IX office can investigate report of sex-based discrimination, sexual harassment, or sexual violence. Title IX ensures that both parties in a reported event have equal opportunity to be heard and participate in a grievance process. To file an online report visit https://cm.maxient.com/reportingform.php?PurdueUniv&layout_id=15 or contact the Title IX coordinator at 765-494-7255.

The Center for Advocacy, Response, and Education (CARE) offers confidential support and advocacy that does not require the filing of a report to the Title IX office. The CARE staff helps each survivor assess their reporting options and access resources that meet personal needs. The CARE office can be found at 205 North Russell Street in Duhme Hall (Windsor), room 143 Monday - Friday 8:00 AM to 5:00 PM. They can also be reached at their 24/7 hotline 765-495-CARE or at CARE@purdue.edu.

And you should always feel free to call, email, or drop by and talk to me (or, if you have an issue with me, to the department head.)


The basic text for this course is:

The following book may also be of interest, as it gives a somewhat different treatment of the material. You don't need both books, this should be considered optional reading.

Course Outline (numbers correspond to week):

  1. Course Introduction, Text Preprocessing.
    1. Introductory lecture, discussion of areas and applications of Information Retrieval.
    2. Ad-Hoc Information Retrieval overview (Chapter 1)
    3. Text Preprocessing (Chapter 2)
      Assignment 1 released, due 9/4/20 23:59ET. Solutions.
    Reading: Croft et al. Chapter 1 (esp. through 1.2), 4 through 4.4; Manning et al. Chapter 2 through 2.2.
  2. Ad-Hoc IR Methods
    1. Basic Concepts, continued: Evaluation.
      Reading: Croft et al. Chapter 5.1, 5.3-5.3.3, 8-8.2, 8.4-8.4.3, 8.5, 8.7. Manning et al. Chapter 8 (esp. through 8.5).
    2. Boolean Retrieval Models Reading: Croft et al. Chapter 7-7.1.1. Manning et al. Chapter 1.1-1.4.
    3. Vector Space Models: TF-IDF
      Reading: Croft et al. Chapter 7.1.2; Manning et al. Chapter 6.2-6.4.
    Assignment 2 released, due 9/11/20 23:59ET. Solutions.
  3. More on Retrieval Models
    1. Latent Semantic Indexing
      Reading: Croft et al. Chapter 7.6.2, Manning et al. Chapter 18.
    2. Probabilistic IR: Binary Independence Model
      Reading: Croft et al. Chapter 7.2-7.2.1; Manning et al. Chapter 11-11.3.2
    3. More on Probabilistic Retrieval Models
      Reading: Croft et al. Chapter 7.2.2; Manning et al. Chapter 11.3.3, 11.4
  4. Web Search
    1. Web Search
      Reading: Manning et al. Chapter 19-19.4
    2. Web Crawling
      Reading: Croft et al. Chapter 3-3.2.2 Manning et al. Chapter 20-20.2.
      Assignment 3 released, due 9/23/20 23:59ET. Solutions.
    3. Web Crawling: Politeness and Freshness, Scaling
      Reading: Croft et al. Chapter 3.2.3, 3.2.7, 3.5-3.8
  5. Web Index, More on Retrieval Models
    1. Web Indexing, Size Estimation
    2. Graph structure based retrieval
      Reading: Croft et al. Chapter 4.5, 10.3.2 Manning et al. Chapter 21
    3. PageRank, Query Expansion and Relevance Feedback
      Reading: Croft et al. Chapter 6.2.3-6.2.4, 7.3.2; Manning et al. Chapter 9.1-9.2.2
    Project 1 (due 11:59pm October 4)
  6. Text Categorization
    1. Overview, K-Nearest Neighbor
      Reading: Croft et al. Chapter 9-9.1.2, Manning et al. Chapters 13-13.1, 15-15.2,
    2. Naive Bayes
      Reading: Yiming Yang and Xin Liu, A re-examination of text categorization methods, SIGIR'99
    3. SVM
  7. Text Clustering
    Project 1 Part 2 (due 11:59pm October 11)
    1. Text Clustering Overview
      Reading: Croft et al. Chapter 9.2-9.2.1, Manning et al. Chapter 16-16.4
    2. K-Means Clustering
    3. Hierarchical Clustering
  8. Collaborative Filtering
    1. Memory-based Collaborative Filtering
      Reading: Croft et al. Chapter 10.4.2
    2. Midterm 1. (Exam and Solutions), a one hour exam in Gradescope available from October 14 9amEDT to October 15 9amEDT. Test will cover Weeks 1-5, although the emphasis will be on ad-hoc retrieval and questions on web crawling will be primarly in the context of ad-hoc retrieval.
    3. Model-based Collaborative filtering.
      Assignment 4 released, due 10/23/20 in Gradescope. Solutions.
  9. Ethics Issues
    1. Data Privacy
      Reading: AOL Query Log Debacle, Right to be forgotten
    2. Filter Bubbles, Algorithmic Bias. Reading:
    3. Algorithmic Bias, continued
      Project 2 released, due 6 November.
    1. Fake news detection
      Reading: Follow links in slides
      October 26 December 4: Drop Date
    2. Guest Lecture, Prof. Goldwasser on Natural Language Processing in IR. Note: This lecture will be on WebEx only, do not expect anyone to be in KRAN G016. It will be live, as well as being recorded for later viewing.
    3. Deep Web & Federated Search
    1. Federated Search, continued
    2. November 4: Reading day (no class)
    3. November 6: Guest Lecture: Rajkumar Pujari on Question Answering. Normal delivery methods (in class/WebEx/recording). Reading: Croft et al. Chapter 11.5
    1. Sentiment Analysis
      Assignment 5 released, due 11/14/20 in Gradescope. Solutions.
    2. Finish with Sentiment Analysis, Search Engine Optimization
    3. The flip side of Search Engine Optimizaion: Detecting Web Spam
    1. November 16: Midterm 2 (Exam and Solutions), a one hour exam in Gradescope available from November 16 9amEST to November 17 9amEST. Test will cover through Week 9. although only memory-based collaborative filtering (not model-based). Although you may find you could use model-based collaborative filtering to answer a question, of so, that is okay - but it would probably not be the easest way to answer any of the questions.
    2. Scaling: MapReduce Framework
      Assignment 6 released, due 1 December 11:59pmEST. Solutions.
    3. Scaling: Hadoop and the view from Yahoo!, Apache Spark.
  10. November 23: Bot Detection
  11. November 30: Trends in Industry: Elizabeth Churchill, Google: Digital Wellbeing. This is a 30 minute video. please view this before normal class time; we will meet at the normal time on WebEx to discuss the video for the remainder of class time.
  12. December 2: Trends in Industry: Prem Natarajan, Amazon: AI in Our Life, Everywhere. Pay particular attention to the discussion of fairness issues. Do you think the ideas presented reasonably cover what you think is needed to address fairness issues?
  13. December 4: Review, standard class WebEx link.

Final Exam Monday, December 7, 9:00amEST (14:00UTC) - Tuesday, December 8, 9:00amEST.
A two hour comprehensive exam, same format and delivery mechanism as the midterms.
If you have three or more exams scheduled that day and would like to reschedule this exam, please let me know as soon as possible. Note that conflicting exams are pretty much the only reason for rescheduling. For those leaving campus before final exams, please check to make sure you have reasonable connectivity to Gradescope before the exam. If not, please email the instructor (or call at 765-494-6005) to explain the situation so we can make alternate arrangements.

Valid XHTML 1.1