CS 47300: Web Information Search And Management

MWF 10:30-11:20

ME 1130

Chris Clifton

Email: clifton_nospam@cs_nojunk.purdue.edu

Course Outline

Course Topics

This course teaches important concepts and knowledge of information retrieval for managing unstructured data such as text data on Web or in emails. At the same time, students will be exposed to a large number of important applications. Students in the course will get hands on experience from homework and a course project. The first part of the course focuses on general concepts/techniques such as stemming, indexing, vector space model, and feedback procedure. The second part of the course shows how to apply the set of techniques on different applications such as Web search, text categorization, and information recommendation.

Teaching Assistants

Office Hours

Monday 2-3, Thursday 2:30-3:30, LWSN 2142F. I am also available by appointment, email some good times and I'll pick what works. Or you can just drop by, I'm often in, and if not tied up with something that has be be finished right away I'll be happy to talk with you.

Mailing List

There will be a course email list used for high-priority announcements. This will use your @purdue.edu email address; make sure this is forwarded to someplace you look on a regular basis.

We will be using Blackboard for turning in assignments as well as recording and distributing grades, as well as any other non-public information about the course.

Course Methodology

The course will be taught through lectures, supplemented with reading. The written assignments and projects are also a significant component of the learning experience.

For review (and if you miss a lecture), you can pick them up as an Echo360 vodcast/podcast (accessible through Blackboard, or Echo360, log in via institution.) Be warned that the audio isn't great; you only see what is on the screen, not what is written on the chalkboard; and you can't ask (or answer) questions; so it isn't really a viable alternative to attending lecture.

We will be using Piazza to facilitate discussions; this will enable you to post questions as well as respond to questions posted by others.


The formal prerequisite is CS 25100: Data Structures and Algorithms (or ECE 36800). It will help if you have taken CS37300: Data Mining and Machine Learning and/or a statistics course such as STAT 35000: Introduction to Statistics or STAT 51100: Statistical Methods. (If you have comparable courses, such as ECE 36800, please contact the instructor.)


Evaluation is a somewhat subjective process (see my grading standards), however it will be based primarily on your understanding of the material as evidenced in:

Exams will be open note, with two 8.5x11 or A4 pages allowed (e.g., one piece of paper, double-sided). If any additional notes are allowed, these will be announced per exam. To avoid a disparity between resources available to different students, and the possibility of using communication-equipped devices in unethical ways, electronic aids are not permitted.

Late work will be penalized 15% per day (24 hour period or fraction thereof). You are allowed five extension days, to be used at your discretion throughout the semester (illness, job interviews, etc.) You must explicitly note that you are using these in the header of the assignment or it will be considered late (i.e., using extension days 2 and 3 for this assignment.) Fractional use is not allowed, and this may not be used to extend submission past the last day of class.

Blackboard will be used to record/distribute grades (and, in some cases, for turning in assignments.)

Policy on Intellectual Honesty

Please read the departmental academic integrity policy above. This will be followed unless I provide written documentation of exceptions. You should also be familiar with the Purdue University Code of Honor and Academic Integrity Guide for Students. You may also find Professor Spafford's course policy useful - while I do not apply it verbatim, it contains detail and some good examples that may help to clarify the policies above and those mentioned below.

In particular, I encourage interaction: you should feel free to discuss the course with other students. However, unless otherwise noted work turned in should reflect your own efforts and knowledge.

For example, if you are discussing an assignment with another student, and you feel you know the material better than the other student, think of yourself as a teacher. Your goal is to make sure that after your discussion, the student is capable of doing similar work independently; their turned-in assignment should reflect this capability. If you need to work through details, try to work on a related, but different, problem.

If you feel you may have overstepped these bounds, or are not sure, please come talk to me and/or note on what you turn in that it represents collaborative effort (the same holds for information obtained from other sources that you provided substantial portions of the solution.) If I feel you have gone beyond acceptable limits, I will let you know, and if necessary we will find an alternative way of ensuring you know the material. Help you receive in such a borderline case, if cited and not part of a pattern of egregious behavior, is not in my opinion academic dishonesty, and will at most result in a requirement that you demonstrate your knowledge in some alternate manner.

If you have other issues

University Emergency Preparedness instructions

Student Mental Health and Wellbeing: Purdue University is committed to advancing the mental health and wellbeing of its students. If you or someone you know is feeling overwhelmed, depressed, and/or in need of support, services are available. For help, such individuals should contact Counseling and Psychological Services (CAPS) at (765)494-6995 and http://www.purdue.edu/caps/ during and after hours, on weekends and holidays, or through its counselors physically located in the Purdue University Student Health Center (PUSH) and the Psychology building (PSYC) during business hours.

Sexual Violence: Purdue University is devoted to fostering a secure, equitable, and inclusive community. If you or someone you know has been the victim of sexual violence and are interested in seeking help, there are services available. Reporting the incident to any Purdue faculty and certain other employees, including resident assistants, will lead to reference to the Title IX Coordinator, as these individuals are mandatory reporters. The Title IX office can investigate report of sex-based discrimination, sexual harassment, or sexual violence. Title IX ensures that both parties in a reported event have equal opportunity to be heard and participate in a grievance process. To file an online report visit https://cm.maxient.com/reportingform.php?PurdueUniv&layout_id=15 or contact the Title IX coordinator at 765-494-7255.

The Center for Advocacy, Response, and Education (CARE) offers confidential support and advocacy that does not require the filing of a report to the Title IX office. The CARE staff helps each survivor assess their reporting options and access resources that meet personal needs. The CARE office can be found at 205 North Russell Street in Duhme Hall (Windsor), room 143 Monday - Friday 8:00 AM to 5:00 PM. They can also be reached at their 24/7 hotline 765-495-CARE or at CARE@purdue.edu.

And you should always feel free to call, email, or drop by and talk to the instructor (or, if you have an issue with the instructor, the department head.)


The basic text for this course is:

The following book may also be of interest, as it gives a somewhat different treatment of the material. You don't need both books, this should be considered optional reading.

Course Outline (numbers correspond to roughly to week):

  1. Course Introduction, Text Preprocessing.
    1. Introductory lecture, discussion of areas and applications of Information Retrieval.
    2. Ad-Hoc Information Retrieval overview (Chapters 1 and 2)
    3. Text Preprocessing (Chapters 1 and 2)
    Reading: Croft et al. Chapter 1 (esp. through 1.2), 4 through 4.4; Manning et al. Chapter 2 through 2.2.
  2. Ad-Hoc IR Methods
    1. Basic Concepts, continued: Indexing, Evaluation
      Reading: Croft et al. Chapter 5.1, 5.3-5.3.3, 8-8.2, 8.4-8.4.3, 8.5, 8.7; Manning et al. Chapter 8 (esp. through 8.5).
    2. Boolean Retrieval Models
      Reading: Croft et al. Chapter 7-7.1.1.
      Assignment 1 (due in Blackboard at 11:59pmEDT on September 6, 2017.)
    3. Vector Space Retrieval Models
      Reading: Croft et al. Chapter 7.1.2; Manning et al. Chapter 6.2-6.4.
  3. Retrieval Models continued
    1. Labor day, no class
    2. Latent Semantic Indexing
      Reading: Manning et al. Chapter 18.
    3. More on LSI, introduction to Probabilistic Retrieval Models
      Reading: Croft et al. Chapter 7.3-7.3.1 (skim 7.2); Manning et al. Chapter 11-11.3
  4. Retrieval Models continued
    Project 1 (due at 11:59pmEDT on September 24, 2017.)
    1. Binary Independence Model
    2. Query Expansion Reading: Manning et al. Chapter 9-9.1
    3. Relevance Feedback Reading: Manning et al. Chapter 9.2-9.2.2
  5. Text Categorization
    Reading: Croft et al. Chapter 9-9.1 (skim 7.2); Manning et al. Chapter 13-13.1, Yiming Yang and Xin Liu, A re-examination of text categorization methods, SIGIR'99
  6. Text Categorization
  7. Graph structure based retrieval
    1. October Break, no class
    2. October 11, in class: First Midterm.
  9. Web Search
  10. Web Search
    October 23: Guest Lecture by Prof. Bruno Ribeiro
    October 24: Drop Date. We'll have the first midterm and project graded and returned by October 20. Prof. Clifton will be traveling 10/21-24, so it is best to contact him by October 20 if you are thinking of dropping.
  11. Collaborative Filtering
    November 3: Guest Lecture by Prof. Dan Goldwasser, Natural Language Processing for Search
  12. Content-based Filtering
    Expect the second midterm during week 12 or 13
  13. Deep Web and Federated Search
    1. Thanksgiving Vacation, no class Wednesday or Friday
  15. TBD
  16. TBD/Review

Final Exam Tuesday, December 12, 3:30pm-5:30pm, WTHR 172.
If you have another exam scheduled at that time or you have three or more exams scheduled that day and would like to reschedule this exam, please let me know as soon as possible. Note that conflicting exams are pretty much the only reason for rescheduling, I bought a ticket to go home earlier is not an accepted reason for an exam to be rescheduled.

Valid XHTML 1.1