The explosive growth of available digital information (e.g., Web pages, emails, news, scientific literature) demands intelligent information agents that can sift through all available information and find out the most valuable and relevant information. Web search engines, such as Google, Yahoo!, and MSN, are several examples of such tools. This course studies the basic principles and practical algorithms used for information retrieval and text mining. The contents includes: statistical characteristics of text, several important retrieval models, text categorization, recommendation system, clustering, information extraction, etc. The course emphasizes both the above applications and solid modeling techniques (e.g., probabilistic modeling) that can be extended for other applications. Students will:
Balamurugan Anandan
Office: LWSN 2149
Email: banandan@purdue.edu
Office hours will be split between the instructor and teaching assistant, with the exact split to be determined based on our schedules in any given week. If the door to LWSN 2142F isn't open, then go to LWSN 2149. Whoever is holding the office hours will also have a WebEx meeting open during that time, for students who are off campus. The schedule is as follows (all times ET):
If none of these times work for you, please contact one of us to set up an appointment.
There will be a course email list used for high-priority announcements.
This will use your @purdue.edu
email address; make sure this
is forwarded to someplace you look on a regular basis.
We will be using Blackboard for turning in assignments as well as recording and distributing grades, as well as any other non-public information about the course.
The course will be taught through lectures, supplemented with reading. The primary reading will be from the text, with supplementary material from current research literature where appropriate. The written assignments and projects are also a significant component of the learning experience.
This course is also being offered through Engineering Professional Education. The lectures will be availabe online, information on accessing the lectures will be made available on blackboard.
We will be using Piazza to facilitate discussions; this will enable you to post questions as well as respond to questions posted by others. More information on accessing Piazza will be provided here soon.
The formal prerequisites are CS34800 and STAT 51100. In practice, you should have an undergraduate level background in probability and statistics, and programming skill commensurate with an undergraduate CS program (or a solid CS minor). Background in Machine Learning and some aspects of database management is helpful, but if you have a good knowledge of probability and statistics and a general CS background, you will be able to pick those up along the way.
We may use (and extend) various toolkits including Lemur (which is written in C++, but has C++, Java, and C# APIs) and Lucene (Java, also has a Python API.) You may also find several other toolkits helpful when it comes to the final project, including Mallet (Java) and FACTORIE (Scala)
If the thought of learning a new
language/programming environment/set of libraries
on the fly
scares you, then you might not
have the programming background needed for the course.
Evaluation will be a subjective process (see my grading standards), however it will be based primarily on your understanding of the material as evidenced in:
Exams will be open note / open book. To avoid a disparity between resources available to different students, electronic aids are not permitted.
Late work will be penalized 10% per day (24 hour period). This penalty will apply except in case of documented emergency (e.g., medical emergency), or by prior arrangement if doing the work in advance is impossible due to fault of the instructor (e.g., you are going to a conference and ask to start the project early, but I don't have it ready yet.)
Blackboard will be used to record/distribute grades (and, in some cases, for turning in assignments.)
The qualifying exam will consist of an hour-long supplement given at the end of the course. Passing the qualifier will require both suitable performance in the course and on the qualifying exam. All computer science students are encouraged to take the exam, even if you do not currently plan to pursue a Ph.D.
Please read the departmental academic integrity policy above. This will be followed unless I provide written documentation of exceptions. You may also find Professor Spafford's course policy useful - while I do not apply it verbatim, it contains detail and some good examples that may help to clarify the policies above and those mentioned below.
In particular, I encourage interaction: you should feel free to discuss the course with other students. However, unless otherwise noted work turned in should reflect your own efforts and knowledge.
For example, if you are discussing an assignment with another student, and you feel you know the material better than the other student, think of yourself as a teacher. Your goal is to make sure that after your discussion, the student is capable of doing similar work independently; their turned-in assignment should reflect this capability. If you need to work through details, try to work on a related, but different, problem.
If you feel you may have overstepped these bounds, or are
not sure, please come talk to me and/or note on what you turn in that
it represents collaborative effort (the same holds for information
obtained from other sources that you provided substantial portions
of the solution.) If I feel you have gone beyond
acceptable limits, I will let you know, and if necessary we will find
an alternative way of ensuring you know the material.
Help you receive in such a borderline case
, if cited
and not part of a pattern of egregious behavior,
is not in my opinion academic dishonesty, and will at most
result in a requirement that you demonstrate your knowledge
in some alternate manner.
The following also are worthwhile resources:
Still working on this, as I get an idea of the background and expertise of the students. However, for an idea you may see the schedule from a previous semester.
Final Exam Tuesday, May 3, 10:30-12:30, LWSN B155. As with the Midterm, the exam is closed book, but you are allowed notes. For the final this is up to four pages of notes (four sides, or two double-sided sheets, 8.5x11 or A4 paper.)
Qualifying Exam 2:00-3:00pm on Tuesday, May 3 in LWSN 1106.