CS 37300: Data Mining and Machine Learning

TR 10:30-11:45

Math 175

Chris Clifton

Email: clifton_nospam@cs_nojunk.purdue.edu

Course Outline

Course Topics

This course will introduce students to the field of data mining and machine learning, which sits at the interface between statistics and computer science. Data mining and machine learning focuses on developing algorithms to automatically discover patterns and learn models of large datasets. This course introduces students to the process and main techniques in data mining and machine learning, including exploratory data analysis, predictive modeling, descriptive modeling, and evaluation.

Teaching Assistants

Note: during periods when classes are online, office hours will be held via Zoom, so that students or TAs who fear they may have been exposed will not risk exposing others. Click the above link during scheduled hours (all times ET), and you should find one of the instructors on. If not, look in Piazza for possible changes (e.g., due to a particular videoconferencing system being down.)

Instructor Office Hours

Wednesday 1:30-2:30, LWSN 2116E in the same videoconferencing room as listed above for TA office hours. I'll do my best to be available at the listed time, but frequently things come up and I'm unable to maintain that. But you aren't limited to those times - I am available by appointment, email some good times and I'll pick what works. Or you can just drop by, I'm often in, and if not tied up with something that has be be finished right away I'll be happy to meet with you.

Mailing List

There will be a course email list used for high-priority announcements. This will use your @purdue.edu email address; make sure this is forwarded to someplace you look on a regular basis.

We will be using Gradescope to turn in and comment on assignments; Blackboard will be used for recording and distributing grades, as well as for any other non-public information about the course.

Course Methodology

The course will be taught through lectures, supplemented with reading. The written assignments and projects are also a significant component of the learning experience.

For review (and if you miss a lecture), you can pick them up as a Boilercast vodcast/podcast (accessible through Blackboard.) Be warned that the audio isn't great; you only see what is on the screen, not what is written on the chalkboard; and you can't ask (or answer) questions; so it isn't really a viable alternative to attending lecture.

We will be using Piazza to facilitate discussions; this will enable you to post questions as well as respond to questions posted by others. Be aware that the default is for posts to be identified and visible to everyone.

We will likely be using iClickers for real-time feedback in class. Please register your iClicker using blackboard (this can be done after you've started using it.) During online portions of the course, similar quizzes will be held through blackboard or some other delivery mechanism (see instructions in the corresponding lecture.) Quizzes will be open for 48 hours after the scheduled time of the lecture. Note that my standard practice is to drop the two lowest scores to allow for absences.

Prerequisites

The formal prerequisite is CS 18200: Foundations Of Computer Science and CS 25100: Data Structures and Algorithms. You also must have either taken or be taking STAT 35000: Introduction to Statistics, or STAT 51100: Statistical Methods, (If you have comparable courses, such as ECE 36800, please contact the instructor.)

Evaluation/Grading

Evaluation is a somewhat subjective process (see my grading standards), however it will be based on your understanding of the material as evidenced in:

Exams will be open note, with two 8.5x11 or A4 pages allowed (e.g., one piece of paper, double-sided). If any additional notes are allowed, these will be announced per exam. To avoid a disparity between resources available to different students, and the possibility of using communication-equipped devices in unethical ways, electronic aids are not permitted.

Late work will be penalized 15% per day (24 hour period or fraction thereof). You are allowed five extension days, to be used at your discretion throughout the semester (illness, job interviews, etc.) You must explicitly note that you are using these in the header of the assignment or it will be considered late (i.e., using extension days 2 and 3 for this assignment.) The above policy is no longer used, as I've always had at least 10% of students every semester who are unable to follow the above directions. Extension days will be applied at the end of the semester, in the way that is deemed most advantageous to you, and late penalties will then be applied. This will only be done at the end of the semester, so you will see your scores on assignments drop during finals week if you've exceeded the allowed extension days. You must keep track of extension days yourself, the instructors will only calculate this at the end of the semester. Fractional use is not allowed, and this may not be used to extend submission past the last day of class.

Blackboard will be used to record/distribute grades (and, in some cases, for turning in assignments.)

Policy on Intellectual Honesty

Please read the departmental academic integrity policy above. This will be followed unless I provide written documentation of exceptions. You should also be familiar with the Purdue University Code of Honor and Academic Integrity Guide for Students. You may also find Professor Spafford's course policy useful - while I do not apply it verbatim, it contains detail and some good examples that may help to clarify the policies above and those mentioned below.

In particular, I encourage interaction: you should feel free to discuss the course with other students. However, unless otherwise noted work turned in should reflect your own efforts and knowledge.

For example, if you are discussing an assignment with another student, and you feel you know the material better than the other student, think of yourself as a teacher. Your goal is to make sure that after your discussion, the student is capable of doing similar work independently; their turned-in assignment should reflect this capability. If you need to work through details, try to work on a related, but different, problem.

If you feel you may have overstepped these bounds, or are not sure, please come talk to me and/or note on what you turn in that it represents collaborative effort (the same holds for information obtained from other sources that provided substantial portions of the solution.) If I feel you have gone beyond acceptable limits, I will let you know, and if necessary we will find an alternative way of ensuring you know the material. Help you receive in such a borderline case, if cited and not part of a pattern of egregious behavior, is not in my opinion academic dishonesty, and will at most result in a requirement that you demonstrate your knowledge in some alternate manner.

Other Issues and Resources

If you have other issues please feel free to talk to me - if I can't help, I'll try to point you in the right direction. Be aware that due to Title IX and state law, there are some things for which I can't promise confidentiality (but see CARE below).

University Emergency Preparedness instructions

Nondiscrimination Statement: Purdue University is committed to maintaining a community which recognizes and values the inherent worth and dignity of every person; fosters tolerance, sensitivity, understanding, and mutual respect among its members; and encourages each individual to strive to reach his or her own potential. In pursuit of its goal of academic excellence, the University seeks to develop and nurture diversity. The University believes that diversity among its many members strengthens the institution, stimulates creativity, promotes the exchange of ideas, and enriches campus life. Purdue’s nondiscrimination policy can be found at http://www.purdue.edu/purdue/ea_eou_statement.html.

Purdue University strives to make learning experiences as accessible as possible. If you anticipate or experience physical or academic barriers based on disability, you are welcome to let me know so that we can discuss options. You are also encouraged to contact the Disability Resource Center at: drc@purdue.edu or by phone: 765-494-1247.

Student Mental Health and Wellbeing: Purdue University is committed to advancing the mental health and wellbeing of its students. If you or someone you know is feeling overwhelmed, depressed, and/or in need of support, services are available. For help, such individuals should contact Counseling and Psychological Services (CAPS) at (765)494-6995 and http://www.purdue.edu/caps/ during and after hours, on weekends and holidays, or through its counselors physically located in the Purdue University Student Health Center (PUSH) and the Psychology building (PSYC) during business hours.

Sexual Violence: Purdue University is devoted to fostering a secure, equitable, and inclusive community. If you or someone you know has been the victim of sexual violence and are interested in seeking help, there are services available. Reporting the incident to any Purdue faculty and certain other employees, including resident assistants, will lead to reference to the Title IX Coordinator, as these individuals are mandatory reporters. The Title IX office can investigate report of sex-based discrimination, sexual harassment, or sexual violence. Title IX ensures that both parties in a reported event have equal opportunity to be heard and participate in a grievance process. To file an online report visit https://cm.maxient.com/reportingform.php?PurdueUniv&layout_id=15 or contact the Title IX coordinator at 765-494-7255.

The Center for Advocacy, Response, and Education (CARE) offers confidential support and advocacy that does not require the filing of a report to the Title IX office. The CARE staff helps each survivor assess their reporting options and access resources that meet personal needs. The CARE office can be found at 205 North Russell Street in Duhme Hall (Windsor), room 143 Monday - Friday 8:00 AM to 5:00 PM. They can also be reached at their 24/7 hotline 765-495-CARE or at CARE@purdue.edu.

And you should always feel free to call, email, or drop by and talk to me (or, if you have an issue with me, to the department head.)

Text

The texts below are recommended but not required. Reading materials will be distributed as necessary, through blackboard. Please check regularly.

Course outline (numbers correspond to roughly to week):

  1. Course Overview Suggested reading: Assignment 1 released, due 1/17/20 17:30.
  2. Probability and Statistics Review
    1. Machine Learning Overview: K-Nearest Neighbor
    2. Background and basics of Statistics
    Assignment 2 released, due 1/29/20 23:59. (solutions)
    Special Lecture: Ron Wasserstein, executive director of the American Statistical Association, Moving to a World Beyond p < 0.05, January 24, 3pm in Fowler Hall (Stewart Center).
  3. Exploratory data analysis
    1. Hypothesis Testing and Decision Making
    2. Exploratory data analysis
    Suggested reading: Principles of Data Mining, Chapters 4.1-4.3, 2.
  4. Assignment 3 released, due 2/9/20 23:59. (Solutions.)
    Suggested reading: Principles of Data Mining, Chapters 3.1-3.6.
  5. Predictive Modeling
    Suggested reading: Principles of Data Mining, Chapters 5.1-5.3.1, 6.1-6.2.
    Assignment 4 released, due 2/19/20 23:59. (Solutions.)
  6. Predictive Modeling: Search/Optimization
    Special Lecture: Dr. Ruha Benjamin, Princeton University, Race After Technology: Abolitionist Tools for the New Jim Code, Monday, February 17, 5:30-7:00pm in Fowler Hall (Stewart Center).
    Suggested reading: Principles of Data Mining, Chapter 8
  7. Predictive Modeling
  8. SVM, Bias/Variance Suggested reading: Principles of Data Mining, Chapter 10.9-10.10.
  9. Descriptive Modeling
    Note: Above slide deck covers the full descriptive modeling section. Suggested reading: Principles of Data Mining, Chapter 9.1, 9.3-9.5.
    March 13: Drop Date.
    Assignment 5 (Due 2 April.)
    March 16-21: Spring Break
    Due to restrictions in place to slow the spread of COVID-19, the remainder of the semester will be online. Online lecture modules are available in Blackboard.
  10. Descriptive Modeling
    Suggested reading: Principles of Data Mining, Chapter 9.2, 9.6.
  11. Pattern Mining
    Suggested reading: Principles of Data Mining, Chapter 13-13.3.
    Assignment 6 (Due 8 April.) (solutions)
  12. Further Topics
  13. Ethics Issues in Data Mining
  14. Data Mining Process
    Suggested reading: Principles of Data Mining, Chapter 13-13.3, CRISP-DM 1.0.
    Assignment 7 (Due 29 April.) (solutions)
  15. Advanced Topics, Review

You may also want to see the canonical syllabus.

Final Exam was scheduled for Monday, May 4, 1pm-3pm. Due to the COVID-19 shutdown, the exam will be a 2 hour timed exam delivered in Gradescope, using the process followed for Midterm 2. It will be available starting May 4 9amEDT, and must be completed by May 5 9amEDT. You may take it anytime during that time frame, but you will only be able to ask questions in Zoom from 9am-7pm and 10pm-midnight EDT. If you have another exam scheduled at that time or you have three or more exams scheduled that day and would like to reschedule the 37300 exam, please let me know as soon as possible. Note that conflicting exams are normally the only reason for rescheduling, although in this COVID-19 world there are bound to be others.


Valid XHTML 1.1