This course will be an introduction to data mining. Topics will range from statistics to machine learning to database, with a focus on analysis of large data sets. Expect at least one project involving real data, that you will be the first to apply data mining techniques to.
The course will be based on
Introduction to Data Mining
developed under
National Science Foundation funding at the
Illinois Institute of Technology.
See their web site
to get a better idea of what the course will be like.
Please send questions to the course newsgroup purdue.class.cs490d. This should be used for most questions. If you have something you don't want made public, send it to . We will also be using WebCT for recording and distributing grades.
The course will be taught through lectures, with class participation expected and encouraged. There will be frequent reading assignments to supplement the lectures. The workload will include both written assignments and programming projects. Projects will be primarily individual, and self-contained. (This won't be a compilers/OS style project course.)
CS348 or CS448 (concurrent registration in 448 okay), or permission of instructor.
It appears that the Computer Technologies program is accepting this as a database management selective. CPT 372 should provide reasonable database programming background. Some Java programming experience/ability helps, but you can probably learn that on your own if you have a reasonable programming background. The other thing is a solid mathematical background. Some statistical background will help (e.g., STAT 225 is sufficient), but a good mathematical background is likely to be enough event without that course. I can assume any CS students have had discrete math, which gives the appropriate level of mathematical maturity. If you haven't had something beyond calculus, you might want to discuss it with me. See information on non-CS students registering for CS courses.
Jiawei Han and Micheline Kamber, Data Mining: Concepts and Techniques, The Morgan Kaufmann Series in Data Management Systems, Jim Gray, Series Editor. Morgan Kaufmann Publishers, August 2000. 550 pages. ISBN 1-55860-489-8.
You might also find the following useful (it is the companion book to
WEKA,
which will be used for course projects):
Ian H. Witten and Eibe Frank,
Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations,
Morgan Kaufmann Publishers,
October 1999. 371 pages. ISBN 1-55860-552-5.
The exact mix of projects, written homeworks, presentations, etc. is yet to be determined. However, at this point I expect there will be several programming projects. During weeks when you are not working on a project, there will be analytical written homework problems, with a mix of mathematical work (e.g., complexity of an algorithm) and case studies (e.g., discuss different approaches to a real-world data anlysis problem.)
Evaluation will be a subjective process (see my grading standards), however it will be based primarily on your understanding of the material as evidenced in:
Exams will be open note / open book. To avoid a disparity between resources available to different students, electronic aids are not permitted.
Projects and written work will be evaluated on a ten point scale:
Late work will be penalized 1 point per day (24 hour period). This penalty will apply except in case of documented emergency (e.g., medical emergency), or by prior arrangement if doing the work in advance is impossible due to fault of the instructor (e.g., you are going to a conference and ask to start the project early, but I don't have it ready yet.)
It is likely there will be a final project, consisting of data describing
research and education opportunities and needs in the state in
the transportation, distribution, and logistics sector.
This will be open-ended: You will be given data, and asked to
use the knowledge gained in the class to learn something interesting
.
Final result will be a set of presentation slides describing what you
found and how you found it. Particularly good projects are likely
to get some exposure
(with your name attached, of course)
to the Central Indiana Corporate Partnership,
a group of top executives of the state's leading companies.
One last reason to take the course can be found at the kdnuggets web site.
Please read the departmental academic integrity policy above. This will be followed unless I provide written documentation of exceptions. In particular, I encourage interaction: you should feel free to discuss the course with other students. However, unless otherwise noted work turned in should reflect your own efforts and knowledge.
For example, if you are discussing an assignment with another student, and you feel you know the material better than the other student, think of yourself as a teacher. Your goal is to make sure that after your discussion, the student is capable of doing similar assignments independently; their turned-in assignment should reflect this capability. If you need to work through details, try to work on a related, but different, problem.
If you feel you may have overstepped these bounds, or are
not sure, please come talk to me or note on what you turn in that
it represents collaborative effort (the same holds for information
obtained from other sources that you feel may cause what you turn
in to not reflect your true ability.) If I feel you have gone beyond
acceptable limits, I will let you know, and if necessary we will find
an alternative way of ensuring you know the material.
Help you receive in such a borderline case
, if cited
and not part of a pattern of egregious behavior,
is not in my opinion academic dishonesty, and will at most
result in a requirement that you demonstrate your ability
in some alternate manner.
customer.
Multi-Relational Data Mining: An Introduction, SIGKDD Explorations 5(1), pp. 1-16, July 2003.
Final exam, Monday, May 3, 15:20-17:20, REC 103.