The goal of this project is to analyze data regarding the needs and capabilities of higher education with respect to Transportation, Distribution, and Logistics within the state of Indiana. The outcome will be knowledge that can be used to drive and support a proposal for a center in this area. Reha Uzsoy has prepared slides describing the problem (PDF).
One of the tables is the result of a Survey of business leaders. The others are surveys done locally of relevant work at institutions of higher learning are doing in the state/region, in terms of Workforce Development and Research. There are slides describing the schema (PDF).
In addition to the Schema slides above, Javed Siddique prepared instructions on connecting Weka to the database (PDF).
Javed has done some preliminary analysis of the data (e.g., univariate statistics.) These are available as a PDF of graphs and an Excel file.
You are encouraged to collaborate on this project - feel free to
discuss your ideas with other students. The deliverables
should be your own. E.g., if you work closely with other students,
you should each produce something separate (such as a different type
of data mining), even if the original ideas were developed jointly.
Remember to provide credit where credit is due.
There will be three deliverables
on this project.
Read the CRISP-DM manual - this is the project plan described in Step 1.4. You should also plan on completing Step 2 (data understanding) by the time this is due (April 16.)
If you use the turnin command, this should be submitted as proposal
.
You will have 15 minutes to present your findings during the last week of class. If you intend to use electronic display, please provide me with the files so we don't have to log out / log in between presentations.
If you give your presentation early in the week, it is okay to present partial results - you will have the opportunity to get feedback you can plug in to your final report. The disadvantage is that you will have to be ready sooner. I'll be looking for volunteers.
The project report should contain the following:
I don't expect this to be very long - 4-10 pages for the main body of the report. Given equivalent information content, a short report is better - don't add words for the sake of being wordy.
Please turn in electronically. Appendices don't need to be neatly formatted (or even included in the main document).
Scoring will be based on (in order of importance):
These are the key issues: The above questions answer your knowledge of data mining. The following questions will be of interest, but will have a much lower impact on your final score:
processscore above), but not necessarily.
Electronic submission preferred. Please use the turnin command (on mentor.ics.purdue.edu, turnin -c cs490d -p \{proposal , final\} directoryname). If that doesn't work, you can tar/zip and email to . Pdf is the safest for capturing non-text. Hard copy is acceptable, please hand in at the beginning of class.