Final Project: Mining Logistics Research & Education Needs

Start date 31 March, proposal due 16 April beginning of class, presentations week of April 26, final report due 30 April beginning of class.

The goal of this project is to analyze data regarding the needs and capabilities of higher education with respect to Transportation, Distribution, and Logistics within the state of Indiana. The outcome will be knowledge that can be used to drive and support a proposal for a center in this area. Reha Uzsoy has prepared slides describing the problem (PDF).

One of the tables is the result of a Survey of business leaders. The others are surveys done locally of relevant work at institutions of higher learning are doing in the state/region, in terms of Workforce Development and Research. There are slides describing the schema (PDF).

In addition to the Schema slides above, Javed Siddique prepared instructions on connecting Weka to the database (PDF).

Javed has done some preliminary analysis of the data (e.g., univariate statistics.) These are available as a PDF of graphs and an Excel file.

You are encouraged to collaborate on this project - feel free to discuss your ideas with other students. The deliverables should be your own. E.g., if you work closely with other students, you should each produce something separate (such as a different type of data mining), even if the original ideas were developed jointly. Remember to provide credit where credit is due.

Deliverables

There will be three deliverables on this project.

Project Plan

Read the CRISP-DM manual - this is the project plan described in Step 1.4. You should also plan on completing Step 2 (data understanding) by the time this is due (April 16.)

If you use the turnin command, this should be submitted as proposal.

Presentation

You will have 15 minutes to present your findings during the last week of class. If you intend to use electronic display, please provide me with the files so we don't have to log out / log in between presentations.

If you give your presentation early in the week, it is okay to present partial results - you will have the opportunity to get feedback you can plug in to your final report. The disadvantage is that you will have to be ready sooner. I'll be looking for volunteers.

Project Report

The project report should contain the following:

Executive Summary
The executive summary should capture briefly the questions you addressed and your key results (in business terms). Also briefly mention any caveats to use of the data. Summarize with your main suggestions for how to act on these results. This should be roughly 10% of the length of the full report.
Main Report
The main report should cover the following points:
  1. What problems you specifically addressed, including details in technical as well as business terms
  2. The process you followed: Preprocessing steps, techniques used.
  3. Interesting results. Include the business meaning of the result, as well as the specific result (e.g., the actual association rule, or numbers associated with means of a cluster.)
  4. Conclusions: What actions should be taken (e.g., what areas would you suggest the center focus on.)
Appendix: Process
Give an appendix that contains information that would allow someone else to repeat your analyses (assuming a reasonable knowledge of the tools used, e.g., someone else in the class.)
Appendix: Detailed results
Actual printouts of the results, annotated so that if someone did rerun your analysis, they would know how to get from the raw results to conclusions.

I don't expect this to be very long - 4-10 pages for the main body of the report. Given equivalent information content, a short report is better - don't add words for the sake of being wordy.

Please turn in electronically. Appendices don't need to be neatly formatted (or even included in the main document).

Scoring

Scoring will be based on (in order of importance):

  1. The process you followed: Is it correct (given the techniques you used), did you describe it well? This includes things such as data selection, preprocessing, parameter setting, etc.
  2. Techniques used: Given the business questions you chose to address, did you select appropriate techniques and justify why?
  3. Interpretation of results: Did you correctly understand and interpret the raw results you obtained?
  4. Quality of writeup: Did you present what you did well, in an understandable and usable manner?

These are the key issues: The above questions answer your knowledge of data mining. The following questions will be of interest, but will have a much lower impact on your final score:

Turning in the project

Electronic submission preferred. Please use the turnin command (on mentor.ics.purdue.edu, turnin -c cs490d -p \{proposal , final\} directoryname). If that doesn't work, you can tar/zip and email to clifton_nospam@cs_nojunk.purdue.edu. Pdf is the safest for capturing non-text. Hard copy is acceptable, please hand in at the beginning of class.


Valid XHTML 1.1!