All Data Sets


Export   


Deprecated: Function split() is deprecated in /home/cgrate/www/data_access/qs_functions.php on line 275

Deprecated: Function split() is deprecated in /home/cgrate/www/data_access/qs_functions.php on line 281

Deprecated: Function split() is deprecated in /home/cgrate/www/data_access/qs_functions.php on line 287

Deprecated: Function split() is deprecated in /home/cgrate/www/data_access/qs_functions.php on line 275

Deprecated: Function split() is deprecated in /home/cgrate/www/data_access/qs_functions.php on line 281

Deprecated: Function split() is deprecated in /home/cgrate/www/data_access/qs_functions.php on line 287
 Name   Description   Date  Download (Internal) Download (External) More... Delete Edit
Cora Citation MatchingText of citations hand-clustered into groups referring to the same paper. http://www.cs.purdue.edu/commugrate/data/mccallum/cora-refs.tar.gzhttp://www.cs.umass.edu/~mccallum/data/cora-refs.tar.gz
SRAA: Simulated/Real/Aviation/Auto UseNet data73,218 UseNet articles from four discussion groups, for simulated auto racing, simulated aviation, real autos, real aviation. I have often used this data for binary classification---separating real from simulated, and auto from aviation---making the point that the same data can be classified different ways depending on the user's needs. This is especially interesting for semi-supervised learning. This data was gathered by Andrew McCallum while at Just Research. Coming soonhttp://www.cs.umass.edu/~mccallum/data/sraa.tar.gz
Cora Research Paper ClassificationResearch papers classified into a topic hierarchy with 73 leaves. We call this a relational data set, because the citations provide relations among papers.http://www.cs.purdue.edu/commugrate/data/mccallum/cora-classify.tar.gzhttp://www.cs.umass.edu/~mccallum/data/cora-classify.tar.gz
Cora Information ExtractionResearch paper headers and citations, with labeled segments for authors, title, institutions, venue, date, page numbers and several other fields. http://www.cs.purdue.edu/commugrate/data/mccallum/cora-ie.tar.gzhttp://www.cs.umass.edu/~mccallum/data/cora-ie.tar.gz
Frequently Asked QuestionsSeveral UseNet FAQ's segmented into questions and answers. Data gathered and labeled by Dayne Freitag and Andrew McCallum. 01/01/2000http://www.cs.purdue.edu/commugrate/data/mccallum/faqdata/http://www.cs.umass.edu/~mccallum/data/faqdata
CMU Seminar Announcements48 emailed seminar announcements, with labeled segments for speaker, title, start-time, end-time. Labeled by Dayne Freitag.http://www.cs.purdue.edu/commugrate/data/mccallum/sa-tagged.tar.gzhttp://www.cs.umass.edu/~mccallum/data/sa-tagged.tar.gz
Industry SectorCorporate web pages classified into a topic hierarchy with about 70 leaves.http://www.cs.purdue.edu/commugrate/data/mccallum/sector.tar.gzhttp://www.cs.umass.edu/~mccallum/data/sector.tar.gz
20 NewsgroupsAbout 20,000 UseNet postings from 20 newsgroups. Gathered by Ken Lang at CMU in the mid-90's. This is the original set, without various editing done by Jason Rennie and others. 09/09/1999http://www.cs.purdue.edu/commugrate/data/20_newsgroups/http://kdd.ics.uci.edu/databases/20newsgroups/20newsgroups.html
CiteSeerCiteSeer collection of research publications (BibTex Records and OAI records.)


OAI records are in two formats:
* oai_dc.tar.gz - Includes the dublin core metadata standard.
* oai_citeseer.tar.gz - The dublic core standard with additional metadata fields, including citation relationships (References and IsReferencedBy), author affiliations, and author addresses.
http://www.cs.purdue.edu/commugrate/data/citeseerhttp://citeseer.ist.psu.edu/oai.html
DBLPA collection of bibliographic information on major computer science journals and proceedings. DBLP indexes more than one million articles and contains more than 10000 links to home pages of computer scientists.
http://www.cs.purdue.edu/commugrate/data/dblphttp://dblp.uni-trier.de/xml/


This research is supported by NSF Grant Number IIS 0916614 and by Purdue Cyber Center.