This is the dataset used in our paper “A Joint Probabilistic Classification Model for Resource Selection” in SIGIR'10.
This real world dataset contains 80 sources (i.e., digital libraries) that are accessible from Purdue University Libraries. Due to the copyright of the sources' owners, we cannot make the sample documents public. However, we will make the feature file public on this site.
Each ditigal library has one id. This is the list of those ids. For more detailed information, including library's name and description, please check this file.
The query file. Each line contains one query. Each query is assigned an id (qid) from 1-100.
The feature file. Each line contains the features according to a pair of one source and one query. The format of this file is:
0 pid:PUR00021 qid:1
1:0.0023381 2:0.0041696 3:0.00068636 4:195.92 5:53.849 6:302.98
0 pid:PUR00021 qid:2 1:1.9027e-05 2:9.5963e-05 3:0.0016461 4:2.1707 5:0 6:4.2967
0 pid:PUR00021 qid:3 1:0.00019331 2:0.00048378 3:0.16086 4:15.459 5:10.675 6:15.459
The first column is the relevance judgment, 0 means irrelevant, 1 means partially relevant and 2 means highly relevant. The second column is the source's id. The third column is the query id. The following columns are features. There are 6 features in the order of:
ReDDE top 100
ReDDE top 1000
Please refer to the paper for more details of those features.