ClueWeb English for Federated Search

Creation of the Dataset

We used only the English web pages of ClueWeb for building two new testbeds.
About half of the full ClueWeb is in English (i.e., about 500 million documents). However, a large portion of those documents are spam, and thus have little value. We employed the Waterloo Spam Ranking result [Cormack et al., 2011] to filter out spam documents. Documents with Fusion scores below 70 (except Wikipedia, and the documents relevant to the training and testing queries) were removed from the dataset. This left us with around 151 million documents.

Three Web tracks of TREC (from 2009 to 2011) have been using the ClueWeb so far. Each track has provided 50 queries based on which we build the new testbeds. Within the full ClueWeb dataset, Wikipedia is the main contributor of relevant documents for Web track queries. The total size of Wikipedia is about 6 million documents, which is reasonable for creating a separate testbed.
We extract all Wiki documents, and apply the same K-means algorithm that was used for creating the TREC4-Kmeans. We also select only 106 queries which contain at least one relevant Wikipedia document (out of the 150 provided queries) for training and testing. In the end, we constructed 100 information sources for the testbed ClueWeb-Wiki, with statistics provided in the Table below.

For the rest of the English ClueWeb, we mainly divide documents by their domains. After filtering out spam as mentioned above, we first extracted all documents' URLs, and checked their main domains. For instance, the URL www.plus.google.com has a google.com domain, whereas something.bbc.co.uk has a bbc.co.uk domain. We then divide all domains into two groups: one group of domains has at least one document relevant to any of the 150 queries (there are 4,655 of such domains), the other group consists of domains without any relevant documents.

In the end, a total of 2,780 information sources were created, including 100 sources of Wikipedia.

Testbed Size (GB) # of inf. sources # of documents min # of docs max # of docs avg # of docs
ClueWeb-Wiki 252 100 5,957,529 4,400 434,525 59,290.28
ClueWeb-E 4,500* 2,780 151,161,188 48 3,417,805 54,364.27

*: estimated

Downloads

The zip file contains one or more assignment files. Each file has multiple lines with the following format:

document_id assigned_source

Documents not in the list were considered spam and have been excluded.

References

  1. Cormack, M. Smucker, and C. Clarke. Efficient and effective spam filtering and re-ranking for large web datasets. Information Retrieval, pages 1–25, 2011.