We used only the English web pages of ClueWeb for building two new testbeds.
About half of the full ClueWeb is in English (i.e., about 500 million
documents). However, a large portion of those documents are spam, and thus have
little value. We employed the Waterloo Spam Ranking result [Cormack et al., 2011] to
filter out spam documents. Documents with Fusion scores below 70 (except
Wikipedia, and the documents relevant to the training and testing queries) were removed from the dataset. This left us with around 151 million
documents.
Three Web tracks of TREC (from 2009 to 2011) have been using the ClueWeb so
far. Each track has provided 50 queries based on which we build the new
testbeds. Within the full ClueWeb dataset, Wikipedia is the main contributor
of relevant documents for Web track queries. The total size of Wikipedia is
about 6 million documents, which is reasonable for creating a separate testbed.
We extract all Wiki documents, and apply the same K-means algorithm that was
used for creating the TREC4-Kmeans. We also select only 106 queries which
contain at least one relevant Wikipedia document (out of the 150 provided
queries) for training and testing. In the end, we constructed 100 information
sources for the testbed ClueWeb-Wiki, with statistics provided in the Table below.
For the rest of the English ClueWeb, we mainly divide documents by their domains. After filtering out spam as mentioned above, we first extracted all documents' URLs, and checked their main domains. For instance, the URL www.plus.google.com has a google.com domain, whereas something.bbc.co.uk has a bbc.co.uk domain. We then divide all domains into two groups: one group of domains has at least one document relevant to any of the 150 queries (there are 4,655 of such domains), the other group consists of domains without any relevant documents.
Since the second group is quite big,
we further divide it in two steps. In the first step, documents reside in the
file partition that comes with the original ClueWeb. ClueWeb names their
top-level files as en0000 to en0133, for a total 134 sets.
This partition was probably done during the crawling process. In the second
step, each document of each of the 134 big collections is classify into
eight ending domains: .com, .gov, .org, .edu, .net, .us, .uk, and the rest.
This is based on our observation that most relevant documents of the 150
queries fall into those categories. In total, this construction builds 1,072
information sources.
For the first group of domains containing at least one relevant document,
the construction procedure is as follows. First, any domain consisting of more than 300
documents will become an independent information source. For the remaining
smaller domains, we would like to merge them into bigger collections.
Therefore, all remaining domains that contain less than 10 relevant
documents each are merged into eight collections according to their endings: smallcom, smallgov, smallorg, smalledu, smallnet, smallus, smalluk, smallother.
The other domains that contain more than 10 relevant documents are
also merged into collections according to their endings. Only two collections are created this way: bigcom and bignet.
In the end, a total of 2,780 information sources were created, including 100 sources of Wikipedia.
| Testbed | Size (GB) | # of inf. sources | # of documents | min # of docs | max # of docs | avg # of docs |
|---|---|---|---|---|---|---|
| ClueWeb-Wiki | 252 | 100 | 5,957,529 | 4,400 | 434,525 | 59,290.28 |
| ClueWeb-E | 4,500 |
2,780 | 151,161,188 | 48 | 3,417,805 | 54,364.27 |
The zip file contains one or more assignment files. Each file has multiple lines with the following format:
document_id assigned_source
Documents not in the list were considered spam and have been excluded.