Traditional search engines like Google provide access to online information that can be copied into a centralized information source by crawling through hyperlinks. However, these traditional search engines typically ignore a large amount of valuable information that is “hidden” behind the search engines of many online text information sources. Federated text search provides one-stop access to the hidden information via a single interface that connects to multiple search engines of text information sources.
Most of the current solutions for federated text search learn the content topics of available information sources (i.e., the resource representation component), select and search relevant information sources for a user query (i.e., the resource selection component), and return a merged ranked list of the retrieved documents (i.e., the results merging component) in a pipeline. This search and retrieval solution is based solely on the content relevance of text information sources to the current user query. However, ignoring other important information about users and information sources such as the personal information needs or the search response time substantially limits the usability of existing federated text search systems. For example, existing systems select a relevant text information source for a user query, but this information source may have a very slow search engine or an ineffective search engine that does not return many relevant documents. Furthermore, the current pipeline architecture restricts the interactions among different components in federated text search systems. For example, many resource selection solutions for a current query do not utilize the results generated by the results merging component for similar past queries.
Intellectual Merit: The proposed research seeks to overcome the above limitations by proposing a new Integrated and Utility-Centric Framework for federated text search. The proposed framework differs from existing federated text search systems in two major aspects: (1) It exploits a broader set of evidence than simply content relevance to maximize the usability; (2) It eliminates the pipeline architecture by incorporating a new results analysis component into an integrated system that is adaptive with respect to the search results of past queries. The proposed research thrusts include: (1) Multiple Type Resource Representation: model important information of text information sources such as search response time and search engine effectiveness; (2) Utility-Centric Resource Selection: satisfy a user’s search criteria by considering multiple types of evidence such as content relevance, search results from past queries, personal information needs, and search response time. (3) Effective and Efficient Results Merging: produce accurate merged ranked results with little cost of acquiring the content information of the returned documents from selected information sources; (4) System Adaptation by Results Analysis: analyze the search results from past queries for more accurate resource representation, resource selection and results merging; (5) System Development and Evaluation: build and test algorithms within research environments. A new FedLemur system will be built and evaluated for the FedStats Web portal. The proposed research will be based on the PI’s extensive prior experience in research and in building prototype systems. Collectively, the proposed research will significantly advance the state-of-the-art in federated text search.
Broader Impacts: The proposed research will serve as an important bridge for moving federated text search research into practical applications. It will also have broad impacts for other important applications such as the unstructured peer-to-peer search. The developed algorithms will be incorporated into the open source Lemur toolkit. A publicly accessible virtual research experimental environment will be built to foster evaluation and advancement in federated text search. The education plan in this career proposal will yield many benefits. The PI will design customized teaching materials for information retrieval courses that attract students from multiple disciplines such as computer science, biology, chemistry, and statistics. The PI will build aWeb portal of information retrieval online courses for learners with different needs and backgrounds. The PI will encourage the involvement of a broad set of students in the research project with special attention for underrepresented students via existing educational programs at Purdue University.