* Information Retrieval and Management

Federated Search (Distributed Information Retrieval): This is my dissertation work under the supervision of my advisor Prof. JamieCallan. A large amount of valuable information cannot be crawled and searched by conventional search engines such as Google or AltaVista due to the reasons of intellectual property protection or frequent information update. Federated search provides the solution for searching this type of information hidden from conventional search engines. My research addresses the three main research problems within federated search: resource representation, resource selection and results merging. New algorithms have been proposed for estimating resource sizes, estimating probability distributions of relevant documents among resources, solving resource selection as a utility maximization problem, and finally merging document rankings retrieved by different search engines. Furthermore, a unified utility maximization framework is proposed to combine the range of solutions together to construct effective systems for different federated search applications. Particularly, the unified model has been utilized to incorporate the factor of search engine retrieval effectiveness into resource selection to better model real world applications. The new research is a significant improvement of the state-of-the-art.
Related Publications: SIGIR 2005, CIKM 2004a, JASIST 2006, LNCS 2004, TOIS 2003, SIGIR 2003, SIGIR workshop 2003, Dg.O 2003, CIKM 2002, SIGIR 2002a

FedLemur (FedStats) Project: This project provides a single interface to multiple Web sites of government agencies. It is a unified federated search solution for different government agencies that publish statistical information (e.g., crop forecasts, unemployment statistics) in text form. The algorithms that I developed for federated search were refined and implemented in this real world application.
Related Publications: JASIST 2006

Cross-Lingual Information Retrieval: We participated in two tasks within Cross-Language Evaluation Forum 2005: Multilingual Information Retrieval across eight languages and Results Merging for given bilingual ranked lists from eight languages. Our submissions ranked No.1. in both the Multilingual Retrieval task and the Results Merging task. We have proposed new algorithms based on combining evidence of multiple retrieval algorithms to improve multilingual retrieval accuracy. Empirical results demonstrated the advantage of the new methods.
Related Publications: CLEF 2005

Lemur Retrieval Toolkit: I participated in the development of the well-known Lemur toolkit for language modeling and information retrieval. Specifically, I designed and implemented some algorithms for the distributed information retrieval component.
Reference: (http://www.lemurproject.org/)

* Filtering System

• I developed new algorithms for collaborative filtering and content-based filtering. The explosive growth of information demands intelligent agents that can help users find out valuable information. Collaborative filtering makes recommendation to a specific user by the behavior of users with similar taste, while content based filtering analyzes the contents of items for the recommendation. The research has mainly focused on graphical models and Bayesian methods to improve the accuracy and efficiency of collaborative filtering systems. I have proposed the Flexible Mixture Model, Decouple Model for effectively modeling users’ preference and rating pattern, an active learning model for efficiently learning users’ interests. Furthermore, a unified approach to combine collaborative filtering and content-based filtering is proposed to utilize both rating information and content information to achieve better recommendation accuracy.
Related Publications: JIR (in press), ICML 2005, CIKM 2004b, UAI 2004, SIGIR 2004a, SIGIR 2004c, ICML 2003a, CIKM 2003, UAI 2003

* Text/Data Mining for Life Science

Mapping Pharmaceutical Patents to Biomedical Ontology: Biomedical ontologies like MeSH, SNOMED and Gene Ontology have been created and used for analyzing and reasoning about diseases, symptoms etc. In this work we argue that the ontology representing patents is very valuable since it represents the intellectual property owned by businesses and also is a means of associating chemical and biological entities that impact human health. We created a tool that allows a user to map patents into the MeSH ontology and find related topics and discover relationships that would otherwise not been possible if one looked at patents in isolation. We designed statistical learning methods for the mapping. It has been shown that the performance of the mapping algorithm improves substantially when we incorporate the structure information of MeSH and patent ontologies.
Related Publications:  IBM TR 2005

Biomedical Named-Entity Detection: Biomedical named-entity recognition is a challenging task, as there is still a large accuracy gap between biomedical named-entity recognition and general newswire named-entity recognition. I proposed several Meta biomedical named-entity recognition algorithms based on probabilistic graphical models that combine recognition results of various recognition systems. The proposed algorithms were tested on the GENIA biomedical corpus to improve the F score from 0.72 by best individual system to 0.95 by the Meta entity recognition approach.
Related Publications: BioKDD 2005

TREC Genomic Track: Advanced text information management helps biologists to digest the overwhelming mount of biological literature. I designed and developed text classifiers for biomedical triage tasks using Bayesian logistic regression and support vector machine methods during my internship at IBM’s Almaden research lab. The system achieved 2 No.2. in two 2005 TREC genomic triage subtasks. I have collaborated with researchers from York University to develop biomedical information retrieval system by proposing different query processing methods. The system by proposing different query processing methods. The system ranked No. 1. in 2005 TREC genomic ad hoc retrieval task.
Related Publications: TREC 2005a, TREC 2005b

Combine Biological Text Knowledge with Biological Experimental Data: One main challenge of current bioinformatics research is how to address the data-sparseness problem within biological experimental data. In many biological applications, there are a limited number of data instances with high-dimensional input data. This work proposes to utilize the knowledge in biological ontology and biological text literature to address the data-sparseness problem. Particularly, Laplacian graph kernel has been used to extract gene relationship from Gene ontology and Pubmed literatures. It is combined with biological experimental data within a Bayesian framework to provide a better modeling ability for different biological processes. Particularly, our current application focuses on the metabolic flux data model of the glucose, lipid and amino acid metabolism.
Related Publications: In submission

* Machine Learning Techniques:

Many algorithms of my previous research have been designed based on my interests of machine learning techniques such as graphical models, kernel methods, active learning, boosting and metric learning.
Related Publications: PAKDD 2005, JIR in press, ICML 2005, CIKM 2004b, UAI 2004, SIGIR 2004a, ICML 2003a, ICML 2003b, UAI 2003, BioKDD 2005, MMSJ 2006, ACM MM 2004

* Speech and Multimedia Processing

Collaborative Image Retrieval: Collaborative image retrieval improves image retrieval accuracy by utilizing the log data of users’ feedback that has been collected by CBIR systems in history, We proposed a novel metric learning approach, named “regularized metric learning”, which learns a distance metric by exploring the correlation between low-level image features and the log data of user relevance judgments. We formulated the proposed learning algorithm into a semi-definite programming problem. Experiments have shown that the new algorithm can substantially improve the retrieval accuracy than a baseline system using Euclidean distance metric.
Related Publications: MMSJ 2006

Automatic Image Annotation: Most current image search engines like Google only search images by the keywords around them. This work proposes solid statistical method to analyze the content of images and assign the best text annotation.
Related Publications: ACM MM 2004

Speaker Identification and Verification: Speech is one of the most obtainable human biocharacteristics. To detect speaker’s identity is one of the most important issues for security and human computer interface. The research develops new algorithms for text-independent speaker identification and text-dependent speaker verification, which result in more effective and efficient speaker identification and verification systems.
Related Publications: ICSLP 2000, ISSPIS 1999, ISSPIS 1999, ICSLP 1998, Signal Processing 1999 (Chinese), NCMT 1999 (Chinese), Ncmmsc 1998 (Chinese), Ncmmsc 1998 (Chinese), NCCIIIA 1997 (Chinese)

Speech Recognition: Speech is one of most natural ways for human being to communicate with each other. I developed new algorithms for both small scale keyword recognition and large-scale continuous speech recognition.

* Natural Language Processing:

Web Page Readability: I proposed a new method of using statistical models to estimate the reading difficulty of Web pages. Language models are used to represent the content typically associated with different readability levels. Reading level classifiers are created as linear combinations of a language model and surface linguistic features. Experiments show that this new method is more accurate than the widely used Flesch Kincaid readability formula.
Related Publications: CIKM 2001

Name Entity Detection: To design more intelligent search engines that go beyond keyword matching, semantic information such as name entities is important. This work designs effective solution for name entity detection. Its performance is comparable with the commercial software of BBN’s IdentiFinder.