• Federated Search (Distributed Information Retrieval): This
is my dissertation work under the supervision of my advisor Prof. JamieCallan.
A large amount of valuable information cannot be crawled and searched by
conventional search engines such as Google or AltaVista due to the reasons
of intellectual property protection or frequent information update. Federated
search provides the solution for searching this type of information hidden
from conventional search engines. My research addresses the three main
research problems within federated search: resource representation, resource
selection and results merging. New algorithms have been proposed for estimating
resource sizes, estimating probability distributions of relevant documents
among resources, solving resource selection as a utility maximization problem,
and finally merging document rankings retrieved by different search engines.
Furthermore, a unified utility maximization framework is proposed to combine
the range of solutions together to construct effective systems for different
federated search applications. Particularly, the unified model has been
utilized to incorporate the factor of search engine retrieval effectiveness
into resource selection to better model real world applications. The new
research is a significant improvement of the state-of-the-art.
Related Publications: SIGIR 2005, CIKM 2004a,
JASIST 2006, LNCS 2004, TOIS 2003, SIGIR 2003, SIGIR workshop 2003, Dg.O
2003, CIKM 2002, SIGIR 2002a
• FedLemur (FedStats) Project: This project provides a single
interface to multiple Web sites of government agencies. It is a unified
federated search solution for different government agencies that publish
statistical information (e.g., crop forecasts, unemployment statistics)
in text form. The algorithms that I developed for federated search were
refined and implemented in this real world application.
Related Publications: JASIST 2006
• Cross-Lingual Information Retrieval: We participated in two
tasks within Cross-Language Evaluation Forum 2005: Multilingual Information
Retrieval across eight languages and Results Merging for given bilingual
ranked lists from eight languages. Our submissions ranked No.1.
in
both the Multilingual Retrieval task and the Results Merging task. We have
proposed new algorithms based on combining evidence of multiple retrieval
algorithms to improve multilingual retrieval accuracy. Empirical results
demonstrated the advantage of the new methods.
Related Publications: CLEF 2005
• Lemur Retrieval Toolkit: I participated in the development
of the well-known Lemur toolkit for language modeling and information retrieval.
Specifically, I designed and implemented some algorithms for the distributed
information retrieval component.
Reference: (http://www.lemurproject.org/)
* Filtering System
• I developed new algorithms for collaborative filtering and content-based
filtering. The explosive growth of information demands intelligent
agents that can help users find out valuable information. Collaborative
filtering makes recommendation to a specific user by the behavior of users
with similar taste, while content based filtering analyzes the contents
of items for the recommendation. The research has mainly focused on graphical
models and Bayesian methods to improve the accuracy and efficiency of collaborative
filtering systems. I have proposed the Flexible Mixture Model, Decouple
Model for effectively modeling users’ preference and rating pattern, an
active learning model for efficiently learning users’ interests. Furthermore,
a unified approach to combine collaborative filtering and content-based
filtering is proposed to utilize both rating information and content information
to achieve better recommendation accuracy.
Related Publications: JIR (in press), ICML
2005, CIKM 2004b, UAI 2004, SIGIR 2004a, SIGIR 2004c, ICML 2003a, CIKM
2003, UAI 2003
* Text/Data Mining for Life Science
• Mapping Pharmaceutical Patents to Biomedical Ontology: Biomedical
ontologies like MeSH, SNOMED and Gene Ontology have been created and used
for analyzing and reasoning about diseases, symptoms etc. In this work
we argue that the ontology representing patents is very valuable since
it represents the intellectual property owned by businesses and also is
a means of associating chemical and biological entities that impact human
health. We created a tool that allows a user to map patents into the MeSH
ontology and find related topics and discover relationships that would
otherwise not been possible if one looked at patents in isolation. We designed
statistical learning methods for the mapping. It has been shown that the
performance of the mapping algorithm improves substantially when we incorporate
the structure information of MeSH and patent ontologies.
Related Publications: IBM TR 2005
• Biomedical Named-Entity Detection: Biomedical named-entity
recognition is a challenging task, as there is still a large accuracy gap
between biomedical named-entity recognition and general newswire named-entity
recognition. I proposed several Meta biomedical named-entity recognition
algorithms based on probabilistic graphical models that combine recognition
results of various recognition systems. The proposed algorithms were tested
on the GENIA biomedical corpus to improve the F score from 0.72 by best
individual system to 0.95 by the Meta entity recognition approach.
Related Publications: BioKDD 2005
• TREC Genomic Track: Advanced text information management helps
biologists to digest the overwhelming mount of biological literature. I
designed and developed text classifiers for biomedical triage tasks using
Bayesian logistic regression and support vector machine methods during
my internship at IBM’s Almaden research lab. The system achieved
2 No.2. in two 2005 TREC genomic triage
subtasks. I have collaborated with researchers from York University to
develop biomedical information retrieval system by proposing different
query processing methods. The system by proposing different query processing
methods. The system ranked No. 1.
in 2005 TREC genomic ad hoc retrieval task.
Related Publications: TREC 2005a, TREC 2005b
• Combine Biological Text Knowledge with Biological Experimental
Data: One main challenge of current bioinformatics research is how
to address the data-sparseness problem within biological experimental data.
In many biological applications, there are a limited number of data instances
with high-dimensional input data. This work proposes to utilize the knowledge
in biological ontology and biological text literature to address the data-sparseness
problem. Particularly, Laplacian graph kernel has been used to extract
gene relationship from Gene ontology and Pubmed literatures. It is combined
with biological experimental data within a Bayesian framework to provide
a better modeling ability for different biological processes. Particularly,
our current application focuses on the metabolic flux data model of the
glucose, lipid and amino acid metabolism.
Related Publications: In submission
* Machine Learning Techniques:
Many algorithms of my previous research have been designed based on
my interests of machine learning techniques such as graphical models, kernel
methods, active learning, boosting and metric learning.
Related Publications: PAKDD 2005, JIR in press,
ICML 2005, CIKM 2004b, UAI 2004, SIGIR 2004a, ICML 2003a, ICML 2003b, UAI
2003, BioKDD 2005, MMSJ 2006, ACM MM 2004
* Speech and Multimedia Processing
• Collaborative Image Retrieval: Collaborative image retrieval
improves image retrieval accuracy by utilizing the log data of users’ feedback
that has been collected by CBIR systems in history, We proposed a novel
metric learning approach, named “regularized metric learning”, which learns
a distance metric by exploring the correlation between low-level image
features and the log data of user relevance judgments. We formulated the
proposed learning algorithm into a semi-definite programming problem. Experiments
have shown that the new algorithm can substantially improve the retrieval
accuracy than a baseline system using Euclidean distance metric.
Related Publications: MMSJ 2006
• Automatic Image Annotation: Most current image search engines
like Google only search images by the keywords around them. This work proposes
solid statistical method to analyze the content of images and assign the
best text annotation.
Related Publications: ACM MM 2004
• Speaker Identification and Verification: Speech is one of the
most obtainable human biocharacteristics. To detect speaker’s identity
is one of the most important issues for security and human computer interface.
The research develops new algorithms for text-independent speaker identification
and text-dependent speaker verification, which result in more effective
and efficient speaker identification and verification systems.
Related Publications: ICSLP 2000, ISSPIS 1999,
ISSPIS 1999, ICSLP 1998, Signal Processing 1999 (Chinese), NCMT 1999 (Chinese),
Ncmmsc 1998 (Chinese), Ncmmsc 1998 (Chinese), NCCIIIA 1997 (Chinese)
• Speech Recognition: Speech is one of most natural ways for
human being to communicate with each other. I developed new algorithms
for both small scale keyword recognition and large-scale continuous speech
recognition.
* Natural Language Processing:
• Web Page Readability: I proposed a new method of using statistical
models to estimate the reading difficulty of Web pages. Language models
are used to represent the content typically associated with different readability
levels. Reading level classifiers are created as linear combinations of
a language model and surface linguistic features. Experiments show that
this new method is more accurate than the widely used Flesch Kincaid readability
formula.
Related Publications: CIKM 2001
• Name Entity Detection: To design more intelligent search engines
that go beyond keyword matching, semantic information such as name entities
is important. This work designs effective solution for name entity detection.
Its performance is comparable with the commercial software of BBN’s IdentiFinder.