CS590N Statistical Relational Learning
Spring 2007: Example Top Ten

Exploiting Relational Structure to Understand Publication Patterns in High-Energy Physics
A. McGovern, L. Friedland, M. Hay, B. Gallagher, A. Fast, J. Neville, and D. Jensen

Using Relational Knowledge Discovery to Prevent Securities Fraud
J. Neville, O. Simsek, D. Jensen, J. Komoroske, K. Palmer and H. Goldberg

1. Where's the beef?
Focused more on what was learned instead of how it was learned (Chris)
I was surprised that both of these papers did not go into great detail of RPTs. (Josh)

2. How can we get our papers published?
In the case of publication analysis, identification of attributes that affect publication and listing of them by importance would have provided more insights. This will be of great interest to authors! (Praveen)

3. Limitations of statistical analysis
Journal acceptance is subjective, predictive models built on only objective information may not be correct (Umang)

4. Future Work: Time
How easy or hard it is to determine the amount of temporal data to be considered in cases where there is no domain enforced limitations such as non-availability of arXiv data before 1992 in this case. (Rajesh)
In this setting the relational data changes over time... I think this is the most interesting area for further research. (Duncan)

5. Future work: Collaborative analysis
Should the task of determining the publication possibility of a given paper also make use of the clustering information obtained? For example there might exist certain topics of research that receive more recognition and hence a larger percentage of them may be published in journals. (Rajesh)

6. Future work: Concept drift
I think an important concept is how to decide or choose a useful relation which can be used in the algorithm. RPT model doesn't dominate Base model in 2001 and 2002 test. So can I say that the relations used in RPT model may not properly fit or be distinguishable in the whole data? Or the relations used in RPT model may only fit in the first two datasets because of the way they were explored? (Yang)

7. How to define/construct the feature space
Too many attributes may slow down the calculation. Too few attributes may put the model in a coarse granularity. Can the authors prove that the way of constructing the attributes is meaningful? (Yang)
Is it common in SRL to test pre-enumerated hypotheses, rather than let an algorithm more freely make conjectures? I ask this because I was surprised to see how much is assumed in the structure of the data. (Josh)

8. Questionable class label
The use of the surrogate measure is questionable because NASD wants to investigate brokers who are likely to be involved in fraudulent activity, which does not always result in a disclosure filing against that broker. (Duncan)

9. How can relationships provide more information than text?
It is interesting to find that using text-only clustering leads to results that the authors found not satisfactory. Intuitively, text content of papers should provide more insights to what a paper is about than citations, but information might be a lot harder to extract from text content due to the complex structure of written languages. (Jia-Hong)

10. Implementation
Why not use an object-relational database? (Chris)

Initial discussion questions

(1) After reading these papers, can you identify other relational domains/tasks of interest?

(2) What are the primary limitations to applying relational models to these or other domains?