Are Abstracts Enough for Hypothesis Generation?

Abstract

The potential for automatic hypothesis generation (HG) systems to improve research productivity keeps pace with the growing set of publicly available scientific information. But as data becomes easier to acquire, we must understand the effect different textual data sources have on our resulting hypotheses. Are abstracts enough for HG, or does it need full-text papers? How many papers does an HG system need to make valuable predictions? How sensitive is a general-purpose HG system to hyperparameter values or input quality? What effect does corpus size and document length have on HG results? To answer these questions we train multiple versions of knowledge network-based HG system, Moliere, on varying corpora in order to compare challenges and trade offs in terms of result quality and computational requirements. Moliere generalizes main principles of similar knowledge network-based HG systems and reinforces them with topic modeling components. The corpora include the abstract and full-text versions of PubMed Central, as well as iterative halves of MEDLINE, which allows us to compare the effect document length and count has on the results. We find that, quantitatively, corpora with a higher median document length result in marginally higher quality results, yet require substantially longer to process. However, qualitatively, full-length papers introduce a significant number of intruder terms to the resulting topics, which decreases human interpretability. Additionally, we find that the effect of document length is greater than that of document count, even if both sets contain only paper abstracts.

Publication
2018 IEEE International Conference on Big Data
Date

Our goal in this paper is to understand the effect data quality has on our hypothesis generation system, Moliere. We train multiple instances of our system and apply our validation techniques to numerically quantify performance differences.

Our different datasets come from Medline and PubMed Central. This required a fair amount cleaning and random sampling, so those wishing to replicate our findings are encouraged to start with the pre-trained datasets. These datasets are preconstructed using data that predates 2010. Here, directories like 2_pow_neg_2 represent $\frac{1}{4}$ of the medline dataset, sampled randomly. (2_pow_neg_0 represents the entire ($\frac{1}{1}$) dataset.) Then, pmc-abstract-network and pmc-fulltext-network each represent the PubMed Central datasets.

In order to run queries against these pre-trained data sets, supply the -d flag. We re-trained our polynomial combination per-dataset. In order to use each datasets polynomial parameters, supply the -y flag followed by a path to the hyper.param file. If this is not supplied, we provide a default set of parameters. For instance, if you download the contents of 2_pow_neg_2 to ~/Downloads/my_test, you would run the tobacco lung cancer query with the following syntax.

./run_query -d ~/Downloads/my_test \
            -y ~/Downloads/my_test/hyper.param \
            tobacco lung_cancer

To repeat our experiments, the files noise-predicate-pairs and published-predicate-pairs contain our negative and positive validation samples respectively. Each file is in the form ID1|ID2|Year, and negative results have been assigned the year 0.

C1449699|C1176309|2014
C0018801|C0013363|2014
C1158564|C0329155|2012
C0600467|C0038454|2013
C0031667|C0018790|2011
...

When running a query, simply supply ID1 and ID2. Results will occur in a temporary directory named /tmp/ID1---ID2 by default. We have some more info on interpreting these results over on our validation page.