Large-Scale Validation of Hypothesis Generation Systems via Candidate Ranking

Abstract

The first step of many research projects is to define and rank a short list of candidates for study. In the modern rapidity of scientific progress, some turn to automated hypothesis generation (HG) systems to aid this process. These systems can identify implicit or overlooked connections within a large scientific corpus, and while their importance grows alongside the pace of science, they lack thorough validation. Without any standard numerical evaluation method, many validate general-purpose HG systems by rediscovering a handful of historical findings, and some wishing to be more thorough may run laboratory experiments based on automatic suggestions. These methods are expensive, time consuming, and cannot scale. Thus, we present a numerical evaluation framework for the purpose of validating HG systems that leverages thousands of validation hypotheses. This method evaluates a HG system by its ability to rank hypotheses by plausibility; a process reminiscent of human candidate selection. Because HG systems do not produce a ranking criteria, specifically those that produce topic models, we additionally present novel metrics to quantify the plausibility of hypotheses given topic model system output. Finally, we demonstrate that our proposed validation method aligns with real-world research goals by deploying our method within Moliere, our recent topic-driven HG system, in order to automatically generate a set of candidate genes related to HIV-associated neurodegenerative disease (HAND). By performing laboratory experiments based on this candidate set, we discover a new connection between HAND and Dead Box RNA Helicase 3 (DDX3).

Publication
2018 IEEE International Conference on Big Data
Date

We explore and implement a number of validation methods in this work. In order to run these for yourself, first setup the Moliere project from our github page. A fair warning, you will need some pretty powerful compute resources. Feel free to reach out or file a github issue for help on that front.

Because the Moliere project represents a significant portion the lead authors thesis work, this codebase is subject to change. The exact code referenced in this paper can be found in the 18.10 release. The network referenced in this paper can be found here.

Within that repo is a run_query.py file that handles the entire query process. For instance, assuming the system is setup properly, this simple command will evaluate the connection between tobacco and lung cancer. Add -v for progress messages, expect this to take a few minutes. By default, this will attempt to find approximately 10k papers relevant to tobacco and cancer from our network, and will apply PLDA to find 20 topics from those papers.

./run_query.py tobacco lung_cancer

Output, by default, is written to a new directory created in /tmp for each query (in this case /tmp/tobacco---lung_cancer). Within this directory will be a result file (/tmp/tobacco---lung_cancer/tobacco---lung_cancer.20.eval) that looks something like this… (spacing added to look nice online)

TopicNetCCoef     0.483871
TopicWalkBtwn     98.0533
TopicCorr         0.84985
BestTopicPerWord  0.205387
  Topic_17 smoker smoke tobacco report assess	
  Topic_0 lung_cancer exposur worker studi cigarett_smoke	
  Topic_4 tobacco smokeless_tobacco form india smoke	
  Topic_10 tobacco treatment intervent quit tobacco_cessat	
  Topic_11 tobacco tobacco_control tobacco_industri countri global
BestCentrL2       0.0452344
  Topic_4 tobacco smokeless_tobacco form india smoke	
  Topic_5 lung_cancer copd screen diseas surviv	
  Topic_0 lung_cancer exposur worker studi cigarett_smoke	
  Topic_12 tobacco cigarett nicotin extract sampl	
  Topic_16 lung_cancer includ lung_canc well mirna
L2                6.47644
PolyMultiple      -0.756896

Note, the absolute value of each metric is far less important than the relative values between different hypotheses.

Here we list the quality of the tobacco to lung cancer connection with respect to our various validation metrics. The other files in this output directory are intended for debugging, re-runs, and later visualization. In addition, metrics that are per-topic include the top 5 topics (and top 5 stemmed words per topic).

The validation set we use can be found in this other repo. Within the wholeSet directory contain the most useful files for repeating experiments. Each file is in the form ID1|ID2|Year

C1449699|C1176309|2014
C0018801|C0013363|2014
C1158564|C0329155|2012
C0600467|C0038454|2013
C0031667|C0018790|2011
...

publsihed.pairs.txt contains actual connections that first occur within Medline after 2010. noise.pairs.txt contains randomly sampled connections that never occur in Medline, but come from the same set of IDs. highlyCited.pairs.txt is a subset of the published pairs, but includes only those that come from papers with over 100 citations.

Because of performance limitations, we subset these files for this paper. These subsets can be found in the evaluatedSubset directory.

The actual numbers from our evaluation can be found in the evaluatedResults directory.