Hypothesis Generation Explained

Undiscovered Public Knowledge

In the last couple of years, researchers worldwide have begun to develop a powerful new tool. By using data mining techniques, these scientists hope to one day put themselves out of a job.

It all began in the 80's with a man named Don Swanson. He was the first to notice something that he called undiscovered public knowledge. He saw that no human could possibly read all of the available information on a given topic, and he guessed that there were some truths that no one actually knows, but have already been published. 

Big ideas are often discovered when people from different backgrounds get together on the same team. These ideas crop up because cross-disciplinary collaboration brings in not only a new viewpoint, but people who entirely different sets of mental information. For example, the idea of DNA storage has been recently popularized by Harvard scientists as a way to keep massive amounts of digital data. This bioinformatic technology relies on the connection that DNA and SSD's are both solutions to the same core problem: information storage.

DNA was originally proposed as a means for data storage in 1964. Up until that point, there had been papers about DNA in medical journals, and there had been papers about hard drives in technical journals. Typically, doctors don't tend to meet many people from IBM at dinner parties, so it was not likely that many doctors knew how hard drives worked, or IBM employees who could describe DNA's storage capabilities. But without realizing it, both communities explored the same problem. This is an example of Don Swanson's undiscovered public knowledge.

Before 1964, no one was saying DNA had anything to do with hard drives, but there existed an implicit connection: both addressed information storage. If someone was capable of keeping all technical and medical literature in their head simultaneously, this connection might seem trivial. And now, looking back on it, the connection seems pretty clear to us. What we attempt to do with hypothesis generation is exactly this; we try to identify these implicit connections.

MOLIERE: Automatic Biomedical Hypothesis Generation System

Hypothesis generation is becoming a crucial time-saving technique which allows biomedical researchers to quickly discover implicit connections between important concepts. Typically, these systems operate on domain-specific fractions of public medical data. MOLIERE, in contrast, utilizes information from over 24.5 million documents. At the heart of our approach lies a multi-modal and multi-relational network of biomedical objects extracted from several heterogeneous datasets from the National Center for Biotechnology Information (NCBI). These objects include but are not limited to scientific papers, keywords, genes, proteins, diseases, and diagnoses. We model hypotheses using Latent Dirichlet Allocation applied on abstracts found near shortest paths discovered within this network, and demonstrate the effectiveness of MOLIERE by performing hypothesis generation on historical data. Our network, implementation, and resulting data are all publicly available for the broad scientific community.

You can find our data here.

You may request a query here.

Lean to Program in Python

As an exercise in teaching, I have started putting together a lecture series aimed at teaching programming to people who have never seen a programming language before. At the time of writing, I have covered primitive data types, input, and basic control flows.