MOLIERE: Automatic Biomedical Hypothesis Generation System

Hypothesis generation is becoming a crucial time-saving technique which allows biomedical researchers to quickly discover implicit connections between important concepts. Typically, these systems operate on domain-specific fractions of public medical data. MOLIERE, in contrast, utilizes information from over 24.5 million documents. At the heart of our approach lies a multi-modal and multi-relational network of biomedical objects extracted from several heterogeneous datasets from the National Center for Biotechnology Information (NCBI). These objects include but are not limited to scientific papers, keywords, genes, proteins, diseases, and diagnoses. We model hypotheses using Latent Dirichlet Allocation applied on abstracts found near shortest paths discovered within this network, and demonstrate the effectiveness of MOLIERE by performing hypothesis generation on historical data. Our network, implementation, and resulting data are all publicly available for the broad scientific community.

You can find our data here.

You may request a query here.


Lean to Program in Python

As an exercise in teaching, I have started putting together a lecture series aimed at teaching programming to people who have never seen a programming language before. At the time of writing, I have covered primitive data types, input, and basic control flows.


Additional Catalan Transformations

The following depicts many Catalan problem transformations. Personally, I find it very hard to follow written descriptions of these transformations, so instead hope to describe these conversions with an image.

First we see a binary tree, then we see how a binary tree can be converted to a set of parentheses. That parenthesis string can then be converted to a ballot sequence, or a mountain range with equal up and down strokes, or a lattice path, or a stair step tiling. The long story short, all of these structures are equivalent, and they use Catalan numbers as an almost "implicit" data structure.


Catalan Numbers

So we have just talked about implicit data structures which a programmer can define to store trees and other constructs, but it turns our that some basic principles of mathematics can be used as implicit data structures themselves. A great example of this is the Catalan numbers, and in this section we will explore some of the many (many many many) different problems which can be converted to and from this special set of integers.

The Catalan Numbers follow a simple rule. The n'th Catalan number is equal to...

C_nfrac 1n12n choose nfrac 2nn1nprod limits _k2nfrac nkkqquad mbox for ngeq 0

[shamelessly copied from Wikipedia]



Possibly the most irregular data structure we will cover is the PQ Tree. Firstly, the PQ tree was inspired by a baby’s mobile (in explaining this data structure to friends, I learned that many readers may be unfamiliar with typical American baby raising practices, pictured below is a mobile). Secondly, this model gives us a very efficient way to represent subsets of permutations, which can greatly help us solve combinatorial problems.


Various Implicit Tree Structures

So when we looked at implicit structures we talked about heaps. With a heap, we had a simple array that held our binary tree. In many situations, this is the most efficient way to store a binary tree. But what can we do if we have trees of arbitrary shape? Of course, we can always make nodes linked with pointers and throw that all in the heap, but what if we want a more elegant solution?

For the following methods, consider the following tree:


How to Solve the Distribution Center Problem in Linear Time

In a previous section we outlined what the distribution center problem is, explaining that the end goal is to find a node in a tree which minimizes the time needed to send goods to all other nodes. In that previous section we explained how the concept of priority queues is important to solving our problem, but careful readers would likely not be satisfied; many typical priority queue operations are worse than linear. So in this section we will hopefully satisfy that curiosity. 

A reminder of the overall algorithm:

We will work by starting at the leaves and processing inward until we find our center. We assign all leaves a value of zero, and parent nodes will have a label dependent on the values and count of their children.


Implicit Data Structures

All too often, when dealing with data structures, we are concerned about complicated pointer structures and many levels of indirection. For example, when we discussed priority queues, we focused heavily on the linking structure of binomial and Fibonacci trees. Now pointers are important and often make for fast structures, but sometimes it is important to have simpler designs ready for times when pointers get in the way.

Storing data to disk, transmitting data between machines, or simply reducing the brain power needed to follow your code; these are all times when its worth switching to implicit data structures.


Tree Distribution Center

One of the most important algorithms of the modern era is graph distribution. At some level, the reason companies like UPS and Amazon are successful boil down to how well they can solve distribution problems. Sure, the problem and algorithm we are about to discuss is a much simpler formulation than companies work with, but the basic principles are the same.

The graph distribution problem is simple. Given a graph (in our case, a tree), which node should you place your distribution center? And how long would it take to distribute something to the entire network? (Note: our algorithm can extend easily to graphs if you first find a minimum spanning tree).

Well for our purposes, let’s assume each node in our tree can do one of two things each timestep. It can either receive a package from one of its neighbors, or it can send packages to a group of its neighbors. Imagine each node is a factory. Each day it can either receive a large shipment of goods, or it can send out a fleet of trucks.

The following animation shows this process with a randomly selected node as the distribution center. Notice how it takes seven timestamps to reach the whole tree.