So for a graduate class in software architecture, we decided to explore different development methodologies and look at how each of the most popular practices facilitate planning.

We delve into this in a paper which can be found here:

]]>

In the last couple of years, researchers worldwide have begun to develop a powerful new tool. By using data mining techniques, these scientists hope to one day put themselves out of a job.

It all began in the 80's with a man named Don Swanson. He was the first to notice something that he called undiscovered public knowledge. He saw that no human could possibly read all of the available information on a given topic, and he guessed that there were some truths that no one actually knows, but have already been published.

Big ideas are often discovered when people from different backgrounds get together on the same team. These ideas crop up because cross-disciplinary collaboration brings in not only a new viewpoint, but people who entirely different sets of mental information. For example, the idea of DNA storage has been recently popularized by Harvard scientists as a way to keep massive amounts of digital data. This bioinformatic technology relies on the connection that DNA and SSD's are both solutions to the same core problem: information storage.

DNA was originally proposed as a means for data storage in 1964. Up until that point, there had been papers about DNA in medical journals, and there had been papers about hard drives in technical journals. Typically, doctors don't tend to meet many people from IBM at dinner parties, so it was not likely that many doctors knew how hard drives worked, or IBM employees who could describe DNA's storage capabilities. But without realizing it, both communities explored the same problem. This is an example of Don Swanson's undiscovered public knowledge.

Before 1964, no one was saying DNA had anything to do with hard drives, but there existed an implicit connection: both addressed information storage. If someone was capable of keeping all technical and medical literature in their head simultaneously, this connection might seem trivial. And now, looking back on it, the connection seems pretty clear to us. What we attempt to do with hypothesis generation is exactly this; we try to identify these implicit connections.

]]>

Hypothesis generation is becoming a crucial time-saving technique which allows biomedical researchers to quickly discover implicit connections between important concepts. Typically, these systems operate on domain-specific fractions of public medical data. MOLIERE, in contrast, utilizes information from over 24.5 million documents. At the heart of our approach lies a multi-modal and multi-relational network of biomedical objects extracted from several heterogeneous datasets from the National Center for Biotechnology Information (NCBI). These objects include but are not limited to scientific papers, keywords, genes, proteins, diseases, and diagnoses. We model hypotheses using Latent Dirichlet Allocation applied on abstracts found near shortest paths discovered within this network, and demonstrate the effectiveness of MOLIERE by performing hypothesis generation on historical data. Our network, implementation, and resulting data are all publicly available for the broad scientific community.

You can find our data here.

You may request a query here.

]]>As an exercise in teaching, I have started putting together a lecture series aimed at teaching programming to people who have never seen a programming language before. At the time of writing, I have covered primitive data types, input, and basic control flows.

Recently, I received a Best Poster Award for the following work. I am currently fleshing my work into a publishable paper.

]]>

The following depicts many Catalan problem transformations. Personally, I find it very hard to follow written descriptions of these transformations, so instead hope to describe these conversions with an image.

So we have just talked about implicit data structures which a programmer can define to store trees and other constructs, but it turns our that some basic principles of mathematics can be used as implicit data structures themselves. A great example of this is the Catalan numbers, and in this section we will explore some of the many (many many many) different problems which can be converted to and from this special set of integers.

The Catalan Numbers follow a simple rule. The *n*'th Catalan number is equal to...

[shamelessly copied from Wikipedia]

]]>Possibly the most irregular data structure we will cover is
the PQ Tree. Firstly, the PQ tree was inspired by a baby’s mobile (in
explaining this data structure to friends, I learned that many readers may be
unfamiliar with typical American baby raising practices, pictured below is a
mobile). Secondly, this model gives us a very efficient way to represent subsets
of permutations, which can greatly help us solve combinatorial problems.

So when we looked at implicit structures we talked about
heaps. With a heap, we had a simple array that held our binary tree. In many situations,
this is the most efficient way to store a binary tree. But what can we do if we
have trees of arbitrary shape? Of course, we can always make nodes linked with
pointers and throw that all in the heap, but what if we want a more elegant
solution?

For the following methods, consider the following tree:

In a previous section we outlined what the distribution center
problem is, explaining that the end goal is to find a node in a tree which
minimizes the time needed to send goods to all other nodes. In that previous
section we explained how the concept of priority queues is important to solving
our problem, but careful readers would likely not be satisfied; many typical
priority queue operations are worse than linear. So in this section we will
hopefully satisfy that curiosity.** **

**A reminder of the overall algorithm:**

We will work by starting at the leaves and processing inward until we find our center. We assign all leaves a value of zero, and parent nodes will have a label dependent on the values and count of their children.

All too often, when dealing with data structures, we are concerned about complicated pointer structures and many levels of indirection. For example, when we discussed priority queues, we focused heavily on the linking structure of binomial and Fibonacci trees. Now pointers are important and often make for fast structures, but sometimes it is important to have simpler designs ready for times when pointers get in the way.

Storing data to disk, transmitting data between machines, or simply reducing the brain power needed to follow your code; these are all times when its worth switching to implicit data structures.

]]>One of the most important algorithms of the modern era is
graph distribution. At some level, the reason companies like UPS and Amazon are
successful boil down to how well they can solve distribution problems. Sure,
the problem and algorithm we are about to discuss is a much simpler formulation
than companies work with, but the basic principles are the same.

The graph distribution problem is simple. Given a graph (in our case, a tree), which node should you place your distribution center? And how long would it take to distribute something to the entire network? (Note: our algorithm can extend easily to graphs if you first find a minimum spanning tree).

Well for our purposes, let’s assume each node in our tree can do one of two things each timestep. It can either receive a package from one of its neighbors, or it can send packages to a group of its neighbors. Imagine each node is a factory. Each day it can either receive a large shipment of goods, or it can send out a fleet of trucks.

The following animation shows this process with a randomly selected node as the distribution center. Notice how it takes seven timestamps to reach the whole tree.

The Model View Controller Paradigm is one of the most fundamental programming patterns out there. It is at the heart of practically every website and every iPhone app (and the good android ones too!). Understanding MVC is critical to writing maintainable software, and even though new acronyms have come out since (MVP, MVVM, HMVC) all of them try to capture an ideal defined by MVC.

In this section we will first discuss what programming patterns are, then we will delve into a brief description of each component in MVC. Throughout this description we will be talking about how MVC can be used in sample applications to make life easier for programmers, and the experience better for end users.

[Raw images from cs.princeton.edu/~wayne/teaching/fibonacci-heap.pdf]

In the last section we discussed Binomial Queues and how they used a linked list of binomial trees to keep an efficient priority queue. While the binomial queue worked hard to give us an amortized constant time insertion operator, it sacrificed performance in other areas such as its O(log n) find-min operator. In this section we will explore another data structure which attempts to keep a priority queue as a linked list of trees, making its own trade-offs in performance. Mainly, the Fibonacci heap maintains a more complicated set of trees than its binomial counterpart, but in exchange for the maintenance overhead, Fibonacci heaps provide many amortized constant time operations. Because of this carefully constructed internal organization, the Fibonacci heap is one of the best performing heap-like data structures existing today.

Often when designing data structures, trade offs are made in
order to make certain operations more computationally efficient at the cost of
other, hopefully less used, operations. The binomial queue is no exception.

Typically, five different operations are important with priority queues. These are find-min, delete-min, insert, decrease-key, and merge. If we look at a min heap, a standard data structure wherein the first element is the smallest in the collection, we see mediocre performance on most operations.

**Heap Performance:**

- find-min - O(1)
- delete-min O(log n)
- insert O(log n)
- decrease-key O(log n)
- merge O(n)

The binomial queue improves upon insertion and merge at the
expense of find-min. The rational is that many applications need to store more
data than is actually queried. In order words, there are likely going to be
more insertions than find-min calls. Binomial queues achieve this trade off by
replacing the heap’s internal binary tree with a circular list of binomial
trees, each enforcing the min-heap property that root nodes are smaller than
all decedent nodes. More on this
shortly.

**Binomial Queue Performance:**

- find-min - O(log n
- delete-min O(log n)
- insert O(1) //amortized
- decrease-key O(log n)
- merge O(log n)

A binomial queue is an irregular priority queue which takes advantage of binary trees to quickly order its elements. In short, its a queue of trees where each tree enforces a heap structure. I'll be writing a more in-depth article on Binomial Queues shortly. Until then, I recommend THIS LINK which has a wonderful visualization from the University of San Francisco.

The following image shows a typical binomial queue with fifteen elements. Notice how to size of each tree follows the pattern 2 ^ n.

Internally, this structure is held together with a series of pointers. The following image depicts that link structure. Notice how siblings are linked together, each having a pointer to their parent, but the parent only has a pointer to it's first child.

Wouldn't it be great if we didn't need to worry about all those links? Well with a little bit of work we can represent this whole structure implicitly!

]]>Our next topic on amortized analysis is a parallel to the Banker’s Paradigm. This new approach, the physicist’s paradigm, is a little more mathematically rigorous, but ultimately should produce the same answer (or same type of answer) as its money-grubbing counterpart. Although both techniques should produce the same result, it is worth understanding both because often times problems are more easily understood when phrased as one or the other.

So what is it?

Think back to your last physics class, remember gravity? If you took any introduction to physics you likely learned about gravitational potential energy, which is the amount of energy stored in an objects distance from the earth. If you take a one kilogram ball and raise it one meter off the ground, you have bestowed it with 9.8 joules of energy. Releasing the ball also releases that energy, which can be seen when the ball falls to the ground. Now, if you remember this, you might also remember that it doesn’t matter what path the ball takes when you lift it up one meter. All that matters for the calculation of gravitational potential energy is the initial and final position.

How do we use it?

So what does this have to do with algorithms? Well, lets look at each step of an algorithm through the lens of energy functions. We are going to use this short function to determine how much work is done by our algorithm at each step. Here, a_i is the amortized cost required to get us from step i-1 to step i, while t is the actual cost of doing step i. Phi is a potential energy value for the data structure, and we look at the change in potential between the previous step and the current one. Now, if you are reading this and have no clue how to calculate the potential energy of a data structure, well, you are in the right place.

Find all 10 digit numbers who for all n=1,2,3,...,9,10 the number formed by taking the leftmost n digits is divisible by n.

In layman's terms, we want to find all 10 digit numbers where:

-The first digit must be divisible by 1. -The leftmost two digits must be divisible by 2. -... -The leftmost nine digits must be divisible by 9. -The leftmost ten digits (the whole number) must be divisible by 10.

Well, a program which finds these numbers needs to explore the whole "problem space" (all possible 10 digit numbers) so the easiest way to do this is a loop. We can start at 0000000000 and count up to 9999999999. Each number we go to, we can check if it meets all of these requirements and if so, we can store it.

results = [] for i in range(0,10 ** 10): if slowCheckDigit(i): results.append(i) print(results) def slowCheckDigit(num): goodNum = True for j in range(10,0,-1): if num % j == 0: num //=10 else: goodNum = False break; return goodNum

Looking at this code, we see that we will be making 10^10^10 comparisons. This is because we will be looping through 10^10 numbers, and each call to slowCheckDigit requires 10 checks (one for each digit). This is ridiculously bad. I attempted to run it on my desktop and got bored before it finished (a great metric for algorithm complexity). We also check a lot of values we should know better than checking. For example, for the first two digits to be divisible by two, we can pretty easily determine that the second digit should be a 2,4,6,8, or 0. Unfortunately the naive solution not only checks 1100000000 but it checks the 99999999 values which start with 1's in the first two digits. This is really inefficient.

The inspiration behind the advanced solution is to build partial solutions to the problem, and only consider digits that fit with our partial solutions. For example, we would start with a first digit (say 1), and then try all possible second digits for our current "partial solution" of 1XXXXXXXXX (I am using X's to represent unknown values). We would then find a valid second digit (say 2), and repeat the process to find a valid third digit for the partial solution 12XXXXXXXX.

In the last post, we learned the absolute basics of Amortized Analysis. To make a long story short (or more specifically, to split a long story up between a series of smaller incremental posts) amortized analysis is a way to look at algorithm's typical cost rather than its strict worst-case run time.

So how does one determine this *typical* cost in an actually meaningful way?

Turns out there are two main methods, the first of which, the Banker's Paradigm, we will discuss here.

]]>This is the first of series of blog posts I intend to make
regarding Data Structures. Hopefully these will be informative for anyone who
has stumbled to this site, but more importantly, these posts will form the
basis for a textbook I am writing for my graduate data structures class. So
without further ado, let me present my first topic.

**Amortized Analysis**

Definition: Amortized – “to pay money that is owed for something (such as a mortgage) by making regular payments over a long period of time” ~ Merriam Webster

Which sounds worse? Paying $30 now or paying $1 a day for the rest of the month? Although they are the same amount of money, many people would rather part with their cash in smaller increments spread over a long period of time. Similarly, when making a large purchase, for example a $2000 laptop, it is common to justify the expense by saying “buy I will use that laptop for the next five years!” Such a statement is unconscious amortization.

]]>

Last week I installed the most recent Windows Insider Preview Build (Instructions Linked) in order to try out the recent new feature for Windows 10. Bash on Ubuntu on Windows is a native Ubuntu installation which lives alongside Windows 10, albeit sandboxed in its own file system. More details can be found here but the long story short is that it works as advertised. After getting the Windows Insider update, bash can be found from the start menu. Clicking on the familiar orange circle opens a black terminal similar to the standard ubuntu window. From there the user has access to commands like ls, find, top, and most importantly, apt-get. I took a bit to customize my setup and play with bash, and so far I have been really impressed, despite the growing fear that microsoft plans to embrace, expand, and exterminate ubuntu. Unfortunately though, their choice of name is clunky and contains 100% more prepositions than necessary. (For some reason, this fits the Microsoft ethos in my mind...)