tag:sybrandt.com,2013:/posts Justin Sybrandt 2017-07-11T05:45:50Z tag:sybrandt.com,2013:Post/1170062 2017-07-03T18:47:51Z 2017-07-11T05:45:50Z MOLIERE KDD Promo video ]]> Justin Sybrandt tag:sybrandt.com,2013:Post/1151325 2017-05-03T15:02:58Z 2017-05-03T17:15:37Z To Agile, or not to Agile: A Comparison of Software Development Methodologies

So for a graduate class in software architecture, we decided to explore different development methodologies and look at how each of the most popular practices facilitate planning.

We delve into this in a paper which can be found here:

To Agile, or not to Agile: A Comparison of Software Development Methodologies

Justin Sybrandt
tag:sybrandt.com,2013:Post/1144474 2017-04-06T15:15:25Z 2017-04-06T15:15:25Z MOLIERE Poster ]]> Justin Sybrandt tag:sybrandt.com,2013:Post/1142923 2017-03-31T16:41:02Z 2017-03-31T16:43:20Z MOLIERE Slides
MOLIERE: Automatic Biomedical Hypothesis Generation System from Justin Sybrandt
Justin Sybrandt
tag:sybrandt.com,2013:Post/1133704 2017-02-25T22:14:03Z 2017-03-06T19:42:31Z Hypothesis Generation Explained

Undiscovered Public Knowledge

In the last couple of years, researchers worldwide have begun to develop a powerful new tool. By using data mining techniques, these scientists hope to one day put themselves out of a job.

It all began in the 80's with a man named Don Swanson. He was the first to notice something that he called undiscovered public knowledge. He saw that no human could possibly read all of the available information on a given topic, and he guessed that there were some truths that no one actually knows, but have already been published. 

Big ideas are often discovered when people from different backgrounds get together on the same team. These ideas crop up because cross-disciplinary collaboration brings in not only a new viewpoint, but people who entirely different sets of mental information. For example, the idea of DNA storage has been recently popularized by Harvard scientists as a way to keep massive amounts of digital data. This bioinformatic technology relies on the connection that DNA and SSD's are both solutions to the same core problem: information storage.

DNA was originally proposed as a means for data storage in 1964. Up until that point, there had been papers about DNA in medical journals, and there had been papers about hard drives in technical journals. Typically, doctors don't tend to meet many people from IBM at dinner parties, so it was not likely that many doctors knew how hard drives worked, or IBM employees who could describe DNA's storage capabilities. But without realizing it, both communities explored the same problem. This is an example of Don Swanson's undiscovered public knowledge.

Before 1964, no one was saying DNA had anything to do with hard drives, but there existed an implicit connection: both addressed information storage. If someone was capable of keeping all technical and medical literature in their head simultaneously, this connection might seem trivial. And now, looking back on it, the connection seems pretty clear to us. What we attempt to do with hypothesis generation is exactly this; we try to identify these implicit connections.

Justin Sybrandt
tag:sybrandt.com,2013:Post/1133421 2017-02-22T17:16:29Z 2017-02-23T01:39:59Z MOLIERE: Automatic Biomedical Hypothesis Generation System

Hypothesis generation is becoming a crucial time-saving technique which allows biomedical researchers to quickly discover implicit connections between important concepts. Typically, these systems operate on domain-specific fractions of public medical data. MOLIERE, in contrast, utilizes information from over 24.5 million documents. At the heart of our approach lies a multi-modal and multi-relational network of biomedical objects extracted from several heterogeneous datasets from the National Center for Biotechnology Information (NCBI). These objects include but are not limited to scientific papers, keywords, genes, proteins, diseases, and diagnoses. We model hypotheses using Latent Dirichlet Allocation applied on abstracts found near shortest paths discovered within this network, and demonstrate the effectiveness of MOLIERE by performing hypothesis generation on historical data. Our network, implementation, and resulting data are all publicly available for the broad scientific community.

You can find our data here.

You may request a query here.

Justin Sybrandt
tag:sybrandt.com,2013:Post/1128207 2017-02-02T14:49:20Z 2017-02-02T14:49:21Z Lean to Program in Python
As an exercise in teaching, I have started putting together a lecture series aimed at teaching programming to people who have never seen a programming language before. At the time of writing, I have covered primitive data types, input, and basic control flows.

Justin Sybrandt
tag:sybrandt.com,2013:Post/1117757 2016-12-23T17:37:04Z 2016-12-23T17:37:04Z Automatic Hypothesis Generation

Recently, I received a Best Poster Award for the following work. I am currently fleshing my work into a publishable paper.

Justin Sybrandt
tag:sybrandt.com,2013:Post/1112373 2016-12-02T01:19:52Z 2017-02-23T01:32:08Z Additional Catalan Transformations

The following depicts many Catalan problem transformations. Personally, I find it very hard to follow written descriptions of these transformations, so instead hope to describe these conversions with an image.

First we see a binary tree, then we see how a binary tree can be converted to a set of parentheses. That parenthesis string can then be converted to a ballot sequence, or a mountain range with equal up and down strokes, or a lattice path, or a stair step tiling. The long story short, all of these structures are equivalent, and they use Catalan numbers as an almost "implicit" data structure.

Justin Sybrandt
tag:sybrandt.com,2013:Post/1112228 2016-12-01T19:50:24Z 2017-02-23T01:33:20Z Catalan Numbers

So we have just talked about implicit data structures which a programmer can define to store trees and other constructs, but it turns our that some basic principles of mathematics can be used as implicit data structures themselves. A great example of this is the Catalan numbers, and in this section we will explore some of the many (many many many) different problems which can be converted to and from this special set of integers.

The Catalan Numbers follow a simple rule. The n'th Catalan number is equal to...

C_nfrac 1n12n choose nfrac 2nn1nprod limits _k2nfrac nkkqquad mbox for ngeq 0

[shamelessly copied from Wikipedia]

Justin Sybrandt
tag:sybrandt.com,2013:Post/1112017 2016-11-30T21:25:31Z 2017-02-23T01:32:21Z PQ-Trees

Possibly the most irregular data structure we will cover is the PQ Tree. Firstly, the PQ tree was inspired by a baby’s mobile (in explaining this data structure to friends, I learned that many readers may be unfamiliar with typical American baby raising practices, pictured below is a mobile). Secondly, this model gives us a very efficient way to represent subsets of permutations, which can greatly help us solve combinatorial problems.

Justin Sybrandt
tag:sybrandt.com,2013:Post/1112006 2016-11-30T20:01:03Z 2017-02-23T01:32:40Z Various Implicit Tree Structures

So when we looked at implicit structures we talked about heaps. With a heap, we had a simple array that held our binary tree. In many situations, this is the most efficient way to store a binary tree. But what can we do if we have trees of arbitrary shape? Of course, we can always make nodes linked with pointers and throw that all in the heap, but what if we want a more elegant solution?

For the following methods, consider the following tree:

Justin Sybrandt
tag:sybrandt.com,2013:Post/1111990 2016-11-30T18:51:42Z 2017-02-23T01:31:58Z How to Solve the Distribution Center Problem in Linear Time

In a previous section we outlined what the distribution center problem is, explaining that the end goal is to find a node in a tree which minimizes the time needed to send goods to all other nodes. In that previous section we explained how the concept of priority queues is important to solving our problem, but careful readers would likely not be satisfied; many typical priority queue operations are worse than linear. So in this section we will hopefully satisfy that curiosity. 

A reminder of the overall algorithm:

We will work by starting at the leaves and processing inward until we find our center. We assign all leaves a value of zero, and parent nodes will have a label dependent on the values and count of their children.

Justin Sybrandt
tag:sybrandt.com,2013:Post/1111926 2016-11-30T14:16:25Z 2017-02-23T01:33:41Z Implicit Data Structures

All too often, when dealing with data structures, we are concerned about complicated pointer structures and many levels of indirection. For example, when we discussed priority queues, we focused heavily on the linking structure of binomial and Fibonacci trees. Now pointers are important and often make for fast structures, but sometimes it is important to have simpler designs ready for times when pointers get in the way.

Storing data to disk, transmitting data between machines, or simply reducing the brain power needed to follow your code; these are all times when its worth switching to implicit data structures.

Justin Sybrandt
tag:sybrandt.com,2013:Post/1108354 2016-11-15T14:08:15Z 2017-02-23T01:34:16Z Tree Distribution Center

One of the most important algorithms of the modern era is graph distribution. At some level, the reason companies like UPS and Amazon are successful boil down to how well they can solve distribution problems. Sure, the problem and algorithm we are about to discuss is a much simpler formulation than companies work with, but the basic principles are the same.

The graph distribution problem is simple. Given a graph (in our case, a tree), which node should you place your distribution center? And how long would it take to distribute something to the entire network? (Note: our algorithm can extend easily to graphs if you first find a minimum spanning tree).

Well for our purposes, let’s assume each node in our tree can do one of two things each timestep. It can either receive a package from one of its neighbors, or it can send packages to a group of its neighbors. Imagine each node is a factory. Each day it can either receive a large shipment of goods, or it can send out a fleet of trucks.

The following animation shows this process with a randomly selected node as the distribution center. Notice how it takes seven timestamps to reach the whole tree.

Justin Sybrandt
tag:sybrandt.com,2013:Post/1107994 2016-11-14T14:36:03Z 2017-02-23T01:40:29Z Model View Controller

The Model View Controller Paradigm is one of the most fundamental  programming patterns out there. It is at the heart of practically every website and every iPhone app (and the good android ones too!). Understanding MVC is critical to writing maintainable software, and even though new acronyms have come out since (MVP, MVVM, HMVC) all of them try to capture an ideal defined by MVC.

In this section we will first discuss what programming patterns are, then we will delve into a brief description of each component in MVC. Throughout this description we will be talking about how MVC can be used in sample applications to make life easier for programmers, and the experience better for end users.

Justin Sybrandt
tag:sybrandt.com,2013:Post/1107850 2016-11-13T23:11:05Z 2017-02-23T01:40:59Z Fibonacci Heap

[Raw images from cs.princeton.edu/~wayne/teaching/fibonacci-heap.pdf]

In the last section we discussed Binomial Queues and how they used a linked list of binomial trees to keep an efficient priority queue. While the binomial queue worked hard to give us an amortized constant time insertion operator, it sacrificed performance in other areas such as its O(log n) find-min operator. In this section we will explore another data structure which attempts to keep a priority queue as a linked list of trees, making its own trade-offs in performance. Mainly, the Fibonacci heap maintains a more complicated set of trees than its binomial counterpart, but in exchange for the maintenance overhead, Fibonacci heaps provide many amortized constant time operations. Because of this carefully constructed internal organization, the Fibonacci heap is one of the best performing heap-like data structures existing today.

Justin Sybrandt
tag:sybrandt.com,2013:Post/1107645 2016-11-12T23:12:15Z 2016-11-13T23:12:18Z Binomial Queues

Often when designing data structures, trade offs are made in order to make certain operations more computationally efficient at the cost of other, hopefully less used, operations. The binomial queue is no exception.

Typically, five different operations are important with priority queues. These are find-min, delete-min, insert, decrease-key, and merge. If we look at a min heap, a standard data structure wherein the first element is the smallest in the collection, we see mediocre performance on most operations.

Heap Performance:

  • find-min - O(1)
  • delete-min O(log n)
  • insert O(log n)
  • decrease-key O(log n)
  • merge O(n)

The binomial queue improves upon insertion and merge at the expense of find-min. The rational is that many applications need to store more data than is actually queried. In order words, there are likely going to be more insertions than find-min calls. Binomial queues achieve this trade off by replacing the heap’s internal binary tree with a circular list of binomial trees, each enforcing the min-heap property that root nodes are smaller than all decedent nodes.  More on this shortly.

Binomial Queue Performance:

  • find-min - O(log n
  • delete-min O(log n)
  • insert O(1) //amortized
  • decrease-key O(log n)
  • merge O(log n)
Justin Sybrandt
tag:sybrandt.com,2013:Post/1102334 2016-10-26T13:13:52Z 2016-10-26T13:14:21Z Implicit Data Strucutres

A binomial queue is an irregular priority queue which takes advantage of binary trees to quickly order its elements. In short, its a queue of trees where each tree enforces a heap structure. I'll be writing a more in-depth article on Binomial Queues shortly. Until then, I recommend THIS LINK which has a wonderful visualization from the University of San Francisco. 

The following image shows a typical binomial queue with fifteen elements. Notice how to size of each tree follows the pattern 2 ^ n. 

Internally, this structure is held together with a series of pointers. The following image depicts that link structure. Notice how siblings are linked together, each having a pointer to their parent, but the parent only has a pointer to it's first child.

Wouldn't it be great if we didn't need to worry about all those links? Well with a little bit of work we can represent this whole structure implicitly! 

Justin Sybrandt
tag:sybrandt.com,2013:Post/1093946 2016-09-27T12:59:37Z 2016-09-27T13:00:24Z The Physicist's Paradigm

Our next topic on amortized analysis is a parallel to the Banker’s Paradigm. This new approach, the physicist’s paradigm, is a little more mathematically rigorous, but ultimately should produce the same answer (or same type of answer) as its money-grubbing counterpart. Although both techniques should produce the same result, it is worth understanding both because often times problems are more easily understood when phrased as one or the other.

So what is it?

Think back to your last physics class, remember gravity? If you took any introduction to physics you likely learned about gravitational potential energy, which is the amount of energy stored in an objects distance from the earth. If you take a one kilogram ball and raise it one meter off the ground, you have bestowed it with 9.8 joules of energy. Releasing the ball also releases that energy, which can be seen when the ball falls to the ground. Now, if you remember this, you might also remember that it doesn’t matter what path the ball takes when you lift it up one meter. All that matters for the calculation of gravitational potential energy is the initial and final position.

How do we use it?

So what does this have to do with algorithms? Well, lets look at each step of an algorithm through the lens of energy functions. We are going to use this short function to determine how much work is done by our algorithm at each step. Here, a_i is the amortized cost required to get us from step i-1 to step i, while t is the actual cost of doing step i. Phi is a potential energy value for the data structure, and we look at the change in potential between the previous step and the current one. Now, if you are reading this and have no clue how to calculate the potential energy of a data structure, well, you are in the right place.

Justin Sybrandt
tag:sybrandt.com,2013:Post/1090098 2016-09-14T16:52:51Z 2016-09-14T16:58:01Z Digits that Divide

Problem Description

Find all 10 digit numbers who for all n=1,2,3,...,9,10 the number formed by taking the leftmost n digits is divisible by n.

In layman's terms, we want to find all 10 digit numbers where:

-The first digit must be divisible by 1. -The leftmost two digits must be divisible by 2. -... -The leftmost nine digits must be divisible by 9. -The leftmost ten digits (the whole number) must be divisible by 10.

The Naive Solution

Well, a program which finds these numbers needs to explore the whole "problem space" (all possible 10 digit numbers) so the easiest way to do this is a loop. We can start at 0000000000 and count up to 9999999999. Each number we go to, we can check if it meets all of these requirements and if so, we can store it.


results = []
for i in range(0,10 ** 10):
  if slowCheckDigit(i):

def slowCheckDigit(num):
  goodNum = True
  for j in range(10,0,-1):
    if num % j == 0:
      num //=10
      goodNum = False
  return goodNum


Looking at this code, we see that we will be making 10^10^10 comparisons. This is because we will be looping through 10^10 numbers, and each call to slowCheckDigit requires 10 checks (one for each digit). This is ridiculously bad. I attempted to run it on my desktop and got bored before it finished (a great metric for algorithm complexity). We also check a lot of values we should know better than checking. For example, for the first two digits to be divisible by two, we can pretty easily determine that the second digit should be a 2,4,6,8, or 0. Unfortunately the naive solution not only checks 1100000000 but it checks the 99999999 values which start with 1's in the first two digits. This is really inefficient.

The Advanced Solution

The inspiration behind the advanced solution is to build partial solutions to the problem, and only consider digits that fit with our partial solutions. For example, we would start with a first digit (say 1), and then try all possible second digits for our current "partial solution" of 1XXXXXXXXX (I am using X's to represent unknown values). We would then find a valid second digit (say 2), and repeat the process to find a valid third digit for the partial solution 12XXXXXXXX.

Justin Sybrandt
tag:sybrandt.com,2013:Post/1088428 2016-09-09T19:33:36Z 2016-09-09T19:33:39Z Amortized Analysis - Bankers Paradigm

In the last post, we learned the absolute basics of Amortized Analysis. To make a long story short (or more specifically, to split a long story up between a series of smaller incremental posts) amortized analysis is a way to look at algorithm's typical cost rather than its strict worst-case run time.

So how does one determine this typical cost in an actually meaningful way?

Turns out there are two main methods, the first of which, the Banker's Paradigm, we will discuss here.

Justin Sybrandt
tag:sybrandt.com,2013:Post/1088077 2016-09-08T20:00:39Z 2016-09-08T20:00:39Z Intro to Amortized Analysis

This is the first of series of blog posts I intend to make regarding Data Structures. Hopefully these will be informative for anyone who has stumbled to this site, but more importantly, these posts will form the basis for a textbook I am writing for my graduate data structures class. So without further ado, let me present my first topic.


Amortized Analysis

Definition: Amortized – “to pay money that is owed for something (such as a mortgage) by making regular payments over a long period of time” ~ Merriam Webster

Which sounds worse? Paying $30 now or paying $1 a day for the rest of the month? Although they are the same amount of money, many people would rather part with their cash in smaller increments spread over a long period of time. Similarly, when making a large purchase, for example a $2000 laptop, it is common to justify the expense by saying “buy I will use that laptop for the next five years!” Such a statement is unconscious amortization.

Justin Sybrandt
tag:sybrandt.com,2013:Post/1054555 2016-05-23T04:24:52Z 2016-05-23T15:39:09Z Bash on Ubuntu On Windows (First Impressions)

Last week I installed the most recent Windows Insider Preview Build (Instructions Linked)  in order to try out the recent new feature for Windows 10. Bash on Ubuntu on Windows is a native Ubuntu installation which lives alongside Windows 10, albeit sandboxed in its own file system. More details can be found here but the long story short is that it works as advertised. After getting the Windows Insider update, bash can be found from the start menu. Clicking on the familiar orange circle opens a black terminal similar to the standard ubuntu window. From there the user has access to commands like ls, find, top, and most importantly, apt-get. I took a bit to customize my setup and play with bash, and so far I have been really impressed, despite the growing fear that microsoft plans to embrace, expand, and exterminate ubuntu. Unfortunately though, their choice of name is clunky and contains 100% more prepositions than necessary. (For some reason, this fits the Microsoft ethos in my mind...)

Justin Sybrandt