Recently, I received a Best Poster Award for the following work. I am currently fleshing my work into a publishable paper.
Recently, I received a Best Poster Award for the following work. I am currently fleshing my work into a publishable paper.
The following depicts many Catalan problem transformations. Personally, I find it very hard to follow written descriptions of these transformations, so instead hope to describe these conversions with an image.
So we have just talked about implicit data structures which a programmer can define to store trees and other constructs, but it turns our that some basic principles of mathematics can be used as implicit data structures themselves. A great example of this is the Catalan numbers, and in this section we will explore some of the many (many many many) different problems which can be converted to and from this special set of integers.
The Catalan Numbers follow a simple rule. The n'th Catalan number is equal to...
[shamelessly copied from Wikipedia]
So the first ten Catalan numbers are:
|n||n'th Catalan Number|
(cond ((eq n 1) 0) ((eq n 2) 1) ((+ (fibonacci (- n 1)) (fibonacci (- n 2))))))))
So how can we compare these two?
This one is very straightforward: Lets equate 'A' to '(' and 'B' to ')'. By doing this we can convert between the two.
ABAABABB becomes ()(()())
Note that a ballot sequence meets our ordering requirements if and only if we can formulate it as a valid string of parenthesis.
Possibly the most irregular data structure we will cover is
the PQ Tree. Firstly, the PQ tree was inspired by a baby’s mobile (in
explaining this data structure to friends, I learned that many readers may be
unfamiliar with typical American baby raising practices, pictured below is a
mobile). Secondly, this model gives us a very efficient way to represent subsets
of permutations, which can greatly help us solve combinatorial problems.
The PQ tree contains two types of nodes, P and Q (what a surprise). Conceptually, a P node is like a hook to hang ornaments on. The children of a P node may be permuted in any way. The Q node on the other hand, is like the horizontal bars pictured in the above baby toy. Its nodes are listed in order, but the bar itself may be rotated. The leaf nodes of a PQ tree are values, which can be reordered following the structure of the tree. The following are some examples of PQ trees, along with the set of permutations they represent.
So obviously, this data structure needs to help us solve a problem (the cool factor only gets it so far) and luckily it can have many interesting applications. Notice how we can use P and Q nodes to get sets of permutations that all have specific qualities. Look at the 3rd above example wherein we can generate 12 permutations on five items that all start and end with 1 and 5. Considering there are 120 permutations on five elements, PQ trees allow us to restrict our domain to a much more manageable size. This becomes more apparent as the number of elements increases to seven or eight (there are 5,040 permutations on seven elements and 40,320 permutations on eight).
So let’s consider a more practical problem, imagine we have a 2d array of 1’s and zeros. Our goal is to rearrange those rows such that on each row, all the 1’s are grouped together. This problem might sound arbitrary, but many scheduling tasks need to solve problems like this. Well, if there are N columns, then there are N factorial different permutations we would need to check. But we can do a lot better with PQ-Trees.
The above animation shows how a PQ-Tree can be constructed which meets the requirements defined by the example matrix, one row at a time. Notice how at each iteration, columns which need to be next to each other are grouped together in a P node. Then, when more restrictions are incorporated, more structure is needed, and a Q node is used to make sure nodes stay adjacent where needed. Ultimately, this data structure takes our search space of 120 column permutations and discovers the set of four solutions.
So when we looked at implicit structures we talked about
heaps. With a heap, we had a simple array that held our binary tree. In many situations,
this is the most efficient way to store a binary tree. But what can we do if we
have trees of arbitrary shape? Of course, we can always make nodes linked with
pointers and throw that all in the heap, but what if we want a more elegant
For the following methods, consider the following tree:
Let’s start with an interesting, if not necessarily helpful data structure created by Art Farley in the University of Oregon. Here a tree is going to be stored as an array of pairs (or an array of size N by 2).
For each node we will store two things, the number of adjacent nodes, and the sum of those node’s labels. In the following table, the blue column shows the node’s label, and the yellow columns show the pairs. Note that in a program we would not explicitly store the blue column.
We can recover the original tree by doing some smart manipulations to this data. Firstly, if a row contains a 1 for the number of adjacent neighbors, then we can conclude it is a leaf node. Note that if there is only one adjacent neighbor then the sum of neighbor index column really just contains the label of that node’s parent. For example, we can read in our table that node 8’s parent is node 5.
Next, we can subtract each leaf node’s label from its parent. For example, we can subtract node 9 and 10 from node 6. Doing so will decrement node 6’s adjacent node count by two, and set its neighbor label sum to 2. Unsurprisingly, the result shows that node 6’s parent is node 2.
Granted, if you believe this may be more complicated than its worth, you might be right, a simpler and more efficient implicit tree structure is next. Still, you never know when techniques like this will show up in your own projects, and it’s important to have a large conceptual toolbox.
The Prufer Array:
This one is really simple, and seems to have been discovered by many people independently, but Prufer was the first one published. In fact, you may have even already thought of this while reading the previous section. We store the tree as an array of size N, where each cell index corresponds to a unique node, and each cells value corresponds to that node’s parent. We will use the value of zero to denote the root of the tree. Although the root can be anywhere, it is common to reindex the tree such that the root is the first index. Then, we don’t even need to store the root itself!
The following table shows our tree’s Prufer array, assuming node 2 is the root. Once again, the blue values wouldn’t actually be stored in a computer.
Now for its simplicity and memory efficiency, it is worth noting that this method does have a tradeoff with Farley’s structure we mentioned previously. It is much harder to identify leaf nodes in a Pruffer array. Before we could see that any node with only one adjacent neighbor must be a leaf node, but here we need to read the entire array and record which nodes have children and which don’t. That being said, in most cases the Prufer array is the best choice, especially when memory is a top priority.
In a previous section we outlined what the distribution center problem is, explaining that the end goal is to find a node in a tree which minimizes the time needed to send goods to all other nodes. In that previous section we explained how the concept of priority queues is important to solving our problem, but careful readers would likely not be satisfied; many typical priority queue operations are worse than linear. So in this section we will hopefully satisfy that curiosity.
A reminder of the overall algorithm:
We will work by starting at the leaves and processing inward until we find our center. We assign all leaves a value of zero, and parent nodes will have a label dependent on the values and count of their children.
The Data Structure:
We will take the edges of the tree as input, noting that this is linear in the number of vertices. From this edge list we will build a flexible, linked, data structure to optimize the types of tree traversal we want to do later in the algorithm.
We start with an array of linked lists. Each node will have a unique index in the array. Then, for each edge from A to B, we will add B to A’s list, and A to B’s list. Lastly, we will link B’s list element to A’s list element, making it easier to jump to specific locations in each linked list.
That likely sounded confusing, so hopefully a diagram will help:
Corresponding Edge List:
How to use this structure:
Start by making an array of “buckets” which we are going to use to order the nodes by their values. A node will be added to the array whenever we can accurately calculate its value. Effectively this means that we will be adding leaf nodes to the buckets first, then their parents, and so on. We are going to go through each bucket only once, and we will start by assigning all the leaf nodes to bucket zero. We can identify leaf nodes by looking at the length of each node’s edge list; an edge list of length one implies a leaf node.
When a node is processed, we remove it’s edges from the edge list structure. For example, if we processed leaf node D, we would remove its edge to B as well as B’s edge to D.
After removing the processed edges, the process repeats, any list with a length of 1 is added to the array of buckets and the process continues.
The last thing to explain is exactly how we are going to keep track of the value of nodes while we are processing the tree. For this we will use yet another array, this one keeping track of the intermediate results for each node. When a leaf node is processed, it will update the intermediate result of its parent. This can all be done in constant time, and in a method similar to how we described when originally talking about the distribution center algorithm.
Overall Running Time:
So step one is to build the edge list, bucket array, and intermediate array. For each of these we only need at most an array length equal to the number of nodes. Building the edge list also adds links for each edge, but because we have a tree, we have N-1 edges. Because setting up the links for edges can be done in constant time, the overall initialization process is linear.
Step two is to put leaf nodes in the zero bucket. This is linear because we simply need to scan through the edge list once.
Lastly, we iterate through the buckets, and for each node in a bucket, we remove its corresponding edges and update the intermediate array. Updating the array is constant, removing the edge is constant (thanks to our edge list data structure) and adding any eligible parent nodes is also constant (because the only eligible nodes will be linked to by the leaf nodes we are removing; there is no need to search for them). This overall process is linear because once a node is processed, it will never appear in the bucket list a second time.
Because all steps of this algorithm are linear, we can clearly say that the distribution problem is solvable in linear time!
All too often, when dealing with data structures, we are concerned about complicated pointer structures and many levels of indirection. For example, when we discussed priority queues, we focused heavily on the linking structure of binomial and Fibonacci trees. Now pointers are important and often make for fast structures, but sometimes it is important to have simpler designs ready for times when pointers get in the way.
Storing data to disk, transmitting data between machines, or simply reducing the brain power needed to follow your code; these are all times when its worth switching to implicit data structures.
What are they?
By “Implicit” I
mean that the data has implied structure
rather than one which is clearly evident. C programmers know about
implicit data structures because they use them frequently. In C, if
you write “short x = 65; char c = x” you would have the number 65
in the variable x
and the letter ‘A’ in the variable c.
and c store
the binary value 010000012,
that binary value differently to produce different results. This is
only exasperated when considering floats. Consider the following C++
float f = 0xc542a000; unsigned int u = 0xc542a000; int y = 0xc542a000; cout << "F: " << f << endl << "U: " << u << endl << "Y: " << y << endl;
Although f, u, and y contain the same binary string, they represent very different values.
So clearly, we use implicit data structures all the time often without really realizing it. Lets see how we can apply this concept to represent more interesting structures.
So in a basic data structures class students are eventually taught heaps, which is typically the first clearly implicit data structure a young computer scientist knows. Heaps are binary trees, but they are typically stored in an array.
By storing the heap in an array, there is no memory overhead of additional links, but this is relatively unimportant when modern systems often have memory to spare. Instead, heaps store their data in an array because it makes memory accesses constant time. This is achieved because interpreting the heap’s array as a tree is very straightforward.
If you need a reminder, if a heap node is stored in array index X then its children will be at array index 2X+1 and 2X+2. Similarly that node’s parent will be at floor(X/2) – 1. These simple equations replace the need to store pointers for every parent-child relationship, they improve cache locality of the heap, and they greatly simplify the overall structure.
Going forward, we are going to see that there are a lot of different ways to store structures implicitly, some of which are incredibly efficient provided their easily interpretable.
One of the most important algorithms of the modern era is
graph distribution. At some level, the reason companies like UPS and Amazon are
successful boil down to how well they can solve distribution problems. Sure,
the problem and algorithm we are about to discuss is a much simpler formulation
than companies work with, but the basic principles are the same.
The graph distribution problem is simple. Given a graph (in our case, a tree), which node should you place your distribution center? And how long would it take to distribute something to the entire network? (Note: our algorithm can extend easily to graphs if you first find a minimum spanning tree).
Well for our purposes, let’s assume each node in our tree can do one of two things each timestep. It can either receive a package from one of its neighbors, or it can send packages to a group of its neighbors. Imagine each node is a factory. Each day it can either receive a large shipment of goods, or it can send out a fleet of trucks.
The following animation shows this process with a randomly selected node as the distribution center. Notice how it takes seven timestamps to reach the whole tree.
So how would we go about finding the best node to pick for our distribution center? From first glance, we probably want to pick one of those three nodes in the center, but we need to do so algorithmically. Maybe we could use network metrics to help us, and they would on average do quite well, but we can do better with the help of a pretty simple data structure.
It is no accident that we spent the last couple of sections reviewing priority queues. We are going to store our nodes in one of these in order to properly walk through the tree, starting with the leaves, and making our way to the distribution center. By effectively using our data structure, we can make this algorithm linear!
The following animation shows this process in detail.
And that’s it! We can walk through our tree once, and the last node marked is our distribution center!
The Model View Controller Paradigm is one of the most fundamental programming patterns out there. It is at the heart of practically every website and every iPhone app (and the good android ones too!). Understanding MVC is critical to writing maintainable software, and even though new acronyms have come out since (MVP, MVVM, HMVC) all of them try to capture an ideal defined by MVC.
In this section we will first discuss what programming patterns are, then we will delve into a brief description of each component in MVC. Throughout this description we will be talking about how MVC can be used in sample applications to make life easier for programmers, and the experience better for end users.
Programming is all about patterns and conventions. C programmers know that when they see something named IN_ALL_CAPS that the value is constant. They know that i is an index and that foobar may be a variable or struct. C programmers use naming conventions constantly, often not realizing how important they are, to make programming easier. In the same manner, web developers, iPhone app makers, and sensible application programmers rely on the MVC paradigm to make large projects easier to understand.
More important than writing efficient code, or writing code quickly, is writing code that is human readable and self documenting. It is tempting, especially when learning to program, to work quickly. But Facebook was not written in a day. At the time of writing, the Facebook iPhone app consists of over 18,000 class files. (Take a look) So how can a handful of people on Facebook’s programming team manage that many files? Well they better be using patterns (spoiler, they use MVC).
Patterns, in my opinion, are all about knowing where to look for something. Are you wondering how to add a feature to the login page? Well, that’s a lot easier if there is a file for login graphics, a separate one for login logic, and a whole other system for login data (otherwise we might lose some users due to minor text fixes).
So without further ado, let’s look at Model View Controller.
In order to understand each piece of the MVC pattern, we will first see the macro-level organization of the three components and how they interact to make up an application at a high level. Then we will give details about each one.
[Raw images from cs.princeton.edu/~wayne/teaching/fibonacci-heap.pdf]
In the last section we discussed Binomial Queues and how they used a linked list of binomial trees to keep an efficient priority queue. While the binomial queue worked hard to give us an amortized constant time insertion operator, it sacrificed performance in other areas such as its O(log n) find-min operator. In this section we will explore another data structure which attempts to keep a priority queue as a linked list of trees, making its own trade-offs in performance. Mainly, the Fibonacci heap maintains a more complicated set of trees than its binomial counterpart, but in exchange for the maintenance overhead, Fibonacci heaps provide many amortized constant time operations. Because of this carefully constructed internal organization, the Fibonacci heap is one of the best performing heap-like data structures existing today.
There are two main differences between the Fibonacci heap and the binomial queue. Firstly, while the binomial queue required that at all times, only one tree of each order may exist in the structure, the Fibonacci heap may contain multiple trees of the same size. Additionally, while the Binomial queue consisted of only binomial trees, the Fibonacci heap contains trees of many shapes and sizes. Note that in both, trees must follow the min-heap property that parent nodes must always have a lower value than any of their children.
The main component of a Fibonacci heap is a rotated linked list of tree roots. When a new element is inserted, it can be added to this rotated linked list as a tree of size 1. Eventually trees may be combined, but not until an operation runs which requires additional structure. This data structure gets its performance benefits by being really lazy.