Posts

I’m here in Seattle, WA attending the IEEE International Conference on Big Data. I’ll be presenting two recent works. The first, presents a new method to validate hypothesis generation systems. The second, uses that method to determine the quality of input papers needed to make good conclusions. With two papers in the same conference, I will be giving a double-length talk! If you’re around, I’ll be at the end of the L12 session Wednesday morning.

CONTINUE READING

I have the chance to present my work at the Google Ph.D. Intern Research Conference (PIRC). This poster represents all of the work we have added to the Moliere project since our original paper last year.

CONTINUE READING

Today in a class, we were asked to write an iterative solver for numerical equations. Now, many students in the class did not have an optimization background, so for the benefit of everyone, I want to share a simple overview of this exercise and how to go about solving it.

The problem was stated as follows:

$$ M(a) = 2\times a + 14$$ $$ G(b) = b - 2 $$

And our goal was to find some solution $x$ such that $M(x) = G(x)$. Additionally, we were supposed to do so iteratively, so just solving the system of equations was out of the question. This is because our next exercise would have a different $M$ and $G$, so our code should be able to support whatever.

For the sake of generalization, my solution here will assume only the $M$ and $G$ are continuous, but I will not assume we know their derivatives. Additionally, I will be writing my code in python, simply because I find that it is easier for anybody to understand. Knowledge of python, hopefully, won’t be necessary. But first, lets go over some aspects of the problem…

CONTINUE READING

Over the last couple of days, I have retooled MOLIERE into a system that anyone1 can deploy it and run their own queries. The code is over at the default repo2 and should be pretty straightforward, the code even downloads raw data itself! Just run build_network.py and point it at a big parallel file systen — in a few hours you’ll have your very own knowledge network!

CONTINUE READING

We have publicly available code and experimental data. Our validation information has been incorporated to THIS REPO.

Our experimental data and results can be found in THIS OTHER REPO.

But, we are still working on uploading all of the supporting data.

CONTINUE READING

I have finally had time to package Moliere, our Automatic Hypothesis Generation System, into a single easy-to-use package!

Take a second to check it out at my repo.

CONTINUE READING

In a previous post I talked about how tools like word2vec are used to numerically understand the meanings behind words. In this post, I’m going to continue that discussion by describing ways we can find numerical representations for whole documents. So, I’ll be assuming you’re already familiar with the concept of word embeddings. Why do we need document embeddings? Many real-world applications need to understand the content of text which is longer than just a single word.

CONTINUE READING

I think its way to hard to manage small projects. There are so many project planning platforms out there and they typically fall into one of two major pitfalls for small teams. Either they are free and simplistic, i.e. Trello, or they are expensive and complicated, i.e. Jira. Of course, there are millions of people who make these systems work for them everyday, but in my experience I find that it is hard for a small, well-intentioned group to actually use these.

CONTINUE READING

Recently, in text mining circles, a new method of representing words has taken off. This has been due, in a large part, to recent papers from Mikolov et al. and tools like word2vec 1. Since then, many other projects have applied this concept to a wide variety of areas within data mining 2. So what is all the hype about? What are these embeddings and why do we need them?

CONTINUE READING

So recently, I needed to parallelize a lot of my old code. This initially seemed like a daunting task. Now its not like I’ve never had to write parallel code before, and its not like my task was that hard. My issue primarily came from a staunch unwillingness to look anything up. After all, I could just throw my problem into python, right? While that may be true, the version of myself today would like to tell the version of myself from last week that the C++ solution is not as bad as I thought.

CONTINUE READING