# Selected Publications

Are Abstracts Enough for Hypothesis Generation?

MOLIERE: Automatic Biomedical Hypothesis Generation System

# Recent Posts

### Moliere Poster -- Google PIRC

I have the chance to present my work at the Google Ph.D. Intern Research Conference (PIRC). This poster represents all of the work we have added to the Moliere project since our original paper last year.

### Basic Iterative Numeric Optimization

Today in a class, we were asked to write an iterative solver for numerical equations. Now, many students in the class did not have an optimization background, so for the benefit of everyone, I want to share a simple overview of this exercise and how to go about solving it.

The problem was stated as follows:

$$M(a) = 2\times a + 14$$ $$G(b) = b - 2$$

And our goal was to find some solution $x$ such that $M(x) = G(x)$. Additionally, we were supposed to do so iteratively, so just solving the system of equations was out of the question. This is because our next exercise would have a different $M$ and $G$, so our code should be able to support whatever.

For the sake of generalization, my solution here will assume only the $M$ and $G$ are continuous, but I will not assume we know their derivatives. Additionally, I will be writing my code in python, simply because I find that it is easier for anybody to understand. Knowledge of python, hopefully, won’t be necessary. But first, lets go over some aspects of the problem…

### Moliere Software Overhaul

Over the last couple of days, I have retooled MOLIERE into a system that anyone1 can deploy it and run their own queries. The code is over at the default repo2 and should be pretty straightforward, the code even downloads raw data itself! Just run build_network.py and point it at a big parallel file systen — in a few hours you’ll have your very own knowledge network!

# Recent Publications

### Are Abstracts Enough for Hypothesis Generation?

The potential for automatic hypothesis generation (HG) systems to improve research productivity keeps pace with the growing set of publicly available scientific information. But as data becomes easier to acquire, we must understand the effect different textual data sources have on our resulting hypotheses. Are abstracts enough for HG, or does it need full-text papers? How many papers does an HG system need to make valuable predictions? How sensitive is a general-purpose HG system to hyperparameter values or input quality? What effect does corpus size and document length have on HG results?
BigData’18, 2018.

### Large-Scale Validation of Hypothesis Generation Systems via Candidate Ranking

The first step of many research projects is to define and rank a short list of candidates for study. In the modern rapidity of scientific progress, some turn to automated hypothesis generation (HG) systems to aid this process. These systems can identify implicit or overlooked connections within a large scientific corpus, and while their importance grows alongside the pace of science, they lack thorough validation. Without any standard numerical evaluation method, many validate general-purpose HG systems by rediscovering a handful of historical findings, and some wishing to be more thorough may run laboratory experiments based on automatic suggestions. These methods are expensive, time consuming, and cannot scale. Thus, we present a numerical evaluation framework for the purpose of validating HG systems that leverages thousands of validation hypotheses.
BigData’18, 2018.

### MOLIERE: Automatic Biomedical Hypothesis Generation System

Hypothesis generation is becoming a crucial time-saving technique which allows biomedical researchers to quickly discover implicit connections between important concepts. We discover these connections with our tool MOLIERE.
KDD’17, 2017.

### To Agile or Not to Agile: A Comparison of Software Development Methodologies

Since the Agile Manifesto, many organizations have explored agile development methods to replace traditional waterfall development. Interestingly, waterfall remains the most widely used practice, suggesting that there is something missing from the many “flavors” of agile methodologies. We explore seven of the most common practices to explore this, and evaluate each against a series of criteria centered around product quality and adherence to agile practices. We find that no methodology entirely replaces waterfall and summarize the strengths and weaknesses of each. From this, we conclude that agile methods are, as a whole, unable to cope with the realities of technical debt and large scale systems. Ultimately, no one methodology fits all projects.
arxiv.org, 2017.

### Rapid Replication of Multi-Petabyte File Systems

By utilizing General Parallel File System (GPFS) policy scans, distsync finds changed files without navigating between directories. This allows our tool to more efficiently synchronize large out of date file systems.
[WIP] PSDW’15, 2015.

# Projects

#### MOLIERE: Automatic Biomedical Hypothesis Generation

We discover potential connections within existing scientific literature. Currently, we are preparing MOLIERE for large-scale public usage.

#### Bridge Health Classification With Automotive Sensing

We classify bridge health using Support Vector Regression and other Machine Learning Techniques. In partnership with Clemson Civil Engineers.

#### Learn to Program Python

An introductory video series for people absolutly new to programming. Learn the basics of programming!

#### Rapid Replication of Multi-Petabyte File Systems

Distsync is a parallel storage system syncronization utility which leverages cluster computing capabilities to unify large out-of-sync distributed file systems.

# Contact

• justin@sybrandt.com
• McAdams Hall Office 224. McMillan Rd, Clemson, SC 29631
• Email for appointment