Shoulders to stand on

Here are a couple of anecdotes demonstrating that without access to the implementation of an experiment, scientific progress is halted.

Having talked with other students, I know that these stories are not of isolated experiences.

Surely, you too have recently come across some fascinating scientific results that gave you an idea you wanted to implement right away, but to your dismay, you quickly realized that you would have to start from scratch. The algorithm was complex. Maybe the data was not available. With disappointment, you realized that it would take you weeks or months of error-prone coding just to get to the baseline. Then you would not be confident that you implemented it in exactly the same way as the original, and there would be no direct comparison. So you wrote it down in your Someday/Maybe file and forgot about it. This happens to me constantly, and it just seems tragic and unnecessary.

My own dog food

Later, I too did a replication study. In this case my advisor and a fellow student wanted to compare their work to known work on the same problem. However, neither the code nor the data was released (the data was proprietary), and the evaluations were not published in a form comparable to more modern evaluation. Luckily for us, the method worked beautifully, and now everyone can see that more clearly.

In an ironic turn, being a novice programmer at the time, my replication code was disorganized. Some of it was lost. It was not under version control until quite late and had few tests of correctness. I now had grounds to empathize with the colleague I earlier felt slighted by. However, I am in the fortunate position of having had to take a forced break before graduating, during which I learned basic software engineering skills, and had ample time to think about this issue.

I am now re-rewriting the entire code base such that the experiments can be completely replicated from scratch. Every step will be included in the code I release, including "trivial" command line glue. Although every module has tests, no code is invulnerable. Bugs may well turn up, and if they do, they will be there for analysis and repair by anyone who cares. Most importantly, if anyone wants to know exactly what I did, they will not have to scour the paper for hints. Similarly, all the data will be at their disposal for analysis, if mine lacks the answer to any particular unforeseeable analytical question.

This should be standard

In 2013, there is no excuse for publishing a paper in applied computer science without releasing the code; a paper by itself is an unsubstantiated claim. Unlike in biology, or other physical fields, applied computer science results should be trivial to replicate by anyone with the right equipment. Moreover, not releasing the code gives your lab an unsportsmanlike advantage: you get to claim the results, perhaps state-of-the-art results, and you get to stay on the cutting edge, because no one else has time to catch up.

Many universities and labs now have projects under open licenses, but it is by no means a standard, and it is not a prerequisite for publication. We ought to change this.