How should we peer review software?

9 min read Original article ↗

If you want to work as a scientist or researcher in any serious capacity, you need to publish papers in peer-reviewed journals.

You need to publish a lot of papers, and papers in fancier journals are better. You can also present your research in a conference, and then the work gets recorded in the conference proceedings, which is kind of like a journal but not really. Except sometimes there are conferences where your abstract is published in an actual journal, but it's only an abstract, so it's still less prestigious than a full paper published in a journal.

That is, unless you work specifically in machine learning, in which case, the best 'journals' are actually conferences. There are still regular conferences that are worse than journals, but the AAAI and NeurIPS conferences are better than most journals. It's also better if your name is first, or at least high up, on the author list, unless you work in cybersecurity, in which case names are just ordered alphabetically. Unless you want to be the supervising PI (principal investigator), in which case you want it to be last. If there are students and professors together on a paper, the students' names go from most significant contributions to least significant, and PIs after that from least significant to most significant.

This is what happens when you let smart people play status games.

The core of the above system is peer review. It's a fairly solid concept—basically if you want to say that something is true and publish it such that everybody can quote it as being real scientific literature, other scientists who are in the same field as you should look at it and say it is reasonable. So, the journal gives it to some scientists to review and asks them their thoughts. Based on the outputs, the editor of the journal can give four responses: reject, accept with major revisions, accept with minor revisions, or accept.

I've worked with professors who absolutely despise the peer review system and enjoy listing papers that were at first rejected despite becoming seminal works in the field. I've also worked with a professor (well, a professor turned CTO) who was somewhat offended when I brought up the fact that it may have some issues. He had over 200 published papers in journals, so I suspect that my poking fun at the system that he has mastered to achieve his considerable success somewhat annoyed him.

I am generally a fan of peer review in theory, if not in practice. It isn't easy to review work that very few people are qualified to perform. If you want to see who the fastest runner is, you can make them run and time them. The person doing the timing doesn't need to be a runner. Science is unique in that in order to vet the procedure, you need to actually be good at the specific type of science being conducted. Subfields in science are very small. As such, it does make sense to have scientists review other scientists' work. It's not a perfect system, but the flaws can mostly be attributed to human nature rather than an inherent issue in the procedure. There has been talk recently that the reviewers should have their names mentioned in papers that they have reviewed to encourage them to give better reviews, but other than minor changes, it is a reasonably okay system in my opinion.

I recently wrote about how when research involves a lot of software, the researchers should submit their software to the journal if it is to be accepted. This already happens in top-tier journals, and I suggested that it should happen in all of them. I am now realizing, after further reflection, that my suggestion is a lot more difficult to actually implement than I had previously thought it was.

In recent weeks, I've been plugging away at the unenviable task of translating 20-year-old MATLAB into pseudocode, which will be in turn translated to usable C++. I have the welcome help of some talented programmers to outsource some of the functionality to, but it is my job to understand the whole behemoth codebase and instruct the people we are outsourcing the code to exactly what to do. The code quality isn't great due to being largely worked on by graduate students who were not trained specifically in software engineering.

Unfortunately, this tangled mass of files is hardly unique to our lab. Most software found in research labs is of similar quality, if not worse. It is typically written by engineers who are experienced in non-software fields, which means they are smart enough to think deeply about how the software should be made but are inexperienced in software development. This is evident in the output you get.

This means that in order to review the software that goes along with research papers, the reviewers would have to dredge through lines upon lines of poorly written software, like what I'm doing right now. I am (sort of) willing to go through this, but I'm being paid for it, and as soon as I'm finished with it, I get to write some cool C++ code that will be used to detect heart arrhythmia, so I'll get it done.

How on Earth are we going to persuade reviewers, who sometimes cannot even be bothered to fully read the paper in detail, to deeply understand the convoluted software?

It may seem like the solution to this is to just submit the completed software to the journals alongside the paper. Then the reviewers just have to run the software, and if it works, there is no need to look at its guts. The issue with this possible solution is that a lot of scientific code is simulation. It is designed to mimic the behavior of a natural phenomenon and apply what was described in the paper to it.

The code that goes along with my spectroscopy project from Purdue (which was supposed to be published in August, and it is now November) doesn't actually do anything that hasn't been done before—it just does it with less data. The 'output' is just a plot describing what happened within the guts of the code. If you look at the 'output,' all you can surmise is that the code displays plots, suggesting that the algorithm being simulated works the way we say it will. It would be entirely possible to write code that fakes those plots. To be clear, I didn't fake the plots, and I fully intend on making the GitHub public once the paper is published. But unless you actually look at the software deeply, you cannot verify that it works on any level that matters.

The code that I'm working on now outputs a diagnosis. While the accuracy of the diagnosis will be verified before it is actually implemented in the medical field, it isn't realistic to require the reviewers to actually implement the diagnosis methods on a sick person and see if they get better. They are just reviewers, not FDA employees. The rigorous review is done, but it is done separately. The reviewing process for a medical procedure is significantly more rigorous than it is for a paper. You don't need to worry about a medical procedure being conducted after only being described in a paper, it will only be conducted in the real world if it undergoes further review. But that's not what I'm talking about in this particular post.

So again, in order to verify whether the code does what the paper writers say, the reviewer would have to inspect the innards of the software, which is a very lengthy and laborious proccess that they probably won't be willing to do. It is an especially difficult task because it is not likely that the researchers intentionally fibbed. Much of society functions due to the fact that very few people want to spend many years studying to become medical researchers and then decide to publish false medical research. That concept is kind of terrifying. It is much more likely that they made an honest mistake, and honest mistakes are a lot harder to find than intentional lies. The reviewers would have to look for hidden bugs in huge codebases. Such an undertaking is difficult.

One alternative solution is making sure scientists write good code, so it would not be such an onerous task for a reviewer to look it over, ensuring that the scientists are not incorrectly describing the behavior of the software. While this idea seems nice, I really don't think it is feasible. The reality is good software is hard to make, and it already requires a lot of training to be a scientist. A PhD in the sciences takes 4-5 years to complete, and you want to add all the years it takes to be a good software engineer on top of that? It is already really hard to become a scientist, and there are a million things that scientists should know but don't. The human lifespan is simply too short to obtain all the education you need to be a scientist who knows everything a scientist has to know.

Of course, you could increase science funding such that they could hire software engineers, but we are going in rather the opposite direction these days.

I don't think the problem is so unsolvable and intractable that we should just abandon it. I'm open to suggestions. To quote the great Jello Biafra in his song Where Do Ya Draw the Line, "I'm not telling you; I'm asking you." I think there is a way to solve this problem, but it is not so trivial as requiring reviewers to inspect the simulation code that goes along with the paper. They aren't going to do that unless you pay them or incentivize them somehow. It just isn't realistic.