Seaborn bug responsible for finding of declining disruptiveness in science
arxiv.orgI saw this headline and my first thought was that someone was claiming that a mind impacting virus that evolved in the ocean was causing scientists to do research with less ambition. Which is of course ridiculous lol. But a bug in a visualization library impacting science is also ridiculous.
If I understand the abstract correctly (which I very well might not be), I don't think it is saying a bug caused problems across all of science, but that it resulted in an incorrect conclusion in one meta study of disruptiveness in science.
Oh man, I’ve been tricked so many times by the names of packages and frameworks… and here I did it to others. Sorry!
Seeing this headline together with the one about the link between Toxoplasma gondii and entrepreneurship made me wonder if I was still dreaming.
Given that the majority of scientists seem to be cat owners, and toxoplasmosis has been linked to mental illness, it's not entirely implausible that a (human) bug is slowing scientific advancement.
Got a source on that cat claim, doc?
His ass.
I prefer the term "moorean fact", but you are correct.
Have you ever seen the “BrainDead” series? This idea is not unheard of.
Thanks for reminding me of this oldie https://m.youtube.com/watch?v=GVvL2ca65DA
Science communication must be at an all-time low. I initially thought the paper was about a sea-borne pathogen being responsible for a decline in disruptiveness in science, which is a crazy statement.
Then I thought that it was a paper claiming that a bug in the seaborn plotting library in python was responsible for the decline in disruptiveness in science, which is absurd!
Finally I understood, that this is a paper that is debunking another meta paper that claimed that disruptiveness in science had declined. And this new, arxiv paper is showing that a bug in the seaborn plotting library is responsible for the mistake in the analysis that led to that widely publicized conclusion about declining disruptiveness in science. oh boy so many levels...
Neither the paper title nor the abstract leads with “Seaborn.” The decision to start the submission with “Seaborn bug…” is purely an HN artifact, and nothing to do with science communication.
ETA: For those who don’t click through, the paper title is “Dataset Artefacts are the Hidden Drivers of the Declining Disruptiveness in Science.” The first few sentences of the abstract are:
“Park et al. [1] reported a decline in the disruptiveness of scientific and technological knowledge over time. Their main finding is based on the computation of CD indices, a measure of disruption in citation networks [2], across almost 45 million papers and 3.9 million patents. Due to a factual plotting mistake, database entries with zero references were omitted in the CD index distributions, hiding a large number of outliers with a maximum CD index of one, while keeping them in the analysis [1].”
> Science communication must be at an all-time low.
It's arxiv, not a press release. :)
The seaborn issue linked in the paper, “Treat binwidth as approximate to avoid dropping outermost datapoints” (https://github.com/mwaskom/seaborn/pull/3489), summarizes the problem as follows:
> floating point errors could cause the largest datapoint(s) to be silently dropped
However, the paper does not contain the string “float”, instead saying only:
> A bug in the seaborn 0.11.2 plotting software [3], used by Park et al. [1], silently drops the largest data points in the histograms.
So at the very least, the paper is silent on a key aspect of the bug.
Seaborn is a visualization library. No statistical tests should have been done with seaborn as an intermediate processing step. I guess they used some of the convenience functions as part of the data analysis. Seaborn is a final step tool, not a data analysis tool. That's an embarrassing lesson to learn post-publication.
Take a look at the linked chart in my other comment. Visualization is absolutely a driver during research, it isn’t just an embarrassing revelation. Charts killed the Challenger crew.
> Charts killed the Challenger crew.
https://www.tiktok.com/t/ZT8oG7ym6/
This is one of my favorite TikToks of all time, and you’ll see why. It goes into detail about how charts killed the Challenger crew. But the storytelling is second to none.
I composed this comment concurrently with yours, so I'm moving it here as a response...
This is going off-topic, but Tufte's attempt to cast the problem as fundamentally one of poor data presentation is rather self-servingly tendentious, IMHO, in a way that unfairly attributes a degree of culpability to the engineers who tried to stop the launch.
The excellent video you link to, taken as a whole, supports this view, I believe.
That’s an interesting point (and quite on topic; communication is key in STEM, and this ties in with that).
Hypothetically, what would be the most fair argument in that situation? It’s quite remarkable that a line engineer was convinced the rocket was going to explode, even to the extent of hopping in his car with his daughter and frying to stop the launch after his company gave the go-ahead. Data presentation seems like one of the few things that could have convinced upper management that there was a serious problem.
One thing I don’t understand (possibly unrelated to your point): if there were very few launches in cold temperature in general, how could he have convinced himself that there was going to be a disaster due to the weather? If I were in his shoes, I might’ve talked myself out of it by saying "well, I suppose it’s true we don’t have much data about cold temperature launches; how certain am I that the cold weather problems till now weren’t a fluke or a non-issue?"
My position is not that a different or more thorough presentation of the argument would have made no difference (though, personally, I doubt that it would have), it is that Tufte's argument (and the aphorism it launched, "charts killed the Challenger crew") greatly exaggerates the significance this narrow issue and tacitly blames the people who were trying their best to save the day.
In Tufte's version, the meeting in question was the tipping point where it all went wrong, while the reality is that it was the last forlorn chance for NASA to to escape, by the skin of its collective teeth, from an overdue disaster that had been years in the making. As the Rogers Commission revealed, NASA had, in an environment of over-promising and political horse-trading, developed a culture in which deviance was normalized, and it was not ready to handle evidence contrary to the semi-official dogma of shuttle flights being routine and established events.
I'm not in a position to say how Boisjoly felt so sure the launch would end disastrously, but I can make a few guesses. I think it is quite possible that he gradually became aware, and then concerned, that the O-rings did not fare well in cold weather, as the data trickled in one launch at a time. I can imagine that when it became clear, a few days before the launch, that the temperature would be below freezing, his concerns sharpened into near-certainty that things would go wrong. One does not need a theory of what, precisely, was happening to the O-rings to suppose that if below-normal temperatures led to problems, then nothing good could come from an exceptionally low one. Perhaps he was too close to the data; I can also imagine that this seemed so clear to him that he never imagined his managers - who were also engineers - not also seeing it, instead clinging to older estimates of risk. I further imagine that he was completely blindsided by the somewhat rhetorical and sarcastic response, which went something like "are we supposed to wait until July?"
IIRC, Boisjoly anticipated that the joints would fail catastrophically immediately after the boosters were ignited, and for a minute afterwards he experienced profound relief...
Despite this all coming out in the Rodgers Commission report, NASA followed the same normalization of deviance path after it became apparent that ice was damaging the tiles, which is one reason why I doubt that better charts would have stopped the launch.
(Just wanted to say thank you for the thoughtful followup. I was hoping you would, since it sounded like it’d be interesting. And when you phrase it like you did, it does sound absurd to offload the blame onto whoever was presenting the charts. Enjoy the rest of your Sunday.)
Thanks! You might find Wayne Hale's (Space Shuttle Program Manager or Deputy for 5 years, a Space Shuttle Flight Director for 40 missions) 10-year retrospective on the Columbia crash interesting:
https://waynehale.wordpress.com/category/after-ten-years/
Also, I recall reading somewhere that the chairman of the Columbia Accident Investigation Board, Admiral Harold Gehman, decided to conduct a test to see if the piece of foam seen hitting the leading edge could have broken it. As, at the time, it had not been decided to end the shuttle program, this was not an easy decision, as it meant sacrificing an essentially irreplaceable spare part. What finally convinced Gehman to go ahead was the fact that a great many NASA engineers firmly believed it could not possibly have been the cause.
He knew how O-rings behaved, they fail to make a seal at those low temperatures. You don't need to have multiple failed launches to reach the conclusion he reached.
You do, though. Because although there’s a chance of catastrophic failure, the typical case is that the launch goes fine, and then they notice some unexpected degradation of the O-ring after the shuttle comes back. That’s how they were measuring chance of O-ring difficulties; a shuttle had never exploded, but there had been observable signs that things could have gone badly.
In other words, in most mechanical systems, a certain amount of wear and tear is acceptable. It’s only at extremes (way too cold) that it becomes a disaster. Convincing yourself that you’re certain there will be a disaster this time is a level of scientific and engineering confidence that’s hard to fathom.
In that situation I would have done everything possible to alert management that there was a high chance of an issue, but would I have grabbed my daughter and drove to stop the launch because I was 100% certain it would blow? Probably not.
It can be a driver. But then you do a deep dive. You create frequency tables, you create crosstabs, you calculate summary statistics, you do inferential statistics. There is no excuse for not catching this pre-publication.
That is not the point I was making.
The bulk of the problem was caused by erroneous metadata.
The bug in Seaborn simply meant that the histograms that could have alerted them that something was wrong with their analysis, didn't.
I hope that all the publications that celebrated the original work, like the Economist https://www.economist.com/science-and-technology/2023/01/04/..., Nature's news service https://www.nature.com/articles/d41586-022-04577-5, the FT https://www.ft.com/content/c8bfd3da-bf9d-4f9b-ab98-e9677f109..., and others spend as much time on correcting the record as they did on promoting the idea that science is broken.
And I hope the original authors tell Nature to retract their paper. It's already highly influential unfortunately.
This image is the best illustration of the flaw https://arxiv.org/html/2402.14583v1/x1.png
On mobile and can’t read the rest of the paper, the impact could be massive.
The submission was flagged, and I am not sure I understand why since the only (negatively) critical discussion I see is on the ambiguity over the title in the HN submission; flagging a submission appears to take it off the HN homepage, and I feel a title ambiguity in the face of the significance of the submission itself isn’t a strong reason for removing the submission from HN? :)
There are (at the time of posting this comment) no comments raising any substantive issue with the arxiv submission itself (which ofc has to go through the peer review process of publication, and hopefully the original authors will respond / rebut this new article) - so curious why its been flagged? It’s not dead, so cannot vouch for it.
If folks in the HN community who have flagged it have done so because there are serious issues with what the paper is asserting, please comment / critique instead of just flagging it. If it’s because of the ambiguity in the title, I hope @dang and the moderators editorialize - there are some valuable comments in this thread that helped me understand what the issue is and what the bug is!
Damn hipsters should just use matplotlib like the rest of us.
Gonna preface by saying I like what matplotlib is trying to do, and that it has done a lot of good for a lot of people.
Seaborn is a wrapper around matplotlib. It's popular because it removes a lot of the boilerplate from matplotlib and is pandas-aware
For example, you call the pairplot function with a dataframe, and you just get a matrix of correlation plots and histograms. Versus matplotlib where half the documentation/search results use imperative w/ global state and the other half is OOP, and all the extra subplots shenanigans you have to decipher to get something that looks good.
It's convenience, really. The people who use seaborn don't want to dive into matplotlib because the interface is kinda a mess with multiple incompatible ways to do things. It also documents what arguments mean instead of hiding most of them in **kwargs soup. You get plots in 1 minute of seaborn that would otherwise take 10 minutes in matplotlib to write.
Bizarre. How do people make such big, splashy findings that can mess with people’s sense of optimism about science and innovation, without doing the simplest types of checks on their data and methodology.
No, the question is: how did peer review not catch it? I have the impression that reviewers don't have the time or incentive to give papers more that a cursory review. Independent of this case, a great many papers are published where the only "proof" is a user study or survey with an extremely low number of participants, but it still gets published. Many papers don't publish their datasets and don't contain enough detail to try and replicate their results.
There should be a real incentive/compensation for reviewing properly and real consequences if a paper gets retracted for reasons that should have been caught in review.
In this case it's fortunate that it did get found out in the end.
Your impression is correct. Peer review would never catch this. Peer review basically assumes the counter party is operating in good faith, and as a result a thorough peer review basically is the following:
* is the treatment of existing work semi-thorough (even experts don’t know everything) and fair?
* are the claims novel w.r.t the existing work? If not, provide a reference to someone who has already done it.
* can you understand the experiments?
* do the experiments and their results lead to the conclusions claimed as novel?
* does the writing inhibit understanding of the technical content?
No peer review I have ever seen or done would catch anything but the most egregious bug of this nature.
Can you describe how you would have expected peer reviewers to catch this?
I am not an expert in statistics, but i have read quite a few papers in my area (IT/Kubernetes/etc) that had an obviously faulty methodology if you have experience at all. Reading the other comments, this software should have never been used in this manner, which it seems to me someone who is hopefully well-versed in this area should have caught. Then again, the reviewers may have had very little experience in this area. (This happened to me, when the review came back the reviewers admitted themselves in the form that this was the case.)
Confirmation bias? It’s easy to run with the assumption you had at the back of your mind when your experiment seems to confirm it.
I have definitely done that with benchmarks / profiles.
It’s probably even easier when the incentives encourage “the find”.
They checked their result multiple ways. The missed a bug. Its not like computer bugs never happen or anything
The real question is, what will they do now?
Will they own up to it and retract their broken paper that's eroding people's confidence in funding science at the highest levels? This has been an incredibly widely read and influential paper already.
I have my doubts that the authors will accept that their paper is bogus.
Particularly because the lead author landed a faculty position at a good institution based exclusively on this junk paper.
umm. probably.
Trash comment. 1st. Splashy often comes from the media, not the scientists
2nd. One of the ways we discover problems with data is by plotting. When the plot library has a bug that hides a problem, well shit.
3rd. They did check their own findings multiple ways. Mistakes happen. The biggest critics of scientific mistakes are often those that have never done science themselves. Its easy, and its a cheap play.
Genuine question: you don’t think that outlier analysis (with moderate diligence) would have picked up the error in the original study, despite the plotting library issue? Just look at this link - this wasn’t detectable with any type of cross-validation? https://arxiv.org/html/2402.14583v1/x1.png
It’s enough to make you lose faith in science.
it is science that discovers errors in science :)
And here we are, discovering errors!
> Now these points of data make a beautiful line. And we're out of beta. We're releasing on time.
As opposed to? The original claim being asspulled then not reviewed at all because that would be science?
the GP I assume means science as a monolithic institution, as opposed to the pure idea of the scientific process in isolation
kind of like the difference between "trusting science" and "trusting THE science" if I had to hazard a guess
presumably he doesn't mean as opposed to "traditional ways of knowing"
Sounds like a plot point from three body problem.
I wonder how much bad science has occurred due to the acceptance of Python as the lingua franca.
The graphing library caused this?
I must be waking up still because on first reading I interpreted it as a sea-born bug, something infectious or parasitic.
Totally expected norovirus .
As long it's not a pandasvirus...
No, bad analysis caused it. Graphs are at best secondary tools to interpret findings. You don't use graphs to draw conclusions.
But we do use plots to help identify problems in datasets, like all the time. Statistics 101
That's exploratory analysis. You don't use exploratory analysis as-is to draw conclusions. It guides you through your research but you do support it with other tools (statistical, logical). Science 101.
Reading the comments here hilarious.
Like others, expecting a wildy different article...
What were you expecting? I read that as "a bug in the Seaborn graphing library caused wrong conclusions" and don't understand what other interpretations there are.
I'd never heard of the Seaborn library. And since Seaborn is the first word of the title, I assumed it was capitalized for that reason only.
So I thought the article would be about some ocean-faring insect or microbe that somehow affected scientists' mental acuity.
This is what I was expecting. And the title of this HN article is not the same as the title of the linked Arxiv report.
Ailment transmitted at sea has somehow made science less impactful.
Ahhh, "sea-borne bug", that makes sense, thanks!
alternatively seaborn as in where this insect was born. but an even weirder statement
I have never heard of the Seaborn graphing library; I was curious as to how a marine virus or bacterium could cause a "finding of declining disruptiveness". maybe a similar mechanism to Toxoplasma gondii?
Of course, it has nothing to do with rampant fraud, unreproducible results, incentive structures which reward the number of papers over the quality of papers, having researchers spend their prime scientific years writing grant proposals instead of actual research...
...nor does it have anything to do with tech companies hoarding cash by the trillions of dollars oversees instead of spending it on R&D, and even what R&D they internally produce they have no incentive to publish or productize, because virtually no new business will be more profitable than the monopoly business they already have...
Seaborn??? Typo surely
Edit: Not mentioned in the abstract but it is in the main paper. Editorialised title.
This threw me also. I was expecting a really different article.
It's referring to the seaborn library (https://seaborn.pydata.org/), a Python library for data visualization (built on top of matplotlib).
The first word of an editorialised title, frowned upon here, and not mentioned in the abstract (which I did read)
It's a statistics software package:
> A bug in the seaborn 0.11.2 plotting software [3], used by Park et al. [1], silently drops the largest data points in the histograms.