UK Covid-19 data underreported due to file size limit exceeded
theargus.co.ukNote: for all the people reckoning this is an Excel issue as reported on Twitter:
a) The Twitter link is referencing the Daily Mail. https://www.dailymail.co.uk/news/article-8805697/Furious-bla...
b) The Mail does not source it's claim.
c) All those Excel references in the news seem to postdate (at time of writing) speculation on Excel in IT-heavy forums like this.
While I can believe the Excel conjecture is correct, I wish people would stop referencing it like it's a proven fact, or provide an authoritative source for the Excel claim.
I hate to be that guy, but...
BBC say they have 'confirmed' it's Excel, though still a bit mysterious on how they've done that: https://www.bbc.co.uk/news/uk-54422505
Particularly, they reckon they were using the older XLS rather than XLSX format with the lower 65000 row limit - although that seems more like a deflection from the fact functionality limits weren't properly handled, tested or designed for.
That article seems both more informative and more up-to-date, so we'll merge most of the comments from this thread into https://news.ycombinator.com/item?id=24689247. Thanks!
I hope you're right, but most reports do at least mention "exceeding maximum file size". That seems like a fairly basic error: what reasonable (national-scale) data storage formats would have an arbitrary file size limit?
We’ve actually had a similar issue in the US with some Electronic Initial Case Report documents (to be clear, we mostly rely on Electronic Lab Results-ELR rather than eICR in the US to count COVID numbers, so this hasn’t resulted in an under-count)
Rather than be a problem with the database having arbitrary file size limits, it actually was a problem with file size limits within network intermediaries. In most cases, the issue is less the limit, and more that the files themselves were too large because of how they were created. I don’t think this is the same issue as the UK (I don’t think they are using the eICR standard there yet), but it’s an example of how you could have a problem with file size limitations without storing data in Excel.
It's probably more the web server configured to only allow x MB maximum for file uploads. Or since the government has been shown to be incompetent, maybe it's the maximum size for email attachments..
Originally I presumed this was an error importing Excel files and an upload file size limit..
A database running from a FAT disk partition for example.
A normalised correctly structured SQL DB?
I'm not a backend developer, but isn't that likely to be in the terabytes? This seems to be issues around tens of thousands of rows.
A normalised SQL DB will have zero redundancy. It would also easily handle these data set sizes. The only reason Excel was chosen was because clearly some clerk just felt more familiar with the tool and didn't have the experience to know better.
The reality is though for storing this kind of relational data -- it's a solved problem -- SQL would have been the correct tool to employ.
Seems like other (slightly) more reputable sources are now reporting that it's Excel: https://www.standard.co.uk/news/uk/covid-testing-technical-i... A Telegraph article mentions a Press Association report.
Apparently each case was recorded in a column, and the number of columns had reached the maximum (16,384).
The document referenced in that tweet doesn't seem to mention columns?
thank you for posting this, I could not understand why everybody kept referencing Excel while TFA didn't.
I thought people had somehow done some math on #cases/sheets to arrive at an obvious-but-not-to-me fact :)
“The issue occurred because some files containing positive test results exceeded the maximum file size that takes these data files and loads then into central systems, officials said.”
Latest news report suggest that this is Excel spreadsheets' size limit.
Where did you see this?
https://twitter.com/MaxCRoser/status/1313046638915706880
> The reason was apparently that the database is managed in Excel and the number of columns had reached the maximum.
>The problem was caused by an Excel spreadsheet reaching its maximum file size, which stopped new names being added in an automated process.
> The files have now been split into smaller multiple files to prevent the issue happening again.
Wow
Are they using msdos or something?
I don't think they are even using grey matter.