UK Biobank health data keeps ending up on GitHub

biobank.rocher.lc

100 points by Cynddl 16 hours ago · 27 comments · 1 min read

Reader

I'm a researcher studying privacy, and I started tracking the DMCA notices that UK Biobank sends to GitHub. I tracked 110 notices filed so far, targeting 197 code repositories by 170 developers across the world.

The exposure of Biobank data on GitHub is the latest in a long series of governance challenges for UK Biobank. (My colleague and I have an editorial in the BMJ about this: http://bmj.com/cgi/content/full/bmj.s660?ijkey=dEot4dJZGZGXe...). The latest is today, with information of all half a million members listed for sale on Alibaba.

Looking at the takedown notices, we often see specific files being targeted rather than entire repositories (possibly to justify the copyright infringement as required for a takedown notice, not a copyright expert; although it is clear that they only use DMCA notices as a last resort, for GitHub users they cannot identify, and who were likely not given access in the first place). A quarter of the files are genetic/genomics. Tabular data account for another large share and could contain phenotype or health records.

adwf 6 hours ago

That's the least of it: https://www.bbc.co.uk/news/articles/cpvxgl3n138o

All 500,000 participants for sale on Alibaba...

And official response: https://www.ukbiobank.ac.uk/news/a-message-to-our-participan...

michaelt 9 hours ago

> It has given 20,000 researchers around the world access under strict agreements that prohibit sharing data further.

To me it seems rather naive to have done that.

After all, you can't un-leak medical data. So even if the "strict agreement" included huge punishments, there's no getting the toothpaste back in the tube.

If you want to ensure compliance before a leak happens you have to (ugh) audit their compliance. And that isn't something that scales to 20,000 researchers.

Too late to do anything about it now though :(

SilverElfin 3 hours ago

That’s insane. And what does researcher even mean - some random university student? What would they know about securing that data? I wonder if the people whose data is out there even know this is happening
- 7bees 2 hours ago
  
  The people involved are volunteers. The rules for getting access are readily available, and clearly don't include "some random university student": https://www.ukbiobank.ac.uk/about-us/how-we-work/access-to-u...
  - siva7 an hour ago
    
    They clearly do include "some random student" as the data can be shared with others from the eligible research group which are almost always university students who have zero clue about itsec.

anitil 4 hours ago

I've opted in to Australia's version of the biobank knowing that it's inevitable that it will be leaked some day, I think the data is so valuable in perpetuity that it's worth it. I remember Ben Goldacre has been working on how to make data more accessible in a safer way to (in part) avoid this very thing, but I haven't heard much of it since [0]

[0] https://www.bennett.ox.ac.uk/blog/2025/02/opensafely-in-brie...

nxobject 4 hours ago

I like their idea of an audit log of analysis runs -- beyond transparency, I'm sure it'll help future researchers know how much iteration is needed to work with the messiness of medical records...
I'm also amused (in a good way) by the fact that SAS isn't supported as an analysis platform...
- anitil 2 hours ago
  
  It's certainly an interesting idea, I remember he was on a few podcasts talking about it. I might submit it here to see if it gets some conversation going

mil22 7 hours ago

The irony is, they don’t even provide the data to the participants themselves.

vain 3 hours ago

Huh? I got my report over email. I think you have to ask for it.

captn3m0 8 hours ago

Took me 5 minutes to find more: https://github.com/tanaylab/Mendelson_et_al_2023/blob/9c5a65... (Uses Date of Birth column).

And some information on how they were distributing it to researchers: https://github.com/broadinstitute/ml4h/blob/master/ingest/uk...

> The following steps require the ukbunpack and ukbconv utilities from the UK Biobank website. The file decrypt_all.sh will run through the following steps on one of the on-prem servers.

> Once the data is downloaded, it needs to be "ukbunpacked" which decrypts it, and then converts it to a file format of choice. Both ukbunpack and ukbconv are available from the UK Biobank's website. The decryption has to happen on a linux system if you download the linux tools, e.g. the Broad's on-prem servers. Note that you need plenty of space to decrypt/unpack, and the programs may fail silently if disk space runs out during the middle.

https://biobank.ctsu.ox.ac.uk/crystal/download.cgi

CynddlOP 7 hours ago

Good catch! The data is everywhere, re-uploaded every week.
I am aware of ~30 repositories that UK Biobank has asked GitHub to delete, and can still be found elsewhere online. They know the site, they have managed to delete data from that site before, and yet the files are still there.

John7878781 9 hours ago

What are the pros/cons of just open-sourcing everything for future bio bank projects?

michaelt 7 hours ago

It's exceptionally difficult to avoid the data being de-anonymised.
If an 'anonymised' medical record says the person was born 6th September 1969, received treatment for a broken arm on 1 April 2004, and received a course of treatment in 2009 after catching the clap on holiday in Thailand - that's enough bits of information to uniquely identify me.
And medical researchers are usually very big on 'fully informed consent' so they can't gloss over that reality, hide it in fine print or obsfucate it with flowerly language. They usually have to make sure the participants really understand what they're agreeing to.
It might still work out fine, of course - 95% of people's medical histories don't contain anything particularly embarrassing, so you might be able to get plenty of participants anyway.
- yosame 5 hours ago
  
  In my experience with health data, the dates are usually offset by a random but constant amount for each person (e.g. id 12345 will have all their dates shifted by +5 weeks) to avoid identification by dates.
  Unfortunately the sequence of treatments and locations are usually enough to identify someone, especially if it's a rarer condition.
  - cameldrv 3 hours ago
    
    Location data is very readily available, so you can easily correlate visits to a health facility with a treatment, and even with an offset, you can probably uniquely identify someone with 4 visits depending on the size of the medical facility.
culi 7 hours ago

The people who agreed to contribute their biodata did not consent to that.
If you want such a project you need to have a new project with a different agreement. I doubt you could get as many volunteers to freely give away such intimate data to anyone who wants though
CynddlOP 8 hours ago

You mean giving anyone access to the data? Or open sourcing the code? If the latter, I think that's a generally a good practice. Security through obscurity is never good for public infrastructure. In this case, UK Biobank has now switched to a remote access platform (not particularly secure, as the data was found for sale on Alibaba today), but contracting it to DNAnexus and Amazon. Private companies have no incentives to open source data, unless mandated to do so.
In the EU, there is a bigger interest in building scalable but also secure platforms for health data. Hopefully good innovation will come from there.
shellac 7 hours ago

'Anonymisation' schemes are a little like encryption, in that they just get monotonically weaker over time as people work out attacks. But the attacks tend to be much worse. I work in academic open data publishing, and the netflix prize (https://arxiv.org/abs/cs/0610105) hangs over our heads.
But what this illustrates to me is that researchers are just really careless, despite everything we make them agree to in data transfer agreements. It seems absurd to have little cubicles like this https://safepodnetwork.ac.uk/ (think Mission Impossible 1) but I do despair.
tptacek 7 hours ago

One of the most important "con"'s is that without controls, fewer people will allow their data to be included in the data sets.
- CynddlOP 7 hours ago
  
  That's a very important point. The people who opt out first are typically not a random fraction of the population, and this makes it much harder to make any analyses with the resulting datasets: it gets very hard to know if your analyses are representative of the population, or not.
  - tptacek 6 hours ago
    
    This is why it was such a big deal when that researcher at Cleveland State misappropriated UKBB data for a race-science study with Emil Kirkegaard. After he was fired, people on Twitter were all like "this is just suppression of science", but the reality is that what they did, contravening UKBB rules, constituted potentially an existential threat to the whole program.
renewiltord 8 hours ago

Hard to do. The same people with the collection and tracking infrastructure required are infinitely sue-able so you need legal protection if anything goes wrong.
- ashley95 4 hours ago
  
  Really don't think this is any issue given the post we are commenting on...

nxobject 5 hours ago

From the perspective of someone who's worked with (biostatisticians who touch) Medicaid and Medicare billing data...

It looks like they've identified the institutions, at least... but aren't identifying it to the public for now. Are there going to be consequences? Are they going to be identified and sanctioned beyond "having their access suspended?"

In the US, HHS wouldn't hestitate to name, shame, and impose a sanction with corrective action plans. Not knowing much about how things work across the pond, I'm sure CMS PII gets used more often in research without these leaks left and right.

Settings

UK Biobank health data keeps ending up on GitHub

Keyboard Shortcuts