The $1,000 Challenge!
To go straight to the actual challenge click here.
Crack the code! The authors announce the HEGP (heegeepee;
hiːdʒiːpiː) Challenge,
with a $1,000 (one thousand dollar) prize for
the individual or group who can crack HEGP encrypted data. HEGP, when
proven solid, will have a large impact on the way human genetics is
pursued today because it will allow for sharing genotype data while
preserving privacy of the individual. Not only is sharing data
required for reproducible Science, there is also a large interest in
hosting data on laptops and servers that need not be HIPAA compliant.
On this page we chat about cracking HEGP, Rubik cube, DNA, strawberries and Enigma.
To go straight to the actual challenge click here.
Relevance
With a publication in Genetics we present a novel Homomorphic encryption method named HEGP that allows for sharing of genotypes and phenotypes in the context of Genome Wide Association (GWA) studies. Importantly sharing but without giving away private information and making individuals identifiable. For more information check out a privacy researcher's perspective and a recent follow up paper on Using encrypted genotypes and phenotypes for collaborative genomic analyses to maintain data confidentiality by Tianjing Zhao, Fangyi Wang, Richard Mott, Jack Dekkers, and Hao Cheng, Genetics. 2024 Mar 6;226(3):iyad210. doi: 10.1093/genetics/iyad210. PMID: 38085098, where they combine Bayesian variable selection methods for genetic parameter estimation, genomic prediction, and GWAS.
With HEGP anyone can freely hand over genotype and phenotype data to anyone else and have them reproduce the results. Even better, the encoding can be stacked. So the second party can encrypt and combine their data independently and give it to a third party.
So any party can use our data to add it their own analysis. And they can share their data back with us. As such, genotype data can be stored on (public) servers and GWA analysis can be reproduced.
This is a breakthrough in FAIR data sharing and contrasts greatly with current practice of hiding/protecting genotype data and only providing summary statistics. One example is an important UK Biobank depression study where genetic markers are presented with their statistical significance. We can only assume that this study can be reproduced by a group of researchers having access to the original data. The truth is that these outcomes can not be reproduced by you or me! HEGP will make data sharing and reproducible analysis a reality.
Cite
Mott R, Fischer C, Prins P, Davies RW. Private Genomes and Public SNPs: Homomorphic Encryption of Genotypes and Phenotypes for Shared Quantitative Genetics. Genetics. 2020 Jun;215(2):359-372. doi: 10.1534/genetics.120.303153. Epub 2020 Apr 23. PMID: 32327562; PMCID: PMC7268998.
Simple explanation
Homomorphic encryption is a mathematical translation of data into an
encrypted form where the result of a computation is the same for the
unencrypted and encrypted forms. With HEGP a matrix of data is
transformed by a high-dimensional random linear orthogonal
transformation key as described in the paper (open access) and
visualised in below animation (hit
). The resulting matrix
scrambles the data while preserving the 'shape of the data' for
analysis.
One way to think about this is that when a Rubik cube gets rotated the
fields change colour, but the object still maintains its shape as a
three dimensional cube. With DNA the genotype/phenotype shape is
typically used to predict associations between genotypes and
phenotypes.
An example of a phenotype is a preference for
strawberry taste. An example of an associated genotype is a DNA
encoded olfactory receptor.
Here we display a typical example of genome-wide association (GWA) of phenotype against genotype:
Genes (on chromosomes) involved in some trait are marked. This is the backbone computation for finding genes involved in some trait and pursued in the UK Biobank involving half a million subjects. To find associations GWA is applied to find genes involved in, for example, cancer or COVID-19 mortality; i.e., the first step towards finding causality and potentially better treatments.
In above image data is shown before and after encryption. The unencrypted data contains three values while the encrypted data shows a normal distribution.
Enigma and why a challenge?
The Enigma machine is an encryption device developed and used in the
20th century to protect commercial, diplomatic and military
communication. It was employed extensively by Nazi Germany during
World War II, in all branches of the German military (source
wikipedia).
Enigma
encrypted text by a transformation and was cracked by the Polish
Cipher Bureau in 1932 and the crack was used by the allied forces to
win the war. To ascertain HEGP is bullet proof, unlike ENIGMA, we
invite the algorithmic inclined to crack the code and make HEGP
history (one way or the other).
Animation
In above animation genotypes and phenotypes are encoded in a matrix. Rows are individuals and columns are genotype markers along chromosomes as well as phenotypes. Starting with the unencrypted data it gets scrambled by the keywhile the analysis will still render the same result (the black lines are values too small to show).
Is this encryption safe?
We tried find ways of cracking the code as discussed in our paper. Brute force guessing of a solution would take more compute seconds than there are molecules in the universe.
What is the challenge?
The challenge consists of decryption of two data sets.
For the first challenge we encrypted a data set consisting of 500 individuals by 12,359 SNPs that exists somewhere in the public domain. We consider this data cracked if you can identify 50 individuals correctly. Note that this data may be derived from human, mouse, rat, nematode or plant data.
The second challenge is harder. We encrypted a mammalian data set that is not in the public domain (yet). We consider the code cracked if you compute the plaintext genotypes of this matrix correctly with mean error of under 10%.
Who wins a $1,000 check?
The best result with a publicly reproducible solution submitted by December 31st 2020 wins the HEGP Challenge and a cash prize of one thousand dollars. Points are given for improving the method. In case of a tie the prize is shared. If there is no winner we will hold the prize until someone claims it.
How to submit?
Submit your answer through a public git repository with the exact steps take to get there. The analysis should be reproducible by anyone. You can announce your solution on the website issue tracker.
To go straight to the actual challenge click here.
(image created xkcd - license CC BY-NC 2.5)