Artificial data give the same results as real data without compromising privacy
news.mit.eduI'm highly dubious of the ability for synthetic data to model accurately datasets without introducing unexpected bias, esp. to account for causality.
If you dig through the original paper, the conclusion is on the line with that:
“For 7 out of 15 comparisons, we found no significant difference between the accuracy of features developed on the control dataset vs. those developed on some version of the synthesized data; that is, the result of the test was False.”
So, on the tests they developed, the proposed method doesn't work 8 times out of 15…
Agreed, seems suspect. If they are really able to learn the population-level distribution then why even bother generating fake data. Just release that instead.
Well, just knowing a few distributions wouldn't be great for building machine learning models.
I'd like to read the paper before drawing such a conclusion. (the link to it seems to be broken)
"for 7 out of 15 comparisons, we found no significant difference" could mean all sorts of things. It could mean that 7 comparisons were perfect and 8 were complete garbage, as you suggest. Or it could mean that 7 comparisons were perfect and 8 had differences that were statistically significant, but the magnitudes of the differences were small enough that the results would still have been perfectly adequate for practical application.
In concrete terms: Let's say the synthetic data lets me build binary classifier that helps with a business issue, and has F1 scores of about 0.8. But if I had access to the real data, I could have got F1 of around 0.85. In that case, I'd happily take the data. As someone who's trying to solve business problems, it would be downright irresponsible of me to reject something that's better than what I currently have on the grounds that it's still less than some unattainable ideal.
You are ignoring the restrictions and regulation that exist around sharing data in lots of financial, government and medical industries. Sometimes, the missed cost of 5 percent accuracy is much less than the inspections, delays and blockages that otherwise would occur if they wanted to use real data.
I misspoke; I should have said, "I'd happily take the synthetic data."
But yeah, you're right; I was being oversimplistic in just thinking of it in terms of "can have/can't have" and not considering the, "can have, but at too high a cost" angle.
I couldn't read the paper (seemed to be missing), but has anyone else noticed that MIT seems to have big problems with open science?
I mean I have formed an association specifically with the MIT brand now, so this type of work coming out of there doesn't surprise me. I couldn't tell you exactly what has lead to this association though.
Just below what you reproduced, they write:
When we examined the confidence intervals for the remaining 8 tests, we found that for half, the mean of accuracies for features written over synthesized data was higher then for those written on the control dataset.
In other words, for 4 out of remaining 8 cases, the models on synthetic data performed better.
Yes, I did leave that out, as I think it's still an issue. A synthetic model performing better is a little dubious, since the modeled distribution has less information than the original one. Overall, the discrepancy seems more important to notice than the actual performance.
Haven't read the paper, but I will.
But I want to comment that it's worked for us. Sequence to sequence learning can reproduce every kind of iid and non-iid things we've ever looked at.
The real question is how safe/anonymous is it really?
I imagine it depends on how closely you model the conditional probabilities.
If it gets down to correctly modeling the probability of colon cancer diagnosis by age, sex and ZIP code, and also the correct distribution of ages by ZIP code, then that'll be a potential problem in counties that only have one male 87-year-old.
I'm talking specifically about modeling iid/non-iid sequences of data from events, experiments, etc. Haven't read the paper so, I'm not sure if I'm talking past the authors or OP.
I haven't read the original paper (yet), but something doesn't sit right with the work, if the way it is portrayed is indeed faithful to it and I'm not missing something important.
- It looks like the work of the data scientists will be limited to the extent of the modeling already done by recursive conditional parameter aggregation. (edit: So why not just ship that model and adapt it instead of using it to generate data?)
- Its "validation" appears to be doubly proxied - i.e. the normal performance measures we use are themselves a proxy, and now we're comparing those against these performance measures derived from models built out of the data generated by these models. I'm not inclined to trust a validation that is so removed.
Any one who can explain this well?
Just finished the paper, so let me take a stab:
Peeling back the mystery a bit, what is happening is:
1. From each child table upwards, model each column as a simple distribution (e.g. Gaussian) and covariance matrix.
2. Given those child table distribution parameters, pass them back as row values to their respective parent tables.
What you end up with is a "flattened" version of each parent table that has the information (in an "information theoretic" sense) of all child relations. Sampling from distributions is straight forward. The stats methods are outlined in section 3 of the paper.
Things of note:
- The paper makes heavy use of Copula transformations to normalize data whenever it passes around the distribution parameters.
- It deals with missing values by adding something like a dummy column.
- The key insight is that columns must be represented by parameterized distributions, but they don't have to be Gaussian. The Kolmogrov-Smirnov test is used to choose the "best fit" CDF to model.
To your question about the role of the data scientists: they are using the resulting simulations to solve more complex tasks. The goal of the experiment was to see how well the sample data would perform against Kaggle competitions. So I guess the idea was that if winners were indistinguishable, the simple/hierarchical distributions would be considered robust enough for complex tasks. In the end, I'm sure shipping the underlying is preferable for consumers.
(Going through the paper .. a few questions/notes)
Table modeling: While column distributions are picked using the KS-test, the covariance matrix calculation first normalizes the column distributions. Assuming that is reasonable, there is a claim of "this model contains all the information about the original table in a compact way..", but it doesn't account for possible multi-dimensional relationships in the data. It only looks at a series of projections to 2D. Can a d-dimensional dataset (in practice) be effectively summarized by the set of projections on to the d(d-1)/2 two-dimensional subspaces? That's once kind of summary, but I'm unsure whether that is adequate for practical modeling work, especially if folks try to apply high dimensional techniques (DL?) to this. (edit: I feel reasonably sure it isn't adequate. If a column ends up being bi-modal, for example, even that gets lost in translation in this approach?)
Crowdsourced validations: The synthetic sets were generated for already available public datasets. It isn't clear from the paper how any bias resulting from prior familiarity with the public datasets would be accounted for in the study concluding equivalence.
Privacy claims: This is a bit unclear. The "apply random noise" technique seems to suggest something similar to differential privacy, but makes no mention of it. If not DP, what definition of "privacy" is being used here? (I'm ok that proving their algorithm to be privacy safe according to a chosen definition of privacy may be out of scope of the paper.)
(Edit2: I can't help the feeling I have that this paper is an elaborate April fool's joke released early ;)
To the first point, the paper mentions that "the covariance is calculated after applying the Gaussian Copula to that table". The experiments seem to conclude that, for their datasets, the 2D projections seem to work alright. I think that the surprising conclusion is that this works so well for any dataset at all.
Just thinking out loud here:
The typical case where a low dimensional representation would fail you is if you had dependencies (e.g. bimodal relations) that weren't represented by a datatype or foreign key. Recall that the simulation of data still occurs within each table, so the higher the non-represented inter-table dimensionality is, the supplied distributions can measure it. It's might be that, for the most part, the raw columns (not from child tables) have much more bearing on the merit of the table covariance. This seems natural, due to the semantic nature of RDBMS structures.
It's probably an important caveat that typical RDBMS structures are created to optimize the user's understanding of the data through semantic structure. Since the claim of the paper was only that they could provide a useful abstraction for simulation, I think it's OK to proceed with the assumption that Gaussians can never be fully sufficient in modeling highly dimensional data without help.
There are existing non-parametric models that attempt to do a similar thing for relational data that I think are more promising. One drawback of current solutions like BayesDB is that you're still dealing with the original table structure, which this paper tries to get around. It would be nice to bridge the gap for something like PyMC3 where we find a cute way to flatten the data, like this paper.
[1] Probabilistic Search for Structured Data via Probabilistic Programming and Nonparametric Bayes. https://arxiv.org/pdf/1704.01087.pdf
I think they just invented the political representative in modelling
Correct me if I am wrong.
As you note, the Kolmogorov-Smirnov test is used to choose the "best fit" CDFs. The set of CDFs then used to generate a random vector, which after a covariance adjustment becomes a synthetic datapoint.
The step that can ruin the synthetic data is exactly (the "best fit" CDFs) as the original distribution does not necessarily fit well any of the well-known distribution.
At the same time, "best fit" CDFs are responsible for anonymizing the results. So if you overfit and stick to the original data too close, you lose anonymity and capture the original data bias. But if you approximate with a distribution you introduce a distribution bias.
So the solution provides a tradeoff between anonymity and "best fit" corruption of the data.
On a parallel note, search for "thresholdout". It's another (genius, I think) way to "stretch" how far your data goes in training a model. I won't do a better job trying to explain it than those who already have, so I won't try—here's a nice link explaining it instead: http://andyljones.tumblr.com/post/127547085623/holdout-reuse
I got really excited about thresholdout a couple weeks ago, but I've since cooled; setting the threshold seems like too much black magic.
I thought the Zillow blogpost [1] was a nice intro (and I'm a sucker for Seinfeld references), and it demonstrates the sensitivity-to-threshold value in a way the original academic authors never did.
[1]: https://www.zillow.com/data-science/double-dip-holdout-set/
They use real data to create artificial data. So, real data is still more useful.
The idea is to sidestep the need to access private information in order for researchers to do their work. So in this case, the artificial data is more useful, since the real data is inaccessible.
But the artificial data must come from somewhere? It can be modeled from real data in order to take into account outliers and to avoid cognitive biases in generation, but then there's still an initial reliance on the real data.
Yeah, the real data's properties are what are under study, so the artificial data needs to mimic it.
Hi, I'm one of the authors of this work. We're very proud that this has attracted so much attention on Hacker News. I'm happy to answer a few questions.
We had two requirements for the synthetic data: From the paper, “This synthetic data must meet two requirements:
1. it must somewhat resemble the original data statistically, to ensure realism and keep problems engaging for data scientists.
2. it must also formally and structurally resemble the original data, so that any software written on top of it can be reused.”
Our goal was as follows:
* Provide synthetic data to users - data scientists similar to the ones that engage on KAGGLE.
* Have them do feature engineering and provide us the software that created those features. Feature engineering is a process of ideation and requires human intuition. So being able to have many people work on it simultaneously was important to us. But it is impossible to give real data to everyone.
* They submit this software and we execute it on the real data, train a model and produce predictions for test data.
* In essence, their work is being evaluated on the real data - by the data holder - us.
The tests we performed:
* We gave 3 groups different versions of synthetic data ( and in some cases added noise to it)
* For a 4th group we gave the real data.
* We did not tell the users that they were not working on real data.
* All groups wrote feature engineering software looking at the data they got.
* We took their software executed it on real data, and evaluated their accuracy in terms of the predictive goal.
* We did this for 5 datasets
* Our goal was to see if the team that had access to real data “did they come up with better features?” . With 5 datasets and 3 comparisons per dataset, we had 15 tests.
Results:
* In 7 of those we found no significant difference.
* In 4 we found the features written by users looking at synthetic dataset were, in fact, better performing than the features generated by users looking at real dataset.
What can we conclude:
* Our goal was to enable crowdsourcing of feature engineering by giving the crowd synthetic data, gather the software they write on top of the synthetic data (not their conclusions) and assemble a machine learning model.
* We found that this is feasible.
* While the synthetic data is capturing as many correlations as possible, in general, the requirement here is for it to be enough such that the user working on it does not get confused, can roughly understand the relationships in the data, be able to intuit features, write software, and debug. That is, they can conclude a particular feature is better for predictions vs. another, inaccurately, based on the dataset they are looking at and it is ok. Since we are able to get many contributions simultaneously, the features one user misses could be generated by others.
* We think this methodology will work only for crowdsourcing feature engineering - a key bottleneck in the development of predictive models.
It would be great to have a link to the paper. Is it on Arxiv or anywhere else where we can download it from.
I was speculating wildly here https://news.ycombinator.com/item?id=16621633 is any of that remotely close
How can you prove that meaningful "actual" data can't be reconstructed from synthesized data?
If I was responsible for protecting privacy of data, I don't know that I would be comfortable with this method. Anonymization of data is hard, and frequently turns out to be not as anonymous as originally thought. At a high level, this sounds like they are training a ML system on your data, and then using it to generate similar data. What sort of guarantees can be given that the ML system won't simulate your data with too high of fidelity? I've seen too many image generators that output images very close to the data they were trained on. You could compare the two datasets and look for similarities, but you'd have to have good metrics of what sort of similarity was bad and what sort was good, and I could see that being tricky, in both directions.
Although, I suppose that if the data was already anonymized to the best of your ability, and then this was run on top of that, as a additional layer of protection, that might be okay.
I wonder how secure it is against identifying individuals. With over-fitting, you can producing the training data as output. Hopefully they have a robust way to prevent that, or any kind of reverse engineering of the output to somehow work out the original data.
Could not get hold of the paper. Are they doing Gibbs sampling or a semiparametric variant of that ?
https://en.wikipedia.org/wiki/Gibbs_sampling
Generating tuples(row) by Gibbs sampling will allow generation of samples from the joint distribution. This in turn would preserve all correlations, conditional probabilities etc. This can be done by starting at a original tuple at random and then repeatedly mutating the tuple by overwriting one of its fields(columns). To overwrite, one selects another random tuple that 'matches' the current one at all positions other than the column selected for overwriting. The match might need to be relaxed from an exact match to a 'close' match.
If the conditional distribution for some conditioning event has very low entropy or the conditional entropy is low, one would need to fuzz the original to preserve privacy, but this will come at the expense of distorting the correlations and conditionals.
I could download it from here: https://dai.lids.mit.edu/wp-content/uploads/2018/03/SDV.pdf
Are you facing any trouble while accessing this link?
Ah ! thanks it works.
Seems like only helpful for testing methods that can't capture any correlations the original method didn't.
Is this akin at all to random sampling with replacement ie bootstrapping?
No, because that would take full rows of the feature matrix (thereby corresponding to the full information of one individual). The idea here is to “generate” rows corresponding to plausible artificial individuals. That way you can give a third party artificial data to build an ML model without compromising (too much) the privacy of the real individual in the initial data.
It is easy to confuse it for such, but it is not bootstrapping. It is a form of multi-dimensional random variable generation, where the generated dimensions preserve same correlations/relationships as those in the original dataset.
How is this related to and different from differential privacy?
Differential privacy is a formal guarantee of an algorithm. Roughly, given algorithm A that takes input database X, we say A is differentially private if m, for any X' differing in at most one row from X, the output distributions of A(X) and A(X') are similar. So to say an algorithm is differentially private you need to prove a claim like this.
It's hard to compare to this paper, because this paper's privacy claims appear to be heuristic, not formal. This isn't necessarily bad, since existing approaches for constructing synthetic data in a differentially private way is still not very practical. But heuristics do necessarily lack provable privacy guarantees, so there's no proof that something very bad privacy-wise can't happen with sufficiently clever processing of the synthetic data.
To add to this answer: the methods outlined in the paper allow for perfect reconstruction of the underlying data in many cases, as the simulation of data is simply sampling from fitted distributions.
I am looking into their experiments. Seems most of them are pretty simple predictions/classifications. No wonder they get good results.
The claim is too bold and I would reject this paper They should clarify that the data is good enough for linear regression. Not to say there is no difference between real and syn data.
The abstract claims there was no difference only 70% of the time. So 30% of the time there was a difference. Unsurprisingly it greatly limits the kind of data analysis that was allowed, which greatly reduces the applicability even if you believe it. I'm pretty dubious of this work anyway.
Heh. I wrote a paper about this a while ago https://www.liebertpub.com/doi/full/10.1089/bio.2014.0069
Does someone have a link to the preprint / arxiv? The link in the story is a 404 (I presume that the paper just hasn't been posted yet or something?)
I've found these documents:
- https://dspace.mit.edu/handle/1721.1/109616#files-area
- https://pdfs.semanticscholar.org/64ad/643e8084486ca7d3312ed4...
Sounds very similar to homomorphic encryption, except with no compromise in performance.
I wonder if this the technique behind Numerai
The link to the actual paper is now working