What Is Synthetic Data? The Good, the Bad, and the Ugly

66 points by sjm217 3 years ago · 12 comments

Reader

If you need to anonymize a dataset (structured, possibly linked tables), I recommend clickhouse-obfuscator - a tool designed specifically for this purpose: https://clickhouse.com/blog/five-methods-of-database-obfusca...

Quick Start:

    curl https://clickhouse.com/ | sh
    ./clickhouse obfuscator --help

Source code:

https://github.com/ClickHouse/ClickHouse/tree/master/program...

It does not use differential privacy.

crabbone 3 years ago

Anyone who believes they can anonymize data automatically will be very disappointed...
There are so many ways in which data can point to individuals, you'd need to process every datapoint with a lot of care and investigation.
For example, rare medical conditions can be a good identification tool if the adversary knows the relation between such a condition and a person. How would an automatic tool know if a medical condition is rare enough? How will it know if such information is already available elsewhere?
Information may be transferred as images, or as audio. What if database simply stores these as blobs and only the application knows what format is used inside the blob?
Or, even if the format is known, in format s.a. DICOM where it's hard to tell if the information is significant or not. You can often recognize MRI machines due to various features of an image they take, eg. there might be some artifacts that would be found in every image. DICOMs usually have information s.a. date the image was taken, beside patient's name. But, connecting the date and a machine one may be able to infer which patient was pictured, if they also know that the patient paid for the cab ride around that time. Or, even simpler: sometimes there may be text in DICOM images identifying patients in some way.
Or, in a situation like my office: there's one woman and 30 men working there. Surprisingly, gender becomes a very precise tool at identifying people.
- shalmanese 3 years ago
  
  I don't think antagonistic data obfuscation is the primary problem to be solved since, as you noted, it's extremely hard and not valid in most circumstances. Antagonism should be filtered out at the client selection stage, most clients have no incentive to pierce the veil and it's relatively easy to vet and make sure that a client has no benefit in deanonymization.
  What obfuscation mainly does is remove the PII that neither side wants to handle before the data gets transferred over so the data is "safe" and the receiver of the data no longer has the burden of stewardship over PII.
  Contrary to a lot of handwringing on the internet, almost everyone that handles your data couldn't care less about you as a person. Their overwhelming interest in you is as a bag of attributes that they can statistically correlate with other bags of attributes. It's a relief for them if they can scrub all the PII from their databases while retaining all of the other bag of attribute qualities that they care about. Of course, the few entities that do care about deanonymization are the ones that make this entire process so difficult.
  - crabbone 3 years ago
    
    > couldn't care less about you as a person
    Precisely. They care about my credit card number and enough of identifying details to impersonate me to the credit company...
- edmundsauto 3 years ago
  
  I don’t think of “anonymization” as a single thing. The requirements depend on the use case and sensitivity of the data. 100% full irreversibility is indeed a difficult task, but even partial anonymization for less sensitive types of data have value.
  It’s kind of like the word “secure”. The threat model matters - what is being protected and from whom?
  - crabbone 3 years ago
    
    Many years ago I had a conversation with an older colleague of mine where I was overly optimistic about some inter-database tool. The other person was very skeptical of the tool (which proved true short afterwards), but this was less important. The more important thing was that my colleague at the time claimed that whoever creates a tool that is able to automatically connect different databases, in a sense that "John Smith" in one database will be unambiguously linked to "Smith J." in another, which would allow, for example, different government agencies to not burden us, the taxpayers with endless rigamarole of submitting the same information over and over...
    So, he claimed that whoever builds such a thing will be instantly the richest person in the world, eclipsing Bill Gates and Jeff Bezos combined.
    Well, having worked with many different databases, I can see how that's a mission impossible... So, what does this have to do with anonymization? -- Well, most databases in the world are either built by application developers or are later extended due to the demands of application developers in such a way that the meaning of the data stored in the database is impossible to determine without the application which works with the database. In all but the most trivial cases. Not to mention that data in the databases in majority of cases is generated by humans, and even though both application developers and data administrators try to prevent invalid inputs, they too make mistakes.
    To continue the example of DICOM files: those are typically generated by a combo of a technician operating the machine, a radiologist who reads the image, a doctor who ordered the imaging and a medical secretary who collected patient's data upon arrival. All of these people are very busy and have very little time to spend on patients. This often leads to mismatch between field type and data stored in those fields. Eg. patient's address gets stored in the name field, the name is stored in the allergies field and so on. Some data are essential for the file to move around the system, but a lot of the properties won't prevent the file from reaching its target, even if they contain completely nonsensical data.
    ----
    My wife participated in some Kaggle challenges that had to do with chest CT. In order to do that, she went through some of the publicly available sets of images that belong to this general category. Each contained defective images, up to and including CTs of other body parts, X-rays and so on. (Needless to mention that stuff like proper radiological modality was wiped from the set, so there was no contrast information attached to images etc.) And that was only what she could find with some simple scripts which relied on heuristic.
    What I'm trying to say is that dealing automatically with large quantities of data that was acquired in real-world situation will almost certainly not live up to expectations. It will require a human in the loop until we have AI comparable to human intelligence.
riedel 3 years ago

Another nice tool for anonymizatiom that can take demographics into account: https://amnesia.openaire.eu/

Syzygies 3 years ago

I've been on various NSF grant panels. One was math / applied math / statistics. Everyone shares their concerns reading proposals, that's how one builds cred. Synthetic data got mentioned.

Late in the decision process, I couldn't resist, I blurted out the joke that had been on my mind for days,

"I can't believe it's not data!"

("Not butter", if you're young for the margarine commercial reference.)

This did not go over well, and probably cost math a grant.

wannabebarista 3 years ago

For context, here's another view on differentially private synthetic data: https://differentialprivacy.org/synth-data-1/.

worik 3 years ago

They left bootstrapping off their list.

For stationary data, only stationary data, it is very powerful.

Look up "Stationary Bootstrap"

hinkley 3 years ago

Synthetic data seems like a potentially useful application of GPT and friends.

Pandabob 3 years ago

The new ChatGPT API is really good at this. I had it create fake documents for a demo, where hallusinations were not an issue. Really surprised at how well it worked.

Settings

What Is Synthetic Data? The Good, the Bad, and the Ugly

Keyboard Shortcuts