Show HN: A tool to seed your dev database with real data
A bunch of developers and myself have created RepliByte - an open-source tool to seed a development database from a production database.
Features:
- Support data backup and restore for PostgreSQL, MySQL and MongoDB
- Replace sensitive data with fake data
- Works on large database (> 10GB) (read Design)
- Database Subsetting: Scale down a production database to a more reasonable size
- Start a local database with the prod data in a single command
- On-the-fly data (de)compression (Zlib)
- On-the-fly data de/encryption (AES-256)
- Fully stateless (no server, no daemon) and lightweight binary
- Use custom transformers
My motivation:As a developer, creating a fake dataset for running tests is tedious. Plus, it does not reflect the real-world data and painful to keep updated. If you prefer to run your app tests with production data. Then RepliByte is for you as well.
Available for MacOSX, Linux and Windows.
> https://github.com/qovery/replibyte Please don't require static AWS credentials: https://github.com/Qovery/replibyte/blob/v0.4.4/replibyte/sr... or at least either include "AWS_SESSION_TOKEN" in that setup (if it is present) in order to allow "aws sts assume-role" to work, or allow `AWS_PROFILE`, or just use the aws-sdk's normal credential discovery mechanism which at least on their "main" SDKs is a fallback list of them, but I couldn't follow the docs.rs soup in order to know if their rust sdk is up to speed or what Second this. I would NEVER hardcode credentials into a config file. I run all AWS commands through an assumed role (STS) via aws-vault. You can use env vars in the conf yaml file -> https://github.com/Qovery/replibyte/blob/main/examples/sourc... That is better, but apologies that I overloaded the word "static" -- I meant the non-assume-role flavor, "aws iam create-access-key", static style credentials. Static in that from AWS's perspective, once issued they don't expire, unlike session credentials that have a finite lifespan I can now see how my wording was confusing, I'll try to be more precise in the future Should work out of the box if they’re using the rust AWS library right? That was the "I can't grok the docs.rs in order to tell you" part; in boto and the sane SDKs, there's a bunch of "Credential Providers" and they're in a list, meaning some are tried before others I have no idea what implementations there are for this: https://docs.rs/aws-sdk-s3/latest/aws_sdk_s3/struct.Credenti... and its official page is even worse: https://docs.aws.amazon.com/sdk-for-rust/latest/dg/credentia... Going all the way down to the GH readme seems to back up the investigation that, no, they really seem to have forgotten about "AWS_SESSION_TOKEN": https://github.com/awslabs/aws-sdk-rust#getting-started-with... Not sure if this is any better than the docs (which are raw, at best) - I'm a middling beginner rust programmer, but I did do this test[1] for myself and it matches most of the other SDKs. [1] https://github.com/pcn/check-assume-role-s3/blob/master/src/... That's excellent news, thanks so much for taking the time to put that together! Why? Credentials in a config file can be mistakenly checked into a repository. They're easy to exfiltrate from files, say I write a script with well known configuration locations for thousands of applications and just dumbly pull them all from a compromised system. I now have little bits of access to the wider system where I can now jump from system to system. The best way to store ephemeral secrets is in an environment variable or /dev/shm. This locks the secret behind the scope of the parent process (shell instance) and the user. I don't get this logic, that's what .gitignore is for. I've been using .env files for years and never mistakenly checked one into a repo. This happens literally all the time in large organizations. People make mistakes And should be easily caught in code reviews and CI jobs? Sure, and it will be caught by CI (if you set it up (properly)) and code review (if it happens (and nobody misses it (and no commits are pushed as a cleanup after (...)))). So basically if everything works perfectly, you won't have that problem. We know processes don't work perfectly, so it's better to avoid that issue in the first place by never leaving the literal tokens in the config. Also, even if your CI catches the problem, it's too late - your credentials are now in the repo and in the CI system and likely in your log collector attached to the CI. You'll need to roll them. Same with reviews. Which occur after the code is in git and pushed. So change the secrets and/or find professional developers? Sorry, not sorry, it’s an amateur mistake, not acceptable in a professional environment. Why would an amateur possess the secrets in the first place? I prefer a security model slightly more robust than "None of the professionals working on this project will ever make a mistake". You too? Ever try cleaning Git commit history? How about having GitHub clear a repository cache containing a secret? You’re not wrong, but this is a whole class of issue that can be avoided for trivial levels of effort. It’s not just to protect secrets, it’s better architecture. Rely on an ACCESS_ID/SECRET_KEY pair and it’s easy to bind too tightly to that authentication mechanism. Then providing credentials in production is a pain. Use the standard credential provider chain and the transition to production settings is trivial. Even if you want your credentials in SOME file, you don't want them in THAT file. Because that configuration file contains enough valuable configuration that SHOULD be in Git, in which case the credentials would be in the way. idk what to tell you. I work on a security team, one of the tools the team built finds and identifies secrets already checked into VCS or ones at the pre-commit stage. It's certainly not a seldomly used tool. I believe GP's comment is at the intersection of "the chain is only as strong as its weakest link" and "defenders have to be correct every time, attackers just once" Defense in depth. It only takes one person on your team accidentally committing one file for one service before your .gitignore safeguard is no longer guarding anything. Tell that to Solarwinds. Still a bad idea though. How do you automate setting that environment variable? For a specific example of one way how : https://github.com/99designs/aws-vault Injection into the environment (container config, Docker, k8s, etc) or the execution context. The actual mechanics are highly dependent on where your secrets are and where they’re going for use. On a dev workstation, could just be some bash or Python using the AWS cli, for example. Static tokens are terrible. They are far too easy to egress and are downright evil. Some organizations use this feature to enforce security policy and ensure regular rotation of credentials. You can't get a static access key ID / secret access key pair, but you can get one with a session token, and so you run everything inside that context. If you can't specify the session token to an Amazon client, you just can't use Amazon STS. I presume so that one can use instance roles or automatic credential discovery that is common within AWS and using AWS applications. One big reason is it’s insecure. Another big reason is it’s much nicer to deploy on any AWS service and have the SDK use the metadata host, which will automatically provide you with a temporary access token with the permissions of the role you set for it. Unless you can exhaustively guarantee your customer-data containing production data will definitely be transformed into something completely unrecognisable and irreversible (and let's face it, you can never do so - systems change all the time), using this is irresponsible. The fact that the motivation for it is that it is "tedious" to do the right way doesn't exactly inspire confidence, though it is definitely in the spirit of the times. I feel I'm missing a reference for the configuration file, on their GitHub page. So maybe this already exists, but it would maybe be useful if there were some features for your concern. Like an opt-out mode where you have to specify transformations for all columns unless indicated otherwise. Or at least for text columns. There are also plenty of organizations (not mine) with review boards for database changes. Those folks could also have a process to make sure that new, sensitive columns get added to the configuration file. From Thoughtworks Tech Radar
https://www.thoughtworks.com/radar 21. Production data in test environments
Hold
We continue to perceive production data in test environments as an area for concern. Firstly, many examples of this have resulted in reputational damage, for example, where an incorrect alert has been sent from a test system to an entire client population. Secondly, the level of security, specifically around protection of private data, tends to be less for test systems. There is little point in having elaborate controls around access to production data if that data is copied to a test database that can be accessed by every developer and QA. Although you can obfuscate the data, this tends to be applied only to specific fields, for example, credit card numbers. Finally, copying production data to test systems can break privacy laws, for example, where test systems are hosted or accessed from a different country or region. This last scenario is especially problematic with complex cloud deployments. Fake data is a safer approach, and tools exist to help in its creation. We do recognize there are reasons for specific elements of production data to be copied, for example, in the reproduction of bugs or for training of specific ML models. Here our advice is to proceed with caution. Is this a negative or positive comment? > Fake data is a safer approach, and tools exist to help in its creation. Because the tool presented is exactly what this quote says. No, the tool presented is to copy real production data down to a test instance with support for anonymization, subsetting, etc. That's a very different approach than tools for creating fake data. Using a customer's production data outside of production probably violates their expectations of your data security practices. I couldn't see myself using this unless there was a mode where only allowed fields are copied and non-id fields are first transformed in a lossy way. That was the first thing I checked whether it supports data masking. And it does, via transforms, as evoxmusic already pointed out. However, there may be times when data masking must be nuanced. Suppose some random email/domain pair is bad and you would rather replace all "example.com" domain instances with "fake.com", and not "random1.com", "random2.com", etc (for ML, 3rd party random analysis). Out of the box I don't see it is provided, HOWEVER I see that you can write custom transformer: https://github.com/Qovery/replibyte/tree/main/examples/wasm and fulfill your needs. Excellent :) That's why RepliByte uses S3 to store transformed data. Then the real production data never leave the production environment. https://github.com/Qovery/replibyte/blob/main/docs/DESIGN.md The doc refers to S3 as an intermediary- it is then going to be loaded in a different DB where all these issues exist. The intermediary adds one more place where data could be leaked. And not using the chaos of the real world to harden your software before it hits production probably violates their expectations of correctness. It's a real and interesting tension! Software that helps people walk this line well is valuable. Cannot agree more. There are tons of edgecases that only happens with the chaotic blob that we call production data. Tools like this or anyother that helps anonymizing data is soo useful for debugging issues I'm (not) looking forward to the future data breach notifications / post-mortems that include something like "... our developers used a tool to copy the production database to a dev database on their laptop ..." Honestly, I'm kinda surprised by the lack of comments advocating against doing this. The comments I've read on here seem strangely negative, I don't understand why. I think this tool looks great! I appreciate the time and effort you put on to releasing a free and open source tool to help solve a real problem. Keep up the great work! <3 thank you - all comments are welcomed to improve this tool. Of course production data are sensitive and everything that can be done to prevent data leak is super important. I like this but after a cursory glance at the source I have a few concerns: - There's a transformer which appears to retain the first char on string fields. That's not safe if you're dealing with customer data. - Remove telemetry. That it's claimed to be anonymized and togglable is meaningless where sensitive data is concerned. Thanks: It's not safe because I could potentially use that information to find a real customer in the DB. It becomes more problematic when working with data from Asian countries where it's possible (even common) for family and/or first names to consist of two or even a single character. With regards to telemetry I'm aware that it can be disabled. But in my experience that would still result in a veto from the security teams I've worked with. This does sounds kind of useful. On the other hand I performed a similar task just yesterday using the native pg_dump and pg_restore commands, and it only took a couple of hours to setup (and now I have a repeatable script), so this’ll need to be implemented really well to provide value. Well, moving sensitive data or PII to development environment for any developer eyeballs to see doesn't sound good and may introduce trouble if data leaked. Development env (where random npm dev scripts get run or whatever) doesn't usually get protected as good as production. Anyways, if you are single or handful developers where everyone get access to prod, you may not care. Still, data hygiene and risk mitigation shouldn't be overlooked. Yeah, we have a shared QA environment which gets a fair amount of data populated in it, and this is what I was actually syncing in this case. If syncing prod data then I'd definitely want to have very thorough filtering. But then at that point I'm not sure I'd trust this tool! I am on the same boat but couple hours is terrible still. The best is probably copying the data directory straight which should cut it down to seconds, but i have yet to automate that + there are production credentials/sensitive data problems that needs to be tackled too... I suspect it's likely to take a couple of hours to set up this tool too! This is actually a much harder problem than it seems. GDPR is quite strict about what is considered PII (and rightly so). For example: you may think replacing sensitive data with fake data is enough to anonymise customer data. It's not: > "Personal data is any information that relates to an identified or identifiable living individual. Different pieces of information, which collected together can lead to the identification of a particular person, also constitute personal data." So it's not enough to, for example, replace all names, addresses etc. when you can still see which products someone has interacted with, when their account was created (which in the production DB would relate back to their actual account!) or any other unexpected pieces of information that links back to their identity. In practice, this means that any realistic production-derived data is either very likely to be still considered PII (and therefore much more demanding to handle safely and securely) or has to be mangled so much that it is no longer representative of production data. I was thinking "oh! this is awesome!", and then noticed it didn't support MSSQL. Not to worry, I'll just contribute a connector. Let's take a look at their existing connector code... https://github.com/Qovery/replibyte/blob/main/replibyte/src/... Not a single comment to say what anything does. Sigh. It's the same for the other drivers too. Can you open an issue to ask for getting better documentation on how to contribute for an additional connector? For example let's say I have a MySQL database, how does RepliByte copy that database into S3? Does it use mysqldump or are they coping the database index files? We have a script that automatically backs up our production database at intervals to S3 and then a program to download the latest backup and scrub the data. It takes a heck of a long time to download and impacts the server when it happens... it's been on my todo list to replace with Percona's Xtrabackup [1] but doesn't look that's what these guys are doing? [1] https://www.percona.com/software/mysql-database/percona-xtra... One of my colleagues has developed a sophisticated Data generator addressing the needs of workloads/algorithms which work based on the characteristics of the data. https://github.com/jssprasanna/redgene ReDGene - Relational Data Generator is a tool aimed at taking control over the data generation with being able to generate column vectors in a table with required type, interval, length, cardinality, skewness and constraints like Primary Key, Foreign Key, Foreign Key(unique 1:1 mapping), Composite Primary Key and Composite Foreign Key. And this is DB agnostic, it generates data in the form of flat files which can be imported into any database which supports importing data from flat files. You’d imagine Postgres or whatever would have a built in function to populate a DB based on types as a sort of fuzzing tool tbh I worked on a gov app years ago that required anonymized databases and I remember thinking that then - why isn’t it available out the box? Everyone must need this from time to time I build a dataset anonymizer at jpmorgan years ago. There’s a surprising amount of nuance needed just to do a decent job generating schematically valid fake data, let alone stuff that’s statistically faithful to true data Does it need to be statistically faithful or can it just fuzz the data type? Maybe that’s better for testing anyway? * Data that is tightly clustered on certain keys and widely dispersed on other keys can hit some "fun" interactions with sharding regimes, indexes, etc. that random data doesn't. * Brute forcing a whole bunch of invalid values can be a lot less interesting than lighting up unconventional combinations of valid values. * Sometimes you're wrong about the validation rules, i.e. you think you know the allowable enum values here but in fact production systems that really exist and have customers behind them are setting other values. Rejecting those would itself be a bug. Exactly! A fun example is city and state and country fields in a row! We tend to see banks or businesses with locations in nyc, ny, USA. And quite unlikely to see a business hq in New York, Hawaii USA. If it even exists. Since generating valid values is difficult, it sounds like an opportunity for an open-source fuzzy data generator with rules for different database types. Encode all the lessons shared in those "falsehoods programmers believe" articles about names, addresses, phone numbers, email addresses, credit card numbers, dates, time zones, etc into one testing tool that everyone can use. I don't work with databases, so maybe this already exists. :) This project needs a giant heading box in the README stating 3 things; - staging databases that hold data generated from production databases should be considered production data, with the same level of consideration for security and access as production. - staging databases that hold production data are a GDPR violation waiting to happen. Make sure your data controller / lawyers knows exactly what you're doing with production data. - ask yourself why you need production data in staging in the first place. What are you gaining over a script that generates data? If you want data at scale you can generate it randomly. If you want data that covers all edge cases you can generate it non-randomly. If you want "real-looking" data then maybe this tool is useful. People copying data from production to staging and then failing to look after it properly is a nightmare. It shouldn't be encouraged except in very unusual circumstances. In my experience of dev, your development and staging data should be covering the weird edge cases that you need to handle far more than the nice "happy path" data you get in production. Do you consider transformed data in staging harmful? (Transformed data = where all the sensitive data have been hidden) I consider it potentially harmful. Anonymizing data is a hard problem, and what is considered sensitive is not settled. For example, an IP address is personal identifiable information under the GDPR. Most people don't mask that in their logs though. If you copy records from production that have network information in them (last known IP for example) then your data controller should be very concerned. Another major problem with tools like replibyte is that people use them properly, and then a database schema changes, but people don't update their script to anonymize new tables or columns. Then a few months later someone notices sensitive data has made its way in to staging, and into the backups, and the database dumps devs made to debug things because "it's only staging data, who cares!" Protecting user data is something that you need to be extremely vigilant about. In my experience, the less access I have to production data the happier I am. Copying it and using it in staging, even if you're careful about it, fills me with dread. It makes sense to me and that's why: Those are great features to have but they're not in the app yet. This is why there should be a warning in the README to tell users to be careful when they use it now. This looks very useful and I have an immediate need for this. I'm still not sure how to use replibyte to do the following: I want to take a snapshot of the DB from one of my environments and then seed a local DB with it. I see this is a basic use case of replibyte, but not sure exactly how to accomplish this. I have a docker container running postgres and I just want to take the snapshot and seed it into that. How exactly do I do this? Can you open an issue on the GitHub repo and set label "question" ? Thx Am I missing the obvious, why would one seed a dev database from production? If anything, data on dev should exist before production? Once you have data in production it's very possible for the fake data you generate in dev not really matching the sort of data you have in production (either in size or because of assumptions made whilst generating it or even bad app updates/schema migrations in the past causing duff data). It can then be useful for future development or debugging that the data is real(-ish). interesting, however couldn't it detect tables and columns automatically instead of having to specify them in the configuration file? If I understand correctly each table is to be specified by hand. Say I have nearly a hundred tables... A feature to automatically detect sensitive data is planned to be added Seems like you only specify transformers
1. What do you mean it is not safe?
2. Telemetry can be removed with the option --no-telemetry and you can inspect the code > https://github.com/Qovery/replibyte/blob/main/replibyte/src/telemetry.rs
Can anyone explain to me how this works in RepliByte? The design document only talked about Postgres. - Works on large database (> 10GB) (read Design)
What about this? Does the database need foreign keys to prevent related rows in tables being lost and are they just randomly deleting rows as the config seems to indicate [2] - Database Subsetting: Scale down a production database to a more reasonable size
RepliByte responds to a very common need that almost every company end to build internally. The idea is to collaboratively work on a tool that can be used by anyone and that can be improved to avoid leaking data. 1. Auto-detection of sensitive data is planned
2. Detecting database schema change is also plan to prevent leaking sensitive data.