Show HN: Generate coherent, synthetic data at scale

4 points by darshanime 8 months ago · 4 comments · 2 min read

Reader

Internally at our org, we have an ecosystem of over 200 microservices, implementing various parts of the business logic. To test any changes, we provide developers with on-demand sandboxed environments. One issue we had to solve for that was the creation of synthetic data across services, which respected the business rules and was coherent.

Today, we are happy to introduce datagen, a tool we developed internally to solve this problem. It generates coherent, synthetic data with the ability to model complex relationships. It is a new DSL (domain specific language) using which the user specifies the shape of the entity they wish to generate, and generator functions describing the logic for generating each field. The entity can be a table in a relational dbms, or a json document in a document store, or a csv file to upload on S3 etc.

The user writes models in .dg files that are transpiled to golang code, which can then be used to generate coherent, synthetic data.

Here is a simple example:

  // users.dg
  model users {
    fields {
      name() string
      age() int
    }
    gens {
      func name() {
        return "Arthur Dent" // hardcoded value
      }
      func age() {
        return IntBetween(18, 65)
      }
    }
  }

Checkout the website for more information: https://ds-horizon.github.io/datagen/

Demo video - https://www.youtube.com/watch?v=ly0DfzTup28

gurjeet 8 months ago

darshanimeOP 8 months ago

Hi, thanks for sharing. There are quite different tools; afaiu, the one you shared does not have any means of cross referencing other data. Also I could see only basic knobs to control the data generation -- ints b/w max/min, weighted distribution from a set of possible options etc.
datagen on the other hand allows you to access the data of any model, any field, any row to create new data; much like a DAG. This is a very powerful abstraction.
Of course, not having to write "code" in json is great too!
- ProofHouse 7 months ago
  
  Is there a good way this could be used for model distillation? Hmmm

Settings

Show HN: Generate coherent, synthetic data at scale

Keyboard Shortcuts