GitHub - compuficial/apery: Synthetic Data Generator for Agents

6 min read Original article ↗

Apery

Deterministic synthetic data generation for agents.

CI status GitHub release Discord MIT License

Apery generates synthetic data from declarative plans. Same plan, same seed, same output — every time.

AI agents are a first-class citizen. Logs go to stderr so stdout stays clean for piping. Structured slog output, machine-parseable JSONL/CSV, clean exit codes.

Install

One-liner (Linux / macOS, amd64 / arm64):

curl -fsSL https://raw.githubusercontent.com/compuficial/apery/main/install.sh | sh

The script auto-detects OS/arch, fetches the latest stable release, and installs to /usr/local/bin (or ~/.local/bin if /usr/local/bin isn't writable). To pin a specific version or install a pre-release:

curl -fsSL https://raw.githubusercontent.com/compuficial/apery/main/install.sh | VERSION=v0.0.1-rc1 sh

Manual download — grab a tarball/zip from releases (Linux, macOS, Windows; amd64 + arm64).

Go install (requires Go 1.24+):

go install github.com/compuficial/apery/cmd/apery@latest

From source (requires Go 1.24+):

git clone https://github.com/compuficial/apery.git
cd apery
make install   # builds and installs to ~/.local/bin

30-Second Demo

# plan.yaml
seed: 42
entities:
  - name: User
    count: 1000
    fields:
      - name: id
        gen: seq
      - name: email
        gen: regex
        config:
          pattern: "[a-z]{5,10}@(gmail|yahoo|outlook)\\.com"
      - name: department
        gen: pick
        config:
          values: [engineering, sales, marketing, support]
          weights: [40, 30, 20, 10]
$ apery generate -f plan.yaml | head -3
{"_entity":"User","id":1,"email":"kczvbmih@outlook.com","department":"engineering"}
{"_entity":"User","id":2,"email":"yzdevl@yahoo.com","department":"engineering"}
{"_entity":"User","id":3,"email":"eoikwpvxu@gmail.com","department":"sales"}

Why Apery

Deterministic Plan + Seed = Identical Output. Always. Across parallel workers, platforms, runs.
Fast Chunked parallel execution across all cores. See Performance for numbers.
Composable 22 generators that nest and combine. Objects, lists, templates, conditional dispatch, cross-row arithmetic.
Relational Foreign keys, 1:M parent-child, M:N junction tables. Zipf distributions for realistic skew.
Agent-first YAML/JSON plans, stdout piping, structured slog output, exit codes. No GUI, no server.
Zero config Single binary. No database, no runtime dependencies.

Generators

Run apery list generators to see all available generators, or apery describe generator <name> for full config docs.

Scalar

Generator Description Example Config
seq Sequential integers start: 1, step: 1
int Uniform random integer min: 0, max: 100
float Uniform random float min: 0.0, max: 1.0
bool Weighted boolean probability: 0.8
pick Random from list/file/URL values: [a, b, c], weights: [5, 3, 2]
const Fixed value value: active
regex String from pattern pattern: "[A-Z]{2}-\\d{6}"
time Timestamp in range start: "2024-01-01", end: "2024-12-31"
uuid UUID v4
ulid ULID
normal_int Gaussian integer mu: 50, sigma: 10
normal_float Gaussian float mu: 0.0, sigma: 1.0
zipf Zipf distribution s: 1.1, imax: 100

Composite

Generator Description
object Nested object with sub-generators per field
list Array of N items from one generator
sample N unique items without replacement
one_of Weighted random dispatch to sub-generators
template String interpolation: "{first} {last}"
switch Conditional dispatch based on another field

Computed (row-aware)

Generator Description
expr Arithmetic over {field} refs and numbers: "{total} / 12", "{amount} * {fx_rate}"
date_offset Shift a base date by N units: base: "{start}", amount: "{i}", unit: months

Relational

Generator Description
rel_ref Foreign key from a previously generated entity (uniform or zipf, optional unique: true)
driven_by 1:M parent-child — Min to Max children per parent row; expose parent columns and index_as the child's position so children can compute cross-row values

Relational Example

seed: 99
entities:
  - name: User
    count: 100
    fields:
      - name: id
        gen: seq
      - name: name
        gen: pick
        config:
          values: [Alice, Bob, Carol, Dave]

  - name: Product
    count: 50
    fields:
      - name: id
        gen: seq
      - name: sku
        gen: regex
        config:
          pattern: "[A-Z]{2}-\\d{6}"

  - name: Order                    # 1:M — each User gets 1-5 Orders
    driven_by:
      entity: User
      field: id
      as: user_id
      min: 1
      max: 5
    fields:
      - name: order_id
        gen: seq
      - name: product_id
        gen: rel_ref
        config:
          entity: Product
          field: id
      - name: quantity
        gen: int
        config:
          min: 1
          max: 10

  - name: Review                   # M:1 with zipf skew
    count: 500
    fields:
      - name: user_id
        gen: rel_ref
        config:
          entity: User
          field: id
          distribution: zipf
          s: 1.5
      - name: product_id
        gen: rel_ref
        config:
          entity: Product
          field: id
      - name: rating
        gen: int
        config:
          min: 1
          max: 5
$ apery generate -f ecommerce.yaml --output-dir ./out --split-entities
$ ls out/
Order.jsonl  Product.jsonl  Review.jsonl  User.jsonl

$ head -1 out/Order.jsonl | jq .
{
  "user_id": 1,
  "order_id": 1,
  "product_id": 34,
  "quantity": 7
}

CLI Reference

apery generate -f plan.yaml              # JSONL to stdout
apery generate -f plan.yaml -o csv       # CSV to stdout
apery generate -f plan.yaml --output-dir ./out
apery generate -f plan.yaml --output-dir ./out --split-entities
apery generate -f plan.yaml --dry-run    # validate only
apery generate -f plan.yaml --seed 123   # override seed
apery generate -f plan.yaml --verbose    # entity progress to stderr
apery generate -f plan.yaml --debug      # full debug output to stderr

apery validate -f plan.yaml              # validate a plan file
apery list generators                    # list all generators
apery describe generator <name>          # show config schema + example
apery version                            # print version
apery help <command>                     # help for any command

Exit codes: 0 success, 1 validation error, 2 generation error, 3 I/O error.

Performance

$ cat bench.yaml
seed: 1
entities:
  - name: Row
    count: 1000000
    fields:
      - name: id
        gen: seq
      - name: value
        gen: int
        config: { min: 0, max: 1000000 }
      - name: label
        gen: pick
        config: { values: [a, b, c, d, e] }

$ time apery generate -f bench.yaml --workers 16 > /dev/null
real    0m1.6s

Numbers depend heavily on plan shape (regex, rel_ref, composite generators are more expensive than scalar seq/int/pick) and whether output is piped to a file or a terminal. Run your own plan with --workers $(nproc) > /dev/null to get a representative number for your workload.

Determinism

$ apery generate -f plan.yaml --seed 42 | md5sum
fc8756b572010e94b46afc81ecbe6a02  -
$ apery generate -f plan.yaml --seed 42 | md5sum
fc8756b572010e94b46afc81ecbe6a02  -

Hierarchical seed derivation ensures identical output regardless of worker count or chunk size. See the spec for the full seed derivation model.

Go Library

import "apery"

p, _ := apery.LoadPlanFile("plan.yaml")

w, _ := apery.NewJSONLWriter("output.jsonl")
apery.Run(ctx, p, w,
    apery.WithWorkers(16),
    apery.WithChunkSize(100000),
)

Architecture

flowchart LR
    Plan([Plan<br/>YAML / JSON])
    Registry[[Registry<br/>22 generators]]
    Runtime[[Runtime<br/>chunked parallel executor]]
    Writer[[Writer<br/>JSONL / CSV / split]]
    Out([Records])

    Plan --> Registry --> Runtime --> Writer --> Out
    Seed([Seed]) -. derive .-> Runtime
Loading
Stage Package Responsibility
Plan internal/plan Load + validate YAML/JSON into entities, fields, and generator configs.
Registry internal/registry Generator factory. Built-ins auto-register at init time; each is self-describing via GeneratorInfo.
Runtime internal/runtime Chunked parallel executor. Row-by-row generation with cross-entity column store for relational lookups. Structured slog logging.
Writer internal/writer Streaming output: single JSONL/CSV file, stdout, or per-entity split files.

Determinism is the core invariant. Seeds cascade deterministically via FNV-1a derivation:

root seed ─▶ entity ─▶ field ─▶ row ─▶ sub-field

Same plan + same seed = byte-identical output, regardless of worker count, platform, or run. See docs/spec.md for the full execution model.

Documentation

  • Specification — Architecture, plan schema, generator reference, execution model
  • Usage Guide — Practical CLI walkthrough with example plans

License

MIT