Deterministic synthetic data generation for agents.
Apery generates synthetic data from declarative plans. Same plan, same seed, same output — every time.
AI agents are a first-class citizen. Logs go to stderr so stdout stays clean for piping. Structured slog output, machine-parseable JSONL/CSV, clean exit codes.
Install
One-liner (Linux / macOS, amd64 / arm64):
curl -fsSL https://raw.githubusercontent.com/compuficial/apery/main/install.sh | shThe script auto-detects OS/arch, fetches the latest stable release, and installs to /usr/local/bin (or ~/.local/bin if /usr/local/bin isn't writable). To pin a specific version or install a pre-release:
curl -fsSL https://raw.githubusercontent.com/compuficial/apery/main/install.sh | VERSION=v0.0.1-rc1 shManual download — grab a tarball/zip from releases (Linux, macOS, Windows; amd64 + arm64).
Go install (requires Go 1.24+):
go install github.com/compuficial/apery/cmd/apery@latest
From source (requires Go 1.24+):
git clone https://github.com/compuficial/apery.git cd apery make install # builds and installs to ~/.local/bin
30-Second Demo
# plan.yaml seed: 42 entities: - name: User count: 1000 fields: - name: id gen: seq - name: email gen: regex config: pattern: "[a-z]{5,10}@(gmail|yahoo|outlook)\\.com" - name: department gen: pick config: values: [engineering, sales, marketing, support] weights: [40, 30, 20, 10]
$ apery generate -f plan.yaml | head -3 {"_entity":"User","id":1,"email":"kczvbmih@outlook.com","department":"engineering"} {"_entity":"User","id":2,"email":"yzdevl@yahoo.com","department":"engineering"} {"_entity":"User","id":3,"email":"eoikwpvxu@gmail.com","department":"sales"}
Why Apery
| Deterministic | Plan + Seed = Identical Output. Always. Across parallel workers, platforms, runs. |
| Fast | Chunked parallel execution across all cores. See Performance for numbers. |
| Composable | 22 generators that nest and combine. Objects, lists, templates, conditional dispatch, cross-row arithmetic. |
| Relational | Foreign keys, 1:M parent-child, M:N junction tables. Zipf distributions for realistic skew. |
| Agent-first | YAML/JSON plans, stdout piping, structured slog output, exit codes. No GUI, no server. |
| Zero config | Single binary. No database, no runtime dependencies. |
Generators
Run apery list generators to see all available generators, or apery describe generator <name> for full config docs.
Scalar
| Generator | Description | Example Config |
|---|---|---|
seq |
Sequential integers | start: 1, step: 1 |
int |
Uniform random integer | min: 0, max: 100 |
float |
Uniform random float | min: 0.0, max: 1.0 |
bool |
Weighted boolean | probability: 0.8 |
pick |
Random from list/file/URL | values: [a, b, c], weights: [5, 3, 2] |
const |
Fixed value | value: active |
regex |
String from pattern | pattern: "[A-Z]{2}-\\d{6}" |
time |
Timestamp in range | start: "2024-01-01", end: "2024-12-31" |
uuid |
UUID v4 | — |
ulid |
ULID | — |
normal_int |
Gaussian integer | mu: 50, sigma: 10 |
normal_float |
Gaussian float | mu: 0.0, sigma: 1.0 |
zipf |
Zipf distribution | s: 1.1, imax: 100 |
Composite
| Generator | Description |
|---|---|
object |
Nested object with sub-generators per field |
list |
Array of N items from one generator |
sample |
N unique items without replacement |
one_of |
Weighted random dispatch to sub-generators |
template |
String interpolation: "{first} {last}" |
switch |
Conditional dispatch based on another field |
Computed (row-aware)
| Generator | Description |
|---|---|
expr |
Arithmetic over {field} refs and numbers: "{total} / 12", "{amount} * {fx_rate}" |
date_offset |
Shift a base date by N units: base: "{start}", amount: "{i}", unit: months |
Relational
| Generator | Description |
|---|---|
rel_ref |
Foreign key from a previously generated entity (uniform or zipf, optional unique: true) |
driven_by |
1:M parent-child — Min to Max children per parent row; expose parent columns and index_as the child's position so children can compute cross-row values |
Relational Example
seed: 99 entities: - name: User count: 100 fields: - name: id gen: seq - name: name gen: pick config: values: [Alice, Bob, Carol, Dave] - name: Product count: 50 fields: - name: id gen: seq - name: sku gen: regex config: pattern: "[A-Z]{2}-\\d{6}" - name: Order # 1:M — each User gets 1-5 Orders driven_by: entity: User field: id as: user_id min: 1 max: 5 fields: - name: order_id gen: seq - name: product_id gen: rel_ref config: entity: Product field: id - name: quantity gen: int config: min: 1 max: 10 - name: Review # M:1 with zipf skew count: 500 fields: - name: user_id gen: rel_ref config: entity: User field: id distribution: zipf s: 1.5 - name: product_id gen: rel_ref config: entity: Product field: id - name: rating gen: int config: min: 1 max: 5
$ apery generate -f ecommerce.yaml --output-dir ./out --split-entities $ ls out/ Order.jsonl Product.jsonl Review.jsonl User.jsonl $ head -1 out/Order.jsonl | jq . { "user_id": 1, "order_id": 1, "product_id": 34, "quantity": 7 }
CLI Reference
apery generate -f plan.yaml # JSONL to stdout
apery generate -f plan.yaml -o csv # CSV to stdout
apery generate -f plan.yaml --output-dir ./out
apery generate -f plan.yaml --output-dir ./out --split-entities
apery generate -f plan.yaml --dry-run # validate only
apery generate -f plan.yaml --seed 123 # override seed
apery generate -f plan.yaml --verbose # entity progress to stderr
apery generate -f plan.yaml --debug # full debug output to stderr
apery validate -f plan.yaml # validate a plan file
apery list generators # list all generators
apery describe generator <name> # show config schema + example
apery version # print version
apery help <command> # help for any command
Exit codes: 0 success, 1 validation error, 2 generation error, 3 I/O error.
Performance
$ cat bench.yaml
seed: 1
entities:
- name: Row
count: 1000000
fields:
- name: id
gen: seq
- name: value
gen: int
config: { min: 0, max: 1000000 }
- name: label
gen: pick
config: { values: [a, b, c, d, e] }
$ time apery generate -f bench.yaml --workers 16 > /dev/null
real 0m1.6sNumbers depend heavily on plan shape (regex, rel_ref, composite generators are more expensive than scalar seq/int/pick) and whether output is piped to a file or a terminal. Run your own plan with --workers $(nproc) > /dev/null to get a representative number for your workload.
Determinism
$ apery generate -f plan.yaml --seed 42 | md5sum fc8756b572010e94b46afc81ecbe6a02 - $ apery generate -f plan.yaml --seed 42 | md5sum fc8756b572010e94b46afc81ecbe6a02 -
Hierarchical seed derivation ensures identical output regardless of worker count or chunk size. See the spec for the full seed derivation model.
Go Library
import "apery" p, _ := apery.LoadPlanFile("plan.yaml") w, _ := apery.NewJSONLWriter("output.jsonl") apery.Run(ctx, p, w, apery.WithWorkers(16), apery.WithChunkSize(100000), )
Architecture
flowchart LR
Plan([Plan<br/>YAML / JSON])
Registry[[Registry<br/>22 generators]]
Runtime[[Runtime<br/>chunked parallel executor]]
Writer[[Writer<br/>JSONL / CSV / split]]
Out([Records])
Plan --> Registry --> Runtime --> Writer --> Out
Seed([Seed]) -. derive .-> Runtime
| Stage | Package | Responsibility |
|---|---|---|
| Plan | internal/plan |
Load + validate YAML/JSON into entities, fields, and generator configs. |
| Registry | internal/registry |
Generator factory. Built-ins auto-register at init time; each is self-describing via GeneratorInfo. |
| Runtime | internal/runtime |
Chunked parallel executor. Row-by-row generation with cross-entity column store for relational lookups. Structured slog logging. |
| Writer | internal/writer |
Streaming output: single JSONL/CSV file, stdout, or per-entity split files. |
Determinism is the core invariant. Seeds cascade deterministically via FNV-1a derivation:
root seed ─▶ entity ─▶ field ─▶ row ─▶ sub-field
Same plan + same seed = byte-identical output, regardless of worker count, platform, or run. See docs/spec.md for the full execution model.
Documentation
- Specification — Architecture, plan schema, generator reference, execution model
- Usage Guide — Practical CLI walkthrough with example plans
License
MIT
