GitHub - sroerick/pakkun: git for data pipelines. Can be used as a platform for ETL, app deployment, or compilers.

pakkun is a project-local CLI for running named transformation pipelines while keeping intermediate artifacts out of the working tree.

It stores declared outputs under .pipe/objects, records runs and provenance in .pipe/db.sqlite, and lets you inspect or materialize artifacts later with human-readable refs.

The canonical user-edited execution graph lives in pipe.yaml at the project root. An optional sibling manifest, pakkun.yaml, now carries repo-level component and compatibility metadata for read-only inspection.

human written note

pakkun is vibeslop, leaving me very suspicious of it, but I'm using it in production in a couple different ways.

The first use case I have for this is LaTeX greeting cards. It lets me keep a clean folder and hide all the intermediate .tex files inside a hidden folder. It effectively acts as a TeX Makefile with better ergonomics.

I'm also using this for CI/CD. I have build and deploy scripts for it running on my BSD server. It seems to be working pretty nicely. I don't know if this is a huge ergonomics improvement, but I find it much more usable than, say, Github Actions. I don't know if this would work as well for cloud deployments (like Github Actions) and I'm not super interested in finding out.

Finally, I used this to clean up the data pipeline for a bible app that I built. Basically, I was taking some xlsx files and piping them into a database. Over the course of production, I had done a lot of manual SQL edits in order to get the app working. I was able to use an LLM to formalize that pipeline using the pakkun system. I tested the newly created database with the existing one by comparing pg_dump, and it worked great.

Structurally, this is git for ETL pipelines. You configure your pipeline stages in YAML, those can be basically any executable. You can track runs and stuff. It's really not much more than a Makefile right now, but I like it.

I have a few more use cases I'd like to run this through and I think it may become slightly useful.

Current status

This repository implements the v1 command set from spec.md:

pakkun init
pakkun version
pakkun project show
pakkun project show --json
pakkun compatibility status
pakkun compatibility export --json
pakkun compatibility sync-todo --project-id <id> [--parent-id <id>] [--apply]
pakkun explain [pipeline]
pakkun explain [pipeline] --json
pakkun failures [pipeline-or-run]
pakkun failures [pipeline-or-run] --json
pakkun run [pipeline] [--base <run-ref>] [--from <step>]
pakkun retry <run-ref>
pakkun retry --last-failed [--from <step>]
pakkun stages [pipeline-or-run]
pakkun status
pakkun status --json
pakkun show <ref>
pakkun show <ref> --json
pakkun mount <ref> <dir>
pakkun publish <ref> <path>
pakkun log [pipeline-or-run]
pakkun log [pipeline-or-run] --json
pakkun provenance <ref>
pakkun ui

The short command list undersells the current implementation somewhat. The repository also includes:

multiple runnable example projects under examples/
internal tests for CLI flows, config loading, refs, and example spec loading
content-addressed artifact storage with publish and provenance inspection
an experimental localhost web UI for browsing runs, artifacts, and provenance
pipeline inheritance with extends for shared build graphs
artifact reuse via inputs[].ref and built-in comparison steps via kind: assert

The main limits are about scope, not whether the basic workflow exists:

pakkun runs locally on the machine or CI runner that invokes it
metadata is stored in project-local SQLite under .pipe/db.sqlite
pakkun does not yet provide remote execution, hosted runners, or cross-job workflow orchestration

pakkun.yaml support is intentionally conservative in this first pass:

pakkun project show renders declared repo metadata without requiring pakkun init
pakkun project show --json exposes the same repo metadata as a stable machine-readable payload
pakkun status --json exposes the current local operational state, including pipeline spec presence, latest runs, failed steps, and aliases
pakkun show <ref> --json exposes run, step, or artifact refs in machine-readable form, including stored artifact paths and object refs where applicable
pakkun compatibility status now resolves local Git branch and tag sources, checks declared validation pipelines in pipe.yaml, and inspects recent run evidence from .pipe/db.sqlite when available
pakkun explain [pipeline] renders the effective pipeline shape, including inherited steps, inputs, outputs, and publish targets, without needing a prior run
pakkun explain [pipeline] --json exposes the same effective pipeline shape in a machine-readable form for agent consumers
pakkun log [pipeline-or-run] --json exposes either recent runs or per-step run details, including declared artifacts, in machine-readable form
pakkun failures [pipeline-or-run] summarizes failed steps with stderr previews and the exact pakkun retry run:<id> --from <step> command needed for resume-oriented repair loops
pakkun failures [pipeline-or-run] --json exposes the same failure summary as a stable machine-readable payload for agent tooling
failed pakkun run and pakkun retry executions now print the same compact failure summary and retry hint on stderr instead of only dumping raw step stderr
pakkun retry --last-failed now jumps straight to the newest failed step in the local run store, removing the extra lookup step for common repair loops
pakkun compatibility export --json emits the same compatibility report as a stable machine-readable payload
pakkun compatibility sync-todo previews or applies compatibility obligation updates into wyo.todo project items using the same report model
branch and tag tracking currently resolve local Git repos only

Experimental web UI

pakkun ui starts a localhost-only web interface for the current initialized project. The UI is intentionally scoped to the same local data model as the CLI:

overview of pipelines, aliases, latest runs, and failed steps
pipeline detail with declared steps and publish targets
run detail with per-step status, captured stdout/stderr, and manifest JSON
artifact detail with provenance, safe text previews, download, and publish

From an initialized project root:

By default it binds to 127.0.0.1 on a random free port and prints the URL.

Build

Requirements:

Go 1.26+

Build the CLI:

To embed explicit release metadata in a production build, pass linker flags:

go build -ldflags "-X pakkun/internal/buildinfo.Version=$(git describe --always --dirty) -X pakkun/internal/buildinfo.Commit=$(git rev-parse HEAD) -X pakkun/internal/buildinfo.BuildTime=$(date -u +%Y-%m-%dT%H:%M:%SZ)" ./cmd/pakkun

The recorded backend decision and migration notes are in docs/sqlite-backend-decision.md: the implementation now uses embedded SQLite while keeping .pipe/db.sqlite compatible with existing project state. The current agent-oriented direction is recorded in docs/llm-spec.md. The current machine-readable CLI payloads are documented in docs/json-contracts.md. The current evaluation artifact naming guidance is documented in docs/artifact-conventions.md.

OpenBSD build and install notes for a server build box live in docs/openbsd-build.md. For a staged production rollout on wyo.town, see docs/wyo-rollout.md. The rollout steps are automated by scripts/wyo-rollout.sh. For a generic bare-repo hook pattern, see docs/post-receive-deploy.md.

CI usage

pakkun is suitable for CI today if you treat it as the project-local build graph and artifact/provenance layer.

A good fit looks like this:

your CI platform chooses the runner OS and machine
each job invokes pakkun run <pipeline>
final outputs are materialized with declared publish paths or explicit pakkun publish calls
pakkun log, pakkun show, and pakkun provenance are used for failure analysis

That means pakkun can already be used for:

release packaging jobs on Linux and Windows
reproducible plugin or docs build pipelines
local dogfooding of the exact commands that CI will later run

What pipe is not trying to be, at least in v1:

a CI hosting platform
a scheduler across multiple machines
a replacement for workflow-level matrix or fan-out/fan-in features

Quick start

The simplest runnable example lives in examples/text.

From the repository root:

go build -o ./pakkun ./cmd/pakkun

Then move into an example directory and use that binary against the local pipe.yaml in that directory:

cd examples/text
../../pakkun init
../../pakkun run text-demo
../../pakkun stages text-demo
../../pakkun show text-demo:upper/result
../../pakkun publish text-demo:upper/result ./build/result.txt
../../pakkun provenance text-demo:upper/result

What each command is doing:

pakkun init creates the local .pipe/ storage and metadata directory in the example folder.
pakkun run text-demo executes the named pipeline in dependency order.
pakkun run text-demo --base run:<run-id> --from upper creates a fresh run, reuses cached outputs from prior successful steps in the base run, and restarts execution at upper.
pakkun stages text-demo shows the declared steps and outputs from pipe.yaml.
pakkun show text-demo:upper/result resolves the latest successful result artifact and shows where it is stored under .pipe/objects.
pakkun publish ... ./build/result.txt materializes that stored artifact back into a normal visible file.
pakkun provenance ... shows which run and step produced the artifact and which prior artifacts fed into it.

If you do not want to build a binary first, you can run the CLI directly with Go:

cd examples/text
go run ../../cmd/pakkun init
go run ../../cmd/pakkun run text-demo

After run, the working tree stays mostly clean:

declared artifacts are stored under .pipe/objects/sha256/...
per-run logs and manifests live under .pipe/runs/<run-id>/
the latest run is addressable as alias:current
outputs that declare publish: <relative-path> in pipe.yaml are materialized into the project tree after a successful run

Using the examples

Each example directory is its own pakkun project.

The normal flow is:

cd examples/<name>
../../pakkun init
../../pakkun status
../../pakkun stages
../../pakkun run <pipeline-name>
../../pakkun retry run:<failed-run-id>
../../pakkun run <pipeline-name> --base run:<run-id> --from <step-name>
../../pakkun log
../../pakkun show <pipeline-name>:<step>/<output>

Things that matter:

Run pakkun init inside the example directory, not at the repo root, unless you want the repo root itself to become a pakkun project.
pakkun always looks for pipe.yaml in the current project root.
pakkun stages without an argument only works when the spec contains a single pipeline.
Refs like text-demo:upper/result point at the latest successful run of that pipeline.
alias:current points at the most recent run, regardless of pipeline.

Useful inspection commands after a run:

../../pakkun status
../../pakkun log
../../pakkun show alias:current
../../pakkun show text-demo:upper
../../pakkun show text-demo:upper/result
../../pakkun provenance text-demo:upper/result
../../pakkun mount text-demo:upper ./mounted
../../pakkun publish text-demo:upper/result ./build/result.txt

What you should expect:

mount creates a directory containing one file per declared output on that stage.
publish writes one specific stored artifact back to a stable user-visible path.
publish requires an artifact ref such as text-demo:upper/result, not a run ref or bare stage ref.
show does not print file contents; it shows metadata and the resolved stored path.
retry creates a new run from a failed run ref, reuses the successful upstream step artifacts from that run, and restarts execution at the first failed step unless --from is given.
run --base ... --from ... does the same reuse against any base run, which is useful when you want to rerun a later step after changing only downstream logic.

Ref model

Supported refs:

<pipeline>:<step>
<pipeline>:<step>/<output>
run:<run-id>
run:<run-id>:<step>
run:<run-id>:<step>/<output>
alias:<name>

Examples:

text-demo:upper
text-demo:upper/result
run:20260311_120000_000000000:upper/result
alias:current

Step environment

The runner injects:

PIPE_PROJECT_ROOT
PIPE_RUN_ID
PIPE_STEP_NAME
PIPE_STEP_OUT
PIPE_INPUT_<name-or-output-name> for prior-step inputs declared with from or ref

If an output is declared as name: typed-ast, the input env var becomes PIPE_INPUT_typed_ast.

If an input declaration includes name: baseline, that input becomes PIPE_INPUT_baseline.

Cache Boundary

The cache boundary in pakkun is the set of declared step outputs.

That means:

retry and resume reuse stored artifacts from prior successful steps
from: and ref: only consume declared outputs
old step work/ directories are not copied into new runs
undeclared temp files, scratch dirs, and side effects are not resumable cache

If a later step depends on something expensive to prepare, declare it as an output and pass it forward explicitly.

Good pattern:

steps:
  - name: prepare_cache
    kind: shell
    run: ./scripts/build-cache.sh "$PIPE_STEP_OUT/cache"
    outputs:
      - name: cache
        path: cache
        type: dir

  - name: package
    kind: shell
    run: ./scripts/package.sh "$PIPE_INPUT_cache" "$PIPE_STEP_OUT/result.tar.gz"
    inputs:
      - from: prepare_cache/cache

Bad pattern:

steps:
  - name: prepare_cache
    kind: shell
    run: ./scripts/build-cache.sh .tmp/cache
    outputs:
      - name: marker
        path: done.txt
        type: file

In the second shape, the real prepared state lives outside the declared outputs, so a later resumed run cannot safely reuse it.

Reuse And Verification

pipe.yaml can define a shared build pipeline and a derived verification pipeline in the same project.

version: 1

pipelines:
  - name: build
    steps:
      - name: render
        kind: shell
        run: cat input.txt > "$PIPE_STEP_OUT/out.txt"
        inputs:
          - source: input.txt
        outputs:
          - name: text
            path: out.txt
            type: file

  - name: verify
    extends: build
    steps:
      - name: compare
        kind: assert
        inputs:
          - from: render/text
          - ref: build:render/text
        assert:
          trim_space: true
        outputs:
          - name: report
            path: report.txt
            type: file
            publish: out/report.txt

That shape is useful for parity checks such as comparing a newly generated dump to a cached reference artifact without pushing anything to production.

Publish behavior

pipe supports two ways to materialize managed artifacts back into the visible project tree:

Manual publish with pipe publish <artifact-ref> <path>
Automatic publish during pipe run when an output declares publish: <relative-path> in pipe.yaml

Example output declaration:

outputs:
  - name: result
    path: result.txt
    type: file
    publish: build/result.txt

The publish target must stay within the project root. During run, any published paths are printed in the command summary.

Example pipelines

examples/text: small, fully runnable text-processing pipeline.
examples/compiler: compiler-shaped pipeline with parse/typecheck/codegen stages using standard shell tools.
examples/dataset: production-shaped data bundle pipeline with multiple extraction steps, published outputs, a reusable reference artifact, and a parity check built with extends, ref, and assert.
examples/quotes: script-driven quote-card PDF pipeline using python3, patch, and pdflatex.
examples/latex: LaTeX-style multi-stage template matching the original spec. This one expects TeX tools such as pdflatex and bibtex.

Suggested order:

Start with examples/text.
Move to examples/compiler once the ref model makes sense.
Try examples/dataset to see a more production-shaped pipeline with published outputs and parity verification.
Try examples/quotes if you want a more real script-driven pipeline with a single final PDF artifact.
Use examples/latex only if you have the TeX toolchain installed.

Storage layout

pipe init creates:

.pipe/
├── aliases/
├── config.yaml
├── db.sqlite
├── mounts/
├── objects/
└── runs/

pipe stores content-addressed files and directories under .pipe/objects/sha256, plus run manifests and step stdout/stderr under .pipe/runs/<run-id>/.

Notes

mount defaults to symlinks and falls back to copies when needed.
publish defaults to copy mode; config loading also accepts legacy expose_mode and projection_mode keys for compatibility.
The CLI resolves <pipeline>:<step> refs against the latest successful run of that pipeline.
Output paths are constrained to the per-step output directory; a step cannot declare outputs outside PIPE_STEP_OUT.