GitHub - xorq-labs/xorq: A compute manifest and composable tools for ML, built on Ibis, DataFusion, and Arrow Flight.

6 min read Original article ↗

The Problem

You write a feature pipeline. It works on your laptop with DuckDB. Deploying it to Snowflake ends up in a rewrite. Intermediate results should be cached so you add infrastructure and a result naming system. A requirement to track pipeline changes is introduced, so you add a metadata store. Congrats, you're going to production! It's time to add a serving layer ...

Six months later: five tools that don't talk to each other and a pipeline only one person understands

Pain Symptom
Glue code everywhere Each engine is a silo. Moving between them means rewriting, not composing.
Runtime Feedback Imperative Python code where you can only tell if something will fail while running the job.
Unnecessary recomputations No shared understanding of what changed. Everything runs from scratch.
Opaque Lineages Feature logic, metadata, lineage. All in different systems. Debugging means archaeology.
"Works on my machine" Environments drift. Reproducing results means reverse engineering someone's setup and interrogating your own.
Stateful orchestrators Retry logic, task states, failure recovery. Another system to manage, another thing that breaks.

Feature stores, Model registries, Orchestrators: Vertical silos that don't serve agentic processes, which need context and skills, not categories.

Xorq

intro intro

Manifest = Context. Every ML computation becomes a structured, input-addressed YAML manifest.

Exprs = Tools. A catalog to discover. A build system to deterministically execute anywhere with user directed caching.

Templates = Skills. Various skills to get started e.g. scikit-learn pipeline, feature stores, semantic layers etc.

$ pip install xorq[examples]
$ xorq init -t penguins

The Expression

Write declarative Ibis expressions that can be run like a tool. Xorq extends Ibis with caching, multi-engine execution, and UDFs.

import ibis
import xorq.api as xo
from xorq.common.utils.ibis_utils import from_ibis
from xorq.caching import ParquetCache

penguins = ibis.examples.penguins.fetch()

penguins_agg = (
    penguins
    .filter(ibis._.species.notnull())
    .group_by("species")
    .agg(avg_bill_length=ibis._.bill_length_mm.mean())
)

expr = (
    from_ibis(penguins_agg)
    .cache(ParquetCache.from_kwargs())
)

Declare .cache() on any node. Xorq handles the rest. No cache keys to generate or manage, no invalidation logic to write.

Compose across engines

One expression, many engines. Part of your pipeline runs on DuckDB, part on Xorq's embedded DataFusion engine, UDFs via Arrow Flight. Xorq systematically handles data transit with low overhead. Bye bye glue code.

expr = from_ibis(penguins).into_backend(xo.sqlite.connect())
expr.ls.backends
(<xorq.backends.sqlite.Backend at 0x7926a815caa0>,
 <xorq.backends.duckdb.Backend at 0x7926b409faa0>)

Expressions are tools, Arrow is the pipe

Unix gave us small programs that compose via stdout. Xorq gives you expressions that compose via Arrow.

In [6]: expr.to_pyarrow_batches()
Out[6]: <pyarrow.lib.RecordBatchReader at 0x15dc3f570>

The Manifest

Build an expression, get a manifest.

$ xorq build expr.py
builds/28ecab08754e
$ tree builds/28ecab08754e
builds/28ecab08754e
├── database_tables
│   └── f2ac274df56894cb1505bfe8cb03940e.parquet
├── expr.yaml
├── metadata.json
└── profiles.yaml

No external metadata store. No separate lineage tool. The build directory is the versioned, cached, portable artifact.

# Input-addressed, composable, portable
# Abridged expr.yaml
nodes:
  '@read_31f0a5be3771':
    op: Read
    name: penguins
    source: builds/28ecab08754e/.../f2ac274df56894cb1505bfe8cb03940e.parquet

  '@filter_23e7692b7128':
    op: Filter
    parent: '@read_31f0a5be3771'
    predicates:
      - NotNull(species)

  '@remotetable_9a92039564d4':
    op: RemoteTable
    remote_expr:
      op: Aggregate
      parent: '@filter_23e7692b7128'
      by: [species]
      metrics:
        avg_bill_length: Mean(bill_length_mm)

  '@cachednode_e7b5fd7cd0a9':
    op: CachedNode
    parent: '@remotetable_9a92039564d4'
    cache:
      type: ParquetCache
      path: parquet

Reproducible builds

The manifest is roundtrippable and machine-writeable. Git-diff your pipelines. Code review your features. Track python dependencies. Rebuild from YAML alone.

$ xorq uv-build expr.py
builds/28ecab08754e/

$ ls builds/28ecab08754e/*.tar.gz
builds/28ecab08754e/sdist.tar.gz  builds/28ecab08754e/my-pipeline-0.1.0.tar.gz

The build captures everything: expression graph, dependencies, memory tables. Share the build that has sdist, get identical results. No "works on my machine."

Only recompute what changed

The manifest is input-addressed: same inputs = same hash. Change an input, get a new hash.

expr.ls.get_cache_paths()
(PosixPath('/home/user/.cache/xorq/parquet/letsql_cache-7c3df7ccce5ed4b64c02fbf8af462e70.parquet'),)

The hash is the cache key. No invalidation logic to debug. If the expression is the same, the hash is the same, and the cache is valid. Change an input, get a new hash, trigger recomputation.

Traditional caching asks "has this expired?" Input-addressed caching asks "is this the same computation?" The second question has a deterministic answer.


The Tools

The manifest provides context. The tools provide skills: catalog, introspect, serve, execute.

Catalog

# Add to catalog
$ xorq catalog add builds/28ecab08754e/ --alias penguins-agg
Added build 28ecab08754e as entry a498016e-5bea-4036-aec0-a6393d1b7c0f revision r1

# List entries
$ xorq catalog ls
Aliases:
penguins-agg    a498016e-5bea-4036-aec0-a6393d1b7c0f    r1
Entries:
a498016e-5bea-4036-aec0-a6393d1b7c0f    r1      28ecab08754e

Run

$ xorq run builds/28ecab08754e -o out.parquet

Serve

Serve expressions anywhere via Arrow Flight:

$ xorq serve-unbound builds/28ecab08754e/ \
  --to_unbind_hash 31f0a5be37713fe2c1a2d8ad8fdea69f \
  --host localhost --port 9002
import xorq.api as xo

backend = xo.flight.connect(host="localhost", port=9002)
f = backend.get_exchange("default")

data = {
    "species": ["Adelie", "Gentoo", "Chinstrap"],
    "island": ["Torgersen", "Biscoe", "Dream"],
    "bill_length_mm": [39.1, 47.5, 49.0],
    "bill_depth_mm": [18.7, 14.2, 18.5],
    "flipper_length_mm": [181, 217, 195],
    "body_mass_g": [3750, 5500, 4200],
    "sex": ["male", "female", "male"],
    "year": [2007, 2008, 2009],
}

xo.memtable(data).pipe(f).execute()
     species  avg_bill_length
0     Adelie             39.1
1  Chinstrap             49.0
2     Gentoo             47.5

Debug with confidence

No more archaeology. Lineage is encoded in the manifest—not scattered across tools—and queryable from the CLI.

$ xorq lineage penguins-agg

Lineage for column 'avg_bill_length':
Field:avg_bill_length #1
└── Cache xorq_cached_node_name_placeholder #2
    └── RemoteTable:236af67d399a4caaf17e0bf5e1ac4c0f #3
        └── Aggregate #4
            ├── Filter #5
            │   ├── Read #6
            │   └── NotNull #7
            │       └── Field:species #8
            │           └── ↻ see #6
            ├── Field:species #9
            │   └── ↻ see #5
            └── Mean #10
                └── Field:bill_length_mm #11
                    └── ↻ see #5

Workflows, without state

No task states. Just retry on failure.

Xorq executes expressions as Arrow RecordBatch streams. There's no DAG of tasks to checkpoint, just data flowing through operators. If something fails, rerun from the manifest. Cached nodes resolve instantly; the rest recomputes.

Scikit-learn Integration

Xorq translates scikit-learn Pipeline objects to deferred expressions:

from xorq.expr.ml.pipeline_lib import Pipeline

sklearn_pipeline = ...
xorq_pipeline = Pipeline.from_instance(sklearn_pipeline)

Templates

Ready-to-start code as skills:

$ xorq init -t <template>
Template Description
penguins Minimal example: caching, aggregation, multi-engine
sklearn Classification pipeline with train/predict separation

Skills for humans

Templates work as easy to get started components with expressions ready to be composed with your sources.

Coming Soon

  • feast — Feature store integration
  • boring-semantic-layer — Metrics and dimensions catalog
  • dbt — dbt model composition
  • Feature Selection

The Horizontal Stack

Write in Python. Catalog as YAML. Compose anywhere via Ibis. Portable compute engine built on DataFusion. Universal UDFs via Arrow Flight.

Architecture Architecture

Lineage, caching, and versioning travel with the manifest; cataloged, not locked in a vendor's database.

Integrations: Ibis • scikit-learn • Feast(wip) • dbt (upcoming)


Learn More


Pre-1.0. Expect breaking changes with migration guides.