Macrodata Labs — every strong model starts with great data

3 min read Original article ↗

Every strong model starts with

great data

Macrodata Labs helps robotics teams turn raw physical-world data into better training datasets. Refiner, our open-source data processing framework, lets you build pipelines locally in Python, then scale the same pipeline on managed cloud compute.

// 01

Introducing Refiner: focus on data, not infrastructure

COMPOSABLE BY DESIGN

Define pipelines from simple primitives. Refiner handles scale, orchestration, and everything in between.

NATIVELY MULTIMODAL

Process robot episodes with trajectories, camera streams, audio, and language in one pipeline. Refiner handles streaming IO, sharding, and native data formats.

BUILT FOR MODELS

Deploy open-source models or bring your own API. Async execution, smart batching, parallelism, and retries are handled either way, locally or at cloud scale.

Read the docs
import refiner as mdr

(
    mdr.read_hdf5(
        "hf://datasets/nvidia/ALOHA-Cosmos-Policy/**/*.hdf5",
        groups="/",
        datasets={"action": "action", "observation.state": "observations/qpos"},
    )
    .to_robot_rows(fps=25, robot_type="aloha")
    .write_lerobot("s3://robots/aloha-lerobot")
)

// 02

Get more from robotics data

Annotating tasks with gemini-3.5-flash

Batch 1 of 6

INGEST ANY FORMAT

Read and convert Parquet, HDF5, MCAP, Zarr, RLDS, and LeRobot without custom scripts or slow local downloads.

SUBTASK ANNOTATIONS & HAND-TRACKING

Use optimized pipelines for timestamped subtask annotation and ego-vision hand tracking across robot episodes.

REWARD MODELS

Estimate task-completion progress with reward models such as Robometer, then use those scores to weight the frames that matter most.

See examples

// 03

Scale instantly with launch_cloud()

ONE LINE TO CLOUD

.launch_local() becomes .launch_cloud(). Scale the same pipeline without rewriting code, changing data formats, or rebuilding your local workflow.

INSTANT CPU & GPU ACCESS

Run many shards across managed CPU and GPU workers without reservations or machine provisioning. Macrodata Labs handles orchestration, scheduling, and worker lifecycle.

PAY FOR WHAT YOU USE

Resources attach when work starts and release when it finishes. You pay for the compute your jobs actually consume, without idle cluster overhead.

See pricing

Switching pipeline.launch_local() to pipeline.launch_cloud() runs the same pipeline on 5 × H100 GPUs — 8m 00s locally becomes 48s, a 900% performance increase, lifting throughput from 5 MB/s to 100 GB/s for about $0.27 per run, billed per second.

// 04

Supervise in real time

TRACE EVERY DATASET

Inspect the DAG, transforms, launch settings, dependencies, and captured code behind each dataset build, so every output is traceable back to the run that produced it.

PINPOINT FAILURES

See the stage, shard, worker, traceback, logs, and retry state for failures instead of piecing together what happened after the fact. Inspect it on the web platform or let your agent pull it through the CLI.

SURFACE BOTTLENECKS

See whether a run is limited by decoding, model calls, writing, CPU, memory, network, or GPU before scaling it further.

// GET STARTED

Get more from your

robotics data.

Use Refiner to start extracting more signal from your data today, or reach out to discuss your robotics data challenges directly.