Every strong model starts with
great data
Macrodata Labs helps robotics teams turn raw physical-world data into better training datasets. Refiner, our open-source data processing framework, lets you build pipelines locally in Python, then scale the same pipeline on managed cloud compute.
// 01
Introducing Refiner: focus on data, not infrastructure
COMPOSABLE BY DESIGN
Define pipelines from simple primitives. Refiner handles scale, orchestration, and everything in between.
NATIVELY MULTIMODAL
Process robot episodes with trajectories, camera streams, audio, and language in one pipeline. Refiner handles streaming IO, sharding, and native data formats.
BUILT FOR MODELS
Deploy open-source models or bring your own API. Async execution, smart batching, parallelism, and retries are handled either way, locally or at cloud scale.
import refiner as mdr
(
mdr.read_hdf5(
"hf://datasets/nvidia/ALOHA-Cosmos-Policy/**/*.hdf5",
groups="/",
datasets={"action": "action", "observation.state": "observations/qpos"},
)
.to_robot_rows(fps=25, robot_type="aloha")
.write_lerobot("s3://robots/aloha-lerobot")
)// 02
Get more from robotics data
Annotating tasks with gemini-3.5-flash
Batch 1 of 6
INGEST ANY FORMAT
Read and convert Parquet, HDF5, MCAP, Zarr, RLDS, and LeRobot without custom scripts or slow local downloads.
SUBTASK ANNOTATIONS & HAND-TRACKING
Use optimized pipelines for timestamped subtask annotation and ego-vision hand tracking across robot episodes.
REWARD MODELS
Estimate task-completion progress with reward models such as Robometer, then use those scores to weight the frames that matter most.
// 03
Scale instantly with launch_cloud()
ONE LINE TO CLOUD
.launch_local() becomes .launch_cloud(). Scale the same pipeline without rewriting code, changing data formats, or rebuilding your local workflow.
INSTANT CPU & GPU ACCESS
Run many shards across managed CPU and GPU workers without reservations or machine provisioning. Macrodata Labs handles orchestration, scheduling, and worker lifecycle.
PAY FOR WHAT YOU USE
Resources attach when work starts and release when it finishes. You pay for the compute your jobs actually consume, without idle cluster overhead.
Switching pipeline.launch_local() to pipeline.launch_cloud() runs the same pipeline on 5 × H100 GPUs — 8m 00s locally becomes 48s, a 900% performance increase, lifting throughput from 5 MB/s to 100 GB/s for about $0.27 per run, billed per second.
// 04
Supervise in real time
TRACE EVERY DATASET
Inspect the DAG, transforms, launch settings, dependencies, and captured code behind each dataset build, so every output is traceable back to the run that produced it.
PINPOINT FAILURES
See the stage, shard, worker, traceback, logs, and retry state for failures instead of piecing together what happened after the fact. Inspect it on the web platform or let your agent pull it through the CLI.
SURFACE BOTTLENECKS
See whether a run is limited by decoding, model calls, writing, CPU, memory, network, or GPU before scaling it further.
// GET STARTED
Get more from your
robotics data.
Use Refiner to start extracting more signal from your data today, or reach out to discuss your robotics data challenges directly.