GitHub - shea256/autofoundry: A CLI tool that automates the provisioning of GPU's across cloud providers and the running of AI experiments across them

Run any ML experiment script across GPUs on multiple cloud providers with a single command.

Autofoundry is a CLI companion to Karpathy's autoresearch. Point it at a shell script, pick your GPU configuration, and it handles the rest: provisioning instances, distributing experiment runs, streaming results live, and producing a final metrics report.

Supported Providers

RunPod — Secure and Community cloud
Vast.ai — Global GPU marketplace
PRIME Intellect — Decentralized GPU network
Lambda Labs — On-demand cloud GPUs

Quickstart

git clone https://github.com/autofoundry/autofoundry.git
cd autofoundry
uv tool install autofoundry

Then run:

On first run, Autofoundry walks you through configuring provider API keys, SSH key path, minimum download bandwidth (default 5000 Mbps — filters out slow Vast.ai hosts), and HuggingFace token. Config is saved to ~/.config/autofoundry/config.toml.

Examples

Run experiments

# Interactive mode — walks you through everything
autofoundry run

# Run a specific script
autofoundry run scripts/run_autoresearch.sh

# Specific GPU with multiple experiment runs
autofoundry run train.sh --gpu H100 --num 4

# Auto-select cheapest datacenter GPU with 80GB+ VRAM
autofoundry run train.sh --segment datacenter --min-vram 80 --auto

# Target a specific provider
autofoundry run train.sh --segment datacenter --min-vram 80 --provider runpod --auto
autofoundry run train.sh --segment datacenter --provider lambdalabs --auto
autofoundry run train.sh --segment workstation --provider vastai --auto
autofoundry run train.sh --segment datacenter --provider primeintellect --auto

# Attach a network volume (RunPod, Lambda Labs)
autofoundry run train.sh --volume my-data --provider runpod

# Resume a previous session
autofoundry run --resume <session-id>

Manage instances directly

# Provision an instance without running an experiment
autofoundry instance up --gpu H100 --provider runpod --auto

# Run a script on an existing operation or instance
autofoundry instance run <session-id-or-instance-id> train.sh

# Stop, restart, inspect, or delete instances
autofoundry instance stop <target>
autofoundry instance up <target>
autofoundry instance status <target>
autofoundry instance terminate <target>

Browse GPU inventory

# Browse all available GPUs across providers
autofoundry inventory

# Filter by segment, VRAM, or GPU name
autofoundry inventory --segment datacenter --min-vram 80
autofoundry inventory --gpu A100

Configure

# Interactive setup for API keys, SSH key, and defaults
autofoundry config

Manage volumes

# List volumes across providers
autofoundry volumes list

# Create a new volume
autofoundry volumes create --name my-data --provider runpod

Monitor and manage sessions

# Show all sessions
autofoundry status

# Show a specific session
autofoundry status <session-id>

# View metrics from most recent run
autofoundry results

# Terminate instances for a session
autofoundry teardown <session-id>

See the full documentation for writing experiment scripts, network volumes, resuming sessions, custom images, CLI reference, and architecture details.

Requirements

Python 3.11+
SSH key pair (ed25519 or RSA)
At least one provider API key (RunPod, Vast.ai, PRIME Intellect, or Lambda Labs)

License

MIT