An AI-native job scheduler written in Rust. Drop-in compatible with Slurm's CLI, REST API, and C FFI while providing WireGuard mesh networking, GPU-first scheduling, and modern state management.
Quick Start
One-Line Install
curl -fsSL https://raw.githubusercontent.com/ROCm/spur/main/install.sh | bashDownloads the latest release binaries and installs them to ~/.spur/bin. Add to your PATH with export PATH="$HOME/.spur/bin:$PATH".
See docs/quickstart.md for the full walkthrough — single-node setup in 5 minutes, multi-node with WireGuard mesh, GPU job examples, and troubleshooting.
Docker
# Latest release docker build -t spur . # Nightly docker build --build-arg VERSION=nightly -t spur:nightly . # Run docker run --rm spur sinfo docker run -d --name spurctld -p 6817:6817 spur spurctld --listen=[::]:6817
For Kubernetes deployment, see deploy/k8s/.
Build from Source
# Prerequisites sudo apt install protobuf-compiler curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh # Build cargo build --release # Run tests cargo test
Build Portable Binaries
Build glibc 2.28+ compatible binaries using Docker (works on Ubuntu 20.04+, RHEL 8+, Debian 10+):
./deploy/build-portable.sh # outputs to dist/bin/Binaries
| Binary | Description |
|---|---|
spur |
CLI (multi-call binary) |
spurctld |
Controller daemon |
spurd |
Node agent daemon |
spurdbd |
Accounting daemon (PostgreSQL) |
spurrestd |
REST API daemon |
libspur_compat.so |
C FFI shim |
Start a Cluster
# 1. Create config mkdir -p /etc/spur cat > /etc/spur/spur.conf <<'EOF' cluster_name = "my-cluster" [[partitions]] name = "default" default = true nodes = "node[001-008]" max_time = "24:00:00" [[nodes]] names = "node[001-008]" cpus = 64 memory_mb = 256000 EOF # 2. Start controller spurctld -D --state-dir /var/spool/spur # 3. Start node agent (on each compute node) spurd -D --controller http://controller:6817 # 4. Optionally start REST API and accounting spurrestd --controller http://controller:6817 spurdbd --database-url postgresql://spur:spur@localhost/spur --migrate
CLI Usage
Native Commands
spur submit job.sh # Submit a batch job spur queue # View job queue spur cancel 12345 # Cancel a job spur nodes # View cluster nodes spur history # View job history (accounting) spur show job 12345 # Detailed job info spur show node node001 # Detailed node info spur show partition gpu # Detailed partition info spur version # Show version
Slurm-Compatible Commands
Works via symlinks or subcommands — existing scripts and muscle memory work unchanged:
# Via symlinks (create once) ln -s $(which spur) /usr/local/bin/sbatch ln -s $(which spur) /usr/local/bin/squeue ln -s $(which spur) /usr/local/bin/scancel ln -s $(which spur) /usr/local/bin/sinfo ln -s $(which spur) /usr/local/bin/sacct ln -s $(which spur) /usr/local/bin/scontrol # Then use as normal sbatch --job-name=train -N4 --gres=gpu:mi300x:8 train.sh squeue -u $USER scancel 12345 sinfo -N sacct -u $USER --starttime=2024-01-01 # Or as subcommands spur sbatch job.sh spur squeue -u alice
#SBATCH Directives
Batch scripts work exactly like Slurm:
#!/bin/bash #SBATCH --job-name=training #SBATCH -N 4 #SBATCH --ntasks-per-node=8 #SBATCH --gres=gpu:mi300x:8 #SBATCH --time=4:00:00 #SBATCH --partition=gpu torchrun --nproc_per_node=8 train.py
#PBS directives are also parsed for PBS/Torque migration.
REST API
Spur serves two API endpoints:
/api/v1/— Native Spur API/slurm/v0.0.42/— Slurm-compatible API (drop-in for slurmrestd clients)
# Submit a job curl -X POST http://localhost:6820/api/v1/job/submit \ -H "Content-Type: application/json" \ -d '{"job": {"name": "test", "partition": "gpu", "nodes": 2, "script": "#!/bin/bash\necho hello"}}' # List jobs curl http://localhost:6820/api/v1/jobs # Cluster health curl http://localhost:6820/api/v1/ping
C FFI (libspur_compat.so)
Drop-in replacement for libslurm.so. Exports Slurm-compatible C symbols:
#include <slurm/slurm.h> // Use Slurm headers job_desc_msg_t desc; slurm_init_job_desc_msg(&desc); desc.name = "my_job"; desc.script = "#!/bin/bash\necho hello"; uint32_t job_id; slurm_submit_batch_job(&desc, &job_id); printf("Submitted job %u\n", job_id);
Exported functions: slurm_init_job_desc_msg, slurm_submit_batch_job, slurm_load_jobs, slurm_free_job_info_msg, slurm_load_node, slurm_load_partitions, slurm_kill_job
Architecture
┌─────────────────────────────┐
│ API Surface │
│ CLI (spur / sbatch / srun) │
│ REST (spurrestd) │
│ C FFI (libspur_compat.so) │
└──────────────┬───────────────┘
│ gRPC
┌──────────────▼───────────────┐
│ spurctld (Rust) │
│ ┌─────────┐ ┌────────────┐ │
│ │Backfill │ │ Raft Log + │ │
│ │Scheduler│ │ Snapshots │ │
│ │ │ │ (openraft) │ │
│ └─────────┘ └────────────┘ │
└──────┬───────────────┬───────┘
│ │
┌────────────▼──┐ ┌───────▼────────┐
│ spurd (Rust) │ │ spurdbd (Rust) │
│ Node agent │ │ Accounting │
│ - cgroups v2 │ │ - PostgreSQL │
│ - GPU enum │ │ - Fair-share │
│ - fork/exec │ │ - Job history │
└────────────────┘ └─────────────────┘
Configuration
TOML format (/etc/spur/spur.conf):
cluster_name = "production" [controller] listen_addr = "[::]:6817" state_dir = "/var/spool/spur" hosts = ["ctrl1", "ctrl2"] [scheduler] plugin = "backfill" interval_secs = 1 fairshare_halflife_days = 14 [accounting] database_url = "postgresql://spur:spur@db1/spur" [[partitions]] name = "gpu" default = true nodes = "gpu[001-064]" max_time = "72:00:00" [[partitions]] name = "cpu" nodes = "cpu[001-256]" max_time = "168:00:00" [[nodes]] names = "gpu[001-064]" cpus = 128 memory_mb = 512000 gres = ["gpu:mi300x:8"] [[nodes]] names = "cpu[001-256]" cpus = 256 memory_mb = 1024000
Environment Variables
| Variable | Default | Description |
|---|---|---|
SPUR_CONTROLLER_ADDR |
http://localhost:6817 |
Controller gRPC address |
SPUR_ACCOUNTING_ADDR |
http://localhost:6819 |
Accounting gRPC address |
Auto-Update
Spur queries the GitHub releases API
to detect newer versions. Daemons (spurctld, spurd) check on startup and
log an info message when an update is available — they never auto-restart. The
spur CLI provides explicit commands to check and install.
CLI
# Show current version spur version # Check the stable channel for a newer release (no install) spur version --check # Download + verify (sha256) + install latest stable; replaces all five # binaries (spur, spurctld, spurd, spurdbd, spurrestd) atomically with # rollback on failure. Existing daemons keep running until restarted. sudo spur self-update # Track the nightly channel instead sudo spur self-update --nightly
Daemon startup check
Both spurctld and spurd emit an info log line on startup if a newer
release is available. Disable via the [update] section in spur.conf:
[update] check_on_startup = true # default: true — startup check auto_update = false # default: false — never download silently channel = "stable" # "stable" or "nightly" cache_dir = "/var/cache/spur" # 1h TTL JSON cache (avoids API spam)
check_on_startup = false skips the GitHub query entirely. auto_update
exists for orchestrators that want unattended installs but is opt-in — even
with auto_update = true, the daemons do not restart themselves.
Project Structure
spur/
├── proto/slurm.proto # gRPC service definitions
├── docs/quickstart.md # Getting started guide
├── crates/
│ ├── spur-proto/ # Generated gRPC code
│ ├── spur-core/ # Core types: Job, Node, ResourceSet, config, hostlist, partition, qos, and more
│ ├── spur-net/ # WireGuard mesh networking, address detection
│ ├── spur-sched/ # Backfill scheduler, priority, timeline
│ ├── spurctld/ # Controller daemon
│ ├── spurd/ # Node agent daemon
│ ├── spurdbd/ # Accounting daemon
│ ├── spurrestd/ # REST API daemon
│ ├── spur-cli/ # CLI binary (multi-call: spur, sbatch, squeue, ...)
│ ├── spur-ffi/ # C FFI shim (libspur_compat.so)
│ ├── spur-spank/ # SPANK plugin host
│ ├── spur-k8s/ # K8s integration
│ └── spur-tests/ # Test suite (mirrors Slurm numbering)
Testing
cargo test # Run all tests cargo test -p spur-tests # Run integration test suite only cargo test -p spur-core # Run core library tests only
Test groups mirror Slurm's testsuite numbering for 1-1 mapping:
| Group | Coverage |
|---|---|
| t05 | Job queue filtering and display |
| t07 | Scheduler (backfill, timelines, partition filtering) |
| t17 | Job submission (directives, defaults, arrays, deps) |
| t24 | Priority and fair-share |
| t28 | Job arrays |
| t50 | Core types (state machine, resources, GRES) |
| t51 | Hostlist expansion/compression |
| t52 | Configuration parsing |
| t53 | (removed — covered by spurctld Raft tests) |
| t55 | Output format conformance |
License
Apache-2.0