Orchestrating 5000 Workers Without Distributed Locks: Rediscovering TDMA
I needed to orchestrate 500-5000 batch workers (ML training, ETL) using Go and SQLite. Every tutorial said: use etcd, Consul, or ZooKeeper.
But why do these processes need to talk to each other at all?
THE INSIGHT
What if orchestrators never run simultaneously?
Runner-0 executes at T=0s, 10s, 20s... Runner-1 executes at T=2s, 12s, 22s... Runner-2 executes at T=4s, 14s, 24s... Runner-3 executes at T=6s, 16s, 26s... Runner-4 executes at T=8s, 18s, 28s...
Time-Division Multiple Access (TDMA). Same pattern GSM uses for radio.
GO IMPLEMENTATION
type Runner struct { ID, TotalRunners int CycleTime time.Duration }
func (r Runner) Start() { slot := r.CycleTime / time.Duration(r.TotalRunners) offset := time.Duration(r.ID) slot
for {
time.Sleep(time.Until(computeNextSlot(offset)))
r.reconcile() // Check workers, start if needed
}
}Each runner gets 2s in a 10s cycle. No overlap = zero coordination.
SQLITE CONFIG
PRAGMA journal_mode=WAL; dbWrite.SetMaxOpenConns(1) // One writer dbRead.SetMaxOpenConns(10) // Concurrent reads
With TDMA, busy_timeout never triggers.
THE MATH
Capacity = SlotDuration / TimePerWorker = 2000ms / 10ms = 200 workers per runner
5 runners = 1000 workers 25 runners = 5000 workers (25s cycle, 12.5s avg latency)
For batch jobs running hours, 10s detection latency is irrelevant.
BENCHMARKS (real data from docs/papers)
System | Writes/s | Latency | Nodes | Use Case etcd | 10,000 | 25ms | 3-5 | Config ZooKeeper | 8,000 | 50ms | 5 | Election Temporal | 2,000 | 100ms | 15-20 | Workflows Airflow | 300 | 2s | 2-3 | Batch TDMA-SPI | 40 | 5s avg | 1-5 | Batch
WHAT YOU GAIN: - Zero consensus protocols (no Raft/Paxos) - Single-node deployment possible - Deterministic behavior - Radical simplicity
WHAT YOU SACRIFICE: - Real-time response (<1s) - High frequency (>1000 ops/sec) - Arbitrary scale (limit ~5000 workers)
UNIVERSAL PATTERN
Wireless Sensor Networks: DD-TDMA (IEEE 2007) - same pattern Kubernetes Controllers: Reconcile every 5-10s (implicit TDMA) Build Systems: Time-slice job claims vs SELECT FOR UPDATE
WHY ISN'T THIS COMMON?
1. Cultural bias: Industry teaches "add consensus layer" as default 2. TDMA sounds old: It's from 1980s telecoms (but old ≠ bad) 3. SQLite underestimated: Actually handles 50K-100K writes/sec on NVMe 4. Most examples optimize for microservices (1000s ops/sec), not batch
WHEN NOT TO USE: Microservices (<100ms latency needed) Real-time systems (trading, gaming) >10,000 operations/sec required
GOOD FOR: Batch processing ML training orchestration ETL pipelines (hourly/daily) Video/image processing Anything where task duration >> detection latency
THE REAL LESSON
Modern distributed systems thinking: 1. Assume coordination needed 2. Pick consensus protocol 3. Deal with complexity
Alternative: 1. Can processes avoid each other? (temporal isolation) 2. Can data be partitioned? (spatial isolation) 3. Is eventual consistency OK?
If yes to all three: you might not need coordination at all.
CONCLUSION
I built a simple orchestrator for batch workers and rediscovered a 40-year-old telecom pattern that eliminates distributed coordination entirely.
The pattern: TDMA + spatial partitioning + SQLite The application to workflow orchestration seems novel.
If Kubernetes feels like overkill, maybe time-slicing is enough.
Sometimes the best distributed system is one that doesn't need to be distributed.
--- Full writeup: [blog link] Code: [github link]
Discussion: Anyone else using time-based scheduling for coordination-free systems? What about high clock skew networks? Thanks for the pointer to asyncmachine! Let me clarify HOROS architecture
since there's some confusion. HOROS uses time slots for orchestrator clones on a SINGLE machine by default.
Not distributed - 5 Go processes share the same kernel clock: Runner-0: T=0s, 10s, 20s... (slot 0)
Runner-1: T=2s, 12s, 22s... (slot 1)
Runner-2: T=4s, 14s, 24s... (slot 2) Zero network, zero clock drift. Just local time.Sleep(). Your approach (logical clocks) solves event ordering in distributed systems.
HOROS solves periodic polling - workers can be idle for hours with no events
to increment a logical clock. Wall-clock fires regardless. Different primitives:
- Logical clocks: "Event A before Event B?" (causality)
- TDMA timers: "Is it your turn?" (time-slicing) For cross-machine workflows, we use SQLite state bridges: Machine-Paris Machine-Virginia
┌─────────────┐ ┌──────────────┐
│ Worker-StepA│ │ Worker-StepC │
│ completes │ │ waits │
│ ↓ │ │ ↑ │
│ output.db │ │ input.db │
└──────┬──────┘ └──────▲───────┘
│ │
└──→ bridge.db ←─────────────────┘
(Litestream replication) bridge.db = shared SQLite with state transitions
StepBridger daemon polls bridge.db, moves data between steps State machines communicate through data writes, not RPC.
Each node stays single-machine internally (local TDMA). Re: formatting - which results were unclear? Happy to improve. I do a lot of logical-clock based synchronization using asyncmachine.dev (also in Go), you may want to check it out as “human time” can be error prone and not “tight”. It does involve forming a network state machines, but connections can be partial and nested. Your results are very hard to read due to formatting, but the idea is interesting. The workers sit idle for n-1 out of n time slices. As n gets larger, amount of work being done approaches zero. TDMA schedules the orchestrators (lightweight checks), not the workers (heavy jobs). Orchestrators: Active 1/n of time (~10ms to check state)
Workers: Run continuously for hours once started T=0s: Orchestrator-0 checks → starts job (runs 2 hours)
T=2s: Orchestrator-1 checks → job still running
T=10s: Orchestrator-0 checks again → job still running Think: traffic lights (TDMA) vs cars (drive continuously). Work throughput is unchanged. TDMA only coordinates who checks when.