Orchestrating 5000 Workers Without Distributed Locks: Rediscovering TDMA

3 points by Horos a month ago · 4 comments · 3 min read

I needed to orchestrate 500-5000 batch workers (ML training, ETL) using Go and SQLite. Every tutorial said: use etcd, Consul, or ZooKeeper.

But why do these processes need to talk to each other at all?

THE INSIGHT

What if orchestrators never run simultaneously?

Runner-0 executes at T=0s, 10s, 20s... Runner-1 executes at T=2s, 12s, 22s... Runner-2 executes at T=4s, 14s, 24s... Runner-3 executes at T=6s, 16s, 26s... Runner-4 executes at T=8s, 18s, 28s...

Time-Division Multiple Access (TDMA). Same pattern GSM uses for radio.

GO IMPLEMENTATION

type Runner struct { ID, TotalRunners int CycleTime time.Duration }

func (r Runner) Start() { slot := r.CycleTime / time.Duration(r.TotalRunners) offset := time.Duration(r.ID) slot

    for {
        time.Sleep(time.Until(computeNextSlot(offset)))
        r.reconcile() // Check workers, start if needed
    }

}

Each runner gets 2s in a 10s cycle. No overlap = zero coordination.

SQLITE CONFIG

PRAGMA journal_mode=WAL; dbWrite.SetMaxOpenConns(1) // One writer dbRead.SetMaxOpenConns(10) // Concurrent reads

With TDMA, busy_timeout never triggers.

THE MATH

Capacity = SlotDuration / TimePerWorker = 2000ms / 10ms = 200 workers per runner

5 runners = 1000 workers 25 runners = 5000 workers (25s cycle, 12.5s avg latency)

For batch jobs running hours, 10s detection latency is irrelevant.

BENCHMARKS (real data from docs/papers)

System | Writes/s | Latency | Nodes | Use Case etcd | 10,000 | 25ms | 3-5 | Config ZooKeeper | 8,000 | 50ms | 5 | Election Temporal | 2,000 | 100ms | 15-20 | Workflows Airflow | 300 | 2s | 2-3 | Batch TDMA-SPI | 40 | 5s avg | 1-5 | Batch

WHAT YOU GAIN: - Zero consensus protocols (no Raft/Paxos) - Single-node deployment possible - Deterministic behavior - Radical simplicity

WHAT YOU SACRIFICE: - Real-time response (<1s) - High frequency (>1000 ops/sec) - Arbitrary scale (limit ~5000 workers)

UNIVERSAL PATTERN

Wireless Sensor Networks: DD-TDMA (IEEE 2007) - same pattern Kubernetes Controllers: Reconcile every 5-10s (implicit TDMA) Build Systems: Time-slice job claims vs SELECT FOR UPDATE

WHY ISN'T THIS COMMON?

1. Cultural bias: Industry teaches "add consensus layer" as default 2. TDMA sounds old: It's from 1980s telecoms (but old ≠ bad) 3. SQLite underestimated: Actually handles 50K-100K writes/sec on NVMe 4. Most examples optimize for microservices (1000s ops/sec), not batch

WHEN NOT TO USE: Microservices (<100ms latency needed) Real-time systems (trading, gaming) >10,000 operations/sec required

GOOD FOR: Batch processing ML training orchestration ETL pipelines (hourly/daily) Video/image processing Anything where task duration >> detection latency

THE REAL LESSON

Modern distributed systems thinking: 1. Assume coordination needed 2. Pick consensus protocol 3. Deal with complexity

Alternative: 1. Can processes avoid each other? (temporal isolation) 2. Can data be partitioned? (spatial isolation) 3. Is eventual consistency OK?

If yes to all three: you might not need coordination at all.

CONCLUSION

I built a simple orchestrator for batch workers and rediscovered a 40-year-old telecom pattern that eliminates distributed coordination entirely.

The pattern: TDMA + spatial partitioning + SQLite The application to workflow orchestration seems novel.

If Kubernetes feels like overkill, maybe time-slicing is enough.

Sometimes the best distributed system is one that doesn't need to be distributed.

--- Full writeup: [blog link] Code: [github link]

Discussion: Anyone else using time-based scheduling for coordination-free systems? What about high clock skew networks?

HorosOP a month ago

Thanks for the pointer to asyncmachine! Let me clarify HOROS architecture since there's some confusion.

HOROS uses time slots for orchestrator clones on a SINGLE machine by default. Not distributed - 5 Go processes share the same kernel clock:

Runner-0: T=0s, 10s, 20s... (slot 0) Runner-1: T=2s, 12s, 22s... (slot 1) Runner-2: T=4s, 14s, 24s... (slot 2)

Zero network, zero clock drift. Just local time.Sleep().

Your approach (logical clocks) solves event ordering in distributed systems. HOROS solves periodic polling - workers can be idle for hours with no events to increment a logical clock. Wall-clock fires regardless.

Different primitives: - Logical clocks: "Event A before Event B?" (causality) - TDMA timers: "Is it your turn?" (time-slicing)

For cross-machine workflows, we use SQLite state bridges:

Machine-Paris Machine-Virginia ┌─────────────┐ ┌──────────────┐ │ Worker-StepA│ │ Worker-StepC │ │ completes │ │ waits │ │ ↓ │ │ ↑ │ │ output.db │ │ input.db │ └──────┬──────┘ └──────▲───────┘ │ │ └──→ bridge.db ←─────────────────┘ (Litestream replication)

bridge.db = shared SQLite with state transitions StepBridger daemon polls bridge.db, moves data between steps

State machines communicate through data writes, not RPC. Each node stays single-machine internally (local TDMA).

Re: formatting - which results were unclear? Happy to improve.

pancsta a month ago

I do a lot of logical-clock based synchronization using asyncmachine.dev (also in Go), you may want to check it out as “human time” can be error prone and not “tight”. It does involve forming a network state machines, but connections can be partial and nested.

Your results are very hard to read due to formatting, but the idea is interesting.

wazokazi a month ago

The workers sit idle for n-1 out of n time slices. As n gets larger, amount of work being done approaches zero.

HorosOP a month ago

TDMA schedules the orchestrators (lightweight checks), not the workers (heavy jobs).
Orchestrators: Active 1/n of time (~10ms to check state) Workers: Run continuously for hours once started
T=0s: Orchestrator-0 checks → starts job (runs 2 hours) T=2s: Orchestrator-1 checks → job still running T=10s: Orchestrator-0 checks again → job still running
Think: traffic lights (TDMA) vs cars (drive continuously).
Work throughput is unchanged. TDMA only coordinates who checks when.

Settings

Orchestrating 5000 Workers Without Distributed Locks: Rediscovering TDMA

Keyboard Shortcuts