GitHub - AnkurRathore/sharded-timing-wheel: A cache-aware, hierarchical timing wheel implementation based on Varghese & Lauck (1987)

A cache-aware, hierarchical timing wheel implementation based on Varghese & Lauck (1987).

⚡ The Problem: The "C10M" Timer Challenge

Standard approaches to timer management often rely on Priority Queues (like std::collections::BinaryHeap or typical Min-Heap implementations). While efficient for general use, these structures suffer at massive scale (1M+ concurrent timers):

Algorithmic Complexity: Insertion and Cancellation are O(log N). As connections scale, CPU usage increases non-linearly.
Pointer Chasing: Tree-based heaps often scatter nodes across RAM. Traversing the structure causes massive L1/L2 Cache Misses.

The Solution

This project implements a Hierarchical Timing Wheel (Hashed Wheel Timer) focused on Data-Oriented Design.

Complexity: O(1) for Insert, Cancel, and Tick.
Memory Layout: Uses a custom Slab Allocator (Arena) to keep all timer data in contiguous memory, maximizing CPU cache pre-fetching.
Zero-Allocation: After the initial warmup, the system generates zero heap allocations during runtime.

Architecture

1. The Engine: Intrusive Slab Allocator

Instead of using Vec<Box<Node>>, I have implemented a custom Slab with an Intrusive Free List.

Contiguous Memory: All timer entries live in a single Vec<Entry>.
Intrusive Linked List: Empty slots form a linked list inside the unused memory of the vector.
Index-Based: I have used usize indices instead of pointers, reducing memory overhead on 64-bit systems and avoiding borrow checker complexity.

// Visual representation of the memory layout
[ Occupied(Task A) | Free(Next=4) | Occupied(Task B) | Occupied(Task C) | Free(Next=End) ]

2. The Algorithm: Varghese & Lauck Hierarchy

I am currently implementing the "Scheme 6" Hierarchical Wheel:

Granularity: 4 levels of wheels (Seconds, Minutes, Hours...).
Bitwise Optimization: Using powers of 2 (64 slots) to replace expensive Modulo (%) instructions with fast Bitwise AND (&) instructions.
Cascading: Timers automatically "fall down" to lower-resolution wheels as time progresses.

🔬 Performance Goals

🔬 Benchmark Results

Methodology Update: Previous benchmarks used sorted inputs. These updated results use pre-calculated random deadlines to simulate real-world network traffic and force the Binary Heap to re-balance (sift-up) continuously.

Operation	Scale (N)	Heap (Standard)	Wheel (v0.2)	Improvement
Insert	1,000,000	15.3 ms	57.0 ms	(Slower, see analysis below)
Cancel	10,000	51.6 ms	0.029 ms	1,700x FASTER 🚀

Performance Analysis

Insertion Trade-Off:
- The Binary Heap benefits from hardware prefetching because it is backed by a contiguous Vec. Even with random inputs, the "hot path" (indices 0, 1, 2...) stays in L1 cache.
- The Timing Wheel pays a constant cost for calculating bit-shifts and writing to random indices in the Slab. Profiling via samply confirms that the bottleneck is L1 Data Cache Misses during the linked-list write operation (mov dword [base + index + offset]).
- Verdict: Accept slower insertion (~3.7x slower) to gain massive cancellation speed.
Cancellation (The "C10M" Win):
- This is the critical metric for network drivers (TCP/QUIC).
- The Heap requires an $O(N)$ linear scan to find a task to cancel.
- The Wheel uses the Slab index (Handle) to unlink the node in $O(1)$ constant time, regardless of how many timers exist.

Citation

If you use sharded-timing-wheel in your research or project, please cite it as:

Rathore, Ankur. (2026). sharded-timing-wheel: A cache-aware, hierarchical timing wheel implementation (v0.2.0). Zenodo. https://doi.org/10.5281/zenodo.18886174

References

Varghese, G., & Lauck, A. (1987). Hashed and hierarchical timing wheels: data structures for the efficient implementation of a timer facility.
Acton, M. (2014). Data-Oriented Design and C++. (CppCon).
Gregg, B. Systems Performance: Enterprise and the Cloud.

📄 License

MIT License