A cache-aware, hierarchical timing wheel implementation based on Varghese & Lauck (1987).
⚡ The Problem: The "C10M" Timer Challenge
Standard approaches to timer management often rely on Priority Queues (like std::collections::BinaryHeap or typical Min-Heap implementations). While efficient for general use, these structures suffer at massive scale (1M+ concurrent timers):
- Algorithmic Complexity: Insertion and Cancellation are O(log N). As connections scale, CPU usage increases non-linearly.
- Pointer Chasing: Tree-based heaps often scatter nodes across RAM. Traversing the structure causes massive L1/L2 Cache Misses.
The Solution
This project implements a Hierarchical Timing Wheel (Hashed Wheel Timer) focused on Data-Oriented Design.
- Complexity: O(1) for Insert, Cancel, and Tick.
- Memory Layout: Uses a custom Slab Allocator (Arena) to keep all timer data in contiguous memory, maximizing CPU cache pre-fetching.
- Zero-Allocation: After the initial warmup, the system generates zero heap allocations during runtime.
Architecture
1. The Engine: Intrusive Slab Allocator
Instead of using Vec<Box<Node>>, I have implemented a custom Slab with an Intrusive Free List.
- Contiguous Memory: All timer entries live in a single
Vec<Entry>. - Intrusive Linked List: Empty slots form a linked list inside the unused memory of the vector.
- Index-Based: I have used
usizeindices instead of pointers, reducing memory overhead on 64-bit systems and avoiding borrow checker complexity.
// Visual representation of the memory layout
[ Occupied(Task A) | Free(Next=4) | Occupied(Task B) | Occupied(Task C) | Free(Next=End) ]
2. The Algorithm: Varghese & Lauck Hierarchy
I am currently implementing the "Scheme 6" Hierarchical Wheel:
- Granularity: 4 levels of wheels (Seconds, Minutes, Hours...).
- Bitwise Optimization: Using powers of 2 (64 slots) to replace expensive Modulo (%) instructions with fast Bitwise AND (&) instructions.
- Cascading: Timers automatically "fall down" to lower-resolution wheels as time progresses.
🔬 Performance Goals
🔬 Benchmark Results
Methodology Update: Previous benchmarks used sorted inputs. These updated results use pre-calculated random deadlines to simulate real-world network traffic and force the Binary Heap to re-balance (sift-up) continuously.
| Operation | Scale (N) | Heap (Standard) | Wheel (v0.2) | Improvement |
|---|---|---|---|---|
| Insert | 1,000,000 | 15.3 ms | 57.0 ms | (Slower, see analysis below) |
| Cancel | 10,000 | 51.6 ms | 0.029 ms | 1,700x FASTER 🚀 |
Performance Analysis
-
Insertion Trade-Off:
- The Binary Heap benefits from hardware prefetching because it is backed by a contiguous
Vec. Even with random inputs, the "hot path" (indices 0, 1, 2...) stays in L1 cache. - The Timing Wheel pays a constant cost for calculating bit-shifts and writing to random indices in the Slab. Profiling via
samplyconfirms that the bottleneck is L1 Data Cache Misses during the linked-list write operation (mov dword [base + index + offset]). - Verdict: Accept slower insertion (~3.7x slower) to gain massive cancellation speed.
- The Binary Heap benefits from hardware prefetching because it is backed by a contiguous
-
Cancellation (The "C10M" Win):
- This is the critical metric for network drivers (TCP/QUIC).
- The Heap requires an
$O(N)$ linear scan to find a task to cancel. - The Wheel uses the Slab index (Handle) to unlink the node in
$O(1)$ constant time, regardless of how many timers exist.
Citation
If you use sharded-timing-wheel in your research or project, please cite it as:
Rathore, Ankur. (2026). sharded-timing-wheel: A cache-aware, hierarchical timing wheel implementation (v0.2.0). Zenodo. https://doi.org/10.5281/zenodo.18886174
References
- Varghese, G., & Lauck, A. (1987). Hashed and hierarchical timing wheels: data structures for the efficient implementation of a timer facility.
- Acton, M. (2014). Data-Oriented Design and C++. (CppCon).
- Gregg, B. Systems Performance: Enterprise and the Cloud.
📄 License
MIT License