The Hidden Bottleneck: Blocking in Async Rust

9 min read Original article ↗

Using eBPF to find blocking code in Tokio applications without instrumenting your code

January 2026

Your Tokio service is slow. Latency spikes under load, then recovers. No errors in the logs. CPU isn't maxed out. The usual culprit: blocking code on the async runtime.

hud demo showing blocking detection in real-time
hud — zero-instrumentation blocking detection for Tokio
TL;DR: Blocking code on Tokio workers causes latency spikes without errors. Use hud to find it without code changes—this article covers how to diagnose and fix the common culprits.

The Problem

Tokio relies on cooperative scheduling: tasks yield at .await points, trusting that code between awaits completes quickly. When it doesn't—CPU-heavy work, synchronous I/O, blocking locks—a single task monopolizes a worker thread, stalling everything else queued behind it. No panics, no errors. Just degraded throughput that's hard to trace.

How Blocking Sneaks In

Consider this async endpoint that hashes passwords:

async fn hash_password(body: Bytes) -> Json<HashResponse> {
    let password = String::from_utf8_lossy(&body).to_string();
    let hash = bcrypt::hash(&password, 10).expect("hash failed");
    Json(HashResponse { hash })
}

It looks innocuous—after all, it's an async fn. But bcrypt::hash is intentionally slow (50–100ms at cost factor 10, longer at higher factors) and never yields. While it runs, that worker is blocked. Other requests queue behind it.

The usual suspects:

PatternExampleWhy it blocks
std::fs in asyncFile::open(), read_to_string()Sync syscalls
CPU-heavy cryptobcrypt, argon2, scryptIntentionally slow
Compressionflate2, zstdCPU-bound
std::net DNSToSocketAddrsSync resolution
Large parsingBig JSON/XML payloadsCPU-bound
Sync mutex during slow workstd::sync::MutexThread blocks waiting

In practice:

// Blocking file I/O
let config = std::fs::read_to_string("config.toml")?;

// Blocking DNS resolution
let addrs = "example.com:443".to_socket_addrs()?;

// Sync mutex held during slow operation - blocks the worker thread
let guard = sync_mutex.lock().unwrap();
expensive_computation();  // other tasks on this worker stall
drop(guard);

Note: holding a std::sync::MutexGuard across an .await usually won't compile—the guard isn't Send. But holding it during long synchronous work still blocks the worker thread. Use tokio::sync::Mutex when you need to hold a lock across .await points. For short critical sections without awaits, std::sync::Mutex is often faster—just keep the work minimal.

All of these work correctly in tests. The problem only manifests under concurrent load.

What It Looks Like in Production

The symptoms are maddeningly vague:

  • p99 latency climbs under load, recovers when load drops
  • Throughput plateaus below expected capacity
  • No errors, no panics, nothing in the logs
  • CPU utilization looks normal (the threads are blocked, not spinning)
  • Adding more Tokio worker threads helps—until it doesn't

So you blame the database. Or the network. Or add caching in the wrong place. Without visibility into the runtime, you're guessing.

Finding It

Quick sanity check: During development, wrap suspicious operations in tokio::time::timeout. If a supposedly-fast async call times out, you likely have blocking code in that path:

use tokio::time::{timeout, Duration};

let result = timeout(Duration::from_millis(100), async {
    some_async_operation().await
}).await;

if result.is_err() {
    tracing::warn!("operation exceeded 100ms — possible blocking");
}

This won't tell you where the blocking is, but it confirms the problem exists before reaching for heavier tools.

Existing options: tokio-console provides runtime visibility and accurate task poll times, but requires instrumenting your code. Tokio's unstable blocking detection warns when polls exceed a threshold, but it's opt-in and needs a redeploy. Standard perf + flamegraphs work but require manual interpretation. All of these require either code changes or significant expertise.

hud takes a different approach: attach to a running process with no code changes required. It's an eBPF profiler that hooks into the Linux scheduler. When a Tokio worker thread experiences OS-level scheduling latency (time spent waiting in the kernel run queue, not Tokio's task queue), hud captures a stack trace.

High scheduling latency often indicates blocking: when one task monopolizes a worker, other tasks queue up waiting. Because hud runs in-kernel and only samples on scheduler events, overhead is low—though you should benchmark in your environment, as eBPF overhead varies with event frequency and stack depth.

Requirements: Linux 5.8 or later, root privileges, and debug symbols in your binary:

# Cargo.toml
[profile.release]
debug = true
force-frame-pointers = true

Note: debug = true increases binary size (~10-20%) and force-frame-pointers adds minor runtime overhead (~1-2%). For production investigation, you can build a debug-enabled binary and swap it in temporarily, or accept the overhead if you need always-on observability.

Install:

# Pre-built binary
curl -LO https://github.com/cong-or/hud/releases/latest/download/hud-linux-x86_64.tar.gz
tar xzf hud-linux-x86_64.tar.gz

# Or from source
git clone https://github.com/cong-or/hud && cd hud
cargo xtask build-ebpf --release && cargo build --release

Point it at a running process:

sudo hud my-service

Demo

Here's hud in action with a server that has intentionally blocking endpoints (from the hud repo):

# Terminal 1: Run the demo server (debug build for clear stack traces)
cargo build --example demo-server
./target/debug/examples/demo-server

# Terminal 2: Attach hud
sudo ./target/release/hud demo-server

# Terminal 3: Generate load
curl -X POST http://localhost:3000/hash -d 'password123'

hud displays latency hotspots in a live TUI, grouped by stack trace. Select one to see the full call stack—the marker highlights frames in your code. Consistent traces pointing to the same function are your likely culprits.

Important: hud captures the stack of the thread experiencing latency—the victim, not necessarily the blocker. If Task A blocks and causes Task B to wait, you'll see Task B's stack. Look for patterns: many different stacks showing latency suggests something else is monopolizing workers; if one stack dominates, that code path is likely the blocker itself.

I built hud because I found existing tools hard to use intuitively. Flamegraphs and perf output require expertise to interpret. hud's TUI surfaces what matters—hotspots ranked by frequency, with drill-down into call stacks—so you can get answers without becoming a profiling expert first.

The Fix

Once you've identified the culprit, the fix is straightforward: offload the work to Tokio's blocking threadpool.

Before:

async fn hash_password(body: Bytes) -> Json<HashResponse> {
    let password = String::from_utf8_lossy(&body).to_string();
    let hash = bcrypt::hash(&password, 10).expect("hash failed");
    Json(HashResponse { hash })
}

After:

async fn hash_password(body: Bytes) -> Json<HashResponse> {
    let password = String::from_utf8_lossy(&body).to_string();

    let hash = tokio::task::spawn_blocking(move || {
        bcrypt::hash(&password, 10).expect("hash failed")
    })
    .await
    .unwrap();

    Json(HashResponse { hash })
}

Quick reference:

Blocking patternSolution
File I/O (std::fs)tokio::fs (uses spawn_blocking internally)
DNS (ToSocketAddrs)tokio::net::lookup_host
Blocking I/OWrap in spawn_blocking
CPU-heavy computespawn_blocking, or spawn_blocking + rayon
Already on runtime threadblock_in_place (avoids thread handoff; panics on current_thread)
CPU loops that can't be offloadedtokio::task::yield_now().await or consume_budget().await

Note: spawn_blocking isn't free—it moves work to a separate thread pool, adding context-switch overhead. For very short operations (<1ms), the overhead may exceed the benefit. For parallelizable CPU work like image processing or batch compression, use rayon's thread pool—but call it from within spawn_blocking, not directly from async code, or you'll still block a worker.

Verifying the Fix

Export traces before and after to confirm the improvement:

# Before fix (run under load for 60 seconds)
sudo hud my-service --headless --duration 60 --export before.json

# Deploy fix, then run again
sudo hud my-service --headless --duration 60 --export after.json

Compare the traces. The problematic stack should show reduced frequency or disappear entirely. Some baseline scheduling noise is normal—focus on whether your specific hotspot improved. You can integrate this into CI against a load test to catch regressions before they reach production.

Tuning Sensitivity

The default threshold is 5ms, a reasonable starting point for most web services. Adjust based on your latency requirements:

ThresholdUse case
--threshold 1Latency-critical APIs, games
--threshold 5General web services (default)
--threshold 10Background workers
--threshold 20Batch jobs, ETL pipelines

At 50k requests/second, a 1ms block affects 50 requests. Choose a threshold that matches your SLO.

For interactive debugging, use --window to set a rolling time window. Without it, metrics accumulate forever and never decay:

sudo hud my-service --window 30    # 30-second rolling window

Tip: Real blocking produces consistent traces pointing to user code and known blocking operations. Noise from OS scheduling appears as random preemption in stdlib or runtime code—you can usually ignore it.

Limitations

hud measures scheduling latency—how long worker threads wait in the kernel run queue. This correlates with blocking but doesn't measure it directly. Keep these caveats in mind:

  • System noise: High CPU pressure, NUMA effects, or hypervisor scheduling can cause latency spikes unrelated to your code. Look for consistent, repeatable traces.
  • Frequent short blocks: Many 1ms blocks won't trigger a 5ms threshold but can still degrade throughput. Lower the threshold if you suspect many small blocks accumulating.
  • Lock contention: Threads waiting on mutexes may be in TASK_INTERRUPTIBLE state and won't appear as scheduling latency.
  • Stripped binaries: Most production binaries lack debug symbols. You'll need a debug-enabled build for meaningful stack traces.
  • Tokio only: hud identifies workers by thread name (tokio-runtime-w). Other runtimes (async-std, smol, glommio) won't be detected. Tested with Tokio 1.x—thread naming is an implementation detail that could change in future versions.

Alternatives

tokio-console is the official Tokio diagnostic tool. It measures actual task poll durations—more accurate than scheduling latency. Use it if you can. But it requires adding the console-subscriber crate and rebuilding your application.

Tokio's unstable blocking detection is another option: compile with RUSTFLAGS="--cfg tokio_unstable" and Tokio will warn when task polls exceed a threshold. This catches the blocker directly, not victims. It's more accurate than hud for identifying the offending code—but requires a rebuild with the unstable flag, and only catches blocks that exceed the threshold during that specific run. Short-but-frequent blocks that add up may slip through.

perf + flamegraphs work for general CPU profiling but require manual interpretation to identify async-specific issues.

hud fills a different niche: profiling without code changes or rebuilds. Useful for staging environments, load testing, quick triage of a running process, or confirming that blocking is even the problem before investing in instrumentation.

ToolBest forTrade-off
hudQuick triage of running processesMeasures symptoms, not direct cause
tokio-consolePrecise task poll timesRequires code instrumentation
perf + flamegraphsCPU profiling, broad analysisManual interpretation needed
Custom metricsProduction monitoringMust know where to instrument

Takeaway

Blocking in async Rust doesn't raise errors or panic. It just makes everything slower—and the cause harder to find.

Watch for:

  • std::fs in async context
  • CPU-heavy crypto (bcrypt, argon2)
  • Sync compression (flate2, zstd)
  • std::net DNS resolution
  • std::sync::Mutex held during expensive work

When you're unsure where the problem lies, start with a profiler that can attach to running processes without code changes. Narrow down the suspects, then dig deeper with instrumentation if needed.


Try hud on your project

curl -LO https://github.com/cong-or/hud/releases/latest/download/hud-linux-x86_64.tar.gz
tar xzf hud-linux-x86_64.tar.gz
sudo ./hud your-tokio-service

github.com/cong-or/hud — MIT or Apache-2.0 licensed. Issues and PRs welcome.

Further reading: