A lightweight yet powerful eBPF-driven diagnostic tool for Kubernetes applications. Podtrace delivers full-stack observability from kernel events to application-layer behavior, all activated on demand, with no prior configuration or instrumentation. With a single command, it uncovers insights across the entire lifecycle of a pod, including network flows, TCP/UDP performance, file system activity, memory behavior, latency patterns, system calls, and high-level application events such as HTTP, DNS, and database queries.
Overview
Podtrace attaches eBPF programs directly to the container, allowing it to observe real behavior as it happens at runtime. It automatically correlates low-level kernel activity with high-level application operations, surfacing clear, human-readable diagnostic events that reveal what the pod is experiencing internally.
Instead of assembling data from multiple systems or modifying application code, Podtrace provides deep operational visibility in one place, enabling you to understand:
- Why a service is slow
- Where latency originates
- How network and I/O resources are being used
- Which operations block or fail
- How requests flow through the application
- What happens inside the pod during incidents
By combining system-level details, application-layer insights, and real-time event correlation, Podtrace acts as a single on-demand observability lens. This makes it uniquely effective for debugging, performance analysis, and production incident response in Kubernetes environments, especially when time, context, or access is limited.
Documentation
Podtrace documentation is available in the doc/ directory.
Features
Network Tracing
- TCP Connection Monitoring: Tracks TCP IPv4/IPv6 connection latency and errors
- TCP RTT Analysis: Detects RTT spikes and retry patterns
- TCP State Tracking: Monitors TCP connection state transitions (SYN, ESTABLISHED, FIN, etc.)
- TCP Retransmission Tracking: Detects TCP retransmissions for network quality diagnostics
- Network Device Errors: Monitors network interface errors and packet drops
- UDP Network Tracing: Tracks UDP send/receive operations with latency and bandwidth metrics
- I/O Bandwidth Tracking: Monitors bytes transferred for TCP/UDP send/receive operations
File System Monitoring
- File Operations: Tracks read, write, and fsync operations with latency analysis
- File Path Tracking: Captures full file paths
- I/O Bandwidth: Monitors bytes transferred for file read/write operations
- Throughput Analysis: Calculates average throughput and peak transfer rates
Memory & System Events
- Page Fault Tracking: Monitors page faults with error code analysis
- OOM Kill Detection: Tracks out-of-memory kills with memory usage details
Application Layer
- HTTP Tracing: HTTP request/response tracking via uprobes
- DNS Tracking: Monitors DNS lookups with latency and error tracking
- Database Query Tracing: Tracks PostgreSQL and MySQL query execution with pattern extraction and latency analysis
- TLS/SSL Handshake Tracking: Track TLS handshake latency, errors and failures
- Connection Pool Monitoring: Tracks connection pool usage, monitors pool exhaustion, and tracks connection reuse patterns
System Monitoring
- CPU/Scheduling Tracking: Monitors thread blocking and CPU scheduling events
- CPU Usage per Process: Shows CPU consumption by process
- Process Activity Analysis: Shows which processes are generating events
- Stack Traces for Slow Operations: Captures user-space stack traces for slow I/O, DNS, CPU blocks, memory faults, and other operations exceeding thresholds
- Lock Contention Tracking: Monitors futex and pthread mutex waits with timing and hot lock identification
- Syscall Tracing: Tracks process lifecycle via execve, fork/clone, open/openat, and close syscalls with file descriptor leak detection
- Network Reliability: Monitors TCP retransmissions and network device errors for network quality diagnostics
- Database Query Tracing: Tracks PostgreSQL and MySQL query execution patterns and latency
- Resource Limit Monitoring: Monitor resource usage vs limits
- Error Correlation with Root Cause Analysis: Correlates errors with operations and Kubernetes context
Distributed Tracing
- Trace Context Extraction: Automatically extracts trace context from HTTP/HTTP2 headers and gRPC metadata
- Event Correlation: Groups events by trace ID to build complete request flows across services
- Request Flow Graphs: Builds directed graphs showing service interactions with latency and error metrics
- Multiple Exporters: Supports OpenTelemetry (OTLP), Jaeger, and Splunk HEC
- Sampling Support: Configurable sampling rates to control export volume
Diagnostics
- Diagnose Mode: Collects events for a specified duration and generates a comprehensive summary report
Alerting
- Real-time Alerts: Sends immediate notifications when fatal, critical, or warning-level issues are detected
- Multiple Channels: Supports webhooks, Slack, and Splunk HEC for alert delivery
- Smart Deduplication: Prevents alert storms with configurable deduplication windows
- Rate Limiting: Configurable rate limits to prevent overwhelming notification systems
Prerequisites
- Linux kernel 5.8+ with BTF support
- Go 1.24+
- Kubernetes cluster access
Building
# Install dependencies make deps # Build eBPF program and Go binary make build # Build and set capabilities make build-setup
Usage
Basic Usage
# Trace a pod in real-time ./bin/podtrace -n production my-pod # Run in diagnostic mode ./bin/podtrace -n production my-pod --diagnose 20s
Diagnose Report
The diagnose mode generates a comprehensive report including:
- Summary Statistics: Total events, events per second, collection period
- DNS Statistics: DNS lookup latency, errors, top targets
- TCP Statistics: RTT analysis, spikes detection, send/receive operations, bandwidth metrics (total bytes, average bytes, peak bytes, throughput)
- UDP Statistics: Send/receive operations, latency analysis, bandwidth metrics, error tracking
- Connection Statistics: IPv4/IPv6 connection latency, failures, error breakdown, top targets
- TCP Connection State Tracking: State transition analysis, state distribution, connection lifecycle monitoring
- File System Statistics: Read, write, and fsync operation latency, slow operations, bandwidth metrics (total bytes, average bytes, throughput)
- HTTP Statistics: Request/response counts, latency analysis, bandwidth metrics, top requested URLs
- Memory Statistics: Page fault counts and error codes, OOM kill tracking with memory usage details
- CPU Statistics: Thread blocking times and scheduling events
- CPU Usage by Process: CPU percentage per process
- Process Activity: Top active processes by event count
- Activity Timeline: Event distribution over time
- Activity Bursts: Detection of burst periods
- Connection Patterns: Analysis of connection behavior
- Network I/O Patterns: Send/receive ratios and throughput analysis
- Process and Syscall Activity: Process execution, fork/clone, file operations, and file descriptor leak detection
- Stack Traces for Slow Operations: User-space stack traces for operations exceeding thresholds with symbol resolution
- Lock Contention Analysis: Futex and pthread mutex wait times and hot lock identification
- Network Reliability: TCP retransmission tracking and network device error monitoring
- Database Query Performance: Query pattern analysis and execution latency (PostgreSQL, MySQL)
- Connection Pool Statistics: Connection pool usage, acquire/release rates, reuse patterns, and exhaustion events
- Potential Issues: Automatic detection of high error rates and performance problems
- Resource Limit Monitoring: Monitor resource usage vs limits
- Error Correlation with Root Cause Analysis: Correlates errors with operations and Kubernetes context
Running without sudo
After building, set capabilities to run without sudo:
sudo ./scripts/setup-capabilities.sh
Distributed Tracing
Podtrace supports distributed tracing to correlate events across services in your Kubernetes cluster. Traces are automatically extracted from HTTP headers and exported to popular observability backends.
Quick Start
# Enable tracing with OpenTelemetry ./bin/podtrace -n production my-pod \ --tracing \ --tracing-otlp-endpoint http://otel-collector:4318 # Enable tracing with Jaeger ./bin/podtrace -n production my-pod \ --tracing \ --tracing-jaeger-endpoint http://jaeger:14268/api/traces # Enable tracing with Splunk ./bin/podtrace -n production my-pod \ --tracing \ --tracing-splunk-endpoint https://splunk:8088/services/collector \ --tracing-splunk-token YOUR_TOKEN
Features
- Automatic Trace Extraction: Extracts W3C Trace Context, B3, and Splunk headers
- Service Correlation: Groups events by trace ID across multiple services
- Request Flow Graphs: Visualizes service interactions
- Multiple Exporters: OTLP, Jaeger, and Splunk HEC support
- Sampling: Configurable sampling rates (0.0-1.0)
Alerting
Podtrace can send real-time alerts when critical issues are detected, including resource limit violations, high error rates, exporter failures, and fatal errors. Alerts are sent via webhooks, Slack, or Splunk HEC.
Quick Start
# Enable alerting with webhook export PODTRACE_ALERTING_ENABLED=true export PODTRACE_ALERT_WEBHOOK_URL=https://alerts.example.com/webhook ./bin/podtrace -n production my-pod # Enable alerting with Slack export PODTRACE_ALERTING_ENABLED=true export PODTRACE_ALERT_SLACK_WEBHOOK_URL=https://hooks.slack.com/services/YOUR/WEBHOOK/URL export PODTRACE_ALERT_SLACK_CHANNEL="#alerts" ./bin/podtrace -n production my-pod # Enable alerting with Splunk export PODTRACE_ALERTING_ENABLED=true export PODTRACE_ALERT_SPLUNK_ENABLED=true ./bin/podtrace -n production my-pod
Features
- Automatic Detection: Monitors resource limits, error rates, and system failures
- Multiple Severity Levels: Fatal, Critical, Warning, and Error
- Alert Deduplication: Prevents duplicate alerts within a time window
- Rate Limiting: Configurable limits to prevent alert storms
- Retry Logic: Automatic retries with exponential backoff
See the Alerting Guide for detailed configuration and usage.
Prometheus & Grafana Integration
Podtrace exposes runtime metrics for Kubernetes pods using a built-in Prometheus endpoint. These metrics cover networking (TCP/UDP), DNS, CPU scheduling, file system operations, memory events, and HTTP tracing, all labeled per process and event type.
Running:
./bin/podtrace -n production my-pod --metrics
launches an HTTP server accessible at:
http://localhost:3000/metrics
Prometheus Scrape Configuration
In your Prometheus scrape job, set <PODTRACE_HOST> to the address of the pod or host running Podtrace.
scrape_configs: - job_name: 'Podtrace' static_configs: - targets: ['<PODTRACE_HOST>:3000']
Available Metrics
All metrics are exported per process and per event type:
| Metric | Description |
|---|---|
podtrace_rtt_seconds |
Histogram of TCP RTTs |
podtrace_rtt_latest_seconds |
Most recent TCP RTT |
podtrace_latency_seconds |
Histogram of TCP send/receive latency |
podtrace_latency_latest_seconds |
Most recent TCP latency |
podtrace_dns_latency_seconds_gauge |
Latest DNS query latency |
podtrace_dns_latency_seconds_histogram |
Distribution of DNS query latencies |
podtrace_fs_latency_seconds_gauge |
Latest file system operation latency |
podtrace_fs_latency_seconds_histogram |
Distribution of file system operation latencies |
podtrace_network_bytes_total |
Total bytes transferred over network (TCP/UDP) |
podtrace_filesystem_bytes_total |
Total bytes transferred via filesystem ops |
podtrace_cpu_block_seconds_gauge |
Latest CPU block time |
podtrace_cpu_block_seconds_histogram |
Distribution of CPU block times |
podtrace_resource_limit_bytes |
Resource limit in bytes (CPU/Memory/I/O) |
podtrace_resource_usage_bytes |
Current resource usage in bytes |
podtrace_resource_utilization_percent |
Resource utilization percentage |
podtrace_resource_alert_level |
Resource alert level (0-3: none/warning/critical/emergency) |
podtrace_pool_acquires_total |
Total connection pool acquires |
podtrace_pool_releases_total |
Total connection pool releases |
podtrace_pool_exhausted_total |
Total pool exhaustion events |
podtrace_pool_wait_time_seconds |
Histogram of pool wait times |
podtrace_pool_connections |
Current number of connections in pool |
podtrace_pool_utilization |
Pool utilization percentage |
Grafana Dashboard
A ready-to-use Grafana dashboard JSON is included in the repository at podtrace/internal/metricsexporter/dashboard/Podtrace-Dashboard.json
Steps to use:
- Open Grafana and go to Dashboards → New → Import.
- Paste the JSON or upload the .json file.
- Select or your Prometheus datasource as the datasource.
- Import. The dashboard will display per-process and per-event-type metrics for RTT, latency, DNS, FS, and CPU block time.