GitHub - federicolepera/slok: Kubernetes operator for managing SLOs with error budget tracking

6 min read Original article ↗

SLOK - Service Level Objectives for Kubernetes

License Kubernetes Go Report Card

SLOK is a Kubernetes operator that manages Service Level Objectives (SLOs) with automatic error budget tracking. Define your reliability targets as Kubernetes resources, and SLOK will continuously monitor them using Prometheus.

Quick Start

Get your first SLO running:

# 1. Install the CRDs and operator
kubectl apply -k config/default

# 2. Create your first SLO
cat <<EOF | kubectl apply -f -
apiVersion: observability.slok.io/v1alpha1
kind: ServiceLevelObjective
metadata:
  name: my-api-availability
spec:
  displayName: "My API Availability"
  objectives:
    - name: availability
      target: 99.9
      window: 7d
      sli:
        query:
          success: sum(rate(http_requests_total{status=~"2.."}[5m]))
          total: sum(rate(http_requests_total[5m]))
EOF

# 3. Check the status
kubectl get slo my-api-availability -o yaml

Prerequisites

Requirement Version Notes
Kubernetes 1.20+
Prometheus 2.x+ Must be accessible from the operator
Prometheus Operator (optional) Required for ServiceMonitor and PrometheusRule
cert-manager 1.0+ Required if using webhooks

Prometheus Setup

SLOK needs to query Prometheus for your SLI metrics. The operator connects to Prometheus via the PROMETHEUS_URL environment variable.

If you're using kube-prometheus-stack, Prometheus is typically available at:

http://prometheus-kube-prometheus-prometheus.monitoring.svc:9090

Installation

Option 1: Kustomize (Quick)

# Install CRDs and deploy operator
kubectl apply -k config/default

Option 2: Helm (Recommended for Production)

# Add the chart repository (if published) or install from local
helm install slok charts/slok \
  --namespace slok-system \
  --create-namespace \
  --set prometheus.url=http://prometheus-kube-prometheus-prometheus.monitoring.svc:9090

Helm Configuration

Parameter Description Default
prometheus.url Prometheus server URL http://prometheus-kube-prometheus-prometheus.monitoring.svc:9090
webhook.enabled Enable admission webhooks true
metrics.enabled Enable metrics endpoint true
prometheusRule.enabled Deploy PrometheusRule for SLO alerts true
replicaCount Number of operator replicas 1

Disable webhooks (useful for development):

helm install slok charts/slok \
  --set webhook.enabled=false \
  --set certManager.enabled=false

Verify Installation

# Check operator is running
kubectl get pods -n slok-system

# Check CRD is installed
kubectl get crd servicelevelobjectives.observability.slok.io

Examples

Availability SLO

Track the percentage of successful HTTP requests:

apiVersion: observability.slok.io/v1alpha1
kind: ServiceLevelObjective
metadata:
  name: payment-api-availability
spec:
  displayName: "Payment API Availability"
  objectives:
    - name: availability
      target: 99.9        # Target: 99.9% successful requests
      window: 30d         # Over a 30-day rolling window
      sli:
        query:
          success: sum(rate(http_requests_total{service="payment-api", status=~"2.."}[5m]))
          total: sum(rate(http_requests_total{service="payment-api"}[5m]))

Latency SLO

Track the percentage of requests under a latency threshold:

apiVersion: observability.slok.io/v1alpha1
kind: ServiceLevelObjective
metadata:
  name: checkout-latency
spec:
  displayName: "Checkout Latency"
  objectives:
    - name: p99-latency
      target: 95.0        # 95% of requests should be under threshold
      window: 7d
      sli:
        query:
          success: sum(rate(http_request_duration_seconds_bucket{service="checkout", le="0.5"}[5m]))
          total: sum(rate(http_request_duration_seconds_count{service="checkout"}[5m]))

Multiple Objectives

Define multiple objectives in a single SLO:

apiVersion: observability.slok.io/v1alpha1
kind: ServiceLevelObjective
metadata:
  name: api-gateway-slo
spec:
  displayName: "API Gateway SLO"
  objectives:
    - name: availability
      target: 99.95
      window: 30d
      sli:
        query:
          success: sum(rate(http_requests_total{job="api-gateway", status!~"5.."}[5m]))
          total: sum(rate(http_requests_total{job="api-gateway"}[5m]))

    - name: latency-p99
      target: 99.0
      window: 30d
      sli:
        query:
          success: sum(rate(http_request_duration_seconds_bucket{job="api-gateway", le="0.3"}[5m]))
          total: sum(rate(http_request_duration_seconds_count{job="api-gateway"}[5m]))

Check SLO Status

kubectl get slo payment-api-availability -o yaml

Output:

status:
  objectives:
    - name: availability
      target: 99.9
      actual: 99.87
      status: met           # met | at-risk | violated | unknown
      errorBudget:
        total: "43.2m"
        consumed: "10.5m"
        remaining: "32.7m"
        percentRemaining: 75.69
      burnRate:
        longBurnRate: 0.5
        shortBurnRate: 0.48
        burnRateThreshold: 14.4
        status: "true"
      lastQueried: "2026-01-28T10:30:00Z"
  lastUpdateTime: "2026-01-28T10:30:00Z"
  conditions:
    - type: Available
      status: "True"
      reason: Reconciled

Alerting

When alerting.enabled is set to true on an objective, SLOK automatically generates PrometheusRule resources in the same namespace as the SLO. You can configure two kinds of alerts: error budget alerts and burn rate alerts.

Budget Alerts

Budget alerts fire when the remaining error budget drops below a given percentage. If no custom budgetAlerts are provided, SLOK creates two default rules:

  • SLOObjectiveAtRisk (warning) -- remaining budget is between 0% and 10%.
  • SLOObjectiveViolated (critical) -- remaining budget is at or below 0%.

To override the defaults, specify your own thresholds:

objectives:
  - name: availability
    target: 99.9
    window: 30d
    sli:
      query:
        success: sum(rate(http_requests_total{service="payment-api", status=~"2.."}[5m]))
        total: sum(rate(http_requests_total{service="payment-api"}[5m]))
    alerting:
      enabled: true
      budgetAlerts:
        - name: SLOBudgetWarning
          percent: 20        # fires when remaining budget < 20%
          severity: warning
        - name: SLOBudgetCritical
          percent: 5         # fires when remaining budget < 5%
          severity: critical

Burn Rate Alerts

Burn rate alerts use multi-window, multi-burn-rate detection as described in the Google SRE Workbook. The idea is to alert when the error budget is being consumed faster than expected, rather than waiting for it to run out.

Each burn rate alert defines:

Field Description
consumePercent Percentage of the total error budget that, if consumed within consumeWindow, should trigger an alert.
consumeWindow The time frame over which consumePercent is evaluated (e.g., 1h). Together with consumePercent and the SLO window, this determines the burn rate threshold.
longWindow The long observation window for the avg_over_time subquery (e.g., 1h).
shortWindow The short observation window for the avg_over_time subquery (e.g., 5m). Used to confirm the long window signal is not stale.

Example configuration with two severity tiers:

objectives:
  - name: availability
    target: 99.9
    window: 30d
    sli:
      query:
        success: sum(rate(http_requests_total{service="payment-api", status=~"2.."}[5m]))
        total: sum(rate(http_requests_total{service="payment-api"}[5m]))
    alerting:
      enabled: true
      burnRateAlerts:
        - name: HighBurnRate
          consumePercent: 2       # 2% of budget consumed in 1h
          consumeWindow: 1h
          longWindow: 1h
          shortWindow: 5m
          severity: critical
        - name: MediumBurnRate
          consumePercent: 5       # 5% of budget consumed in 6h
          consumeWindow: 6h
          longWindow: 6h
          shortWindow: 30m
          severity: warning

The burn rate threshold is calculated as:

threshold = (consumePercent / 100) * (sloWindow / consumeWindow)

For example, with a 30-day window and consumePercent: 2, consumeWindow: 1h:

threshold = 0.02 * 720h / 1h = 14.4

If both the long-window and short-window burn rates exceed 14.4, the alert fires.

Combining Budget and Burn Rate Alerts

You can use both alert types together on the same objective:

alerting:
  enabled: true
  budgetAlerts:
    - name: BudgetLow
      percent: 10
      severity: warning
  burnRateAlerts:
    - name: HighBurnRate
      consumePercent: 2
      consumeWindow: 1h
      longWindow: 1h
      shortWindow: 5m
      severity: critical

Budget alerts tell you how much budget is left. Burn rate alerts tell you how fast it is being consumed. Using both gives you coverage for slow, sustained degradation (caught by budget alerts) and sudden spikes (caught by burn rate alerts).

Limitations

Current Version

Limitation Description Workaround
Manual PromQL required No query templates or builders Write PromQL directly in the spec
Instant queries only Uses Prometheus instant query, not range query Ensure your query uses rate() or similar functions
No multi-cluster support One operator per cluster Deploy SLOK in each cluster
Fixed reconciliation interval SLOs are re-evaluated every 1 minute Cannot be configured per-SLO
Prometheus Operator required for alerts PrometheusRule generation requires the Prometheus Operator CRDs Install the Prometheus Operator or disable alerting.enabled

Query Requirements

The SLI is defined as a ratio of two PromQL queries: success (numerator) and total (denominator). The operator computes (success / total) * 100 to get the actual percentage.

Each query must:

  • Return a single instant vector value (use sum() to aggregate)
  • Use appropriate time functions (rate(), increase()) for counters

Good queries:

sli:
  query:
    success: sum(rate(http_requests_total{status=~"2.."}[5m]))
    total: sum(rate(http_requests_total[5m]))

Bad query (returns a multi-element vector):

sli:
  query:
    success: rate(http_requests_total{status=~"2.."}[5m])
    total: rate(http_requests_total[5m])

Error Budget Calculation

Error budget is calculated as:

allowed_errors = (100 - target) * window_in_minutes
consumed = allowed_errors * ((target - actual) / (100 - target))
remaining = allowed_errors - consumed

Example for a 99.9% target over 30 days:

  • Allowed downtime: 0.1% of 30 days = 43.2 minutes
  • If actual is 99.87%, you've consumed roughly 30% of your budget

Development

# Run locally (against current kubeconfig)
make run

# Run tests
make test

# Build Docker image
make docker-build IMG=your-registry/slok:latest

# Deploy to cluster
make deploy IMG=your-registry/slok:latest

License

Apache License 2.0 - see LICENSE for details.