From Demo to Production: Self-Hosting LLMs with Ollama and Docker

20 min read Original article ↗

SitePoint Team

From Demo to Production: Self-Hosting LLMs with Ollama and Docker

How to Deploy Ollama in Production with Docker

  1. Containerize Ollama instances with pinned image tags, volume-mounted model storage, and a custom entrypoint that pre-pulls models before accepting traffic.
  2. Add a Redis response cache using a cache-aside pattern keyed on hashed prompt, model, and temperature to eliminate redundant inference.
  3. Place an Nginx reverse proxy in front of multiple Ollama instances using least_conn routing and proxy_buffering off for streaming support.
  4. Build a lightweight API gateway that checks the cache before forwarding requests and exposes Prometheus-format metrics.
  5. Wire Prometheus and Grafana into the stack to monitor request latency, cache hit ratio, throughput, and error rates.
  6. Harden the deployment with API key authentication at the gateway, TLS termination at Nginx, network segmentation, and health-check-driven restarts.
  7. Launch the full stack with docker compose up -d and verify the request path end-to-end with cache HIT/MISS headers.

Everyone has run ollama run llama3 on their laptop. The command works, the model responds, and for about ten minutes you feel like you've built something. Then someone asks you to make it available for your team, keep it running reliably, and tell them when it breaks. That's when the gap between an Ollama production deployment and a local demo becomes painfully clear.

Ollama gives you a clean HTTP API for generation, chat, and embeddings, but it ships without caching, load balancing, authentication, or metrics. For a self-hosted LLM Docker setup, those missing pieces are what separate a toy from infrastructure your team can actually depend on.

This article walks through building a complete local LLM infrastructure stack using Docker Compose. By the end, you'll have a working Llama 3 production setup: multiple Ollama instances behind an Nginx load balancer, a Redis response cache to kill redundant inference, Prometheus and Grafana for observability, health checks on every service, and a lightweight API gateway tying it all together.

Prerequisites: Docker and Docker Compose v2 installed, a machine with at least 16GB RAM (32GB+ recommended), and basic familiarity with Ollama and containerization. GPU access is optional but covered.

Table of Contents

Architecture Overview: What Production Self-Hosting Actually Requires

Why Ollama Alone Isn't Enough

Ollama's HTTP server (ollama serve) handles requests to endpoints like /api/generate, /api/chat, and /api/embed. It works fine for single-user experimentation, but several limitations show up fast under real workloads:

  • No built-in caching. Identical prompts trigger full inference every time. If your RAG pipeline sends the same context window repeatedly, you're burning GPU cycles for nothing.
  • No native Prometheus metrics endpoint. The /api/ps endpoint tells you which models are loaded, and /api/tags lists available models, but neither produces the /metrics format Prometheus expects. You're flying blind on latency percentiles, throughput, and error rates.
  • No load balancing or horizontal scaling. A single Ollama process serves requests sequentially per loaded model. There's no built-in way to distribute work across multiple instances.
  • No authentication. The API has no auth scheme. Anyone who can reach the port can run inference.

Target Architecture

Here's what we're building:

┌─────────┐     ┌───────────────┐     ┌──────────────────┐
│  Client  │────▶│  Nginx LB     │────▶│  Ollama Instance 1│
│          │     │  (port 8080)  │────▶│  Ollama Instance 2│
└─────────┘     └───────┬───────┘     └────────┬─────────┘
                        │                       │
                ┌───────▼───────┐               │
                │  API Gateway  │◀──────────────┘
                │  (cache logic)│
                └───────┬───────┘
                ┌───────▼───────┐     ┌──────────────────┐
                │    Redis      │     │   Prometheus      │
                │  (response    │     │   + Grafana       │
                │   cache)      │     │   (monitoring)    │
                └───────────────┘     └──────────────────┘

Client sends requests to Nginx, which load-balances across N Ollama instances using least-connections routing. The API Gateway intercepts requests to check Redis for cached responses before forwarding to Ollama. Prometheus scrapes metrics from the gateway and Nginx, and Grafana renders dashboards. Every service runs in Docker with health checks and restart policies.

The Foundation: Dockerizing Ollama for Production

Base Ollama Docker Configuration

The official ollama/ollama image on Docker Hub is the starting point. Pin to a specific tag rather than :latest because you don't want a model-breaking update rolling into production on a random Tuesday morning when someone runs docker compose pull.

Models live at /root/.ollama inside the container. Without a volume mount, every container restart triggers a multi-gigabyte re-download. The OLLAMA_HOST environment variable controls the bind address; set it to 0.0.0.0 so the container accepts connections from other services on the Docker network.

Here's a single Ollama service definition:

# docker-compose.yml (single Ollama instance)
services:
  ollama1:
    image: ollama/ollama:0.3.14
    container_name: ollama1
    volumes:
      - ollama_models:/root/.ollama
    environment:
      - OLLAMA_HOST=0.0.0.0
    ports:
      - "11434"  # internal only, not exposed to host
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
        limits:
          memory: 16G
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:11434/api/tags"]
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 60s
    restart: unless-stopped

volumes:
  ollama_models:

A few things worth calling out. The health check hits /api/tags because Ollama exposes no dedicated /healthz endpoint. It's a lightweight GET request that confirms the server is accepting connections. The start_period of 60 seconds gives the container time to load without triggering false restarts. For CPU-only deployments, rip out the entire deploy.resources.reservations.devices block.

For multi-GPU machines, you can pin specific GPUs to specific instances using the NVIDIA_VISIBLE_DEVICES environment variable (e.g., NVIDIA_VISIBLE_DEVICES=0 for the first GPU, NVIDIA_VISIBLE_DEVICES=1 for the second).

Pre-Loading Models on Container Start

The first request to a fresh Ollama instance triggers a model download that can take minutes. In production, that timeout kills the request and confuses your monitoring. Fix it with a custom entrypoint that pulls the model before the server starts accepting traffic.

#!/bin/bash
# scripts/entrypoint.sh

set -e

# Start Ollama server in the background
ollama serve &
SERVER_PID=$!

# Wait for server to be ready
echo "Waiting for Ollama server to start..."
until curl -sf http://localhost:11434/api/tags > /dev/null 2>&1; do
  sleep 2
done

# Pull the specified model (defaults to llama3.1:8b)
MODEL_NAME="${MODEL_NAME:-llama3.1:8b}"
echo "Pulling model: $MODEL_NAME"
ollama pull "$MODEL_NAME"

echo "Model $MODEL_NAME ready. Server accepting traffic."

# Bring server back to foreground
wait $SERVER_PID

Reference this in your Compose file by overriding the entrypoint:

    entrypoint: ["/bin/bash", "/scripts/entrypoint.sh"]
    volumes:
      - ./scripts/entrypoint.sh:/scripts/entrypoint.sh:ro
      - ollama_models:/root/.ollama

On model selection: I've found that Llama 3.1 8B at Q4_0 quantization hits a practical sweet spot for most internal tools. It runs comfortably in around 5–6GB of VRAM and produces coherent output for summarization, classification, and simple Q&A. The 70B variant delivers noticeably better reasoning but demands 40GB+ VRAM and responds significantly slower. If you're building a document processing pipeline, start with 8B, measure quality, and upgrade only if the output isn't meeting your bar.

Response Caching with Redis: Eliminating Redundant Inference

Why Cache LLM Responses

In production workloads, identical or near-identical prompts show up more often than you'd think. Template-based queries ("Summarize this contract clause: {clause}"), RAG pipelines with overlapping context windows, repeated user questions: they all generate prompts that have already been answered. Running full inference on a prompt you've already processed is pure waste.

When I deployed an internal FAQ bot backed by Ollama for a developer tools team, we found roughly 40% of weekday queries were duplicates or near-duplicates of previous questions. With caching enabled, average response time for those requests dropped from around 4 seconds to under 50 milliseconds, and GPU utilization fell enough that we delayed a hardware upgrade.

Cache Layer Implementation

The cache-aside pattern works well here: the API gateway checks Redis before calling Ollama, and writes responses back on cache misses. The cache key is a SHA-256 hash of the normalized prompt, model name, and temperature, so different generation parameters produce different cache entries.

Redis service definition:

  redis:
    image: redis:7-alpine
    container_name: redis_cache
    command: redis-server --appendonly yes --maxmemory 2gb --maxmemory-policy allkeys-lru
    volumes:
      - redis_data:/data
    healthcheck:
      test: ["CMD", "redis-cli", "ping"]
      interval: 10s
      timeout: 5s
      retries: 3
    restart: unless-stopped

The appendonly yes flag enables AOF persistence so cached responses survive container restarts. The allkeys-lru eviction policy automatically boots the least recently used entries when memory fills up, which is exactly the behavior you want for a response cache.

Here's the API gateway cache logic in Python (using FastAPI):

# gateway/main.py
import hashlib
import json
import os

import httpx
import redis.asyncio as redis
from fastapi import FastAPI, Request, Response

app = FastAPI()

REDIS_HOST = os.environ.get("REDIS_HOST", "redis")
OLLAMA_UPSTREAM = os.environ.get("OLLAMA_UPSTREAM", "http://nginx_lb:8080")
CACHE_TTL = int(os.environ.get("CACHE_TTL", 3600))  # 1 hour default

cache = redis.Redis(host=REDIS_HOST, port=6379, decode_responses=True)

# Reuse a single httpx client for connection pooling (see performance notes below)
http_client = httpx.AsyncClient(base_url=OLLAMA_UPSTREAM, timeout=120.0)


def make_cache_key(body: dict) -> str:
    """Hash prompt + model + temperature for cache key."""
    normalized = json.dumps({
        "model": body.get("model", ""),
        "prompt": body.get("prompt", ""),
        "messages": body.get("messages", []),
        "temperature": body.get("options", {}).get("temperature", 0.8),
    }, sort_keys=True)
    return f"ollama:{hashlib.sha256(normalized.encode()).hexdigest()}"


@app.get("/health")
async def health():
    return {"status": "ok"}


@app.post("/api/generate")
@app.post("/api/chat")
async def proxy_with_cache(request: Request):
    body = await request.json()

    # Force non-streaming for cacheable requests
    body["stream"] = False
    cache_key = make_cache_key(body)

    # Check cache
    cached = await cache.get(cache_key)
    if cached:
        return Response(
            content=cached,
            media_type="application/json",
            headers={"X-Cache": "HIT"}
        )

    # Cache miss — forward to Ollama via Nginx
    resp = await http_client.post(
        request.url.path,
        json=body,
    )

    # Write to cache
    if resp.status_code == 200:
        await cache.setex(cache_key, CACHE_TTL, resp.text)

    return Response(
        content=resp.text,
        status_code=resp.status_code,
        media_type="application/json",
        headers={"X-Cache": "MISS"}
    )


@app.on_event("shutdown")
async def shutdown():
    await http_client.aclose()
    await cache.aclose()

Cache Invalidation Considerations

A few edge cases to handle:

  • Model version changes. Because the model name is part of the cache key, pulling a new version of llama3.1:8b won't automatically invalidate old cached responses. Include the model digest in the key if you update models frequently, or flush the cache on redeployment.
  • Temperature > 0 responses. Non-deterministic outputs make caching less useful. For creative generation tasks, you might skip caching entirely. For structured extraction (temperature 0), caching is highly effective.
  • Streaming responses. Ollama uses NDJSON for streaming (stream: true). The gateway above forces stream: false for cacheable requests. If your client needs streaming, you can cache the full non-streaming response and replay it as a single chunk on cache hits. This breaks down when clients depend on token-by-token streaming latency; in that case, bypass the cache and forward the streaming connection directly to Ollama.

Load Balancing with Nginx: Scaling Horizontally

Running Multiple Ollama Instances

Ollama serializes requests per loaded model, so a single instance becomes a bottleneck under concurrent load. The fix: run multiple instances, each handling requests independently. In Docker Compose, define separate named services rather than using deploy.replicas, because named services give you explicit control over GPU assignment and per-instance configuration:

  ollama1:
    image: ollama/ollama:0.3.14
    # ... (full config as above)
    environment:
      - OLLAMA_HOST=0.0.0.0
      - NVIDIA_VISIBLE_DEVICES=0

  ollama2:
    image: ollama/ollama:0.3.14
    # ... (identical config)
    environment:
      - OLLAMA_HOST=0.0.0.0
      - NVIDIA_VISIBLE_DEVICES=1

Both instances share the same ollama_models named volume for model storage. This avoids downloading the model twice, though model loading into memory happens independently per instance.

Nginx as a Reverse Proxy and Load Balancer

LLM request durations vary wildly: 1 second for a short completion, 2 minutes for a long generation. Round-robin load balancing would stack requests on a busy instance while another sits idle. The least_conn directive routes each new request to the instance with the fewest active connections, which is the right call for highly variable request durations.

One critical configuration detail: Ollama streams responses as NDJSON over HTTP chunked transfer encoding (not WebSockets). Nginx's default proxy buffering will accumulate the entire response before forwarding it, which defeats the purpose of streaming and can trigger timeouts. Set proxy_buffering off for the Ollama upstream.

# nginx/nginx.conf
worker_processes auto;

events {
    worker_connections 1024;
}

http {
    # Rate limiting zone
    limit_req_zone $binary_remote_addr zone=api:10m rate=10r/s;

    upstream ollama_backend {
        least_conn;
        server ollama1:11434;
        server ollama2:11434;
    }

    server {
        listen 8080;

        # Stub status for Prometheus scraping
        location /nginx_status {
            stub_status;
            allow 172.16.0.0/12;  # Docker internal network
            deny all;
        }

        location / {
            limit_req zone=api burst=20 nodelay;

            proxy_pass http://ollama_backend;
            proxy_set_header Host $host;
            proxy_set_header X-Real-IP $remote_addr;

            # Critical for LLM workloads
            proxy_read_timeout 120s;
            proxy_connect_timeout 10s;
            proxy_send_timeout 120s;

            # Required for streaming NDJSON responses
            proxy_buffering off;
            proxy_cache off;
            chunked_transfer_encoding on;
        }
    }
}

That proxy_read_timeout 120s is non-negotiable. Nginx's default of 60 seconds will kill legitimate long-running inference requests.

I've set the rate limit to 10 requests per second per IP with a burst of 20 to prevent a single client from monopolizing all Ollama instances.

One caveat worth knowing: NGINX open-source does not support active health checks for upstream servers (that's an NGINX Plus feature). It does handle passive failure detection, where if an upstream returns errors, Nginx temporarily stops routing to it. For active health checking, you rely on Docker's own health check and restart mechanism, which is what our healthcheck directives handle.

Monitoring with Prometheus and Grafana: Observability for LLM Workloads

Exposing Metrics

Ollama has no /metrics endpoint. The /api/ps endpoint shows loaded models, and /api/tags lists available models, but neither produces Prometheus-format metrics. So you need a custom metrics surface, which the API gateway provides.

The gateway should expose a /metrics endpoint with:

  • llm_requests_total (counter, labeled by model, endpoint, status)
  • llm_request_duration_seconds (histogram, labeled by model)
  • llm_cache_hits_total and llm_cache_misses_total (counters)
  • llm_tokens_generated_total (counter, if parsed from Ollama response metadata)

The prometheus_client Python library makes this straightforward. Add counters and histograms to the gateway code, instrument the cache-check and Ollama-call paths, and expose them on a separate port or path.

Prometheus and Grafana Configuration

# prometheus/prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: 'api-gateway'
    static_configs:
      - targets: ['gateway:9090']

  # NOTE: Nginx stub_status output is NOT in Prometheus exposition format.
  # You need an nginx-prometheus-exporter sidecar to convert it.
  # See: https://github.com/nginxinc/nginx-prometheus-exporter
  # Without the exporter, remove or replace this job.
  - job_name: 'nginx'
    static_configs:
      - targets: ['nginx-exporter:9113']

Note that the Nginx stub_status output isn't in Prometheus exposition format. You'll need an nginx-prometheus-exporter sidecar to convert the stub_status output into scrapeable Prometheus metrics, or parse it in your gateway. Without that exporter, the Prometheus nginx scrape job above will fail to parse the response.

Grafana with a provisioned datasource avoids manual setup on every deployment:

  prometheus:
    image: prom/prometheus:v2.53.0
    container_name: prometheus
    volumes:
      - ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml:ro
      - prometheus_data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.retention.time=30d'
    healthcheck:
      test: ["CMD", "wget", "--spider", "-q", "http://localhost:9090/-/healthy"]
      interval: 30s
      timeout: 5s
      retries: 3
    restart: unless-stopped

  grafana:
    image: grafana/grafana:11.1.0
    container_name: grafana
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=${GRAFANA_PASSWORD:-admin}
    volumes:
      - grafana_data:/var/lib/grafana
    ports:
      - "3000:3000"
    depends_on:
      - prometheus
    restart: unless-stopped

Key dashboard panels to build: requests per minute (by model), p50/p95/p99 latency, cache hit ratio over time, and model load status. If you're running GPUs, the NVIDIA DCGM Exporter can feed GPU utilization and VRAM usage into Prometheus, though setting that up is its own adventure and beyond the scope of this stack.

The Complete Docker Compose Stack: Putting It All Together

This is the centerpiece. Copy this, adapt the .env values, and you have a production-grade self-hosted LLM deployment.

Full docker-compose.yml

# docker-compose.yml — Production Ollama Stack
# Usage: docker compose up -d

services:
  # ── Ollama Instances ──────────────────────────────────────
  ollama1:
    image: ollama/ollama:${OLLAMA_VERSION:-0.3.14}
    container_name: ollama1
    entrypoint: ["/bin/bash", "/scripts/entrypoint.sh"]
    volumes:
      - ollama_models:/root/.ollama
      - ./scripts/entrypoint.sh:/scripts/entrypoint.sh:ro
    environment:
      - OLLAMA_HOST=0.0.0.0
      - MODEL_NAME=${MODEL_NAME:-llama3.1:8b}
      - NVIDIA_VISIBLE_DEVICES=${GPU_DEVICE_1:-0}
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
        limits:
          memory: ${OLLAMA_MEMORY_LIMIT:-16G}
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:11434/api/tags"]
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 120s
    networks:
      - llm_internal
    restart: unless-stopped

  ollama2:
    image: ollama/ollama:${OLLAMA_VERSION:-0.3.14}
    container_name: ollama2
    entrypoint: ["/bin/bash", "/scripts/entrypoint.sh"]
    volumes:
      - ollama_models:/root/.ollama
      - ./scripts/entrypoint.sh:/scripts/entrypoint.sh:ro
    environment:
      - OLLAMA_HOST=0.0.0.0
      - MODEL_NAME=${MODEL_NAME:-llama3.1:8b}
      - NVIDIA_VISIBLE_DEVICES=${GPU_DEVICE_2:-1}
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
        limits:
          memory: ${OLLAMA_MEMORY_LIMIT:-16G}
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:11434/api/tags"]
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 120s
    networks:
      - llm_internal
    restart: unless-stopped

  # ── Nginx Load Balancer ──────────────────────────────────
  # NOTE: Nginx is placed before the gateway so clients hit Nginx first.
  # The gateway is a separate service that clients can also call directly
  # for cached responses. Adjust routing to match your architecture.
  nginx_lb:
    image: nginx:1.27-alpine
    container_name: nginx_lb
    volumes:
      - ./nginx/nginx.conf:/etc/nginx/nginx.conf:ro
    depends_on:
      - ollama1
      - ollama2
    healthcheck:
      test: ["CMD-SHELL", "curl -sf http://localhost:8080/nginx_status || exit 1"]
      interval: 15s
      timeout: 5s
      retries: 3
    networks:
      - llm_internal
    restart: unless-stopped

  # ── Redis Cache ──────────────────────────────────────────
  redis:
    image: redis:7-alpine
    container_name: redis_cache
    command: >
      redis-server
      --appendonly yes
      --maxmemory ${REDIS_MAXMEMORY:-2gb}
      --maxmemory-policy allkeys-lru
    volumes:
      - redis_data:/data
    healthcheck:
      test: ["CMD", "redis-cli", "ping"]
      interval: 10s
      timeout: 5s
      retries: 3
    networks:
      - llm_internal
    restart: unless-stopped

  # ── API Gateway (caching + metrics) ──────────────────────
  gateway:
    build: ./gateway
    container_name: api_gateway
    environment:
      - REDIS_HOST=redis
      - CACHE_TTL=${CACHE_TTL:-3600}
      - OLLAMA_UPSTREAM=http://nginx_lb:8080
    ports:
      - "${EXTERNAL_PORT:-8080}:8000"
    depends_on:
      redis:
        condition: service_healthy
      ollama1:
        condition: service_healthy
      ollama2:
        condition: service_healthy
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
      interval: 15s
      timeout: 5s
      retries: 3
    networks:
      - llm_internal
    restart: unless-stopped

  # ── Prometheus ───────────────────────────────────────────
  prometheus:
    image: prom/prometheus:v2.53.0
    container_name: prometheus
    volumes:
      - ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml:ro
      - prometheus_data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.retention.time=30d'
    healthcheck:
      test: ["CMD", "wget", "--spider", "-q", "http://localhost:9090/-/healthy"]
      interval: 30s
      timeout: 5s
      retries: 3
    networks:
      - llm_internal
    restart: unless-stopped

  # ── Grafana ──────────────────────────────────────────────
  grafana:
    image: grafana/grafana:11.1.0
    container_name: grafana
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=${GRAFANA_PASSWORD:-changeme}
    volumes:
      - grafana_data:/var/lib/grafana
    ports:
      - "${GRAFANA_PORT:-3000}:3000"
    depends_on:
      - prometheus
    networks:
      - llm_internal
    restart: unless-stopped

# ── Volumes ────────────────────────────────────────────────
volumes:
  ollama_models:
  redis_data:
  prometheus_data:
  grafana_data:

# ── Networks ───────────────────────────────────────────────
networks:
  llm_internal:
    driver: bridge

Directory Structure and Supporting Files

ollama-production/
├── docker-compose.yml
├── .env
├── scripts/
│   └── entrypoint.sh
├── nginx/
│   └── nginx.conf
├── prometheus/
│   └── prometheus.yml
├── gateway/
│   ├── Dockerfile
│   ├── main.py
│   └── requirements.txt
└── README.md

The .env file keeps all tunables in one place:

# .env — Production configuration
MODEL_NAME=llama3.1:8b
OLLAMA_VERSION=0.3.14
OLLAMA_MEMORY_LIMIT=16G

# GPU assignment (set to "all" for single-GPU machines)
GPU_DEVICE_1=0
GPU_DEVICE_2=1

# Cache
CACHE_TTL=3600
REDIS_MAXMEMORY=2gb

# Networking
EXTERNAL_PORT=8080
GRAFANA_PORT=3000
GRAFANA_PASSWORD=changeme

Startup, Verification, and First Request

# Start the full stack
docker compose up -d

# Watch Ollama model pull progress
docker compose logs -f ollama1 ollama2

# Wait for healthy status on all services
docker compose ps

Once all services report healthy, verify the full request path:

# First request — cache MISS (expect 3-8 seconds)
# Note: requests go to the gateway (exposed on EXTERNAL_PORT),
# which checks Redis then forwards cache misses to Nginx -> Ollama.
curl -s http://localhost:8080/api/generate \
  -H "Content-Type: application/json" \
  -d '{"model":"llama3.1:8b","prompt":"What is Docker?","stream":false}' \
  -w "

HTTP Status: %{http_code}
" \
  -D - 2>&1 | grep -E "(X-Cache|HTTP Status)"

# Output:
# X-Cache: MISS
# HTTP Status: 200

# Second identical request — cache HIT (expect <100ms)
curl -s http://localhost:8080/api/generate \
  -H "Content-Type: application/json" \
  -d '{"model":"llama3.1:8b","prompt":"What is Docker?","stream":false}' \
  -w "

Total time: %{time_total}s
" \
  -D - 2>&1 | grep -E "(X-Cache|Total time)"

# Output:
# X-Cache: HIT
# Total time: 0.042s

# Verify Prometheus is scraping
curl -s http://localhost:9090/api/v1/targets | python3 -m json.tool | grep "health"

# Access Grafana at http://localhost:3000 (admin / changeme)

The difference between the first and second request is the entire point. A cache miss runs full inference through Nginx to Ollama and back. A cache hit returns the stored response from Redis in milliseconds.

For workloads with any prompt repetition, this single layer dramatically cuts your compute requirements.

Hardening for Production: Security, Resilience, and Performance Tuning

Security Essentials

Ollama's API has zero authentication. Every security boundary has to come from the layers around it:

  • API key authentication at the gateway layer. Add a simple middleware that checks for an Authorization: Bearer <token> header before forwarding requests. Ten lines of code, and it stops unauthorized access cold.
  • TLS termination at Nginx. For internal deployments, a self-signed certificate works. For anything internet-facing, use Let's Encrypt with certbot.
  • Network segmentation. In the Compose file, only Nginx and Grafana expose ports to the host. Ollama, Redis, the gateway, and Prometheus live entirely on the llm_internal bridge network and are unreachable from outside.
  • Bind address restriction. Set OLLAMA_HOST=0.0.0.0 only within the Docker network. Never expose Ollama's port 11434 directly to the host with a port mapping.

Resilience Patterns

Health check driven restarts do the heavy lifting here. If Ollama runs out of memory or hangs, the Docker health check fails after three retries, and restart: unless-stopped brings the container back. The start_period gives it time to reload the model.

For graceful degradation, modify the gateway to return a cached response (even if slightly stale) when all Ollama instances are unreachable. Fall back to a 503 with a Retry-After header only when the cache has no relevant entry.

Then there's model warm-up. Ollama unloads models from memory after a period of inactivity (default 5 minutes, controlled by the OLLAMA_KEEP_ALIVE environment variable). To prevent cold-start latency for the first request after idle time, either increase OLLAMA_KEEP_ALIVE (e.g., set it to 24h) or run a lightweight keep-alive cron that sends a minimal prompt every few minutes. A simple curl in a loop works.

Performance Tuning Quick Wins

proxy_buffering off in Nginx is already set for streaming, but double-check it's applied to all location blocks serving Ollama endpoints.

Redis allkeys-lru eviction is already configured. Monitor eviction rates in Grafana; if evictions are frequent, bump REDIS_MAXMEMORY.

Connection pooling in the gateway matters. Use httpx.AsyncClient with a persistent connection pool rather than creating new connections per request. The code above creates a new client per request inside the async with block; for production, instantiate the client at module level and reuse it.

Model selection matters more than tuning. Switching from Q8_0 quantization to Q4_0, or choosing a smaller model like Phi-3 Mini, can double your throughput. Measure your quality requirements first, then pick the smallest model that meets them.

When to Use This (and When Not To)

This stack fits well for internal tools, RAG pipelines, dev and staging environments, and any workload where sending data to external APIs (OpenAI, Anthropic) is off the table due to data sensitivity or compliance. It also gives you predictable costs: once you own the hardware, inference is essentially free.

It's not the right choice for consumer-facing products that need sub-200ms latency at high concurrency. Ollama is optimized for ease of use, not maximum throughput. If you're hitting throughput ceilings, look at vLLM or Text Generation Inference (TGI) from Hugging Face, which implement PagedAttention and continuous batching for significantly higher tokens-per-second on the same hardware. Those tools trade Ollama's simplicity for raw performance and require more operational investment.

Teams without dedicated DevOps capacity should also think carefully. This stack has seven services to maintain, update, and debug. If nobody on your team is comfortable reading Docker Compose logs at 2 AM, a managed API might be the more honest choice.

Your Self-Hosted LLM Is Now Production-Ready

You've built a fully containerized, cached, load-balanced, and monitored LLM deployment that goes well beyond what ollama serve gives you out of the box. The architecture handles the unsexy but critical problems: redundant inference, request serialization, observability gaps, and failure recovery.

The stack is model-agnostic. Swap llama3.1:8b for mistral:7b, phi3:mini, or qwen2:7b by changing one line in your .env file. The caching, load balancing, and monitoring layers don't care what model sits behind them.

For next steps, consider adding semantic caching (using embedding similarity instead of exact prompt matching for cache lookups), request queuing with Celery or BullMQ to handle burst traffic gracefully, and multi-model routing in the gateway to direct different request types to different models. Each of those extends naturally from the architecture you've already built.

Clone the stack, adapt the .env to your hardware, and run docker compose up -d. You'll have a production LLM endpoint running in under ten minutes.

SitePoint TeamSitePoint Team

Sharing our passion for building incredible internet things.