Root cause: ollama pull stuck at 99% — the 2-year-old bug

If you've used ollama for any length of time, you've probably hit this:

pulling 9b6d12fa8910...   99%  ▕████████████████████▏ 6.9 GB

…and then it just sits there. Your bandwidth is fine. The server is fine. The TCP connection looks alive. But ollama is wedged. After 30 minutes you Ctrl-C, run ollama pull again, and it finishes the last 1% in 3 seconds.

That's issue #1736 — 124 comments, 82 reactions, open since 2023. The world's most popular local-LLM runner has a download bug that 5,000 people have hit and 0 people fixed.

I spent a weekend on it. The root cause is a 5-line change. Here it is.

The maintainer was right — and that's why it stayed unfixed

Two years of "is this a Cloudflare problem?", "is this a Range header bug?", "is this NAT timeout?" — until a maintainer (mxyng) wrote this comment:

Certain parts stall completely and zero data is received from the backend. The connection itself is still healthy so it doesn't trigger a retry.

That's exact and correct. R2 (Cloudflare's object storage) occasionally drops streams: TCP stays connected, no FIN, no RST, just no more bytes. From Go's net/http perspective everything is fine, so the Read() call sits there forever.

This is a server-side problem. Ollama can't fix R2. The maintainer concluded the only real solution was to fix the storage backend.

That conclusion is what kept the bug open for 2 years.

It's wrong, but not for the reason you'd guess.

The actual root cause is in the client

Ollama already has a watchdog. Look at server/download.go:

g.Go(func() error {
    ticker := time.NewTicker(time.Second)
    for {
        select {
        case <-ticker.C:
            if part.Completed.Load() >= part.Size {
                return nil
            }

            part.lastUpdatedMu.Lock()
            lastUpdated := part.lastUpdated
            part.lastUpdatedMu.Unlock()

            if !lastUpdated.IsZero() && time.Since(lastUpdated) > 30*time.Second {
                // stall detected: fire errPartStalled
                part.lastUpdated = time.Time{}  // reset to zero
                return errPartStalled
            }
        case <-ctx.Done():
            return ctx.Err()
        }
    }
})

A goroutine wakes up every second. If the part hasn't made progress in 30 seconds, it fires errPartStalled, which triggers a retry on a fresh connection. This is exactly the right defensive code for the bug mxyng described.

So why doesn't it work?

Look at the guard condition: !lastUpdated.IsZero() && time.Since(lastUpdated) > 30*time.Second.

lastUpdated is set to a real time by the Write() method — i.e., only after the first byte arrives. Before any byte arrives, lastUpdated is the zero time. The guard skips the stall check when lastUpdated.IsZero().

This is fine on a fresh connection — there's a brief delay before bytes start flowing, and you don't want to fire errPartStalled during normal connection setup.

But watch what happens on the second attempt:

First attempt: bytes flow, then stall. lastUpdated = T (some real time).
Watchdog at T+30s: time.Since(lastUpdated) > 30s → fires errPartStalled.
Reset: part.lastUpdated = time.Time{} (zero time again).
Retry opens new connection.
New connection also stalls before producing any byte (because Cloudflare's flow is sticky to your IP and the bad POP keeps returning dead streams).
Watchdog: !lastUpdated.IsZero() → false → stall check skipped forever.
Reader goroutine sits on Read() until ctx cancellation, which never comes because nothing else is going wrong.

That's the bug. The watchdog protects against the first stall but disarms itself for every subsequent retry until a byte arrives. And when the underlying connection is dead from the start, no byte ever arrives.

This is what mxyng was seeing in user reports: not a server-side issue, a client-side dead-end. The watchdog was the protection, and it self-disabled.

The fix

Five lines:

g.Go(func() error {
    ticker := time.NewTicker(time.Second)
    // Initialize at watchdog entry so the stall timer fires even when
    // no bytes ever arrive from the server.
    part.lastUpdatedMu.Lock()
    part.lastUpdated = time.Now()
    part.lastUpdatedMu.Unlock()

    for {
        // ... same loop, but drop the !IsZero() guard from the check
        if time.Since(lastUpdated) > 30*time.Second {
            return errPartStalled
        }
    }
})

Initialize lastUpdated to time.Now() at watchdog entry. Drop the IsZero() guard. Drop the reset-to-zero after stall detection.

The watchdog now always has a reference point. Every retry gets 30 seconds to produce a byte; if it can't, the stream is dead and we open a fresh one (which may route to a different Cloudflare POP).

Why it took 2 years

Three things stacked:

1. The maintainer's diagnosis was correct in spirit but pointed away from the fix. "Cloudflare/R2 server-side stalls" is true, but the client watchdog was meant to handle exactly that. The fix is in the client, not the backend.

2. The buggy line read defensively. !lastUpdated.IsZero() looks like "be careful, don't fire too early during connection setup." It reads like good code. The interaction with the retry path is non-obvious.

3. The bug requires a chain of failures to reproduce. First connection stalls → retry → second connection also stalls before first byte → only then does the watchdog disable itself. Most users see one stall, Ctrl-C, retry succeeds. The "stuck at 99% forever" case requires the second attempt to also stall from byte zero, which only happens reliably on certain Cloudflare POPs (Australia, parts of Asia — exactly the geographic clustering in the bug report).

The TDD test that locks it in

func TestDownloadChunkStallWatchdogFiresWithoutProgress(t *testing.T) {
    origStall := stallDuration
    stallDuration = 200 * time.Millisecond
    t.Cleanup(func() { stallDuration = origStall })

    srv := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
        w.Header().Set("Content-Length", "1024")
        w.WriteHeader(http.StatusPartialContent)
        w.(http.Flusher).Flush()
        <-r.Context().Done() // never send body bytes
    }))
    // ... assert errPartStalled within 2s
}

This test reproduces the exact failure: server accepts the connection, sends headers, then blocks forever without sending body bytes. Without the fix, the test takes 5 seconds and returns context deadline exceeded. With the fix, it returns errPartStalled in ~1 second.

The test catches the bug. Now it can't regress.

Lessons

A correct diagnosis isn't a fix. "Server is broken" was true. The fixable part was still in the client.
Defensive guards have a blast radius. if !lastUpdated.IsZero() reads like safety. It's actually a fail-open switch in this code path.
Test the empty case. The original test suite verified the watchdog fires during a partial-data stall. Nobody tested "zero bytes flow ever." That's where it broke.
Stale-bot kills good code. This bug stayed open partly because every 14 days the stale-bot considered closing it. Real bugs deserve patience.

The PR is ollama/ollama#15716 if you want to read the actual diff. ~10 lines of production code, 63 lines of test.

If you maintain or contribute to a distributed system that does retries, audit your watchdogs for "init-on-first-event" guards. The pattern is everywhere and the same bug repeats.

— @alvinttang