Understanding how GIL Affects Checkpoint Performance in PyTorch Training

16 min read Original article ↗

I have been spending time learning about model training infrastructure lately, and something that stood out to me was GIL contention when saving training checkpoints in PyTorch. Having spent years in the Ruby world dealing with the GVL (Global VM Lock, Ruby’s equivalent), I was naturally drawn to it. The symptoms are familiar, like - you spin up background threads expecting parallelism, and instead everything gets slower.

So, as one does, I decided to go down some rabbit holes as an opportunity to learn more about what the GIL actually is, why it makes saving checkpoints in the background counterproductive when done with threads, what the fix looks like, and how I measured all of it on an H100 (rented for a few hours) with a real Llama model to get a better mental model. Here is my attempt at piecing it all together.

What the GIL is, and what it is not

The Global Interpreter Lock is a mutex inside CPython (the reference Python implementation) that protects access to Python objects. It exists because CPython’s memory management uses reference counting, and reference counting is not thread-safe. Every time you assign a variable, pass an argument, or return a value, CPython increments or decrements a reference count. Without the GIL, two threads doing this simultaneously would corrupt the count and either leak memory or free objects that are still in use.

What I find interesting is that the GIL in Python is a deliberate tradeoff. Single-threaded Python code runs faster with the GIL than it would with fine-grained per-object locking, because you avoid the overhead of acquiring and releasing locks on every reference count operation. The cost is that only one thread can execute Python bytecode at a time.

Here is roughly what the CPython eval loop looks like:

┌─────────────────────────────────────────────────┐
│          CPython Eval Loop (Python 3.2+)        │
│                                                 │
│   while (bytecode_remaining) {                  │
│       if (eval_breaker) {                       │
│           if (gil_drop_request) {               │
│               release_gil();                    │
│               // other threads can run here     │
│               acquire_gil();                    │
│           }                                     │
│       }                                         │
│       execute_next_bytecode();                  │
│   }                                             │
│                                                 │
│   // A waiting thread sets gil_drop_request     │
│   // after sys.getswitchinterval() seconds      │
│   // (default: 5ms). The holding thread checks  │
│   // eval_breaker every few bytecode ops.       │
└─────────────────────────────────────────────────┘

Every 5 milliseconds (configurable via sys.setswitchinterval()), a waiting thread requests the GIL by setting a flag. The holding thread checks this flag periodically and yields. This is cooperative multitasking at the interpreter level.

The important nuance, and this is the part that tripped me up at first, is that C extensions can release the GIL explicitly when they do work that does not touch Python objects. NumPy releases it during matrix operations. The time.sleep() function releases it. And CUDA (Compute Unified Device Architecture, NVIDIA’s parallel computing platform) operations release it too, because the actual GPU (Graphics Processing Unit) computation happens on the device, not in Python. But the Python-level code between operations – autograd bookkeeping, argument marshalling, and calling into the C++ backend – does hold the GIL.

This distinction ends up being really important for checkpointing.

A typical training step looks like this at the Python level:

┌──────────────────────────────────────────────────────────────────┐
│                     One Training Step                            │
│                                                                  │
│  Python thread (holds GIL)           GPU (runs kernels)          │
│  ──────────────────────────          ──────────────────          │
│  optimizer.zero_grad()         ───►  launch zero kernels         │
│  logits = model(batch)         ───►  launch forward kernels      │
│  loss.backward()               ───►  launch backward kernels     │
│  optimizer.step()              ───►  launch optimizer kernels    │
│                                                                  │
│  The GIL is held for the Python glue code between operations     │
│  (autograd bookkeeping, moving between Python and C++ layers).   │
│  The C++ backend releases the GIL during actual CUDA dispatch.   │
│  But moving between ops requires frequent GIL reacquisition.     │
└──────────────────────────────────────────────────────────────────┘

The GPU runs independently once kernels are queued. But the queue needs to be fed. Between each operation, the training thread needs the GIL to run the Python-level autograd bookkeeping, set up the next call, and transition back into the C++ backend. If another thread is holding the GIL during these transitions, the training thread waits. The GPU runs through its queued work, and if the queue empties before the training thread gets the GIL back, the GPU sits idle.

Once I understood this, the problem with thread-based async checkpointing started to make sense.

What happens during a checkpoint

When training a model with FSDP (Fully Sharded Data Parallelism), each GPU holds a shard of the model weights and optimizer states. A checkpoint needs to get this data from GPU memory to persistent storage. The way I think about it, the process has two phases:

┌────────────────────────────────────────────────────────────────┐
│                                                                │
│  GPU HBM (memory)         CPU DRAM (memory)         NVMe (disk)│
│  ┌──────────┐            ┌──────────┐            ┌──────────┐  │
│  │  Model   │  staging   │  Staged  │   write    │  Saved   │  │
│  │  weights │ ─────────► │   copy   │ ─────────► │  check-  │  │
│  │  +optim  │  (PCIe)    │          │  (disk)    │  point   │  │
│  └──────────┘            └──────────┘            └──────────┘  │
│                                                                │
│  Phase 1: Staging           Phase 2: Persistence               │
│  Training pipeline blocked  Can run in background              │
│  (0.1s - 10s)               (5s - 30s)                         │
│                                                                │
└────────────────────────────────────────────────────────────────┘

Phase 1 (staging) copies tensors from GPU to CPU memory over the PCIe (Peripheral Component Interconnect Express) bus. The GPU cannot train during this time because the checkpoint code issues synchronization barriers that block the training pipeline until the copy completes. With pinned (page-locked) memory, the GPU can copy directly into the CPU buffer using Direct Memory Access (DMA), bypassing the CPU entirely for the data transfer. This makes staging take a few hundred milliseconds. Without pinned memory, the copy goes through an intermediate kernel buffer, and staging takes seconds.

Phase 2 (persistence) serializes the CPU tensors and writes them to disk. This phase does not need the GPU at all. It is pure CPU and I/O work.

The idea behind async checkpointing is simple: do Phase 1 quickly, then run Phase 2 in the background while the GPU resumes training. Synchronous checkpointing blocks the GPU for both phases. Async should block only for Phase 1.

Thread-based async and the GIL problem

PyTorch’s async_save() was introduced in version 2.4 with the default BlockingAsyncStager added in PR #124939. By default, it runs Phase 2 in a background thread. The training thread resumes after staging, and the background thread handles serialization and disk writes.

Here is where it gets interesting though. The background thread still needs the GIL to serialize Python objects, construct metadata, and orchestrate the write. And while it holds the GIL, the training thread cannot dispatch new CUDA kernels.

┌──────────────────────────────────────────────────────────────────┐
│               Thread-based Async Checkpointing                   │
│                                                                  │
│  Time ──────────────────────────────────────────────────►        │
│                                                                  │
│  Training   [train][train][ stg ][ train? ][ train? ][ train ]   │
│  thread     ▓▓▓▓▓▓▓▓▓▓▓▓▓░░░░░░░░▓▓░░▓▓░░▓▓░░▓▓░░░▓▓▓▓▓▓▓▓       │
│                                                                  │
│  Checkpoint                      [ serialize + write to disk ]   │
│  thread                          ░░▓▓░░▓▓░░▓▓░░▓▓░░░             │
│                                                                  │
│  GPU         ██████████████      ██  ██  ██  ██   ████████████   │
│              (full speed)        (gaps: queue     (full speed)   │
│                                   draining)                      │
│                                                                  │
│  ▓ = holding GIL    ░ = waiting for GIL    █ = GPU active        │
│                                                                  │
│  The training thread and checkpoint thread alternate holding     │
│  the GIL. The GPU kernel queue drains during the gaps.           │
└──────────────────────────────────────────────────────────────────┘

The two threads take turns with the GIL. The training thread dispatches a few kernels, yields the GIL (either at the switch interval or during a C extension call), the checkpoint thread does some serialization work, yields back, and so on. The GPU processes whatever is in its queue, but the queue is being fed at a reduced rate because the training thread is sharing GIL time with the checkpoint thread. In my measurements, this worked out to about a 50% increase in step time (108ms baseline to ~162ms), meaning roughly a third of the training thread’s GIL time was being surrendered to the checkpoint thread.

The result is that training throughput drops during the entire background write, which can last 10 to 30 seconds for a large checkpoint. In my tests, thread-based async was still faster than fully synchronous saves (591ms average vs 1,878ms), but much less beneficial than I expected. The per-step numbers tell the real story. Training steps that should take 108ms were consistently running at 162ms for as long as the background thread was active – a 50% step-time penalty, which translates to about a 33% throughput drop. In earlier testing on an A100 with PyTorch 2.6, the overhead was even worse and thread-based async was actually slower than sync overall.

Process-based async: separate GIL, separate interpreter

When I found the fix that PyTorch shipped in PR #147039 (March 2025), it felt obvious in hindsight. Instead of running Phase 2 in a background thread, run it in a separate process. A separate process has its own Python interpreter with its own GIL, so the training thread’s GIL is never contested.

┌──────────────────────────────────────────────────────────────────┐
│               Process-based Async Checkpointing                  │
│                                                                  │
│  Time ──────────────────────────────────────────────────►        │
│                                                                  │
│  Training   [train][train][ stg ][ train ][ train ][ train ]     │
│  process    ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓        │
│  (GIL A)                                                         │
│                                                                  │
│  Checkpoint                      [ serialize + write to disk ]   │
│  process                         ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓       │
│  (GIL B)                                                         │
│                                                                  │
│  GPU         ████████████████████████████████████████████████    │
│              (full speed throughout, no queue draining)          │
│                                                                  │
│  Two separate GILs. Neither process waits for the other.         │
│  The checkpoint process receives staged data via shared memory.  │
└──────────────────────────────────────────────────────────────────┘

The training process keeps its GIL permanently and can dispatch CUDA kernels at full rate. The checkpoint process has its own GIL and does all the serialization and disk I/O work independently. They communicate through shared memory, so the staged CPU tensors do not need to be copied between processes.

One detail I appreciated in the implementation is that PyTorch spawns the checkpoint process as a daemon on the first checkpoint and then reuses it for all subsequent saves. So the startup overhead (process creation, initializing a Gloo (Facebook’s collective communications library) process group for coordination) is a one-time cost of roughly 2 to 5 seconds. Pretty neat.

Why spawn, not fork

Python’s multiprocessing module offers different ways to create a child process, the two most common being fork and spawn. Naturally I wondered why PyTorch chose spawn.

Fun Fact: I almost had a rule in large Ruby apps to never call fork in middle of a runtime job, ever.

fork() creates a child that shares the parent’s memory via copy-on-write. This is what Redis uses for its BGSAVE operation, and it works beautifully there. But it turns out CUDA makes fork unsafe. When you import PyTorch and use a GPU, the runtime registers cleanup handlers (atexit hooks, C++ destructors) that call into the CUDA driver. After fork, the child inherits these handlers. When the child exits, the handlers fire against an invalid CUDA context and segfault. Beyond CUDA, fork in a multi-threaded process (and training processes always have multiple threads for NCCL (NVIDIA Collective Communications Library) communication and event polling) can deadlock because the child inherits the parent’s lock states but only one thread.

The spawn start method creates a completely new Python process with a fresh interpreter (using fork()+exec() under the hood on Unix). No inherited CUDA state, no inherited locks, no inherited threads. It is safe but requires explicit data transfer between parent and child. PyTorch uses multiprocessing.get_context('spawn') to create the checkpoint daemon process and passes the staged tensors through shared memory.

The benchmark

I was curious to see the differences so I ran five checkpoint modes on an NVIDIA H100 80GB with PyTorch 2.7 nightly. The model is Llama 1.3B (a LlamaForCausalLM from HuggingFace Transformers with 24 layers, hidden size 2048, and 16 attention heads), trained with FSDP in bfloat16 with an AdamW optimizer. The checkpoint is 7.52 GB. Each mode runs 25 training steps with a checkpoint every 5 steps.

Here is what each mode does:

  • Sync: dcp.save() blocks the GPU for both staging and writing to disk. Training stops completely until the checkpoint is saved.
  • Thread async: async_save() with the default AsyncCheckpointerType.THREAD. Stages to CPU, then writes to disk in a background thread that shares the GIL with the training thread.
  • Process async: async_save() with AsyncCheckpointerType.PROCESS. Stages to CPU, then hands off to a separate daemon process with its own GIL.
  • Process async + pinned memory: Same as process async, but with cache_staged_state_dict=True on the FileSystemWriter, which pre-allocates a page-locked CPU buffer for faster GPU to CPU copies.
ModeAvg step timeTokens/secOverheadAvg staging time
Baseline (no checkpoint)108 ms18,9760%n/a
Sync dcp.save()1,878 ms1,090+1,640%n/a
Thread async591 ms3,466+448%2,390 ms
Process async307 ms6,668+185%869 ms
Process async + pinned memory280 ms7,328+159%530 ms

A few things stood out to me. The sync mode blocks for about 10 seconds at each checkpoint step, which is rough. Thread-based async reduces the blocking to the staging time (1 to 5 seconds) but then suppresses training throughput for the duration of the background write, which I did not expect going in. Process-based async has similar staging times but recovers to full throughput much faster. And adding pinned memory (cache_staged_state_dict=True on the FileSystemWriter) cuts the staging time roughly in half because the GPU can copy directly into page-locked memory.

The per-step breakdown tells the story more clearly. Here is what happens around a checkpoint step with thread-based async:

step   9:    162 ms  (train=162)                  ← normal-ish (but already elevated)
step  10:  2,299 ms  (train=162  staging=2,137)   ← CHECKPOINT
step  11:    291 ms  (train=291)                  ← GIL contention: 170% above baseline
step  12:    162 ms  (train=162)                  ← still 50% above baseline
step  13:    163 ms  (train=163)                  ← still 50% above baseline
step  14:    164 ms  (train=164)                  ← still elevated

The training step time inflates from 108 ms to about 162 ms for every step while the background thread is active. That is a sustained 50% step-time increase (~33% throughput drop) from GIL contention. And the step immediately after the checkpoint (step 11) is even worse at 291 ms because the background thread is doing its heaviest serialization work.

Compare this with process-based async plus pinned memory, around the second checkpoint (after the daemon is warm):

step   9:    159 ms  (train=159)                  ← slight CPU contention
step  10:    564 ms  (train=158  staging=406)     ← CHECKPOINT (staging only)
step  11:    229 ms  (train=229)                  ← CPU contention, no GIL
step  12:    158 ms  (train=158)                  ← back near baseline
step  13:    159 ms  (train=159)                  ← normal

Recovery to baseline takes one to two steps instead of persisting for the entire background write duration.

The three parameters

After all this digging, the actual code change to go from synchronous checkpointing to the best async configuration turned out to be surprisingly small. Three parameters:

import torch.distributed.checkpoint as dcp
from torch.distributed.checkpoint.state_dict_saver import (
    async_save,
    AsyncCheckpointerType,
)
from torch.distributed.checkpoint.filesystem import FileSystemWriter

# Before: synchronous, blocks for 10+ seconds
dcp.save(state_dict, checkpoint_id="/mnt/checkpoints/step_100")

# After: process-based async with pinned memory staging
writer = FileSystemWriter(
    "/mnt/checkpoints",
    cache_staged_state_dict=True,    # pinned memory for fast staging
)
future = async_save(
    state_dict,
    storage_writer=writer,
    checkpoint_id="/mnt/checkpoints/step_100",
    async_checkpointer_type=AsyncCheckpointerType.PROCESS,
)
# Training continues immediately after staging (~400ms)
# Call future.result() before the next checkpoint

Here is what happens when you call async_save with these parameters:

┌─────────────────────────────────────────────────────────────────────┐
│          Process-based Async Save with Pinned Memory                │
│                                                                     │
│  Training Process (GIL A)          Daemon Process (GIL B)           │
│  ─────────────────────────         ─────────────────────            │
│                                                                     │
│  1. async_save() called                                             │
│     │                                                               │
│  2. Stage tensors to pinned        (daemon already running          │
│     CPU memory via DMA              from previous checkpoint)       │
│     GPU ──► pinned CPU buffer                                       │
│     (~400ms, GPU blocked)                                           │
│     │                                                               │
│  3. Send staged tensors ──────────► 4. Receive via shared memory    │
│     to daemon via shared memory        (no copy, same physical      │
│     (only metadata through pipe)        pages in CPU DRAM)          │
│     │                                                               │
│  5. Return future,               5. Serialize with pickle           │
│     resume training                 │                               │
│     │                            6. Write to NVMe disk              │
│  6. Train at full speed             │                               │
│     (own GIL, no contention)     7. Send completion back            │
│     │                               through pipe                    │
│  7. Before next checkpoint,                                         │
│     call future.result()                                            │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘

AsyncCheckpointerType.PROCESS runs the background write in a daemon process with its own GIL. cache_staged_state_dict=True allocates a persistent pinned memory buffer for staging, so the GPU can copy directly into it without going through an intermediate buffer. On subsequent checkpoints, the buffer is reused (no reallocation overhead).

For additional savings at multi-GPU scale, add plan caching to skip the expensive collective metadata operations on repeated checkpoints:

from torch.distributed.checkpoint.default_planner import DefaultSavePlanner

planner = DefaultSavePlanner(enable_plan_caching=True)
future = async_save(
    state_dict,
    storage_writer=writer,
    planner=planner,
    checkpoint_id="/mnt/checkpoints/step_100",
    async_checkpointer_type=AsyncCheckpointerType.PROCESS,
)

I did not see much difference from plan caching (PR #147343) on a single GPU, which makes sense because the save plan involves collectives across all ranks and there is only one rank in my setup. At 1856 GPUs though, Meta benchmarked this for Llama 3 70B training and the combination of process-based execution and cached plans reduced background processing from about 436 seconds to 67 seconds. So the multi-GPU optimizations clearly matter at scale.

Parting notes

Even with process-based async and pinned memory, the overhead is not zero. My benchmark showed 159% overhead on average (280ms per step vs 108ms baseline). The checkpoint steps themselves take about 560ms (of which roughly 400ms is staging where the GPU is genuinely blocked). The non-checkpoint steps during background writes run at about 160 to 230ms instead of the normal 108ms. I think this remaining overhead comes from Python / PyTorch itself perhaps or CPU contention, where the background process doing serialization and disk I/O competes for CPU cores that the training process also needs to dispatch CUDA kernels. I am not entirely sure about the mechanism though (yet).

For training jobs where checkpointing happens every 20 minutes, this overhead is negligible (a few hundred milliseconds every 20 minutes). For jobs that checkpoint every 1 to 5 minutes for fast recovery, the overhead adds up, and further optimizations become worth pursuing perhaps.

Overall, I think using a separate process is a nice trick to get around the GIL issue.

The PyTorch team at Meta published a good overview of their zero-overhead checkpointing prototype which overlaps the staging copy with the forward and backward pass, bringing total overhead under 0.5 seconds, quite impressive. I found their 6x faster async checkpointing blog post quite interesting to read as well.