How Kubernetes Learned to Resize Pods Without Restarting Them

A deep dive into KEP-1287, from Linux cgroups to production reality

Press enter or click to view image in full size

kubernetes.io/blog/2025/12/17/kubernetes-v1-35-release/

It’s 2:47 AM. Peak traffic. Your stateful PostgreSQL Pod hits its memory limit. Latency jumps from 50ms to 2 seconds. GC pauses stretch. PagerDuty alerts pile up.

Before Kubernetes v1.35, your options were limited:

Option 1: Restart the Pod

Drop all active connections
Lose query cache
Trigger connection pool exhaustion
Wait 30+ seconds for the database to warm up

Option 2: Scale horizontally

Add replicas even though memory pressure doesn’t distribute evenly
Risk replication lag
Complicate your topology for a vertical problem

Option 3: Overprovision from day one

Request 8GB when you need 2GB
Waste 75% of memory allocation
Make node bin-packing inefficient
Inflate cloud costs

None of these were ideal.

Vertical scaling was destructive by nature. This was not a flaw in Kubernetes; it reflected a deeper challenge. Changing resource limits on live processes is difficult. Kubernetes chose immutability over operational complexity, shaping how we tuned JVMs, planned capacity, and accepted waste.

With in-place Pod resizing in v1.35, that changes. Operationally, you can adjust resources on the fly without restarts.

Why Vertical Scaling Was Architecturally Difficult

The Kubernetes Design Choice

Kubernetes was built on a principle: Pods are immutable.

Resource requests and limits live in the PodSpec
Changing the spec means creating a new Pod
The Pod UID changes, signaling a completely new workload
Kubelet would delete and recreate containers on resource changes

This approach was intentional. It simplified reasoning about state:

Scheduler decisions are made once at creation time
Controllers treat Pods as cattle, not pets
Failure recovery is built on Pod replacement

However, it created an operational gap. Real workloads sometimes need vertical adjustments without downtime.

The Linux Kernel Reality

Even if Kubernetes had tried to support live resizing earlier, Linux made it difficult.

cgroups v1 limitations:

Fragmented controller hierarchy (cpu, memory, io separate)
Inconsistent behavior across controllers
Race conditions when updating limits
Unpredictable OOM killer behavior on memory decreases

Container runtime challenges:

runc and containerd set cgroup limits at container start
No CRI API existed to update running containers
Behavior varied across Docker, containerd, and CRI-O
Failure modes were undefined

The fundamental question: Can you safely update /sys/fs/cgroup/memory/memory.limit_in_bytes on a running process without triggering an OOM kill?

The answer was: “Sometimes, but we don’t know when.”

So Kubernetes chose safety. For nearly a decade, vertical scaling meant Pod replacement.

The Breakthrough: KEP-1287 Explained

Kubernetes Enhancement Proposal 1287 was proposed in 2018. It took seven years to reach General Availability in v1.35 (December 2025).

What Changed

The feature: CPU and memory requests and limits can now change on running Pods.

The guarantee: Pod UID remains unchanged. From Kubernetes’ perspective, it’s the same workload.

The implementation: Containers may or may not restart, depending on configuration.

What This Actually Means

# Before: This required deleting and recreating the Pod
apiVersion: v1
kind: Pod
metadata:
  name: database
spec:
  containers:
  - name: postgres
    resources:
      requests:
        memory: "2Gi"
        cpu: "1"
      limits:
        memory: "4Gi"
        cpu: "2"# After: You can patch this directly
kubectl patch pod database --type='json' -p='[
  {
    "op": "replace",
    "path": "/spec/containers/0/resources/limits/memory",
    "value": "6Gi"
  }
]'
# Pod UID stays the same
# Connections stay alive
# State remains intact

This a fundamental shift in how Kubernetes manages running workloads.

Architecture: Control Plane Flow

Diagram 1: Resize Request Flow

Press enter or click to view image in full size

Diagram 1: Resize Request Flow : Shows the path from kubectl → API Server → Kubelet → Container Runtime, emphasizing that Scheduler is NOT involved

Key Architectural Points

The Scheduler is NOT involved

This is not a rescheduling operation
Kubelet validates node capacity locally
If insufficient capacity exists, resize is marked Infeasible

State is tracked separately

status:
  containerStatuses:
  - name: postgres
    allocatedResources:      # Currently applied to container
      cpu: "1"
      memory: "2Gi"
    resources:               # What was requested in spec
      limits:
        cpu: "2"
        memory: "4Gi"
  resize: "InProgress"       # Current resize state

Resize state machine:

Proposed → Kubelet hasn’t processed yet
InProgress → Runtime is applying changes
Deferred → Node lacks capacity, will retry
Infeasible → Cannot resize (e.g., decrease below current usage)

This separation between allocatedResources and spec.resources is critical. It allows kubelet to crash mid-resize and recover cleanly.

Node-Level Mechanics: What Actually Happens

The cgroup v2 Foundation

In-place resizing fundamentally requires cgroup v2. Here’s why:

cgroup v1 problems:

# cgroup v1 had separate hierarchies
/sys/fs/cgroup/cpu/kubepods/pod123/container456/
/sys/fs/cgroup/memory/kubepods/pod123/container456/
/sys/fs/cgroup/cpuset/kubepods/pod123/container456/# Coordinating updates across these was error-prone

cgroup v2 solution:

# Unified hierarchy
/sys/fs/cgroup/kubepods.slice/pod123.slice/container456.scope/
  ├── cpu.max
  ├── memory.max
  ├── memory.current
  └── io.max# All controllers in one place, consistent semantics

Diagram 2: CPU vs Memory Resize Differences

Press enter or click to view image in full size

Diagram 2: CPU vs Memory Resize Differences : Critical visual showing why CPU is safe and memory is risky at the kernel level

What Kubelet and containerd Execute

Step 1: Kubelet detects the change

// Simplified from kubelet source
func (m *kubeGenericRuntimeManager) updateContainerResources(
    pod *v1.Pod,
    container *v1.Container,
    containerID string,
) error {
    // Compare desired vs current
    desiredResources := container.Resources
    currentResources := m.getContainerResources(containerID)        if desiredResources.Equals(currentResources) {
        return nil // No change needed
    }
        // Call CRI API
    return m.runtimeService.UpdateContainerResources(
        containerID,
        &runtimeapi.LinuxContainerResources{
            CpuPeriod:            desiredResources.CpuPeriod,
            CpuQuota:             desiredResources.CpuQuota,
            MemoryLimitInBytes:   desiredResources.MemoryLimit,
        },
    )
}

Step 2: containerd updates cgroups

// Simplified from containerd source
func (c *controller) Update(resources *specs.LinuxResources) error {
    // Update CPU
    if resources.CPU != nil {
        if err := c.writeCgroup("cpu.max", 
            fmt.Sprintf("%d %d", resources.CPU.Quota, resources.CPU.Period)); err != nil {
            return err
        }
    }        // Update memory
    if resources.Memory != nil && resources.Memory.Limit != nil {
        if err := c.writeCgroup("memory.max", 
            strconv.FormatInt(*resources.Memory.Limit, 10)); err != nil {
            return err
        }
    }
        return nil
}

Step 3: The actual kernel interaction

# What actually happens on the node
# CPU update
echo "200000 100000" > /sys/fs/cgroup/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod<uid>.slice/cri-containerd-<id>.scope/cpu.max# Memory update
echo "2147483648" > /sys/fs/cgroup/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod<uid>.slice/cri-containerd-<id>.scope/memory.max

Why This Is Safe for CPU, Risky for Memory

CPU Resizing (Safe)

CPU limits are implemented through the Completely Fair Scheduler (CFS) in Linux.
cpu.max in cgroup v2 sets the quota for CPU time slices.
If a process exceeds its quota, the kernel throttles it, slowing execution.
Increasing quota simply allows the process more CPU time.
No process is killed, so live resizing is safe.

This is why you can safely adjust CPU limits on running Pods.

Memory Resizing (Risky)

Memory limits are enforced via memory.max in cgroup v2.
If a process tries to allocate more than the limit, Linux must reclaim memory.
If reclaim fails, the OOM killer terminates the process.
Unlike CPU, decreasing the limit doesn’t force the process to free memory.
Applications like JVMs, databases, or caches keep memory resident; they don’t shrink automatically.

This is why decreasing memory limits can be dangerous unless the application is designed to release memory on demand, or you restart the container with new heap settings.

This is why the resizePolicy field exists:

spec:
  containers:
  - name: app
    resources:
      requests:
        memory: "2Gi"
        cpu: "1"
    resizePolicy:
    - resourceName: cpu
      restartPolicy: NotRequired    # Safe to change live
    - resourceName: memory
      restartPolicy: RestartContainer # Restart if memory changes

Production Details Operators Must Know

QoS Class Behavior

Critical: QoS class does NOT change during resize.

# Pod starts as Burstable (requests < limits)
resources:
  requests:
    memory: "1Gi"
  limits:
    memory: "2Gi"# After resize, still Burstable even if requests = limits 
# Eviction priority stays medium  OOM score stays medium
resources:
  requests:
    memory: "2Gi"
  limits:
    memory: "2Gi"

Why this matters in production

If node memory pressure happens:

Guaranteed Pods survive longest
Burstable Pods are killed next
BestEffort Pods die first

Your resized Pod still behaves like Burstable even though it looks Guaranteed in YAML.

Interaction with Vertical Pod Autoscaler (VPA)

Before in-place resizing

VPA behavior:

Recommend new resources
Delete Pod
Recreate Pod with new values

Side effects:

Pod UID changes
Connections drop
Cache lost
Stateful workloads suffer

But QoS recalculates, because Pod is new.

VPA can now apply recommendations without recreating Pods:

apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: database-vpa
spec:
  targetRef:
    apiVersion: apps/v1
    kind: StatefulSet
    name: database
  updatePolicy:
    updateMode: "InPlace"  # New mode in VPA

Before: VPA would set updateMode: Auto, which deleted Pods.

Now: VPA patches resources directly, preserving Pod identity.

The dangerous part

VPA is statistical.
It does not know:

JVM heap size
Database cache behavior
Native memory overhead

So VPA might:

Shrink memory below actual in-use memory
Trigger OOM kill immediately

This is why in-place VPA without guardrails is risky for memory.

The Infeasibility Problem

# Current node capacity
Node: 32 GB total, 28 GB allocated, 4 GB free# Pod resize request
Current: 2 GB
Desired: 8 GB
Increase: +6 GB
# Result: Infeasible (only 4 GB free)

What Kubernetes does:

It does not reschedule the Pod
It does not evict other Pods
It does not partially apply the resize

status:
  resize: Infeasible
  conditions:
  - type: PodResizeFailed
    status: "True"
    reason: InsufficientCapacity
    message: "Node lacks 6GB for memory increase"

The Pod continues running with old limits.

Important points:

No restart
No silent failure

Silent partial resizes would be worse.

What you must do:

Monitor for Infeasible state
Either add node capacity or shrink other workloads
Retry the resize when capacity becomes available

If you ignore this state, you are running with a false sense of safety.

Observability Gaps

Metrics lag behind reality:

# Prometheus metrics update on scrape intervals (15-60s)
container_memory_usage_bytes{pod="database"}# But cgroup limit changed immediately
# Your alerts may fire incorrectly during this window

During that window:

Usage looks high relative to old limit
Alerts fire
Nothing is actually wrong

Alert: MemoryUtilizationHigh
Current: 1.8 GB used / 2 GB limit = 90% (triggers alert)
Reality: Limit just increased to 4 GB, usage is 45%

Mitigation:

# Add resize state to alerts
- alert: MemoryUtilizationHigh
  expr: |
    container_memory_usage_bytes / container_spec_memory_limit_bytes > 0.9
    AND
    kube_pod_status_resize == 0  # Not currently resizing

Complete State Machine

Press enter or click to view image in full size

Diagram 3: Complete State Machine : Shows Proposed → InProgress → Deferred/Infeasible → Complete flow

Operational Guidelines from Production Experience

1. Never Shrink JVM Memory Without Heap Tuning

The problem:

# Container has 4 GB limit, JVM configured with -Xmx3G
resources:
  limits:
    memory: "4Gi"# Operator decreases to 2 GB
# JVM heap is still 3 GB
# OOM kill is now inevitable

The fix:

# If resizing down, restart the container with new heap settings
kubectl set resources deployment app --limits=memory=2Gi
# AND update JVM args: -Xmx1536M

Or use RestartContainer policy:

resizePolicy:
- resourceName: memory
  restartPolicy: RestartContainer

2. Prefer CPU Resizing Over Memory in Live Systems

CPU is safe to adjust dynamically:

# Low latency? Add more CPU
kubectl patch pod api --type=json -p='[{
  "op": "replace",
  "path": "/spec/containers/0/resources/limits/cpu",
  "value": "4"
}]'# No restart, immediate effect

Memory requires careful analysis:

Is the app memory-elastic (cache vs essential data)?
Does it respect system memory signals?
Can it release memory on demand?

3. Gate Memory Decreases with Manual Approval

Implement a policy controller:

apiVersion: admissionregistration.k8s.io/v1
kind: ValidatingWebhookConfiguration
metadata:
  name: memory-decrease-gate
webhooks:
- name: validate.resize.example.com
  rules:
  - operations: ["UPDATE"]
    apiGroups: [""]
    apiVersions: ["v1"]
    resources: ["pods"]
  clientConfig:
    service:
      name: resize-validator
      namespace: kube-system

Logic:

if new_memory < current_memory:
    if pod.annotations.get("approved-by-sre") != "true":
        return reject("Memory decrease requires SRE approval")

4. Test Under Load in Staging

Load test scenario:

# 1. Start workload at baseline
kubectl apply -f baseline-load.yaml# 2. Generate 70% load
hey -c 100 -z 5m http://service
# 3. Resize during load
kubectl patch pod service --patch ...
# 4. Monitor for:
# - Latency spikes
# - Error rate changes
# - Memory usage patterns
# - CPU throttling
# 5. Repeat for decreases

5. Alert on Lingering Resize States

# Prometheus alert
- alert: PodResizeStuck
  expr: |
    kube_pod_status_resize{state=~"InProgress|Proposed"} > 0
    for: 5m
  annotations:
    summary: "Pod {{ $labels.pod }} stuck in resize state"
    description: "Resize has not completed in 5 minutes"- alert: PodResizeInfeasible
  expr: |
    kube_pod_status_resize{state="Infeasible"} > 0
  annotations:
    summary: "Pod {{ $labels.pod }} cannot resize"
    description: "Node lacks capacity for resize request"

Why It Took Eight Years:

KEP-1287 was proposed in 2018. It reached GA in 2025. Why?

Technical Dependencies

Year 1–2 (2018–2019): CRI API Design

Container Runtime Interface needed new RPCs
UpdateContainerResources had to be standardized
runc, containerd, CRI-O all needed alignment

Year 3–4 (2020–2021): Runtime Implementation

containerd 1.x couldn’t support it reliably
CRI-O needed refactoring
cgroup v2 adoption was still minimal

Year 5–6 (2022–2023): Failure Mode Discovery

What happens if kubelet crashes mid-resize?
What if the container dies during update?
What if node runs out of capacity?
Each edge case required state machine changes

Year 7 (2024): Production Validation

Alpha (v1.27): Limited testing in opt-in clusters
Beta (v1.29): Wider deployment, bug discovery
Operators needed to trust it before GA

The Cultural Challenge

Kubernetes built a reputation on immutability. Operators designed around it:

CI/CD pipelines that always replace Pods
Monitoring that watches for Pod UID changes
Disaster recovery that depends on Pod recreation

Changing this required trust, not just code. The feature had to be:

Reliable enough for databases
Safe enough for production
Clear enough for operators to understand failure modes

That takes time.

The Lesson

Reliable infrastructure features mature slowly. This is good.

Closing Thoughts: This Is an Operational Change, Not a Convenience

In-place Pod resizing is not a feature to celebrate and forget. It’s a fundamental shift in how Kubernetes handles running workloads.

It requires operators to:

Understand memory vs CPU semantics at the kernel level
Monitor resize state transitions
Test under load before production use
Treat memory decreases as high-risk operations

It enables teams to:

Optimize costs without sacrificing reliability
Scale vertically without losing state
Treat infrastructure as more transparent and controllable

A deep dive into KEP-1287, from Linux cgroups to production reality

Why Vertical Scaling Was Architecturally Difficult

The Kubernetes Design Choice

The Linux Kernel Reality

The Breakthrough: KEP-1287 Explained

What Changed

Architecture: Control Plane Flow

Diagram 1: Resize Request Flow

Key Architectural Points

Node-Level Mechanics: What Actually Happens

The cgroup v2 Foundation

Diagram 2: CPU vs Memory Resize Differences

What Kubelet and containerd Execute

Why This Is Safe for CPU, Risky for Memory

CPU Resizing (Safe)

Memory Resizing (Risky)

Production Details Operators Must Know

QoS Class Behavior

Why this matters in production

Interaction with Vertical Pod Autoscaler (VPA)

Before in-place resizing

VPA can now apply recommendations without recreating Pods:

The dangerous part

The Infeasibility Problem

What Kubernetes does:

Observability Gaps

Mitigation:

Complete State Machine

Operational Guidelines from Production Experience

Why It Took Eight Years:

Technical Dependencies

The Cultural Challenge

The Lesson

Closing Thoughts: This Is an Operational Change, Not a Convenience

Further Reading