How Kubernetes Learned to Resize Pods Without Restarting Them

10 min read Original article ↗

A deep dive into KEP-1287, from Linux cgroups to production reality

Dinidu Sachintha

Press enter or click to view image in full size

kubernetes.io/blog/2025/12/17/kubernetes-v1-35-release/

It’s 2:47 AM. Peak traffic. Your stateful PostgreSQL Pod hits its memory limit. Latency jumps from 50ms to 2 seconds. GC pauses stretch. PagerDuty alerts pile up.

Before Kubernetes v1.35, your options were limited:

Option 1: Restart the Pod

  • Drop all active connections
  • Lose query cache
  • Trigger connection pool exhaustion
  • Wait 30+ seconds for the database to warm up

Option 2: Scale horizontally

  • Add replicas even though memory pressure doesn’t distribute evenly
  • Risk replication lag
  • Complicate your topology for a vertical problem

Option 3: Overprovision from day one

  • Request 8GB when you need 2GB
  • Waste 75% of memory allocation
  • Make node bin-packing inefficient
  • Inflate cloud costs

None of these were ideal.

Vertical scaling was destructive by nature. This was not a flaw in Kubernetes; it reflected a deeper challenge. Changing resource limits on live processes is difficult. Kubernetes chose immutability over operational complexity, shaping how we tuned JVMs, planned capacity, and accepted waste.

With in-place Pod resizing in v1.35, that changes. Operationally, you can adjust resources on the fly without restarts.

Why Vertical Scaling Was Architecturally Difficult

The Kubernetes Design Choice

Kubernetes was built on a principle: Pods are immutable.

  • Resource requests and limits live in the PodSpec
  • Changing the spec means creating a new Pod
  • The Pod UID changes, signaling a completely new workload
  • Kubelet would delete and recreate containers on resource changes

This approach was intentional. It simplified reasoning about state:

  • Scheduler decisions are made once at creation time
  • Controllers treat Pods as cattle, not pets
  • Failure recovery is built on Pod replacement

However, it created an operational gap. Real workloads sometimes need vertical adjustments without downtime.

The Linux Kernel Reality

Even if Kubernetes had tried to support live resizing earlier, Linux made it difficult.

cgroups v1 limitations:

  • Fragmented controller hierarchy (cpu, memory, io separate)
  • Inconsistent behavior across controllers
  • Race conditions when updating limits
  • Unpredictable OOM killer behavior on memory decreases

Container runtime challenges:

  • runc and containerd set cgroup limits at container start
  • No CRI API existed to update running containers
  • Behavior varied across Docker, containerd, and CRI-O
  • Failure modes were undefined

The fundamental question: Can you safely update /sys/fs/cgroup/memory/memory.limit_in_bytes on a running process without triggering an OOM kill?

The answer was: “Sometimes, but we don’t know when.”

So Kubernetes chose safety. For nearly a decade, vertical scaling meant Pod replacement.

The Breakthrough: KEP-1287 Explained

Kubernetes Enhancement Proposal 1287 was proposed in 2018. It took seven years to reach General Availability in v1.35 (December 2025).

What Changed

The feature: CPU and memory requests and limits can now change on running Pods.

The guarantee: Pod UID remains unchanged. From Kubernetes’ perspective, it’s the same workload.

The implementation: Containers may or may not restart, depending on configuration.

What This Actually Means

# Before: This required deleting and recreating the Pod
apiVersion: v1
kind: Pod
metadata:
name: database
spec:
containers:
- name: postgres
resources:
requests:
memory: "2Gi"
cpu: "1"
limits:
memory: "4Gi"
cpu: "2"

# After: You can patch this directly
kubectl patch pod database --type='json' -p='[
{
"op": "replace",
"path": "/spec/containers/0/resources/limits/memory",
"value": "6Gi"
}
]'

# Pod UID stays the same
# Connections stay alive
# State remains intact

This a fundamental shift in how Kubernetes manages running workloads.

Architecture: Control Plane Flow

Diagram 1: Resize Request Flow

Press enter or click to view image in full size

Diagram 1: Resize Request Flow : Shows the path from kubectl → API Server → Kubelet → Container Runtime, emphasizing that Scheduler is NOT involved

Key Architectural Points

The Scheduler is NOT involved

  • This is not a rescheduling operation
  • Kubelet validates node capacity locally
  • If insufficient capacity exists, resize is marked Infeasible

State is tracked separately

status:
containerStatuses:
- name: postgres
allocatedResources: # Currently applied to container
cpu: "1"
memory: "2Gi"
resources: # What was requested in spec
limits:
cpu: "2"
memory: "4Gi"
resize: "InProgress" # Current resize state

Resize state machine:

  1. Proposed → Kubelet hasn’t processed yet
  2. InProgress → Runtime is applying changes
  3. Deferred → Node lacks capacity, will retry
  4. Infeasible → Cannot resize (e.g., decrease below current usage)

This separation between allocatedResources and spec.resources is critical. It allows kubelet to crash mid-resize and recover cleanly.

Node-Level Mechanics: What Actually Happens

The cgroup v2 Foundation

In-place resizing fundamentally requires cgroup v2. Here’s why:

cgroup v1 problems:

# cgroup v1 had separate hierarchies
/sys/fs/cgroup/cpu/kubepods/pod123/container456/
/sys/fs/cgroup/memory/kubepods/pod123/container456/
/sys/fs/cgroup/cpuset/kubepods/pod123/container456/

# Coordinating updates across these was error-prone

cgroup v2 solution:

# Unified hierarchy
/sys/fs/cgroup/kubepods.slice/pod123.slice/container456.scope/
├── cpu.max
├── memory.max
├── memory.current
└── io.max

# All controllers in one place, consistent semantics

Diagram 2: CPU vs Memory Resize Differences

Press enter or click to view image in full size

Diagram 2: CPU vs Memory Resize Differences : Critical visual showing why CPU is safe and memory is risky at the kernel level

What Kubelet and containerd Execute

Step 1: Kubelet detects the change

// Simplified from kubelet source
func (m *kubeGenericRuntimeManager) updateContainerResources(
pod *v1.Pod,
container *v1.Container,
containerID string,
) error {
// Compare desired vs current
desiredResources := container.Resources
currentResources := m.getContainerResources(containerID)

if desiredResources.Equals(currentResources) {
return nil // No change needed
}

// Call CRI API
return m.runtimeService.UpdateContainerResources(
containerID,
&runtimeapi.LinuxContainerResources{
CpuPeriod: desiredResources.CpuPeriod,
CpuQuota: desiredResources.CpuQuota,
MemoryLimitInBytes: desiredResources.MemoryLimit,
},
)
}

Step 2: containerd updates cgroups

// Simplified from containerd source
func (c *controller) Update(resources *specs.LinuxResources) error {
// Update CPU
if resources.CPU != nil {
if err := c.writeCgroup("cpu.max",
fmt.Sprintf("%d %d", resources.CPU.Quota, resources.CPU.Period)); err != nil {
return err
}
}

// Update memory
if resources.Memory != nil && resources.Memory.Limit != nil {
if err := c.writeCgroup("memory.max",
strconv.FormatInt(*resources.Memory.Limit, 10)); err != nil {
return err
}
}

return nil
}

Step 3: The actual kernel interaction

# What actually happens on the node
# CPU update
echo "200000 100000" > /sys/fs/cgroup/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod<uid>.slice/cri-containerd-<id>.scope/cpu.max

# Memory update
echo "2147483648" > /sys/fs/cgroup/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod<uid>.slice/cri-containerd-<id>.scope/memory.max

Why This Is Safe for CPU, Risky for Memory

CPU Resizing (Safe)

  • CPU limits are implemented through the Completely Fair Scheduler (CFS) in Linux.
  • cpu.max in cgroup v2 sets the quota for CPU time slices.
  • If a process exceeds its quota, the kernel throttles it, slowing execution.
  • Increasing quota simply allows the process more CPU time.
  • No process is killed, so live resizing is safe.

This is why you can safely adjust CPU limits on running Pods.

Memory Resizing (Risky)

  • Memory limits are enforced via memory.max in cgroup v2.
  • If a process tries to allocate more than the limit, Linux must reclaim memory.
  • If reclaim fails, the OOM killer terminates the process.
  • Unlike CPU, decreasing the limit doesn’t force the process to free memory.
  • Applications like JVMs, databases, or caches keep memory resident; they don’t shrink automatically.

This is why decreasing memory limits can be dangerous unless the application is designed to release memory on demand, or you restart the container with new heap settings.

This is why the resizePolicy field exists:

spec:
containers:
- name: app
resources:
requests:
memory: "2Gi"
cpu: "1"
resizePolicy:
- resourceName: cpu
restartPolicy: NotRequired # Safe to change live
- resourceName: memory
restartPolicy: RestartContainer # Restart if memory changes

Production Details Operators Must Know

QoS Class Behavior

Critical: QoS class does NOT change during resize.

# Pod starts as Burstable (requests < limits)
resources:
requests:
memory: "1Gi"
limits:
memory: "2Gi"

# After resize, still Burstable even if requests = limits
# Eviction priority stays medium OOM score stays medium

resources:
requests:
memory: "2Gi"
limits:
memory: "2Gi"

Why this matters in production

If node memory pressure happens:

  • Guaranteed Pods survive longest
  • Burstable Pods are killed next
  • BestEffort Pods die first

Your resized Pod still behaves like Burstable even though it looks Guaranteed in YAML.

Interaction with Vertical Pod Autoscaler (VPA)

Before in-place resizing

VPA behavior:

  • Recommend new resources
  • Delete Pod
  • Recreate Pod with new values

Side effects:

  • Pod UID changes
  • Connections drop
  • Cache lost
  • Stateful workloads suffer

But QoS recalculates, because Pod is new.

VPA can now apply recommendations without recreating Pods:

apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
name: database-vpa
spec:
targetRef:
apiVersion: apps/v1
kind: StatefulSet
name: database
updatePolicy:
updateMode: "InPlace" # New mode in VPA

Before: VPA would set updateMode: Auto, which deleted Pods.

Now: VPA patches resources directly, preserving Pod identity.

The dangerous part

VPA is statistical.
It does not know:

  • JVM heap size
  • Database cache behavior
  • Native memory overhead

So VPA might:

  • Shrink memory below actual in-use memory
  • Trigger OOM kill immediately

This is why in-place VPA without guardrails is risky for memory.

The Infeasibility Problem

# Current node capacity
Node: 32 GB total, 28 GB allocated, 4 GB free

# Pod resize request
Current: 2 GB
Desired: 8 GB
Increase: +6 GB

# Result: Infeasible (only 4 GB free)

What Kubernetes does:

  • It does not reschedule the Pod
  • It does not evict other Pods
  • It does not partially apply the resize
status:
resize: Infeasible
conditions:
- type: PodResizeFailed
status: "True"
reason: InsufficientCapacity
message: "Node lacks 6GB for memory increase"

The Pod continues running with old limits.

Important points:

  • No restart
  • No silent failure

Silent partial resizes would be worse.

What you must do:

  • Monitor for Infeasible state
  • Either add node capacity or shrink other workloads
  • Retry the resize when capacity becomes available

If you ignore this state, you are running with a false sense of safety.

Observability Gaps

Metrics lag behind reality:

# Prometheus metrics update on scrape intervals (15-60s)
container_memory_usage_bytes{pod="database"}

# But cgroup limit changed immediately
# Your alerts may fire incorrectly during this window

During that window:

  • Usage looks high relative to old limit
  • Alerts fire
  • Nothing is actually wrong
Alert: MemoryUtilizationHigh
Current: 1.8 GB used / 2 GB limit = 90% (triggers alert)
Reality: Limit just increased to 4 GB, usage is 45%

Mitigation:

# Add resize state to alerts
- alert: MemoryUtilizationHigh
expr: |
container_memory_usage_bytes / container_spec_memory_limit_bytes > 0.9
AND
kube_pod_status_resize == 0 # Not currently resizing

Complete State Machine

Press enter or click to view image in full size

Diagram 3: Complete State Machine : Shows Proposed → InProgress → Deferred/Infeasible → Complete flow

Operational Guidelines from Production Experience

1. Never Shrink JVM Memory Without Heap Tuning

The problem:

# Container has 4 GB limit, JVM configured with -Xmx3G
resources:
limits:
memory: "4Gi"

# Operator decreases to 2 GB
# JVM heap is still 3 GB
# OOM kill is now inevitable

The fix:

# If resizing down, restart the container with new heap settings
kubectl set resources deployment app --limits=memory=2Gi
# AND update JVM args: -Xmx1536M

Or use RestartContainer policy:

resizePolicy:
- resourceName: memory
restartPolicy: RestartContainer

2. Prefer CPU Resizing Over Memory in Live Systems

CPU is safe to adjust dynamically:

# Low latency? Add more CPU
kubectl patch pod api --type=json -p='[{
"op": "replace",
"path": "/spec/containers/0/resources/limits/cpu",
"value": "4"
}]'

# No restart, immediate effect

Memory requires careful analysis:

  • Is the app memory-elastic (cache vs essential data)?
  • Does it respect system memory signals?
  • Can it release memory on demand?

3. Gate Memory Decreases with Manual Approval

Implement a policy controller:

apiVersion: admissionregistration.k8s.io/v1
kind: ValidatingWebhookConfiguration
metadata:
name: memory-decrease-gate
webhooks:
- name: validate.resize.example.com
rules:
- operations: ["UPDATE"]
apiGroups: [""]
apiVersions: ["v1"]
resources: ["pods"]
clientConfig:
service:
name: resize-validator
namespace: kube-system

Logic:

if new_memory < current_memory:
if pod.annotations.get("approved-by-sre") != "true":
return reject("Memory decrease requires SRE approval")

4. Test Under Load in Staging

Load test scenario:

# 1. Start workload at baseline
kubectl apply -f baseline-load.yaml

# 2. Generate 70% load
hey -c 100 -z 5m http://service

# 3. Resize during load
kubectl patch pod service --patch ...

# 4. Monitor for:
# - Latency spikes
# - Error rate changes
# - Memory usage patterns
# - CPU throttling

# 5. Repeat for decreases

5. Alert on Lingering Resize States

# Prometheus alert
- alert: PodResizeStuck
expr: |
kube_pod_status_resize{state=~"InProgress|Proposed"} > 0
for: 5m
annotations:
summary: "Pod {{ $labels.pod }} stuck in resize state"
description: "Resize has not completed in 5 minutes"

- alert: PodResizeInfeasible
expr: |
kube_pod_status_resize{state="Infeasible"} > 0
annotations:
summary: "Pod {{ $labels.pod }} cannot resize"
description: "Node lacks capacity for resize request"

Why It Took Eight Years:

KEP-1287 was proposed in 2018. It reached GA in 2025. Why?

Technical Dependencies

Year 1–2 (2018–2019): CRI API Design

  • Container Runtime Interface needed new RPCs
  • UpdateContainerResources had to be standardized
  • runc, containerd, CRI-O all needed alignment

Year 3–4 (2020–2021): Runtime Implementation

  • containerd 1.x couldn’t support it reliably
  • CRI-O needed refactoring
  • cgroup v2 adoption was still minimal

Year 5–6 (2022–2023): Failure Mode Discovery

  • What happens if kubelet crashes mid-resize?
  • What if the container dies during update?
  • What if node runs out of capacity?
  • Each edge case required state machine changes

Year 7 (2024): Production Validation

  • Alpha (v1.27): Limited testing in opt-in clusters
  • Beta (v1.29): Wider deployment, bug discovery
  • Operators needed to trust it before GA

The Cultural Challenge

Kubernetes built a reputation on immutability. Operators designed around it:

  • CI/CD pipelines that always replace Pods
  • Monitoring that watches for Pod UID changes
  • Disaster recovery that depends on Pod recreation

Changing this required trust, not just code. The feature had to be:

  • Reliable enough for databases
  • Safe enough for production
  • Clear enough for operators to understand failure modes

That takes time.

The Lesson

Reliable infrastructure features mature slowly. This is good.

Closing Thoughts: This Is an Operational Change, Not a Convenience

In-place Pod resizing is not a feature to celebrate and forget. It’s a fundamental shift in how Kubernetes handles running workloads.

It requires operators to:

  • Understand memory vs CPU semantics at the kernel level
  • Monitor resize state transitions
  • Test under load before production use
  • Treat memory decreases as high-risk operations

It enables teams to:

  • Optimize costs without sacrificing reliability
  • Scale vertically without losing state
  • Treat infrastructure as more transparent and controllable

Further Reading

Written by Dinidu Sachintha, Site Reliability Engineer Intern @ WSO2 Sri Lanka December 2025