A deep dive into KEP-1287, from Linux cgroups to production reality
Press enter or click to view image in full size
It’s 2:47 AM. Peak traffic. Your stateful PostgreSQL Pod hits its memory limit. Latency jumps from 50ms to 2 seconds. GC pauses stretch. PagerDuty alerts pile up.
Before Kubernetes v1.35, your options were limited:
Option 1: Restart the Pod
- Drop all active connections
- Lose query cache
- Trigger connection pool exhaustion
- Wait 30+ seconds for the database to warm up
Option 2: Scale horizontally
- Add replicas even though memory pressure doesn’t distribute evenly
- Risk replication lag
- Complicate your topology for a vertical problem
Option 3: Overprovision from day one
- Request 8GB when you need 2GB
- Waste 75% of memory allocation
- Make node bin-packing inefficient
- Inflate cloud costs
None of these were ideal.
Vertical scaling was destructive by nature. This was not a flaw in Kubernetes; it reflected a deeper challenge. Changing resource limits on live processes is difficult. Kubernetes chose immutability over operational complexity, shaping how we tuned JVMs, planned capacity, and accepted waste.
With in-place Pod resizing in v1.35, that changes. Operationally, you can adjust resources on the fly without restarts.
Why Vertical Scaling Was Architecturally Difficult
The Kubernetes Design Choice
Kubernetes was built on a principle: Pods are immutable.
- Resource requests and limits live in the
PodSpec - Changing the spec means creating a new Pod
- The Pod UID changes, signaling a completely new workload
- Kubelet would delete and recreate containers on resource changes
This approach was intentional. It simplified reasoning about state:
- Scheduler decisions are made once at creation time
- Controllers treat Pods as cattle, not pets
- Failure recovery is built on Pod replacement
However, it created an operational gap. Real workloads sometimes need vertical adjustments without downtime.
The Linux Kernel Reality
Even if Kubernetes had tried to support live resizing earlier, Linux made it difficult.
cgroups v1 limitations:
- Fragmented controller hierarchy (cpu, memory, io separate)
- Inconsistent behavior across controllers
- Race conditions when updating limits
- Unpredictable OOM killer behavior on memory decreases
Container runtime challenges:
- runc and containerd set cgroup limits at container start
- No CRI API existed to update running containers
- Behavior varied across Docker, containerd, and CRI-O
- Failure modes were undefined
The fundamental question: Can you safely update /sys/fs/cgroup/memory/memory.limit_in_bytes on a running process without triggering an OOM kill?
The answer was: “Sometimes, but we don’t know when.”
So Kubernetes chose safety. For nearly a decade, vertical scaling meant Pod replacement.
The Breakthrough: KEP-1287 Explained
Kubernetes Enhancement Proposal 1287 was proposed in 2018. It took seven years to reach General Availability in v1.35 (December 2025).
What Changed
The feature: CPU and memory requests and limits can now change on running Pods.
The guarantee: Pod UID remains unchanged. From Kubernetes’ perspective, it’s the same workload.
The implementation: Containers may or may not restart, depending on configuration.
What This Actually Means
# Before: This required deleting and recreating the Pod
apiVersion: v1
kind: Pod
metadata:
name: database
spec:
containers:
- name: postgres
resources:
requests:
memory: "2Gi"
cpu: "1"
limits:
memory: "4Gi"
cpu: "2"# After: You can patch this directly
kubectl patch pod database --type='json' -p='[
{
"op": "replace",
"path": "/spec/containers/0/resources/limits/memory",
"value": "6Gi"
}
]'
# Pod UID stays the same
# Connections stay alive
# State remains intact
This a fundamental shift in how Kubernetes manages running workloads.
Architecture: Control Plane Flow
Diagram 1: Resize Request Flow
Press enter or click to view image in full size
Key Architectural Points
The Scheduler is NOT involved
- This is not a rescheduling operation
- Kubelet validates node capacity locally
- If insufficient capacity exists, resize is marked
Infeasible
State is tracked separately
status:
containerStatuses:
- name: postgres
allocatedResources: # Currently applied to container
cpu: "1"
memory: "2Gi"
resources: # What was requested in spec
limits:
cpu: "2"
memory: "4Gi"
resize: "InProgress" # Current resize stateResize state machine:
- Proposed → Kubelet hasn’t processed yet
- InProgress → Runtime is applying changes
- Deferred → Node lacks capacity, will retry
- Infeasible → Cannot resize (e.g., decrease below current usage)
This separation between allocatedResources and spec.resources is critical. It allows kubelet to crash mid-resize and recover cleanly.
Node-Level Mechanics: What Actually Happens
The cgroup v2 Foundation
In-place resizing fundamentally requires cgroup v2. Here’s why:
cgroup v1 problems:
# cgroup v1 had separate hierarchies
/sys/fs/cgroup/cpu/kubepods/pod123/container456/
/sys/fs/cgroup/memory/kubepods/pod123/container456/
/sys/fs/cgroup/cpuset/kubepods/pod123/container456/# Coordinating updates across these was error-prone
cgroup v2 solution:
# Unified hierarchy
/sys/fs/cgroup/kubepods.slice/pod123.slice/container456.scope/
├── cpu.max
├── memory.max
├── memory.current
└── io.max# All controllers in one place, consistent semantics
Diagram 2: CPU vs Memory Resize Differences
Press enter or click to view image in full size
What Kubelet and containerd Execute
Step 1: Kubelet detects the change
// Simplified from kubelet source
func (m *kubeGenericRuntimeManager) updateContainerResources(
pod *v1.Pod,
container *v1.Container,
containerID string,
) error {
// Compare desired vs current
desiredResources := container.Resources
currentResources := m.getContainerResources(containerID) if desiredResources.Equals(currentResources) {
return nil // No change needed
}
// Call CRI API
return m.runtimeService.UpdateContainerResources(
containerID,
&runtimeapi.LinuxContainerResources{
CpuPeriod: desiredResources.CpuPeriod,
CpuQuota: desiredResources.CpuQuota,
MemoryLimitInBytes: desiredResources.MemoryLimit,
},
)
}
Step 2: containerd updates cgroups
// Simplified from containerd source
func (c *controller) Update(resources *specs.LinuxResources) error {
// Update CPU
if resources.CPU != nil {
if err := c.writeCgroup("cpu.max",
fmt.Sprintf("%d %d", resources.CPU.Quota, resources.CPU.Period)); err != nil {
return err
}
} // Update memory
if resources.Memory != nil && resources.Memory.Limit != nil {
if err := c.writeCgroup("memory.max",
strconv.FormatInt(*resources.Memory.Limit, 10)); err != nil {
return err
}
}
return nil
}
Step 3: The actual kernel interaction
# What actually happens on the node
# CPU update
echo "200000 100000" > /sys/fs/cgroup/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod<uid>.slice/cri-containerd-<id>.scope/cpu.max# Memory update
echo "2147483648" > /sys/fs/cgroup/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod<uid>.slice/cri-containerd-<id>.scope/memory.max
Why This Is Safe for CPU, Risky for Memory
CPU Resizing (Safe)
- CPU limits are implemented through the Completely Fair Scheduler (CFS) in Linux.
cpu.maxin cgroup v2 sets the quota for CPU time slices.- If a process exceeds its quota, the kernel throttles it, slowing execution.
- Increasing quota simply allows the process more CPU time.
- No process is killed, so live resizing is safe.
This is why you can safely adjust CPU limits on running Pods.
Memory Resizing (Risky)
- Memory limits are enforced via memory.max in cgroup v2.
- If a process tries to allocate more than the limit, Linux must reclaim memory.
- If reclaim fails, the OOM killer terminates the process.
- Unlike CPU, decreasing the limit doesn’t force the process to free memory.
- Applications like JVMs, databases, or caches keep memory resident; they don’t shrink automatically.
This is why decreasing memory limits can be dangerous unless the application is designed to release memory on demand, or you restart the container with new heap settings.
This is why the resizePolicy field exists:
spec:
containers:
- name: app
resources:
requests:
memory: "2Gi"
cpu: "1"
resizePolicy:
- resourceName: cpu
restartPolicy: NotRequired # Safe to change live
- resourceName: memory
restartPolicy: RestartContainer # Restart if memory changesProduction Details Operators Must Know
QoS Class Behavior
Critical: QoS class does NOT change during resize.
# Pod starts as Burstable (requests < limits)
resources:
requests:
memory: "1Gi"
limits:
memory: "2Gi"# After resize, still Burstable even if requests = limits
# Eviction priority stays medium OOM score stays medium
resources:
requests:
memory: "2Gi"
limits:
memory: "2Gi"
Why this matters in production
If node memory pressure happens:
- Guaranteed Pods survive longest
- Burstable Pods are killed next
- BestEffort Pods die first
Your resized Pod still behaves like Burstable even though it looks Guaranteed in YAML.
Interaction with Vertical Pod Autoscaler (VPA)
Before in-place resizing
VPA behavior:
- Recommend new resources
- Delete Pod
- Recreate Pod with new values
Side effects:
- Pod UID changes
- Connections drop
- Cache lost
- Stateful workloads suffer
But QoS recalculates, because Pod is new.
VPA can now apply recommendations without recreating Pods:
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
name: database-vpa
spec:
targetRef:
apiVersion: apps/v1
kind: StatefulSet
name: database
updatePolicy:
updateMode: "InPlace" # New mode in VPABefore: VPA would set updateMode: Auto, which deleted Pods.
Now: VPA patches resources directly, preserving Pod identity.
The dangerous part
VPA is statistical.
It does not know:
- JVM heap size
- Database cache behavior
- Native memory overhead
So VPA might:
- Shrink memory below actual in-use memory
- Trigger OOM kill immediately
This is why in-place VPA without guardrails is risky for memory.
The Infeasibility Problem
# Current node capacity
Node: 32 GB total, 28 GB allocated, 4 GB free# Pod resize request
Current: 2 GB
Desired: 8 GB
Increase: +6 GB
# Result: Infeasible (only 4 GB free)
What Kubernetes does:
- It does not reschedule the Pod
- It does not evict other Pods
- It does not partially apply the resize
status:
resize: Infeasible
conditions:
- type: PodResizeFailed
status: "True"
reason: InsufficientCapacity
message: "Node lacks 6GB for memory increase"The Pod continues running with old limits.
Important points:
- No restart
- No silent failure
Silent partial resizes would be worse.
What you must do:
- Monitor for
Infeasiblestate - Either add node capacity or shrink other workloads
- Retry the resize when capacity becomes available
If you ignore this state, you are running with a false sense of safety.
Observability Gaps
Metrics lag behind reality:
# Prometheus metrics update on scrape intervals (15-60s)
container_memory_usage_bytes{pod="database"}# But cgroup limit changed immediately
# Your alerts may fire incorrectly during this window
During that window:
- Usage looks high relative to old limit
- Alerts fire
- Nothing is actually wrong
Alert: MemoryUtilizationHigh
Current: 1.8 GB used / 2 GB limit = 90% (triggers alert)
Reality: Limit just increased to 4 GB, usage is 45%Mitigation:
# Add resize state to alerts
- alert: MemoryUtilizationHigh
expr: |
container_memory_usage_bytes / container_spec_memory_limit_bytes > 0.9
AND
kube_pod_status_resize == 0 # Not currently resizingComplete State Machine
Press enter or click to view image in full size
Operational Guidelines from Production Experience
1. Never Shrink JVM Memory Without Heap Tuning
The problem:
# Container has 4 GB limit, JVM configured with -Xmx3G
resources:
limits:
memory: "4Gi"# Operator decreases to 2 GB
# JVM heap is still 3 GB
# OOM kill is now inevitable
The fix:
# If resizing down, restart the container with new heap settings
kubectl set resources deployment app --limits=memory=2Gi
# AND update JVM args: -Xmx1536MOr use RestartContainer policy:
resizePolicy:
- resourceName: memory
restartPolicy: RestartContainer2. Prefer CPU Resizing Over Memory in Live Systems
CPU is safe to adjust dynamically:
# Low latency? Add more CPU
kubectl patch pod api --type=json -p='[{
"op": "replace",
"path": "/spec/containers/0/resources/limits/cpu",
"value": "4"
}]'# No restart, immediate effect
Memory requires careful analysis:
- Is the app memory-elastic (cache vs essential data)?
- Does it respect system memory signals?
- Can it release memory on demand?
3. Gate Memory Decreases with Manual Approval
Implement a policy controller:
apiVersion: admissionregistration.k8s.io/v1
kind: ValidatingWebhookConfiguration
metadata:
name: memory-decrease-gate
webhooks:
- name: validate.resize.example.com
rules:
- operations: ["UPDATE"]
apiGroups: [""]
apiVersions: ["v1"]
resources: ["pods"]
clientConfig:
service:
name: resize-validator
namespace: kube-systemLogic:
if new_memory < current_memory:
if pod.annotations.get("approved-by-sre") != "true":
return reject("Memory decrease requires SRE approval")4. Test Under Load in Staging
Load test scenario:
# 1. Start workload at baseline
kubectl apply -f baseline-load.yaml# 2. Generate 70% load
hey -c 100 -z 5m http://service
# 3. Resize during load
kubectl patch pod service --patch ...
# 4. Monitor for:
# - Latency spikes
# - Error rate changes
# - Memory usage patterns
# - CPU throttling
# 5. Repeat for decreases
5. Alert on Lingering Resize States
# Prometheus alert
- alert: PodResizeStuck
expr: |
kube_pod_status_resize{state=~"InProgress|Proposed"} > 0
for: 5m
annotations:
summary: "Pod {{ $labels.pod }} stuck in resize state"
description: "Resize has not completed in 5 minutes"- alert: PodResizeInfeasible
expr: |
kube_pod_status_resize{state="Infeasible"} > 0
annotations:
summary: "Pod {{ $labels.pod }} cannot resize"
description: "Node lacks capacity for resize request"
Why It Took Eight Years:
KEP-1287 was proposed in 2018. It reached GA in 2025. Why?
Technical Dependencies
Year 1–2 (2018–2019): CRI API Design
- Container Runtime Interface needed new RPCs
UpdateContainerResourceshad to be standardized- runc, containerd, CRI-O all needed alignment
Year 3–4 (2020–2021): Runtime Implementation
- containerd 1.x couldn’t support it reliably
- CRI-O needed refactoring
- cgroup v2 adoption was still minimal
Year 5–6 (2022–2023): Failure Mode Discovery
- What happens if kubelet crashes mid-resize?
- What if the container dies during update?
- What if node runs out of capacity?
- Each edge case required state machine changes
Year 7 (2024): Production Validation
- Alpha (v1.27): Limited testing in opt-in clusters
- Beta (v1.29): Wider deployment, bug discovery
- Operators needed to trust it before GA
The Cultural Challenge
Kubernetes built a reputation on immutability. Operators designed around it:
- CI/CD pipelines that always replace Pods
- Monitoring that watches for Pod UID changes
- Disaster recovery that depends on Pod recreation
Changing this required trust, not just code. The feature had to be:
- Reliable enough for databases
- Safe enough for production
- Clear enough for operators to understand failure modes
That takes time.
The Lesson
Reliable infrastructure features mature slowly. This is good.
Closing Thoughts: This Is an Operational Change, Not a Convenience
In-place Pod resizing is not a feature to celebrate and forget. It’s a fundamental shift in how Kubernetes handles running workloads.
It requires operators to:
- Understand memory vs CPU semantics at the kernel level
- Monitor resize state transitions
- Test under load before production use
- Treat memory decreases as high-risk operations
It enables teams to:
- Optimize costs without sacrificing reliability
- Scale vertically without losing state
- Treat infrastructure as more transparent and controllable
Further Reading
Written by Dinidu Sachintha, Site Reliability Engineer Intern @ WSO2 Sri Lanka December 2025