Operational Sympathy - NFHN Reader

Press enter or click to view image in full size

At the recently concluded WSO2 Tech Conference, I gave a talk titled “Code for Cloud”, built around a simple but uncomfortable truth: the cloud punishes complacency. Many of the assumptions that held true in traditional systems quietly collapse under cloud scale. Latency is no longer predictable, failures are not exceptional events, and complexity compounds faster than intuition. In the talk, I challenged a mindset that still persists across software teams: that Non-Functional Requirements can be deferred, optimized later, or delegated entirely to infrastructure and teams managing it. In reality, cost, resiliency, security, scalability, and operability are not “nice to have” attributes in cloud-native systems; they are hard design constraints that shape architecture, design decisions, and every feature that is developed

For a significant part of my own career, working in software architecture, design, and development, I was not immune to this way of thinking. My primary focus was on delivering functional correctness. i.e. making sure features worked as intended, while assuming that running production-grade systems was someone else’s responsibility. Today, standing on the other side of that fence as part of the WSO2 SRE team, where my teams and I are accountable for keeping large-scale production systems alive, stable, and cost-effective, I see the consequences of those earlier assumptions clearly. Ignoring NFRs does not merely accumulate technical debt; it manifests as outages, runaway cloud bills, fragile systems, and operational firefighting.

This article builds on that talk to argue for a mindset shift toward operational sympathy — designing and writing software with an explicit awareness of how it will behave, fail, and be operated in the real world. I haven’t found many references to the term operational sympathy but after my talk, I thought that this is a concept that should be popularized like mechanical sympathy.

Mechanical sympathy

Press enter or click to view image in full size

This term as applicable to software engineering, is attributed to Martin Thompson, who popularized it in the software world, drawing inspiration from Jackie Stewart, the Formula One driver who originally used the phrase to describe understanding and respecting how a machine behaves to get the best performance without breaking it. You can’t win races by constantly pushing your race car beyonds its design parameters.

What is operational sympathy?

Press enter or click to view image in full size

Operational Sympathy is a mindset and practice where software architects, designers, and developers intentionally design how their systems behave in production under load, during failures, and under security threats by planning failure modes, graceful degradation, built-in observability, and clear operational run books that enable early detection and fast recovery.

Key elements of operational sympathy

1. Production-Aware Design

Systems are designed with a clear understanding of how they will run in real production environments, including deployment models, scaling behavior, and day-to-day operational workflows. This is the most important aspect.

2. Load and Scale Consciousness

Software is built with explicit consideration for how it behaves under normal load, peak traffic, and unexpected spikes, avoiding assumptions that performance and scaling will be “handled by the cloud.”

3. Failure-Aware Architecture

Failure is treated as a normal operating condition. Teams proactively identify failure modes, design for graceful degradation, resilience, and ensure partial failures do not cascade into full outages.

4. Built-In Observability

Observability is designed into the system from the start through meaningful metrics, logs, and traces that allow failures and degradation to be detected early before users feel the impact.

5. Operability and Recovery Simplicity

Systems are designed to be easy to operate during incidents, with clear levers for mitigation, safe rollbacks, feature toggles, and predictable recovery paths.

6. Security as a Runtime Concern

Security is considered not just at design time, but as a continuous operational reality, covering threat detection, blast-radius reduction, and incident response. Do not rely primarily on perimeter security. Follow a zero-trust approach as much as possible.

7. Cost Awareness by Design

Architectural and implementation choices account for how resource usage translates into real cloud costs, preventing runaway bills caused by inefficient scaling or poor defaults. Also consider using native cloud services vs cloud-agnostic solutions. Carry out a cost approximation exercise before finalizing the architecture.

8. Runbook-Driven Thinking

Teams document known failure scenarios, diagnostic steps, and recovery actions as runbooks, ensuring that operational teams can respond quickly and confidently under pressure. Imagine yourself as the person handling an incident at 2AM, related to the feature you are developing.

9. Shared Ownership of Production Outcomes

Developers, architects, and operators share responsibility for system behavior in production, closing the gap between those who build systems and those who keep them running. Get out of the “it’s not my problem” mindset.

Operational empathy checklist

The following checklist provides a framework for architects, designer and developers to evaluate their designs and implementations for operational sympathy. You can make a copy of this spreadsheet and use it. https://docs.google.com/spreadsheets/d/1jryXy-aNQDoDgjMC8T2D5grdgP5bxQr-DwKJkB2hNfE/edit?gid=0#gid=0

Press enter or click to view image in full size

The operations sympathy score will give an indication of how operationally sympathetic your design is.

Press enter or click to view image in full size

Conclusion

Non-functional requirements should never be an afterthought or responsibility of the operations teams. They are aspects to be considered from the first whiteboard sketch. Operational sympathy is the mindset that closes the gap between building software and running it, forcing architects and developers to confront how their decisions behave under load, failure, security threats, and real operational stress. When non-functional requirements are treated as design constraints rather than afterthoughts, systems become easier to operate, cheaper to run, faster to recover, and kinder to the people responsible for keeping them running.

The key takeaway is simple but uncompromising: if you don’t consciously design for production, production will expose every shortcut you took.

PS: My talk at the WSO2 Tech Conference which resulted in this article.