GitHub - Scoutflo/Scoutflo-SRE-Playbooks: 🚀 SRE incident response playbooks for AWS & Kubernetes. Step-by-step troubleshooting guides to help on-call engineers resolve infrastructure issues faster.

SRE Playbooks Repository

Comprehensive incident response playbooks for AWS, Kubernetes, and Sentry environments - Helping SREs diagnose and resolve infrastructure issues faster with systematic, step-by-step troubleshooting guides.

Overview

This repository contains 414 comprehensive incident response playbooks designed to help Site Reliability Engineers (SREs) systematically diagnose and resolve common infrastructure and application issues in AWS, Kubernetes, and Sentry environments.

Why This Repository?

Systematic Approach: Each playbook follows a consistent structure with clear diagnostic steps
Time-Saving: Quickly identify root causes with correlation analysis frameworks
Community-Driven: Continuously improved by the open-source community
Production-Ready: Based on real-world incident response scenarios
Comprehensive Coverage: 232 Kubernetes playbooks + 157 AWS playbooks + 25 Sentry playbooks
Proactive Monitoring: 56 K8s + 65 AWS proactive playbooks for capacity planning and compliance

Diagnosis Improvements

All playbooks use an events-first approach for root cause analysis:

Diagnosis sections prioritize checking recent events and changes before diving into configuration details
Conditional logic patterns help narrow down causes based on observed symptoms
Time-based correlation analysis connects events to failures systematically

Use Cases

During Incidents: Quick reference for troubleshooting common issues
On-Call Rotation: Essential runbook collection for on-call engineers
Knowledge Sharing: Standardize troubleshooting procedures across teams
Training: Learn systematic incident response methodologies
Documentation: Build your own runbook library

Repository Structure

scoutflo-SRE-Playbooks/
├── AWS Playbooks/                    # 157 AWS playbooks
│   ├── 01-Compute/                   # 27 playbooks (EC2, Lambda, ECS, EKS)
│   ├── 02-Database/                  # 8 playbooks (RDS, DynamoDB)
│   ├── 03-Storage/                   # 7 playbooks (S3)
│   ├── 04-Networking/                # 17 playbooks (VPC, ELB, Route53)
│   ├── 05-Security/                  # 16 playbooks (IAM, KMS, GuardDuty)
│   ├── 06-Monitoring/                # 8 playbooks (CloudTrail, CloudWatch)
│   ├── 07-CI-CD/                     # 9 playbooks (CodePipeline, CodeBuild)
│   ├── 08-Proactive/                 # 65 proactive monitoring playbooks
│   └── README.md
├── K8s Playbooks/                    # 232 Kubernetes playbooks
│   ├── 01-Control-Plane/             # 24 playbooks
│   ├── 02-Nodes/                     # 24 playbooks
│   ├── 03-Pods/                      # 41 playbooks
│   ├── 04-Workloads/                 # 25 playbooks
│   ├── 05-Networking/                # 27 playbooks
│   ├── 06-Storage/                   # 9 playbooks
│   ├── 07-RBAC/                      # 6 playbooks
│   ├── 08-Configuration/             # 6 playbooks
│   ├── 09-Resource-Management/       # 8 playbooks
│   ├── 10-Monitoring-Autoscaling/    # 3 playbooks
│   ├── 11-Installation-Setup/        # 1 playbook
│   ├── 12-Namespaces/                # 2 playbooks
│   ├── 13-Proactive/                 # 56 proactive monitoring playbooks
│   └── README.md
├── Sentry Playbooks/                 # 25 Sentry playbooks
│   ├── 01-Error-Tracking/            # 19 playbooks
│   ├── 02-Performance/               # 6 playbooks
│   ├── 03-Release-Health/            # Placeholder
│   └── README.md
├── CONTRIBUTING.md
└── README.md

AWS Playbooks (`AWS Playbooks/`)

157 playbooks covering 7 service categories + proactive monitoring:

Compute Services (27 playbooks): EC2, Lambda, ECS, EKS
Database (8 playbooks): RDS, DynamoDB
Storage (7 playbooks): S3
Networking (17 playbooks): VPC, ELB, Route 53, NAT Gateway
Security (16 playbooks): IAM, KMS, GuardDuty, CloudTrail
Monitoring (8 playbooks): CloudTrail, CloudWatch
CI/CD (9 playbooks): CodePipeline, CodeBuild
Proactive (65 playbooks): Capacity planning, compliance, cost optimization

Key Topics:

Connection timeouts and network issues
Access denied and permission problems
Resource unavailability and capacity issues
Security breaches and threat detection
Service integration failures
Proactive capacity and compliance monitoring

See AWS Playbooks/README.md for complete documentation and playbook list.

Kubernetes Playbooks (`K8s Playbooks/`)

194 playbooks organized into 13 categorized folders covering Kubernetes cluster and workload issues:

Folder Structure:

01-Control-Plane/ (18 playbooks) - API Server, Scheduler, Controller Manager, etcd
02-Nodes/ (12 playbooks) - Node readiness, kubelet issues, resource constraints
03-Pods/ (31 playbooks) - Scheduling, lifecycle, health checks, resource limits
04-Workloads/ (23 playbooks) - Deployments, StatefulSets, DaemonSets, Jobs, HPA
05-Networking/ (19 playbooks) - Services, Ingress, DNS, Network Policies, kube-proxy
06-Storage/ (9 playbooks) - PersistentVolumes, PersistentVolumeClaims, StorageClasses
07-RBAC/ (6 playbooks) - ServiceAccounts, Roles, RoleBindings, authorization
08-Configuration/ (6 playbooks) - ConfigMaps and Secrets access issues
09-Resource-Management/ (8 playbooks) - Resource Quotas, overcommit, compute resources
10-Monitoring-Autoscaling/ (3 playbooks) - Metrics Server, Cluster Autoscaler
11-Installation-Setup/ (1 playbook) - Helm and installation issues
12-Namespaces/ (2 playbooks) - Namespace management issues
13-Proactive/ (56 playbooks) - Proactive monitoring, capacity planning, compliance

Key Topics:

Pod lifecycle issues (CrashLoopBackOff, Pending, Terminating)
Control plane component failures
Network connectivity and DNS resolution
Storage and volume mounting problems
RBAC and permission errors
Resource quota and capacity constraints
Proactive capacity and compliance monitoring

See K8s Playbooks/README.md for complete documentation and playbook list.

Sentry Playbooks (`Sentry Playbooks/`)

25 playbooks covering error tracking and performance monitoring:

Folder Structure:

01-Error-Tracking/ (19 playbooks) - Error capture, grouping, alerting, and debugging
02-Performance/ (6 playbooks) - Transaction monitoring, performance issues, tracing
03-Release-Health/ - Release tracking and health monitoring (placeholder)

Key Topics:

Error capture and reporting issues
Issue grouping and deduplication
Alert configuration and routing
Performance transaction monitoring
SDK integration troubleshooting
Release health tracking

See Sentry Playbooks/README.md for complete documentation and playbook list.

Getting Started

Prerequisites

Basic knowledge of AWS services, Kubernetes, or Sentry
Access to AWS Console, Kubernetes cluster, or Sentry dashboard (for using playbooks)
Git (for cloning the repository)

Installation

Option 1: Clone the Repository

# Clone the repository
git clone https://github.com/Scoutflo/scoutflo-SRE-Playbooks.git

# Navigate to the repository
cd scoutflo-SRE-Playbooks

# View available playbooks
ls AWS\ Playbooks/
ls K8s\ Playbooks/
ls Sentry\ Playbooks/

Option 2: Use as Git Submodule

Include playbooks in your own projects:

git submodule add https://github.com/Scoutflo/scoutflo-SRE-Playbooks.git playbooks

Option 3: Download Specific Playbooks

Browse and download individual playbooks directly from GitHub web interface.

Quick Start

Identify Your Issue: Determine if it's an AWS, Kubernetes, or Sentry issue
Navigate to Playbooks:
- AWS issues -> AWS Playbooks/
- K8s issues -> K8s Playbooks/[category-folder]/
- Sentry issues -> Sentry Playbooks/[category-folder]/
Find the Playbook: Match your symptoms to a playbook title
Follow the Steps: Execute diagnostic steps in order
Use Diagnosis Section: Apply correlation analysis for root cause identification

Learn More

Watch Tutorials: Check our YouTube channel for video walkthroughs and best practices
AI SRE Demo: Watch the Scoutflo AI SRE Demo to see AI-powered incident response
Scoutflo Documentation: Visit Scoutflo Documentation for platform guides
Join the Community: Connect with other SREs in our Slack workspace

Example Usage

Scenario: EC2 instance SSH connection timeout

Navigate to AWS Playbooks/
Open Connection-Timeout-SSH-Issues-EC2.md
Follow the Playbook steps, replacing <instance-id> with your actual instance ID
Use the Diagnosis section to correlate events with failures
Apply the identified fix

Usage

How Playbooks Work

Important: These playbooks are designed for AI agents using natural language processing (NLP). They use natural language instructions that AI agents interpret and execute using available tools (like AWS MCP tools, Kubernetes MCP tools, or kubectl).

Example Playbook Step:

Natural Language: "Retrieve logs from pod <pod-name> in namespace <namespace> and analyze error messages"
AI Agent Action: Interprets this instruction and uses appropriate tools to fetch and analyze pod logs

For Manual Use:

While playbooks are optimized for AI agents, you can also use them manually
The README files in each category folder include equivalent kubectl/AWS CLI commands for manual verification
Replace placeholders with actual resource identifiers when following steps manually

Playbook Structure

All playbooks follow a consistent structure:

Title - Clear, descriptive issue identification
Meaning - What the issue means, triggers, symptoms, root causes
Impact - Business and technical implications
Playbook - 8-10 numbered diagnostic steps in natural language (ordered from common to specific)
Diagnosis - Correlation analysis framework with time windows using events-first approach and conditional logic patterns

Best Practices

For AI Agents: Playbooks are optimized for AI interpretation - use natural language instructions
For Manual Use: See category README files for equivalent kubectl/AWS CLI commands
Replace Placeholders: All playbooks use placeholders (e.g., <instance-id>, <pod-name>) that must be replaced with actual values
Follow Order: Execute steps sequentially unless you have strong evidence pointing to a specific step
Correlate Timestamps: Use the Diagnosis section to correlate events with failures
Extend Windows: If initial correlations don't reveal causes, extend time windows as suggested

Placeholder Reference

AWS Playbooks:

<instance-id>, <bucket-name>, <region>, <function-name>, <role-name>, <user-name>, <security-group-id>, <vpc-id>, <rds-instance-id>, <load-balancer-name>

Kubernetes Playbooks:

<pod-name>, <namespace>, <deployment-name>, <node-name>, <service-name>, <ingress-name>, <pvc-name>, <configmap-name>, <secret-name>

Sentry Playbooks:

<project-slug>, <organization-slug>, <issue-id>, <transaction-name>, <release-version>, <environment>

Terminology & Glossary

Understanding the terms used in these playbooks will help you use them more effectively. For detailed glossaries, see:

Quick Reference

SRE (Site Reliability Engineering)

A discipline combining software engineering and operations to build reliable systems.

Playbook / Runbook

A step-by-step guide for diagnosing and resolving specific issues.

Incident

An event that disrupts or degrades a service, requiring immediate attention.

On-Call

Engineers available to respond to incidents outside normal business hours.

MTTR (Mean Time To Recovery)

Average time to restore a service after an incident. Playbooks help reduce MTTR.

Correlation Analysis

Finding relationships between events (like configuration changes) and symptoms (like service failures) by comparing timestamps.

Root Cause

The underlying reason why an issue occurred, as opposed to just the symptoms.

Placeholder

A value in playbooks (like <instance-id>) that you replace with your actual resource identifier.

Diagnosis Section

Part of each playbook that helps you correlate events with failures using time-based analysis.

Common Abbreviations

K8s: Kubernetes (K + 8 letters + s)
SRE: Site Reliability Engineering
MTTR: Mean Time To Recovery
API: Application Programming Interface
DNS: Domain Name System
RBAC: Role-Based Access Control
PVC: PersistentVolumeClaim
HPA: Horizontal Pod Autoscaler

For detailed explanations of AWS and Kubernetes terms, see the respective README files above.

Quick Reference

Need a quick cheat sheet? Check out our Quick Reference Card for:

One-page overview
Common commands
Quick lookup tables
Essential links

Not sure which playbook to use? Use our Troubleshooting Decision Tree to:

Quickly identify the right playbook
Navigate by issue type
Look up by error message or alert name

Examples & Use Cases

See real-world scenarios in EXAMPLES.md:

Step-by-step examples
Common workflows
Success stories
Best practices

FAQ

Have questions? Check our FAQ for answers to:

General questions
Usage questions
Technical questions
Contributing questions

Video Tutorials

Learn how to use these playbooks effectively:

YouTube Channel: @scoutflo6727 - Subscribe for tutorials and walkthroughs
AI SRE Demo: Watch Demo Video - See Scoutflo AI SRE in action
Tutorials: Step-by-step video guides on using playbooks
Best Practices: Learn SRE incident response best practices

Coming Soon: Video tutorials for:

How to use playbooks effectively
Common troubleshooting scenarios
Contributing to playbooks
Advanced correlation analysis

Roadmap

Check out our ROADMAP.md to see:

Planned features and new playbook categories
Short-term and long-term goals
How to suggest new features
Release history

Contributing

We welcome contributions from the community! Your contributions help make these playbooks better for everyone. See our Contributors page to see who has helped build this project.

First-time contributor? Start with our Getting Started Guide for a quick onboarding experience, then look for issues labeled good first issue.

How to Contribute

1. Reporting Issues

Found a bug, unclear instruction, or have a suggestion?

Check Existing Issues: Search GitHub Issues first
Create a New Issue:
- Use clear, descriptive title
- Describe the problem or suggestion
- Include relevant service/component, error messages, or examples
- Tag with appropriate labels (aws-playbook, k8s-playbook, sentry-playbook, bug, enhancement, etc.)

2. Improving Existing Playbooks

To fix or enhance existing playbooks:

Fork the Repository: Create your own fork

Create a Branch:

git checkout -b fix/playbook-name-improvement

Make Your Changes:
- Follow the established playbook structure
- Maintain consistency with existing formatting
- Update placeholders and examples as needed
Test Your Changes: Verify the playbook is accurate and helpful

Commit and Push:

git add .
git commit -m "Fix: Improve [playbook-name] with [description]"
git push origin fix/playbook-name-improvement

Create a Pull Request:
- Provide clear description of changes
- Reference any related issues
- Request review from maintainers

3. Adding New Playbooks

To add a new playbook for an uncovered issue:

Check for Duplicates: Ensure a similar playbook doesn't already exist
Follow the Structure: Use existing playbooks as templates
Choose the Right Location:
- AWS playbooks -> AWS Playbooks/
- K8s playbooks -> Appropriate category folder in K8s Playbooks/
- Sentry playbooks -> Appropriate category folder in Sentry Playbooks/
Follow Naming Conventions:
- AWS: <IssueOrSymptom>-<Component>.md
- K8s: <AlertName>-<Resource>.md
- Sentry: <IssueType>-<Component>.md
Include All Sections: Title, Meaning, Impact, Playbook (8-10 steps), Diagnosis (5 correlations)
Update README: Add the new playbook to the appropriate README's playbook list
Create Pull Request: Follow standard contribution process

Contribution Guidelines

Follow the Structure: Maintain consistency with existing playbooks
Use Placeholders: Replace specific values with placeholders
Be Specific: Provide actionable, step-by-step instructions
Include Correlation: Add time-based correlation analysis in the Diagnosis section
Test Thoroughly: Ensure playbooks are accurate and helpful
Document Changes: Clearly describe what you changed and why

Review Process

All contributions require review from maintainers
Feedback will be provided within 2-3 business days
Address any requested changes promptly
Once approved, your contribution will be merged

See CONTRIBUTING.md for detailed contribution guidelines.

Connect with Us

We'd love to hear from you! Here are the best ways to connect:

Community Channels

Slack Community: Join our Slack workspace for real-time discussions
GitHub Discussions: Start a discussion for questions and ideas
GitHub Issues: Report bugs or request features
LinkedIn: Follow Scoutflo on LinkedIn for updates and insights
Twitter/X: Follow @scout_flo for latest news and announcements

Feedback & Feature Requests

Have an idea for improvement or a new playbook topic?

GitHub Issues: Create a feature request
Slack: Share your ideas in our #playbooks channel

Bug Reports

Found a bug or error in a playbook?

GitHub Issues: Create a bug report
Slack: Report in our #playbooks channel for quick response

Scoutflo Resources

Official Documentation: Scoutflo Documentation - Complete guide to Scoutflo platform
Website: scoutflo.com - Learn more about Scoutflo
AI SRE Tool: ai.scoutflo.com - AI-powered SRE assistant
Infra Management Tool: deploy.scoutflo.com - Kubernetes deployment platform
YouTube Channel: @scoutflo6727 - Tutorials and demos
AI SRE Demo: Watch Demo Video - See Scoutflo AI SRE in action
Blog: scoutflo.com/blog and blog.scoutflo.com - Latest articles and insights
Pricing: scoutflo.com/pricing - Pricing information

Additional Resources

Roadmap: Check out our project roadmap to see what's coming
Documentation: Visit our wiki for detailed guides
Legal: Privacy Policy | Terms of Service

Support

Need help? Check out our Support Guide or:

Questions: GitHub Discussions
Bugs: Report an issue
Features: Request a feature
Security: See SECURITY.md

Related Resources

AWS Resources

Official Documentation:

AWS Documentation - Complete AWS service documentation
AWS Well-Architected Framework - Best practices for building on AWS
AWS Troubleshooting Guides - Official troubleshooting guides
AWS Service Health Dashboard - Check AWS service status

Learning & Best Practices:

AWS Architecture Center - Reference architectures
AWS Security Best Practices - Security guidelines
AWS re:Post - AWS community Q&A
AWS Training - Free and paid training courses

Tools & Utilities:

AWS CLI Documentation - Command-line interface
AWS CloudShell - Browser-based shell
AWS Systems Manager - Operations management
AWS CloudWatch - Monitoring and observability

Kubernetes Resources

Official Documentation:

Kubernetes Documentation - Complete Kubernetes documentation
kubectl Cheat Sheet - Quick command reference
Kubernetes Troubleshooting - Official troubleshooting guide
Kubernetes API Reference - API documentation

Learning & Best Practices:

Kubernetes Best Practices - Cluster administration
Kubernetes Security Best Practices - Security guidelines
CNCF Cloud Native Trail Map - Learning path
Kubernetes.io Blog - Latest updates and tutorials

Tools & Utilities:

k9s - Terminal UI for Kubernetes
Lens - Kubernetes IDE
Helm - Package manager for Kubernetes
kubectx & kubens - Context and namespace switching

Community Resources:

Kubernetes Slack - Community chat
Stack Overflow - Kubernetes - Q&A
r/kubernetes - Reddit community
Kubernetes Forum - Discussion forum

SRE Resources

Books & Guides:

Google SRE Book - Site Reliability Engineering book
Site Reliability Engineering - SRE practices
The Site Reliability Workbook - Practical SRE guide
Building Secure & Reliable Systems - Security and reliability

Learning Resources:

SRE Foundation Course - CNCF training
SRE Weekly - Weekly newsletter
SREcon - SRE conferences
Incident Response Guide - PagerDuty's incident response guide

Tools & Platforms:

Prometheus - Monitoring and alerting
Grafana - Visualization and dashboards
Jaeger - Distributed tracing
ELK Stack - Logging and analysis

Incident Response & Runbooks

Runbook Resources:

PagerDuty Incident Response - Incident response best practices
Atlassian Incident Management - Incident management guide
GitLab Runbooks - Example runbooks
Google's SRE Runbook Template - Runbook structure

Incident Management:

Incident.io - Incident management platform
FireHydrant - Incident response platform
Statuspage - Status page management

Community & Forums

General DevOps:

DevOps Reddit - DevOps community
DevOps Stack Exchange - Q&A platform
HashiCorp Learn - Infrastructure tutorials

Cloud Native:

CNCF Resources - Cloud Native Computing Foundation
Cloud Native Landscape - CNCF project landscape
CNCF Webinars - Educational webinars

Statistics

Total Playbooks: 376
- AWS: 157 playbooks (92 reactive + 65 proactive)
- Kubernetes: 194 playbooks (138 reactive + 56 proactive)
- Sentry: 25 playbooks
Coverage: Major AWS services, Kubernetes components, and Sentry monitoring
Format: Markdown with structured sections
Language: English
Community: Open source, community-driven

License

This project is licensed under the MIT License - see the LICENSE file for details.

Maintainers

This project is maintained by:

For maintainer information, see MAINTAINERS.md.

Acknowledgments

Contributors: Thank you to all contributors who help improve these playbooks
Community: The SRE community for sharing knowledge and best practices
Organizations: Companies and teams using these playbooks in production

Made with love by the SRE community for the SRE community

If you find these playbooks helpful, please consider giving us a star on GitHub!

SRE Playbooks Repository

Table of Contents

Overview

Why This Repository?

Diagnosis Improvements

Use Cases

Repository Structure

Contents

AWS Playbooks (AWS Playbooks/)

Kubernetes Playbooks (K8s Playbooks/)

Sentry Playbooks (Sentry Playbooks/)

Getting Started

Prerequisites

Installation

Option 1: Clone the Repository

Option 2: Use as Git Submodule

Option 3: Download Specific Playbooks

Quick Start

Learn More

Example Usage

Usage

How Playbooks Work

Playbook Structure

Best Practices

Placeholder Reference

Terminology & Glossary

Quick Reference

Common Abbreviations

Quick Reference

Examples & Use Cases

FAQ

Video Tutorials

Roadmap

Contributing

How to Contribute

1. Reporting Issues

2. Improving Existing Playbooks

3. Adding New Playbooks

Contribution Guidelines

Review Process

Connect with Us

Community Channels

Feedback & Feature Requests

Bug Reports

Scoutflo Resources

Additional Resources

Support

Related Resources

AWS Resources

Kubernetes Resources

SRE Resources

Incident Response & Runbooks

Community & Forums

Statistics

License

Maintainers

Acknowledgments

AWS Playbooks (`AWS Playbooks/`)

Kubernetes Playbooks (`K8s Playbooks/`)

Sentry Playbooks (`Sentry Playbooks/`)