SRE Playbooks Repository
Comprehensive incident response playbooks for AWS, Kubernetes, and Sentry environments - Helping SREs diagnose and resolve infrastructure issues faster with systematic, step-by-step troubleshooting guides.
Table of Contents
- Overview
- Repository Structure
- Contents
- Getting Started
- Usage
- Terminology & Glossary
- Quick Reference
- Troubleshooting Guide
- Examples & Use Cases
- FAQ
- Video Tutorials
- Roadmap
- Contributing
- Connect with Us
- Support
- Related Resources
- License
Overview
This repository contains 414 comprehensive incident response playbooks designed to help Site Reliability Engineers (SREs) systematically diagnose and resolve common infrastructure and application issues in AWS, Kubernetes, and Sentry environments.
Why This Repository?
- Systematic Approach: Each playbook follows a consistent structure with clear diagnostic steps
- Time-Saving: Quickly identify root causes with correlation analysis frameworks
- Community-Driven: Continuously improved by the open-source community
- Production-Ready: Based on real-world incident response scenarios
- Comprehensive Coverage: 232 Kubernetes playbooks + 157 AWS playbooks + 25 Sentry playbooks
- Proactive Monitoring: 56 K8s + 65 AWS proactive playbooks for capacity planning and compliance
Diagnosis Improvements
All playbooks use an events-first approach for root cause analysis:
- Diagnosis sections prioritize checking recent events and changes before diving into configuration details
- Conditional logic patterns help narrow down causes based on observed symptoms
- Time-based correlation analysis connects events to failures systematically
Use Cases
- During Incidents: Quick reference for troubleshooting common issues
- On-Call Rotation: Essential runbook collection for on-call engineers
- Knowledge Sharing: Standardize troubleshooting procedures across teams
- Training: Learn systematic incident response methodologies
- Documentation: Build your own runbook library
Repository Structure
scoutflo-SRE-Playbooks/
├── AWS Playbooks/ # 157 AWS playbooks
│ ├── 01-Compute/ # 27 playbooks (EC2, Lambda, ECS, EKS)
│ ├── 02-Database/ # 8 playbooks (RDS, DynamoDB)
│ ├── 03-Storage/ # 7 playbooks (S3)
│ ├── 04-Networking/ # 17 playbooks (VPC, ELB, Route53)
│ ├── 05-Security/ # 16 playbooks (IAM, KMS, GuardDuty)
│ ├── 06-Monitoring/ # 8 playbooks (CloudTrail, CloudWatch)
│ ├── 07-CI-CD/ # 9 playbooks (CodePipeline, CodeBuild)
│ ├── 08-Proactive/ # 65 proactive monitoring playbooks
│ └── README.md
├── K8s Playbooks/ # 232 Kubernetes playbooks
│ ├── 01-Control-Plane/ # 24 playbooks
│ ├── 02-Nodes/ # 24 playbooks
│ ├── 03-Pods/ # 41 playbooks
│ ├── 04-Workloads/ # 25 playbooks
│ ├── 05-Networking/ # 27 playbooks
│ ├── 06-Storage/ # 9 playbooks
│ ├── 07-RBAC/ # 6 playbooks
│ ├── 08-Configuration/ # 6 playbooks
│ ├── 09-Resource-Management/ # 8 playbooks
│ ├── 10-Monitoring-Autoscaling/ # 3 playbooks
│ ├── 11-Installation-Setup/ # 1 playbook
│ ├── 12-Namespaces/ # 2 playbooks
│ ├── 13-Proactive/ # 56 proactive monitoring playbooks
│ └── README.md
├── Sentry Playbooks/ # 25 Sentry playbooks
│ ├── 01-Error-Tracking/ # 19 playbooks
│ ├── 02-Performance/ # 6 playbooks
│ ├── 03-Release-Health/ # Placeholder
│ └── README.md
├── CONTRIBUTING.md
└── README.md
Contents
AWS Playbooks (AWS Playbooks/)
157 playbooks covering 7 service categories + proactive monitoring:
- Compute Services (27 playbooks): EC2, Lambda, ECS, EKS
- Database (8 playbooks): RDS, DynamoDB
- Storage (7 playbooks): S3
- Networking (17 playbooks): VPC, ELB, Route 53, NAT Gateway
- Security (16 playbooks): IAM, KMS, GuardDuty, CloudTrail
- Monitoring (8 playbooks): CloudTrail, CloudWatch
- CI/CD (9 playbooks): CodePipeline, CodeBuild
- Proactive (65 playbooks): Capacity planning, compliance, cost optimization
Key Topics:
- Connection timeouts and network issues
- Access denied and permission problems
- Resource unavailability and capacity issues
- Security breaches and threat detection
- Service integration failures
- Proactive capacity and compliance monitoring
See AWS Playbooks/README.md for complete documentation and playbook list.
Kubernetes Playbooks (K8s Playbooks/)
194 playbooks organized into 13 categorized folders covering Kubernetes cluster and workload issues:
Folder Structure:
01-Control-Plane/(18 playbooks) - API Server, Scheduler, Controller Manager, etcd02-Nodes/(12 playbooks) - Node readiness, kubelet issues, resource constraints03-Pods/(31 playbooks) - Scheduling, lifecycle, health checks, resource limits04-Workloads/(23 playbooks) - Deployments, StatefulSets, DaemonSets, Jobs, HPA05-Networking/(19 playbooks) - Services, Ingress, DNS, Network Policies, kube-proxy06-Storage/(9 playbooks) - PersistentVolumes, PersistentVolumeClaims, StorageClasses07-RBAC/(6 playbooks) - ServiceAccounts, Roles, RoleBindings, authorization08-Configuration/(6 playbooks) - ConfigMaps and Secrets access issues09-Resource-Management/(8 playbooks) - Resource Quotas, overcommit, compute resources10-Monitoring-Autoscaling/(3 playbooks) - Metrics Server, Cluster Autoscaler11-Installation-Setup/(1 playbook) - Helm and installation issues12-Namespaces/(2 playbooks) - Namespace management issues13-Proactive/(56 playbooks) - Proactive monitoring, capacity planning, compliance
Key Topics:
- Pod lifecycle issues (CrashLoopBackOff, Pending, Terminating)
- Control plane component failures
- Network connectivity and DNS resolution
- Storage and volume mounting problems
- RBAC and permission errors
- Resource quota and capacity constraints
- Proactive capacity and compliance monitoring
See K8s Playbooks/README.md for complete documentation and playbook list.
Sentry Playbooks (Sentry Playbooks/)
25 playbooks covering error tracking and performance monitoring:
Folder Structure:
01-Error-Tracking/(19 playbooks) - Error capture, grouping, alerting, and debugging02-Performance/(6 playbooks) - Transaction monitoring, performance issues, tracing03-Release-Health/- Release tracking and health monitoring (placeholder)
Key Topics:
- Error capture and reporting issues
- Issue grouping and deduplication
- Alert configuration and routing
- Performance transaction monitoring
- SDK integration troubleshooting
- Release health tracking
See Sentry Playbooks/README.md for complete documentation and playbook list.
Getting Started
Prerequisites
- Basic knowledge of AWS services, Kubernetes, or Sentry
- Access to AWS Console, Kubernetes cluster, or Sentry dashboard (for using playbooks)
- Git (for cloning the repository)
Installation
Option 1: Clone the Repository
# Clone the repository git clone https://github.com/Scoutflo/scoutflo-SRE-Playbooks.git # Navigate to the repository cd scoutflo-SRE-Playbooks # View available playbooks ls AWS\ Playbooks/ ls K8s\ Playbooks/ ls Sentry\ Playbooks/
Option 2: Use as Git Submodule
Include playbooks in your own projects:
git submodule add https://github.com/Scoutflo/scoutflo-SRE-Playbooks.git playbooks
Option 3: Download Specific Playbooks
Browse and download individual playbooks directly from GitHub web interface.
Quick Start
- Identify Your Issue: Determine if it's an AWS, Kubernetes, or Sentry issue
- Navigate to Playbooks:
- AWS issues ->
AWS Playbooks/ - K8s issues ->
K8s Playbooks/[category-folder]/ - Sentry issues ->
Sentry Playbooks/[category-folder]/
- AWS issues ->
- Find the Playbook: Match your symptoms to a playbook title
- Follow the Steps: Execute diagnostic steps in order
- Use Diagnosis Section: Apply correlation analysis for root cause identification
Learn More
- Watch Tutorials: Check our YouTube channel for video walkthroughs and best practices
- AI SRE Demo: Watch the Scoutflo AI SRE Demo to see AI-powered incident response
- Scoutflo Documentation: Visit Scoutflo Documentation for platform guides
- Join the Community: Connect with other SREs in our Slack workspace
Example Usage
Scenario: EC2 instance SSH connection timeout
- Navigate to
AWS Playbooks/ - Open
Connection-Timeout-SSH-Issues-EC2.md - Follow the Playbook steps, replacing
<instance-id>with your actual instance ID - Use the Diagnosis section to correlate events with failures
- Apply the identified fix
Usage
How Playbooks Work
Important: These playbooks are designed for AI agents using natural language processing (NLP). They use natural language instructions that AI agents interpret and execute using available tools (like AWS MCP tools, Kubernetes MCP tools, or kubectl).
Example Playbook Step:
- Natural Language: "Retrieve logs from pod
<pod-name>in namespace<namespace>and analyze error messages" - AI Agent Action: Interprets this instruction and uses appropriate tools to fetch and analyze pod logs
For Manual Use:
- While playbooks are optimized for AI agents, you can also use them manually
- The README files in each category folder include equivalent kubectl/AWS CLI commands for manual verification
- Replace placeholders with actual resource identifiers when following steps manually
Playbook Structure
All playbooks follow a consistent structure:
- Title - Clear, descriptive issue identification
- Meaning - What the issue means, triggers, symptoms, root causes
- Impact - Business and technical implications
- Playbook - 8-10 numbered diagnostic steps in natural language (ordered from common to specific)
- Diagnosis - Correlation analysis framework with time windows using events-first approach and conditional logic patterns
Best Practices
- For AI Agents: Playbooks are optimized for AI interpretation - use natural language instructions
- For Manual Use: See category README files for equivalent kubectl/AWS CLI commands
- Replace Placeholders: All playbooks use placeholders (e.g.,
<instance-id>,<pod-name>) that must be replaced with actual values - Follow Order: Execute steps sequentially unless you have strong evidence pointing to a specific step
- Correlate Timestamps: Use the Diagnosis section to correlate events with failures
- Extend Windows: If initial correlations don't reveal causes, extend time windows as suggested
Placeholder Reference
AWS Playbooks:
<instance-id>,<bucket-name>,<region>,<function-name>,<role-name>,<user-name>,<security-group-id>,<vpc-id>,<rds-instance-id>,<load-balancer-name>
Kubernetes Playbooks:
<pod-name>,<namespace>,<deployment-name>,<node-name>,<service-name>,<ingress-name>,<pvc-name>,<configmap-name>,<secret-name>
Sentry Playbooks:
<project-slug>,<organization-slug>,<issue-id>,<transaction-name>,<release-version>,<environment>
Terminology & Glossary
Understanding the terms used in these playbooks will help you use them more effectively. For detailed glossaries, see:
Quick Reference
SRE (Site Reliability Engineering)
- A discipline combining software engineering and operations to build reliable systems.
Playbook / Runbook
- A step-by-step guide for diagnosing and resolving specific issues.
Incident
- An event that disrupts or degrades a service, requiring immediate attention.
On-Call
- Engineers available to respond to incidents outside normal business hours.
MTTR (Mean Time To Recovery)
- Average time to restore a service after an incident. Playbooks help reduce MTTR.
Correlation Analysis
- Finding relationships between events (like configuration changes) and symptoms (like service failures) by comparing timestamps.
Root Cause
- The underlying reason why an issue occurred, as opposed to just the symptoms.
Placeholder
- A value in playbooks (like
<instance-id>) that you replace with your actual resource identifier.
Diagnosis Section
- Part of each playbook that helps you correlate events with failures using time-based analysis.
Common Abbreviations
- K8s: Kubernetes (K + 8 letters + s)
- SRE: Site Reliability Engineering
- MTTR: Mean Time To Recovery
- API: Application Programming Interface
- DNS: Domain Name System
- RBAC: Role-Based Access Control
- PVC: PersistentVolumeClaim
- HPA: Horizontal Pod Autoscaler
For detailed explanations of AWS and Kubernetes terms, see the respective README files above.
Quick Reference
Need a quick cheat sheet? Check out our Quick Reference Card for:
- One-page overview
- Common commands
- Quick lookup tables
- Essential links
Troubleshooting Guide
Not sure which playbook to use? Use our Troubleshooting Decision Tree to:
- Quickly identify the right playbook
- Navigate by issue type
- Look up by error message or alert name
Examples & Use Cases
See real-world scenarios in EXAMPLES.md:
- Step-by-step examples
- Common workflows
- Success stories
- Best practices
FAQ
Have questions? Check our FAQ for answers to:
- General questions
- Usage questions
- Technical questions
- Contributing questions
Video Tutorials
Learn how to use these playbooks effectively:
- YouTube Channel: @scoutflo6727 - Subscribe for tutorials and walkthroughs
- AI SRE Demo: Watch Demo Video - See Scoutflo AI SRE in action
- Tutorials: Step-by-step video guides on using playbooks
- Best Practices: Learn SRE incident response best practices
Coming Soon: Video tutorials for:
- How to use playbooks effectively
- Common troubleshooting scenarios
- Contributing to playbooks
- Advanced correlation analysis
Roadmap
Check out our ROADMAP.md to see:
- Planned features and new playbook categories
- Short-term and long-term goals
- How to suggest new features
- Release history
Contributing
We welcome contributions from the community! Your contributions help make these playbooks better for everyone. See our Contributors page to see who has helped build this project.
First-time contributor? Start with our Getting Started Guide for a quick onboarding experience, then look for issues labeled
good first issue.
How to Contribute
1. Reporting Issues
Found a bug, unclear instruction, or have a suggestion?
- Check Existing Issues: Search GitHub Issues first
- Create a New Issue:
- Use clear, descriptive title
- Describe the problem or suggestion
- Include relevant service/component, error messages, or examples
- Tag with appropriate labels (
aws-playbook,k8s-playbook,sentry-playbook,bug,enhancement, etc.)
2. Improving Existing Playbooks
To fix or enhance existing playbooks:
- Fork the Repository: Create your own fork
- Create a Branch:
git checkout -b fix/playbook-name-improvement
- Make Your Changes:
- Follow the established playbook structure
- Maintain consistency with existing formatting
- Update placeholders and examples as needed
- Test Your Changes: Verify the playbook is accurate and helpful
- Commit and Push:
git add . git commit -m "Fix: Improve [playbook-name] with [description]" git push origin fix/playbook-name-improvement
- Create a Pull Request:
- Provide clear description of changes
- Reference any related issues
- Request review from maintainers
3. Adding New Playbooks
To add a new playbook for an uncovered issue:
- Check for Duplicates: Ensure a similar playbook doesn't already exist
- Follow the Structure: Use existing playbooks as templates
- Choose the Right Location:
- AWS playbooks ->
AWS Playbooks/ - K8s playbooks -> Appropriate category folder in
K8s Playbooks/ - Sentry playbooks -> Appropriate category folder in
Sentry Playbooks/
- AWS playbooks ->
- Follow Naming Conventions:
- AWS:
<IssueOrSymptom>-<Component>.md - K8s:
<AlertName>-<Resource>.md - Sentry:
<IssueType>-<Component>.md
- AWS:
- Include All Sections: Title, Meaning, Impact, Playbook (8-10 steps), Diagnosis (5 correlations)
- Update README: Add the new playbook to the appropriate README's playbook list
- Create Pull Request: Follow standard contribution process
Contribution Guidelines
- Follow the Structure: Maintain consistency with existing playbooks
- Use Placeholders: Replace specific values with placeholders
- Be Specific: Provide actionable, step-by-step instructions
- Include Correlation: Add time-based correlation analysis in the Diagnosis section
- Test Thoroughly: Ensure playbooks are accurate and helpful
- Document Changes: Clearly describe what you changed and why
Review Process
- All contributions require review from maintainers
- Feedback will be provided within 2-3 business days
- Address any requested changes promptly
- Once approved, your contribution will be merged
See CONTRIBUTING.md for detailed contribution guidelines.
Connect with Us
We'd love to hear from you! Here are the best ways to connect:
Community Channels
- Slack Community: Join our Slack workspace for real-time discussions
- GitHub Discussions: Start a discussion for questions and ideas
- GitHub Issues: Report bugs or request features
- LinkedIn: Follow Scoutflo on LinkedIn for updates and insights
- Twitter/X: Follow @scout_flo for latest news and announcements
Feedback & Feature Requests
Have an idea for improvement or a new playbook topic?
- GitHub Issues: Create a feature request
- Slack: Share your ideas in our
#playbookschannel
Bug Reports
Found a bug or error in a playbook?
- GitHub Issues: Create a bug report
- Slack: Report in our
#playbookschannel for quick response
Scoutflo Resources
- Official Documentation: Scoutflo Documentation - Complete guide to Scoutflo platform
- Website: scoutflo.com - Learn more about Scoutflo
- AI SRE Tool: ai.scoutflo.com - AI-powered SRE assistant
- Infra Management Tool: deploy.scoutflo.com - Kubernetes deployment platform
- YouTube Channel: @scoutflo6727 - Tutorials and demos
- AI SRE Demo: Watch Demo Video - See Scoutflo AI SRE in action
- Blog: scoutflo.com/blog and blog.scoutflo.com - Latest articles and insights
- Pricing: scoutflo.com/pricing - Pricing information
Additional Resources
- Roadmap: Check out our project roadmap to see what's coming
- Documentation: Visit our wiki for detailed guides
- Legal: Privacy Policy | Terms of Service
Support
Need help? Check out our Support Guide or:
- Questions: GitHub Discussions
- Bugs: Report an issue
- Features: Request a feature
- Security: See SECURITY.md
Related Resources
AWS Resources
Official Documentation:
- AWS Documentation - Complete AWS service documentation
- AWS Well-Architected Framework - Best practices for building on AWS
- AWS Troubleshooting Guides - Official troubleshooting guides
- AWS Service Health Dashboard - Check AWS service status
Learning & Best Practices:
- AWS Architecture Center - Reference architectures
- AWS Security Best Practices - Security guidelines
- AWS re:Post - AWS community Q&A
- AWS Training - Free and paid training courses
Tools & Utilities:
- AWS CLI Documentation - Command-line interface
- AWS CloudShell - Browser-based shell
- AWS Systems Manager - Operations management
- AWS CloudWatch - Monitoring and observability
Kubernetes Resources
Official Documentation:
- Kubernetes Documentation - Complete Kubernetes documentation
- kubectl Cheat Sheet - Quick command reference
- Kubernetes Troubleshooting - Official troubleshooting guide
- Kubernetes API Reference - API documentation
Learning & Best Practices:
- Kubernetes Best Practices - Cluster administration
- Kubernetes Security Best Practices - Security guidelines
- CNCF Cloud Native Trail Map - Learning path
- Kubernetes.io Blog - Latest updates and tutorials
Tools & Utilities:
- k9s - Terminal UI for Kubernetes
- Lens - Kubernetes IDE
- Helm - Package manager for Kubernetes
- kubectx & kubens - Context and namespace switching
Community Resources:
- Kubernetes Slack - Community chat
- Stack Overflow - Kubernetes - Q&A
- r/kubernetes - Reddit community
- Kubernetes Forum - Discussion forum
SRE Resources
Books & Guides:
- Google SRE Book - Site Reliability Engineering book
- Site Reliability Engineering - SRE practices
- The Site Reliability Workbook - Practical SRE guide
- Building Secure & Reliable Systems - Security and reliability
Learning Resources:
- SRE Foundation Course - CNCF training
- SRE Weekly - Weekly newsletter
- SREcon - SRE conferences
- Incident Response Guide - PagerDuty's incident response guide
Tools & Platforms:
- Prometheus - Monitoring and alerting
- Grafana - Visualization and dashboards
- Jaeger - Distributed tracing
- ELK Stack - Logging and analysis
Incident Response & Runbooks
Runbook Resources:
- PagerDuty Incident Response - Incident response best practices
- Atlassian Incident Management - Incident management guide
- GitLab Runbooks - Example runbooks
- Google's SRE Runbook Template - Runbook structure
Incident Management:
- Incident.io - Incident management platform
- FireHydrant - Incident response platform
- Statuspage - Status page management
Community & Forums
General DevOps:
- DevOps Reddit - DevOps community
- DevOps Stack Exchange - Q&A platform
- HashiCorp Learn - Infrastructure tutorials
Cloud Native:
- CNCF Resources - Cloud Native Computing Foundation
- Cloud Native Landscape - CNCF project landscape
- CNCF Webinars - Educational webinars
Statistics
- Total Playbooks: 376
- AWS: 157 playbooks (92 reactive + 65 proactive)
- Kubernetes: 194 playbooks (138 reactive + 56 proactive)
- Sentry: 25 playbooks
- Coverage: Major AWS services, Kubernetes components, and Sentry monitoring
- Format: Markdown with structured sections
- Language: English
- Community: Open source, community-driven
License
This project is licensed under the MIT License - see the LICENSE file for details.
Maintainers
This project is maintained by:
For maintainer information, see MAINTAINERS.md.
Acknowledgments
- Contributors: Thank you to all contributors who help improve these playbooks
- Community: The SRE community for sharing knowledge and best practices
- Organizations: Companies and teams using these playbooks in production
Made with love by the SRE community for the SRE community
If you find these playbooks helpful, please consider giving us a star on GitHub!