StatefulSet Backup Operator
โ ๏ธ Work in Progress - Version 0.0.7 This operator is under active development. APIs may change, and some features are still being implemented.
A Kubernetes operator for automated backup and restore of StatefulSets using native VolumeSnapshot APIs. Features scheduled snapshots, retention policies, pre/post hooks, and point-in-time recovery with a simple declarative interface.
๐ฏ Features
- โ Automated Snapshots - Schedule backups using cron expressions or trigger them on-demand
- โ Coordinated Backups - Create consistent snapshots across all replicas of a StatefulSet
- โ Pre/Post Backup Hooks - Execute commands inside pods before and after snapshots (e.g., database flush operations)
- โ Retention Management - Automatically clean up old snapshots based on configurable retention policies (per-replica)
- โ Point-in-Time Recovery - Restore StatefulSets to any previous snapshot with a single command
- โ Native Kubernetes Integration - Uses standard VolumeSnapshot APIs (CSI) for broad storage provider compatibility
- โ Namespace Isolation - Proper namespace scoping for multi-tenant environments
- โ Comprehensive Test Suite - 26 unit tests with 42.5% code coverage, fully CI-compatible without external dependencies
๐ Why Not Velero?
Velero is excellent for full-cluster disaster recovery, but if you just need:
- Fast, automated backups for your StatefulSets
- Point-in-time recovery without external storage
- Minimal operational overhead
- Cost-effective snapshot-based backups
...then this operator is a better fit. Think of it as "the right tool for the right job" - lightweight, focused, and cloud-native.
Comparison with Velero
| Feature | StatefulSet Backup Operator | Velero |
|---|---|---|
| Setup time | ~2 minutes | 15-30 minutes |
| Dependencies | None (CSI driver only) | Object storage (S3, GCS) + CLI |
| Backup speed | Seconds (snapshots) | Minutes (full copy) |
| Storage cost | Incremental snapshots | Full backups on S3 |
| StatefulSet hooks | โ Built-in | |
| Cross-cluster DR | โ (roadmap) | โ |
| Per-replica restore | โ | |
| GitOps friendly | โ 100% CRD-based |
๐ Prerequisites
- Kubernetes 1.20+
- CSI driver with snapshot support (CSI Snapshot v1 API)
- VolumeSnapshotClass configured in your cluster
- Kubectl access to the cluster
Required CRDs
The operator requires the following VolumeSnapshot CRDs to be installed:
volumesnapshotclasses.snapshot.storage.k8s.iovolumesnapshots.snapshot.storage.k8s.iovolumesnapshotcontents.snapshot.storage.k8s.io
Verify VolumeSnapshot API is available:
kubectl api-resources | grep volumesnapshotInstall VolumeSnapshot CRDs (if not present):
# Install CSI snapshot controller and CRDs
kubectl apply -f https://raw.githubusercontent.com/kubernetes-csi/external-snapshotter/master/client/config/crd/snapshot.storage.k8s.io_volumesnapshotclasses.yaml
kubectl apply -f https://raw.githubusercontent.com/kubernetes-csi/external-snapshotter/master/client/config/crd/snapshot.storage.k8s.io_volumesnapshotcontents.yaml
kubectl apply -f https://raw.githubusercontent.com/kubernetes-csi/external-snapshotter/master/client/config/crd/snapshot.storage.k8s.io_volumesnapshots.yamlTested Environments
- โ Minikube (with CSI hostpath driver)
- โ Kind (with CSI snapshot support)
- ๐ GKE, EKS, AKS (testing in progress)
CSI Driver Compatibility
The operator works with any CSI driver that supports VolumeSnapshot v1 API:
- AWS EBS CSI Driver
- GCE Persistent Disk CSI Driver
- Azure Disk CSI Driver
- Ceph CSI
- Longhorn
- OpenEBS
- Portworx
๐ ๏ธ Installation
Option 1: Build from Source
# Clone the repository git clone https://github.com/federicolepera/statefulset-backup-operator.git cd statefulset-backup-operator # Build the Docker image make docker-build IMG=<your-registry>/statefulset-backup-operator:v0.0.2 # Push to your registry make docker-push IMG=<your-registry>/statefulset-backup-operator:v0.0.2 # Install CRDs make install # Deploy the operator make deploy IMG=<your-registry>/statefulset-backup-operator:v0.0.2
Option 2: Install CRDs and Deploy Manually
# Install CRDs kubectl apply -f config/crd/bases/ # Deploy operator (update image in config/manager/manager.yaml first) kubectl apply -f config/rbac/ kubectl apply -f config/manager/
Option 3: Helm Chart (Recommended)
# Install from local chart helm install statefulset-backup-operator ./charts/statefulset-backup-operator \ -n statefulset-backup-system --create-namespace # With custom image helm install statefulset-backup-operator ./charts/statefulset-backup-operator \ --set image.repository=myregistry/statefulset-backup-operator \ --set image.tag=v0.0.7 \ -n statefulset-backup-system --create-namespace # Upgrade existing installation helm upgrade statefulset-backup-operator ./charts/statefulset-backup-operator \ -n statefulset-backup-system
Helm Values:
| Parameter | Description | Default |
|---|---|---|
image.repository |
Image repository | ghcr.io/federicolepera/statefulset-backup-operator |
image.tag |
Image tag | appVersion from Chart.yaml |
image.pullPolicy |
Image pull policy | IfNotPresent |
replicaCount |
Number of replicas | 1 |
resources.limits.cpu |
CPU limit | 500m |
resources.limits.memory |
Memory limit | 128Mi |
resources.requests.cpu |
CPU request | 10m |
resources.requests.memory |
Memory request | 64Mi |
serviceAccount.create |
Create ServiceAccount | true |
leaderElection.enabled |
Enable leader election | true |
metrics.enabled |
Enable metrics endpoint | true |
metrics.port |
Metrics port | 8080 |
health.port |
Health probe port | 8081 |
The Helm chart automatically installs:
- CRDs for StatefulSetBackup and StatefulSetRestore
- ClusterRole with all required permissions
- ClusterRoleBinding
- ServiceAccount
- Deployment with health probes and security context
๐ Usage
Basic Backup
Create a simple backup that runs once:
apiVersion: backup.sts-backup.io/v1alpha1 kind: StatefulSetBackup metadata: name: my-database-backup namespace: default spec: statefulSetRef: name: postgresql namespace: default retentionPolicy: keepLast: 3 # Keep last 3 snapshots per PVC volumeSnapshotClassName: csi-hostpath-snapclass # Optional: omit to use cluster default
Note on VolumeSnapshotClass: The volumeSnapshotClassName field is optional. If not specified, the operator will automatically use the default VolumeSnapshotClass in your cluster (the one with annotation snapshot.storage.kubernetes.io/is-default-class: "true"). This makes the operator work out-of-the-box on most Kubernetes clusters without additional configuration.
You can also omit the volumeSnapshotClassName field entirely:
apiVersion: backup.sts-backup.io/v1alpha1 kind: StatefulSetBackup metadata: name: my-database-backup-auto namespace: default spec: statefulSetRef: name: postgresql namespace: default retentionPolicy: keepLast: 3 # Keep last 3 snapshots per PVC # volumeSnapshotClassName omitted - will use cluster default
Scheduled Backup with Hooks
Create automated backups with pre/post hooks:
apiVersion: backup.sts-backup.io/v1alpha1 kind: StatefulSetBackup metadata: name: postgres-scheduled-backup namespace: production spec: statefulSetRef: name: postgresql namespace: production schedule: "0 2 * * *" # Every day at 2 AM (standard cron format) retentionPolicy: keepDays: 30 # Keep backups for 30 days volumeSnapshotClassName: csi-hostpath-snapclass # Optional preBackupHook: containerName: postgres # Optional: specify container (defaults to first container) timeoutSeconds: 120 # Optional: timeout in seconds (default: 60) command: - "psql" - "-U" - "postgres" - "-c" - "CHECKPOINT" postBackupHook: command: - "echo" - "Backup completed"
Hook Behavior:
- Hooks execute sequentially on each pod (pod-0, then pod-1, etc.)
- If a hook fails on any pod, the entire backup fails
- Hooks execute in the first container unless
containerNameis specified - Default timeout is 60 seconds; configure with
timeoutSecondsfield
Restore from Backup
Restore a StatefulSet to a previous snapshot:
apiVersion: backup.sts-backup.io/v1alpha1 kind: StatefulSetRestore metadata: name: restore-postgres namespace: production spec: statefulSetRef: name: postgresql namespace: production backupName: postgres-scheduled-backup scaleDown: true # Recommended: scales StatefulSet to 0 before restore pvcDeletionTimeoutSeconds: 120 # Optional: timeout for PVC deletion (default: 60)
Restore Phases: The restore process follows a multi-phase workflow:
- New: Initial state, saves original replica count
- ScalingDown: Reduces StatefulSet replicas to 0 (if
scaleDown: true) - Restoring: Deletes existing PVCs and recreates them from snapshots
- ScalingUp: Restores StatefulSet to original replica count
- Completed/Failed: Final state
Selective Snapshot Restore
Restore specific snapshots (useful for partial recovery):
apiVersion: backup.sts-backup.io/v1alpha1 kind: StatefulSetRestore metadata: name: restore-selective namespace: production spec: statefulSetRef: name: postgresql namespace: production snapshotNames: - postgres-backup-data-20260112-120000-0 # Restore only replica-0 - postgres-backup-data-20260112-120000-1 # Restore only replica-1 scaleDown: true
๐ Monitoring
Check Backup Status
# List all backups kubectl get statefulsetbackup # Detailed status kubectl describe statefulsetbackup my-database-backup # Check created snapshots kubectl get volumesnapshot
Status Fields:
- Phase: Current state -
Ready,InProgress, orFailed - LastBackupTime: Timestamp of the most recent successful backup
- Conditions: Standard Kubernetes conditions for state tracking
- Message: Human-readable status information
Example output:
status: phase: Ready lastBackupTime: "2026-01-12T14:30:00Z" conditions: - type: Ready status: "True" lastTransitionTime: "2026-01-12T14:30:05Z"
Check Restore Status
# List all restores kubectl get statefulsetrestore # Watch restore progress kubectl get statefulsetrestore restore-postgres -w # Detailed restore status kubectl describe statefulsetrestore restore-postgres
Restore Status Fields:
- Phase: Current phase -
New,ScalingDown,Restoring,ScalingUp,Completed, orFailed - OriginalReplicas: Original replica count saved before scaling down
- RestoredSnapshots: Map of PVC names to restored snapshot names
- StartTime/CompletionTime: Timing information for the restore operation
Example output:
status: phase: Completed originalReplicas: 3 restoredSnapshots: data-postgresql-0: postgres-backup-data-20260112-120000-0 data-postgresql-1: postgres-backup-data-20260112-120000-1 data-postgresql-2: postgres-backup-data-20260112-120000-2 startTime: "2026-01-12T15:00:00Z" completionTime: "2026-01-12T15:05:30Z"
View Operator Logs
# Get operator pod kubectl get pods -n statefulset-backup-operator-system # View logs kubectl logs -n statefulset-backup-operator-system <operator-pod-name> -f
๐๏ธ Architecture
Snapshot Naming and Labels
VolumeSnapshot Naming Format:
{backup-name}-{volume-claim-template-name}-{timestamp}-{replica-index}
Example: postgres-backup-data-20260112-143000-0
Labels Applied to Snapshots:
backup.sts-backup.io/statefulset: Name of the source StatefulSetbackup.sts-backup.io/policy: Name of the backup resource
PVC Naming for StatefulSets:
{volume-claim-template-name}-{statefulset-name}-{replica-index}
Example: data-postgresql-0
Backup Scheduling Behavior
Manual Backups (no schedule):
- Executed once when the backup resource is created
- No subsequent backups occur automatically
- Status updates to
Readyafter completion
Scheduled Backups:
- Uses standard cron format:
minute hour day-of-month month day-of-week - First backup executes immediately (within 10 seconds) if no
LastBackupTimeexists - Next reconciliation scheduled up to 1 hour in advance
- If schedule is missed (controller downtime), executes immediately upon restart
- Invalid cron expressions cause backup to fail with 1-minute retry interval
Backup Flow
- Reconcile Loop checks if it's time for a backup (based on schedule or manual trigger)
- Pre-Backup Hook executes in all StatefulSet pods (if configured)
- Snapshot Creation creates VolumeSnapshots for each PVC
- Post-Backup Hook executes in all StatefulSet pods (if configured)
- Retention Policy removes old snapshots (keeping N most recent per PVC)
- Status Update updates the StatefulSetBackup status with results
Restore Flow
- New Restore validates the restore request and saves original replica count
- Scale Down scales StatefulSet to 0 replicas (if enabled)
- Find Snapshots locates snapshots to restore based on backupName
- Delete PVCs removes existing PVCs for each replica
- Recreate PVCs creates new PVCs from VolumeSnapshots
- Scale Up restores StatefulSet to original replica count
- Completion waits for all pods to be ready and marks restore as complete
Retention Policy
The operator supports two types of retention policies. Note: Currently, you can specify either keepLast OR keepDays, but not both together.
Count-Based Retention (keepLast)
Keeps the most recent N snapshots per PVC:
- With 3 replicas and
keepLast: 2 - Each PVC maintains its own 2 most recent snapshots
- Total snapshots: 6 (2 per PVC)
- Old snapshots are deleted based on creation timestamp (oldest first)
retentionPolicy: keepLast: 5 # Keep last 5 snapshots per PVC
Use cases:
- You want a fixed number of recent backups regardless of age
- Example: Keep last 7 daily backups (1 week of history)
- Example: Keep last 24 hourly backups (24 hours of history)
Time-Based Retention (keepDays)
Deletes all snapshots older than N days:
- Applies to all snapshots regardless of PVC
- Calculates age from snapshot creation timestamp
- Example:
keepDays: 14deletes snapshots older than 2 weeks
retentionPolicy: keepDays: 14 # Keep snapshots from the last 14 days
Use cases:
- You want to enforce a maximum retention period
- Compliance requirements (e.g., "keep backups for 30 days")
- Cost management (limit storage by time)
- Example: Daily backups with 2 week retention
- Example: Hourly backups with 1 day retention
Future Enhancement: Support for combining both policies (e.g.,
keepLast: 3withkeepDays: 7) is planned for a future release.
๐ง Work in Progress
The following features are currently under development or planned:
Current Limitations
-
โ ๏ธ Cross-Namespace Operations - Backup and target StatefulSet must be in same namespace- Cross-namespace snapshots not supported
- Enhancement planned: Support cross-namespace operations
-
โ ๏ธ Restore to Different StatefulSet - Not supported- Restore only works with the original source StatefulSet
- Workaround: Manually copy snapshots and recreate PVCs
Roadmap
- Comprehensive unit test suite (v0.0.2)
- CI/CD integration with GitHub Actions (v0.0.2)
- Time-based retention policy with
keepDays(v0.0.3) - Configurable container selection for hooks (v0.0.3)
- Hook timeout configuration (v0.0.4)
- Configurable PVC deletion timeout for restore (v0.0.5)
- Snapshot readiness verification before retention (v0.0.6)
- Helm chart for easy installation (v0.0.7)
- Combined retention policies (both
keepLastandkeepDaystogether) - Webhook validation for CRDs
- Backup verification and integrity checks
- Metrics and Prometheus integration
- Multi-cluster restore (cross-cluster DR)
- Support for encryption at rest
- CLI tool for backup/restore operations
- Dashboard/UI for visualization
๐งช Development
Prerequisites
- Go 1.21+
- Docker
- Kubernetes cluster (Minikube recommended for development)
- Kubebuilder 3.0+
Local Development Setup
# Install dependencies go mod download # Generate CRDs and code make manifests generate # Install CRDs into cluster make install # Run operator locally (outside cluster) make run # Or debug with VSCode # Use the provided .vscode/launch.json configuration
Testing
The operator includes a comprehensive test suite with 26 unit tests covering both backup and restore controllers.
# Setup test environment (first time only) make setup-envtest # Run all unit tests with coverage make test # Run specific controller tests go test ./internal/controller/... -v # Run a specific test go test ./internal/controller/... -v -run "TestStatefulSetBackupController"
Test Coverage
- 26 total tests implemented
- 24 tests pass successfully โ
- 2 tests skipped (require VolumeSnapshot CRDs)
- 42.5% code coverage of the codebase
- GitHub Actions CI runs all tests automatically
For detailed test documentation, see TEST_DOCUMENTATION.md.
What's Tested
StatefulSetBackup Controller (15 tests):
- Manual and scheduled backup workflows
- Cron schedule validation and requeue logic
- Backup status management
- Resource lifecycle (creation, deletion)
- Error handling for missing StatefulSets
StatefulSetRestore Controller (11 tests):
- Restore phase workflow (ScalingDown โ Restoring โ ScalingUp)
- Scale down/up operations
- Snapshot search and restoration
- Parameter validation
- Completed/failed state handling
All tests are CI-compatible and run without requiring VolumeSnapshot CRDs to be installed.
Integration Testing
# Run with a test StatefulSet kubectl apply -f config/samples/apps_v1_statefulset.yaml kubectl apply -f config/samples/backup_v1alpha1_statefulsetbackup.yaml # Watch operator logs # (if running locally, check terminal output)
Building
# Build binary make build # Build and push Docker image make docker-build docker-push IMG=<your-registry>/statefulset-backup-operator:tag
๐ Examples
Example 1: PostgreSQL Backup
apiVersion: backup.sts-backup.io/v1alpha1 kind: StatefulSetBackup metadata: name: postgres-backup spec: statefulSetRef: name: postgres namespace: databases schedule: "0 */6 * * *" # Every 6 hours retentionPolicy: keepDays: 7 # Keep snapshots for 7 days preBackupHook: containerName: postgres command: ["psql", "-U", "postgres", "-c", "CHECKPOINT"]
Example 2: MongoDB Backup with Replica Sync
apiVersion: backup.sts-backup.io/v1alpha1 kind: StatefulSetBackup metadata: name: mongodb-backup spec: statefulSetRef: name: mongodb namespace: databases schedule: "0 3 * * *" # Daily at 3 AM retentionPolicy: keepDays: 14 # Keep 2 weeks of backups preBackupHook: containerName: mongodb command: - "mongosh" - "--eval" - "db.fsyncLock()" postBackupHook: command: - "mongosh" - "--eval" - "db.fsyncUnlock()"
Example 3: Redis Cluster Backup
apiVersion: backup.sts-backup.io/v1alpha1 kind: StatefulSetBackup metadata: name: redis-backup spec: statefulSetRef: name: redis namespace: cache schedule: "*/30 * * * *" # Every 30 minutes retentionPolicy: keepLast: 12 # Keep last 12 snapshots per PVC (6 hours of backups) preBackupHook: containerName: redis command: ["redis-cli", "BGSAVE"]
๐ค Contributing
Contributions are welcome! This is an early-stage project, and we'd love your help.
How to Contribute
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
Areas Where We Need Help
- Testing on various Kubernetes distributions (GKE, EKS, AKS)
- Documentation improvements
- Additional storage provider testing
- Performance optimization
- Feature implementations from roadmap
๐ License
This project is licensed under the Apache License 2.0 - see the LICENSE file for details.
๐ Acknowledgments
- Built with Kubebuilder
- Uses external-snapshotter for VolumeSnapshot support
- Inspired by the Kubernetes community's need for lightweight backup solutions
๐ Support
- ๐ Bug Reports: GitHub Issues
- ๐ก Feature Requests: GitHub Discussions
- ๐ง Contact: Federico Lepera
โญ Star History
If you find this project useful, please consider giving it a star! It helps the project gain visibility and encourages continued development.
Note: This operator is in active development (v0.0.7). APIs and features may change. Not recommended for production use until v1.0.0 release.
๐ Changelog
Version 0.0.7 (2026-01-15)
New Features:
- โ
Helm chart for easy installation - deploy the operator with a single
helm installcommand - โ Fully customizable via Helm values (image, resources, replicas, etc.)
- โ Automatic CRD installation via Helm chart
Helm Chart Includes:
- ServiceAccount with ClusterRole for cluster-wide permissions
- ClusterRole with permissions for: StatefulSets, VolumeSnapshots, VolumeSnapshotClasses, PVCs, Pods/exec (for hooks), Events, and CRDs
- Deployment with security context, health probes, and resource limits
- Leader election support for HA deployments
- Metrics endpoint configuration
Installation:
helm install statefulset-backup-operator ./charts/statefulset-backup-operator \ -n statefulset-backup-system --create-namespace
Version 0.0.6 (2026-01-15)
New Features:
- โ
Snapshot readiness verification - operator now waits for snapshots to be
ReadyToUse=truebefore proceeding (60 second timeout)
Improvements:
- Retention policy is now only applied after snapshots are confirmed ready
- Prevents potential data loss from deleting old snapshots before new ones are ready
- Better reliability for slow storage backends
Version 0.0.5 (2026-01-14)
New Features:
- โ
Configurable PVC deletion timeout for restore - use
pvcDeletionTimeoutSecondsfield to set custom timeout (default: 60 seconds)
Improvements:
- Restore operations now support longer PVC deletion times for slow storage backends
- Clear error messages when PVC deletion times out
Version 0.0.4 (2026-01-14)
New Features:
- โ
Hook timeout configuration - use
timeoutSecondsfield to set custom timeout for pre/post backup hooks (default: 60 seconds)
Improvements:
- Hooks now fail gracefully with clear error message when timeout is exceeded
- Better error handling for long-running hook commands
Version 0.0.3 (2026-01-14)
New Features:
- โ
Time-based retention policy with
keepDays- delete snapshots older than N days - โ
Configurable container selection for hooks - use
containerNamefield to specify which container to run hooks in
Improvements:
- Enhanced retention policy documentation with use cases and examples
- Updated all examples to showcase new retention options
Version 0.0.2 (2026-01-12)
New Features:
- โ Comprehensive unit test suite with 26 tests covering both controllers
- โ GitHub Actions CI integration for automated testing
- โ Test documentation with detailed coverage information
- โ CI-compatible tests that run without VolumeSnapshot CRDs
- โ Automatic default VolumeSnapshotClass detection - operator now automatically uses the cluster's default VolumeSnapshotClass if not explicitly specified
Test Coverage:
- StatefulSetBackup Controller: 15 tests covering manual/scheduled backups, cron validation, status management, and resource lifecycle
- StatefulSetRestore Controller: 11 tests covering restore workflow phases, scale operations, snapshot search, and error handling
- Overall code coverage: 42.5%
- All tests pass in CI without external dependencies
Documentation:
- Added TEST_DOCUMENTATION.md with comprehensive test guide
- Updated README with testing instructions and coverage details
Version 0.0.1 (2026-01-01)
Initial Release:
- Basic backup and restore functionality
- Cron-based scheduling
- Pre/post backup hooks
- Per-replica retention policies
- StatefulSet integration