GitHub - federicolepera/statefulset-backup-operator: ๐Ÿš€ Kubernetes operator that makes StatefulSet backups simple. Schedule automated snapshots, manage retention, execute pre/post hooks, and restore to any point in time. Built on native K8s VolumeSnapshot APIs for maximum compatibility. Perfect for databases and stateful apps.

14 min read Original article โ†—

StatefulSet Backup Operator

License Kubernetes Go Report Card

โš ๏ธ Work in Progress - Version 0.0.7 This operator is under active development. APIs may change, and some features are still being implemented.

A Kubernetes operator for automated backup and restore of StatefulSets using native VolumeSnapshot APIs. Features scheduled snapshots, retention policies, pre/post hooks, and point-in-time recovery with a simple declarative interface.

๐ŸŽฏ Features

  • โœ… Automated Snapshots - Schedule backups using cron expressions or trigger them on-demand
  • โœ… Coordinated Backups - Create consistent snapshots across all replicas of a StatefulSet
  • โœ… Pre/Post Backup Hooks - Execute commands inside pods before and after snapshots (e.g., database flush operations)
  • โœ… Retention Management - Automatically clean up old snapshots based on configurable retention policies (per-replica)
  • โœ… Point-in-Time Recovery - Restore StatefulSets to any previous snapshot with a single command
  • โœ… Native Kubernetes Integration - Uses standard VolumeSnapshot APIs (CSI) for broad storage provider compatibility
  • โœ… Namespace Isolation - Proper namespace scoping for multi-tenant environments
  • โœ… Comprehensive Test Suite - 26 unit tests with 42.5% code coverage, fully CI-compatible without external dependencies

๐Ÿš€ Why Not Velero?

Velero is excellent for full-cluster disaster recovery, but if you just need:

  • Fast, automated backups for your StatefulSets
  • Point-in-time recovery without external storage
  • Minimal operational overhead
  • Cost-effective snapshot-based backups

...then this operator is a better fit. Think of it as "the right tool for the right job" - lightweight, focused, and cloud-native.

Comparison with Velero

Feature StatefulSet Backup Operator Velero
Setup time ~2 minutes 15-30 minutes
Dependencies None (CSI driver only) Object storage (S3, GCS) + CLI
Backup speed Seconds (snapshots) Minutes (full copy)
Storage cost Incremental snapshots Full backups on S3
StatefulSet hooks โœ… Built-in โš ๏ธ Via init containers
Cross-cluster DR โŒ (roadmap) โœ…
Per-replica restore โœ… โš ๏ธ Limited
GitOps friendly โœ… 100% CRD-based โš ๏ธ Mix CLI/CRD

๐Ÿ“‹ Prerequisites

  • Kubernetes 1.20+
  • CSI driver with snapshot support (CSI Snapshot v1 API)
  • VolumeSnapshotClass configured in your cluster
  • Kubectl access to the cluster

Required CRDs

The operator requires the following VolumeSnapshot CRDs to be installed:

  • volumesnapshotclasses.snapshot.storage.k8s.io
  • volumesnapshots.snapshot.storage.k8s.io
  • volumesnapshotcontents.snapshot.storage.k8s.io

Verify VolumeSnapshot API is available:

kubectl api-resources | grep volumesnapshot

Install VolumeSnapshot CRDs (if not present):

# Install CSI snapshot controller and CRDs
kubectl apply -f https://raw.githubusercontent.com/kubernetes-csi/external-snapshotter/master/client/config/crd/snapshot.storage.k8s.io_volumesnapshotclasses.yaml
kubectl apply -f https://raw.githubusercontent.com/kubernetes-csi/external-snapshotter/master/client/config/crd/snapshot.storage.k8s.io_volumesnapshotcontents.yaml
kubectl apply -f https://raw.githubusercontent.com/kubernetes-csi/external-snapshotter/master/client/config/crd/snapshot.storage.k8s.io_volumesnapshots.yaml

Tested Environments

  • โœ… Minikube (with CSI hostpath driver)
  • โœ… Kind (with CSI snapshot support)
  • ๐Ÿ”„ GKE, EKS, AKS (testing in progress)

CSI Driver Compatibility

The operator works with any CSI driver that supports VolumeSnapshot v1 API:

  • AWS EBS CSI Driver
  • GCE Persistent Disk CSI Driver
  • Azure Disk CSI Driver
  • Ceph CSI
  • Longhorn
  • OpenEBS
  • Portworx

๐Ÿ› ๏ธ Installation

Option 1: Build from Source

# Clone the repository
git clone https://github.com/federicolepera/statefulset-backup-operator.git
cd statefulset-backup-operator

# Build the Docker image
make docker-build IMG=<your-registry>/statefulset-backup-operator:v0.0.2

# Push to your registry
make docker-push IMG=<your-registry>/statefulset-backup-operator:v0.0.2

# Install CRDs
make install

# Deploy the operator
make deploy IMG=<your-registry>/statefulset-backup-operator:v0.0.2

Option 2: Install CRDs and Deploy Manually

# Install CRDs
kubectl apply -f config/crd/bases/

# Deploy operator (update image in config/manager/manager.yaml first)
kubectl apply -f config/rbac/
kubectl apply -f config/manager/

Option 3: Helm Chart (Recommended)

# Install from local chart
helm install statefulset-backup-operator ./charts/statefulset-backup-operator \
  -n statefulset-backup-system --create-namespace

# With custom image
helm install statefulset-backup-operator ./charts/statefulset-backup-operator \
  --set image.repository=myregistry/statefulset-backup-operator \
  --set image.tag=v0.0.7 \
  -n statefulset-backup-system --create-namespace

# Upgrade existing installation
helm upgrade statefulset-backup-operator ./charts/statefulset-backup-operator \
  -n statefulset-backup-system

Helm Values:

Parameter Description Default
image.repository Image repository ghcr.io/federicolepera/statefulset-backup-operator
image.tag Image tag appVersion from Chart.yaml
image.pullPolicy Image pull policy IfNotPresent
replicaCount Number of replicas 1
resources.limits.cpu CPU limit 500m
resources.limits.memory Memory limit 128Mi
resources.requests.cpu CPU request 10m
resources.requests.memory Memory request 64Mi
serviceAccount.create Create ServiceAccount true
leaderElection.enabled Enable leader election true
metrics.enabled Enable metrics endpoint true
metrics.port Metrics port 8080
health.port Health probe port 8081

The Helm chart automatically installs:

  • CRDs for StatefulSetBackup and StatefulSetRestore
  • ClusterRole with all required permissions
  • ClusterRoleBinding
  • ServiceAccount
  • Deployment with health probes and security context

๐Ÿ“– Usage

Basic Backup

Create a simple backup that runs once:

apiVersion: backup.sts-backup.io/v1alpha1
kind: StatefulSetBackup
metadata:
  name: my-database-backup
  namespace: default
spec:
  statefulSetRef:
    name: postgresql
    namespace: default
  retentionPolicy:
    keepLast: 3  # Keep last 3 snapshots per PVC
  volumeSnapshotClassName: csi-hostpath-snapclass  # Optional: omit to use cluster default

Note on VolumeSnapshotClass: The volumeSnapshotClassName field is optional. If not specified, the operator will automatically use the default VolumeSnapshotClass in your cluster (the one with annotation snapshot.storage.kubernetes.io/is-default-class: "true"). This makes the operator work out-of-the-box on most Kubernetes clusters without additional configuration.

You can also omit the volumeSnapshotClassName field entirely:

apiVersion: backup.sts-backup.io/v1alpha1
kind: StatefulSetBackup
metadata:
  name: my-database-backup-auto
  namespace: default
spec:
  statefulSetRef:
    name: postgresql
    namespace: default
  retentionPolicy:
    keepLast: 3  # Keep last 3 snapshots per PVC
  # volumeSnapshotClassName omitted - will use cluster default

Scheduled Backup with Hooks

Create automated backups with pre/post hooks:

apiVersion: backup.sts-backup.io/v1alpha1
kind: StatefulSetBackup
metadata:
  name: postgres-scheduled-backup
  namespace: production
spec:
  statefulSetRef:
    name: postgresql
    namespace: production
  schedule: "0 2 * * *"  # Every day at 2 AM (standard cron format)
  retentionPolicy:
    keepDays: 30   # Keep backups for 30 days
  volumeSnapshotClassName: csi-hostpath-snapclass  # Optional
  preBackupHook:
    containerName: postgres  # Optional: specify container (defaults to first container)
    timeoutSeconds: 120      # Optional: timeout in seconds (default: 60)
    command:
      - "psql"
      - "-U"
      - "postgres"
      - "-c"
      - "CHECKPOINT"
  postBackupHook:
    command:
      - "echo"
      - "Backup completed"

Hook Behavior:

  • Hooks execute sequentially on each pod (pod-0, then pod-1, etc.)
  • If a hook fails on any pod, the entire backup fails
  • Hooks execute in the first container unless containerName is specified
  • Default timeout is 60 seconds; configure with timeoutSeconds field

Restore from Backup

Restore a StatefulSet to a previous snapshot:

apiVersion: backup.sts-backup.io/v1alpha1
kind: StatefulSetRestore
metadata:
  name: restore-postgres
  namespace: production
spec:
  statefulSetRef:
    name: postgresql
    namespace: production
  backupName: postgres-scheduled-backup
  scaleDown: true  # Recommended: scales StatefulSet to 0 before restore
  pvcDeletionTimeoutSeconds: 120  # Optional: timeout for PVC deletion (default: 60)

Restore Phases: The restore process follows a multi-phase workflow:

  1. New: Initial state, saves original replica count
  2. ScalingDown: Reduces StatefulSet replicas to 0 (if scaleDown: true)
  3. Restoring: Deletes existing PVCs and recreates them from snapshots
  4. ScalingUp: Restores StatefulSet to original replica count
  5. Completed/Failed: Final state

Selective Snapshot Restore

Restore specific snapshots (useful for partial recovery):

apiVersion: backup.sts-backup.io/v1alpha1
kind: StatefulSetRestore
metadata:
  name: restore-selective
  namespace: production
spec:
  statefulSetRef:
    name: postgresql
    namespace: production
  snapshotNames:
    - postgres-backup-data-20260112-120000-0  # Restore only replica-0
    - postgres-backup-data-20260112-120000-1  # Restore only replica-1
  scaleDown: true

๐Ÿ” Monitoring

Check Backup Status

# List all backups
kubectl get statefulsetbackup

# Detailed status
kubectl describe statefulsetbackup my-database-backup

# Check created snapshots
kubectl get volumesnapshot

Status Fields:

  • Phase: Current state - Ready, InProgress, or Failed
  • LastBackupTime: Timestamp of the most recent successful backup
  • Conditions: Standard Kubernetes conditions for state tracking
  • Message: Human-readable status information

Example output:

status:
  phase: Ready
  lastBackupTime: "2026-01-12T14:30:00Z"
  conditions:
    - type: Ready
      status: "True"
      lastTransitionTime: "2026-01-12T14:30:05Z"

Check Restore Status

# List all restores
kubectl get statefulsetrestore

# Watch restore progress
kubectl get statefulsetrestore restore-postgres -w

# Detailed restore status
kubectl describe statefulsetrestore restore-postgres

Restore Status Fields:

  • Phase: Current phase - New, ScalingDown, Restoring, ScalingUp, Completed, or Failed
  • OriginalReplicas: Original replica count saved before scaling down
  • RestoredSnapshots: Map of PVC names to restored snapshot names
  • StartTime/CompletionTime: Timing information for the restore operation

Example output:

status:
  phase: Completed
  originalReplicas: 3
  restoredSnapshots:
    data-postgresql-0: postgres-backup-data-20260112-120000-0
    data-postgresql-1: postgres-backup-data-20260112-120000-1
    data-postgresql-2: postgres-backup-data-20260112-120000-2
  startTime: "2026-01-12T15:00:00Z"
  completionTime: "2026-01-12T15:05:30Z"

View Operator Logs

# Get operator pod
kubectl get pods -n statefulset-backup-operator-system

# View logs
kubectl logs -n statefulset-backup-operator-system <operator-pod-name> -f

๐Ÿ—๏ธ Architecture

Snapshot Naming and Labels

VolumeSnapshot Naming Format:

{backup-name}-{volume-claim-template-name}-{timestamp}-{replica-index}

Example: postgres-backup-data-20260112-143000-0

Labels Applied to Snapshots:

  • backup.sts-backup.io/statefulset: Name of the source StatefulSet
  • backup.sts-backup.io/policy: Name of the backup resource

PVC Naming for StatefulSets:

{volume-claim-template-name}-{statefulset-name}-{replica-index}

Example: data-postgresql-0

Backup Scheduling Behavior

Manual Backups (no schedule):

  • Executed once when the backup resource is created
  • No subsequent backups occur automatically
  • Status updates to Ready after completion

Scheduled Backups:

  • Uses standard cron format: minute hour day-of-month month day-of-week
  • First backup executes immediately (within 10 seconds) if no LastBackupTime exists
  • Next reconciliation scheduled up to 1 hour in advance
  • If schedule is missed (controller downtime), executes immediately upon restart
  • Invalid cron expressions cause backup to fail with 1-minute retry interval

Backup Flow

  1. Reconcile Loop checks if it's time for a backup (based on schedule or manual trigger)
  2. Pre-Backup Hook executes in all StatefulSet pods (if configured)
  3. Snapshot Creation creates VolumeSnapshots for each PVC
  4. Post-Backup Hook executes in all StatefulSet pods (if configured)
  5. Retention Policy removes old snapshots (keeping N most recent per PVC)
  6. Status Update updates the StatefulSetBackup status with results

Restore Flow

  1. New Restore validates the restore request and saves original replica count
  2. Scale Down scales StatefulSet to 0 replicas (if enabled)
  3. Find Snapshots locates snapshots to restore based on backupName
  4. Delete PVCs removes existing PVCs for each replica
  5. Recreate PVCs creates new PVCs from VolumeSnapshots
  6. Scale Up restores StatefulSet to original replica count
  7. Completion waits for all pods to be ready and marks restore as complete

Retention Policy

The operator supports two types of retention policies. Note: Currently, you can specify either keepLast OR keepDays, but not both together.

Count-Based Retention (keepLast)

Keeps the most recent N snapshots per PVC:

  • With 3 replicas and keepLast: 2
  • Each PVC maintains its own 2 most recent snapshots
  • Total snapshots: 6 (2 per PVC)
  • Old snapshots are deleted based on creation timestamp (oldest first)
retentionPolicy:
  keepLast: 5  # Keep last 5 snapshots per PVC

Use cases:

  • You want a fixed number of recent backups regardless of age
  • Example: Keep last 7 daily backups (1 week of history)
  • Example: Keep last 24 hourly backups (24 hours of history)

Time-Based Retention (keepDays)

Deletes all snapshots older than N days:

  • Applies to all snapshots regardless of PVC
  • Calculates age from snapshot creation timestamp
  • Example: keepDays: 14 deletes snapshots older than 2 weeks
retentionPolicy:
  keepDays: 14  # Keep snapshots from the last 14 days

Use cases:

  • You want to enforce a maximum retention period
  • Compliance requirements (e.g., "keep backups for 30 days")
  • Cost management (limit storage by time)
  • Example: Daily backups with 2 week retention
  • Example: Hourly backups with 1 day retention

Future Enhancement: Support for combining both policies (e.g., keepLast: 3 with keepDays: 7) is planned for a future release.

๐Ÿšง Work in Progress

The following features are currently under development or planned:

Current Limitations

  • โš ๏ธ Cross-Namespace Operations - Backup and target StatefulSet must be in same namespace

    • Cross-namespace snapshots not supported
    • Enhancement planned: Support cross-namespace operations
  • โš ๏ธ Restore to Different StatefulSet - Not supported

    • Restore only works with the original source StatefulSet
    • Workaround: Manually copy snapshots and recreate PVCs

Roadmap

  • Comprehensive unit test suite (v0.0.2)
  • CI/CD integration with GitHub Actions (v0.0.2)
  • Time-based retention policy with keepDays (v0.0.3)
  • Configurable container selection for hooks (v0.0.3)
  • Hook timeout configuration (v0.0.4)
  • Configurable PVC deletion timeout for restore (v0.0.5)
  • Snapshot readiness verification before retention (v0.0.6)
  • Helm chart for easy installation (v0.0.7)
  • Combined retention policies (both keepLast and keepDays together)
  • Webhook validation for CRDs
  • Backup verification and integrity checks
  • Metrics and Prometheus integration
  • Multi-cluster restore (cross-cluster DR)
  • Support for encryption at rest
  • CLI tool for backup/restore operations
  • Dashboard/UI for visualization

๐Ÿงช Development

Prerequisites

  • Go 1.21+
  • Docker
  • Kubernetes cluster (Minikube recommended for development)
  • Kubebuilder 3.0+

Local Development Setup

# Install dependencies
go mod download

# Generate CRDs and code
make manifests generate

# Install CRDs into cluster
make install

# Run operator locally (outside cluster)
make run

# Or debug with VSCode
# Use the provided .vscode/launch.json configuration

Testing

The operator includes a comprehensive test suite with 26 unit tests covering both backup and restore controllers.

# Setup test environment (first time only)
make setup-envtest

# Run all unit tests with coverage
make test

# Run specific controller tests
go test ./internal/controller/... -v

# Run a specific test
go test ./internal/controller/... -v -run "TestStatefulSetBackupController"

Test Coverage

  • 26 total tests implemented
  • 24 tests pass successfully โœ…
  • 2 tests skipped (require VolumeSnapshot CRDs)
  • 42.5% code coverage of the codebase
  • GitHub Actions CI runs all tests automatically

For detailed test documentation, see TEST_DOCUMENTATION.md.

What's Tested

StatefulSetBackup Controller (15 tests):

  • Manual and scheduled backup workflows
  • Cron schedule validation and requeue logic
  • Backup status management
  • Resource lifecycle (creation, deletion)
  • Error handling for missing StatefulSets

StatefulSetRestore Controller (11 tests):

  • Restore phase workflow (ScalingDown โ†’ Restoring โ†’ ScalingUp)
  • Scale down/up operations
  • Snapshot search and restoration
  • Parameter validation
  • Completed/failed state handling

All tests are CI-compatible and run without requiring VolumeSnapshot CRDs to be installed.

Integration Testing

# Run with a test StatefulSet
kubectl apply -f config/samples/apps_v1_statefulset.yaml
kubectl apply -f config/samples/backup_v1alpha1_statefulsetbackup.yaml

# Watch operator logs
# (if running locally, check terminal output)

Building

# Build binary
make build

# Build and push Docker image
make docker-build docker-push IMG=<your-registry>/statefulset-backup-operator:tag

๐Ÿ“ Examples

Example 1: PostgreSQL Backup

apiVersion: backup.sts-backup.io/v1alpha1
kind: StatefulSetBackup
metadata:
  name: postgres-backup
spec:
  statefulSetRef:
    name: postgres
    namespace: databases
  schedule: "0 */6 * * *"  # Every 6 hours
  retentionPolicy:
    keepDays: 7   # Keep snapshots for 7 days
  preBackupHook:
    containerName: postgres
    command: ["psql", "-U", "postgres", "-c", "CHECKPOINT"]

Example 2: MongoDB Backup with Replica Sync

apiVersion: backup.sts-backup.io/v1alpha1
kind: StatefulSetBackup
metadata:
  name: mongodb-backup
spec:
  statefulSetRef:
    name: mongodb
    namespace: databases
  schedule: "0 3 * * *"  # Daily at 3 AM
  retentionPolicy:
    keepDays: 14   # Keep 2 weeks of backups
  preBackupHook:
    containerName: mongodb
    command:
      - "mongosh"
      - "--eval"
      - "db.fsyncLock()"
  postBackupHook:
    command:
      - "mongosh"
      - "--eval"
      - "db.fsyncUnlock()"

Example 3: Redis Cluster Backup

apiVersion: backup.sts-backup.io/v1alpha1
kind: StatefulSetBackup
metadata:
  name: redis-backup
spec:
  statefulSetRef:
    name: redis
    namespace: cache
  schedule: "*/30 * * * *"  # Every 30 minutes
  retentionPolicy:
    keepLast: 12  # Keep last 12 snapshots per PVC (6 hours of backups)
  preBackupHook:
    containerName: redis
    command: ["redis-cli", "BGSAVE"]

๐Ÿค Contributing

Contributions are welcome! This is an early-stage project, and we'd love your help.

How to Contribute

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'Add amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

Areas Where We Need Help

  • Testing on various Kubernetes distributions (GKE, EKS, AKS)
  • Documentation improvements
  • Additional storage provider testing
  • Performance optimization
  • Feature implementations from roadmap

๐Ÿ“„ License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

๐Ÿ™ Acknowledgments

๐Ÿ“ž Support

โญ Star History

If you find this project useful, please consider giving it a star! It helps the project gain visibility and encourages continued development.


Note: This operator is in active development (v0.0.7). APIs and features may change. Not recommended for production use until v1.0.0 release.

๐Ÿ“Š Changelog

Version 0.0.7 (2026-01-15)

New Features:

  • โœ… Helm chart for easy installation - deploy the operator with a single helm install command
  • โœ… Fully customizable via Helm values (image, resources, replicas, etc.)
  • โœ… Automatic CRD installation via Helm chart

Helm Chart Includes:

  • ServiceAccount with ClusterRole for cluster-wide permissions
  • ClusterRole with permissions for: StatefulSets, VolumeSnapshots, VolumeSnapshotClasses, PVCs, Pods/exec (for hooks), Events, and CRDs
  • Deployment with security context, health probes, and resource limits
  • Leader election support for HA deployments
  • Metrics endpoint configuration

Installation:

helm install statefulset-backup-operator ./charts/statefulset-backup-operator \
  -n statefulset-backup-system --create-namespace

Version 0.0.6 (2026-01-15)

New Features:

  • โœ… Snapshot readiness verification - operator now waits for snapshots to be ReadyToUse=true before proceeding (60 second timeout)

Improvements:

  • Retention policy is now only applied after snapshots are confirmed ready
  • Prevents potential data loss from deleting old snapshots before new ones are ready
  • Better reliability for slow storage backends

Version 0.0.5 (2026-01-14)

New Features:

  • โœ… Configurable PVC deletion timeout for restore - use pvcDeletionTimeoutSeconds field to set custom timeout (default: 60 seconds)

Improvements:

  • Restore operations now support longer PVC deletion times for slow storage backends
  • Clear error messages when PVC deletion times out

Version 0.0.4 (2026-01-14)

New Features:

  • โœ… Hook timeout configuration - use timeoutSeconds field to set custom timeout for pre/post backup hooks (default: 60 seconds)

Improvements:

  • Hooks now fail gracefully with clear error message when timeout is exceeded
  • Better error handling for long-running hook commands

Version 0.0.3 (2026-01-14)

New Features:

  • โœ… Time-based retention policy with keepDays - delete snapshots older than N days
  • โœ… Configurable container selection for hooks - use containerName field to specify which container to run hooks in

Improvements:

  • Enhanced retention policy documentation with use cases and examples
  • Updated all examples to showcase new retention options

Version 0.0.2 (2026-01-12)

New Features:

  • โœ… Comprehensive unit test suite with 26 tests covering both controllers
  • โœ… GitHub Actions CI integration for automated testing
  • โœ… Test documentation with detailed coverage information
  • โœ… CI-compatible tests that run without VolumeSnapshot CRDs
  • โœ… Automatic default VolumeSnapshotClass detection - operator now automatically uses the cluster's default VolumeSnapshotClass if not explicitly specified

Test Coverage:

  • StatefulSetBackup Controller: 15 tests covering manual/scheduled backups, cron validation, status management, and resource lifecycle
  • StatefulSetRestore Controller: 11 tests covering restore workflow phases, scale operations, snapshot search, and error handling
  • Overall code coverage: 42.5%
  • All tests pass in CI without external dependencies

Documentation:

  • Added TEST_DOCUMENTATION.md with comprehensive test guide
  • Updated README with testing instructions and coverage details

Version 0.0.1 (2026-01-01)

Initial Release:

  • Basic backup and restore functionality
  • Cron-based scheduling
  • Pre/post backup hooks
  • Per-replica retention policies
  • StatefulSet integration