Self-host OpenClaw AI agents on Kubernetes with production-grade security, observability, and lifecycle management.
OpenClaw is an AI agent platform that acts on your behalf across Telegram, Discord, WhatsApp, and Signal. It manages your inbox, calendar, smart home, and more through 50+ integrations. While OpenClaw.rocks offers fully managed hosting, this operator lets you run OpenClaw on your own infrastructure with the same operational rigor.
Why an Operator?
Deploying AI agents to Kubernetes involves more than a Deployment and a Service. You need network isolation, secret management, persistent storage, health monitoring, optional browser automation, and config rollouts, all wired correctly. This operator encodes those concerns into a single OpenClawInstance custom resource so you can go from zero to production in minutes:
apiVersion: openclaw.rocks/v1alpha1 kind: OpenClawInstance metadata: name: my-agent spec: envFrom: - secretRef: name: openclaw-api-keys storage: persistence: enabled: true size: 10Gi
The operator reconciles this into a fully managed stack of 9+ Kubernetes resources: secured, monitored, and self-healing.
Agents That Adapt Themselves
Agents can autonomously install skills, patch their config, add environment variables, and seed workspace files - all through the Kubernetes API, validated by the operator on every request.
# 1. Enable self-configure on the instance spec: selfConfigure: enabled: true allowedActions: [skills, config, envVars, workspaceFiles]
# 2. The agent creates this to install a skill at runtime apiVersion: openclaw.rocks/v1alpha1 kind: OpenClawSelfConfig metadata: name: add-fetch-skill spec: instanceRef: my-agent addSkills: - "@anthropic/mcp-server-fetch"
Every request is validated against the instance's allowlist policy. Protected config keys cannot be overwritten, and denied requests are logged with a reason. See Self-configure for details.
Note: Without
selfConfigureenabled, config or skill changes made by the agent inside the container won't trigger a pod restart. You'll need to restart the pod manually (e.g.kubectl delete pod <pod-name>) for changes to take effect.
Features
| Feature | Details | |
|---|---|---|
| Declarative | Single CRD | One resource defines the entire stack: StatefulSet, Service, RBAC, NetworkPolicy, PVC, PDB, Ingress, and more |
| Adaptive | Agent self-configure | Agents autonomously install skills, patch config, and adapt their environment via the K8s API - every change validated against an allowlist policy |
| Secure | Hardened by default | Non-root (UID 1000), read-only root filesystem, all capabilities dropped, seccomp RuntimeDefault, default-deny NetworkPolicy, validating webhook |
| Observable | Built-in metrics | Prometheus metrics, ServiceMonitor integration, structured JSON logging, Kubernetes events |
| Flexible | Provider-agnostic config | Use any AI provider (Anthropic, OpenAI, or others) via environment variables and inline or external config |
| Config Modes | Merge or overwrite | overwrite replaces config on restart; merge deep-merges with PVC config, preserving runtime changes. Config is restored on every container restart via init container. |
| Skills | Declarative install | Install ClawHub skills, npm packages, or GitHub-hosted skill packs via spec.skills - supports npm: and pack: prefixes |
| Plugins | Declarative install | Install OpenClaw plugins via spec.plugins - npm packages installed in a secure init container |
| Runtime Deps | pnpm & Python/uv | Built-in init containers install pnpm (via corepack) or Python 3.12 + uv for MCP servers and skills |
| Auto-Update | OCI registry polling | Opt-in version tracking: checks the registry for new semver releases, backs up first, rolls out, and auto-rolls back if the new version fails health checks |
| Scalable | Auto-scaling | HPA integration with CPU and memory metrics, min/max replica bounds, automatic StatefulSet replica management |
| Resilient | Self-healing lifecycle | PodDisruptionBudgets, health probes, automatic config rollouts via content hashing, 5-minute drift detection |
| Backup/Restore | S3-backed snapshots | Automatic backup to S3-compatible storage on deletion, pre-update, and on a cron schedule; restore into a new instance from any snapshot |
| Workspace Seeding | Initial files & dirs | Pre-populate the workspace with files and directories before the agent starts; reference an external ConfigMap for GitOps workflows |
| Gateway Auth | Auto-generated tokens | Automatic gateway token Secret per instance, bypassing mDNS pairing (unusable in k8s) |
| Tailscale | Tailnet access | Expose via Tailscale Serve or Funnel with SSO auth - no Ingress needed |
| Extensible | Sidecars & init containers | Chromium for browser automation, Ollama for local LLMs, Tailscale for tailnet access, plus custom init containers and sidecars |
| Cloud Native | SA annotations & CA bundles | AWS IRSA / GCP Workload Identity via ServiceAccount annotations; CA bundle injection for corporate proxies |
Architecture
+-----------------------------------------------------------------+
| OpenClawInstance CR OpenClawSelfConfig CR |
| (your declarative config) (agent self-modification requests) |
+---------------+-------------------------------------------------+
| watch
v
+-----------------------------------------------------------------+
| OpenClaw Operator |
| +-----------+ +-------------+ +----------------------------+ |
| | Reconciler| | Webhooks | | Prometheus Metrics | |
| | | | (validate | | (reconcile count, | |
| | creates -> | & default)| | duration, phases) | |
| +-----------+ +-------------+ +----------------------------+ |
+---------------+-------------------------------------------------+
| manages
v
+-----------------------------------------------------------------+
| Managed Resources (per instance) |
| |
| ServiceAccount -> Role -> RoleBinding NetworkPolicy |
| ConfigMap PVC PDB ServiceMonitor |
| GatewayToken Secret |
| |
| StatefulSet |
| +-----------------------------------------------------------+ |
| | Init: config -> pnpm* -> python* -> skills* -> custom | |
| | (* = opt-in) | |
| +------------------------------------------------------------+ |
| | OpenClaw Container Gateway Proxy (nginx) | |
| | Chromium (opt) / Ollama (opt) | |
| | Tailscale (opt) + custom sidecars | |
| +------------------------------------------------------------+ |
| |
| Service (default: 18789, 18793 or custom) -> Ingress (opt) |
+-----------------------------------------------------------------+
Quick Start
Prerequisites
- Kubernetes 1.28+
- Helm 3
1. Install the operator
helm install openclaw-operator \ oci://ghcr.io/openclaw-rocks/charts/openclaw-operator \ --namespace openclaw-operator-system \ --create-namespace
Alternative: install with Kustomize
# Install CRDs make install # Deploy the operator make deploy IMG=ghcr.io/openclaw-rocks/openclaw-operator:latest
2. Create a secret with your API keys
apiVersion: v1 kind: Secret metadata: name: openclaw-api-keys type: Opaque stringData: ANTHROPIC_API_KEY: "sk-ant-..."
3. Deploy an OpenClaw instance
apiVersion: openclaw.rocks/v1alpha1 kind: OpenClawInstance metadata: name: my-agent spec: envFrom: - secretRef: name: openclaw-api-keys storage: persistence: enabled: true size: 10Gi
kubectl apply -f secret.yaml -f openclawinstance.yaml
4. Verify
kubectl get openclawinstances # NAME PHASE AGE # my-agent Running 2m kubectl get pods # NAME READY STATUS AGE # my-agent-0 1/1 Running 2m
Configuration
Inline config (openclaw.json)
spec: config: raw: agents: defaults: model: primary: "anthropic/claude-sonnet-4-20250514" sandbox: true session: scope: "per-sender"
External ConfigMap reference
spec: config: configMapRef: name: my-openclaw-config key: openclaw.json
Config changes are detected via SHA-256 hashing and automatically trigger a rolling update. No manual restart needed.
Gateway proxy
By default, each pod includes an nginx reverse proxy sidecar that forwards traffic to the OpenClaw gateway on loopback. Set spec.gateway.enabled: false to disable it:
- Health probes and Service ports target the gateway directly on port 18789
gateway.bindis set to0.0.0.0instead of loopback- The
gateway-proxycontainer and its tmp volume are omitted from the pod - To replace the built-in proxy with your own (e.g., Envoy, a signing proxy), disable it and add your proxy via
spec.sidecars - Warning: Do not set
gateway.bind: loopbackin your config JSON when the proxy is disabled - the gateway will only listen on127.0.0.1with nothing forwarding external traffic, making the pod unreachable. The operator emits aGatewayBindConflictwarning event if this misconfiguration is detected. - TLS: When the proxy is disabled, the gateway serves plaintext
ws://on0.0.0.0. Ensure your replacement proxy or Ingress handles TLS termination to avoid exposing unencrypted WebSocket traffic (CWE-319).
Gateway authentication
The operator automatically generates a gateway token Secret for each instance and injects it into both the config JSON (gateway.auth.mode: token) and the OPENCLAW_GATEWAY_TOKEN env var. This bypasses Bonjour/mDNS pairing, which is unusable in Kubernetes.
- The token is generated once and never overwritten - rotate it by editing the Secret directly
- If you set
gateway.auth.tokenin your config orOPENCLAW_GATEWAY_TOKENinspec.env, your value takes precedence - To bring your own token Secret, set
spec.gateway.existingSecret- the operator will use it instead of auto-generating one (the Secret must have a key namedtoken) - The operator automatically sets
gateway.controlUi.dangerouslyDisableDeviceAuth: true- device pairing is incompatible with Kubernetes (users cannot approve pairing from inside a container, connections are always proxied, and mDNS is unavailable) - Do not set
gateway.mode: localin your config - this mode is for desktop installs and enforces device identity checks that cannot work behind a reverse proxy in Kubernetes - When connecting to the Control UI through an Ingress, pass the gateway token in the URL fragment:
https://openclaw.example.com/#token=<your-token> - Since v2026.2.24, OpenClaw restricts
gateway.allowedOriginsto same-origin by default - if accessing via a non-default hostname (e.g. Ingress), setgateway.allowedOrigins: ["*"]in your config
Control UI allowed origins
The operator auto-injects gateway.controlUi.allowedOrigins so the Control UI works through reverse proxies without CORS errors. Origins are derived from:
- Localhost (always):
http://localhost:18789,http://127.0.0.1:18789for port-forwarding - Ingress hosts: scheme determined from TLS config (
https://if TLS,http://otherwise) - Explicit extras:
spec.gateway.controlUiOriginsfor custom proxy URLs
If you set gateway.controlUi.allowedOrigins directly in your config JSON, the operator will not override it.
Chromium sidecar
Enable headless browser automation for web scraping, screenshots, and browser-based integrations:
spec: chromium: enabled: true image: repository: chromedp/headless-shell # default tag: "stable" resources: requests: cpu: "250m" memory: "512Mi" limits: cpu: "1000m" memory: "2Gi" # Pass extra flags to the Chromium process (appended to built-in anti-bot defaults) extraArgs: - "--user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36" # Inject extra environment variables into the sidecar extraEnv: - name: DISPLAY value: ":99"
When enabled, the operator automatically:
- Injects a
CHROMIUM_URLenvironment variable into the main container - Configures browser profiles in the OpenClaw config - both
"default"and"chrome"profiles are set to point at the sidecar's CDP endpoint, so browser tool calls work regardless of which profile name the LLM passes - Sets up shared memory, security contexts, and health probes for the sidecar
- Applies anti-bot-detection flags by default (
--disable-blink-features=AutomationControlled,--disable-features=AutomationControlled,--no-first-run)
Persistent browser profiles
By default, all browser state (cookies, localStorage, session tokens) is lost on pod restart. Enable persistence to retain browser profiles across restarts:
spec: chromium: enabled: true persistence: enabled: true # default: false storageClass: "" # optional - uses cluster default if empty size: "1Gi" # default: 1Gi existingClaim: "" # optional - use a pre-existing PVC
When persistence is enabled, the operator creates a dedicated PVC and passes --user-data-dir=/chromium-data to Chrome so that cookies, localStorage, IndexedDB, cached credentials, and session tokens survive pod restarts. This is useful for authenticated browser automation, MFA-protected services, and long-running browser workflows.
Security note: Persistent browser profiles contain sensitive session tokens. The PVC has the same security posture as other instance volumes. Ensure your StorageClass supports encryption at rest for sensitive workloads.
Ollama sidecar
Run local LLMs alongside your agent for private, low-latency inference without external API calls:
spec: ollama: enabled: true models: - llama3.2 - nomic-embed-text gpu: 1 storage: sizeLimit: 30Gi resources: requests: cpu: "1" memory: "4Gi" limits: cpu: "4" memory: "16Gi"
When enabled, the operator:
- Injects an
OLLAMA_HOSTenvironment variable into the main container - Pre-pulls specified models via an init container before the agent starts
- Configures GPU resource limits when
gpuis set (nvidia.com/gpu) - Mounts a model cache volume (emptyDir by default, or an existing PVC via
storage.existingClaim)
See Custom AI Providers for configuring OpenClaw to use Ollama models via environment variables.
Web terminal sidecar
Provide browser-based shell access to running instances for debugging and inspection without requiring kubectl exec:
spec: webTerminal: enabled: true readOnly: false credential: secretRef: name: my-terminal-creds resources: requests: cpu: "50m" memory: "64Mi" limits: cpu: "200m" memory: "128Mi"
When enabled, the operator:
- Injects a ttyd sidecar container on port 7681
- Mounts the instance data volume at
/home/openclaw/.openclawso you can inspect config, logs, and data files - Adds the web terminal port to the Service and NetworkPolicy for external access
- Supports basic auth via a Secret with
usernameandpasswordkeys - Supports read-only mode (
readOnly: true) for production environments where shell input should be disabled
Tailscale integration
Expose your instance via Tailscale Serve (tailnet-only) or Funnel (public internet) - no Ingress or LoadBalancer needed:
spec: tailscale: enabled: true mode: serve # "serve" (tailnet only) or "funnel" (public internet) authKeySecretRef: name: tailscale-auth authSSO: true # allow passwordless login for tailnet members hostname: my-agent # defaults to instance name image: repository: ghcr.io/tailscale/tailscale # default tag: latest resources: requests: cpu: 50m memory: 64Mi limits: cpu: 200m memory: 256Mi
When enabled, the operator runs a Tailscale sidecar (tailscaled) that handles serve/funnel declaratively via TS_SERVE_CONFIG. An init container copies the tailscale CLI binary to a shared volume so the main container can call tailscale whois for SSO authentication. The sidecar runs in userspace mode (TS_USERSPACE=true) - no NET_ADMIN capability needed.
State persistence: Tailscale node identity and TLS certificates are automatically persisted to a Kubernetes Secret (<instance>-ts-state) via TS_KUBE_SECRET. This prevents hostname incrementing (device-1, device-2, ...) and Let's Encrypt certificate re-issuance across pod restarts. The operator pre-creates the state Secret, grants the pod's ServiceAccount get/update/patch access to it, and mounts the SA token automatically.
Use ephemeral+reusable auth keys from the Tailscale admin console. When authSSO is enabled, tailnet members can authenticate without a gateway token.
Config merge mode
By default, the operator overwrites the config file on every pod restart. Set mergeMode: merge to deep-merge operator config with existing PVC config, preserving runtime changes made by the agent:
spec: config: mergeMode: merge raw: agents: defaults: model: primary: "anthropic/claude-sonnet-4-20250514"
Caveat: In merge mode, removing a key from the CR does not remove it from the PVC config - the old value persists because deep-merge only adds or updates keys. If you need to remove stale config keys (e.g., after removing gateway.mode: local), temporarily switch to mergeMode: overwrite, apply, wait for the pod to restart, then switch back to merge.
Skill installation
Install skills declaratively. The operator runs an init container that fetches each skill before the agent starts. Entries use ClawHub by default, or prefix with npm: to install from npmjs.com. ClawHub installs are idempotent - if a skill is already installed (e.g., when using persistent storage), it is skipped rather than failing:
spec: skills: - "@anthropic/mcp-server-fetch" # ClawHub (default) - "npm:@openclaw/matrix" # npm package from npmjs.com
npm lifecycle scripts are disabled globally on the init container (NPM_CONFIG_IGNORE_SCRIPTS=true) to mitigate supply chain attacks.
Skill packs
Skill packs bundle multiple files (SKILL.md, scripts, config) into a single installable unit hosted on GitHub. Use the pack: prefix with owner/repo/path format:
spec: skills: - "pack:openclaw-rocks/skills/image-gen" # latest from default branch - "pack:openclaw-rocks/skills/image-gen@v1.0.0" # pinned to tag - "pack:myorg/private-skills/custom-tool@main" # private repo (requires GITHUB_TOKEN)
Each pack directory must contain a skillpack.json manifest:
{
"files": {
"skills/image-gen/SKILL.md": "SKILL.md",
"skills/image-gen/scripts/generate.py": "scripts/generate.py"
},
"directories": ["skills/image-gen/scripts"],
"config": {
"image-gen": {"enabled": true}
}
}The operator resolves packs via the GitHub Contents API (cached for 5 minutes), seeds files into the workspace via the init container, and injects config entries into config.raw.skills.entries (user overrides take precedence). Set GITHUB_TOKEN on the operator deployment for private repo access.
Plugin installation
Install plugins declaratively. The operator runs a dedicated init container that installs each plugin via npm install before the agent starts:
spec: plugins: - "@martian-engineering/lossless-claw" - "some-other-plugin"
npm lifecycle scripts are disabled globally on the init container (NPM_CONFIG_IGNORE_SCRIPTS=true) to mitigate supply chain attacks. Plugins are installed into the PVC-backed ~/.openclaw/node_modules directory and persist across pod restarts.
Workspace seeding
Pre-populate the agent workspace with files and directories before the agent starts. Files can be provided inline or referenced from an external ConfigMap -- ideal for GitOps workflows where workspace content is managed alongside your manifests.
Inline files:
spec: workspace: initialDirectories: - tools/scripts initialFiles: README.md: | # My Workspace This workspace is managed by OpenClaw.
External ConfigMap reference:
spec: workspace: configMapRef: name: my-workspace-files # all keys become workspace files initialFiles: # inline files (override configMapRef) EXTRA.md: "additional content"
All keys in the referenced ConfigMap are written as files into the workspace directory. When both configMapRef and initialFiles are specified, inline files take precedence over ConfigMap entries with the same filename.
Merge priority (highest wins): operator-injected files > inline initialFiles > external configMapRef > skill packs.
The operator sets a WorkspaceReady status condition to False when the referenced ConfigMap is missing or contains invalid filenames, and True once workspace files are seeded successfully. The controller watches external ConfigMaps for changes and re-reconciles automatically.
How it works: Workspace files are seeded once via an init container. The init container copies files from a read-only ConfigMap volume to the PVC. The main container only sees the PVC (writable), so agents can modify their workspace files and changes persist across pod restarts. ConfigMaps are never mounted directly on the main container.
GitOps example with Kustomize:
# kustomization.yaml apiVersion: kustomize.config.k8s.io/v1beta1 kind: Kustomization namespace: my-namespace # must match the instance namespace generatorOptions: disableNameSuffixHash: true # required - operator looks up by exact name configMapGenerator: - name: my-workspace-files files: - workspace/SOUL.md - workspace/AGENT.md
Important: Two kustomize settings are required when using
configMapGeneratorwithconfigMapRef:
disableNameSuffixHash: true-- The operator looks up ConfigMaps by exact name. Kustomize's default hash suffix (e.g.-57k7g4dthc) would cause aConfigMapNotFounderror.namespace-- Generated ConfigMaps must be in the same namespace as the instance. Without this, kustomize creates them in thedefaultnamespace.
Additional workspaces (multi-agent):
When running multiple agents with isolated workspaces, use additionalWorkspaces to seed files for each agent. Each entry seeds to ~/.openclaw/workspace-<name>/ -- set matching paths in spec.config.raw.agents.list[].workspace.
spec: workspace: configMapRef: name: main-agent-workspace additionalWorkspaces: - name: scheduler configMapRef: name: scheduler-workspace initialFiles: SOUL.md: "I am the scheduler agent" initialDirectories: - tools config: raw: agents: list: - id: main name: "Main Agent" - id: scheduler name: "Scheduler Agent" bindings: - agentId: scheduler match: channel: discord peer: kind: channel id: "123456789" # bind to a specific channel
Each additional workspace supports the same configMapRef, initialFiles, and initialDirectories as the default workspace. Operator-injected ENVIRONMENT.md is included; BOOTSTRAP.md is not (only the default agent runs onboarding). Max 10 additional workspaces.
Seed-once behavior: Workspace files (both default and additional) are only written on first boot when they don't already exist on the PVC. If an agent modifies its own SOUL.md or AGENT.md at runtime, those changes persist across pod restarts and are never overwritten by the ConfigMap content. To re-seed a file, delete it from the PVC first.
Full GitOps example with multiple agents:
# kustomization.yaml apiVersion: kustomize.config.k8s.io/v1beta1 kind: Kustomization namespace: my-namespace generatorOptions: disableNameSuffixHash: true resources: - instance.yaml configMapGenerator: - name: main-agent-workspace files: - agents/main/SOUL.md - agents/main/AGENT.md - name: scheduler-workspace files: - agents/scheduler/SOUL.md - agents/scheduler/TOOLS.md
Self-configure
Allow agents to modify their own configuration by creating OpenClawSelfConfig resources via the K8s API. The operator validates each request against the instance's allowedActions policy before applying changes:
spec: selfConfigure: enabled: true allowedActions: - skills # add/remove skills - config # patch openclaw.json - workspaceFiles # add/remove workspace files - envVars # add/remove environment variables
When enabled, the operator:
- Grants the instance's ServiceAccount RBAC permissions to read its own CRD and create
OpenClawSelfConfigresources - Enables SA token automounting so the agent can authenticate with the K8s API
- Injects a
SELFCONFIG.mdskill file andselfconfig.shhelper script into the workspace - Opens port 6443 egress in the NetworkPolicy for K8s API access
The agent creates a request like:
apiVersion: openclaw.rocks/v1alpha1 kind: OpenClawSelfConfig metadata: name: add-fetch-skill spec: instanceRef: my-agent addSkills: - "@anthropic/mcp-server-fetch"
The operator validates the request, applies it to the parent OpenClawInstance, and sets the request's status to Applied, Denied, or Failed. Terminal requests are auto-deleted after 1 hour.
See the API reference for the full OpenClawSelfConfig CRD spec and spec.selfConfigure fields.
Persistent storage
By default the operator creates a 10Gi PVC and retains it when the CR is deleted (orphan behavior). Override size, storage class, or retention:
spec: storage: persistence: size: 20Gi storageClass: fast-ssd orphan: true # default -- PVC is RETAINED when the CR is deleted # orphan: false -- PVC is deleted with the CR (garbage collected)
To reuse an existing PVC (e.g., after restoring from a backup):
spec: storage: persistence: existingClaim: my-agent-data
Retention is stateful data protection. Because agent workspaces contain irreplaceable data such as memory, notebooks, and conversation history, the default is
orphan: true. To re-attach a retained PVC to a new instance, setexistingClaimto its name.
Runtime dependencies
Enable built-in init containers that install pnpm or Python/uv to the data PVC for MCP servers and skills:
spec: runtimeDeps: pnpm: true # Installs pnpm via corepack python: true # Installs Python 3.12 + uv
Custom init containers and sidecars
Add custom init containers (run after operator-managed ones) and sidecar containers:
spec: initContainers: - name: fetch-models image: curlimages/curl:8.5.0 command: ["sh", "-c", "curl -o /data/model.bin https://..."] volumeMounts: - name: data mountPath: /data sidecars: - name: cloud-sql-proxy image: gcr.io/cloud-sql-connectors/cloud-sql-proxy:2.14.3 args: ["--structured-logs", "my-project:us-central1:my-db"] ports: - containerPort: 5432 sidecarVolumes: - name: proxy-creds secret: secretName: cloud-sql-proxy-sa
Reserved init container names (init-config, init-pnpm, init-python, init-skills, init-ollama) are rejected by the webhook. If your sidecar replaces the built-in gateway proxy, set spec.gateway.enabled: false to avoid running both.
Extra volumes and mounts
Mount additional ConfigMaps, Secrets, or CSI volumes into the main container:
spec: extraVolumes: - name: shared-data persistentVolumeClaim: claimName: shared-pvc extraVolumeMounts: - name: shared-data mountPath: /shared
Ingress Basic Auth
Add HTTP Basic Authentication to the Ingress. The operator auto-generates a random password and stores it in a managed Secret:
spec: networking: ingress: enabled: true className: nginx hosts: - host: my-agent.example.com security: basicAuth: enabled: true username: admin # default: "openclaw" realm: "My Agent" # default: "OpenClaw"
The generated Secret is named <name>-basic-auth and contains three keys: auth (htpasswd format for ingress controllers), username, and password (plaintext, for retrieving the auto-generated credentials). It is tracked in status.managedResources.basicAuthSecret. To use your own credentials, provide a pre-formatted htpasswd Secret:
spec: networking: ingress: security: basicAuth: enabled: true existingSecret: my-htpasswd-secret # must contain key "auth"
For Traefik ingress, a Middleware CRD resource is created automatically (requires Traefik CRDs installed).
Custom service ports
By default the operator creates a Service with the gateway (18789) and canvas (18793) ports. To expose custom ports instead (e.g., for a non-default application), set spec.networking.service.ports:
spec: networking: service: type: ClusterIP ports: - name: http port: 3978 targetPort: 3978
When ports is set, it fully replaces the default ports -- including the Chromium port if the sidecar is enabled. To keep the defaults alongside custom ports, include them explicitly. If targetPort is omitted it defaults to port. See the API reference for all fields.
CA bundle injection
Inject a custom CA certificate bundle for environments with TLS-intercepting proxies or private CAs:
spec: security: caBundle: configMapName: corporate-ca-bundle # or secretName key: ca-bundle.crt # default key name
The bundle is mounted into all containers and the SSL_CERT_FILE / NODE_EXTRA_CA_CERTS environment variables are set automatically.
ServiceAccount annotations
Add annotations to the managed ServiceAccount for cloud provider integrations:
spec: security: rbac: serviceAccountAnnotations: # AWS IRSA eks.amazonaws.com/role-arn: "arn:aws:iam::123456789:role/openclaw" # GCP Workload Identity # iam.gke.io/gcp-service-account: "openclaw@project.iam.gserviceaccount.com"
Auto-update
Opt into automatic version tracking so the operator detects new releases and rolls them out without manual intervention:
spec: autoUpdate: enabled: true checkInterval: "24h" # how often to poll the registry (1h-168h) backupBeforeUpdate: true # back up the PVC before applying an update rollbackOnFailure: true # auto-rollback if the new version fails health checks healthCheckTimeout: "10m" # how long to wait for the pod to become ready (2m-30m)
When enabled, the operator resolves latest to the highest stable semver tag on creation, then polls for newer versions on each checkInterval. Before updating, it optionally runs an S3 backup, then patches the image tag and monitors the rollout. If the pod fails to become ready within healthCheckTimeout, it reverts the image tag and (optionally) restores the PVC from the pre-update snapshot.
Safety mechanisms include failed-version tracking (skips versions that failed health checks), a circuit breaker (pauses after 3 consecutive rollbacks), and full data restore when backupBeforeUpdate is enabled. Auto-update is a no-op for digest-pinned images (spec.image.digest).
See status.autoUpdate for update progress: kubectl get openclawinstance my-agent -o jsonpath='{.status.autoUpdate}'
Backup and restore
The operator uses rclone to back up and restore PVC data to/from S3-compatible storage. All backup operations require a Secret named s3-backup-credentials in the operator namespace:
apiVersion: v1 kind: Secret metadata: name: s3-backup-credentials namespace: openclaw-operator-system stringData: S3_ENDPOINT: "https://s3.us-east-1.amazonaws.com" S3_BUCKET: "my-openclaw-backups" S3_ACCESS_KEY_ID: "<key-id>" # optional - omit for workload identity S3_SECRET_ACCESS_KEY: "<secret-key>" # optional - omit for workload identity # S3_PROVIDER: "Other" # optional - set to "AWS", "GCS", etc. for native credential chains # S3_REGION: "us-east-1" # optional - needed for MinIO or providers with custom regions
Compatible with AWS S3, Backblaze B2, Cloudflare R2, MinIO, Wasabi, and any S3-compatible API.
Cloud workload identity: Omit S3_ACCESS_KEY_ID and S3_SECRET_ACCESS_KEY and set S3_PROVIDER (e.g., AWS, GCS) to use the provider's native credential chain. Set spec.backup.serviceAccountName to a workload identity-enabled ServiceAccount (IRSA, GKE Workload Identity, AKS Workload Identity) so backup Jobs inherit the cloud IAM role. See the Workload Identity section in the API reference for a full example.
When backups run automatically:
- On delete - the operator backs up the PVC before removing any resources. Subject to
spec.backup.timeout(default: 30m) - if the backup does not complete in time, it is skipped automatically. Addopenclaw.rocks/skip-backup: "true"to skip immediately. - Before auto-update - when
spec.autoUpdate.backupBeforeUpdate: true(the default). - On a schedule - when
spec.backup.scheduleis set (cron expression).
If the Secret does not exist, backups are silently skipped and operations proceed normally.
Periodic scheduled backups:
spec: backup: schedule: "0 2 * * *" # Daily at 2 AM UTC retentionDays: 7 # Keep 7 days of daily snapshots (default) historyLimit: 3 # Successful job runs to retain (default: 3) failedHistoryLimit: 1 # Failed job runs to retain (default: 1) timeout: "30m" # Max time for pre-delete backup (default: 30m, min: 5m, max: 24h) serviceAccountName: "" # Optional: IRSA/Pod Identity SA for backup Jobs
The operator creates a Kubernetes CronJob that runs rclone to sync PVC data to S3. The CronJob uses pod affinity to co-locate on the same node as the StatefulSet pod (required for RWO PVCs). Backups use an incremental sync strategy: data is synced to a fixed latest path (only changed files uploaded), a daily snapshot is taken, and snapshots older than retentionDays are automatically pruned.
Restoring from backup:
spec: # Path recorded in status.lastBackupPath of the source instance restoreFrom: "backups/my-tenant/my-agent/2026-01-15T10:30:00Z"
The operator runs a restore job to populate the PVC before starting the StatefulSet, then clears restoreFrom automatically. Backup paths follow the format backups/<tenantId>/<instanceName>/<timestamp>.
Clone / migrate an instance: restoreFrom works on both existing and brand-new instances. To clone an instance across namespaces, create a new OpenClawInstance with spec.restoreFrom pointing to the source's backup path - the operator creates the PVC, runs the restore Job, then starts the StatefulSet. The new instance gets a fresh gateway token; the source is unaffected. The restore Job uses spec.backup.serviceAccountName when set, so workload identity (IRSA/Pod Identity) works for cross-namespace clones. For ArgoCD users, add spec.restoreFrom to ignoreDifferences since the operator auto-clears it after restore.
For full details see the Backup and Restore section in the API reference.
What the operator manages automatically
These behaviors are always applied - no configuration needed:
| Behavior | Details |
|---|---|
gateway.bind |
When the gateway proxy sidecar is enabled (default), binds to loopback and an nginx reverse proxy handles external access. When disabled (spec.gateway.enabled: false), binds to 0.0.0.0 so the gateway is reachable directly. |
| Gateway auth token | Auto-generated Secret per instance; injected into config and env |
| Control UI origins | gateway.controlUi.allowedOrigins auto-injected from localhost + ingress hosts + spec.gateway.controlUiOrigins |
OPENCLAW_GATEWAY_HANDSHAKE_TIMEOUT_MS |
10000 (10s) to work around upstream timeout regression in v2026.3.12 (#46892) |
OPENCLAW_DISABLE_BONJOUR=1 |
Always set (mDNS does not work in Kubernetes) |
| Browser profiles | When Chromium is enabled, "default" and "chrome" profiles are auto-configured with the sidecar's CDP endpoint |
| Tailscale serve config | When Tailscale is enabled, a tailscale-serve.json key is added to the ConfigMap for the sidecar's TS_SERVE_CONFIG |
| Tailscale state persistence | When Tailscale is enabled, node identity and TLS certs are persisted to a <instance>-ts-state Secret via TS_KUBE_SECRET |
| Config hash rollouts | Config changes trigger rolling updates via SHA-256 hash annotation |
| Config restoration | The init container restores config on every pod restart (overwrite or merge mode) |
For the full list of configuration options, see the API reference and the full sample YAML.
Security
The operator follows a secure-by-default philosophy. Every instance ships with hardened settings out of the box, with no extra configuration needed.
Defaults
- Non-root execution: containers run as UID 1000; root (UID 0) is blocked by the validating webhook (exception: Ollama sidecar requires root per the official image)
- Read-only root filesystem: enabled by default for the main container and the Chromium sidecar; the PVC at
~/.openclaw/provides writable home, and a/tmpemptyDir handles temp files - All capabilities dropped: no ambient Linux capabilities
- Seccomp RuntimeDefault: syscall filtering enabled
- Default-deny NetworkPolicy: only DNS (53) and HTTPS (443) egress allowed; ingress limited to same namespace
- Minimal RBAC: each instance gets its own ServiceAccount with read-only access to its own ConfigMap; operator can create/update Secrets only for operator-managed gateway tokens
- No automatic token mounting:
automountServiceAccountToken: falseon both ServiceAccounts and pod specs (enabled only whenselfConfigureis active) - Secret validation: the operator checks that all referenced Secrets exist and sets a
SecretsReadycondition - Security context propagation: when
podSecurityContext.runAsNonRootis set tofalse, the operator propagates this to init containers and applicable sidecars (tailscale, web terminal) so there is no contradiction between pod-level and container-level settings. Self-consistent sidecars (gateway-proxy, chromium, ollama) retain their own security contexts. ThecontainerSecurityContext.runAsNonRootandcontainerSecurityContext.runAsUserfields allow granular control over the main container independently of the pod level.
Validating webhook
| Check | Severity | Behavior |
|---|---|---|
runAsUser: 0 |
Error | Blocked: root execution not allowed |
| Reserved init container name | Error | init-config, init-pnpm, init-python, init-skills, init-ollama are reserved |
| Invalid skill name | Error | Only alphanumeric, -, _, /, ., @ allowed (max 128 chars). npm: prefix for npm packages, pack: prefix for skill packs; bare npm: or pack: is rejected |
| Invalid CA bundle config | Error | Exactly one of configMapName or secretName must be set |
| JSON5 with inline raw config | Error | JSON5 requires configMapRef (inline must be valid JSON) |
| JSON5 with merge mode | Error | JSON5 is not compatible with mergeMode: merge |
Invalid checkInterval |
Error | Must be a valid Go duration between 1h and 168h |
Invalid healthCheckTimeout |
Error | Must be a valid Go duration between 2m and 30m |
Warning-level checks (deployment proceeds with a warning)
| Check | Behavior |
|---|---|
| NetworkPolicy disabled | Deployment proceeds with a warning |
| Ingress without TLS | Deployment proceeds with a warning |
| Chromium without digest pinning | Deployment proceeds with a warning |
| Ollama without digest pinning | Deployment proceeds with a warning |
| Web terminal without digest pinning | Deployment proceeds with a warning |
| Ollama runs as root | Required by official image; informational |
| Auto-update with digest pin | Digest overrides auto-update; updates won't apply |
readOnlyRootFilesystem disabled |
Proceeds with a security recommendation |
| No AI provider keys detected | Scans env/envFrom for known provider env vars |
| Unknown config keys | Warns on unrecognized top-level keys in spec.config.raw |
Observability
Prometheus metrics
| Metric | Type | Description |
|---|---|---|
openclaw_reconcile_total |
Counter | Reconciliations by result (success/error) |
openclaw_reconcile_duration_seconds |
Histogram | Reconciliation latency |
openclaw_instance_phase |
Gauge | Current phase per instance |
openclaw_instance_info |
Gauge | Instance metadata for PromQL joins (always 1) |
openclaw_instance_ready |
Gauge | Whether instance pod is ready (1/0) |
openclaw_managed_instances |
Gauge | Total number of managed instances |
openclaw_resource_creation_failures_total |
Counter | Resource creation failures |
openclaw_autoupdate_checks_total |
Counter | Auto-update version checks by result |
openclaw_autoupdate_applied_total |
Counter | Successful auto-updates applied |
openclaw_autoupdate_rollbacks_total |
Counter | Auto-update rollbacks triggered |
When metrics.enabled: true (the default), the operator automatically configures a full metrics pipeline: it injects diagnostics.otel config into OpenClaw to push OTLP metrics to a lightweight OTel Collector sidecar (otel/opentelemetry-collector), which exposes a Prometheus scrape endpoint on the configured port (default 9090). No manual OpenClaw configuration is needed. If you already set diagnostics.otel in your instance config, the operator preserves your settings.
ServiceMonitor
spec: observability: metrics: enabled: true serviceMonitor: enabled: true interval: 15s labels: release: prometheus
OTLP metrics export (operator)
The operator can push its own metrics (reconciliation counters, workqueue stats, client latencies, etc.) to any OTLP-compatible backend via gRPC. This bridges all Prometheus metrics to OpenTelemetry, running alongside the existing Prometheus scrape endpoint.
# values.yaml otlp: enabled: true endpoint: "otel-collector.observability.svc:4317" insecure: true # set to false for TLS
The endpoint can also be configured via the OTEL_EXPORTER_OTLP_ENDPOINT environment variable. Metrics are pushed every 30 seconds. If the OTLP endpoint is unreachable, the operator logs a warning and continues operating normally.
PrometheusRule (alerts)
Auto-provisions a PrometheusRule with 7 alerts including runbook URLs:
spec: observability: metrics: prometheusRule: enabled: true labels: release: kube-prometheus-stack # must match Prometheus ruleSelector runbookBaseURL: https://openclaw.rocks/docs/runbooks # default
Alerts: OpenClawReconcileErrors, OpenClawInstanceDegraded, OpenClawSlowReconciliation, OpenClawPodCrashLooping, OpenClawPodOOMKilled, OpenClawPVCNearlyFull, OpenClawAutoUpdateRollback
Grafana dashboards
Auto-provisions two Grafana dashboard ConfigMaps (discovered via the grafana_dashboard: "1" label):
spec: observability: metrics: grafanaDashboard: enabled: true folder: OpenClaw # Grafana folder (default) labels: grafana_dashboard_instance: my-grafana # optional extra labels
Dashboards:
- OpenClaw Operator - fleet overview with reconciliation metrics, instance table, workqueue, and auto-update panels
- OpenClaw Instance - per-instance detail with CPU, memory, storage, network, and pod health panels
Auto-Scaling (HPA)
Enable horizontal pod auto-scaling to automatically adjust the number of replicas based on CPU and memory utilization:
spec: availability: autoScaling: enabled: true minReplicas: 1 maxReplicas: 10 targetCPUUtilization: 80 targetMemoryUtilization: 70 # optional
When enabled, the operator creates a HorizontalPodAutoscaler targeting the StatefulSet and sets the StatefulSet's replica count to nil so the HPA manages scaling. The HPA is deleted when auto-scaling is disabled.
When auto-scaling is combined with persistent storage:
- Each replica gets its own PVC via StatefulSet
VolumeClaimTemplates(nameddata-<instance>-<ordinal>) - PVCs inherit
size,storageClass, andaccessModesfromspec.storage.persistence - Retention policy is
Retainfor both scale-down and deletion -- data is preserved - If auto-scaling is later disabled, per-replica PVCs become orphaned and must be cleaned up manually
Topology Spread Constraints
Spread pods across topology domains (zones, nodes) for improved availability:
spec: availability: topologySpreadConstraints: - maxSkew: 1 topologyKey: topology.kubernetes.io/zone whenUnsatisfiable: DoNotSchedule labelSelector: matchLabels: app.kubernetes.io/instance: my-instance
Runtime Class
Schedule pods on alternative container runtimes (Kata Containers, gVisor, etc.) for VM-level isolation or security hardening:
spec: availability: runtimeClassName: kata-fc
A matching RuntimeClass resource must exist in the cluster. If unset, the default container runtime is used.
Pod Annotations
Merge extra annotations into the StatefulSet pod template. Operator-managed keys (openclaw.rocks/config-hash, openclaw.rocks/secret-hash) always take precedence and cannot be overridden.
Useful for cloud-provider hints, such as preventing GKE Autopilot from evicting long-running agent pods:
spec: podAnnotations: cluster-autoscaler.kubernetes.io/safe-to-evict: "false"
Phases: Pending -> Restoring -> Provisioning -> Running | Updating | BackingUp | Degraded | Failed | Terminating
Deployment Guides
Platform-specific deployment guides are available for:
Development
# Clone and set up git clone https://github.com/OpenClaw-rocks/openclaw-operator.git cd openclaw-operator go mod download # Generate code and manifests make generate manifests # Run tests make test # Run linter make lint # Run locally against a Kind cluster kind create cluster make install make run
See CONTRIBUTING.md for the full development guide.
Roadmap
- v1.0.0: API graduation to
v1, conformance test suite, semver constraints for auto-update, HPA integration, cert-manager integration, multi-cluster support
See the full roadmap for details.
Don't Want to Self-Host?
OpenClaw.rocks offers fully managed hosting starting at EUR 15/mo. No Kubernetes cluster required. Setup, updates, and 24/7 uptime handled for you.
Contributing
Contributions are welcome. Please open an issue to discuss significant changes before submitting a PR. See CONTRIBUTING.md for guidelines.
Disclaimer: AI-Assisted Development
This repository is developed and maintained collaboratively by a human and Claude Code. This includes writing code, reviewing and commenting on issues, triaging bugs, and merging pull requests. The human reads everything and acts as the final guard, but Claude does the heavy lifting - from diagnosis to implementation to CI.
In the future, this repo may be fully autonomously operated, whether we humans like that or not.
License
Apache License 2.0, the same license used by Kubernetes, Prometheus, and most CNCF projects. See LICENSE for details.