GitHub - afreidah/s3-orchestrator: Multi-backend S3-compatible proxy with quota management, replication, and rebalancing

Project Website · Documentation · Maximizing Free-Tier Storage

An S3-compatible orchestrator that combines multiple storage backends into a single unified endpoint. Add as many S3-compatible backends as you want — OCI Object Storage, Backblaze B2, AWS S3, MinIO, whatever — and the orchestrator presents them to clients as one or more virtual buckets. Per-backend quota enforcement lets you cap each backend at exactly the byte limit you choose, so you can stack multiple free-tier or cost-limited allocations from different providers into a single, larger storage target for backups, media, etc without worrying about surprise bills.

Multiple virtual buckets let different applications share the same orchestrator with isolated file namespaces and independent credentials. Each bucket's objects are stored with an internal key prefix ({bucket}/{key}), so bucket isolation requires zero changes to the storage layer or database schema.

Built-in cross-backend replication also makes this an easy way to keep your data in multiple clouds without touching your application. Point your app at the proxy, set a replication factor, and every object automatically lands in two or more providers — instant multi-cloud redundancy with zero client-side changes.

Objects are routed to backends based on the configured routing_strategy: pack (default) fills backends in config order, while spread places each write on the least-utilized backend by ratio. Metadata and quota tracking live in PostgreSQL; the backends only see standard S3 API calls. The orchestrator is fully S3-compatible and works with any standard S3 client.

Getting Started

Prerequisites: Go 1.26+, Docker, Make.

git clone https://github.com/afreidah/s3-orchestrator.git
cd s3-orchestrator
make run

This starts three MinIO backends via Docker Compose (the orchestrator uses embedded SQLite by default, so no external database is needed), then launches the orchestrator on localhost:9000. Test it:

aws --endpoint-url http://localhost:9000 s3 cp /etc/hostname s3://photos/test.txt
aws --endpoint-url http://localhost:9000 s3 ls s3://photos/

Default credentials: access key photoskey, secret photossecret. Web dashboard at localhost:9000/ui/ (login: admin / admin).

See the Quickstart for full details, credentials for all buckets, and troubleshooting.

Other ways to install:

Docker: docker pull ghcr.io/afreidah/s3-orchestrator:<version>
Debian: download .deb from GitHub Releases
Binary: download from GitHub Releases

Database: SQLite is embedded by default — no external database needed for single-instance use. For multi-instance deployments, configure database.driver: postgres with PostgreSQL 14+. Run s3-orchestrator init to generate a config file interactively.

Verify artifact signatures:

Container images and release checksums are signed with cosign (keyless / Sigstore):

# Verify a container image
cosign verify ghcr.io/afreidah/s3-orchestrator:<version> \
  --certificate-identity-regexp='github\.com/afreidah/s3-orchestrator' \
  --certificate-oidc-issuer='https://token.actions.githubusercontent.com'

# Verify release checksums
cosign verify-blob checksums.txt \
  --signature checksums.txt.sig \
  --certificate checksums.txt.pem \
  --certificate-identity-regexp='github\.com/afreidah/s3-orchestrator' \
  --certificate-oidc-issuer='https://token.actions.githubusercontent.com'

Operational CLI: s3-orchestrator admin --help for rebalance, drain, encryption management, and backend sync.

Architecture

              S3 clients (aws cli, rclone, etc.)
                          |
                          v
                    +-----------+
                    | S3 Orch.  |  <-- SigV4 auth, rate limiting, quota routing
                    +-----------+
                     |         |
            +--------+         +------------------+------------------+
            v                  v                  v                  v
       PostgreSQL        OCI Object         Backblaze B2          AWS S3
       (metadata)       Storage (20 GB)       (10 GB)             (5 GB)
                              \                  |                  /
                               '------------ 35 GB total ---------'

PostgreSQL stores object locations (object_locations), per-backend quota counters and orphan bytes tracking (backend_quotas), and multipart upload state (multipart_uploads, multipart_parts). Schema is applied automatically on startup via goose versioned migrations embedded in the binary. All queries are generated by sqlc from annotated SQL files and executed via pgx/v5 connection pools.
Storage layer is split into three Go packages: internal/store/core holds engine-agnostic types, role interfaces, and orchestration helpers (the multi-step transactional operations like RecordObject, PromotePending, MoveObjectLocation); internal/store/postgres and internal/store/sqlite are thin per-engine adapters that implement core.TxAdapter so the same orchestration code drives both engines. SQLite is the default for single-instance use; PostgreSQL is required for multi-instance deployments.
Backends are standard S3-compatible services accessed via AWS SDK v2, each with a dedicated tuned HTTP transport (connection pooling, idle timeout for DNS freshness). Streaming operations use a shared buffer pool to reduce GC pressure. Any provider that speaks the S3 API works -- OCI Object Storage, Backblaze B2, AWS S3, MinIO, Wasabi, etc.
Write routing selects a backend for each new object based on the routing_strategy. In pack mode (default), objects go to the first backend in config order that has available quota — good for filling free-tier allocations sequentially. In spread mode, objects go to the backend with the lowest utilization ratio ((bytes_used + orphan_bytes) / bytes_limit) — good for distributing load evenly across backends. Quota is updated atomically in a transaction alongside the object location record. Set quota_bytes: 0 (or omit it) to disable quota enforcement on a backend — useful when you don't need cost control and just want unified access or replication. Backends with a max_object_size limit automatically skip objects that exceed the limit during write routing, rebalancing, and replication — preventing repeated 413 errors from providers with per-object size restrictions.
Usage limits optionally cap monthly API requests, egress bytes, and ingress bytes per backend. When a backend exceeds a limit, writes overflow to other backends and reads fail over to replicas. Delete and abort operations always bypass limits. Limits are enforced using cached database totals (refreshed at the configured flush interval) plus unflushed counters (in Redis when configured, otherwise local in-memory atomics). Adaptive flushing automatically shortens the interval when any backend approaches a limit.

S3 API Coverage

Operation	Method	Path	Notes
ListBuckets	`GET`	`/`	Returns buckets the credential has access to
HeadBucket	`HEAD`	`/{bucket}`	Confirms bucket exists (200 if authorized)
GetBucketLocation	`GET`	`/{bucket}?location`	Returns empty `LocationConstraint`
PutObject	`PUT`	`/{bucket}/{key}`	Preserves `x-amz-meta-*` user metadata
GetObject	`GET`	`/{bucket}/{key}`	Supports `Range` header; returns `x-amz-meta-*`
HeadObject	`HEAD`	`/{bucket}/{key}`	Returns `x-amz-meta-*` user metadata
DeleteObject	`DELETE`	`/{bucket}/{key}`	Idempotent (404 from store treated as success)
DeleteObjects	`POST`	`/{bucket}?delete`	Batch delete up to 1000 keys per request
CopyObject	`PUT`	`/{bucket}/{key}`	Uses `X-Amz-Copy-Source` header (same-bucket only)
ListObjectsV1	`GET`	`/{bucket}`	Original list API, uses `marker` pagination
ListObjectsV2	`GET`	`/{bucket}?list-type=2`	Supports `delimiter` for virtual directories
ListMultipartUploads	`GET`	`/{bucket}?uploads`	Lists in-progress multipart uploads
CreateMultipartUpload	`POST`	`/{bucket}/{key}?uploads`
UploadPart	`PUT`	`/{bucket}/{key}?partNumber=N&uploadId=X`
CompleteMultipartUpload	`POST`	`/{bucket}/{key}?uploadId=X`
AbortMultipartUpload	`DELETE`	`/{bucket}/{key}?uploadId=X`
ListParts	`GET`	`/{bucket}/{key}?uploadId=X`

Batch delete (DeleteObjects) accepts an XML request body listing up to 1000 keys and returns per-key success/error results. Metadata removal is sequential (each key is its own DB transaction), while backend S3 deletes run concurrently with bounded parallelism for throughput. Failed backend deletes are enqueued to the cleanup queue for automatic retry. The response always returns HTTP 200, even when individual keys fail -- errors are reported per-key in the XML body. Quiet mode (<Quiet>true</Quiet>) suppresses the <Deleted> elements and only returns errors.

Each request must target a virtual bucket name that matches the credentials used to sign the request. Requests to a bucket the credentials aren't authorized for return 403 AccessDenied.

Every response includes an X-Amz-Request-Id header with a unique request ID for tracing. Clients can supply their own ID via X-Request-Id; otherwise the orchestrator generates one. The same ID appears in audit logs and OpenTelemetry spans.

Authentication & Multi-Bucket

Each virtual bucket has one or more credential sets. On every request, the orchestrator:

Extracts the access key from the SigV4 Authorization header, presigned URL query parameters, or token from X-Proxy-Token.
Looks up which bucket the credential belongs to.
Verifies the signature (SigV4 header or presigned query parameters) or token.
Validates the URL path bucket matches the authorized bucket.

Three auth methods are supported, checked in order:

AWS SigV4 (recommended) - Standard AWS Signature Version 4 via the Authorization header. Compatible with aws cli, SDKs, and any S3 client. Signature verification is constant-time: unknown access keys still compute a full HMAC to prevent timing side-channel enumeration.
Presigned URLs - SigV4 query-parameter authentication (X-Amz-Algorithm, X-Amz-Credential, etc.) for time-limited, shareable URLs. Works with any AWS SDK presign client. Maximum expiry: 7 days. Uses the same bucket credentials as normal requests — no additional configuration required.
Legacy token - Simple X-Proxy-Token header for backward compatibility.

Multiple services can share a bucket by each having their own credentials that all map to the same bucket name. Access key IDs must be globally unique across all buckets.

Authentication is always required — every bucket must have at least one credential set.

For client usage examples (AWS CLI, rclone, boto3, Go SDK), see the User Guide. For deployment and operations, see the Admin Guide.

Degraded Mode (Database Circuit Breaker)

A three-state circuit breaker wraps all database access:

closed (healthy) → open (DB down) → half-open (probing) → closed

When the database becomes unreachable (consecutive failures exceed failure_threshold), the orchestrator enters degraded mode:

Reads broadcast to all backends in order (or in parallel if parallel_broadcast is enabled). A location cache (TTL configurable via cache_ttl) stores successful lookups to avoid repeated broadcasts for the same key.
Writes (PUT, DELETE, COPY, multipart) return 503 ServiceUnavailable.
Health endpoint returns degraded instead of ok.

After open_timeout elapses, the circuit enters half-open state and sends a single probe request. If the database responds, the circuit closes and normal operation resumes automatically.

Backend Circuit Breaker

Optional per-backend circuit breakers detect when individual S3 backends become unreachable (expired credentials, provider outage, network failure) and stop sending traffic to them until recovery is detected.

closed (healthy) → open (backend down) → half-open (probing) → closed

When a backend accumulates failure_threshold consecutive failures, the circuit opens:

Writes skip the unhealthy backend and route to other backends with available quota.
Reads fail over to replicas on healthy backends (requires replication.factor >= 2).
Replication creates replacement copies on healthy backends after a sustained outage (see Health-Aware Replication).
All calls to the backend return ErrBackendUnavailable immediately — no timeout waiting.

After open_timeout elapses (plus randomized jitter of up to open_timeout/4), the next organic request to the backend is allowed through as a probe. If it succeeds, the circuit closes. If it fails, the circuit reopens for another timeout period. The jitter is recomputed on each open transition to prevent multiple backends from probing simultaneously after a shared failure event.

A background watchdog service checks all circuit breakers every minute for stale half-open probes. If a probe has been in flight longer than 2 minutes (e.g. the backend accepted the TCP connection but never responded), the watchdog resets the circuit to open so a new probe can be dispatched. This prevents circuits from getting permanently stuck half-open on low-traffic backends where no new request arrives to trigger the passive stale-probe detection.

Unlike the database circuit breaker, backend circuit breakers treat all errors as failures (no error filtering). This is a per-backend wrapper — each backend has its own independent circuit breaker state.

backend_circuit_breaker:
  enabled: true
  failure_threshold: 5   # consecutive failures before opening (default: 5)
  open_timeout: "5m"     # delay before probing recovery (default: 5m)

Disabled by default. Requires a restart to enable/disable (non-reloadable).

Write Routing

The routing_strategy setting controls how the orchestrator selects a backend for new objects (PutObject, CopyObject, CreateMultipartUpload):

pack (default) — fills the first backend in config order until its quota is full, then overflows to the next. Good for maximizing usable capacity on free-tier providers where you want to fill one allocation before touching the next.
spread — places each object on the backend with the lowest utilization ratio ((bytes_used + orphan_bytes) / bytes_limit). Good for distributing storage evenly across backends to balance load and wear.

Both strategies respect quota limits — a backend with no remaining space is skipped regardless of strategy. When usage limits are configured, backends that have exceeded their monthly limits are also excluded from selection.

Rebalancing

The rebalancer periodically moves objects between backends to optimize storage distribution. Disabled by default to avoid unexpected egress charges.

Two strategies:

pack - Fills backends in configuration order, consolidating free space on the last backend. Good for maximizing usable capacity on free-tier providers.
spread - Equalizes utilization ratios across all backends. Good for distributing load evenly.

The threshold parameter (0–1) sets the minimum utilization spread required to trigger a rebalance run. Objects are moved in configurable batch sizes with bounded concurrency (concurrency setting, default 5) for throughput.

Replication

When replication.factor is greater than 1, a background worker creates additional copies of objects on different backends to reach the target factor. Read operations automatically fail over to replicas if the primary copy is unavailable.

The worker runs once at startup to catch up on pending replicas, then continues at the configured interval.

Health-Aware Replication

When backend circuit breakers are enabled, the replication worker is aware of backend health. If a backend's circuit breaker has been open longer than unhealthy_threshold (default: 10 minutes), the replicator treats copies on that backend as unavailable and creates replacement copies on healthy backends to maintain the target replication factor.

This prevents a sustained backend outage from silently reducing effective redundancy. For example, with factor: 2 and an object on backends A and B, if B goes down and stays down past the threshold, the replicator creates a third copy on backend C — restoring two accessible copies.

The threshold prevents replacement copies from being created during brief transient failures. The replicator also prefers healthy backends as copy sources and never selects a circuit-broken backend as a replication target.

When a backend recovers, the extra copies it created are cleaned up automatically by the over-replication cleaner (see Over-Replication Cleanup).

Over-Replication Cleanup

When a backend recovers after the replicator has already created replacement copies on other backends, objects end up with more copies than the replication factor. A background worker detects and removes the excess.

The cleaner scores each copy by its backend's health and storage utilization, then removes the lowest-scoring copies until the object reaches the target factor:

Draining backend: score 0 (always removed first)
Circuit-broken backend: score 1 (removed next)
Healthy backend: score 2 + (1 − utilization ratio), range [2..3]

Among healthy backends, the most utilized backend gets the lowest score — freeing space where it is scarcest. Each object's copies are locked with FOR UPDATE to prevent races with concurrent replicator or rebalancer activity.

The worker runs at the replication.worker_interval and shares the same batch_size and concurrency settings. It only runs when replication.factor > 1. Like the replicator, it uses a PostgreSQL advisory lock for multi-instance coordination.

Cleanup can also be triggered on demand via the admin API (POST /admin/api/over-replication), the CLI (s3-orchestrator admin over-replication --execute), or the web dashboard's Clean Excess button.

Cleanup Queue

When a backend S3 operation succeeds but the subsequent metadata update or cleanup deletion fails, an orphaned object is left on the backend — invisible to the system, consuming storage but not tracked by quotas. Rather than silently logging these failures, the orchestrator enqueues them in a persistent cleanup_queue table in PostgreSQL for automatic retry.

Orphan bytes tracking — each enqueued item records the object's size_bytes. When an item is enqueued, the corresponding backend's orphan_bytes counter in backend_quotas is incremented. All capacity checks (write routing, replication target selection) subtract orphan_bytes from available space, so the write path never overcommits storage on a backend with pending cleanups. When a cleanup succeeds, orphan_bytes is decremented. This prevents a sustained backend outage from silently allowing quota overcommitment: even if a backend is down for days and cleanup retries are exhausting, the space consumed by orphaned objects remains reserved.

A background worker runs every minute, fetching pending items and attempting to delete them from their respective backends. Failed attempts are rescheduled with exponential backoff (1m × 2^attempts, capped at 24h). After 10 failed attempts, the row is graduated to the cleanup_dlq (dead-letter) table by core.MoveCleanupToDLQ in a single transaction. orphan_bytes is intentionally NOT decremented during the move because the backend object is still on disk — the bytes really are still occupying the backend's quota, and decrementing here would lie about reclaimed capacity. Operators monitor s3o_cleanup_dlq_depth and s3o_cleanup_dlq_enqueued_total{backend} to spot unrecoverable orphans, then resolve each entry deliberately (delete the object out-of-band, then write off the row + adjust orphan_bytes by its size).

Enqueue points cover all failure sites across the codebase:

PutObject / CopyObject / CompleteMultipartUpload — orphaned object when RecordObject fails and the immediate cleanup delete also fails
PutObject / CopyObject (overwrite) — displaced copies on other backends when a key is overwritten; old copies that can't be immediately deleted are enqueued
DeleteObject — metadata removed but backend delete fails (storage leak)
UploadPart — part uploaded but RecordPart fails and cleanup delete fails
CompleteMultipartUpload / AbortMultipartUpload — temporary __multipart/ part objects not deleted
Rebalancer — orphaned copy on destination when MoveObjectLocation fails, or stale source copy after a successful move
Replicator — orphaned replica when RecordReplica fails or source is deleted during replication

Enqueue is best-effort: if the database is down (circuit breaker open), the failure is logged and the orphan is not enqueued. This avoids cascading failures — if the DB recovers, the next operation that fails will be enqueued normally.

Operators inspect exhausted items in the dead-letter table:

SELECT id, original_id, backend_name, object_key, reason, attempts,
       size_bytes, first_enqueued_at, moved_at, last_error
FROM cleanup_dlq
ORDER BY moved_at;

-- After confirming the object is gone (manual S3 delete, reconciler sweep, etc.):
BEGIN;
UPDATE backend_quotas
   SET orphan_bytes = GREATEST(0, orphan_bytes - (SELECT size_bytes FROM cleanup_dlq WHERE id = 42))
 WHERE backend_name = (SELECT backend_name FROM cleanup_dlq WHERE id = 42);
DELETE FROM cleanup_dlq WHERE id = 42;
COMMIT;

-- To push a DLQ entry back through automatic retry (e.g. after fixing the backend):
INSERT INTO cleanup_queue (backend_name, object_key, reason, size_bytes, next_retry, attempts, last_error)
SELECT backend_name, object_key, reason, size_bytes, NOW(), 0, last_error
  FROM cleanup_dlq WHERE id = 42;
DELETE FROM cleanup_dlq WHERE id = 42;

PUT-before-COMMIT Pending Intents

The write path inserts a pending_objects row before sending the upload to the backend, then deletes that row in the same transaction that records the new object_locations row. The pattern guarantees that a DB outage between the backend PUT and the metadata commit cannot silently destroy the prior copy of an overwritten key.

If the metadata commit succeeds, the intent is gone within the same transaction and nothing else needs to happen. If the commit fails, the intent survives and a background pending reaper picks it up on the next tick:

The reaper claims the intent by deleting it transactionally (so two concurrent reapers cannot resolve the same intent).
It HEADs the destination backend.
Backend has the object — the reaper promotes the intent into object_locations in the same transaction, taking the prior copy's place. Displaced copies on other backends are returned for cleanup.
Backend does not have the object — the reaper drops the intent; the original write effectively never happened.
Concurrent successful write — if object_locations already holds a newer row for the same key (created after the intent), the intent is provably stale and dropped without writing metadata.

Configurable via write_path.pending_pattern (default: enabled, 1-minute reaper tick, 5-minute min_age so in-flight PUTs are not interrupted). Setting enabled: false reverts to the legacy delete-on-record-failure path, which trades data-loss safety for one fewer round-trip per PUT.

Lifecycle (Object Expiration)

Config-driven lifecycle rules automatically delete objects matching a key prefix after a configurable number of days. Useful for expiring temporary uploads, staging artifacts, or any objects with a known retention period.

lifecycle:
  rules:
    - prefix: "tmp/"
      expiration_days: 7
    - prefix: "uploads/staging/"
      expiration_days: 1

A background worker runs hourly and evaluates each rule against created_at timestamps in the object_locations table (uses an existing index — no schema changes needed). Deletions go through the standard DeleteObject path, so all copies are removed, quotas are decremented, and failed backend deletes are enqueued to the cleanup queue.

Rules are hot-reloadable via SIGHUP. An empty rules list (or omitting the section entirely) disables lifecycle — no advisory lock is acquired and no DB queries are executed.

Orphan Reconciliation

Optional background service that periodically scans each backend's S3 bucket and reconciles it against the metadata database. For each backend, it walks both sides as ascending key streams — S3 paginated by ListObjects and the DB paginated by ListObjectsByBackendKeyAsc — and merges them in lockstep. Keys present only on the backend are imported; keys present only in the DB are removed. Memory is bounded by the page size on each side (1000 entries) regardless of object count, so backends holding millions of objects reconcile without OOM. Rows owned by sibling virtual buckets stored on the same backend are skipped so a per-bucket pass does not affect other buckets.

reconcile:
  enabled: true       # disabled by default
  interval: "24h"     # how often to run (default: 24h)

Disabled by default. Requires a restart to enable/disable (non-reloadable). Runs under advisory lock 1009 to prevent concurrent scans across instances.

On-demand reconciliation is available via the admin API — useful after backend data loss or token expiry events:

# Reconcile all backends
s3-orchestrator admin reconcile

# Reconcile a single backend
curl -X POST -H "X-Admin-Token: $TOKEN" \
  http://localhost:9000/admin/api/reconcile?backend=g3

Encryption

Optional server-side envelope encryption with AES-256-GCM. When enabled, every object is encrypted before it leaves the orchestrator — backends only ever see ciphertext. Each object gets a random 256-bit Data Encryption Key (DEK) that is wrapped by a master key before storage. The master key can come from an inline config value, a file on disk, or HashiCorp Vault Transit.

Objects are encrypted in fixed-size chunks (default 64 KB), so range requests (Range header) work without downloading the entire object — the orchestrator calculates which ciphertext chunks to fetch and decrypts only those. Clients see standard S3 behavior; encryption is fully transparent.

Key features:

Chunked AES-256-GCM — each chunk has an independent nonce derived from a base nonce XORed with the chunk index, enabling random-access decryption
Envelope encryption — per-object DEKs mean rotating the master key only requires re-wrapping DEKs, not re-encrypting data
Key rotation — add the new master key, move the old one to previous_keys, and call the rotate-encryption-key admin API to re-wrap DEKs still using the old key
Encrypt existing data — the encrypt-existing admin API encrypts all unencrypted objects in-place without downtime
Decrypt existing data — the decrypt-existing admin API reverses encryption, restoring plaintext objects on backends (useful for disabling encryption or migrating away)
Vault Transit support — delegate key management to HashiCorp Vault for HSM-backed key storage. The Vault token is automatically renewed in the background; for Nomad workload identity deployments, use token_file to point at the Nomad-managed token file instead of a static token string
Unknown key ID detection — when a wrapped DEK references a key ID that isn't the current primary or any configured previous key, a warning is logged before falling back to the primary key (signals potential metadata corruption or missing rotation key)

Compatibility with backend-side encryption: If your backend already has its own server-side encryption (e.g., AWS SSE-S3 or SSE-KMS), both layers work independently. The orchestrator encrypts before uploading and the backend encrypts the ciphertext again at rest. On read, the backend decrypts its layer and returns the orchestrator's ciphertext, which the orchestrator then decrypts. This is harmless but redundant — you can safely disable the backend's encryption to avoid unnecessary KMS costs.

See the Admin Guide for setup, key rotation, and encrypting existing data.

Object Data Cache

Optional in-memory LRU cache that stores full GET responses to reduce backend API calls and egress. When a cached object is requested, the response is served directly from memory without contacting the backend. Useful for read-heavy workloads where the same objects are fetched repeatedly.

Key behaviors:

Full GET responses only — range requests bypass the cache on miss but are served from cache on hit
Admission control — objects larger than max_object_size are never cached, preventing a single large object from evicting many smaller ones
Automatic invalidation — cache entries are evicted on PutObject, DeleteObject, CopyObject, DeleteObjects, and CompleteMultipartUpload
TTL-based expiry — entries expire after the configured TTL regardless of access, bounding staleness in multi-instance deployments where writes may happen on another instance
Per-instance — each orchestrator instance maintains its own cache; caches are not shared across instances
Post-decryption — when encryption is enabled, the cache stores decrypted plaintext (same security properties as any in-process data)

cache:
  enabled: true
  max_size: "256MB"          # total cache capacity
  max_object_size: "10MB"    # largest object eligible for caching
  ttl: "5m"                  # time-to-live per entry

Disabled by default. Requires a restart to enable/disable (non-reloadable).

Rate Limiting

Optional per-IP token bucket rate limiting. When enabled, requests exceeding the configured rate return 429 SlowDown with a Retry-After: 1 header. Stale IP entries are evicted by a background goroutine every cleanup_interval (default 1m); entries not seen within cleanup_max_age (default 5m) are removed. Under high source-IP cardinality (e.g., DDoS), the map can accumulate up to cleanup_max_age worth of unique IPs before eviction runs — tune these values if memory pressure is a concern.

When running behind a reverse proxy (e.g., Traefik, nginx), configure trusted_proxies with the proxy's CIDR ranges so the orchestrator extracts the real client IP from the X-Forwarded-For header using rightmost-untrusted extraction. Without trusted_proxies, X-Forwarded-For is ignored and the direct connection IP is always used.

Usage Limits

Per-backend monthly limits for API requests, egress bytes, and ingress bytes. Set any limit to 0 (or omit it) for unlimited. Limits reset naturally each month — the usage tracking table is keyed by YYYY-MM period.

Enforcement behavior:

Writes (PutObject, CopyObject, CreateMultipartUpload, UploadPart) — backends over their limits are excluded from selection; writes overflow to the next eligible backend. If all backends are over-limit, the orchestrator returns 507 InsufficientStorage.
Reads (GetObject, HeadObject) — over-limit backends are skipped; the orchestrator tries replicas. Returns 429 SlowDown only when all copies of the object are on over-limit backends.
Deletes (DeleteObject, DeleteObjects, AbortMultipartUpload) — always allowed regardless of limits.

Effective usage is computed as DB baseline + unflushed counters + proposed operation, so enforcement stays accurate between flush/refresh cycles without double-counting. The flush interval is configurable (default 30s) and can adaptively shorten when backends approach their limits. For multi-instance deployments, optional Redis shared counters eliminate the cross-instance blind spot between flushes.

Configuration

YAML config file specified via -config flag (default: config.yaml). Supports ${ENV_VAR} expansion.

server:
  listen_addr: "0.0.0.0:9000"
  max_object_size: 5368709120  # 5 GB (default)
  # max_concurrent_requests: 0  # total concurrent operations — HTTP + background services (0 = unlimited, default: 1000)
  # max_concurrent_reads: 0     # separate read concurrency limit (0 = use global)
  # max_concurrent_writes: 0    # separate write concurrency limit (0 = use global; background services share this budget)
  # load_shed_threshold: 0      # active shedding at this capacity ratio (0 = disabled)
  # admission_wait: "0s"        # brief wait before rejection (0 = instant)
  # backend_timeout: "30s"       # per-operation timeout for backend S3 calls (default: 30s; uses tighter of this or parent context deadline)
  # read_header_timeout: "10s"   # max time to read request headers (default: 10s)
  # read_timeout: "5m"           # max time to read entire request including body (default: 5m)
  # write_timeout: "5m"          # max time to write response (default: 5m)
  # idle_timeout: "120s"         # max time to wait for next request on keep-alive (default: 120s)
  # shutdown_delay: "0s"         # delay before toggling readiness off and draining HTTP (default: 0; LB continues routing during delay)
  # tls:
  #   cert_file: "/path/to/cert.pem"  # hot-reloaded on SIGHUP; warns if cert expires within 24h
  #   key_file: "/path/to/key.pem"
  #   min_version: "1.2"           # "1.2" (default) or "1.3"
  #   client_ca_file: ""           # CA bundle for mTLS client verification

# Virtual buckets with per-bucket credentials
buckets:
  - name: "app1-files"
    # max_multipart_uploads: 100  # optional; limit active multipart uploads per bucket (0 = unlimited)
    credentials:
      - access_key_id: "APP1_ACCESS_KEY"
        secret_access_key: "APP1_SECRET_KEY"

  - name: "shared-files"
    credentials:
      # Multiple services can share a bucket with separate credentials
      - access_key_id: "WRITER_ACCESS_KEY"
        secret_access_key: "WRITER_SECRET_KEY"
      - access_key_id: "READER_ACCESS_KEY"
        secret_access_key: "READER_SECRET_KEY"

  # Legacy token auth (backward compatibility)
  # - name: "legacy-bucket"
  #   credentials:
  #     - token: "my-secret-token"

# SQLite (default) — zero-dependency, single-instance
database:
  driver: sqlite
  path: "s3-orchestrator.db"

# PostgreSQL — required for multi-instance deployments
# database:
#   driver: postgres
#   host: "localhost"
#   port: 5432
#   database: "s3proxy"
#   user: "s3proxy"
#   password: "secret"
#   ssl_mode: "require"
#   max_conns: 50             # default: 50; size to 2-3x max_concurrent_requests
#   min_conns: 10
#   max_conn_lifetime: "5m"

routing_strategy: "pack"       # "pack" (fill in order) or "spread" (least utilized) (default: pack)

backends:
  - name: "oci"
    endpoint: "https://namespace.compat.objectstorage.region.oraclecloud.com"
    region: "us-phoenix-1"
    bucket: "my-bucket"
    access_key_id: "backend-access-key"
    secret_access_key: "backend-secret-key"
    force_path_style: true
    unsigned_payload: true    # stream uploads without buffering (auto-enabled for HTTPS, set explicitly for HTTP)
    disable_checksum: false   # disable SDK default checksums for GCS and other providers that reject them
    strip_sdk_headers: false  # strip AWS SDK v2 headers before signing for GCS compatibility
    quota_bytes: 21474836480  # 20 GB (0 or omit for unlimited)
    max_object_size: 52428800 # 50 MB per-object size limit (0 = unlimited)
    api_request_limit: 0      # monthly API request limit (0 = unlimited)
    egress_byte_limit: 0      # monthly egress byte limit (0 = unlimited)
    ingress_byte_limit: 0     # monthly ingress byte limit (0 = unlimited)

telemetry:
  metrics:
    enabled: true
    path: "/metrics"
    # listen: "127.0.0.1:9091"  # optional; serve metrics on a separate address (recommended for production)
  tracing:
    enabled: true
    endpoint: "localhost:4317"
    insecure: true
    sample_rate: 1.0             # fraction of requests that generate OTel traces (use 0.01–0.1 in production)

circuit_breaker:
  failure_threshold: 3     # consecutive DB failures before opening (default: 3)
  open_timeout: "15s"      # delay before probing recovery (default: 15s)
  cache_ttl: "60s"         # key→backend cache TTL during degraded reads (default: 60s)
  parallel_broadcast: false # fan-out reads to all backends in parallel during degraded mode (default: false)

# backend_circuit_breaker:   # per-backend circuit breakers (disabled by default)
#   enabled: false
#   failure_threshold: 5     # consecutive failures before opening (default: 5)
#   open_timeout: "5m"       # delay before probing recovery (default: 5m)

rebalance:
  enabled: false
  strategy: "pack"         # "pack" or "spread" (default: pack)
  interval: "6h"           # run interval (default: 6h)
  batch_size: 100          # max objects per run (default: 100)
  threshold: 0.1           # min utilization spread to trigger (default: 0.1)
  concurrency: 5           # parallel moves per run (default: 5)

replication:
  factor: 1                # copies per object; 1 = no replication (default: 1)
  worker_interval: "5m"    # replication worker cycle (default: 5m)
  batch_size: 50           # objects per cycle (default: 50)
  concurrency: 5           # parallel replications per cycle (default: 5)
  unhealthy_threshold: "10m" # grace period before replacing copies on circuit-broken backends (default: 10m)

cleanup_queue:
  concurrency: 10          # parallel cleanup deletions per tick (default: 10)

rate_limit:
  enabled: false
  requests_per_sec: 100    # token refill rate (default: 100)
  burst: 200               # max burst size (default: 200)
  cleanup_interval: "1m"   # stale entry eviction interval (default: 1m)
  cleanup_max_age: "5m"    # evict entries not seen within this window (default: 5m)
  # trusted_proxies:       # CIDRs whose X-Forwarded-For is trusted
  #   - "10.0.0.0/8"       # Uses rightmost-untrusted extraction
  #   - "172.16.0.0/12"

encryption:
  enabled: false
  # chunk_size: 65536           # plaintext bytes per chunk (default: 64KB, range: 4KB–1MB, power of 2)
  # master_key: "${ENCRYPTION_KEY}"  # base64-encoded 256-bit key (exactly one key source required)
  # master_key_file: "/path/to/key"  # alternative: raw 32-byte key file
  # vault:                           # alternative: Vault Transit
  #   address: "http://vault:8200"
  #   token: "${VAULT_TOKEN}"        # static token (auto-renewed via RenewSelf)
  #   # token_file: "/secrets/vault-token"  # OR file-based (for Nomad workload identity; re-read periodically)
  #   key_name: "s3-orchestrator"
  #   mount_path: "transit"          # default: transit
  #   # ca_cert: "/path/to/ca.pem"  # Vault CA certificate for TLS verification
  #   # renew_interval: "5m"        # token renewal check interval (default: 5m)
  # previous_keys:                   # old master keys for rotation (unwrap only)
  #   - "base64-encoded-old-key"

integrity:
  enabled: false               # SHA-256 content hashing for data integrity verification
  # verify_on_read: false      # hash-check GET responses as they stream
  # verify_on_replicate: true  # verify hash when creating replicas (default: true when enabled)
  # scrubber_interval: "6h"    # background verification interval (0 = disabled)
  # scrubber_batch_size: 100   # objects per scrub cycle

# cache:                        # optional: in-memory LRU object data cache
#   enabled: false               # disabled by default
#   max_size: "256MB"            # total cache capacity (default: 256MB)
#   max_object_size: "10MB"      # largest cacheable object (default: 10MB)
#   ttl: "5m"                    # per-entry time-to-live (default: 5m)

ui:
  enabled: false             # enable the built-in web dashboard
  path: "/ui"                # URL prefix (default: /ui)
  admin_key: "${UI_ADMIN_KEY}"       # access key for dashboard login
  admin_secret: "${UI_ADMIN_SECRET}" # secret key (plaintext or bcrypt hash)
  session_secret: "${UI_SESSION_SECRET}" # required — HMAC key for session cookies (independent of admin_secret)
  # admin_token: ""          # separate token for admin API (defaults to admin_key)
  # force_secure_cookies: false # always set Secure flag on cookies (for behind TLS proxy)

usage_flush:
  interval: "30s"            # base flush interval (default: 30s)
  adaptive_enabled: false    # shorten interval when near usage limits (default: false)
  adaptive_threshold: 0.8    # usage ratio to trigger fast flush (default: 0.8)
  fast_interval: "5s"        # interval when near limits (default: 5s)

# reconcile:                   # optional: periodic orphan reconciliation
#   enabled: false             # scan backends for untracked objects (default: false)
#   interval: "24h"            # how often to run (default: 24h)

# redis:                       # optional: shared usage counters for multi-instance deployments
#   address: "redis:6379"      # host:port (required when section is present)
#   password: ""               # AUTH password (omit for no auth)
#   db: 0
#   tls: false
#   key_prefix: "s3orch"       # namespace for multi-tenant Redis (default: s3orch)
#   failure_threshold: 3       # consecutive failures before local fallback (default: 3)
#   open_timeout: "15s"        # delay before probing recovery (default: 15s)

lifecycle:
  rules:                       # empty or omitted = lifecycle disabled
    - prefix: "tmp/"           # key prefix to match
      expiration_days: 7       # delete objects older than this
    - prefix: "uploads/staging/"
      expiration_days: 1

Provider quick reference — endpoint format and required flags for common S3-compatible providers:

Provider	Endpoint	`force_path_style`	Notes
AWS S3	`https://s3.<region>.amazonaws.com`	`false` (default)
MinIO	`http://<host>:9000`	`true`
OCI Object Storage	`https://<ns>.compat.objectstorage.<region>.oraclecloud.com`	`true`
Backblaze B2	`https://s3.<region>.backblazeb2.com`	`false`
Cloudflare R2	`https://<account-id>.r2.cloudflarestorage.com`	`false`	`region: auto`
Wasabi	`https://s3.<region>.wasabisys.com`	`false`
Google Cloud Storage	`https://storage.googleapis.com`	`false`	Set `disable_checksum: true` and `strip_sdk_headers: true`

See the Maximizing Free Tiers guide for detailed setup on each provider including where to find credentials.

Configuration Hot-Reload

The orchestrator supports hot-reloading a subset of configuration by sending SIGHUP to the running process. This lets you update credentials, quotas, rate limits, and other operational settings without restarting the service or dropping client connections.

kill -HUP $(pidof s3-orchestrator)

Reloadable vs non-reloadable settings

Setting	Reloadable	Notes
`buckets` (credentials, limits)	Yes	Credentials and `max_multipart_uploads` take effect immediately
`rate_limit`	Yes	New visitors get updated rates; existing per-IP limiters expire naturally
`backends[].quota_bytes`	Yes	Synced to database on reload
`backends[].api_request_limit`	Yes
`backends[].egress_byte_limit`	Yes
`backends[].ingress_byte_limit`	Yes
`rebalance`	Yes	Strategy, interval, threshold, concurrency, enabled/disabled
`replication`	Yes	Factor, worker interval, batch size
`usage_flush`	Yes	Interval, adaptive enabled/threshold/fast interval
`lifecycle`	Yes	Rules (prefix, expiration_days)
`integrity`	Yes	Enabled, verify_on_read, scrubber interval/batch size
`server.listen_addr`	No	Requires restart
`server.max_concurrent_requests`	No	Requires restart
`server.max_concurrent_reads`	No	Requires restart
`server.max_concurrent_writes`	No	Requires restart
`server.load_shed_threshold`	No	Requires restart
`server.admission_wait`	No	Requires restart
`server` timeouts	No	`read_header_timeout`, `read_timeout`, `write_timeout`, `idle_timeout`, `shutdown_delay`
`server.tls`	No	Requires restart
`database`	No	Requires restart
`telemetry`	No	Requires restart
`circuit_breaker`	No	Requires restart
`backend_circuit_breaker`	No	Requires restart
`ui`	No	Requires restart
`encryption`	No	Requires restart
`cache`	No	Requires restart
`redis`	No	Requires restart
`routing_strategy`	No	Requires restart
`reconcile`	No	Requires restart
`backends` (structural: endpoint, credentials, count)	No	Requires restart

On a successful reload, the orchestrator logs each reloaded section:

{"level":"INFO","msg":"SIGHUP received, reloading configuration","path":"config.yaml"}
{"level":"INFO","msg":"Reloaded bucket credentials","buckets":2}
{"level":"INFO","msg":"Reloaded rate limits","requests_per_sec":100,"burst":200}
{"level":"INFO","msg":"Reloaded backend quota limits"}
{"level":"INFO","msg":"Reloaded backend usage limits"}
{"level":"INFO","msg":"Reloaded rebalance/replication/usage-flush config"}
{"level":"INFO","msg":"Configuration reload complete"}

If the new config file is invalid, the orchestrator keeps the current configuration and logs the error:

{"level":"ERROR","msg":"Config reload failed, keeping current config","error":"invalid config: ..."}

Non-reloadable field changes are logged as warnings but do not prevent the reload of other settings:

{"level":"WARN","msg":"Config field changed but requires restart to take effect","field":"server.listen_addr"}

Database

The orchestrator supports two metadata-store engines:

SQLite (default) — embedded, zero-dependency, single-instance. Schema is applied at startup from a single consolidated schema.sql.
PostgreSQL — required for multi-instance deployments. Connects via pgx/v5 pools and auto-applies versioned migrations on startup using goose; migration files are embedded in the binary and tracked via a goose_db_version table so only unapplied migrations run.

Engine-agnostic orchestration lives in internal/store/core/ (transactional business logic against a TxAdapter interface). Each engine package (internal/store/postgres/, internal/store/sqlite/) is a thin adapter that implements the same TxAdapter, so the same code drives both engines.

The schema currently provisions:

Table	Purpose
`backend_quotas`	Per-backend byte limits, usage counters, and orphan bytes tracking
`object_locations`	Maps object keys to backends with size tracking
`multipart_uploads`	In-progress multipart upload metadata
`multipart_parts`	Individual parts for active multipart uploads
`backend_usage`	Monthly per-backend API request and data transfer counters
`cleanup_queue`	Retry queue for failed backend object deletions
`cleanup_dlq`	Dead-letter for `cleanup_queue` rows that exhausted retries; surfaces unrecoverable orphans for operator action
`pending_objects`	In-flight PUT intents recorded before the backend write so a DB outage can't silently destroy the prior copy
`notification_outbox`	Durable webhook event delivery queue

Quota updates are transactional: object location inserts/deletes and quota counter changes happen atomically.

All Postgres SQL queries live in internal/store/postgres/sqlc/queries/ as annotated .sql files. Type-safe Go code is generated by sqlc into internal/store/postgres/sqlc/. To regenerate after editing queries:

Telemetry

Prometheus Metrics

All metrics are prefixed with s3o_. Exposed at /metrics when enabled.

Metric	Type	Labels	Description
`s3o_build_info`	Gauge	version, go_version	Build metadata
`s3o_requests_total`	Counter	method, status_code	HTTP request count
`s3o_request_duration_seconds`	Histogram	method	Request latency
`s3o_request_size_bytes`	Histogram	method	Upload sizes
`s3o_response_size_bytes`	Histogram	method	Download sizes
`s3o_inflight_requests`	Gauge	method	Currently processing
`s3o_backend_requests_total`	Counter	operation, backend, status	Backend S3 API calls
`s3o_backend_duration_seconds`	Histogram	operation, backend	Backend latency
`s3o_manager_requests_total`	Counter	operation, backend, status	Manager-level operations
`s3o_manager_duration_seconds`	Histogram	operation, backend	Manager latency
`s3o_quota_bytes_used`	Gauge	backend	Current bytes used
`s3o_quota_bytes_limit`	Gauge	backend	Quota limit
`s3o_quota_orphan_bytes`	Gauge	backend	Bytes reserved by pending cleanup items
`s3o_quota_bytes_available`	Gauge	backend	Remaining space (limit − used − orphan)
`s3o_objects_count`	Gauge	backend	Stored object count
`s3o_active_multipart_uploads`	Gauge	backend	In-progress uploads
`s3o_rebalance_objects_moved_total`	Counter	strategy, status	Objects moved by rebalancer
`s3o_rebalance_bytes_moved_total`	Counter	strategy	Bytes moved by rebalancer
`s3o_rebalance_runs_total`	Counter	strategy, status	Rebalancer executions
`s3o_rebalance_duration_seconds`	Histogram	strategy	Rebalancer execution time
`s3o_rebalance_skipped_total`	Counter	reason	Rebalancer runs skipped
`s3o_rebalance_pending`	Gauge	—	Objects planned for rebalance
`s3o_replication_pending`	Gauge	—	Objects below replication factor
`s3o_replication_copies_created_total`	Counter	—	Replica copies created
`s3o_replication_errors_total`	Counter	—	Replication errors
`s3o_replication_duration_seconds`	Histogram	—	Replication cycle time
`s3o_replication_runs_total`	Counter	status	Replication worker executions
`s3o_replication_health_copies_total`	Counter	—	Copies created to replace copies on circuit-broken backends
`s3o_over_replication_pending`	Gauge	—	Objects exceeding the replication factor
`s3o_over_replication_removed_total`	Counter	—	Excess copies removed
`s3o_over_replication_errors_total`	Counter	—	Over-replication cleanup errors
`s3o_over_replication_runs_total`	Counter	status	Over-replication worker executions
`s3o_over_replication_duration_seconds`	Histogram	—	Over-replication cleanup cycle time
`s3o_circuit_breaker_state`	Gauge	name	0=closed, 1=open, 2=half-open (name: "database" or backend name)
`s3o_circuit_breaker_transitions_total`	Counter	name, from, to	State transitions per component
`s3o_degraded_reads_total`	Counter	operation	Broadcast reads in degraded mode
`s3o_degraded_cache_hits_total`	Counter	—	Cache hits during degraded reads
`s3o_degraded_write_rejections_total`	Counter	operation	Writes rejected in degraded mode
`s3o_usage_api_requests`	Gauge	backend	Current month API request count
`s3o_usage_egress_bytes`	Gauge	backend	Current month egress bytes
`s3o_usage_ingress_bytes`	Gauge	backend	Current month ingress bytes
`s3o_usage_limit_rejections_total`	Counter	operation, limit_type	Operations rejected by usage limits
`s3o_cleanup_queue_enqueued_total`	Counter	reason	Items added to the cleanup retry queue
`s3o_cleanup_queue_processed_total`	Counter	status	Items processed from the cleanup queue (success/retry/exhausted)
`s3o_cleanup_queue_depth`	Gauge	—	Current pending items in the cleanup queue
`s3o_cleanup_dlq_depth`	Gauge	—	Unrecoverable orphans waiting in the cleanup dead-letter table
`s3o_cleanup_dlq_enqueued_total`	Counter	backend	Cleanup rows graduated to the dead-letter after exhausting retries
`s3o_rate_limit_rejections_total`	Counter	—	Requests rejected by per-IP rate limiting
`s3o_admission_rejections_total`	Counter	—	Requests rejected by server-level admission control
`s3o_lifecycle_deleted_total`	Counter	—	Objects deleted by lifecycle expiration
`s3o_lifecycle_failed_total`	Counter	—	Objects that failed lifecycle deletion
`s3o_lifecycle_runs_total`	Counter	status	Lifecycle worker executions
`s3o_audit_events_total`	Counter	event	Audit log entries emitted
`s3o_drain_active`	Gauge	—	`1` while a backend drain is in progress
`s3o_drain_objects_moved_total`	Counter	—	Objects migrated during drain
`s3o_drain_bytes_moved_total`	Counter	—	Bytes migrated during drain
`s3o_encryption_operations_total`	Counter	op	Encrypt/decrypt operations (encrypt, decrypt, decrypt_range)
`s3o_encryption_errors_total`	Counter	op, error_type	Encryption/decryption failures
`s3o_encryption_unknown_key_id_total`	Counter	—	Decryption attempts with unknown keyID (primary key fallback)
`s3o_encrypt_existing_objects_total`	Counter	status	Objects processed by encrypt-existing (success/error)
`s3o_decrypt_existing_objects_total`	Counter	status	Objects processed by decrypt-existing (success/error)
`s3o_key_rotation_objects_total`	Counter	status	DEKs re-wrapped by key rotation (success/error)
`s3o_redis_operations_total`	Counter	operation, status	Redis command outcomes (incrby, get, getset, pipeline_add, pipeline_load)
`s3o_redis_fallback_active`	Gauge	—	`1` when Redis is unavailable and using local counters
`s3o_cache_hits_total`	Counter	—	Object data cache hits
`s3o_cache_misses_total`	Counter	—	Object data cache misses
`s3o_cache_evictions_total`	Counter	—	Object data cache evictions (LRU or TTL)
`s3o_cache_size_bytes`	Gauge	—	Current memory used by cached objects
`s3o_cache_entries`	Gauge	—	Current number of cached objects
`s3o_integrity_checks_total`	Counter	operation	Integrity hash verifications performed (read, scrub)
`s3o_integrity_errors_total`	Counter	operation	Hash mismatches detected (corrupted copies enqueued for cleanup)

Quota metrics are refreshed from PostgreSQL every 30 seconds (no backend API calls).

A ready-to-import Grafana dashboard covering all metrics is included at grafana/s3-orchestrator.json.

OpenTelemetry Tracing

Spans are emitted for every HTTP request, manager operation, and backend S3 call. The service registers as s3-orchestrator (resource.service.name). Traces propagate via W3C traceparent headers. Configured to export via gRPC OTLP to Tempo or any OTLP-compatible collector.

Trace-to-log correlation — every JSON log line emitted within an active span automatically includes trace_id and span_id fields. Log aggregators (Loki, etc.) can use these fields to link logs to their corresponding traces in Tempo or any OpenTelemetry-compatible tracing backend. Only log calls that receive a context.Context with an active span include trace context; application-level logs without a span context are unaffected.

Audit Logging

Structured audit log entries are emitted as JSON via slog for every S3 API request and significant internal operation. Each entry includes an "audit": true marker for easy filtering in log pipelines.

Request ID tracing — every S3 API request gets a unique request ID, returned in the X-Amz-Request-Id response header. Clients can supply their own via the X-Request-Id request header. The same ID flows through context to all downstream operations, appearing in both the HTTP-level audit entry and the storage-level audit entry for full request correlation. The ID is also set as a s3o.request_id attribute on OpenTelemetry spans, linking audit logs to traces.

Two-level audit entries — each S3 request produces two audit log lines: one at the HTTP layer (s3.PutObject, s3.GetObject, etc.) with method, path, bucket, status, duration, and remote address, and one at the storage layer (storage.PutObject, storage.GetObject, etc.) with the backend name, object key, and size. Both share the same request_id.

Internal operation auditing — background operations generate their own correlation IDs:

Operation	Events
Rebalancer	`rebalance.start`, `rebalance.move`, `rebalance.complete`
Replicator	`replication.start`, `replication.copy`, `replication.complete`
Over-replication cleaner	`over_replication.start`, `over_replication.remove`, `over_replication.complete`
Multipart cleanup	`storage.MultipartCleanup`
Overwrite (displaced)	`storage.overwrite_displaced`
Cleanup queue	`cleanup_queue.processed`, `cleanup_queue.exhausted_to_dlq`

Example audit log entry:

{"level":"INFO","msg":"audit","audit":true,"event":"s3.PutObject","request_id":"a1b2c3d4e5f6...","operation":"PutObject","method":"PUT","path":"/my-files/photo.jpg","bucket":"my-files","status":200,"duration":"45ms"}

Webhook Notifications

Optional outbound webhooks for object mutations and operational events. Events are written to a durable notification_outbox table inside the same transaction as the originating change, then a background drainer POSTs them as CloudEvents-formatted JSON to each configured endpoint. The outbox pattern means events are never lost on crash and never sent twice for the same change.

Two event categories are supported:

Data events — S3-style object mutations (s3:ObjectCreated:Put, s3:ObjectRemoved:Delete, etc.) carrying the bucket and key.
Operational events — backend health (backend.circuit.opened, backend.capacity.warning), integrity (integrity.corruption_detected), cleanup (cleanup.exhausted), replication and lifecycle completions.

Each endpoint declares which event-type patterns it cares about and an optional HMAC-SHA256 signing key:

notifications:
  endpoints:
    - url: "https://hooks.example.com/storage"
      events:
        - "s3:ObjectCreated:*"
        - "s3:ObjectRemoved:*"
      prefix: "uploads/"          # only deliver data events under this prefix
      secret: "${HOOK_SECRET}"
      timeout: 5s
      max_retries: 5

Failed deliveries retry with exponential backoff. After max_retries, the row is dropped and an audit warning is emitted. See web/content/guides/event-notifications.md for the full event catalog and signature-verification recipe.

Web UI

A built-in web dashboard provides operational visibility and management without external tooling. When enabled, it renders a server-side HTML page at the configured path (default /ui/). All routes require authentication via HMAC-signed session cookies — users log in with an admin key/secret pair configured in the YAML config.

The dashboard shows:

Storage Summary — total bytes used/capacity across all backends with a progress bar
Backends — quota used/limit per backend with progress bars, object counts, active multipart uploads
Monthly Usage — API requests, egress, and ingress per backend with limits
Objects — interactive collapsible tree browser; buckets and directories expand on click to reveal contents, with rollup file counts and sizes
Configuration — virtual buckets, write routing strategy, replication factor, rebalance strategy, rate limit status
Logs — recent structured log output from an in-memory ring buffer (last 5,000 entries), filterable by severity level with client-side text search and optional auto-refresh

The dashboard also provides management actions:

Upload — upload files to any virtual bucket via the browser
Download — download individual objects from the file tree
Delete — delete individual objects from the file tree
Rebalance — trigger an on-demand rebalance across backends
Clean Excess — remove over-replicated copies that exceed the replication factor
Sync — import pre-existing objects from a backend's S3 bucket into the proxy database, scoped to a selected virtual bucket

The object tree uses JavaScript for lazy-loaded AJAX expansion — directories load their children on click via the /ui/api/tree endpoint. All dashboard responses include security headers (X-Frame-Options, X-Content-Type-Options, Referrer-Policy, Content-Security-Policy). Enable it in the config:

ui:
  enabled: true
  path: "/ui"                # default
  admin_key: "${UI_ADMIN_KEY}"
  admin_secret: "${UI_ADMIN_SECRET}"

JSON APIs are available at {path}/api/dashboard, {path}/api/tree, and {path}/api/logs for programmatic access. The logs endpoint accepts optional query parameters: level, since, component, and limit. Management endpoints ({path}/api/delete, {path}/api/delete-prefix, {path}/api/upload, {path}/api/rebalance, {path}/api/clean-excess, {path}/api/sync) accept POST requests and return JSON responses. The download endpoint ({path}/api/download?key=...) accepts GET requests.

Endpoints

Public

Path	Method	Purpose
`/{bucket}/{key}`	*	S3 API (PutObject, GetObject, etc.)
`/health`	GET	Liveness — always 200; body is `ok` or `degraded` (when DB circuit is open)
`/health/ready`	GET	Readiness — 200 once startup is complete; flips to 503 during shutdown drain
`/metrics`	GET	Prometheus metrics (when `telemetry.metrics.enabled`)

Admin API (`X-Admin-Token` required)

Path	Method	Purpose
`/admin/api/status`	GET	Backend health, quota, circuit-breaker state
`/admin/api/object-locations?key=...`	GET	Per-backend ledger for one object key
`/admin/api/cleanup-queue`	GET	Cleanup queue depth and pending sample
`/admin/api/usage-flush`	POST	Force out-of-band flush of usage counters
`/admin/api/replicate`	POST	Trigger one replication cycle
`/admin/api/log-level`	GET / PUT	View or set the running instance's log level
`/admin/api/over-replication`	GET / POST	Show pending excess copies / trigger cleanup
`/admin/api/rotate-encryption-key`	POST	Re-wrap DEKs that still reference an old master key
`/admin/api/encrypt-existing`	POST	Encrypt all unencrypted objects in-place
`/admin/api/decrypt-existing`	POST	Decrypt all encrypted objects in-place
`/admin/api/scrub`	POST	Trigger one integrity-scrub pass
`/admin/api/backfill-checksums`	POST	Compute hashes for objects predating integrity
`/admin/api/reconcile`	POST	Trigger an out-of-band reconcile pass
`/admin/api/backends/{name}/drain`	POST / GET / DELETE	Start / inspect / cancel a backend drain
`/admin/api/backends/{name}`	DELETE	Remove backend metadata (use `?purge=true` + `?confirm=true` to also delete S3 objects)

Web UI (`X-Session-Cookie` after login; enabled only when `ui.enabled`)

Path	Method	Purpose
`/ui/`	GET	Dashboard HTML
`/ui/login`	GET / POST	Login page
`/ui/api/dashboard`	GET	Dashboard data as JSON
`/ui/api/tree`	GET	Lazy-loaded directory listing
`/ui/api/upload`	POST	Multipart-form file upload
`/ui/api/download`	GET	Object download (`?key=...`)
`/ui/api/delete`	POST	Delete one object
`/ui/api/delete-prefix`	POST	Delete every object under a prefix
`/ui/api/rebalance` (+ `/status`)	POST / GET	Trigger / poll rebalance
`/ui/api/clean-excess` (+ `/status`)	POST / GET	Trigger / poll over-replication cleanup
`/ui/api/replicate` (+ `/status`)	POST / GET	Trigger / poll replicate
`/ui/api/scrub` (+ `/status`)	POST / GET	Trigger / poll integrity scrub
`/ui/api/backfill-checksums` (+ `/status`)	POST / GET	Trigger / poll checksum backfill
`/ui/api/encrypt-existing` (+ `/status`)	POST / GET	Trigger / poll encrypt-existing
`/ui/api/sync`	POST	Import objects from a backend's S3 bucket
`/ui/api/logs`	GET	Buffered log entries (`level`, `since`, `component`, `limit`)

Background Tasks

All locked background tasks apply a random startup jitter of up to half the tick interval before the first tick, preventing thundering herd on the advisory lock when multiple instances start simultaneously.

Task	Interval	Advisory Lock	Description
Usage flush + metrics	configurable (default 30s)	When Redis configured	Flushes usage counters to PostgreSQL, then refreshes quota stats, usage baselines, object counts, and multipart counts. Updates Prometheus gauges. Adaptive mode shortens interval near limits. Advisory lock is acquired whenever Redis is configured (regardless of health) to prevent double-counting during recovery.
Stale multipart cleanup	1h	Yes	Aborts multipart uploads older than 24h and deletes their temporary part objects.
Cleanup queue	1m	Yes	Retries failed backend object deletions with exponential backoff (1m to 24h, max 10 attempts). On the tenth consecutive failure the row graduates to `cleanup_dlq` for operator action; `orphan_bytes` stays incremented because the bytes are still on disk.
Rebalancer	configurable (default 6h)	Yes	Moves objects between backends per strategy. Only runs when enabled.
Replicator	configurable (default 5m)	Yes	Creates copies of under-replicated objects. Only runs when factor > 1. Runs once at startup.
Over-replication cleaner	configurable (default 5m)	Yes	Removes excess copies of objects that exceed the replication factor. Only runs when factor > 1.
Lifecycle	1h	Yes	Deletes objects matching lifecycle rules whose `created_at` exceeds `expiration_days`. Only runs when rules are configured.
Reconciler	configurable (default 24h)	Yes	Scans each backend for untracked objects and imports them into the metadata database via `SyncBackend`. Only runs when `reconcile.enabled: true`.
Pending reaper	configurable (default 1m)	Yes	Resolves PUT-before-COMMIT intents that survived a failed metadata commit. HEADs the destination backend and either promotes the row into `object_locations` (object present) or drops the intent (object absent). Skips intents younger than `min_age` so in-flight PUTs are not interrupted.
Scrubber	configurable (default 6h)	Yes	Random-samples objects, fetches and re-hashes them, and enqueues a cleanup if the stored `content_hash` does not match. Only runs when `integrity.enabled: true` and `scrubber_interval > 0`.
Notification drainer	5s	No	Drains `notification_outbox` rows by POSTing CloudEvents JSON to configured webhook endpoints. Optional HMAC signing per endpoint.
CB watchdog	1m	No	Checks all circuit breakers for stale half-open probes. If a probe has been in flight longer than 2 minutes, resets the circuit to open so a new probe can be dispatched. Prevents circuits from getting stuck half-open when traffic stops.

Background services (rebalancer, replicator, over-replication cleaner, cleanup queue) share the admission semaphore with HTTP requests, so max_concurrent_requests is the total budget for both HTTP and background backend operations.

Multi-Instance Deployment

Multiple orchestrator instances can safely share the same PostgreSQL database. Background tasks (rebalancer, replicator, cleanup queue, multipart cleanup) use PostgreSQL advisory locks to prevent concurrent execution across instances — if one instance holds the lock for a task, other instances skip that tick silently.

Request-serving paths (PutObject, GetObject, etc.) are stateless and work correctly with any number of instances behind a load balancer. The per-instance location cache is TTL-bounded and self-correcting. Rate limiting remains per-instance.

Usage Counters

Without Redis, each instance tracks usage counters independently in memory and flushes to PostgreSQL at the configured interval (default 30s). Between flushes, instances cannot see each other's accumulated usage, which can allow quota overshoot under high throughput.

With Redis configured, all instances share the same usage counters via Redis INCRBY/GET operations. The baseline+delta formula stays the same (DB baseline + counter + proposed), but the counter lives in Redis instead of local memory, eliminating the cross-instance blind spot. When Redis is active, only one instance flushes counters to PostgreSQL (coordinated via advisory lock) since GETSET is a destructive read.

A circuit breaker monitors Redis health. If Redis becomes unavailable, the backend falls back to local in-memory counters automatically — same behavior as running without Redis. A background health probe PINGs Redis periodically and, on recovery, syncs local deltas back to Redis via an additive INCRBY pipeline before resuming shared operation. The entire local counter map is swapped atomically (single pointer swap) so no concurrent Add calls can lose deltas between the snapshot and the pipeline. Stale Redis keys from before the outage expire via TTL. Local counters are zeroed only after the pipeline commits, so a crash mid-recovery cannot lose deltas. The recovery is safe for concurrent execution by multiple instances since INCRBY is additive.

redis:
  address: "redis.example.com:6379"
  password: "${REDIS_PASSWORD}"
  key_prefix: "s3orch"       # namespace for multi-tenant Redis
  failure_threshold: 3        # consecutive failures before fallback
  open_timeout: "15s"         # delay before probing recovery

Redis is optional. Without it, adaptive flushing still shortens the flush interval when any backend approaches a usage limit, improving enforcement accuracy.

CLI Subcommands

Running s3-orchestrator with no subcommand starts the daemon (-config and -mode flags). The subcommands below are all dispatched by the same binary; pass -h after any of them for usage.

version

Prints the binary version, Go version, and platform:

s3-orchestrator version
# s3-orchestrator vX.Y.Z go1.26.X linux/amd64

init

Generates a configuration file interactively. Prompts for database driver (SQLite or PostgreSQL), one or more storage backends, and one or more virtual buckets, then writes a validated config.yaml:

s3-orchestrator init                          # writes ./config.yaml
s3-orchestrator init -config /etc/s3o.yaml    # custom path

The generated config is round-tripped through the loader before being written, so the file the user lands with is guaranteed to validate.

help

Prints the subcommand summary:

s3-orchestrator help
s3-orchestrator -h

validate

Validates a configuration file without starting the server. Exits 0 on success with a brief summary, or exits 1 with error details:

s3-orchestrator validate -config config.yaml
# config config.yaml: valid
#   backends: 2
#   buckets:  1
#   routing:  spread

sync

Imports pre-existing objects from a backend S3 bucket into the orchestrator's metadata database. Useful when bringing an existing bucket under orchestrator management. The --bucket flag specifies which virtual bucket the imported objects belong to — keys are stored with a {bucket}/ prefix for namespace isolation.

# Import all objects from a backend into the "unified" virtual bucket
s3-orchestrator sync --config config.yaml --backend oci --bucket unified

# Preview what would be imported
s3-orchestrator sync --config config.yaml --backend oci --bucket unified --dry-run

# Import only objects under a prefix
s3-orchestrator sync --config config.yaml --backend oci --bucket unified --prefix photos/

Flag	Default	Description
`--config`	`config.yaml`	Path to configuration file
`--backend`	(required)	Backend name to sync
`--bucket`	(required)	Virtual bucket name to prefix imported keys with
`--prefix`	`""`	Only sync objects with this key prefix
`--dry-run`	`false`	Preview what would be imported without writing

Objects already tracked in the database for that backend are skipped. The command logs per-page progress and a final summary with imported count, skipped count, and total bytes imported.

admin

Operational CLI for a running instance. Reads config.yaml to discover the server address and admin token. See the Admin Guide for full details.

s3-orchestrator admin status                       # backend health and usage
s3-orchestrator admin object-locations -key "..."  # find all copies of an object
s3-orchestrator admin cleanup-queue                # cleanup queue depth
s3-orchestrator admin usage-flush                  # force flush usage counters
s3-orchestrator admin replicate                    # trigger replication cycle
s3-orchestrator admin over-replication             # show over-replicated object count
s3-orchestrator admin over-replication --execute   # clean excess copies
s3-orchestrator admin over-replication --execute --batch-size 200  # with custom batch
s3-orchestrator admin log-level                    # view current log level
s3-orchestrator admin log-level -set debug         # change log level at runtime
s3-orchestrator admin drain <backend>              # start draining a backend
s3-orchestrator admin drain-status <backend>       # check drain progress
s3-orchestrator admin drain-cancel <backend>       # cancel an active drain
s3-orchestrator admin remove-backend <backend>              # remove backend DB records (S3 objects preserved)
s3-orchestrator admin remove-backend <backend> --purge      # preview: shows what would be destroyed
s3-orchestrator admin remove-backend <backend> --purge --confirm  # delete S3 objects + DB records
s3-orchestrator admin reconcile                    # reconcile DB against all backends
s3-orchestrator admin reconcile -backend g3        # reconcile a single backend
s3-orchestrator admin scrub                        # trigger an integrity scrub cycle
s3-orchestrator admin backfill-checksums           # compute hashes for unhashed objects

Development

# Install build and packaging dependencies
make tools

# Regenerate sqlc query code (after editing .sql files)
make generate

# Run locally (starts MinIO + PostgreSQL via Docker, then runs the server)
make run

# Lint
make lint

# Static analysis
make vet

# Scan Go dependencies for known vulnerabilities
make govulncheck

# Run unit tests
make test

# Run integration tests (requires Docker)
make integration-test

# Build local Docker image
make build

# Create a new database migration
make migration

# Build multi-arch and push to registry
make push VERSION=vX.Y.Z

# Build a .deb package for the host architecture
make deb VERSION=X.Y.Z

# Build .deb packages for both amd64 and arm64
make deb-all VERSION=X.Y.Z

# Build and run lintian validation
make deb-lint VERSION=X.Y.Z

# Publish .deb packages to an Aptly repository
make publish-deb

# Dry-run GoReleaser locally (builds everything without publishing)
make release-local

Deployment

The orchestrator can run as a Docker container, a native systemd service, or on container orchestration platforms. Production-ready manifests for Nomad and Kubernetes are in deploy/, with local demo scripts that stand up a complete environment in one command.

Container Orchestration (Nomad / Kubernetes)

Example manifests in deploy/ demonstrate a three-backend setup with replication factor 2, spread routing, and full observability. Local demo scripts build from source and deploy against docker-compose backing services:

# Kubernetes via k3d (requires: docker, k3d, kubectl)
make kubernetes-demo

# Nomad in dev mode (requires: docker, nomad)
make nomad-demo

See deploy/README.md for production deployment instructions and customization options (TLS, mTLS, Vault integration, Ingress).

Prerequisites

PostgreSQL database (schema auto-applied on startup)
At least one S3-compatible storage backend
Configuration file with credentials
Redis (optional — for shared usage counters in multi-instance deployments)
TLS termination — either via the built-in server.tls config or a reverse proxy (Traefik, nginx, Ingress). Plain HTTP exposes SigV4 signatures and object data on the wire, and the UNSIGNED-PAYLOAD streaming mode means body integrity depends entirely on transport security. See the Security Hardening guide for TLS and mTLS setup.

Docker

Build and push a multi-arch image with a version tag:

The VERSION is baked into the binary via -ldflags and displayed in the web UI header and /health endpoint. Defaults to the value in .version if omitted.

Debian Package

Build a .deb package for bare-metal or VM deployments:

Install and configure:

sudo dpkg -i s3-orchestrator_X.Y.Z_amd64.deb
sudo vim /etc/s3-orchestrator/config.yaml
sudo vim /etc/default/s3-orchestrator   # set DB_PASSWORD, backend keys, etc.
sudo systemctl start s3-orchestrator

The package installs:

Path	Purpose
`/usr/bin/s3-orchestrator`	Binary
`/etc/s3-orchestrator/config.yaml`	Configuration (conffile, preserved on upgrade)
`/etc/default/s3-orchestrator`	Environment variables for `${VAR}` expansion
`/usr/lib/systemd/system/s3-orchestrator.service`	Systemd unit
`/var/lib/s3-orchestrator/`	Data directory

The systemd unit runs as a dedicated s3-orchestrator user with filesystem hardening (ProtectSystem=strict, ProtectHome=yes, NoNewPrivileges=yes). Config reload via systemctl reload s3-orchestrator sends SIGHUP.

Releasing

Tag a version and push to trigger an automated GitHub Release via GoReleaser:

This regenerates CHANGELOG.md via git-cliff, tags the current .version value, and pushes the tag. The tag triggers GoReleaser to build Linux binaries (amd64 + arm64), Debian packages, and SHA256 checksums — all attached to the GitHub Release.

To regenerate the changelog without releasing:

Commit categorization is configured in cliff.toml. Commit messages starting with Add, Fix, Harden, Refactor, Improve, docs:, test:, or chore(deps): are automatically grouped into the appropriate section.

Docker images are still built manually since the private registry isn't reachable from GitHub Actions:

To dry-run the release locally (builds everything without publishing):

Project Structure

cmd/s3-orchestrator/         Binary entry: subcommand dispatch + thin shims
  main.go                    Entry point, subcommand dispatch
  admin.go / init_cmd.go / sync.go    Shim into internal/cli/{adminctl,initcmd,synccmd}
  validate.go / version.go   Validate-config and version subcommands

internal/
  cli/                       CLI-side dispatch and bootstrap
    serve/                   Daemon lifecycle: build the DI injector, start HTTP, SIGHUP reload, shutdown
    adminctl/                Admin operational CLI (HTTP client wrapping the admin API)
    initcmd/                 Interactive config-file generator
    synccmd/                 Pre-existing bucket import CLI

  di/                        Single wiring point for samber/do/v2
    di.go                    Every Provide<X> for stores, workers, handlers, backends
    services.go              Lifecycle-managed background services

  transport/                 HTTP interface layer (no business logic)
    s3api/                   S3-compatible XML/REST API
      server.go              HTTP router, bucket resolution, key prefixing, metrics
      buckets.go             HeadBucket, GetBucketLocation, ListBuckets, versioning stubs
      objects.go             PUT, GET, HEAD, DELETE, COPY, DeleteObjects handlers
      list.go                ListObjectsV1 / V2 handlers
      multipart.go           Multipart upload handlers
      helpers.go             Path parsing, header guards, S3 XML error responses
      ratelimit.go           Per-IP token bucket
      admission.go           Concurrency limit + load shedding
    admin/handler.go         Admin API: status, drain, replicate, scrub, encrypt-existing, etc.
    auth/auth.go             BucketRegistry, SigV4 verification, legacy token auth
    ui/                      Web dashboard
      handler.go             HTTP handler + session auth + JSON APIs
      admin_actions.go       Async-trigger endpoints (rebalance, scrub, encrypt-existing, ...)
      async.go               Shared async-job result store consumed by /status endpoints
      templates.go           Embedded template loader + formatting helpers
      templates/             Dashboard and login HTML
      static/                CSS, JS (directory tree, log viewer)
    httputil/
      clientip.go            X-Forwarded-For + X-Forwarded-Proto with trusted-proxy CIDRs
      loginthrottle.go       Per-IP brute-force protection
      certreloader.go        TLS certificate hot-reload + expiry warning

  observe/                   Observability layer
    audit/audit.go           Request-id context plumbing + structured audit logger
    telemetry/               Per-domain Prometheus metric files (metrics_*.go) + OTel helpers
    event/event.go           Notification event types + Emit hook

  config/                    YAML loader split by domain (server, database, backends, ...)

  breaker/
    breaker.go               Three-state CircuitBreaker state machine
    registry.go              Watchdog-swept registry of all breakers (DB + per-backend)

  backend/
    s3.go                    ObjectBackend interface + S3Backend (AWS SDK v2)
    circuitbreaker.go        Per-backend CircuitBreaker wrapper
    backendtest/             Failure-injectable wrapper used by tests

  store/                     Metadata store
    circuitbreaker.go        Database CircuitBreaker wrapper
    cb_*.go                  Per-role CB decorators (Object, Pending, Cleanup, Quota, ...)
    core/                    Engine-agnostic orchestration
      types.go               Domain types (ObjectLocation, PendingObject, CleanupQueueRow, ...)
      errors.go              Sentinel errors and structured S3Error
      interfaces.go          Narrow per-role store interfaces
      adapter.go             TxAdapter (the per-engine seam) + Reader
      runner.go              Runner interface + generic WithTxVal[T] helper
      objects.go             RecordObject, DeleteObject, MoveObjectLocation, ImportObject
      pending.go             PromotePending orchestration
      cleanup.go             SweepStaleCleanupQueueRows + MoveCleanupToDLQ
      replication.go         RecordReplica
      helpers.go             Engine-agnostic helpers (intentSuperseded, applyQuotaDeltas, ...)
    postgres/                Postgres engine adapter
      store.go               *Store satisfies core.Runner via WithTx
      adapter.go             pgTxAdapter satisfies core.TxAdapter against sqlc.Queries
      objects.go / quota.go / multipart.go / replication.go / cleanup_queue.go / pending.go
      admin.go / advisory_lock.go / integrity.go / notifications.go / usage.go
      migrations/            Versioned goose migrations (embedded)
      sqlc/                  Generated type-safe query code (do not edit)
    sqlite/                  SQLite engine adapter
      store.go / adapter.go / objects.go / quota.go / multipart.go / pending.go
      cleanup.go / replication.go / admin.go / directory.go / migrations.go
      schema.sql             Consolidated schema (translates Postgres migrations)

  counter/                   Per-backend usage counters
    counter.go               CounterBackend interface + field constants
    local.go                 In-memory atomic backend (default)
    redis.go                 Redis shared backend with CB fallback
    tracker.go               Usage limit enforcement, baseline management, flush

  proxy/                     Manager layer
    manager.go               BackendManager: composition root, routing, config accessors
    manager_writepath.go     PUT-before-COMMIT pending-row write path
    objects.go               ObjectManager type, constructor, shared helpers
    objects_read.go          Read failover, broadcast reads, GetObject, HeadObject, ListObjects
    objects_write.go         PutObject, CopyObject, DeleteObject, DeleteObjects
    multipart.go             Multipart lifecycle
    reconcile.go             Bounded-memory sorted-merge reconciliation engine
    core.go                  Shared infrastructure (timeout, admission, routing helpers)
    lifecycle.go             Lifecycle expiration rule processing
    integrity.go             Integrity-aware GET wrapper (read-time hash verification)
    encryption_helpers.go    On-write encrypt + on-read decrypt adapters
    cache.go                 LocationCache (key -> backend) with TTL + background eviction
    stores.go                Stores struct bundling the narrow per-role interfaces
    drain/                   Backend-drain coordinator
    dashboard/               DashboardData aggregation + lazy directory listing
    metrics/                 Manager-level Collector (per-op record + periodic gauge refresh)
    proxytest/               Test-only helper: AttachWorkers, StoresFromMock

  worker/                    Background services
    ops_runtime.go           Runtime-side ops interfaces (admission, timeout, usage, backend access)
    ops_store.go             Per-worker store-role interfaces
    rebalancer.go            Object rebalancing across backends
    replicator.go            Cross-backend object replication
    overreplication.go       Over-replication detection + excess-copy cleanup
    cleanup.go               Cleanup queue retry worker (graduates to DLQ on exhaustion)
    pending.go               PendingReaper (PUT-before-COMMIT intent resolver)
    scrubber.go              Integrity scrubber + content-hash backfill
    reconciler.go            Orphan reconciler driver (consumes proxy/reconcile engine)

  notify/notifier.go         Webhook notification drainer (notification_outbox pattern)
  encryption/                Envelope encryption (AES-256-GCM, key providers, Vault Transit)
  cache/                     Object-data LRU cache with TTL
  lifecycle/                 Generic supervisor for long-lived services
  internalkey/               Internal key prefix helpers shared by transport + store
  testutil/                  Shared test fakes (MockStore, builders)

  integration/               End-to-end tests against MinIO + Postgres testcontainers
                             (gated by `//go:build integration`)

grafana/
  s3-orchestrator.json       Grafana dashboard (all Prometheus metrics)
sqlc.yaml                    sqlc configuration
Dockerfile                   Multi-stage build
Makefile                     Build, test, lint, generate, push, deb targets
nfpm.yaml                    Debian package definition (nfpm)
packaging/
  s3-orchestrator.service    Systemd unit file
  config.yaml                Sample config installed to /etc/s3-orchestrator/
  s3-orchestrator.default    Default env file installed to /etc/default/
  preinstall.sh              Creates system user and directories
  postinstall.sh             Enables systemd service
  postremove.sh              Purge cleanup (removes user and data)
  changelog                  Debian changelog
  copyright                  Debian copyright file
  lintian-overrides          Lintian override rules
cliff.toml                   git-cliff changelog generation config
CHANGELOG.md                 Auto-generated changelog (make changelog)
config.example.yaml          Configuration reference
deploy/
  nomad/
    s3-orchestrator.nomad.hcl  Production Nomad job (Vault integration)
    local/
      s3-orchestrator.nomad.hcl  Local dev job (docker-compose backing services)
      demo.sh                  One-command Nomad dev demo
  helm/
    s3-orchestrator/
      Chart.yaml               Helm chart metadata
      values.yaml              Default production values
      templates/               Deployment, Service, ConfigMap, Secret, Ingress, etc.
  kubernetes/
    local/
      values.yaml              Local dev Helm values (docker-compose backing services)
      demo.sh                  One-command k3d demo

Additional Documentation

Guide	Description
Quickstart	Get running in under a minute
User Guide	S3 client configuration and usage
Admin Guide	Configuration, operations, monitoring, deployment
API Reference	UI and Admin API JSON endpoint documentation
Security Hardening	TLS, mTLS, config security, network segmentation
Performance Tuning	Connection pools, timeouts, routing, rebalancer tuning
Disaster Recovery	Failure scenarios and recovery procedures
Version Migration	Upgrade guide, config changes by version
Style Guide	Coding conventions for contributors
Contributing	How to build, test, and submit changes