atlas9: building a better software experience

10 min read Original article ↗

atlas9 is about an itch I need to scratch.

There is so much stuff to figure out in software projects: API design, databases, builds, deploys, async tasks, auth, workflows, releases, docs, flags, config, tests, telemetry, monitoring, logs, infrastructure-as-code, and on and on.

Large amounts of energy (time, emotion, money) is spent on this stuff. And when teams get this stuff wrong, even more energy is spent either correcting it, or living with it.

I want to build something that takes care of the common stuff, so that software teams are happier and more efficient. But, what exactly should I build? Who exactly am I building it for?

I have thoughts and questions and some opinions, and I’m starting this blog to write those down, in the hope that they will resonate with others and that discussion and community will come to life.

Part of the struggle is bringing technologies together.

There’s no shortage of incredible tools at our disposal – PostgreSQL, Kubernetes, S3, Dynamo, Docker, Django, Rails, GitHub, Terraform, OpenTelemetry, Datadog, Splunk, AWS, GCP, Azure, etc.

And yet, it feels harder than ever to pull it all together. Kubernetes and AWS are super powerful, and overwhelming. PostgreSQL feels easy and powerful, the query planner changes its mind at 1am, or you run out of connections, or your vacuum isn’t running fast enough, or you need to upgrade, etc. Terraform works great until it doesn’t. For all the words written about observability, I still struggle to set it all up and get the data I need (and I worked for an observability company for years!). Even in a sophisticated, mature web application framework, teams struggle with common things like API error messages, validation, defaults, database transactions or replicas, etc.

And each piece makes it harder to have a smooth developer experience, or one that usefully matches real-world production behavior.

I’m very fortunate to have always worked with great people. My favorite thing about work is always the people – they’re smart, experienced, fun, passionate, interesting, and I learn from them non-stop.

And yet, collectively, we (myself included) get so many seemingly simple details wrong. Like keeping fields consistently snake_case or camelCase in an API. Or apply defaults and validation to API resources in a consistent way. Or using database transactions correctly. Or where to set defaults in the many layers of config. I believe we all agree that by the year 2026, it should be really hard to get these seemingly simple details wrong.

It points to the fact that building software systems, especially with more than a couple people, is still hard. I hope atlas9 can be a tool that takes care of more of these details.

At the bottom of this post, I’ve included a long list of things that I’ve seen teams struggle with, or that I have struggled with myself. It’s a long list! And I bet it could be much longer if I A) had a better memory, B) didn’t block out bad memories, C) thought harder about it, and D) asked others to contribute their struggles. I think a lot of these topics are very common in software organizations of almost any size. It’s mind-boggling how much we have to figure out every time, and how much opportunity for missteps that creates.

My experience is mostly in web applications and distributed systems, so atlas9 is focused on the issues I’ve encountered there.

atlas9 starts with some thoughtful writing (and hopefully discussion) about various patterns and tradeoffs. That’s why I’m starting this blog.

atlas9 will most likely include an application framework, most likely written in Go. There are a lot of existing frameworks out there, most of which I’m not yet very familiar with, so I have some research to do. My experience with frameworks is that they’re flexible, and that flexibility leaves a lot of room for decisions to be made. Those decisions can take a lot of effort, and each decisions has some risk of being a misstep. Also, honestly, I’ve become fairly annoyed with some of the patterns encouraged by popular languages and frameworks, and over the many years of my career I’ve developed my preferred style, and I expect that will be reflected in atlas9 to some degree.

I suspect atlas9 will need to be fairly opinionated in some ways. I suspect that “taking care of the common stuff” will mean making decisions about which common patterns to use. The tradeoff might be that atlas9 doesn’t support every way of doing things, it doesn’t fit into every box, it doesn’t work for everyone.

I suspect that atlas9 doesn’t need to be just one design. I’ve been pondering the gap between small, medium, and large projects. A small project could be fine with a single server and a sqlite database. A medium project might want Kubernetes and multiple services. A different medium (or even large) project might be fine on a few servers with a single large postgres database. Can atlas9 be useful at all these scales? Which one am I designing for first?

People have probably been feeling this was about technology forever. “It should be better than this” is perhaps a great driver of innovation in all areas of technology and knowledge.

And there are people out there making it better, like:

  • Railway

  • Render

  • Fly.io

  • OpenShift

  • Supabase

  • many others I’m sure (send me a comment or message with others)

Some of these seem like really cool companies and products. I recognize that atlas9 is covering some of the same ground. That’s ok. There are probably still gaps to fill, and many people aren’t able or don’t want to use these hosted services. Figuring out what atlas9 can do that complements these projects is on the long list of questions I have in mind.

If you’re interested in this idea, if you want to follow along or be part of the discussion, please let me know! You can subscribe, you can comment, or you can send me a message directly.

Leave a comment

This is the end of the post. Below is a big, messy list of common topics that I’ve seen software teams encounter.

Leave a comment or send me a message to add to this list. Let me know which common issues or struggles are important to you, and in the future I’ll publish an updated list.

  • Docs and discussion

    • How will you write docs?

    • How will you hold discussions?

    • RFCs? ADRs?

    • API docs: OpenAPI, examples, auth docs, changelogs

    • Internal vs external docs

    • Runbooks

    • Keeping docs fresh

    • Writing style, information architecture

  • Development Environment

    • What language will you use?

    • Will you use multiple languages?

    • Do you need to pick a build system for your language?

    • Do you need to manage runtime versions?

    • Do you need to pick a package manager?

    • Will you build Docker images?

    • Will you prefer a single IDE?

    • Will you use a linter? formatter?

      • Does it run on save? In pre-commit hooks?

    • How will you handle secrets in local dev envs?

    • How will you run your service or services locally?

    • Can you have concurrent, local instances of your project?

    • Test/seed data

  • Building and Shipping

    • CI/CD pipelines, PR builds, artifact publishing

    • Merge queues, required checks, require branches to be up-to-date or not?

    • Apple notarized binaries

    • Homebrew taps, casks, and release process

    • Signing, security, dependency analysis

    • Branching model: git flow or trunk-based?

    • Monorepo or polyrepo?

    • Deployment: blue-green, canary, rolling updates

    • Auto-rollback triggers, canary duration

    • Shadow deployments for safe testing

    • Build caching: Docker layers, CI artifacts, remote caches

    • Deterministic builds, dependency pinning

  • APIs

    • REST vs GraphQL vs gRPC

    • Pagination: cursor or offset?

    • Filtering, sorting, versioning

    • Patching

    • Soft delete

    • Audit logs, resource/action history

    • Deprecation policies

    • GraphQL: N+1 queries, complexity limits, schema stitching, subscriptions

    • API gateways: routing, transformation, auth

    • Rate limiting

    • SDKs, client generation, OpenAPI specs

    • Async APIs for long-running jobs

    • Timeout and retry design

  • Architecture

    • Microservices or monolith?

    • Service discovery, configuration

    • Contracts between services

    • Sync vs async communication

    • Durable workflows, sagas

    • Service mesh?

    • Schema definition and evolution

    • Deprecation timelines

  • Databases

    • Postgres:

      • Statement timeouts

      • Vacuum

      • Index bloat

      • Read replica usage

    • Document data? Key-value? Time series?

    • Migrations: rollback strategies, zero-downtime, testing

    • Backups: automated, point-in-time, cross-region, tested

    • Sharding: key selection, resharding, cross-shard queries

    • Query optimization: EXPLAIN plans, slow query logs

    • Transactions: isolation levels, deadlocks

    • Connection pooling, limits, timeouts

    • SSL/TLS, certificate rotation

  • Caching and Performance

    • Cache invalidation

    • CDN caching, application caching (Redis, Memcached)

    • Cache warming

    • Load testing, capacity planning

    • Horizontal vs vertical scaling

    • N+1 queries

    • Profiling in production, flame graphs

    • Image optimization: resizing, encoding, lazy loading

  • Messaging and Async

    • Queues: RabbitMQ, Kafka, SQS

    • Dead letter queues, retry policies

    • Ordering guarantees

    • Exactly-once vs at-least-once

    • Head-of-line blocking in partitioned data

    • Background jobs: cron, distributed scheduling, deduplication

    • Workflow orchestration: Airflow, Prefect, Temporal

    • Batch processing vs streaming

      • Windowing, stateful processing, watermarks

      • Data pipelines: ETL vs ELT, quality checks, monitoring, lineage

    • WebSockets: connection management, reconnection, ordering

    • Email: SPF, DKIM, DMARC, reputation, bounces

    • Webhooks: signing, retries, debugging, versioning

  • Auth

    • SSO, OAuth2 flows

    • Refresh tokens, magic links, MFA, social login

    • API auth: keys vs JWT vs OAuth

    • Bearer tokens, client credentials

    • Password hashing, reset flows, strength requirements

    • Credential stuffing prevention

    • Sessions: server-side vs client-side, timeouts, concurrency

    • RBAC, ABAC, ReBAC, entitlements, account access

    • API key generation, rotation, scoping, revocation

  • LLMs

    • Shared prompts, rules

    • Hosted agents

      • respond to event (e.g. PR push, PR comments)

      • periodically take automated actions

  • Security

    • CORS, CSP, security headers

    • Encryption at rest and in transit

    • KMS, E2E encryption

    • CVE monitoring, patching, advisories

    • Vulnerability disclosure policies

    • Dependency scanning, lock files

    • Supply chain: SBOM, artifact signing, provenance

    • Zero trust: identity, device, network segmentation, least privilege

    • Bot prevention: CAPTCHA, rate limiting, bot scoring

  • Infrastructure

    • GitOps, ClickOps

    • Terraform state, drift detection, policy as code

    • Containers: base images, size, scanning, private registries

    • Kubernetes: cluster sizing, namespaces, resource limits

    • Pod disruption budgets, HPA, ingress

    • VPNs, VPCs, subnets, load balancers

    • DNS, DDoS protection, SSL certs + renewal/rotation

    • Load balancing: algorithms, health checks, session affinity, geo-routing

    • Multi-region: architecture, data sovereignty, latency, failover

    • Disaster recovery: RPO/RTO, failover procedures, testing

    • Consensus, leader election, etc

    • Cron jobs

    • Coroutine vs Thread vs process pools

    • Scaling, prioritization, backpressure

    • Spot instances, on-demand, provisioned/reserved

  • Environments and Config

    • Dev, staging, prod

    • Keeping environments in sync

    • Drift detection

    • Configuration as code: validation, versioning, rollback

    • Secrets management, automated rotation

    • Feature flags: percentage rollouts, user targeting, cleanup

  • Observability

    • Metrics, events, traces, logs

    • Expected vs unexpected errors

    • Log verbosity, runtime log level changes

    • Log aggregation, retention, structured logging, sampling

    • PII in logs

    • Tracing: sampling, correlation IDs, span naming

    • Dashboards: design, naming, aggregation, anomaly detection

    • Alerting: fatigue, escalation, grouping, runbook automation

    • On-call rotations synced with calendars and PagerDuty

    • Incident retros, Slack channels, blameless post-mortems

    • Status pages, uptime checks, synthetic monitoring

    • Health checks: liveness, readiness, startup probes

    • SLOs, SLIs, SLAs, error budgets

    • Frontend: client-side errors, web vitals, session replay, RUM

  • Resilience

    • Error codes, messages, validation, i18n

    • Circuit breakers, fallbacks

    • Partial outage handling, read-only mode

    • Chaos engineering: failure injection, game days

    • Chaos Monkey, Gremlin, blast radius control

    • Rate Limiting

    • API and product rate limits

    • Per-user, per-key limits

    • Adaptive throttling

    • Rate limit headers and error responses

    • Idempotent API design

  • Testing

    • Unit, integration, e2e

    • Test data, mocking, parallelization

    • Flaky tests, coverage requirements

    • Visual regression: screenshots, diffs, baselines

    • Mutation testing

    • Property-based testing

    • Contract testing (Pact)

    • Smoke tests

    • API mocking: WireMock, MockServer, recording/replaying

  • Data Management

    • Retention: GDPR, deletion workflows, soft delete

    • Backups, disaster recovery

    • Import/export: formats, streaming, permissions, scheduling

    • Validation, rollback

    • Anonymization: PII removal, masking

    • GDPR right to be forgotten

    • Validation: client vs server, schema, business rules

  • Search and Storage

    • Search: Elasticsearch, Solr

    • Indexing strategies, relevance tuning

    • File storage: S3 vs alternatives

    • Virus scanning, size limits

    • Streaming vs buffering

  • Frontend

    • Code splitting, tree shaking, bundle analysis

    • Dynamic imports

    • Component libraries

    • Design systems, versioning, theming

    • Accessibility: WCAG, screen readers, keyboard nav, ARIA

    • Browser compatibility: progressive enhancement, polyfills

    • SEO: SSR, meta tags, sitemaps, robots.txt

    • Pagination: cursor vs offset, infinite scroll, prefetching

  • Internationalization

    • Translations, date/time/currency formatting

    • RTL languages

    • Timezones, DST, clock drift, timestamp precision

  • Multi-tenancy

    • Data isolation strategies

    • Tenant provisioning

    • Cross-tenant queries

  • Billing

    • Invoicing, usage tracking, payments

    • Merchant of record

    • Plans, subscriptions, pricing

    • Incident recovery, backfill, adjustments

  • Project Management

    • Issue tracking, public issues

    • Change approvals, rollbacks, verification

    • Feature workflow: branches, flags, QA environments

    • Dark launches

    • Tech debt tracking

  • Teams

    • Cross-team coordination: API contracts, breaking changes

    • Shared libraries

    • Platform vs product teams

    • Hiring: interviews, assessments, onboarding, mentorship

    • Technical debt: tracking, prioritization, communication

    • Internal tooling, admin dashboards

    • Platform engineering: self-service, golden paths

    • Code review

      • nit-pick vs blocking

      • code owners

      • rollout/rollback strategy

      • scope and max allowed PR size

      • nag or reminders about PR reviews

      • stale PRs

    • code style, standards

  • Support and Users

    • Customer support: admin panels, safe impersonation

    • Ticket integration, access logs, debug tools

    • Feedback: in-app, NPS, user research

  • Analytics

    • Event tracking

    • Behavior analytics

    • A/B testing

    • Data warehouses

  • Compliance

    • Terms of service, privacy policy, cookies

    • Audit logs, data residency

    • SOC2 automation, evidence collection, audit prep

  • Costs

    • Cloud cost monitoring, budgets

    • Allocation by team

    • Vendor evaluation, contracts, risk, exit strategies

    • Fallbacks when third parties fail

Discussion about this post

Ready for more?