atlas9 is about an itch I need to scratch.
There is so much stuff to figure out in software projects: API design, databases, builds, deploys, async tasks, auth, workflows, releases, docs, flags, config, tests, telemetry, monitoring, logs, infrastructure-as-code, and on and on.
Large amounts of energy (time, emotion, money) is spent on this stuff. And when teams get this stuff wrong, even more energy is spent either correcting it, or living with it.
I want to build something that takes care of the common stuff, so that software teams are happier and more efficient. But, what exactly should I build? Who exactly am I building it for?
I have thoughts and questions and some opinions, and I’m starting this blog to write those down, in the hope that they will resonate with others and that discussion and community will come to life.
Part of the struggle is bringing technologies together.
There’s no shortage of incredible tools at our disposal – PostgreSQL, Kubernetes, S3, Dynamo, Docker, Django, Rails, GitHub, Terraform, OpenTelemetry, Datadog, Splunk, AWS, GCP, Azure, etc.
And yet, it feels harder than ever to pull it all together. Kubernetes and AWS are super powerful, and overwhelming. PostgreSQL feels easy and powerful, the query planner changes its mind at 1am, or you run out of connections, or your vacuum isn’t running fast enough, or you need to upgrade, etc. Terraform works great until it doesn’t. For all the words written about observability, I still struggle to set it all up and get the data I need (and I worked for an observability company for years!). Even in a sophisticated, mature web application framework, teams struggle with common things like API error messages, validation, defaults, database transactions or replicas, etc.
And each piece makes it harder to have a smooth developer experience, or one that usefully matches real-world production behavior.
I’m very fortunate to have always worked with great people. My favorite thing about work is always the people – they’re smart, experienced, fun, passionate, interesting, and I learn from them non-stop.
And yet, collectively, we (myself included) get so many seemingly simple details wrong. Like keeping fields consistently snake_case or camelCase in an API. Or apply defaults and validation to API resources in a consistent way. Or using database transactions correctly. Or where to set defaults in the many layers of config. I believe we all agree that by the year 2026, it should be really hard to get these seemingly simple details wrong.
It points to the fact that building software systems, especially with more than a couple people, is still hard. I hope atlas9 can be a tool that takes care of more of these details.
At the bottom of this post, I’ve included a long list of things that I’ve seen teams struggle with, or that I have struggled with myself. It’s a long list! And I bet it could be much longer if I A) had a better memory, B) didn’t block out bad memories, C) thought harder about it, and D) asked others to contribute their struggles. I think a lot of these topics are very common in software organizations of almost any size. It’s mind-boggling how much we have to figure out every time, and how much opportunity for missteps that creates.
My experience is mostly in web applications and distributed systems, so atlas9 is focused on the issues I’ve encountered there.
atlas9 starts with some thoughtful writing (and hopefully discussion) about various patterns and tradeoffs. That’s why I’m starting this blog.
atlas9 will most likely include an application framework, most likely written in Go. There are a lot of existing frameworks out there, most of which I’m not yet very familiar with, so I have some research to do. My experience with frameworks is that they’re flexible, and that flexibility leaves a lot of room for decisions to be made. Those decisions can take a lot of effort, and each decisions has some risk of being a misstep. Also, honestly, I’ve become fairly annoyed with some of the patterns encouraged by popular languages and frameworks, and over the many years of my career I’ve developed my preferred style, and I expect that will be reflected in atlas9 to some degree.
I suspect atlas9 will need to be fairly opinionated in some ways. I suspect that “taking care of the common stuff” will mean making decisions about which common patterns to use. The tradeoff might be that atlas9 doesn’t support every way of doing things, it doesn’t fit into every box, it doesn’t work for everyone.
I suspect that atlas9 doesn’t need to be just one design. I’ve been pondering the gap between small, medium, and large projects. A small project could be fine with a single server and a sqlite database. A medium project might want Kubernetes and multiple services. A different medium (or even large) project might be fine on a few servers with a single large postgres database. Can atlas9 be useful at all these scales? Which one am I designing for first?
People have probably been feeling this was about technology forever. “It should be better than this” is perhaps a great driver of innovation in all areas of technology and knowledge.
And there are people out there making it better, like:
Railway
Render
Fly.io
OpenShift
Supabase
many others I’m sure (send me a comment or message with others)
Some of these seem like really cool companies and products. I recognize that atlas9 is covering some of the same ground. That’s ok. There are probably still gaps to fill, and many people aren’t able or don’t want to use these hosted services. Figuring out what atlas9 can do that complements these projects is on the long list of questions I have in mind.
If you’re interested in this idea, if you want to follow along or be part of the discussion, please let me know! You can subscribe, you can comment, or you can send me a message directly.
This is the end of the post. Below is a big, messy list of common topics that I’ve seen software teams encounter.
Leave a comment or send me a message to add to this list. Let me know which common issues or struggles are important to you, and in the future I’ll publish an updated list.
Docs and discussion
How will you write docs?
How will you hold discussions?
RFCs? ADRs?
API docs: OpenAPI, examples, auth docs, changelogs
Internal vs external docs
Runbooks
Keeping docs fresh
Writing style, information architecture
Development Environment
What language will you use?
Will you use multiple languages?
Do you need to pick a build system for your language?
Do you need to manage runtime versions?
Do you need to pick a package manager?
Will you build Docker images?
Will you prefer a single IDE?
Will you use a linter? formatter?
Does it run on save? In pre-commit hooks?
How will you handle secrets in local dev envs?
How will you run your service or services locally?
Can you have concurrent, local instances of your project?
Test/seed data
Building and Shipping
CI/CD pipelines, PR builds, artifact publishing
Merge queues, required checks, require branches to be up-to-date or not?
Apple notarized binaries
Homebrew taps, casks, and release process
Signing, security, dependency analysis
Branching model: git flow or trunk-based?
Monorepo or polyrepo?
Deployment: blue-green, canary, rolling updates
Auto-rollback triggers, canary duration
Shadow deployments for safe testing
Build caching: Docker layers, CI artifacts, remote caches
Deterministic builds, dependency pinning
APIs
REST vs GraphQL vs gRPC
Pagination: cursor or offset?
Filtering, sorting, versioning
Patching
Soft delete
Audit logs, resource/action history
Deprecation policies
GraphQL: N+1 queries, complexity limits, schema stitching, subscriptions
API gateways: routing, transformation, auth
Rate limiting
SDKs, client generation, OpenAPI specs
Async APIs for long-running jobs
Timeout and retry design
Architecture
Microservices or monolith?
Service discovery, configuration
Contracts between services
Sync vs async communication
Durable workflows, sagas
Service mesh?
Schema definition and evolution
Deprecation timelines
Databases
Postgres:
Statement timeouts
Vacuum
Index bloat
Read replica usage
Document data? Key-value? Time series?
Migrations: rollback strategies, zero-downtime, testing
Backups: automated, point-in-time, cross-region, tested
Sharding: key selection, resharding, cross-shard queries
Query optimization: EXPLAIN plans, slow query logs
Transactions: isolation levels, deadlocks
Connection pooling, limits, timeouts
SSL/TLS, certificate rotation
Caching and Performance
Cache invalidation
CDN caching, application caching (Redis, Memcached)
Cache warming
Load testing, capacity planning
Horizontal vs vertical scaling
N+1 queries
Profiling in production, flame graphs
Image optimization: resizing, encoding, lazy loading
Messaging and Async
Queues: RabbitMQ, Kafka, SQS
Dead letter queues, retry policies
Ordering guarantees
Exactly-once vs at-least-once
Head-of-line blocking in partitioned data
Background jobs: cron, distributed scheduling, deduplication
Workflow orchestration: Airflow, Prefect, Temporal
Batch processing vs streaming
Windowing, stateful processing, watermarks
Data pipelines: ETL vs ELT, quality checks, monitoring, lineage
WebSockets: connection management, reconnection, ordering
Email: SPF, DKIM, DMARC, reputation, bounces
Webhooks: signing, retries, debugging, versioning
Auth
SSO, OAuth2 flows
Refresh tokens, magic links, MFA, social login
API auth: keys vs JWT vs OAuth
Bearer tokens, client credentials
Password hashing, reset flows, strength requirements
Credential stuffing prevention
Sessions: server-side vs client-side, timeouts, concurrency
RBAC, ABAC, ReBAC, entitlements, account access
API key generation, rotation, scoping, revocation
LLMs
Shared prompts, rules
Hosted agents
respond to event (e.g. PR push, PR comments)
periodically take automated actions
Security
CORS, CSP, security headers
Encryption at rest and in transit
KMS, E2E encryption
CVE monitoring, patching, advisories
Vulnerability disclosure policies
Dependency scanning, lock files
Supply chain: SBOM, artifact signing, provenance
Zero trust: identity, device, network segmentation, least privilege
Bot prevention: CAPTCHA, rate limiting, bot scoring
Infrastructure
GitOps, ClickOps
Terraform state, drift detection, policy as code
Containers: base images, size, scanning, private registries
Kubernetes: cluster sizing, namespaces, resource limits
Pod disruption budgets, HPA, ingress
VPNs, VPCs, subnets, load balancers
DNS, DDoS protection, SSL certs + renewal/rotation
Load balancing: algorithms, health checks, session affinity, geo-routing
Multi-region: architecture, data sovereignty, latency, failover
Disaster recovery: RPO/RTO, failover procedures, testing
Consensus, leader election, etc
Cron jobs
Coroutine vs Thread vs process pools
Scaling, prioritization, backpressure
Spot instances, on-demand, provisioned/reserved
Environments and Config
Dev, staging, prod
Keeping environments in sync
Drift detection
Configuration as code: validation, versioning, rollback
Secrets management, automated rotation
Feature flags: percentage rollouts, user targeting, cleanup
Observability
Metrics, events, traces, logs
Expected vs unexpected errors
Log verbosity, runtime log level changes
Log aggregation, retention, structured logging, sampling
PII in logs
Tracing: sampling, correlation IDs, span naming
Dashboards: design, naming, aggregation, anomaly detection
Alerting: fatigue, escalation, grouping, runbook automation
On-call rotations synced with calendars and PagerDuty
Incident retros, Slack channels, blameless post-mortems
Status pages, uptime checks, synthetic monitoring
Health checks: liveness, readiness, startup probes
SLOs, SLIs, SLAs, error budgets
Frontend: client-side errors, web vitals, session replay, RUM
Resilience
Error codes, messages, validation, i18n
Circuit breakers, fallbacks
Partial outage handling, read-only mode
Chaos engineering: failure injection, game days
Chaos Monkey, Gremlin, blast radius control
Rate Limiting
API and product rate limits
Per-user, per-key limits
Adaptive throttling
Rate limit headers and error responses
Idempotent API design
Testing
Unit, integration, e2e
Test data, mocking, parallelization
Flaky tests, coverage requirements
Visual regression: screenshots, diffs, baselines
Mutation testing
Property-based testing
Contract testing (Pact)
Smoke tests
API mocking: WireMock, MockServer, recording/replaying
Data Management
Retention: GDPR, deletion workflows, soft delete
Backups, disaster recovery
Import/export: formats, streaming, permissions, scheduling
Validation, rollback
Anonymization: PII removal, masking
GDPR right to be forgotten
Validation: client vs server, schema, business rules
Search and Storage
Search: Elasticsearch, Solr
Indexing strategies, relevance tuning
File storage: S3 vs alternatives
Virus scanning, size limits
Streaming vs buffering
Frontend
Code splitting, tree shaking, bundle analysis
Dynamic imports
Component libraries
Design systems, versioning, theming
Accessibility: WCAG, screen readers, keyboard nav, ARIA
Browser compatibility: progressive enhancement, polyfills
SEO: SSR, meta tags, sitemaps, robots.txt
Pagination: cursor vs offset, infinite scroll, prefetching
Internationalization
Translations, date/time/currency formatting
RTL languages
Timezones, DST, clock drift, timestamp precision
Multi-tenancy
Data isolation strategies
Tenant provisioning
Cross-tenant queries
Billing
Invoicing, usage tracking, payments
Merchant of record
Plans, subscriptions, pricing
Incident recovery, backfill, adjustments
Project Management
Issue tracking, public issues
Change approvals, rollbacks, verification
Feature workflow: branches, flags, QA environments
Dark launches
Tech debt tracking
Teams
Cross-team coordination: API contracts, breaking changes
Shared libraries
Platform vs product teams
Hiring: interviews, assessments, onboarding, mentorship
Technical debt: tracking, prioritization, communication
Internal tooling, admin dashboards
Platform engineering: self-service, golden paths
Code review
nit-pick vs blocking
code owners
rollout/rollback strategy
scope and max allowed PR size
nag or reminders about PR reviews
stale PRs
code style, standards
Support and Users
Customer support: admin panels, safe impersonation
Ticket integration, access logs, debug tools
Feedback: in-app, NPS, user research
Analytics
Event tracking
Behavior analytics
A/B testing
Data warehouses
Compliance
Terms of service, privacy policy, cookies
Audit logs, data residency
SOC2 automation, evidence collection, audit prep
Costs
Cloud cost monitoring, budgets
Allocation by team
Vendor evaluation, contracts, risk, exit strategies
Fallbacks when third parties fail