Why We Outgrew Cloudflare D1 (And Everything We Tried Before Building Our Own Solution)

Part 1 of the D0 Series

We run our entire product on Cloudflare. Not "mostly on Cloudflare" or "Cloudflare for the edge layer" - we mean the whole thing: workers, storage, AI, queues, real-time WebSockets, and everything in between. That decision made us, arguably, one of the most demanding power users of Cloudflare's infrastructure in existence. It also meant we hit every sharp edge of every product they ship, usually long before anyone else did - and because we were so deeply reliant on the entire stack, an outage in any single Cloudflare product was an outage for us. Workers AI goes down, our AI features go down. D1 has an incident, we can't read or write any data. It didn't matter what else was healthy.

We've been on D1 since the private alpha in August 2022. That matters for the timeline of what follows - several of these problems weren't edge cases we stumbled into after scale. We hit them early, we reported them, and in some cases we watched them sit unfixed for years.

This series is the honest account of what broke, what we tried, and what we eventually had to build ourselves. This first post is about D1 - Cloudflare's SQLite-based serverless database - what its real-world limits look like under serious multi-tenant load, and every fix we reached for before we concluded that no workaround was going to hold at the scale we were heading toward.

Our D1 Architecture, Before It Became a Problem

The way we structured our data model was intentional and, at the time, the only sane approach. Every user got their own D1 database. Every tenant got one. Every dataspace (our term for a tenant's workspace and its associated data) got one too.

That's a three-tier multi-tenant setup, and it was deliberate - complete data isolation at every level, no cross-tenant query risk, clean per-user storage boundaries. User and tenant databases stayed small - maybe a few kilobytes of permissions, IDs, and lightweight profile data, heavy on reads, almost nothing on writes. Dataspace databases were the opposite: constantly written to, constantly read, growing as long as a customer kept using the product.

By the time we were deep in production, with only a few customers and pilots, we had over 421 D1 databases and counting. That number matters a lot for almost every problem that follows.

Problem 1: REST vs. Binding, and the Routing Nightmare

D1 gives you two ways to talk to a database: over REST via the Cloudflare API, or via a Worker binding. The difference in practice is enormous, and it's not well-documented.

A binding connects you directly to the D1 instance. The Worker that uses it and the D1 it's bound to resolve to each other at the edge, with virtually no routing overhead. It's fast in a way that's almost unfair to compare against the alternative.

REST, at least early on, was a tour of Cloudflare's internal network topology. We're based in California, so our tests were always based out of LAX - the nearest PoP. A REST query from LAX would bounce north to PDX (Portland), Cloudflare's North American control plane core - where api.cloudflare.com itself runs - because every REST call has to go through there. PDX would then route back down to (LAX, SJC, DFW, SEA, or DEN) where the D1 instance happened to be running (our own tenant is placed in WNAM). The response made the same trip in reverse. Four hops for a single database query, and two of them are a ~600 mile (~960 km) round trip up and down the West Coast. That's the best case. Several of our upstream services were hosted on the East Coast, which meant crossing the continent twice per query.

And for our farthest customers - for example we serve Crypto.com's Hong Kong office, which Cloudflare routes through their APAC region - that api core bounce to SFO was a transpacific round trip on top of everything else.

By the time Crypto.com joined us as a customer, PDX hop was replaced by a nearer SFO hop coming from APAC

Cloudflare has shipped meaningful changes for this

Their Code Orange remediation work (stemming from their Nov 2023 incident and finished May 2026), where they decentralized and decoupled their api plane core and spread it across more PoPs.
Apr 10, 2025 when they accounced Read Replication which greatly helped our root lookup table's accessibility, given it's read-heavy use by spawning copies world-wide.
May 29, 2025: D1 REST requests are now handled at the closest PoP to the incoming request, so /query and /raw calls no longer have to proxy through the control plane core at all. The changelog puts the improvement at 50-500ms depending on request and database location, with the biggest gains going to databases outside the U.S. since PDX is still where control plane metadata lives. It's a real improvement. But even with that fix, REST still carries all the overhead of an HTTP server: connection setup, TLS negotiation, parsing, headers. Bindings have none of that. They're effectively an in-process call.

The fix we reached for: VIP bindings. Since bindings are declared statically in a Worker's configuration and can't be dynamically assigned at runtime, we couldn't bind all 421+ databases. So we picked a subset - certain tenants and dataspaces that mattered most for performance - and hardcoded those into the Worker. Everyone else fell back to REST. We wrote a check in the request path: if a binding for a given database ID exists, use it; otherwise fall back to REST. This worked, but calling it a solution would be generous. It was favoritism baked into the code, and it wasn't going to scale.

We also considered Workers for Platforms, which would let you dynamically upload a Worker script with its own bindings per tenant. This was suggested during architecture sessions with Matt Silverlock (during the engineering sessions in the Workers Launchpad). We thought about it but ultimately avoided implementing it - as it would have required triplicating Worker code, managing a fleet of per-tenant Worker scripts, and the operational complexity would've become completely disproportionate to the problem.

Problem 2: The API Rate Limit That We Burned Through in 10-20 Seconds

REST queries against D1 share the global Cloudflare API rate limit. The default is 1,200 requests per 5 minutes per account. We exceeded that in 10 to 20 seconds of normal production load.

Over 9000!

Cloudflare worked with us to raise our account's rate limit to over 9,000 requests per 5 minutes, which bought us meaningful headroom. As of now, our typical load sits at about 5,000 to 6,000 requests per 5-minute window. We're not comfortable, we're just not bleeding.

The important footnote here: binding-based queries don't count toward this limit at all. Which made the VIP binding workaround doubly important - it wasn't just about latency, it was about not running out of API budget for the tenants who generated the most traffic.

Problem 3: Batch Queries Over REST Didn't Exist for Two Years

D1's binding supports batch SQL queries. You can bundle multiple statements into a single call and get all the results back together, eliminating round trips. As it turns out, the REST API supported this too - since the private alpha launch in 2022 - but it was never publicly documented. As far as we and everyone else in the ecosystem were concerned, it simply didn't exist over REST.

This sounds like a minor documentation gap. It wasn't. Several of our most data-heavy pages - the ones doing the most SQL work on load - relied on batch queries to avoid performing 5 or 6 sequential round trips on every page view. Since those users were on REST (they weren't in the VIP binding set), every one of those batched calls was getting serialized into individual HTTP requests. The difference was measurable in whole seconds. Not milliseconds. Seconds on page load. And the whole time, the capability was sitting right there in the API - we just had no way to know.

Cloudflare finally documented it on January 20th, with the official TypeScript SDK updated the same day (v6.0.0-beta.1, where DatabaseQueryParams became a union type supporting the batch array). We never got clean before/after metrics - by the time we found out we'd already moved on - but the subjective improvement on those pages was not subtle

Problem 4: The D1 Account Limit - A Moving Target We Kept Brushing Against

The D1 account limit didn't start at 50,000. During the private alpha, the ceiling was 10 databases per account. When D1 moved to public beta on September 28, 2023, that was raised to 25. We were already hitting the 10-database ceiling during alpha and when we migrated to the public beta version, we had to get it manually raised to 100 - something Cloudflare noted wasn't possible on the alpha infrastructure. Then on October 3, 2023, five days after the beta launch, they raised the global limit to 50,000.

That jump to 50,000 felt like breathing room, and compared to 100 it obviously was. But map it against our architecture - three databases minimum per customer, more for every additional workspace they create - and it stops feeling generous pretty quickly. The ratio of D1 databases to customers is not linear; it compounds with usage. We never hit 50,000, but it was a number that lived rent-free in the back of our minds. Every new signup nudged the counter up. Every additional workspace a customer created pushed it faster than the raw headcount suggested. At the growth rate we were seeing, it wasn't a distant theoretical ceiling - it was a visible one, and unlike most infrastructure limits you can engineer around, this one had no native escape valve short of calling Cloudflare and asking for an exception.

Problem 5: 10GB Per D1 and No Way to Grow

Each D1 database is capped at 10GB. For user and tenant databases, that limit is essentially invisible - we're talking kilobytes per record. For dataspace databases, it's a ceiling that, for active customers, you can see in the distance.

The tricky part isn't the 10GB limit in isolation. It's that D1 gives you no mechanism to split or expand a database once you've created it. You can't shard it later. You can't migrate data to a second database and stitch them together transparently. Once you're at 10GB, you're at 10GB. We hadn't hit this limit in production yet, but it was clearly going to become a problem for any customer who used the product seriously over time.

Problem 6: Regionality Is Set at Creation and Locked Forever

D1 supports location hints at creation time - you can nudge it toward US West, US East, Europe West, etc. - but for most of our time on D1, it had no jurisdiction support, and once a database is created, it does not move. Ever.

This matters for GDPR and any equivalent data residency requirement. If a European customer creates an account and their database spins up in the wrong region, there's no native path to move it. Technically you can work around this - export the full database, create a new D1 in the correct region, import everything over, and update all your references - but that's a manual, error-prone migration with a downtime window, not something you'd want to run at scale across hundreds of databases or do reactively because a customer suddenly raised a compliance concern. Our default behavior was to create each database in the region nearest to the user at signup and hope it was good enough. For customers with specific requirements, "nearest at signup" isn't a guarantee, and it certainly isn't a legally defensible data residency commitment.

Cloudflare did ship jurisdiction support on November 5, 2025 (changelog), allowing eu and fedramp to be declared at database creation time. It's a meaningful step forward. The caveat - and it's the same one that's always applied - is that it can only be set at creation and still cannot be changed after the fact. The same export-recreate-import process described above applies here too, with the same scale problem: any database created before you knew a customer needed a specific jurisdiction is effectively locked wherever it landed.

Problem 7: Reliability Engineering We Didn't Expect to Need

When we started hitting errors in production that had no documentation and no status page entries, our first assumption was that we were doing something wrong. The errors looked like this:

D1_ERROR: Network connection lost
D1_ERROR: D1 storage operation exceeded timeout which caused object to be reset
D1_ERROR: Failed to parse body as JSON, got: <html><head><title>500 Internal
D1_ERROR: D1 DB's isolate exceeded its memory limit and was reset

After enough of them, and enough digging through Discord and community threads, we learned these were known - just not documented. The D1 team's guidance was to build retry logic and design for idempotency, with a handful of these errors every several hours being characterized as normal operating behavior. That guidance was eventually added to the docs after the community raised it.

Some of these errors you can architect around. Retries, idempotency keys, graceful degradation - reasonable asks for distributed systems work. But something like an isolate exceeding its memory limit and being reset is entirely inside Cloudflare's runtime. There's no query you can write differently, no schema change that helps. It just happens, and your job is to catch it. That pushed us to think less about individual error handling and more about what it would take to build proper infrastructure-level reliability on top of D1.

One thing that inadvertently helped us here was our three-tier architecture. Because every tenant, user, and dataspace had its own isolated D1 database, traffic was naturally spread across hundreds of databases rather than concentrated on one. An isolate reset or timeout in one database affected only that specific user's session - not the entire platform. It wasn't a designed reliability feature, it was a side effect of the isolation model, but it meaningfully softened what could have been platform-wide failures into localized blips.

Community Sentry data for the same errors showed roughly ~18,600 of these over 7 months post-GA, with spikes of nearly 1,000 in a single week - not as evidence of D1 being broken, but as a pain point for what the error budget actually looks like at scale and why you can't afford to ignore it.

That thinking is what pushed us to prototype multi-provider storage failover: GCP Cloud SQL, Azure SQL, or Hyperdrive-tunneled on-prem instances as transparent fallback targets, applying something similar to the 3-2-1 backup principle but for failover - multiple copies on 2+ distinct providers, at least one on completely different infrastructure.

We got a working version of this, but then we scrapped it. The core problem was cross-database-type maintenance. SQLite, PostgreSQL, and MySQL were structurally different, with expression behavior, different edge cases around nulls, defaults, and implicit casts. Even when you commit to treating Postgres and MySQL as if they're SQLite and limiting yourself to the least-common-denominator feature set, the inconsistencies compound fast as your schema evolves. Every new feature we shipped was a new surface area where the Postgres or MySQL path might silently diverge. The maintenance burden for a team our size was unrealistic - so we pulled the plug.

The Scheduler: The Fix That Became a Product

Separate from the connection and storage limitations, we had a maintenance problem. D1 was originally announced with the promise that you'd be able to run JavaScript directly alongside your D1 instance - collocated logic, scheduled maintenance, cleanup jobs. Cloudflare has since silently dropped that.

So we had invite codes and sessions accumulating in databases past their expiration. We had orphaned IDs from users who left tenants. Or even more complex stuff like tables that couldn't use foreign keys (or cross D1s) to stay in sync but needed to be kept in sync anyway. Things that needed to run on a schedule, against specific databases, with actual logic - not just DELETE FROM table WHERE expires_at < NOW().

We built a D1 Scheduler: a Durable Object that connected to D1 on a schedule and ran maintenance tasks, modeled as closely as possible on MySQL's built-in event scheduler. This is the direct precursor to our production scheduler, which later grew to handle user-facing scheduled tasks too - sending reports, triggering workflows. The concept was simple enough: Durable Objects have alarm APIs, so you can wake up at a predictable time, run your work against a database, and go back to sleep.

That scheduler is still running in production today. It was the first time we used a Durable Object to proxy a D1 operation - and it planted the idea that kept growing.

Where This Left Us

After all of this - the VIP binding hacks, the rate limit escalations, the abandoned multi-provider failover prototype, the scheduler workaround - we had something more valuable than just a list of complaints. We had a detailed map of exactly where D1's abstraction layer ended and the underlying infrastructure began.

In the end, D1 is built on Durable Objects' SQL storage and advertised as a managed DB offering. What wasn't documented - what you could only discover by building on it at the scale we did - was everything D1 smoothed over, what it deliberately abstracted away, and where those abstractions had seams.

But understanding where the abstraction had seams also meant understanding where it could be extended. Durable Objects themselves didn't have most of the limits we were running into - the instance count is effectively unlimited, you bind by namespace not by instance, and the storage ceiling per instance can be distributed across as many instances as you need. D1 was a managed layer on top of that primitive. A well-intentioned one, but a layer nonetheless - and layers have ceilings.

The question we kept coming back to was straightforward: if we already understand what D1 is doing under the hood, and we know exactly where it falls short for our use cases, what's stopping us from going one level lower and building our own storage generic to satisfy what we need?

That's what D0 is. We'll get into it in the next post.

CloudflareD1SQLiteInfrastructureMulti-TenantDurable ObjectsD0