Why Has Figma Reinvented the Wheel with PostgreSQL?

169 points by magden 2 years ago · 97 comments

Reader

I'm at a company that is weighing a very similar decision (we are on RDS Postgres with a rapidly growing database that will require some horizontal partitioning). There really isn't an easy solution. We spoke to people who have done sharding in-house (Figma, Robinhood) as well as others who migrated to natively distributed systems like Cockroach (Doordash).

If you decide to move off of RDS but stay on Postgres, you can run your own Postgres but now lose all the benefits of a managed service. You could move off of AWS (eg to Azure), but moving to a different cloud is a huge lift. That, btw, would also be required if you want to try something like Spanner (move to GCP). Moving off of Postgres to another database is also risky. The migration will obviously take some effort, but you're also probably talking about lots of code changes, schema changes, etc, as well as unknown operational risks that you'll discover on the new system. This applies if you're talking about things like Cockroach or even moving to MySQL.

That said, rolling your own sharding is a MASSIVE undertaking. Limitless looks promising, since it takes care of a lot of the issues that Figma ended up spending time on (basically, you shouldn't need something like Figma's DBProxy, as shard routing, shard splitting, etc will be taken care of). It's still in preview though, and like the article mentioned, the costs may be high.

Overall, no easy decisions on this one, unfortunately.

evanelias 2 years ago

> That said, rolling your own sharding is a MASSIVE undertaking.
It's a large challenge, but it's absolutely doable. A ton of companies did this 10-15 years ago, basically every successful social network, user generated content site, many e-commerce sites, massively multiplayer games, etc. Today's pre-baked solutions didn't exist then, so we all just rolled our own, typically on MySQL back then.
With DIY, the key thing is to sidestep any need for cross-shard joins. This is easier if you only use your relational DB for OLTP, and already have OLAP use-cases elsewhere.
Storing "association" type relation tables on 2 shards helps tremendously too: for example, if user A follows user B, you want to record this on both user A and user B's shards. This way you can do "list all IDs of everyone user A follows" as well as "list all IDs of users following user B" without crossing shard boundaries. Once you have the IDs, you have to do a multi-shard query to get the actual user rows, but that's a simple scatter-gather by ID and easy to parallelize.
Implementing shard splitting is hard, but again definitely doable. Or avoid it entirely by putting many smaller shards on each physical DB server -- then instead of splitting a big server, you can just move an entire shard to another server, which avoids any row-by-row operations.
Many other tricks like this. It's a lot of tribal knowledge scattered across database conference talks from a decade ago :)
- Ozzie_osman 2 years ago
  
  It's definitely doable. I was at Google circa 2006, pre Spanner, with sharded MySQL. Ads ran on top of it. It was a pain.
  And yes, there are many tricks like having more logical shards than physical ones, collocating tables by the same shard_id, etc. It's still difficult. You need tooling for everything from shard splitting (even if that is just loving a logical shard), to schema migrations, not to mention if you end up needing cross-shard transactions or cross-shard joins.
  Generally, you'd need a team of very strong infrastructure engineers. Most companies don't have the resources for that. There are definitely some engineers out there that could whip this all together.
franckpachot 2 years ago

What I do not understand is they say "we explored CockroachDB, TiDB, Spanner, and Vitess". Those are not compatible with PostgreSQL beyond the protocol and migration would require massive rewrites and tests to get the same behavior. YugabyteDB is using PostgreSQL for the SQL processing, to provide same features and behavior and distributes with a Spanner-like architecture. I'm not saying that there's no risk and no efforts, but they are limited. And testing is easy as you don't have to change the application code. I don't understand why they didn't spend a few days on a proof of concept with YugabyteDB and explored only the solutions where application cannot work as-is.
- eivanov89 2 years ago
  
  I think Denis addressed this in his post: "Overall, as an engineer, you will never regret taking part in the development of a sharding solution. It’s a complex engineering problem with many non-trivial tasks to solve". In other words, it might be not invented here syndrome (with all due respect to Figma team). Or there might be more nuances we are unaware about.
  - pas 2 years ago
    
    they wanted to stay on RDS, maybe not "them", maybe it was the decision of some manager
    also, it's... strange that they had 18 months and "extremely tight timeline pressure" we simply don't know enough about the situation
- jmull 2 years ago
  
  Maybe it’s just a matter of it being difficult to list all the things they didn’t use. The Figma article itself is a little more clear on their goals…
  It’s not really just postgres compatibility they are after, but compatibility with the Amazon RDS version of postgres. They also wanted to have something they could adopt incrementally and back out of when something unanticipated goes wrong.
  Also, I think yugabyte uses an older version of the postgres processing engine, which may or may not be a big deal, depending on what they are using.
krab 2 years ago

We use Citus. Very similar performance properties to DIY sharding but much more polished. Currently at 7 TB, self hosted. Growing roughly at 100 % per year, write-heavy. Works fine for us.
- __s 2 years ago
  
  Unfortunately this runs into their "have to move off AWS for managed service" point since the managed service for Citus is now on Azure post acquisition, as Azure Cosmos DB for PostgreSQL
  Pitching Citus ran into issues where people were hoping it would handle sharding transparently, which isn't the case. But for someone who's evaluating rolling their own sharding, being able to manage sharding keys explicitly is how Citus allows efficient joins based on your workload. So yes, if you're looking to roll an unmanaged sharded postgresql cluster, consider starting with Citus
- Ozzie_osman 2 years ago
  
  Curious why you needed to shard at 7TB? I can imagine for some workloads, especially if it's write-heavy, you might start hitting constraints around vacuuming and things like that? But 7TB should be manageable on a (somewhat large and beefy) single machine.
  - krab 2 years ago
    
    You're right we could. In fact, it was a single server until about 2 TB. We considered a larger server and in fact at that point we could have just added a few more disks. But we still decided to shard.
    First, the data size is growing and we didn't really know the growth rate in advance. Sharding gives you some flexibility in the infrastructure sizing. And yes, you don't want to wait until the last minute.
    Second, it helps us to spread the disk I/O. Possible on a single machine if you're a little bit careful with disk types and sizes. But again, the overall load still grows.
    Third, all the bulk operations take a long time on a single server. Each of the distributed servers takes about an hour to back up and 2-3 hours to restore. I'd feel uneasy if it was much longer.
  - mixmastamyk 2 years ago
    
    Don’t wait until the last possible second to make a big strategic move—do it early on your own schedule. Especially when growing at a high rate.
ksec 2 years ago

>That said, rolling your own sharding is a MASSIVE undertaking.
Yes. It may not fits your need but take a look at PlanetScale. ( Based on MySQL and Vitess but I have seen quite a few people moving from Postgres )

Tehnix 2 years ago

I’m definitely of the opinion that what Figma[0] (and earlier, Notion[1]) did is what I’d call “actual engineering”.

Both of these companies are very specific about their circumstances and requirements

- Time is ticking, and downtime is guaranteed if they don’t do anything

- They are not interested in giving up the massive amount feature AWS supports via RDS, very specially around data recovery (anyone involved with Business Continuity planning would understand the importance of this)

- They need to be able to do it safely, incrementally, and without closing the door on reverting their implementation/rollout

- The impact on Developers should be minimal

“Engineering” at its core is about navigating the solution space given your requirements, and they did it well!

Both Figma and Notion meticulously focused on the minimal feature set they needed, in the most important order, to prevent disastrous downtime (e.g Figma didn’t need to support all of SQL for sharding, just their most used subset).

Both companies call out (rightfully so) that they have extensive experience operating RDS at this point, and that existing solutions either didn’t give them the control they needed (Notion) or required a significant rewrite or their data structures (Figma), which was not acceptable.

I think many people also completely underestimate how important operational experience with your solution is at this scale. Switch to Citus/Vitess? You’ll now find out the hard way all the “gotchas” of running those solutions at scale, and it would guarantedly have resulted in significant downtime as they procured this knowledge.

They’d also have to spend a ton of time catching up to RDS features they were suddenly lacking, which I would wager would take much more time than the time it took implementing their solutions.

Great job to both teams!

[0] https://www.figma.com/blog/how-figmas-databases-team-lived-t...

[1] https://www.notion.so/blog/sharding-postgres-at-notion

pas 2 years ago

the right way to look at it - IMHO - is to interpret "lots of RDS experience" as complete lack of run-your-own postgreSQL experience. and given that it's not surprising that their cost-benefit math give them the answer of "invest into a custom middleware, instead of moving to running our postgreSQL plus some sharding thing on top"
of course it's not cheap, but probably they are deep into the AWS money pit anyway (so running Citus/whatever would be similar TCO)
and it's okay, AWS is expensive for a lot of bootstrapped startups with huge infra requirements for each additional user, but Figma and Notion are virtually on the exact opposite of that spectrum
also it shows that there's no trivial solution in this space, sharding SQL DBs is very hard in general, and the extant solutions have sharp edges

cplat 2 years ago

Good points. Although, having worked on many high scale architectures before, I always err on the side of thinking that any technical solution of this magnitude would have far too many nuances for a blog post to capture. And I believe Figma’s post also talks mainly about the common denominator that’s easier to communicate to the external world.

For me to understand their choices, I’ll first have to understand their system, which can be a task in itself. Without that, I’d not become aware of the nuances, only general patterns.

mkesper 2 years ago

Aren't there any good managed postgres solutions supporting citus? The decision here seems to have been to invent a whole new sharding solution instead of building enough in house DBA to self-host postgres (if you want to stay on Amazon, you can use any extension you want on EC2). Speaks for the state of engineering right now.

durkie 2 years ago

Crunchy data (which can run on AWS/Azure/GCP) supports Citus: https://www.crunchydata.com/blog/announcing-citus-support-fo...
jerrygenser 2 years ago

Citus was bought by Microsoft so now it's only offered as Azure managed service.
- krab 2 years ago
  
  It's AGPL so you can self-host. You can even pay someone to host it for you if you don't like Microsoft's offer. I find it easier than rolling your own AND self-hosting.
  - jerrygenser 2 years ago
    
    Is there another managed service of Citus that is NOT offered by MSFT? I understand it's legally possible, but is anyone else actually hosting a managed Citus?
- halfcat 2 years ago
  
  Yes, the confusingly named “Azure Cosmos DB for PostgreSQL”
  https://learn.microsoft.com/en-us/azure/cosmos-db/postgresql...

kingraoul 2 years ago

The biggest issue with the Figma article was they did not discuss partitioning the tables before they sharded them.

jpalomaki 2 years ago

3rd party solutions can also add complexity you don't want. You need to keep up-to-date with their release schedules to have access to bug and security fixes, even though you would feature wise be happy with the older version.

Also these can add unnecessary complexity by having features you don't need. Or they might be missing features you need. Contributing up-stream can be difficult and there might be conflicts of interest especially for projects which have separate paid version.

willsmith72 2 years ago

Nice article, I also wondered why they omitted citus. Is there any plan from rds to offer it?

Obviously it could interfere with demand for aurora limitless

plq 2 years ago

> Is there any plan from rds to offer it?
I don't think so because Microsoft bought Citus.
- evanelias 2 years ago
  
  That, combined with AGPL licensing means the other big clouds won't touch it, as their lawyers have (rightly or wrongly) deemed AGPL too risky.

robust-cactus 2 years ago

Now seems a good time to point out, the wheel has literally been reinvented over and over again. The wheels of yesterday were terrible. Each version gets better. It's fine, reinvent away folks :)

mu53 2 years ago

Seriously, a naive database sharding algorithm could be implemented in a week or so by a competent dev.
A company like figma (billions in revenue) putting a small team to implementing a database sharding solution for an un-implemented use case (RDS, not just postgres). AND open sourcing it creating a value for the community is a net-good for the industry.
- harisund1990 2 years ago
  
  Sharding is the easy part. Eventually you need to implement distributed transactions, taking a consistent backup across shards, PITR, resharding, load balancing, and the list goes on... That takes exponentially more number of people and time and mainly risk.
  It works for Figma(for now), but for it to work as a solution for other companies with different hardware, data schema and access patterns will add even more complexity to the mix.
  It's a excellent solution but I don't think it be good enough in the long run.
- hobs 2 years ago
  
  And that would be the solution you'd absolutely abhor. Database sharding has a bunch of gotchas and things to think about because you must consider query access patterns along with your sharding (unless you want the devs to get owned or have very weird behavior.)
  Building something super simple can be ok for the base use case but if you are a multi-billion dollar company you can probably afford a few dbas to actually make your platform good.
  - robust-cactus 2 years ago
    
    Complexity is a great reason to implement something like this in-house. It's probably better to understand (and fully control) the sharding and transaction mechanism than to trust a third party with such a core piece of infra.
    As companies get larger they move further up the stack whether it's sharding techniques, databases, custom orchestration software, their own networking hardware, etc.

giva 2 years ago

I still can't understand why they decided to use a single database for all their customers. If each customer needs access to its own data, why not a dedicated database for every customer?

harisund1990 2 years ago

It's easier to manage 1 database instead of 1000s
- nine_k 2 years ago
  
  It's more expensive to screw up one all-important database than one of a thousand.
  Same logic allies to compute boxes, see "pets vs cattle" from 15-20 years ago.
  - jitl 2 years ago
    
    The difference between "pets" and "cattle" are that pets have state and need to be taken care of, you can't recreate them from scratch trivially. Cattle are stateless and can be created and destroyed easily.
    The whole point of a database is to contain the state - as a pet - so the rest of your application can be stateless - as cattle.
    To really get cattle database systems, you need a self-managing cluster architecture that puts things on autopilot like Neon where you've got >=2 copies of each row and can tolerate losing any single box without unavailability.
    
    nine_k 2 years ago
    
    This is fair.
    But restoring a small DB from a fresh backup, if things go really wrong, is faster, and does not affect other customers.
    I completely agree wrt having a hot spare / cluster with transparent failover and management.
- sitkack 2 years ago
  
  SQLite requires near zero management.
jitl 2 years ago

Multi-tenant design is a huge win in terms of reducing developer toil and expense.
Many customers will have a tiny amount of data. For those customers a dedicated database is huge amount of overhead. There may not be any single customer who it makes sense to allocate dedicated "hardware" for.
Sure you have to deal with a one-time pain to shard your thingy, but you don't need to pay for tens-of-thousands of individual database servers, write interesting tools to keep their schemas in sync, wrangle their backups, etc.
- giva 2 years ago
  
  I don't mean one database server for each customer. I say one database for customer. Hundreds or thousands of customer can be on the same database server. When you need more resources, you add another server. If customer grows too much, you move it on another server.
  There is a bit of overhead, but not huge by any means.
  Being able to update schema one customer at a time is a huge plus in my view, as it gives you a lot of flexibility in rollout. You deploy a new version of the application on a new application server and move the customers on the new servers updating their schema one by one (automatically, obviously)
  Backups are routinely automated anyway.
  - davitocan 2 years ago
    
    I've done this before where we ran a schema per customer and it was fabulous. Once the customer was large enough we could justify allocating a separate DB for them. The application was written in such a way that it knew which data store to query based on the user.
gedy 2 years ago

> why not a dedicated database for every customer
Well there's trade-offs with this too, whether needing aggregate data across shards for features, reporting, etc. Shared data between customers, users, etc. API access, etc.
- giva 2 years ago
  
  Sure there are trade offs. If a significant part of the value of the app rely on data sharing or transactions between customers this is clearly unfeasible. But if the app mostly deals with a significant amount of private data of the customers that occasionally needs to be shared, I think using separate databases (not servers) is the safest option.
hfucvyv 2 years ago

Probably the same reason they used postgres instead of MySQL.
- giva 2 years ago
  
  Could you please elaborate?

iAkashPaul 2 years ago

Even notion has a similar approach to sharding postgres but both of them could benefit from having shard IDs prefixed with YY/MM/DD(as needed) otherwise it's back to the shard navigator once they max out against org-ids for each shard's capacity

jitl 2 years ago

(I work at Notion)
Our shard key - Workspace ID - is a UUIDv4 so there’s a pretty high number of orgs per shard without conflict.
- iAkashPaul 2 years ago
  
  Hey Jake, I meant capacity per shard in this case, not exhausting the IDs. Any potential solutions for that or is that not an immediate challenge?
  - jitl 2 years ago
    
    Gotcha. With our shard strategy we add more capacity either by scaling up nodes (very easy), or by resharding - adding nodes to the cluster and putting fewer shards on each node.
    We recently did a reshard from 32 nodes / 15 shards per node -> 96 nodes / 5 shards per node. That puts us in footing to scale up for a while before we need to reshard again. This is a pretty smooth process, and each time we scale out we get much more scale up capacity.
    Our shard logic is very simple static assignment based only on the Workspace ID. If we wanted to add workspace created time routing, we'd need to starting plumbing that information around the system in ways that are slightly annoying. Probably the move would be to re-key the Workspace table to use a date-embedding UUID format.
    https://www.notion.so/blog/the-great-re-shard

seanhunter 2 years ago

The answer is obvious: they invented their own sharding solution because it's a really really cool problem to work on and they have more engineers than they really need to develop their actual product. A more resource-constrained team would have found a solution that sharded their backend using one of the existing solutions out there.

I have seen this several times before and it's always a symptom of having too many engineers working below the waterline. Rather than work on the actual customer-facing problem, let's port the backend to do event-sourcing/cqrs, move all our infrastructure to k8s, change language from x to y etc.

These are all what I would call "internal goals" (ie they may or may not be necessary or even essential to progress but are not directly customer-visible in their outcomes even if they may enable customer features to be built or indirectly improve the customer experience later) and need to be held to an extremely high level of scrutiny.

If you're amazon/google/meta and you need to do this because of extreme user scale I might believe you. If you're CERN or someone and you need to do this because of absolutely ridiculous data scale I might believe you. The idea that it's better for figma to write their own sharding solution than it is to port to one of the existing ones just doesn't pass even the most basic sniff test.

aerhardt 2 years ago

I can buy your comment as an interesting and even credible hypothesis, but the absolutes which you deal in (“doesn’t pass even the most basic sniff test”) are damning. You are clearly lacking huge amounts of information and context and are passing your own assumptions as hard facts.
Also, I’m assuming Amazon or Google will sometimes roll their own solutions on problems of a scale in the same ballpark as Figma’s.
But anyhow, what’s the scale at which this becomes acceptable, exactly? Is there a magical number which serves as a universal threshold? Or is there - like in all engineering decisions - a very concrete economic case for which you and I both lack a lot of the requisite context and inputs?
- dbuser99 2 years ago
  
  In this particular case of sharding a postgresql solution, in my opinion, the parent is right. Any major cloud provider would give companies of their scale assistance. This is their bread and butter. The posts likely hide the requirement of stay on aws, but we don’t know they did not talk about that. Likewise cockroach or yugabyte were also available options.
- CoastalCoder 2 years ago
  
  I like the approach you took for questioning an unqualified claim.
  Seems like a useful argument design pattern.
jitl 2 years ago

We went through something similar at Notion a few years ago and also chose to stick with RDS Postgres and build sharding logic in our application’s database client.
In both our case and Figma’s, sharding Postgres ASAP was of critical importance because of transaction ID wraparound threat or other capacity issues that promise hard days-long downtime. The kind of downtime that costs 10s of millions of dollars of brand damage alone. Possibly even company ending.
In such a situation, failure is not an option, and you must pick the least risky solution. Moving to an unmanaged cluster system and figuring out your own point-in-time backup/restore, access control provisioning, etc etc has a lot more unknown unknowns than sticking with the managed database vendor you know. The potential failure scenarios of Citus have scary worst cases - we get backup and restore wrong but it seems to work fine in test, then we move to Citus, then something breaks and we can’t restore from backup after all. It’s equally bad to mis-estimate the amount of time needed to bring up the new system. Let’s say you estimate 6 months to get parity with RDS built in features needed to survive disaster and start moving data over, but instead it takes 10 months. Is there enough time left to finish before going hard down? The clock is ticking. Staying with RDS keeps a whole class of new risk out of the picture.
At least here at Notion, NO ONE wanted to build something complicated for fun. We really wanted the company we’d spent years working for and on-call for to survive.
Our story: https://www.notion.so/blog/sharding-postgres-at-notion
- hobs 2 years ago
  
  Or you could just hire some set of people who know how to manage postgres? Seems easier than building an entirely new thing with its own set of bugs that are unknown unknown brand damage awaiting you.
  - jitl 2 years ago
    
    It's not just manage Postgres, it's manage a Citus cluster - (unmanaged Postgres + postgres experts + time for them to implement their stuff) just gets us to parity with RDS but doesn't solve our sharding problem. We asked our Postgres consultants & networks to see if we could find Citus experts we could bring on full-time but didn't have great success. Most of the experts we talked to suggested application level sharding, and it seems like it worked out okay.
    
    hobs 2 years ago
    
    Absolutely, I am just saying that you are talking on now all the inconsistencies of a third party management system and building your solution on top of that; you don't get the infra savings and benefits of managing your own, you gain some velocity for now and as big name clients probably will be stable for a few years.
    I had a problem just recently where I worked at a place that's using blue/green aws rds deployments with mysql replication, and binlogs cant be moved in that service.
    This is something that is bog simple in a non-managed service, and as a result we can either manage app replication, re-sync data on each b/g upgrade, or do physical replication (slow). My point isn't that rds is bad, it's just that if you are already deciding to implement your own significant infrastructure on top of database it seems weird to me to not just have the knobs on the thing itself.
    Though you could say the same is true of the storage, and tbqh most of the cloud storage is dogshit these days but we just deal with it.
  - vbezhenar 2 years ago
    
    It seems that this day the art of configuring a database has been long lost. I also completely don't understand the issue. Just buy two huge behemoth servers, put your postgres there in a replicated mode and move on. It'll sustain huge load. Surely those companies can afford to hire one sysadmin.
    
    jitl 2 years ago
    
    You can’t necessarily play Cookie Clicker with database hardware scaling and have a good time. Query performance and upkeep processes often begin to degrade well before a table reaches the maximum hardware-bound size. We were using an instance with 96 cores and 350gb of RAM which seems over provisioned on paper and still hitting a variety of issues like stalled out Postgres auto-vacuum.
thih9 2 years ago

The article suggests a different reason. What would be your approach if you wanted to stay on RDS?
> So, now, let me speculate. The real reason why Figma reinvented the wheel by creating their own custom solution for sharding might be as straightforward as this — Figma wanted to stay on RDS, and since Amazon had decided not to support the CitusData extension in the past, the Figma team had no choice but to develop their own sharding solution from scratch.
- MrJohz 2 years ago
  
  This rings a lot more true for me as well: a lot of the overly complicated decisions I've made haven't been because I wanted to try something interesting out (although occasionally it's been a factor), but more because I've ended up backed into a corner by previous decisions, factors outside my control, and limited time. Even when the simpler solution is obvious (which isn't always the case), it often takes a more complicated journey to get there. And balancing short term vs long term complexity is a challenge in its own right.
- seanhunter 2 years ago
  
  Wanting to stay on RDS is a reason doesn't survive the sort of extra scrutiny that I said should be applied in situations where you're doing a lot of work towards an internal goal. It also says in the article that they thought it was too risky to migrate (but somehow building their own sharding solution is going to be less risky for some reason).
  I could of course be wrong but it really just feels to me like the reasons given in the article are attempts to justify a decision that was actually made because of "not invented here" syndrome.
  - thih9 2 years ago
    
    Looks like you can’t think of a good reason to stay on RDS in this case, is that correct?
    
    seanhunter 2 years ago
    
    I can totally see why they want to stay on RDS, but think the other considerations should almost certainly outweigh that.
    My main point is this decision makes no sense on its face[1]. Obviously I'm lacking the real context, so there may be overwhelming circumstances which mean that it was the right decision anyway, but these weren't explained in TFA for me. In TFA the reasoning was superficial, and this is the sort of decision that really should be held to a very high standard because as I say these types of internal goals have the potential to burn a ton of valuable engineering time on things which don't affect the customer-facing offering.
    Now we have in a sibling thread someone from notion saying they did the same thing and for me exactly the same reasoning applies. It could be that all these different Saas companies are so special that them each building their own individual postgres sharding solutions to work around the fact that they can't get a sharded, managed postgres instance makes sense. Or not.
    [1] That's what I mean by saying it doesn't pass the sniff test. It might actually be the right decision but your instincts should rebel against it because it feels very wrong. So there needs to be a serious examination before going down that path.
- mbesto 2 years ago
  
  Fair. But it doesn't really explain why they wanted to stay on RDS. This is their reasoning:
  > over the past few years, we’ve developed a lot of expertise on how to reliably and performantly run RDS Postgres in-house. While migrating, we would have had to rebuild our domain expertise from scratch.
  So they had in house expertise to run performantly on RDS but that same experience couldn't be translated to switching over to it running on EC2 + Citus? Rather they used another non-experience concept of building their own sharding? That left me scratching my head.
  - lopkeny12ko 2 years ago
    
    I was puzzled by this as well. RDS is a managed, cloud product. You don't run it. The whole point is that AWS runs it for you, no?
    
    thih9 2 years ago
    
    It’s Postgres, large dbs will need some level of config and maintenance.
    > Common DBA tasks for Amazon RDS for PostgreSQL
    https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/Appen...
  - thih9 2 years ago
    
    Perhaps there were legal, compliance, or contractual constraints that made moving out of RDS impossible within their acceptable business risk levels?
fhd2 2 years ago

I suppose Figma might just be beyond the "let's find the fastest/cheapest way to get this working" point. I believe it makes sense for a company in that stage to mess about a bit, find different (maaaybe even better) ways of doing things, keep the engineering work interesting to attract/retain talent, be OK with the inevitable waste involved in that game. If you're chasing the global maximum, you shouldn't get too obsessed with local maxima.
That said, I've seen plenty of unprofitable startups with high burn rate play this game. That seems a bit suicidal to me.
- thih9 2 years ago
  
  > I suppose Figma might just be beyond the "let's find the fastest/cheapest way to get this working" point.
  The article implies otherwise. E.g. it quotes Figma saying: “Given our very aggressive growth rate, we had only months of runway remaining.”
  - fhd2 2 years ago
    
    Right, I was thinking of Figma in 2024 for some reason (they seem to have conquered the market, I'll just assume they're profitable with that pricing they have), the article talks about Figma in 2022, from what I gather. Should have read properly.
djtango 2 years ago

I don't have a read on this - do we know that Figma isn't doing difficult stuff that warrants proprietary solutions?
- winrid 2 years ago
  
  Absolutely not. Citus would have solved this problem. Or move to MySQL and use PlanetScale etc.
  Second best option is ability to easily create prod environments, and then give those to your biggest customers (bigname.figma.com) etc. No single figma customer will go beyond an i3.metal for the DB, or the app.
  - djtango 2 years ago
    
    So I just read the article - they were on RDS so Citus wasn't an option.
    They also stated it was too risky to migrate data stores on the timeline they were working within
    Those all seem like measured engineering decisions AFAICT
    
    winrid 2 years ago
    
    That doesn't sound right.
    Do data dump from prod for initial sync and then setup replication from RDS to new cluster. Once synced do switch. Then you're off RDS and can shard on Citus.
    
    djtango 2 years ago
    
    a) they didn't want to move off RDS b) this is a pretty big hand wave over migrating your persistence store on a moving product and moving engineering teams
    The coordination alone usually takes months
    
    winrid 2 years ago
    
    Yes, it takes months. They also spent months building this custom solution that they now need to maintain.
    
    raverbashing 2 years ago
    
    Yes but they probably wouldn't want to migrate off RDS.
    
    nathanappere 2 years ago
    
    This doesn't work with the constraint of "staying on AWS" though.
    
    winrid 2 years ago
    
    Yes it does.
    
    thibaut_barrere 2 years ago
    
    I do agree here. The choice to prefer X over Y for hosting (no matter X and Y) often makes sense, changing hosting providers can take a bit of time & it's hard to fully assess the reality of the quality of support / security (again, not specifically writing about RDS or Citus, both are very good teams) beforehand, so it usually is safer to have a long probing period to move safely, which takes time, something they visibly didn't have much.
- jiggawatts 2 years ago
  
  I just read through several thousand lines of code re-implementing the concept of a distributed queue from the ground up... for an application that has maybe a few hundred users. And doesn't need queues, at all.
  This issue is so pervasive that we've all just assumed that it must be necessary.
  - djtango 2 years ago
    
    I couldn't get the context of this response - is this application you read unrelated to the featured article?
    I just read the article and from what I can tell the Figma team made a somewhat reasonable sounding decision
    
    jiggawatts 2 years ago
    
    Yes, unrelated. My point was that wheel-reinvention is a curse of the software industry because it's just so easy to reinvent every wheel on a whim. DevOps and is no different. How many large orgs have their own build tooling, or some special sauce around large repos?
thibaut_barrere 2 years ago

I don't read the story this way personally (not saying that these scenarios do not occur, but I feel the narrative detailed in the original article makes sense even without "chasing cool problems").

harisund1990 2 years ago

The article should be titled "Why Figma HAD TO reinvent the wheel with PostgresSQL". When you have a legacy system and not enough time, or will to move off of it the only option is to get inventive and build with what you have.

There is always a price. In this case the database team did something quick, cheap and easily. But the Application teams now have to deal with handling all the nuaces of the system. Maybe Figma has more people in these Apps teams with time on their hands to handle it.

marwis 2 years ago

If you go with sharding proxy design why not use Apache ShardingSphere?

It follows the same approach but is far more sophisticated and mature.

willi59549879 2 years ago

i wonder how neon would perform with a database of this size. Several terabytes per table is pretty big.

RunSet 2 years ago

I find the wheel is most typically reinvented in pursuit of venture capital.

adityapatadia 2 years ago

Am I the only one here thinking they should have just used MongoDB and be done with it?

I know over simplified approach but majority of problem would be solved.

taormina 2 years ago

Would literally any of the problem be solved?

Settings

Why Has Figma Reinvented the Wheel with PostgreSQL?

Keyboard Shortcuts