Settings

Theme

Scaling Rails and Postgres to millions of users at Microsoft

stepchange.work

202 points by htormey a year ago · 95 comments

Reader

pajeets a year ago

Postgres can be scaled vertically like Stackoverflow did. With cache on edge for popular reads if you absolutely must (but you most likely dont).

No need to microservice or sync read replicas even (unless you are making a game). No load balancers. Just up the RAM and CPU up to TB levels for heavy real world apps (99% of you wont ever run into this issue)

Seriously its so create scalable backend services with postgrest, rpc, triggers, v8, even queues now all in Postgres. You dont even need cloud. Even a mildly RAM'd VPS will do for most apps.

got rid of redis, kubernetes, rabbitmq, bunch of SaaS tools. I just do everything on Postgres and scale vertically.

One server. No serverless. No microservice or load handlers. It's sooo easy.

  • mr_toad a year ago

    Stack overflow absolutely had load balancers, and 9 web servers, and Redis caches. They also use 4 SQL servers, so not entirely vertical either. And they were only serving 500 requests a second on average (peak was probably higher).

    • pajeets a year ago

      was it? i read it was a huge ram server

      • zo1 a year ago

        The details of their architecture is documented in a series of blog posts:

        https://nickcraver.com/blog/2016/02/03/stack-overflow-a-tech...

        I get what you're saying, they didn't do dynamic and "wild" horizontal scaling, they focused more on having an optimal architecture with beefy "vertically scaled" servers.

        Very much something we should focus on. These days horizontal scaling, microservices, kubernetes, and just generally "throwing compute" at the problem is the lazy answer to scaling issues.

      • mr_toad a year ago
        • DoctorOW a year ago

          That's a primary and backup server for Stackoverflow and a primary/backup for SE. But they each have the full dataset for their sites, not actual horizontal scaling. Also that page is just a static marketing tool, not very representative of their current stack. See: https://meta.stackexchange.com/questions/374585/is-the-stack...

        • KronisLV a year ago

          Having most of the servers be loaded at about 5% CPU usage feels extremely wasteful, but at the same time I guess it's better to have the spare capacity for something that you really want to keep online, given the nature of the site.

          However, if they have a peak of 450 web requests per second and somewhere between 11000 - 23800 SQL queries per second, that'd mean between 25 - 53 SQL queries to serve a single request. There's probably a lot of background processes and whatnot (and also queries needed for web sockets) that cut the number down and it's not that bad either way, but I do wonder why that is.

          The apps with good performance that I've generally worked with attempted to minimize the amount of DB requests needed to serve a user's request (e.g. session cached in Redis/Valkey and using DB views to return an optimized data structure that can be returned with minimal transformations).

          Either way, that's a quite beefy setup!

  • danmaz74 a year ago

    Having at least 2 web servers and a read-only DB replica for redundancy/high availability is very easy and much safer. Yes, setting up a single-server is faster, but if your DB server dies - and at some point it will happen - you'll not just save a lot of downtime, but also a lot of stress and additional work.

    • brightball a year ago

      Read replicas come with their own complexity as you have to account for the lag time on the replica for UX. This leads to a lot of unexpected quirks if it’s not planned for.

      • danmaz74 a year ago

        That's true, but you can use your replica only for non-realtime reporting, or even just as a hot standby.

        Edit: Careful for the non-realtime reporting though if you want to run very slow queries - those will pause replication and can be a PITA.

      • cqqxo4zV46cp a year ago

        A hot standby / failover still meets this definition. That’s how I interpreted what was being described.

    • cultofmetatron a year ago

      my startup has a similar setup (elixir + postgres). we use aurora so we get automated failover. its more expensive but its just a cost of doing business.

      • aledalgrande a year ago

        Last time I looked at Aurora (just as it came out) it was hilariously expensive. Are the costs better now for a real use case?

        • cultofmetatron a year ago

          > it was hilariously expensive

          It still is. But you have to look at it in perspective. do you have customers that NEED high availability an will pull out pitch forks if you are down for even a few minutes? I do. the peace of mind is what you're paying for in that case.

          Plus its still cheaper than paying a devops guy a fulltime salary to maintain these systems if you do it on your own.

  • justinclift a year ago

    That works for the performance aspect, but doesn't address any kind of High Availability (HA).

    There are definitely ways to make HA work, especially if you run your own hardware, but the point is that you'll need (at least) a 2nd server to take over the load of the primary one that died.

  • nazka a year ago

    Thank you for sharing this! I have been diving into it.

    How do you manage transactions with PostgREST? Is there a way to do it inside it? Or does it need to be in a good old endpoint/microservice? I can’t find anything in their documentation about complex business logic beyond CRUD operations.

  • whakim a year ago

    Yes, scaling vertically is much easier than scaling horizontally and dealing with replicas, caching, etc. But that certainly has limits and shouldn’t be taken as gospel, and is also way more expensive when you’re starting to deal with terabytes of RAM.

    I also find it very difficult to trust your advice when you’re telling folks to stick Postgres on a VPS - for almost any real organization using a managed database will pay for itself many times over, especially at the start.

    • pajeets a year ago

      looking at hetzner benchmarks i would say VPS are quite enough to handle Postgres for Alexa Top 1000. When you approach under top 100, you will need more RAM than what is offered.

      But my point is you won't ever hit this type of traffic. You don't even need Kafka to handle streams of logs from a fleet of generators from the wild. Postgres just works.

      In general, the problem with modern backend architectural thinking is that it treats database as some unreliable bottleneck but that is an old fashioned belief.

      Vast majority of HN users and startups are not going to be servicing more than 1 million transactions per second. Even a medium sized VPS from Digital Ocean running Postgres can handle that load just fine.

      Postgres is very fast and efficient and you dont need to build your architecture around problems you wont ever hit and prepay that premium for that <0.1% peak that happens so infrequently (unless you are a bank and receive fines for that).

      • whakim a year ago

        I work at a startup that is less than 1 year old and we have indices that are in the hundreds of gigabytes. It is not as uncommon as you think. Scaling vertically is extremely expensive, especially if one doesn’t take your (misguided) suggestion to run Postgres on a VPS rather than using a managed solution like most do.

        • pajeets a year ago

          shouldn't be expensive to handle that amount of indices on a dedicated server without breaking the bank

  • seabrookmx a year ago

    > One server

    What happens if this server dies?

    • wongarsu a year ago

      Then your service is offline until you fix it. For many services a completely acceptable thing to happen once in a blue moon

      Most would probably get two servers with a simple failover strategy. But on the other hand servers rarely die. At the scale of a datacenter it happens often, but if you have like six of them, buy server grade stuff and replace them every 3-5 years chances you won't experience any hardware issues

    • pajeets a year ago

      if you cant risk this rarity then get a failover server with equal specs

      maybe add another for good measure....if the biz insurance needs extreme HA then absolutely have multiple failover

      my point is you arent doing extreme orchestration or routing

      throw a cloudflare ddos protection too

  • JB_Dev a year ago

    Eventually you get data residency asks to keep data in the right region and for that you need to have horizontal partitioning of some kind.

  • jamil7 a year ago

    Our backend at work does use a read replica purely for websockets. I always wondered if it was overkill, I’m not a backend developer, though.

    • pajeets a year ago

      not sure what you are building but i hope that was for a real time multiplayer game otherwise doesn't make sense to have bi-directional communication when you only need reads

      making read replicas function also as writes is needed for such cases but already when you have more than one place to write you run into edge cases and complexities in debugging

      • jamil7 a year ago

        I think the reason is pushes are sent out regularly in batches by some cron system and rather than reading from the main database it reads from the replica before it pushes them out. I didn't really explain the context properly in my comment.

  • mattacular a year ago

    > Just up the RAM and CPU up to TB levels

    not sure what CPU at TB levels means but hope your wallet scales better vertically

    • cosmicradiance a year ago

      They are definitely not on the cloud.

      • pajeets a year ago

        Aurora on AWS definitely has extreme RAM

        It's not cheap at roughly $200/hr but already if you have this type of traffic then you are generating revenues (hopefully) at much greater amounts.

cdiamand a year ago

I ran into some scaling challenges with Postgres a few years ago and had to dive into the docs.

While I was mostly living out of the "High Availability, Load Balancing, and Replication" chapter, I couldn't help but poke around and found the docs to be excellent in general. Highly recommend checking them out.

https://www.postgresql.org/docs/16/index.html

  • danpalmer a year ago

    They are excellent! Another great example is the Django project, which I always point to for how to write and structure great technical documentation. Working with Django/Postgres is such a nice combo and the standards of documentation and community are a huge part of that.

    • irjustin a year ago

      Interestingly I have had almost the exact opposite experience being very frustrated with the Django docs.

      To be fair, it could be because I'm frustrated with Django's design decisions having come from Rails.

      When learning Django a few years ago, I still carry a deep loathing against polymorphism (generic relations[0]), and model validations (full clean[1]),

      You know what - it's design decisions...

      [0] https://docs.djangoproject.com/en/5.1/ref/contrib/contenttyp...

      [1] https://docs.djangoproject.com/en/5.1/ref/models/instances/#...

      • rtpg a year ago

        generic relations are hard to get right, really if you can avoid using them you're going to avoid a lot of trickiness.

        When you need them... it's nice to have them "just there", implemented correctly (at least as correctly as they can be in an entirely generic way).

        Model validations is a whole thing... I think that Django offering a built-in auto-generated admin leads to a whole slew of differing decisions that end up coming back to be really tricky to handle.

      • globular-toast a year ago

        Would love to hear more about what you don't like with model validations (full clean).

        • irjustin a year ago

          Sorry on the slow reply.

          But yea, I can complain at length.

          - Model validations aren't run automatically. Need to call full_clean manually.

          - EXCEPT when you're in a form! Forms have their own clean, which IS run automatically because is_valid() is run.

          - This also happens to run the model's full_clean.

          - DRF has its own version of create which is separate and also does not run full_clean.

          - Validation errors in DRF's Serializers are a separate class of errors from model validations and thus model Val Errors are not handled automatically.

          - Can't monkey patch models.Model.save to run full_clean automatically for because it breaks some models like User AND now it would run twice for Forms+Model[0].

          Because of some very old web-forum style design decisions, model validations aren't unified thus the fragmentation makes you need to know whether you're calling .save()/.create() manually, are in a form, or in DRF. And it's been requested to change this behavior but it breaks backwards compat[0].

          It's frustrating because in Rails this is a solved problem. Model validations ALWAYS run (and only once) because... I'm validating the model. Model validations == data validations which means it should be true for all areas regardless of caller, except in exceptions, then I should be required to be explicit when skipping (i.e. Rails) where as in Django I need to be explicit in running it - sometimes... depends where I am.

          [0] https://stackoverflow.com/questions/4441539/why-doesnt-djang...

          • globular-toast a year ago

            Thanks for your reply. I'm currently in a stage of falling out of love with Django and trying to get my thoughts together on why that is.

            I think Django seems confused on the issue of clean/validation. On the one hand, it could say the "model" is just a database table and any validation should live in the business logic of your application. This would be a standard way of architecting a system where the persistence layer is in some peripheral part that isn't tied to the business logic. It's also how things like SQLAlchemy ORM are meant to be used. On the other hand, it could try to magically handle the translation of real business objects (with validation) to database tables.

            It tries to do both, with bad results IMO. It sucks to use it on the periphery like SQLAlchemy, it's just not designed for that at all. So everyone builds "fat" models that try to be simultaneously business objects plus database tables. This just doesn't work for many reasons. It very quickly falls apart due to the object relational mismatch. I don't know how Rails works, but I can't imagine this ever working right. The only way is to do validation in the business layer of the application. Doing it in the views, like rest framework or form cleans is even worse.

            • irjustin a year ago

              Yeah definitely understand the frustration. I've been there and while I don't think we've found _the_ solution, we've settled into a flow that we're generally happy with.

              For us we separate validations in two. Business and Data validations, which are generally defined as:

              - Business: The Invoice in Country X is needs to ensure Y and Z taxes are applied at Billing T+3 days otherwise throw an error.

              - Data Validation: The company's currency must match the country it operates in.

              Business validations and logic always go inside services where as data validations are on the model. Data validations apply to 100% of all inserts. Once there's an IF statement segmenting a group it becomes business validation.

              I could see an argument as to why the above is bad because sometimes it's a qualitative decision. Once in a while the lines get blurry, a data validation becomes _slightly_ too complex and an arguement ensues as to whether it's data vs business logic.

              Our team really adheres to services and not fat models, sorry DHH.

              To me, it's all so controversial whatever you pick will work out just fine - just stick to it and don't get lazy about it.

              • globular-toast a year ago

                Services are definitely better and a solid part of a domain-driven design. The trouble is with Django I think it's a bandaid on a fundamentally broken architecture. The models end up anaemic because they're trying to be two things at once. It's super common to see things like services directly mutating model attributes and set up relationships manually by creating foreign keys etc. All of that should be hidden far away from services.

                The ultimate I think is Domain-Driven Design (or Clean Architecture). This gives you a true core domain model that isn't constrained by frameworks etc. It's as powerful as it can be in whatever language you use (which in the case of Python is very powerful indeed). Some people have tried to get it to work with Django but it fights against you. It's probably more up front work as you won't get things like Django admin, but unless you really, truly are doing CRUD, then admin shouldn't be considered a good thing (it's like doing updates directly on the database, undermining any semblance of business rules).

  • jbverschoor a year ago

    Like many of the BSDs

    • aerzen a year ago

      Did Postgres used to be a BSD? Are they known for good documentation?

      • andrewf a year ago

        BSD was the Unix distribution; BSD and Postgres/Ingres development did overlap at UC Berkeley.

      • password4321 a year ago

        BSD? No, that's operating system(s)

        Good documentation? Yes

rubyfan a year ago

15 years ago I worked on a couple of really high profile rails sites. We had millions of users with Rails and a single mysql instance (+memcached and nginx). Back then ruby was a bit slower than it is today but I’m certain some of the challenges you face at that scale are things people still do today…

1. try to make most things static-ish reads and cache generic stuff, e.g. most things became non-user specific HTML that got cached as SSI via nginx or memcached

2. move dynamic content to services to load after static-ish main content, e.g. comments, likes, etc. would be loaded via JSON after the page load

3. Move write operations to microservices, i.e. creating new content and changes to DB become mostly deferrable background operations

I guess the strategy was to do as much serving of content without dipping into ruby layer except for write or infrequent reads that would update cache.

teleforce a year ago

Please check this excellent book by former Microsoft and Groupon engineer on scaling Rails and Postgres:

[1] High Performance PostgreSQL for Rails Reliable, Scalable, Maintainable Database Applications by Andrew Atkinson:

https://pragprog.com/titles/aapsql/high-performance-postgres...

giovannibonetti a year ago

What a small world. Earlier today I got tagged in a PR [1] where Andrew became the maintainer of a Ruby gem related to database migrations. Good to know he is involved in multiple projects in this space.

[1] https://github.com/lfittl/activerecord-clean-db-structure/is...

  • andatki a year ago

    Hi there! That's funny! This interview and those gem updates were unrelated. However both are part of the sweet spot for me of education, advocacy, and technical solutions for PostgreSQL and Ruby on Rails apps.

    I hope you’re able to check out the podcast episode and enjoy it. Thanks for weighing in within the gem comments, and for commenting here on this connection. :)

benwilber0 a year ago

Postgres can scale to millions of users, but Rails definitely can't. Unless you're prepared to spend a ton of money.

  • petcat a year ago

    For real. Show me a company that has scaled RoR or Django to 1 million concurrent users without blowing $250,000/month on their AWS bill. I've worked at unicorn companies trying to do exactly that.

    Their baseline was 800 instances of the Rails app...lol.

    I'm not going to name-names (you've heard of them) ... but this is a company that had to invent an entirely new and novel deployment process in order to get new code onto the massive beast of Rails servers within a finite amount of time.

    • loktarogar a year ago

      I've scaled a single rails server to 50k concurrent, and so if Rails is the theoretical bottleneck there, and we base it off scaling my meager efforts, that's only 20 servers for 1 mil concurrent, or around $1000/mo at the price point I was paying (heroku).

      Rails these days isn't the top of the speed meters but it's not that slow either.

      • petcat a year ago

        Sounds like you made a nice, tight little Rails app. 50,000 concurrent? Oh man, I wish.

    • hw a year ago

      “Rails can’t scale” is so 10 years ago. It’s often other things like DB queries or network I/O that tend to be bottlenecks, or you have a huge Rails monolith that has a large memory footprint, or an application that isn’t well architected or optimized.

    • ainiriand a year ago

      We use 5 ec2 instances to serve around 32 million requests per day on PHP, all under 100ms. It is not the language.

      • dilyevsky a year ago

        Sounds impressive until you realize that there’s 86400 seconds in a day and so even if majority of those happen during business hours thats still firmly under 200 qps per server. On modern hw that’s very small. Also what instance size?

      • nov21b a year ago

        The language/runtime certainly has an impact. But indeed, in reality there is no way to compare these scaling claims. For all we know people are talking about serving a http-level cache without even hitting the runtime.

        • ainiriand a year ago

          Each and every request reach the DB and/or Redis. MyISAM is deprecated, but is crazy fast if you mainly read.

      • jylasdfasd a year ago

        This is trivial with epoll(7) or io_uring(7). What you are describing "5 ec2 instances" could likely be attributed to language and/or framework bloat but hard to know for certain without details.

      • charlie0 a year ago

        Framework or custom app?

        • ainiriand a year ago

          Raw php scripts, no ORM either. It has very good abstractions for some logic and for some other parts it is just a spaghetti function. Changing anything is difficult and critical so we are not able to refactor much.

    • sparker72678 a year ago

      Were they running t2.micro instances or something?

      We're running 270k+ RPM no sweat, and our spend for those containers is maybe 1/100th what you're quoting there.

      The idea that Rails can't handle high load is just such bloody nonsense.

      You can build an abomination with any framework, if you try.

    • dcchambers a year ago

      > Show me a company that has scaled RoR or Django to 1 million concurrent users without blowing $250,000/month on their AWS bill.

      Can you deploy something to vercel that supports a million concurrent users for less than $250K/month? What about using AWS Lambdas? Go microservices running in K8s?

      I think your infra bills are going to skyrocket no matter your software stack if you're serving 1 million+ concurrent users.

    • danmaz74 a year ago

      "without blowing $250,000/month on their AWS bill". The point is that you don't need AWS for this! You can use Docker to configure much, much cheaper/faster physical servers from Hetzner or similar with super-simple automated failover, and you absolutely don't need an expensive dedicated OPS team for that for this kind of simple deployments, as I read so often here on HN.

      You might get surprised as how far you can go with the KISS approach with modern hardware and open source tools.

      • dilyevsky a year ago

        You ain’t replacing 250k/mo worth of ec2 with a single hetzner server so your “super-simple failover” option goes out the window. Baremetal is not that much faster if you’re running ruby on it, dont fall for the marketing.

        • danmaz74 a year ago

          I never said that you should only have one server on Hetzner. For the web servers and background workers, though, scaling horizontally with docker images on physical server is still trivial.

          By the way, I was running my startup on 17 physical machines on Hetzner, so I'm not talking from marketing but from experience.

cies a year ago

My experience scaling up Rails (mostly in size of codebase NOT in size of traffic) really made me love typesafe languages.

IDE smartness (auto complete, refactoring), compile error instead of runtime, clear APIs...

Kotlin is a pretty nice "Type-safe Ruby" to me nowadays.

  • Alifatisk a year ago

    I had a similar experience, working in a large Ruby codebase made me realise how important type-hints is, sometimes I had to investige what types where expected and required because the editor where unable to tell me. I hope RBS / Sorbet solves this.

neonsunset a year ago

This desperately needs the Walmart treatment of JET.com’s teams past acquisition :)

jojobas a year ago

What's Rails and Postgres? Do they mean ASP.NET and MS SQL Server?

  • andatki a year ago

    Rails and Postgres (and AWS) was the pre-acquisition stack, and development continued with that stack during this time period (2020-2021). https://en.wikipedia.org/wiki/Flip_(software)

    Microsoft acquired companies with web and mobile platforms with varied backgrounds at a high rate. I got the sense that the tech stack—at least when it was based on open source—was evaluated for ongoing maintenance and evolution on a case by case basis. There was a cloud migration to Azure and encouragement to adopt Surface laptops and VS Code, but the leadership advocated for continuing development in the stack as feature development was ongoing, and the team was small.

    Besides hosted commercial versions, I was happy to see Microsoft supporting community/open source PostgreSQL so much and they continue to do so.

    https://en.wikipedia.org/wiki/List_of_mergers_and_acquisitio...

    https://techcommunity.microsoft.com/t5/azure-database-for-po...

    • neonsunset a year ago

      PostgreSQL has been the most popular choice for greenfield .NET projects for a while too. There really isn't any vendor lock-in as most of the ecosystem is built with swappable components.

djaouen a year ago

I don't understand why you wouldn't just use Elixir/Phoenix if you need to scale?

  • foundart a year ago

    Perhaps because you need to scale quickly and already have a large Rails app that would take a long time to recreate in another language and framework.

  • SkyPuncher a year ago

    It’s hard to compete with Rails productivity

  • seabrookmx a year ago

    I don't understand why you wouldn't use <compiled language that's faster than the BEAM> if you need to scale?

    /s

    • djaouen a year ago

      I mean, you could, but you'd be missing out on the Rails-esque nature of Elixir/Phoenix.

datadeft a year ago

Scaling a non-scalabe by default framework that should have been few services written in a performance first language at a billion+ USD company.

I am not sure why are we boliling the oceans for the sake of a language like Ruby and a framework like Rails. I love those to death but Amazons approach is much better (or it used to be): you can't make a service for 10.000+ users in anything else than: C++, Java (probably Rust as well nowadays).

For millions of users the CPU cost difference probably justifies the rewrite cost.

  • gls2ro a year ago

    You are connecting the dots backwards, but a project is usually trying to connect the dots forward.

    So if you have a lot of money then you can start implementing from scratch your own web framework in C. It will be the perfect framework for your own product and you can put 50 dev/sec/ops/* on the team to make sure both the framework and product code are written.

    But some (probably most) products are started with 1-2 people trying to find product market fit or whatever name is for solving a real problem for paying users as fast as they can. And then delegate scaling for when money are going in.

    This is similar because this is about a startup/product bought by Microsoft and not built inhouse.

    For fast delivery of stable secure code for web apps Rails is a perfect fit. I am not saying the only one but there are not that many offering the stability and batteries included to deliver with a small team a web app that can scale to product market fit while keeping the team small.

  • danmaz74 a year ago

    "For millions of users the CPU cost difference probably justifies the rewrite cost." This is only true if you have expensive computations done in Ruby or Python or similar, which is very rarely the case.

    • consteval a year ago

      Not true, Ruby and Python are absurdly slow at even trivial tasks. Moving stuff around in memory, which is most of what a webapp is, is expensive. Lots of branches is gonna be really expensive too.

      • danmaz74 a year ago

        I've got more than 15 years of Rails production experience, including a lot of performance optimisation, and in my experience the Ruby code is very rarely the bottleneck. And in those cases, you can almost always find some solution.

  • ainiriand a year ago

    You really do not know what you are talking about, it is not about the language, like it was repeated in this forum many many times already. We serve an application in PHP to thousands of users per second in less than 100ms constantly.

    • hamandcheese a year ago

      Sometimes it is the language. Or at least the ecosystem and libraries available.

      My go-to example is graphql-ruby, which really chokes serializing complex object graphs (or did, it's been a while now since I've had to use it). It is pretty easy to consume 100s of ms purely on compute to serialize a complex graphql response.

      • Lio a year ago

        I have mixed feelings about this. It's saying that python is too slow for data science ignoring that python can outsource that work to Pandas or NumPy.

        For GraphQL on Rails you can avoid graphql-ruby and use Agoo[1] instead so that that work is outsourced to C. So in practice it's not a problem.

        1. https://github.com/ohler55/agoo

        • datadeft a year ago

          > python can outsource that work to Pandas or NumPy.

          Exactly. So C/C++/Fortrant is better in this regard than Python.

      • ainiriand a year ago

        I would make a case that that's not the language's fault. You need to assess how critical is speed in your requirements and adapt your solutions.

    • datadeft a year ago

      > You really do not know what you are talking about

      > it is not about the language

      Sure how about these people?

      https://thenewstack.io/which-programming-languages-use-the-l...

  • neonsunset a year ago

    Yup. As if there is no wealth of organizational knowledge and a particular first-party language to address this exact problem.

Keyboard Shortcuts

j
Next item
k
Previous item
o / Enter
Open selected item
?
Show this help
Esc
Close modal / clear selection