Settings

Theme

The Amazon Prime Day 2023 AWS Bill

lastweekinaws.com

147 points by bpugh 2 years ago · 65 comments

Reader

version_five 2 years ago

  Amazon Prime Day event resulted in an incremental 163 petabytes of EBS storage capacity allocated – generating a peak of 15.35 trillion requests and 764 petabytes of data transfer per day. 
The main thing that strikes me is how (seemingly) inefficient everything is. What do they possibly need this amount of data for in selling stuff? Are they taking high-def video of every customer as they browse for something to buy? I get that it's a huge company and this is (I guess) their business time, but how can the y need so much storage. Ditto for much of the other stuff.
  • luhn 2 years ago

    Yeah, those numbers struck me as well. At 375 million items sold, that's about 0.5GB storage and 2GB transfer per item.

    • steveBK123 2 years ago

      10+ years ago I worked on a trading system that was generating something like 1TB/day of messaging.

      As we hit these levels we asked them - how many trades are we even doing on this system? The answer was something on the order of.. 50. Granted it was a bond system and the nationals are huge, but theres just no reason to store 20GB per trade.

      These are the kinds of decisions that get made when one team is responsible for message generation and the other is responsible for the storage, lol.

      We then had to work backwards with them to unwind a lot of the INFO level chatty messaging between what you'd now call "microservices" and reduce the volume by 90+%.

    • fbdab103 2 years ago

      I suppose you need to know how many requests did not result in a purchase. Is it 1000 views:purchase? I have not checked in on a Prime Day sale for several years, but is there any timeliness component (Flash Sales?) where people would be incentivized to mash the reload button?

    • tomwheeler 2 years ago

      Yes, but that's per item sold.

      After looking at screen after screen of no-name garbage on Prime Day, I gave up. I suspect that there are tons of people like me. In other words, we only contributed to the numerator, not the denominator.

  • thenewarrakis 2 years ago

    I think the EBS numbers are "double counting". Most of the other services in the list are using EBS under the hood, so I wouldn't be surprised if this number includes stuff like the Aurora instances, CloudTrail events, SQS events, etc that are also included.

    Also, it specifically says "incremental capacity allocated", not necessarily used. Keep in mind that every EC2 instance launched also means new EBS storage is allocated. The article also estimates that 50 million EC2 instances were used for Prime Day. If you assume that half of these were newly created to support the surge of Prime Day, 25 million instances using up 160 PB of storage is only 6 gigabytes per instance, which definitely seems in the realm of possibility.

  • rqtwteye 2 years ago

    It seems to me that a lot of modern architectures store the same data in multiple places. The systems I see proposed in my company probably need often 10 times more space than the actual data we have because they copy and cache a lot of stuff.

    • figassis 2 years ago

      Microservices requires denormalizing data across tables and dbs. There’s a cost to how many microservices you build.

  • CamperBob2 2 years ago

    Hot take: Amazon's search UX is so terrible that it not only wastes near-endless amounts of customer time and patience, but their own bandwidth as well.

    • greatpostman 2 years ago

      They’ve a/b tested it to death

      • fiddlerwoaroof 2 years ago

        I wonder if Amazon has overfitted and/or a/b tester itself into a bad local optima. It’s pretty hard for me to believe that their current website really is as good as their data indicates.

        • thayne 2 years ago

          IME a/b tests are often run by people with little to no knowledge of statistics or experimental procedure. It is pretty easy to end up backing bad decisions with data when you don't completely understand the data.

      • CamperBob2 2 years ago

        "Hey, check this out! User engagement as a function of time spent on amazon.com is up 125% with the new build!"

        Once a metric becomes a target for optimization, it often loses its value as an indicator of a larger goal. People who obsess over A/B tests rarely understand that.

  • thayne 2 years ago

    A lot of that was certainly just for the root volumes of all those ec2 instances (how much exactly is hard to know without more details). Which of course would have duplicate copies of the various base images for the VMs.

    Although, that does bring up the question of why AWS doesn't have a way to share a single read-only volume across multiple ec2 instances in the same availability zone. In many workloads there isn't any need to write to disk.

  • kamikaz1k 2 years ago

    Sometimes it’s just a bad decision that happens to “scaLe”. Like the print video thing.[1]

    1. https://youtu.be/J7ITgYBn_3k

  • twoodfin 2 years ago

    The EBS storage could easily be highly redundant (for good reason) local cache copies of store data.

  • pipingdog 2 years ago

    Logging, metrics, distributed action trace.

rurp 2 years ago

> $102 million in infrastructure spend for an event that brought in over $12.7 billion in sales isn’t the worst return on investment that companies could make — by a landslide!

Well it's not amazing if your margin's are tiny, as they are in many industries (such as retail). Plus this was almost certainly architected by some of the foremost AWS experts in the world. It's verrrry easy to spend vastly more than was strictly necessary in AWS.

I don't mean to be too negative though, it was a really interesting article. Pretty wild to think about spending $100m on infrastructure over two days and still making a bunch of profit.

  • madrox 2 years ago

    Important to remember that, before you could burst your infrastructure in the cloud, sites simply went offline in events like this. You took actively lost revenue in those cases.

    • ndriscoll 2 years ago

      Or you could just design your architecture to not perform trillions of database requests for hundreds of millions of sales.

      The listing data is almost static and should almost fit in RAM (the hot set probably does. Apparently Amazon has ~350M listings. A 24TB RAM server could give ~68kB/listing, and probably only a small fraction is hot). Since you'll need multiple servers anyway, you could shard on products and definitely fit things in RAM. 375 million sales even if condensed into 1 hour would only be 104k/second. A single db server should be able to handle the cart/checkout. Assuming ~10M page views/second, a couple racks of servers should be able to handle it.

      The ad/tracking infrastructure surely can't account for the 1000x disparity in resource usage.

      • turtlebits 2 years ago

        I think you're forgetting that Amazon doesn't have a 100% conversion rate...

        • ndriscoll 2 years ago

          I'm not. That's why I threw out 10M page views per 100k purchases. Maybe 1% is an overestimate of conversions, but I imagine a 48x multiplier to average traffic is an overestimate of peak traffic, so it balances out. It would be interesting to know the actual peak number of user actions/second though.

      • madrox 2 years ago

        Are you saying the reason why sites went offline pre-cloud was because engineers were simply bad at design?

        • ndriscoll 2 years ago

          More importantly, "pre-cloud" means years ago and therefore older hardware, but also yes, software mostly isn't written to be high performance.

          Modern NVMe drives get 1000x the performance of hard disks 10 years ago. You can buy one that can fit the entire reddit text database for $150 now. 10 years ago you'd be looking at a high six-figure SAN appliance from IBM or EMC to get the kind of performance my desktop has now. You can have TBs of RAM and 100+ cores in a server now. You can get 400 Gb/s networking now, and some people even have 10 Gb/s home internet. You could basically run some of the biggest sites from 10 years ago out of your closet these days.

          Some software has also improved a lot in the last 10 years. Things like io_uring are great. Green threads are great. Postgres is super fast these days, and it keeps getting faster. My old quad core computer with a SATA drive can already do ~60-70k requests/second with a Scala web app and postgres. That's without even using graal or loom or trying to screw around with core affinity.

          If anything, the cloud scales poorly. People in practice end up using vastly underpowered VMs, and then horizontally scale them, which introduces a ton of overhead (computationally and management-wise). RDS gets you like 3000 IOPS baseline and increasing that to the level of a single NVMe drive will cost one employee's firstborn child each month, so people end up with this mistaken belief that RDBMSs are slow or don't scale. AWS will provide you with reference architectures to use lambdas for web requests and advertise their "scalability" [0], but the API only lets you serve 1 request/lambda at a time, and according to their docs, you can only have up to "tens of thousands" of concurrent lambdas[1]. That would also require 10s of thousands of connections to your db, which kills it, and doesn't let you batch things unless you first put your work onto something like an SQS queue and have a separate db worker lambda pick up batches. More infra to manage (and more $$$) instead of writing a dozen lines of code to add an in-memory work queue, and you end up needing to write more code to deal with sending work/status across the system anyway. So my old i5 with an SSD ends up scaling better than their "well-architected" "scalable" serverless solution. AWS will happily give you plenty of this kind of advice that will lead to a slow, expensive, large system.

          The one big upside of AWS is that if you do need to manage a lot of servers (like you are in the IoT space and need to handle millions of requests per second), they have good tools for doing that. Multi-region redundancy is also a click of a button if you need that. But they normalize overbuilding (and thus needing that management) way before it's necessary.

          [0] https://docs.aws.amazon.com/wellarchitected/latest/serverles...

          [1] https://docs.aws.amazon.com/lambda/latest/dg/gettingstarted-...

    • Uvix 2 years ago

      Depending on the margins that could be preferable.

  • boulos 2 years ago

    Yes, at 1% margin on those sales, that's more like $125M in revenue. It's important to remember that things like Prime Day are basically marketing that results in revenue outside the event.

    • dylan604 2 years ago

      >It's important to remember that things like Prime Day are basically marketing

      Be it Prime Day or Black Friday/Cyber Monday sales, I've seen the prices before the sale starts, and then once the sales start, it is the same price but with a slashed out higher MSRP type price. It's not any more of a sale during the sale than it was any of the other days.

  • Retric 2 years ago

    Yea, actual profit was likely 100 - 400 million or so. As such spending 102 million on a single line item would be a serious question for most companies.

    Of course Amazon is paying itself that premium so they have little incentive to care.

mrbonner 2 years ago

It’s not a surprise for me to hear that Amazon is still a heavy user of RDBMS all these years even after the so-called Rolling Stone project to get rid of Oracle DB in 2015. If Amazon can use RDBMS for their scale, I’m just furious when folks jumping up and down screaming in top of their lungs “Why do we use Postgres and not (insert some random NoSQL engine here)?” My response so far is calmly ask another question “Why not?” And let them try to find a justification to suite our scale requirements.

  • endisneigh 2 years ago

    It’s fascinating that this is your conclusion from the article. Mine would be that if you can make it work and believe these estimates then dynamodb is clearly more cost effective. And given that every project inevitably settles in access patterns and thus is a perfect fit for something like dynamodb, why bother with rdbms as the hot path? Just use dynamo and stream to a columnar database for analytics once your product is “finished”.

    • bognition 2 years ago

      It all depends on your workload, access patterns, and data model.

      You can absolutely spend an arm and a leg making a system work using a RDBMS that would be simpler and cheaper using a NoSQL store. The opposite is also true.

      When picking a database you should always consider the trade offs of the different technologies and weigh those against your goals and budgets.

      Sometimes is okay to spend more for a system that is just simpler to manage and use. Sometimes it’s not.

  • orochimaaru 2 years ago

    Your application use cases should dictate the database choice - eg consistency needed, access patterns, data normalization, reliability, etc.

benjaminwootton 2 years ago

The real cost would come in the months after whilst trying to decipher the bill adequately to track down everything you used and get it turned off. (Half a joke.)

I imagine there would be a ton of Lambda and the like in there too.

jayzalowitz 2 years ago

Corey is probably right, but id chunk an extra 10-20% of overprovisioning/undercounting on actual bill here and considering they OWN the fleet, they probably went out of there way to have disaster recovery ready to go in a bunch more contexts.

ckdarby 2 years ago

Even if AWS treats Amazon like any customer the article is off by a factor of 30-60%.

RIs for their RDS instances. Saving Plan for their EC2s.

1 or 3 year commit, no upfront vs all upfront, etc.

A customer at the size of Amazon using AWS would have private pricing arrangement and an EDP.

  • simpsond 2 years ago

    You wouldn’t commit for 3 years for increased resources of a single day.

    • leetrout 2 years ago

      You could and then sell the extra on the spot market for the other ~1000 days.

    • ckdarby 2 years ago

      True, but all of the usage is not net new and they'll have a base commit.

  • jayzalowitz 2 years ago

    Honestly, their EDP is probably effectively cost, set in stone to make sure that if the government breaks them up or something like that both systems are good.

jeffbee 2 years ago

The amount of mail alone is bonkers. If we assume that half of this traffic went to the big operators, Google and Microsoft, each of them would have observed a noticeable traffic bump, 10s of 1000s of requests per second on average all day. It is fun to think about how these systems are interconnected and how they affect each other.

infinitedata 2 years ago

Funny how folks here and from the article are fixated in comparing the $102M vs the $12.7b. They somehow forget there are product, advertising, warehouses, transportation, shipping, labor, operation and other labor cost involved. You didn’t spend $102 to earn $12,700…

RcouF1uZ4gsC 2 years ago

Sometime back IIRC, some hackers were upset about something Amazon did and tried to DDOS them.

When they realized their entire attack was just a fraction of what Amazon handled during the Holiday shopping season, they realized the futility and called it off.

mokarma 2 years ago

Naive question: What are they using EBS for? It seems unnecessary given all the Databases. Is that just local caching for EC2's?

  • cj 2 years ago

    EBS is just a networked hard drive, so they could really be using it for anything storage related.

    Is Amazon’s general architecture for their retail site publicly described anywhere?

  • OJFord 2 years ago

    Well, without precluding other use, not even specifically caching but just disk for EC2 instances.

  • nonameiguess 2 years ago

    At minimum, root volumes for the VMs. Theoretically, you could load immutable machine images from the network and run entirely off of in-memory filesystems if you persist nothing past instance shutdown (similar to how extremely cautious people might run Tails booted off USB on a laptop with no hard drive), but that won't actually save cost since memory is more expensive than disk anyway.

    • thayne 2 years ago

      I don't think you can even technically do that in AWS. I don't think there is any way to detach the root volume from a running instance, or use an immutable network image to boot from. However, for many server workloads, operating entirely would be reasonable. Often you just need the operating system kernel and your server software, and maybe a monitoring agent. And all of that will be loaded in memory anyway.

  • yowlingcat 2 years ago

    Well, not to answer your question with a question, but what would you imagine backs all of those database services? Or, said another way, I'm not sure Corey Quinn is mapping the cost dependency graph correctly by giving this breakdown as mutually exclusive (from the standpoint of AWS internally).

  • ripper1138 2 years ago

    It’s disk storage for EC2

gumby 2 years ago

> There’s the internal chargeback costs that AWS charges Amazon for services that would be subject to

Do they do this? I have asked some friends who are developers at AWS and both told me that they don't worry or even know what their usage costs. But that's just anecdote; perhaps their boss knows.

  • donavanm 2 years ago

    I cant comment on individual teams or the business and accounting practices of Amazon.

    I would ABSOLUTELY say that, at a minimum, every director or principal engineer needs to be familiar with costs and _should_ understand their P&L. Senior engineers and line managers probably/should have a passing familiarity or consideration. Individual random SDEs may not as its not their primary business function or deliverable and someone else is ultimately responsible.

    Disclosure: Principal at AWS, opinions are my own.

Seanambers 2 years ago

Isn't the real clue here that the prices in the article are cost + margin. AMZN gets a steal.

  • ovao 2 years ago

    And notably, AWS can, and likely does, allocate whatever unused or unpartitioned infra to themselves (or, more pedantically, to Amazon). A perpetual ‘savings plan’.

fragmede 2 years ago

Did any AWS customers experience unavailability during prime day, eg capacity issues launching instances, due to prime day taking precedence over other customers? If there are, they're under NDA so we'd never know.

  • other_herbert 2 years ago

    You’ve talked me into running some load tests around and before these times… around thanksgiving I’ll give it a shot too… I wonder though if it’s just a redirection of traffic… if regular business sites are less busy because people are shopping it would just slightly shift the load from one “side” to the other

    Hmmmm….

  • bashtoni 2 years ago

    My expectation would be that Prime Day just causes AWS to get a little further ahead than normal with provisioning new infrastructure.

    With AWS still growing they are constantly having to add hardware. Ahead of Prime Day, I presume they just bring forward new resources that their model otherwise says aren't needed for a few months.

    Unavailability for other customers indicates either AWS growth has plateaued, they have hit the limit of throughput of how much hardware they can provision, or they just did their sums wrong.

  • vineyardmike 2 years ago

    Amazon surely allocates their resources in advance of prime day, so they can preemptively change prices to account for demand or deny requests.

    That said, why would capacity issues be behind NDA? Anyone can grab their API and attempt to allocate a VM (or 100k)

    • LazyMans 2 years ago

      You can query the spot pricing api and see what’s going on with that. I have a feeling Amazon purposely tries not to hang their customers out to dry by consuming large amounts of spot instances, or on-demand tanking spot availability.

    • fragmede 2 years ago

      Just a chilling effects from general paranoia over breaking NDA. What is and isn't actually covered by the NDA isn't something I had the time to look up for my comment for.

      You can't spin up 100k instances on a virgin account, but it's an interesting idea!

  • sokoloff 2 years ago

    I’m sure many customers have some form of mNDAs with AWS. I’d have to read ours to be sure, but I don’t think ours would preclude us talking about the problems we experienced on Prime Days, if there were any. (We saw none.)

  • spencerchubb 2 years ago

    I am not saying that this is true AT ALL, but it would be kind of ironic if AWS slowed down competing ecommerce stores to try and get an advantage.

Keyboard Shortcuts

j
Next item
k
Previous item
o / Enter
Open selected item
?
Show this help
Esc
Close modal / clear selection