Modern cloud architecture on AWS: server fleets and databases

start.jcolemorrison.com

87 points by colemorrison 6 years ago · 53 comments

Reader

Although there are a ton of AWS servers, there's only a few core services that I recommend:

  EC2 - You need a server.
  RDS - You need a database.
  S3 - You need to store files.
  Lambda - You are building an API with short lived requests.

These services are all very high quality and are excellent at what they do. Once you get outside of these core services, the quality quickly drops. You're probably better off using the non-AWS versions of those services.

For a few quick examples, you should be using Datadog over CloudWatch, Snowflake over Redshift or Athena, and Terraform over CloudFormation.

scarface74 6 years ago

Why would you ever use Terraform over CloudFormation? There are so many parts of AWS that use CF and that you can modify from the getting started templates like CodeStar and exporting a SAM template from your lambda template.
Before someone comments on how TF is “cross platform”, all of the provisioners are vendor specific.
As far as what other services to use, if you are hosting your own services on AWS instead of using AWS manager services, you’re kind of missing the point of AWS.
But a few other services we use all of the time are CodeBuild, ElasticCache (hosted Redis), ElasticSearch, Route 53, load balancers, autoscaling groups, SSM (managing the few “pets” until we can kill them), ECS, ECR, Fargate, SNS, SQS, DynamoDB, SFTP, CloudTrail, Microsoft AD, we are experimenting with the recently announced Device Farm/Selenium service, step functions, Athena, Secrets Manager, and a few more I’m probably forgetting.
- viraptor 6 years ago
  
  > Why would you ever use Terraform over CloudFormation?
  1. You're using Terraform already for resources outside of AWS (cdn, monitoring, dns, anything else) and want to stay with a common tech.
  2. You're running into cases that CF doesn't support and have to generate your descriptions externally, or use sparkleformation hacks.
  3. You want to manage a new AWS service. (CloudFormation support lags behind Terraform, new services don't get CF resources for months)
  - scarface74 6 years ago
    
    In cases two and three it’s just as easy to write a custom resource....
    
    viraptor 6 years ago
    
    You mean just as easy to write/test/deploy a custom resource as it is to use a ready one? I disagree. I think there's a few days of work of difference in that case.
    
    scarface74 6 years ago
    
    Actually, no.
    Examples for creating them in Java, Python and Node are here
    https://github.com/stelligent/cloudformation-custom-resource...
    Just add a few lines of code for create, update and delete for your resource.
    For Node and Python, you can write them in the web console, test them, copy the code to your git repo and export the SAM CF template for your CI/CD process.
fragmede 6 years ago

Depending on the market segment you exist in, I'd recommend AWS Fargate and AWS Lightsail (container-runner; Digital Ocean/Linode/VPS competitor) over EC2. There's absolutely a segment for which EC2 is appropriate, but just like most data isn't "big", I doubt that most EC2 customers wouldn't be better served by Lightsail. If you've got several hundred or several thousand EC2 instances with bespoke code/config for many different ASGs, then Lightsail isn't for you, but (my impression is) that's not most people.
viraptor 6 years ago

> you should be using Datadog over CloudWatch
DataDog is great, but the way it polls data means you can't rely on it being available for a long time: https://docs.datadoghq.com/integrations/faq/cloud-metric-del...
> If you receive 1-minute metrics with CloudWatch, then their availability delay is about 2 minutes—so total latency to view your metrics may be ~10-12 minutes.
If an alert delayed by 10min matters to you, DD is not viable for alerting (could be still used for dashboards).
- bradknowles 6 years ago
  
  CloudWatch Logs has a lot of its own internal latencies. If you can send the logs straight from your ec2 hosts to the log processing system and bypass CloudWatch, then you only want/need CWL for the things you can’t get from those logs.
  CloudWatch Metrics is a totally separate beast, which happens to share a similar name. You can set up basic alerts in CWM, and you can trigger certain types of events from those alerts, but it is still very limited. If you want real monitoring and alerting, then CWM isn’t even the easy 80%.
  - viraptor 6 years ago
    
    Agreed, my message was about using DD instead of CW for internal logs. With your own logs, you've got so much more flexibility - but not everything can be done that way. For example ELB stats don't really exist elsewhere in realtime.
reese_john 6 years ago

Just curious, why would you not recommend SQS ?
- malisper 6 years ago
  
  I've never used SQS but IMO it seems inferior to Kinesis or Kafka. The two big reasons are that you can't have multiple consumers read from a single queue and once data leaves the queue, it's gone forever. Both Kinesis and Kafka let you have multiple consumers and configure a retention period for your messages.
  - appwiz 6 years ago
    
    You can have millions of consumers read concurrently from a single SQS queue. Messages that are read remain in the queue up to the configured retention period or until a consumer calls DeleteMessage.
    Source: I’ve built very high volume services that continue to run production workloads and use SQS as the buffer between components.
    
    malisper 6 years ago
    
    > You can have millions of consumers read concurrently from a single SQS queue.
    We're using different definitions of "consumer". By consumer, I'm talking about a group of workers that processes the data for one purpose. For example you may have one consumer read from the queue to generate various metrics and a second consumer read from the queue and write to a DB. With vanilla SQS, when you process a message, you need to perform all the tasks simultaneously. With Kinesis and Kafka you can have independent groups of workers (i.e. independent consumers), each performing one of these tasks. Each consumer is able to process the queue at it's own rate. The way Amazon recommends doing this in SQS is to have SNS fan out a single SQS queue to multiple SQS queues. Then you can consume each queue independently[0]. That will multiply your costs by the number of queues you have.
    > Messages that are read remain in the queue up to the configured retention period or until a consumer calls DeleteMessage.
    I'm talking about retaining a message even if it was successfully processed, on the order of days or weeks. I've used this feature of Kafka before to implement a recovery log. Under normal operation, Kafka writes data to a DB. If the DB goes down, you can quickly recover the last N days of data by going through the data retained in Kafka.
    [0] https://forums.aws.amazon.com/message.jspa?messageID=865925
    
    scarface74 6 years ago
    
    One producer/multiple consumers is what SNS+ Attributes + subscription filters + SQS is for.
    If your database goes down, you have point in time recovery and read replicas that can be promoted as needed.
  - CoolGuySteve 6 years ago
    
    How so? I've run thousands of consumers on SQS for batch jobs and it seems to work.
    There's also dead letter queue and retries for messages that aren't properly serviced.
    
    malisper 6 years ago
    
    See my reply to appwiz: https://news.ycombinator.com/item?id=22280200

Jonnax 6 years ago

At what scale would you want to use RDS rather than using an EC2 instance with Postgres installed?

Assuming that the operator has the skills to manage Postgres.

It's not like RDS does something complex like Geodistribution, right?

Also what is the scaling like? Is it automatic? How quickly can you handle more connections? Because my understanding was that it was slow.

I did have a play with their RDS Postgres nonths before, and I managed somehow to crash it requiring a restore from snapshot. Also their smallest instance was quite expensive for the performance.

makmanalp 6 years ago

I'd ask the opposite question - at what scale would you want to have your own custom setup rather than RDS? Managing your own database infrastructure for workloads other than "a few queries a second" is hard work with a lot of pitfalls, and you better be at a size that there's some benefit (high levels of customization, use case specific tuning, economies of scale, etc). As a person who does exactly this for a living, I'd rather shell out for RDS or a similar offering than my own setup most of the time. Especially at first, before you discover what exactly you /don't/ like about it or what you'd want different.
- tmpz22 6 years ago
  
  Is it hard work though? In a couple hours you should be able to setup automatic backups and practice going through the recover process a couple times. That's all there is for most small-business setups, but if you are daring you can now do whatever you want with the config file, install extensions, setup basic system monitoring (CPU/Ram usage, disk usage, etc.). GCP/Digital Ocean let you look at node resource usage automatically, and since Postgres is probably the only process it means you don't even need to set that up!
  - malisper 6 years ago
    
    > In a couple hours you should be able to setup automatic backups and practice going through the recover process a couple times.
    Unfortunately there's a lot more too it than that. You need to handle when the backup job fails or dies, have a process for deleting old backups, etc. Not just that, but if you have multiple Postgres instances, you need to do this work for each machine. I've seen first hand this kind of stuff become a huge distraction. It's often worth it to pay AWS a bit more in exchange to not worry about this stuff.
    
    tmpz22 6 years ago
    
    > Unfortunately there's a lot more too it than that.
    Is there though? Consider what I would argue to be the "average" case:
    * Your database never exceeds > 40% resource usage
    * You service fewer then 1m queries/day
    * You never burst more then 1k queries/minute
    * You have a script tied to a cronjob that backs up the database, with basic error handling that sends you a Slack DM if it fails
    * You have a script tied to a cronjob which deletes old backups, with basic error handling that sends you a Slack DM if it fails
    What percentage of companies need more then that?
    
    fragmede 6 years ago
    
    > * Your database never exceeds > 40% resource usage > * You service fewer then 1m queries/day > * You never burst more then 1k queries/minute
    How do I know it doesn't exceed 40% usage? Better yet, who's holding the pager when it does? If/when it does, who's product launch is dead in the water while the db is reconfigured onto a larger instance? What product isn't being delivered because we're faffing about with the database instead of product code?
    > * You have a script tied to a cronjob that backs up the database, with basic error handling that sends you a Slack DM if it fails > * You have a script tied to a cronjob which deletes old backups, with basic error handling that sends you a Slack DM if it fails
    Who's responsible for restoring from backup every week/month/quarter, to assert they actually work, with whatever changes have been made recently? Untested backups are Shrodingers backups.
    Just how well tested is this script? Does it properly error out if the script fails to be run? What if a firewall rule accidentally gets set that blocks egress from the backup box to the Internet (for security); who/how/what gets notified instead? Who's deliverables are slipping because the backups randomly stopped working?
    > What percentage of companies need more then that?
    That's a fair question, but Amazon's done far more research than I, possibly you on that topic. The real question is, of companies that don't need more than that, how many companies want to hire somebody to take on those responsibilities part-time? How many companies have the expertise to even hire somebody qualified to do that part-time? And since those people are managing the DB part time, how many of them are giving it the attention it needs, and aren't distracted by other responsibilities to the company?
    None of those problems are insurmountable, but they're far from most business' core competency, and time I'm spending dealing with postgresql.conf (or my.cnf) is time I'm not dealing with other issues. Don't get me wrong, there's still a time and place for managing database instances, but IMO small business (small > tiny) aren't the appropriate place for that. I'd be interested in hearing if someone's run the numbers to justify it though! (Especially if it falls in favor of running it yourself.)
    
    sciurus 6 years ago
    
    Any that can't afford more than a couple minutes of downtime when a server fails.
    
    viraptor 6 years ago
    
    That's definitely not an "average" company. It's also a really small number of companies that really can't afford that, rather than "earn less money than usual".
- forkLding 6 years ago
  
  I used both too and I think it really depends if you want to pay for it. For small to medium independent projects, I think a EC2 instance that sometimes cuts past that free tier usage is fine, otherwise RDS can be overkill and can seriously eat into costs. I would say a beginner doing independent projects should strongly consider EC2 instances instead of RDS.
reilly3000 6 years ago

RDS can scale read replicas and fail over to master, but aurora and it’s serverless option is much better for auto-scaling. Behind the scenes storage is decoupled from compute with makes scaling fundamentally easier.
EC2 is your only choice if you want a database that AWS doesn’t support, such as Rethink or Cassandra (they just recently launched a managed Cassandra service though). EC2 is also your only choice if you need full control of the DB, such as using many Postgres extensions and foreign data wrappers. Even some triggers and UDFs are limited.
A self-managed, auto-scaling, cross-AZ replicated DB setup is no small matter with EC2. Not to mention logging, metrics, patching of the DB and underlying OS. It’s 100% doable, but one should only proceed with that course with understanding of the human costs.
Personally, I’ve been choosing FaunaDB these days when possible. It’s a no-ops managed service and has on-prem/VPC options. I just write graphQL clients and move on with my life, the rest just works.
- CoolGuySteve 6 years ago
  
  I went with Aurora Postgres recently for a new application and to be honest it fucking sucks.
  When it scales up, queries start to fail and there are long delays. I had to add a try/retry loop around all my inserts to avoid losing data.
  I definitely wouldn't recommend it for bursty applications where it needs to rapidly scale up.
  The thing that bugs me is that I know a similar workload works fine on a relatively cheap c3.xlarge so I'm not actually saving much money with Aurora.
  - nostrebored 6 years ago
    
    ... were you not handling errors on database inserts through well established concepts like exponential backoff and retry already?
    that seems like a colossally bad idea
    If you don't need the striped storage model and guarantees that Aurora provides, it might not be the service for you
    
    CoolGuySteve 6 years ago
    
    Normally I’d agree with you but I was only doing a dozen or so inserts per second for maybe a minute or two after hours of idleness.
    There was no back off because it was never needed before, even the shittiest instance could handle it without issue but Aurora failed spectacularly.
meritt 6 years ago

RDS (postgres, mysql, maria) is basically just a managed ec2 instance. The instance cost is about ~2x the comparable price for an ec2 instance, which seems reasonable to me. Storage costs are pretty comparable to EBS. You can do push-button upgrades to increase the capacity, but it's slow. The main benefits I think you get over self-managed is automated / on-demand backups, pretty seamless software upgrades/patches, you can quickly spin-up a duplicate instance from a snapshot for testing/distributing workloads/etc, and replication works pretty seamlessly.
Provisioned IOPS is one area that can get expensive very quick, but people often don't realize that you get 3 PIOPS included with every 1GB of allocated storage, so you really don't need to pay for provisioned IO if you have a decent amount of storage.
If you want auto-scaling you need to look at Aurora or Redshift, which are quite different and significantly more expensive. I've not used those.
zedpm 6 years ago

It isn't a matter of getting to a certain scale before you use RDS, if that's what you're asking. The value of RDS is streamlining and automating db administrative tasks. If you want to update from Postgres 11.5 to 11.6, for example, you just change that setting on your RDS instance and it happens, either immediately or at some scheduled maintenance window. I f you want a hot standby in a separate availability zone, it's trivial to add one. Read replica? Trivial to add.
In general, I've found it makes sense to pay the premium for RDS and spend my and my team's time on more valuable work than db admin tasks.
malisper 6 years ago

I've previously administrated a cluster Postgres instances with a total of about 1PB of data. My recommendation is that you should use RDS unless you have a reason not to.
RDS takes care of tons of administrative tasks such as backups, replication, failover, and database upgrades. Yes you can setup backups yourself, but the on going maintenance is going to be a pain. You need to deal with what happens when a backup fails, have a playbook for restoring from backup, cleanup your old backups, etc. These are tasks that are extremely dangerous if you get something wrong and they are completely taking care of for you by RDS.
mdeeks 6 years ago

I'll take a shot at this. There is always an asterisk under every one of these. Every company and situation is different.
If you're a tiny startup or hobby with literally no money, it might make sense for you to manage it yourself because you have no choice.
Once you have some money and a viable business, then your value is no longer your ability to spend your time running Postgres, ensuring backups and restores work, creating replicas, upgrading software, and setting up all of the monitoring tools. You provide MUCH more value spending your time and abilities building things that are core to the business that let you make money and grow.
No doubt you can do it all and save some cash. But you have to do it regularly if you want confidence that everything you have built still works. With RDS, you pay them some extra for a near guarantee that it will all just work 100% of the time.
Once you become a large company with tons of engineers and you start to bump into limits of RDS, then it might make sense to run it yourself again. It is a significant burden to do it correctly 100% of the time. Your entire business can fail if you don't do your job right.
- nunez 6 years ago
  
  If I were a "tiny startup", I would absolutely leverage RDS in the first instance. Last thing I want to spend time doing is setting us Postgres and accumulating tech debt from a bad or suboptimal install. There are RDS instances that qualify for Free Tier as well if you're running _really_ lean.
  In fact, the only time that I would consider migrating to a DB on EC2 is if the database isn't supported by RDS (DB2, for example, though I think there are solutions for this now), the licensing for RDS makes it cost-prohibitive or if the client needed to be able to really tweak the database engine or daemons.
  - mdeeks 6 years ago
    
    The key words were "literally no money". By tiny startup I meant one or two people without any funding, revenue, and saddled with life expenses like rent, food, college savings, etc.
    I definitely agree that a DB on EC2 is tech debt. But sometimes that is something you have to accept for immediate financial reasons while you figure out how to create a product with revenue to cover costs.
paxy 6 years ago

RDS/Aurora does do autoscaling, backups, georeplication, encryption etc. It's more a matter of time & convenience rather than skill. Sure you could do all of it yourself on an EC2 instance, but at some point it becomes a big chunk of your job, and you would rather be spending your time on other things.
nunez 6 years ago

You would move to RDS the minute you know you're committed to AWS and know that you don't want to worry about things like backups, upgrades or clustering. The disadvantage to doing so is that you lose a lot of administrative privileges to the database server itself and you don't get access to the filesystem or underlying OS. We had trouble migrating a sizable SQL Server installation onto RDS because it had a ton of triggers and stored procedures (SPs) that relied on files in the filesystem.
- bradknowles 6 years ago
  
  Yeah, well. That’s SQL Server for you. There’s a lot of things that RDS can’t do for you in the SQL Server flavor.
  RDS MySQL != RDS Postgres && RDS (MySQL || Postgres) >> RDS SQL Server
pestaa 6 years ago

There are advantages when using RDS other than scaling.
The performance dashboard is especially nice.
The reusable sets of configuration is convenient and the UI makes it easy to compare original vs. changed values.
The instance upgrade is not seamless, though, it is normally scheduled for the next maintenance window, unless you decide to apply and reboot immediately.
I agree the service is expensive, but setting up a db server for production takes a lot of time and expertise.
pdeva1 6 years ago

one thing that doesn't get talked about RDS is that network cost for replication of data for RDS multi A-Z deployments is free. Depending on how much you write to RDS, this cost can dominate cpu/memory costs on non-RDS installations.
encoderer 6 years ago

For Cronitor, we did it when we needed replication.

kamilafsar 6 years ago

I keep reading all these horror stories about Aurora (especially PostgreSQL). Is there anyone out there with an alternative story?

BrentOzar 6 years ago

We run a 2TB database with 30-60 days of data (only keeps 30-60 days on-hand, and we're roughly ingesting 50GB/day.) We've been using Aurora Postgres since it came out, and it's been pretty good. (Good enough to the point that it's never crossed my mind to think about moving to something else.)
Source: Microsoft SQL Server performance tuner who runs an app that centralizes SQL Server performance data from thousands of servers. You'd think I would be running MSSQL on the back end, but the licensing costs just didn't make sense compared to Aurora Postgres.
Roritharr 6 years ago

Using Aurora MySQL for over a year now in prod, purrs like a kittycat.
Just don't use the AWS Database Migration Service if you can help it, that thing has a couple of badly documented pitfalls. (Fe. tables can't have ENUM fields)
- etaioinshrdlu 6 years ago
  
  AWS Database Migration Service had a shockingly large gap between how it's marketed and how well it actually performed.
  It had so many gotchas and broken features. I'd be amazed if anyone got it really working on large applications without weeks of time invested.
  It would be so cool though if it worked seamlessly. It addresses some of the hardest tasks in DB management, namely zero downtime server migration.
popotamonga 6 years ago

10 separate xlarge instances pgsql and never had a single issue in 3 yrs of heavy use. What stories are those?
Trisell 6 years ago

Used it a year at my last job and now almost a year at my current job. Never had an issue. It just runs no matter how much I throw at it. Only have had to change the instance sizes to deal with data ingest.
vidar 6 years ago

Links?
- kamilafsar 6 years ago
  
  https://www.reddit.com/r/aws/comments/bv70k8/aurora_postgres...

ramoz 6 years ago

"Modern" architectures can get quite complex and fast at scale & in complex cases. This is merely a simple introduction to simple components of modern cloud architecture.

pm90 6 years ago

I do agree with you, and I was lured into reading it because of that. However, this seems like a nice introduction for beginners. Maybe it should be tagged as such.

root-z 6 years ago

As someone who has spent a fair amount of time working with AWS. I appreciate how approachable this tutorial is, as the official docs are usually way more arcane.

Settings

Modern cloud architecture on AWS: server fleets and databases

Keyboard Shortcuts