AWS Operational issue – Multiple services in us-east-1
health.aws.amazon.comMy guess is this is all due to CloudWatch logs putlogevents failures.
By default a docker container configured with awslogs runs in "blocking" mode. As logs get logged, docker will buffer them and push to CloudWatch logs frequently. In case the log stream is faster than what the buffer can absorb, stdout/stderr get blocked and then the container will freeze on the logging write call. If putlogevents is failing, buffers are probably filling up and freezing containers. I assume most of AWS uses it's own logging system, which could cause these large, intermittent failures.
If you're okay dropping logs, add something like this to the container logging definition:
"max-buffer-size": "25m"
"mode": "non-blocking"I just want to thank you for providing this info. This was exactly the cause of some of our issues and this config setting restored functionality to a major part of our app.
Happy it helped. If you have a very high throughput app (or something that logs gigantic payloads), the "logging pauses" may slow down your app in non-obvious ways. Diagnosing it the very first time took forever (I think I straced the process in the docker container and saw it was hanging on `write(1)`)
https://aws.amazon.com/blogs/containers/preventing-log-loss-...
It seems to have cascaded from AWS Kinesis...
[03:59 PM PDT] We can confirm increased error rates and latencies for Kinesis APIs within the US-EAST-1 Region. We have identified the root cause and are actively working to resolve the issue. As a result of this issue, other services, such as CloudWatch, are also experiencing increase error rates and delayed Cloudwatch log delivery. We will continue to keep you updated as we make progress in resolving the issue.
39 affected services listed:
AWS Application Migration Service
AWS Cloud9
AWS CloudShell
AWS CloudTrail
AWS CodeBuild
AWS DataSync
AWS Elemental
AWS Glue
AWS IAM Identity Center
AWS Identity and Access Management
AWS IoT Analytics
AWS IoT Device Defender
AWS IoT Device Management
AWS IoT Events
AWS IoT SiteWise
AWS IoT TwinMaker
AWS License Manager
AWS Organizations
AWS Step Functions
AWS Transfer Family
Amazon API Gateway
Amazon AppStream 2.0
Amazon CloudSearch
Amazon CloudWatch
Amazon Connect
Amazon EMR Serverless
Amazon Elastic Container Service
Amazon Kinesis Analytics
Amazon Kinesis Data Streams
Amazon Kinesis Firehose
Amazon Location Service
Amazon Managed Grafana
Amazon Managed Service for Prometheus
Amazon Managed Workflows for Apache Airflow
Amazon OpenSearch Service
Amazon Redshift
Amazon Simple Queue Service
Amazon Simple Storage Service
Amazon WorkSpaces
44 services are showing as affected now, and AWS IoT Analytics, AWS IoT TwinMaker, and Amazon Elastic MapReduce are showing as Resolved.
https://aws.amazon.com/kinesis/
> Amazon Kinesis Data Streams is a serverless streaming data service that simplifies the capture, processing, and storage of data streams at any scale.
I'd never heard of that one.
This is a bigger deal than the 'degraded' implies. SQS has basically ground to a halt for reads which is leading to massive slowdowns where I am at and the logging issues are causing task timeouts.
The us-east-1 curse strikes again! Elastic Container Service is down for us completely.
This is just starting to effect us, looks like SQS is the biggest loser right now.
affect* :)
Our accounting system Xero is down, with reference on their status page to AWS. Related to this, I assume.
Though it is not listed in the 33 affected services, we are seeing an issue communicating with S3 via a Storage Gateway.
Managed CloudFormation StackSets aren’t showing up for me. I assume this is related to Organizations.