On August 21, 2025 PagerDuty alerted me that my Node server is taking way too long to respond.
What made this event stressful is that when I tailed the server logs, I had never seen this kind of issue before. My knowledge of devops is shaky to begin with. Meanwhile users were suffering and messages were piling up on HelpScout (this unfortunately occurred during peak hours).
Even though the issue ended up being Cloudflare’s fault which I discovered a few days afterwards, I learned a lot from this experience. It also emboldened me to take a hard look at my Node server code and, after much research, implement simple measures to make it more resilient in production. I describe these measures at the end of this post.
Troubleshooting
I host my NodeJS server on AWS ElasticBeanstalk. I SSHed into it and here’s what I saw in the logs:
Aug 21 18:57:20 ip-172-30-1-243 web[3627024]: [21/08/2025 18:57:20.848] [LOG] AggregateError [ETIMEDOUT]:
Aug 21 18:57:20 ip-172-30-1-243 web[3627024]: at internalConnectMultiple (node:net:1128:18)
Aug 21 18:57:20 ip-172-30-1-243 web[3627024]: at internalConnectMultiple (node:net:1196:5)
Aug 21 18:57:20 ip-172-30-1-243 web[3627024]: at Timeout.internalConnectMultipleTimeout (node:net:1720:5)
Aug 21 18:57:20 ip-172-30-1-243 web[3627024]: at listOnTimeout (node:internal/timers:596:11)
Aug 21 18:57:20 ip-172-30-1-243 web[3627024]: at process.processTimers (node:internal/timers:529:7) {
Aug 21 18:57:20 ip-172-30-1-243 web[3627024]: code: 'ETIMEDOUT',
Aug 21 18:57:20 ip-172-30-1-243 web[3627024]: [errors]: [
Aug 21 18:57:20 ip-172-30-1-243 web[3627024]: Error: connect ETIMEDOUT 172.66.43.194:443
Aug 21 18:57:20 ip-172-30-1-243 web[3627024]: at createConnectionError (node:net:1656:14)
Aug 21 18:57:20 ip-172-30-1-243 web[3627024]: at Timeout.internalConnectMultipleTimeout (node:net:1715:38)
Aug 21 18:57:20 ip-172-30-1-243 web[3627024]: at listOnTimeout (node:internal/timers:596:11)
Aug 21 18:57:20 ip-172-30-1-243 web[3627024]: at process.processTimers (node:internal/timers:529:7) {
Aug 21 18:57:20 ip-172-30-1-243 web[3627024]: errno: -110,
Aug 21 18:57:20 ip-172-30-1-243 web[3627024]: code: 'ETIMEDOUT',
Aug 21 18:57:20 ip-172-30-1-243 web[3627024]: syscall: 'connect',
Aug 21 18:57:20 ip-172-30-1-243 web[3627024]: address: '172.66.43.194',
Aug 21 18:57:20 ip-172-30-1-243 web[3627024]: port: 443
Aug 21 18:57:20 ip-172-30-1-243 web[3627024]: },
Aug 21 18:57:20 ip-172-30-1-243 web[3627024]: Error: connect ENETUNREACH 2606:4700:3108::ac42:2bc2:443 - Local (:::0)
Aug 21 18:57:20 ip-172-30-1-243 web[3627024]: at internalConnectMultiple (node:net:1192:16)
Aug 21 18:57:20 ip-172-30-1-243 web[3627024]: at Timeout.internalConnectMultipleTimeout (node:net:1720:5)
Aug 21 18:57:20 ip-172-30-1-243 web[3627024]: at listOnTimeout (node:internal/timers:596:11)
Aug 21 18:57:20 ip-172-30-1-243 web[3627024]: at process.processTimers (node:internal/timers:529:7) {
Aug 21 18:57:20 ip-172-30-1-243 web[3627024]: errno: -101,
Aug 21 18:57:20 ip-172-30-1-243 web[3627024]: code: 'ENETUNREACH',
Aug 21 18:57:20 ip-172-30-1-243 web[3627024]: syscall: 'connect',
Aug 21 18:57:20 ip-172-30-1-243 web[3627024]: address: '2606:4700:3108::ac42:2bc2',
Aug 21 18:57:20 ip-172-30-1-243 web[3627024]: port: 443
Aug 21 18:57:20 ip-172-30-1-243 web[3627024]: }
Aug 21 18:57:20 ip-172-30-1-243 web[3627024]: ]
My first action was to restart the server and pray that works. Unfortunately, it didn’t.
My Node server makes HTTP calls to a backend API, so my initial suspicion was that HTTP requests to my backend API were timing out.
But as per Newrelic, my API servers were in good health. I SSHed into one of the and tailed the Nginx logs. Nothing unusual there.
My API servers are fronted by a AWS network load balancer. Could the LB be giving me trouble? Maybe requests were not reaching my API servers in the first place.
Cloudwatch metrics for the LB looked normal. Nevertheless, the LB was to blame for an outage many months back which still haunts me, so even though I was unable to find supporting evidence, I clung to the belief that the LB was the issue this time too.
In retrospect, I wasted too many precious minutes going down this road before I thought to do a WHOIS lookup on the IP addresses in my Node logs.
I discovered that all these IPs were owned by Cloudflare.
Of course! Up till then, I had ignored Cloudflare completely. I proxy requests to my LB through Cloudflare, so it made sense that my Node server would connect to Cloudflare, not directly to my LB.
At the time, I also didn’t realize that a ETIMEDOUT meant that my Node server was not able to connect to Cloudflare in the first place. Instead, I assumed ETIMEDOUT meant that the HTTP request was timing out due to some issue upstream. And since Cloudlfare is seldom the culprit, the upstream issue would have to be the LB or further up.
I blame my weak networking fundamentals for this oversight. Had I realized ETIMEDOUT referred to the connection, not the request, I would’ve tried turning off proxying requests through Cloudflare. But I didn’t. Instead, I continued troubleshooting.
I logged into Cloudflare to check if my Node server IP address was being blocked by its WAF. It wasn’t. Even if it was, Cloudflare would return a 403, not an ETIMEDOUT.
Next, I thought it might be a DNS issue. I changed the CNAME record pointing to the LB hostname to an A record pointing directly to a static IP address for one of my LB’s availability zones.
I was about 2-3 hours into the goose chase by then and was preparing myself for the very real possibility that I may not sleep that night. But to my relief, the site was coming back to normal. At the time I believed that my updating DNS records did the trick, without knowing why. But now I know that it was a red herring which just happened to coincide with when the Cloudflare incident was resolved.
In hindsight, I should’ve just turned off Cloudflare proxy to my backend API.
Learnings
As Nietzsche would say, “What doesn’t kill you makes you stronger.”
This experience, while stressful, taught me a lot about networking fundamentals and forced me to take a hard look at my Node server code.
Why 504 Gateway Timeout?
The first mystery: why was my server hanging for 60 seconds and returning a 504 Gateway Timeout when there was a connect timeout? The connect attempt times out in less than a second, so I should get a response back in less than a second too.
My Node server is fronted by an Nginx reverse proxy server. Nginx logs were reporting a 504 after 60.005 seconds (Nginx default timeout for requests is 60 seconds):
172.68.245.213 - - [21/Aug/2025:18:57:32 +0000] "GET / HTTP/1.1" 504 562 "Mozilla/5.0 (Linux; Android 10; K) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/139.0.0.0 Mobile Safari/537.36" "92.190.14.36" 60.005 60.005 .
To figure this out, I had to first reproduce the ETIMEDOUT on my staging server. I stumbled upon this Github issue which made it easy.
app.get('/error/timeout', async () => {
await fetch("http://ds-test.r-1.ch:8080/")
});
With this, I was able to replicate the ETIMEDOUT in my Node logs and 504 Gateway Timeout on Nginx. I soon discovered the reason for this is: Express does NOT handle async errors out of the box. Unlike sync errors, async errors must be explicitly handed to Express which I wasn’t doing. Instead of adding try / catch blocks everywhere, I decided to solve this by using the express-async-errors package and single line of code:
require('express-async-errors');
const app = express();
It worked. Now Nginx did not wait 60 seconds to respond in the event of an upstream ETIMEDOUT. It returned immediately with an Internal Server Error.
Don’t Connect To IPv6
You may have noticed that after the ETIMEDOUT error, Node throws a ENETUNREACH error:
Aug 21 18:57:20 ip-172-30-1-243 web[3627024]: Error: connect ENETUNREACH 2606:4700:3108::ac42:2bc2:443 - Local (:::0)
Aug 21 18:57:20 ip-172-30-1-243 web[3627024]: at internalConnectMultiple (node:net:1192:16)
Aug 21 18:57:20 ip-172-30-1-243 web[3627024]: at Timeout.internalConnectMultipleTimeout (node:net:1720:5)
Aug 21 18:57:20 ip-172-30-1-243 web[3627024]: at listOnTimeout (node:internal/timers:596:11)
Aug 21 18:57:20 ip-172-30-1-243 web[3627024]: at process.processTimers (node:internal/timers:529:7) {
Aug 21 18:57:20 ip-172-30-1-243 web[3627024]: errno: -101,
Aug 21 18:57:20 ip-172-30-1-243 web[3627024]: code: 'ENETUNREACH',
Aug 21 18:57:20 ip-172-30-1-243 web[3627024]: syscall: 'connect',
Aug 21 18:57:20 ip-172-30-1-243 web[3627024]: address: '2606:4700:3108::ac42:2bc2',
Aug 21 18:57:20 ip-172-30-1-243 web[3627024]: port: 443
Aug 21 18:57:20 ip-172-30-1-243 web[3627024]: }
This got me curious. Node implements the Happy Eyeballs algorithm to connect to hosts. If the connection to the IPv4 address times out, it falls back to the IPv6 address, in this case 2606:4700:3108::ac42:2bc2. But my Node server is hosted on an old Linux distribution which cannot connect to IPv6 addresses.
[ec2-user@ip-172-30-1-169 ~]$ ping6 ipv6.google.com
ping6: connect: Network is unreachable
How do I instruct Node’s HTTP agent to only connect to IPv4 addresses? Turns out it’s pretty easy.
const https = require('https');
const http = require('http');
const httpsAgent = new https.Agent({
family: 4 // Always connect to IPv4 address
});
const httpAgent = new http.Agent({
family: 4
});
axios.defaults.httpsAgent = httpsAgent;
axios.defaults.httpAgent = httpAgent;
With this change, going to /error/timeout triggers the ETIMEDOUT but Node does not attempt to connect to the IPv6 address.
Increase Node Connect Timeout
Sometimes 250 milliseconds is not enough time to establish a connection, especially if Cloudflare servers are under stress. I bumped this up to 500ms. Since my Node server is hosted on Elastic Beanstalk, the best way to do this was by updating env.config:
option_settings:
...
- option_name: NODE_OPTIONS
value: "--network-family-autoselection-attempt-timeout=500"
Axios Retries
I use the axios library to make HTTP calls from Node. I discovered that axios does not automatically retry on low-level networking errors like ETIMEDOUT. To fix this, I installed the axios-retry package. Using it was easy:
import axios from 'axios';
import axiosRetry from 'axios-retry';
axiosRetry(axios, { retries: 3 });
Set Axios HTTP Request Timeout
So far we’ve been talking about connect timeouts. HTTP hasn’t entered the picture. But in my research of solving the connect timeout, I discovered that the axios default HTTP request timeout is set to 0. So by default, after a successful connection is established, axios requests will wait indefinitely for an HTTP response. For production, this is not ideal. I fixed this by updating the axios default to 5 seconds.
axios.defaults.timeout = 5000; // HTTP Request timeout of 5s