The occasional `ECONNRESET` (part 1/2)

2026-05-05

Two services running on the same machine. One of them opens a listening TCP socket bound to localhost, the other one connects to that. They exchange data. Every now and then, the service that initiated the connection gets an ECONNRESET while reading data from the socket -- but no other errors show up in the logs, no crashes, nothing. What's going on?

Go on to part 2.

A reproducer in the "lab"

Let's start with the "server", i.e. the service that opens the listening socket.

The following program does just that: Create a new TCP socket, wait for connections, fork into a new process for each request. There's not much of a "request", though: The server simply dumps 600'000 x bytes to the client upon connection.

The number 600'000 bears some meaning: It needs to be large enough to trigger the behavior that I want to show. 600 bytes, for example, probably won't work.

server.c

Now on to the client: It connects to port 8125 on localhost where our server is waiting, and then it calls recv() until EOF or error. We'll get to the --spam flag in a second.

client.c

Also:

Makefile

Let's run the two programs:

[terminal1]$ ./server

[terminal2]$ ./client
Read 600000 bytes, final return value was 0, errno was 0

Nothing spectacular.

But let's use the --spam flag:

$ ./client --spam
Read 600000 bytes, final return value was -1, errno was 104
Connection reset by peer

$ ./client --spam
Read 256000 bytes, final return value was -1, errno was 104
Connection reset by peer

$ ./client --spam
Read 351232 bytes, final return value was -1, errno was 104
Connection reset by peer

$ ./client --spam
Read 351232 bytes, final return value was -1, errno was 104
Connection reset by peer

$ ./client --spam
Read 351232 bytes, final return value was -1, errno was 104
Connection reset by peer

$ ./client --spam
Read 256000 bytes, final return value was -1, errno was 104
Connection reset by peer

$ ./client --spam
Read 600000 bytes, final return value was -1, errno was 104
Connection reset by peer

--spam causes the client to first send some data to the server before it tries to receive data. And appararently this causes the connection to break at some point: The client's recv() sees a return value of -1 and errno gets set to 104 = Connection reset by peer.

What tcpdump sees

First, what's "on the wire"?

Okay. So there actually is a TCP RST. Could have also been a programming error or misinterpretation on my part.

What `strace ./server` sees

The RST originates from the server, so let's attach strace and see what we get:

19:59:03.420432 accept(3, NULL, NULL)   = 4
19:59:05.652715 clone(child_stack=NULL, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7fe43484fa10) = 239546
[pid 239546] 19:59:05.652831 ...
[pid 239546] 19:59:05.652959 mmap(NULL, 602112, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0 <unfinished ...>
[pid 239546] 19:59:05.652980 <... mmap resumed>) = 0x7fe43456d000
[pid 239546] 19:59:05.653235 sendto(4, "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"..., 600000, 0, NULL, 0) = 600000
[pid 239546] 19:59:05.653474 close(4)   = 0
[pid 239546] 19:59:05.653553 exit_group(0) = ?
[pid 239546] 19:59:05.653667 +++ exited with 0 +++

No crash. It forked and used sendto() to dump all the data to the client. Then it quit.

Also note that sendto() returned the full 600000, so from the perspective of this program, "all data got sent" (there's a footnote, obviously, as the manpage explains: "Successful completion of a call to sendto() does not guarantee delivery of the message. A return value of -1 indicates only locally-detected errors.").

In fact, there is no difference here whether you use --spam on the client or not.

What `strace ./client --spam` sees

19:59:05.652518 connect(3, {sa_family=AF_INET, sin_port=htons(8125), sin_addr=inet_addr("127.0.0.1")}, 16) = 0
19:59:05.652649 sendto(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 100, 0, NULL, 0) = 100
19:59:05.652805 recvfrom(3, "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"..., 4096, 0, NULL, NULL) = 4096
19:59:05.653382 recvfrom(3, "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"..., 4096, 0, NULL, NULL) = 4096
...
19:59:05.654440 recvfrom(3, "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"..., 4096, 0, NULL, NULL) = 4096
19:59:05.654473 recvfrom(3, "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"..., 4096, 0, NULL, NULL) = 1024
19:59:05.654506 recvfrom(3, 0x55d23c5b5010, 4096, 0, NULL, NULL) = -1 ECONNRESET (Connection reset by peer)
19:59:05.654575 write(1, "Read 128000 bytes, final return "..., 60) = 60
19:59:05.654694 write(4, "Connection reset by peer\n", 25) = 25
19:59:05.654725 close(4)                = 0
19:59:05.654750 close(3)                = 0
19:59:05.654783 exit_group(0)           = ?
19:59:05.654864 +++ exited with 0 +++

Nothing out of the ordinary here, either. We ran recvfrom() until one of them returned with -1 / ECONNRESET.

A first hypothesis

Let's test when we get the ECONNRESET. Because, if you scroll back up, some of the invocations read the full 600'000 bytes and some returned other values. So there's probably some timing issue here.

Let's do what is most obvious in hindsight: Delay the close() in the server. Because if anything legitimately creates a RST, it'll probably be that.

In serve_client(), simply add a sleep(1) before the call to close():

sleep(1);
if (close(s) == -1)
{
    perror("close");
    return 1;
}

And there you go, we can clearly see a one second delay now. A tcpdump shows it best:

So, without having dug deeper, here is some speculation:

The server sees the incoming data on its socket, but it doesn't read them.
When we call close() in the server, the socket is "dirty" (because of pending data), so a RST gets fired to (hopefully) tell the client that not all data has been read.

This makes sense to me at the moment, although I haven't found any definitive sources yet to confirm this behavior.

(Initially, one of the hypotheses was that some buffer was filling up, hence spamlen is set to 100 and it originally was 100'000 -- but that doesn't matter, a single byte is enough.)

The real-life scenario

gunicorn serving a flask app was running behind nginx as a reverse proxy. Sometimes, nginx got these ECONNRESET from gunicorn. There were also cloud hosters involved and firewalls and lots of routers and parallel requests and what not, getting derailed by ioctls and non-blocking sockets, so boiling this down took a while.

Essentially, what nginx does is pass the HTTP request on to gunicorn, but it does so with two syscalls:

09:11:31.254489 writev(29, [{iov_base="POST /reveal/d48z/iha4A9MOMuLW40"..., iov_len=392}], 1) = 392
09:11:31.255435 writev(29, [{iov_base="compat=lynx+needs+this", iov_len=22}], 1) = 22

392 bytes HTTP headers, 22 bytes HTTP body, 414 bytes in total.

gunicorn then reads this data from the socket:

09:11:31.593968 recvfrom(6, "POST /reveal/gEJh/bIoAUdWrSV47mI"..., 8192, 0, NULL, NULL) = 414

However, every now and then, it only got the data from the first writev() call:

09:11:31.251229 recvfrom(6, "POST /reveal/d48z/iha4A9MOMuLW40"..., 8192, 0, NULL, NULL) = 392

gunicorn (and/or the application running inside of it) was happy to just get the headers, it didn't care about the body. I assume that this part of the software stack is "lazy": When there's nothing accessing the body in the application, it won't bother even recv()-ing it.

Problem is, gunicorn ends these transactions like that:

09:11:31.583979 sendto(6, "HTTP/1.1 200 OK\r\nServer: gunicor"..., 212, 0, NULL, 0) = 212
09:11:31.584225 sendto(6, "\312\205]"..., 614400, 0, NULL, 0) = 614400
09:11:31.590869 close(6)                = 0

Send off the data and close the socket. If there's data still pending to be read, this will cause a RST, I think.

The workaround was to have the Python app do some dummy operation on the HTTP body, to make sure that it got fully read from the socket. So far, I haven't seen any more ECONNRESET. (Depending on your application, this can open up the possibility of a DoS, though: If someone POSTs 10 GB of data on your server and you only have 1 GB of memory, then you might be in trouble. client_max_body_size in nginx can probably prevent that.)

Next steps

Verify that the close() call really is the cause of the TCP RST and find a good source for that.
- It might be because of RFC 1122:
  
  A host MAY implement a "half-duplex" TCP close sequence, so that an application that has called CLOSE cannot continue to read data from the connection. If such a host issues a CLOSE call while received data is still pending in TCP, or if new data is received after CLOSE is called, its TCP SHOULD send a RST to show that data was lost.
Figure out who's to "blame" on the Python side: gunicorn, flask, or the actual flask app. Report upstream.
- It might be gunicorn, where it already had been reported.

To be continued ...