How We Improved Reliability of our WebSocket Connections
making.close.com> The (sadly all too) common approach to rarely occurring bugs & edge cases: Pretend like the problem doesn't exist. Blame it on faulty networking, solar flares, etc.
How to tell if someone hasn't been working with a piece of software in production yet? They've never blamed a bug on cosmic radiation yet :D
Remember to always blame the user… or the universe!
Relevant: https://www.youtube-nocookie.com/embed/lKXe3HUG2l4, The mess we are in (2014), Joe Armstrong
I think socket.io handles this client keep alive automatically.
https://socket.io/docs/v4/how-it-works/
See disconnection detection section
Thanks for pointing this out!
https://socket.io/docs/v4/how-it-works/#disconnection-detect... says:
> At a given interval (the `pingInterval` value sent in the handshake) the server sends a PING packet and the client has a few seconds (the `pingTimeout` value) to send a PONG packet back. If the server does not receive a PONG packet back, it will consider that the connection is closed.
So far this describes what we've already been doing prior to the fix mentioned in the blog post. However, this next sentence is where Socket.io's solution diverts from ours:
> Conversely, if the client does not receive a PING packet within `pingInterval + pingTimeout`, it will consider that the connection is closed.
Indeed looks like a solid way to solve the client-side recognition of a broken connection!
--
That said, I'm a little confused because I cannot find any mention of `pingTimeout` in their JS client [0], and `pingInterval` is only mentioned in an implementation of a test server [1]. I wonder if I'm looking at the wrong thing.
[0]: https://github.com/socketio/socket.io-client/search?q=pingti...
[1]: https://github.com/socketio/socket.io-client/search?q=pingin...
Socket.io client depends on engine.io. That's probably why there isn't much in the socket.io-client: https://github.com/socketio/engine.io/search?q=pingInterval
Ah, yep, that explains it.
They do solve this problem as documented: https://github.com/socketio/engine.io/blob/64d57545116c7a7d9...
Yes. Same with the primus websocket library.
Indeed it does! https://github.com/primus/primus/blob/a7ba7249cb0205a01629da...
I do wish we didn't all have to reinvent this wheel though…
Good to know that WebSocket API is broken by design. Thanks W3C!
This is what confused me. The discussion made sense to leave it up to the browser to implement, but I can't understand why they didn't require it in the browser WebSocket implementation — they even suggested it and then forgot about it.
Yes! I was both surprised and confused when I saw this. Unless I'm missing something, it means that every application implementing WebSockets has to reinvent the wheel, creating their client-side ping/pong handler using Data Frames since browsers don't automatically send nor expose an API for sending/acting on Control Frames.
A classic issue of TCP half open connection. The client/browser side still thinks that the websocket/TCP connection is still alive. It happens because the client is not actively sending any data outbound, which would have helped to reset that connection eventually. It will be nice if the browser side of the websocket connection can also start PING/PONG mechanism.
100% agree!
Interesting read, thanks. I've delved into websockets and hit some interesting issues. I don't think I've had this scenario - that I know of - but this is good to know.
> You need to prove that what you think your code does is truly what happens.
Such a good insight -- seems obvious, but too often the source of gotchas, bad data, and bad user experience.
This is practical implementation when working with websocket. When server got an error or timeout waiting for client pong, it closes the connection, at the same time client send “health check” message without receive reponse (whatever message value of your choise) it closes the connection and reconnect.
This is why so many crypto exchanges send ping and pongs periodically as requests and not as control.
It’s application layer keepalive.