How it works · Network
TCP, postcards into a hurricane, made reliable.
IP delivers postcards into a hurricane. TCP, somehow, makes them feel like a phone call — ordered, reliable, and exactly the bytes you sent. The trick is bookkeeping.
Parts01 – 08 InteractiveFull lifecycle PrereqIP / sockets
What does TCP guarantee?
Three guarantees, layered on chaos.
TCP (Transmission Control Protocol) provides three guarantees on top of the unreliable IP layer: reliable, ordered, byte-stream delivery. Vint Cerf and Bob Kahn specified it in 1974; RFC 793 standardised it in 1981, refreshed as RFC 9293 in 2022. TCP underlies HTTP/1, HTTP/2, SSH, SMTP, FTP, and most other application protocols.
IP delivers packets like postcards thrown into a hurricane: capable of arriving duplicated, out of order, or never arriving at all. TCP layers a perfectly reliable, ordered, byte-stream abstraction directly on top. Three guarantees, all of them implemented in one of the most carefully evolved pieces of infrastructure on the planet.
01
Reliable
Every byte you send is acknowledged or retransmitted. If a packet drops at a router five hops in, TCP notices the gap and resends — without the application ever knowing.
02
Ordered
Bytes arrive in exactly the sequence they were sent. The receiver buffers out-of-order segments and waits — the OS will not deliver byte 1000 to the application until byte 1 has been delivered.
03
Byte-stream
There are no messages in TCP. The sender writes 100 bytes; the receiver may read 80 then 20, or 100 in one shot. Application protocols layered on TCP must add their own framing — which is exactly what HTTP, TLS, and Redis all do.
TCP three-way handshake
A connection isn't open until both sides know two things: their own initial sequence number, and the peer's. With two packets you can synchronize one direction. To synchronize both directions and prove the round trip works, you need three.
The Initial Sequence Numbers are not zero — and they shouldn't be. If a router delays a packet for two minutes and finally delivers it, random ISNs ensure the new connection rejects the stale packet as mathematically impossible. Without random ISNs, a long-lost packet from a closed connection could inject corrupted bytes into your live database query. The number of bugs this avoids is measured in decades.
SYN · SYN-ACK · ACK
Three packets, one round trip.
Client SYN announces its ISN. Server SYN-ACK announces its own ISN and acknowledges the client's. Client ACK confirms. After this third segment, both sides are in ESTABLISHED; data may flow.
Options ride along
MSS · SACK · window scaling.
The handshake also negotiates options: maximum segment size (so neither side overshoots the path MTU), selective acknowledgement (so a single dropped packet doesn't stall the whole window), and window scaling (so 16-bit windows can advertise gigabytes).
TCP connection lifecycle — open, talk, close
Open, talk, close.
The simulator below plays the entire lifecycle: three-way handshake, a single request/response data exchange, and the four-way teardown. Use the phase tabs to jump, the player to scrub, and + Show wire bytes to read the actual segment header values at each step.
Step 01 of 11
The client starts with a SYN — synchronize. It picks a random Initial Sequence Number (ISN) so that any stale packet from a long-dead connection cannot be mistaken for live data. The window field offers receive-buffer space; options negotiate MSS, selective ACK, and window scaling.
TCP sequence numbers and cumulative ACKs
Sequence numbers, cumulative ACKs.
Every byte gets a number. The sequence number in a TCP header is the number of the first byte of the payload. The acknowledgement number is the next byte the receiver expects.
The ACK is cumulative — ack=1018 means "every byte up to and including 1017 has arrived." If the receiver got bytes 1–100 then bytes 201–300, it does not ack 300. It keeps acking 101 — repeatedly, urgently — screaming at the sender that there's a hole. When the sender sees three duplicate ACKs in a row, it doesn't wait for its retransmit timer; it triggers fast retransmit and resends the missing segment immediately.
This is why packet loss shows up as a stutter on a video call but not as corruption. TCP is, beneath the wire, an extremely loud and persistent clerk.
Selective ACK
Pure cumulative ACK has a downside: if bytes 200–299 are missing but 300–999 arrived fine, the sender doesn't know the rest of the window is intact and may retransmit too much. SACK (RFC 2018) lets the receiver describe non-contiguous ranges: "I'm missing 200–299, but I have 300–999." Negotiated in the SYN. On by default everywhere modern.
TCP sliding window — flow control
Sliding window, flow control.
A fast sender talking to a slow receiver is a recipe for a buffer overflow — and TCP must never lose bytes that the receiver's OS already accepted. The defence is the receive window: a 16-bit field in every TCP header advertising "this much buffer space is free, right now."
A mobile phone whose CPU is briefly pinned can advertise a window of zero. The server stops sending mid-flight, no matter how big the file. As the phone drains its buffer, it advertises a new, larger window — possibly via a small "window update" segment — and transmission resumes. This is the difference between flow control and congestion control: flow control protects the receiver; congestion control protects the network. They are two completely independent feedback loops running at the same time.
The 16-bit window field maxes out at 65 535 bytes — small for fat-pipe links. The window scaling option, negotiated at SYN time, lets each side multiply the field by 2n, so windows of gigabytes are routine on real networks.
TCP congestion control — slow start, AIMD, BBR
Slow start, AIMD, BBR.
Above flow control sits congestion control: the sender's private model of "how fast dare I transmit before a router somewhere drops a packet?" The model is the congestion window (cwnd). The sender may have at most min(cwnd, rwnd) bytes in flight at any moment.
- 01
Slow start
A new connection starts with a cwnd of about ten segments. Each ACK doubles the window — exponential growth — until either a packet drops or cwnd reaches the slow-start threshold (ssthresh). Despite the name, this is the fastest phase.
- 02
Congestion avoidance
Past ssthresh, growth becomes additive — cwnd increments by one segment per RTT. Every successful round trip earns a single segment of extra throughput. AIMD: Additive Increase.
- 03
Multiplicative decrease
When loss is detected — three duplicate ACKs or a timeout — cwnd is cut, often in half. The sender then resumes additive growth. AIMD's two halves give it its name: Additive Increase, Multiplicative Decrease.
- 04
BBR · model-based
Classic AIMD treats loss as the only signal. Modern BBR (Bottleneck Bandwidth and Round-trip propagation time) — Google's, used on YouTube, Cloudflare's edge, and increasingly Linux servers — instead estimates the bottleneck capacity and minimum RTT, then paces transmission to match. Throughput stays high under loss; queues stay shallow. It is, slowly, eating the world.
TCP head-of-line blocking
Head-of-line blocking.
TCP's strict in-order delivery is also its greatest architectural flaw. If you multiplex five HTTP/2 streams over a single TCP connection and the very first segment of stream one is dropped, TCP halts everything. The OS will physically withhold perfectly valid, already-arrived bytes for streams two, three, four, and five — because it is legally bound to deliver byte one to the application before byte one thousand.
This cannot be fixed inside TCP without breaking its contract. Which is exactly why HTTP/3 abandoned TCP altogether: QUIC reimplements TCP's reliability on top of UDP, but with independent per-stream sequence numbers. A loss on one stream stalls only that stream. Other streams keep flowing. See the HTTP guide for the rest of the story.
TCP four-way teardown and TIME_WAIT
Four-way teardown, and TIME_WAIT.
Closing a connection is half-duplex. Either side can declare "I have no more data to send" with a FIN, and the other end keeps sending until it sends its own FIN. That gives four control segments: FIN, ACK, FIN, ACK. Often the middle two ride together — server's ACK piggybacks on its FIN — but conceptually they're four phases.
After the final ACK, the closing side sits in TIME_WAIT for two maximum-segment-lifetimes (the canonical 2 × MSL ≈ 60–240 s). This is not lazy cleanup — it is essential. A duplicate FIN delayed in some router could otherwise arrive long after the connection has closed and confuse a brand-new connection that happens to reuse the same four-tuple. TIME_WAIT keeps the kernel guarding the four-tuple long enough for any stragglers to die.
Servers under enormous churn can run out of ephemeral ports waiting for TIME_WAITs to clear. Tuning tcp_tw_reuse and connection reuse via keep-alive is a standard remedy.
TCP tuning in production — the knobs that move the needle
What operators actually change.
- tcp_congestion_control
- The algorithm. Linux defaults to CUBIC; production high-throughput setups (Google, Cloudflare) often switch to BBR. Set globally: sysctl net.ipv4.tcp_congestion_control=bbr.
- tcp_rmem / tcp_wmem
- Receive and send buffer sizes (min/default/max). Defaults are tuned for ~10 ms RTT; for transcontinental connections at 150 ms RTT, raising max to ~16 MB allows TCP to keep the pipe full.
- tcp_fastopen
- Enables TFO (RFC 7413). Saves an RTT on repeat connections. Set to 3 (both client and server). Mostly transparent to applications.
- tcp_tw_reuse
- Allow reuse of TIME_WAIT sockets for new connections. Critical for high-RPS services hitting port exhaustion. Set to 1 cautiously (correctness depends on timestamp option being on).
- tcp_keepalive_time / _intvl / _probes
- Keepalive cadence. Defaults of 7200/75/9 = the OS notices a dead peer ~2.4 hours after it goes silent. Production: tune down to 60/10/3 if your clients are unreliable mobile/IoT.
- net.core.somaxconn
- Listen-backlog ceiling. Default 128; serving 100k+ accepts/sec needs 4096 or more.
The default fork. Linux kernel 5.10+ enables several improvements (BBRv2 work, TCP zero-copy SO_ZEROCOPY, TLS offload at the kernel via kTLS) that can be ignored on a single laptop but cumulatively matter at fleet scale. Cloudflare's, Netflix's, and Meta's production kernels are all custom-tuned forks.
A closing note
TCP is forty-five years old and still doing the bulk of the world's work. The 1981 spec drew the line in exactly the right place — let IP do delivery, let the endpoints do everything else — and the everything-else has accreted three decades of careful tuning: SACK, window scaling, BBR, fast open, multipath. Almost every byte you have ever streamed, downloaded, or paid for has been carried by this protocol. It deserves at least one careful read. Picked apart, it is just very, very disciplined bookkeeping.
Found this useful?