The Network Poller - NFHN Reader

In the previous article we saw how sysmon steps in every 10ms to call netpoll(0) on behalf of busy Ps, making sure network I/O doesn’t stall when the scheduler is too busy to poll on its own. We glossed over what that network poller actually is. Today we’re fixing that.

Go’s networking story is one of its most quietly impressive tricks. You write code that looks blocking — conn.Read(buf), conn.Write(buf), listener.Accept() — and it reads exactly like the classic threaded servers from decades ago. But under the hood, no OS thread is actually blocked on those calls. Behind the scenes the runtime is driving epoll on Linux, kqueue on BSD/macOS, and I/O Completion Ports on Windows, parking and waking goroutines as readiness events arrive. The synchronous API you see is a comfortable illusion — and the machinery that maintains it is the network poller, or netpoller for short.

The code lives mostly in src/runtime/netpoll.go (the platform-independent driver) with one file per platform (netpoll_epoll.go, netpoll_kqueue.go, netpoll_windows.go, and so on). A thin bridge in src/internal/poll connects it to the net and os packages. Here’s the stack at a glance:

The netpoller stack, from user code down through the runtime to the kernel

Let’s walk through how the pieces fit together.

Before we dive in, it’s worth understanding why the netpoller has to exist at all.

The Problem: Blocking API, Non-Blocking Reality

If you’ve ever written a networked server in a lower-level language, you’ve run into the classic dilemma. You can write blocking code — call read, wait until data arrives, move on. Easy to read, easy to write, but it pins one OS thread to every connection. Try serving a million websockets like that and your operating system will fall over: threads are not cheap. Or you can write non-blocking code with epoll and friends, which scales gloriously but turns your nice linear code into a tangle of state machines and callbacks — every would-be-blocking operation has to return to an event loop and come back later.

Go refuses to pick. You get the clean blocking-looking API (for { conn.Read(buf); ... }, one goroutine per connection) and the non-blocking scalability underneath. The netpoller is the trick that makes both true at once. Goroutines are cheap — a couple of kilobytes of stack — so parking a hundred thousand of them waiting for data is fine. Threads are expensive, so we never block one on I/O. The runtime shuffles goroutines around on a small number of threads, and the kernel only gets involved when there’s real work to do.

That’s the pitch. Now let’s see how it’s actually built.

The Journey of a Blocked Read

Let’s follow what happens when your code calls conn.Read(buf) on a TCP connection where the other side hasn’t sent anything yet.

Think of it like ordering food at a busy restaurant with a smart waiter. You sit down and order. The kitchen isn’t ready, but the waiter doesn’t just stand at your table staring at you — that would be a waste. Instead, the waiter notes down your order, goes off to serve other tables, and the kitchen agrees to ring a bell when your food is ready. When the bell rings, a waiter (not necessarily the same one) picks up your plate and brings it over. From your point of view, you just ordered and eventually got fed. You never saw the waiter juggling the other tables. That’s essentially what Go’s runtime does for Read.

Here’s the same story in Go terms.

Your goroutine calls Read, and the runtime asks the kernel “is there data on this socket?” The kernel says “not yet” — that’s the “kitchen isn’t ready” moment. A naive implementation would now make the whole OS thread wait around for data, but Go refuses to waste a thread like that. Instead, the runtime puts your goroutine to sleep, writes a little note that says “this goroutine is waiting for data on this socket,” and lets the OS thread go do something else — run another goroutine, serve another table.

Meanwhile, the runtime has also told the kernel “please let me know when this socket has data.” That’s our bell. The kernel keeps an eye on the socket in the background, and when data finally arrives, it rings: it tells the runtime “socket number 17 is ready now.” The runtime looks at its notes, finds the goroutine that was waiting on socket 17, and puts it back on the list of goroutines to run. Some OS thread eventually picks it up, your goroutine wakes up exactly where it left off, retries the read — this time successfully — and Read returns with the bytes. From your code’s point of view, Read simply took a while to return. You never noticed the sleep, the bell, or the reshuffling.

The beautiful part of this design is that the “state machine” — the remembering of where you were when you went to sleep — isn’t something you had to write. It’s literally your goroutine’s own stack, frozen and then thawed. That’s why Go gives you a simple for { n, err := conn.Read(buf); ... } loop where older languages would make you write callbacks or event handlers.

The rest of the article is about how the runtime actually pulls this off: how it talks to the kernel (epoll, kqueue, IOCP), how it writes those notes (the pollDesc), how it puts goroutines to sleep and wakes them up without losing bells (the parking protocol), and how it handles the awkward corners like deadlines, closed sockets, and stale notifications.

Before we look inside the runtime, it’s worth understanding what the kernel actually offers us — because the shape of those APIs shapes everything above them.

The Kitchen Bell: Epoll, Kqueue, IOCP

The general problem is this. Your program is tracking thousands of sockets, and on any given moment maybe three of them have something interesting going on. You’d like to ask the kernel “out of all these sockets I’m watching, which ones are ready right now?” — and if the answer is “none,” you’d like to go to sleep until one is, without having to poll. Every modern OS has some facility for exactly this, and while the APIs look different, they’re solving the same problem.

There are two main flavors, and the distinction matters.

The first is the readiness model. You register a set of file descriptors with the kernel and say “please let me know when any of these become readable or writable.” When the kernel notices a socket has data you can consume (or buffer space you can fill), it tells you so, and then you do the actual read or write yourself. The kernel is basically a doorman saying “you can go in now” — it doesn’t do the work for you. Linux’s epoll and BSD/macOS’s kqueue both work this way.

The second is the completion model. Instead of asking the kernel to tell you when you could do a read, you hand the kernel a buffer and say “please do this read for me and tell me when it’s done.” The kernel actually performs the I/O in the background and eventually notifies you “here are the bytes, the read is finished.” Windows’s I/O Completion Ports (IOCP) is the classic example.

From the user’s point of view the result is identical — your goroutine calls Read, eventually Read returns bytes — but the plumbing is meaningfully different. Readiness APIs let you decide when and how to do the I/O after being told the fd is ready. Completion APIs want you to commit to the operation upfront and hand them the buffer.

Whichever model the OS gives us, the netpoller has to hide that choice from the rest of the runtime.

How the Netpoller Uses Them

The netpoller wraps both models behind one uniform function: netpoll(delay). Whoever calls it is saying “check with the kernel for any I/O events, and return a list of goroutines that should now be runnable.” The delay just controls patience: negative means “wait forever,” zero means “don’t wait at all, just peek,” positive means “wait up to this many nanoseconds.”

On Linux (netpoll_epoll.go) that call becomes epoll_wait. On BSD and macOS (netpoll_kqueue.go) it’s kevent. On Windows (netpoll_windows.go) it’s GetQueuedCompletionStatusEx. Each of these returns a batch of events, and for each event, the runtime figures out which goroutine was waiting on it and wakes it up. There are also implementations for Solaris, AIX, and WASI. Different kitchens, same waiter.

The readiness-versus-completion difference does leak a little into the design. On Linux and BSD, Go issues the actual read syscall itself (after the kernel says “ready”). On Windows, it hands buffers to the IOCP and the kernel does the read. But all of that is tucked away inside the platform-specific file; by the time you’re in the shared code, it’s all just “here’s a list of goroutines that should wake up.”

There’s one more wrinkle in the readiness world that’s worth knowing about.

One More Wrinkle: Edge vs Level

Readiness APIs come in two sub-flavors that change how you have to use them.

Level-triggered means the kernel tells you “this socket is ready” for as long as the condition is true. If you don’t do anything about it, the kernel will keep telling you on every poll. This is forgiving — you can half-drain a socket and the next poll will still flag it — but it generates a lot of redundant notifications if you’re not careful.

Edge-triggered means the kernel tells you exactly once when the socket transitions from not-ready to ready. After that, silence — until either it goes back to not-ready and then ready again, or you explicitly re-arm your interest. This is more efficient (no duplicate notifications) but puts the burden on you to fully drain a ready socket before you trust a future notification.

Go registers sockets with epoll and kqueue in edge-triggered mode, which matches its usage perfectly: the goroutine always calls Read in a loop until it gets “nothing to read,” then parks. By the time a goroutine is asleep on an edge-triggered fd, we’ve genuinely exhausted everything, and the next notification will be for new data. On a few platforms (Solaris event ports, AIX pollsets) only level-triggered modes are available, so the runtime has to re-arm the interest every time a goroutine parks. Different knob, same outcome.

With that picture in mind, let’s look at what the runtime itself is doing on top of these primitives.

The `pollDesc`: The Note at the Table

Remember our waiter’s little note? In Go it’s called a pollDesc , and there’s one per file descriptor the runtime is tracking. It holds the fd itself, a couple of deadline timers, a generation counter (which we’ll need later when we talk about stale notifications), and — the fields that do all the real work — rg and wg: the “read goroutine” and “write goroutine” slots.

Think of rg and wg as two little boxes on the note. A socket can have one goroutine waiting to read and a different one waiting to write, and we need to tell them apart. Each box has just one slot, and that slot is always in one of four states:

State	Meaning
`pdNil`	nothing is happening — nobody’s waiting, nothing to report
`pdReady`	the bell already rang; whoever shows up next gets to go
`pdWait`	a goroutine is about to park but isn’t actually asleep yet
pointer to `g`	a goroutine is parked here, waiting for the bell

The first two states are easy. pdNil is the default “quiet” state. pdReady means the socket is ready, but either no goroutine was waiting at that moment, or the one that was has already been woken and someone else hasn’t arrived to claim it yet.

The fourth state — the pointer to a goroutine — is like writing the table number down on the note: when the bell rings, the runtime knows exactly which goroutine to bring the plate to.

The interesting one is pdWait. It’s a tiny in-between state that says “a goroutine is in the process of going to sleep, but it isn’t fully parked yet — please wait a moment before waking it.” That might sound like a detail, but it’s doing essential work: it’s how the runtime prevents losing wakeups when the kernel and the goroutine are racing each other. We’ll see exactly how in the next section.

Parking: Falling Asleep Without Missing the Bell

So our goroutine got told “no data yet” and wants to wait. The function that handles this is netpollblock , and it has to be careful about one specific thing: not missing the bell.

Here’s the race. The kernel and our goroutine are running on completely different timelines. Between the moment the kernel said “nothing to read” and the moment our goroutine actually falls asleep, the data might already have arrived. If we weren’t careful, the bell would ring against an empty table, and we’d fall asleep seconds later with nobody around to wake us up. Classic lost wakeup, and a classic source of bugs in concurrent systems.

Here’s how netpollblock avoids that, step by step.

First, it peeks at the slot. Maybe a wakeup is already sitting there — pdReady. That happens all the time: the kernel is fast, and the bell might have rung in the brief moment between “no data” and “I’m about to wait.” If that’s the case, great — we consume the pdReady, reset the slot, and return immediately without ever actually going to sleep. Free lunch.

If the slot is empty (pdNil), we flip it to pdWait atomically. This is the runtime’s promise to itself: “a goroutine is coming in to sleep — if you want to wake it, please hold on a moment, don’t just walk away.” From this point on, the wake-up side can see us coming.

Then we call gopark, which is the runtime’s “put this goroutine to sleep” primitive. But crucially, gopark doesn’t just put us to sleep blindly — it takes a small callback that runs after the scheduler has locked everything down, and only then does the callback do the final atomic swap: pdWait → pointer to our goroutine. Why this extra step? Because if a wakeup raced in while we were on our way to sleep, it will have flipped the slot from pdWait to pdReady — and our pdWait → g swap will fail, because the slot isn’t pdWait anymore. When the swap fails, gopark cancels the whole nap and we just keep going. The wakeup we were worried about losing? It’s sitting right there in the slot as pdReady, and we consume it a moment later, wide awake.

If the swap succeeds, we sleep. Later — maybe microseconds, maybe minutes — the bell rings, someone writes pdReady into the slot, and our goroutine gets put back on a run queue. When it resumes, the only thing left to do is clear the slot back to pdNil and return to the caller, which retries the read.

Here’s the whole dance as a flowchart — three paths in, one way out:

Flowchart of netpollblock showing fast path, normal path, and race path all converging on the same return

The thing that makes all of this work is that three-state dance: pdNil → pdWait → pointer to g. Without the pdWait middle step, there’d be an ambiguous moment where a goroutine has decided to sleep but isn’t quite there yet — and a bell arriving in that window would have nowhere to land. With pdWait, the wake-up side can always tell the difference between “nobody’s coming,” “someone’s on their way,” and “someone’s already here,” and it does the right thing in each case.

Now let’s flip the camera around and watch the wakeup side.

Waking: Ringing the Bell

When the runtime finds out that a socket became ready (we’ll see how the kernel tells us in a minute), it calls netpollready , which in turn calls netpollunblock for the right direction (read or write). And netpollunblock has to handle whichever of the three possible states the slot happens to be in.

If there’s a goroutine pointer sitting in the slot, someone was parked waiting. We hand that goroutine back to the caller, who’ll drop it onto a scheduler run queue so it can resume. This is the common, happy case.

If the slot is pdWait, someone was about to park but hasn’t actually gone to sleep yet. We don’t have a goroutine to wake — there isn’t one yet. Instead, we leave pdReady in the slot. A moment later, when that goroutine finishes falling asleep and its commit callback fires, it’ll see pdReady and immediately wake itself back up. The race is defused.

If the slot is pdNil, nobody was waiting at all. The bell rang into an empty room. We stash pdReady in the slot anyway — the next goroutine to come along wanting to read will see it and skip the sleep entirely.

Here’s the whole state machine at a glance — every transition a rg/wg slot can make, and who causes it. Solid arrows are the happy path; the dashed arrow is the race we designed pdWait to defuse.

State machine of the rg/wg slot, showing all transitions between pdNil, pdWait, g pointer, and pdReady

Success isn’t the only way out of a parked read, though. Our socket might also hit a deadline, or get closed from somewhere else in the program. The same netpollunblock function handles those cases too, with one small twist: we still wake the parked goroutine, but we leave the slot as pdNil instead of pdReady. The reason is that the goroutine isn’t resuming because there’s data to read — it’s resuming because something went wrong, and it should check for errors on the way out (deadline exceeded, connection closed, etc.) rather than blindly retry the syscall.

Of those two non-happy-path cases, deadlines are the more interesting one, because the runtime has to manufacture the wakeup itself.

Deadlines: Giving Up Without Asking the Kernel

What happens if you call conn.SetReadDeadline(t) and then block on Read? You’d hope that after t passes, Read returns with a timeout error — and it does. What’s interesting is that no kernel timer is involved. The deadline machinery lives entirely inside the Go runtime, on top of the same timer infrastructure that powers time.After.

When you set a deadline, the runtime stores it in the pollDesc and arms a runtime timer to fire at that moment. When the timer fires, the callback takes the same code path as a readiness wakeup — it calls netpollunblock — except the slot gets left as pdNil instead of pdReady. The parked goroutine wakes up, but when it checks what happened, it sees no data arrived; it sees a deadline expired. That gets turned into os.ErrDeadlineExceeded and bubbled up to your code.

There are two small niceties worth mentioning. If you set the same deadline for both read and write on the same socket, the runtime uses one combined timer instead of two — tiny memory savings that matter once you have a busy server with lots of sockets. And because SetDeadline can be called repeatedly (each call replacing the previous), the runtime uses a little sequence counter to know which timer events belong to the current deadline versus a stale one. If a timer fires after its deadline has already been replaced, the runtime notices the mismatch and silently ignores it.

That sequence-counter trick — ignore events that no longer apply — shows up again, in a different shape, when we look at how the runtime protects itself from stale kernel events.

Stale Notifications: Making Sure the Bell Belongs to You

Here’s a subtle but nasty problem. Imagine a socket gets closed. Its pollDesc gets recycled and used for a brand-new socket a few milliseconds later. But somewhere deep in the kernel, an old readiness event for the original socket is still in flight, about to be delivered. If the runtime processes it naively, it’ll look up the pollDesc, find a goroutine waiting — and wake the wrong one, for the wrong socket.

Go solves this with a generation counter. Every pollDesc has a number that increments every time it’s recycled. When Go registers a socket with the kernel, it stores the current generation alongside the pointer to the pollDesc — packed cleverly into the same 64-bit field the kernel hands back on every event (the trick is that aligned pointers have a few always-zero low bits, so there’s room to tuck a small counter in there for free). When an event comes back later, the runtime pulls those two pieces apart and compares the generation from the event against the current generation on the pollDesc. If they match, the event is fresh and we process it. If they don’t, the pollDesc has been recycled since, so the event is stale and we drop it on the floor.

We’ve been talking about netpoll as if it just runs on its own, but of course something has to call it.

Who Actually Calls `netpoll`?

That something is the Go scheduler. Think about what a scheduler has to do when it’s looking for work. It checks its own local queue of runnable goroutines, then the global queue, then tries to steal from other processors. If none of that produces work, it starts to suspect everything is idle. Before giving up, it asks the netpoller: “hey, is any I/O ready right now?” — a non-blocking poll, essentially a quick peek. If something comes back, great, we’ve got work. If not, and this thread is really about to go to sleep, the scheduler makes the poll blocking: “fine, I’ll wait until either some I/O is ready or it’s time for the next timer to fire.”

This is a really elegant detail: the thread that was looking for work becomes the thread that’s sleeping inside epoll_wait. Waiting for goroutines and waiting for I/O are the same waiting. You don’t need a dedicated “poller thread” — whichever thread has nothing better to do takes the job.

A couple of small coordination tricks keep this sane across many threads. There’s a shared flag that says “somebody is currently inside netpoll,” so other threads don’t also try to block on the kernel at the same time (one poller is enough for the whole runtime). There’s also a shared “the current poller is sleeping until timestamp T” value, so that other threads can tell whether the sleeper is about to oversleep something important.

And as we saw in the previous article, sysmon is still out there doing its thing — calling netpoll(0) every 10 milliseconds as a safety net. Even if every single P is stuck spinning on CPU-bound work and nobody’s remembering to check for I/O, sysmon will. Network events never get left behind.

That “thread asleep inside epoll_wait” raises its own problem, though: what if something changes and we need to wake it up before the kernel does?

Waking the Waiter: `netpollBreak`

Suppose a thread is deep inside netpoll(-1), sleeping indefinitely, waiting for something — anything — to happen. Now suppose, on a completely different thread, someone schedules a new timer that’s supposed to fire in 50ms. The sleeping poller doesn’t know about this timer; it might sleep for an hour before the kernel wakes it up with a network event. We need a way to reach into that snoring thread and tap it on the shoulder: “hey, you need to wake up and re-check the timer list.”

That’s netpollBreak . The idea is simple: at startup, the runtime creates a little “wakeup channel” and registers it alongside the normal sockets. When something elsewhere in the runtime needs to wake the poller, it pokes the wakeup channel, which looks to the poller like just another event. epoll_wait returns, the poller loops back to check on the scheduler, sees the new timer, and acts accordingly.

How that poke is implemented varies from platform to platform — each OS exposes a slightly different way to send yourself a wakeup event — but the specifics don’t really matter here. What matters is that every platform Go supports has some mechanism for this, and the netpoller uses whichever one is available.

The scheduler fires a netpollBreak whenever it notices the sleeping poller is going to miss something — most often, a new timer with an earlier deadline than the one it’s currently waiting for. And if several threads try to break the poller at the same time, the runtime is smart enough to only send one actual wakeup.

Let’s replay that conn.Read one more time, now with all the pieces we’ve introduced.

Putting It All Together

Your goroutine calls Read. The runtime asks the kernel for data; the kernel says “nothing here yet.” The goroutine decides to wait, so it reaches for the pollDesc attached to this socket and runs the parking dance: peek at the read slot, find it empty, flip it to pdWait, call gopark, and let the commit callback finish the handoff by putting a pointer to the goroutine in the slot. The OS thread that was running this goroutine is now free to go run something else — and it does.

Meanwhile, across the runtime, some processor has run out of work. It’s gone through the whole checklist — local queue, global queue, work-stealing — and come up empty. It calls netpoll in blocking mode, which on Linux means the thread goes to sleep inside epoll_wait. The thread that was looking for something to do is now the thread waiting for the kernel.

Eventually, a packet arrives for our socket. The kernel marks the socket readable and wakes up whoever’s waiting on epoll. epoll_wait returns with an event. The runtime unpacks the tagged pointer the kernel handed back, double-checks that the generation number still matches (to make sure this event isn’t about a recycled pollDesc), and looks at the pollDesc. There’s a goroutine pointer in the read slot. The runtime pulls the goroutine out, drops pdReady in its place, and hands the goroutine to the scheduler to run.

Some thread picks up our goroutine. It resumes right where it left off — inside netpollblock, just past gopark. It clears the slot back to pdNil, returns up through the layers into Read, which retries the syscall. This time the kernel has bytes to hand over. Read returns them to your code.

Count the OS threads that were blocked on behalf of your goroutine during all this: zero. One thread was sleeping on the kernel, but that was the runtime’s shared poller, not a per-connection thread, and it was happy to serve wakeups for any of your hundred thousand connections. Each of those connections costs you a parked goroutine (a couple of kilobytes of stack) and one entry in the kernel’s epoll table. That’s it.

Let’s zoom back out one more time and put the whole picture in one place.

Summary

The network poller is the bridge between Go’s blocking-looking API and the kernel’s non-blocking reality — the smart waiter that takes your order, lets you nap, and fetches your food when the kitchen rings the bell. For each file descriptor there’s a pollDesc holding a “who’s waiting to read” slot and a “who’s waiting to write” slot. A careful three-step parking protocol — “empty, about to sleep, asleep” — makes sure no wakeup ever arrives while a goroutine is in the awkward middle of falling asleep. The actual kernel conversation happens through epoll, kqueue, IOCP, or event ports, depending on the OS. Deadlines are done entirely with runtime timers. Stale events are filtered out with a generation counter. And the poller itself can be woken up through a little self-wakeup channel when something changes and it needs to re-check. The scheduler calls it whenever it runs out of work; sysmon polls it every 10ms as a safety net.

The payoff is real: a server that handles a million concurrent connections on a handful of OS threads, while your code still looks like for { conn.Read(buf); process(buf) }. It’s one of those places where the abstraction is so good you can go years writing Go without ever realizing how much is happening underneath.

In the next article we’ll step away from the scheduler and into the data structures you use every day — slices, maps, and channels — and see how they actually work under the hood.