What Every Programmer Should Know About Memory (2007) [pdf]
people.freebsd.orgThis comes up over and over. It's great. But, the 75% of useful content comes after 25% of diving way too deep into the details of the electrical engineering.
"Every programmer" should know the orders of magnitude of cache hierarchy latencies, how RAM loads a whole cache line to service that single byte you requested, roughly how the automatic prefetcher thinks, that MESI and NUMA access are a growing issue, that the TLB cache is a thing, and generally how the memory controller is the interface between the CPU and pretty much everything else --like the NIC, HD and GPU.
"Every programmer" does not need to know about DRAM discharge timing, row selection and refresh cycles.
Understanding quad-pumped bus frequencies and CAS latencies is great when you are building systems. But, it's not something you think about when designing data structures and algos.
Long ago when I worked on real-time digital signal processing I did have to worry about DRAM discharge timing because the board I was using had a non-maskable interrupt to do the RAM refresh, and this limited the rate at which I could access the A/D converter. I think it was an LSI-11 board if I remember right. Fun times.
But yes, unless you're doing bare-metal embedded systems you haven't needed to care in a long time ... at least until someone came up with Rowhammer.
The ones who have the skills to actually use this information will end up reading the whole thing anyway, there's an element of self selection here.
You don't need to know the DRAM timing stuff, technically, but it doesn't hurt to learn something new.
The actual problem with this doc is that it was written a very long time ago so although most things are very very true still a few things (like everything on the quality of prefetching techniques should be taken with a large grain of salt)
It's also wrong about some basic details AFAIKS. A "memory controller" typically has more than one memory channels and not always ganged, so independent concurrent accesses even in a single controller can be supported. DRAM can also have multiple banks which can be accessed independently. So there is certainly not only one "bank" per north bridge or one bank per ODMC.
on modern systems the memory controller interfaces between the memory network and the DRAM. I wouldn't say the CPU/NIC or CPU/storage boundary touches the memory controller (except if its writing to memory)
It does. Outside of MMIO which nowadays is just the control plane for device configuration and the main processing state machines (e.g., starting and stopping processing of command and completion packets in ring buffers in memory). So those commands which are the primary control plane for the data operations are even all in memory! Then the data operations themselves are all memory too of course.
It is possible for the PCI host bridge DMAs to load and store into caches, but in practice it can be difficult if not impossible to line everything up so the data is in cache when it is required, because of the data throughput and pipelining (many parallel pipelines) latency variations even on local NAND devices, etc.
Maybe you get your command/complete rings from cache (which would be nice since they have to come in order and the CPU has to operate on), but it's very hard to get all your data served by the caches.
The old favorite netflix serving talks show this
My limited understanding is that the CPUs have two most-common interfaces to the rest of the system:
1) Reading and writing to memory addresses in such a way as to be interpreted by the memory controller to forward those actions to/from the PCI bus and other systems. That forwarding being controlled by reading and writing to addresses in a way that the MC interprets as commands to configure itself.
2) Ports -- which still exist for legacy reasons, but are long out of fashion.
What's #3?
The CPU core/cache uses its physical addresses to route loads and stores, but they don't have to go to the memory controller on the chip. They could go to the SMP unit if the memory controller for the data is on another chip. Or a PCI host bridge on the chip. Either directly to its register space, or to the register space or memory that belongs to a device behind it. These addresses are called "MMIO" memory mapped IO and are the way configuration is done. They are also slow, low bandwidth, synchronous.
x86 has "ports". I don't know about modern implementations but I would guess they are done with MMIO out the back end of the core (e.g., the CPU turns inb/outb instructions into accesses to special memory ranges).
Either way MMIO is the way to configure devices.
DMA is how to move data between device and CPU or devices. But nowadays with high performance devices, you don't set up a request and then send a MMIO command to process that request, and then get an interrupt and do a MMIO to get completion status of the command. The commands and completions themselves are DMA'ed. So you have ring buffers (multiple for a multi queue or virtualized device) for commands and completions which get DMAed. You do the MMIOs and interrupts only to manage starting and stopping (fill, empty, etc) conditions of the queues.
DMAs can go direct to caches in some architectures, but as I said L3 caches are only so large, and data volumes so large that it can be hard to arrange. You would hope your queues are mostly cached at least, but in reality I don't know if that actually happens well.
Quite Right!
Any resources on where i can read up on these ?
Well thanks for this summary :-).
If you have a 114-page paper, you probably don't have a document of what every programmer should know about memory, especially since many programmers work in domains where some of the recommendations here aren't even possible to follow!
Here's a brief summary of what every programmer really needs to know about memory:
* There is a hierarchy of memory from fast-but-small to slow-but-large. The smallest and fastest is in L1 cache, which is typically on the order of 10-100 KiB per core, and isn't growing substantially over time. The largest cache is L3 cache, in the 10s of MiB, and shared between many or all cares, while your RAM is in the 10s of GiB. An idea of the difference in access times can be found here: https://gist.github.com/jboner/2841832.
* All memory traffic happens on cacheline-sized blocks of memory, 64 bytes on x86. (It might be different on other processor architectures, but not by much). If you're reading 1 byte of data from memory, you're going to also read in 63 other bytes anyways, whether or not you're actually using that data.
* The slow speed of DRAM (compared to processor clock speed) means that it's frequently the case that trading off a little CPU computation time for tighter memory utilization. For example, storing a 16-bit integer instead of a pointer-sized integer, if you know that the maximum value will be <65,536.
* "Pointer chasing" algorithms are generally less efficient than array based algorithms. In particular, an array-based list is going to be faster than linked lists most of the time, especially on small (<1,000 element) lists.
That about covers what I think every programmer needs to know about memory. There are some topics that many programmers should perhaps know--cross-thread contention for memory, false sharing, and then NUMA in descending order, I think, but by the time you're in the weeds of "let's talk about nontemporal stores"... yeah, that's definitely not in the ballpack of everyone should know about them, especially if you're not going to mention when nontemporal stores will hurt instead of help [1]. Also, transactional memory is something I'd put on my list of perennial topics of "this sounds like a good idea in theory, but doesn't work in practice."
[1] What Drepper omits is that nontemporal stores will evict the cache lines from cache if they're already in cache, so you really shouldn't use them unless you know there is going to be no reuse.
You have to know something about coherency. You don't need to know the details but you need to know that multiple processors can read a cache line, but if there are any writers there will be trouble.
You have to know something about consistency. You don't need to know the details but you need to know enough to know you don't know enough to write your own synchronization primitives or lock free algorithms, at least.
The fields of microarchitecture and compiler design (specifically relating to microarchitecture also, i.e. we didn't go with Itanium or RISC) could really use a new batch of textbooks.
Hennessey and Patterson is still good but there's only like 1 alternative, and the most up to date "complete" book on memory I'm aware of is from 2007 (Jacobs et al).
Also: I've read about memory quite bit, I get it. What I don't get, and am thus asking for recommendations for is: "What every programmer should know about storage".
You can get an idea of your computers memory hierarchy in your browser here:
The main past threads appear to be:
What every programmer should know about memory [pdf] - https://news.ycombinator.com/item?id=27343593 - May 2021 (7 comments)
What every programmer should know about memory, Part 1 (2007) - https://news.ycombinator.com/item?id=25908018 - Jan 2021 (26 comments)
What Every Programmer Should Know About Memory (2007) [pdf] - https://news.ycombinator.com/item?id=19302299 - March 2019 (97 comments)
What every programmer should know about memory, Part 1 (2007) - https://news.ycombinator.com/item?id=15300547 - Sept 2017 (12 comments)
What Every Programmer Should Know About Memory (2007) [pdf] - https://news.ycombinator.com/item?id=14622861 - June 2017 (26 comments)
What every programmer should know about memory (2007) - https://news.ycombinator.com/item?id=10601626 - Nov 2015 (53 comments)
What Every Programmer Should Know About Memory (2007) [pdf] - https://news.ycombinator.com/item?id=9360778 - April 2015 (15 comments)
What every programmer should know about memory, Part 1 - https://news.ycombinator.com/item?id=3919429 - May 2012 (79 comments)
What every programmer should know about memory [2007] - https://news.ycombinator.com/item?id=3360188 - Dec 2011 (2 comments)
What every programmer should know about memory - https://news.ycombinator.com/item?id=1511990 - July 2010 (37 comments)
What every programmer should know about memory, Part 1 - https://news.ycombinator.com/item?id=1394346 - June 2010 (4 comments)
What Every Programmer Should Know About Memory - https://news.ycombinator.com/item?id=659367 - June 2009 (1 comment)
What every programmer should know about memory, Part 1 - https://news.ycombinator.com/item?id=58627 - Sept 2007 (7 comments)
This is from 2007! Still somewhat relevant though.
It's almost all exactly the same just with different parameters.
`curl cht.sh/latency` illustrates potential bottlenecks quite well given how succinct it is.