Table of Contents

5 min read Original article ↗

This series is designed for programmers who know how to write programs, but don’t know how hardware runs those programs. It’s designed to bring you up to speed on how modern CPUs work, how to estimate the expected speed of performance-critical code, and the basic optimization techniques every programmer should know.

The course is broken into parts, with the first part (the “prologue”) being strictly a demonstration with no associated homework. Later parts feature weekly homework.

Q&A videos are posted regularly. If you have a question you’d like answered, please put it in the comments of the most recent Q&A video. Homework listings are available from github.

Prologue: The Five Multipliers (3 1/2 hours, no homework)

This part of the course gives simple demonstrations of how seemingly minor code changes can produce dramatically different software performance, even for very simple operations.

  1. Welcome to the Performance-Aware Programming Series! (22:05)

  2. Waste (32:56)

  3. Instructions Per Clock (25:05)

  4. Single Instruction, Multiple Data (35:31)

  5. Caching (22:55)

  6. Multithreading (32:11)

  7. Python Revisited (36:22)

Interlude (1 hour, no homework)

  1. The Haversine Distance Problem (30:28)

  2. “Clean” Code, Horrible Performance (22:40)

Part 1: Reading ASM (7 hours, plus homework)

This part of the course is designed to ensure that everyone taking the course has a solid understanding of how a CPU works at the assembly-language level.

  1. Instruction Decoding on the 8086 (28:28)

  2. Decoding Multiple Instructions and Suffixes (43:51)

  3. Opcode Patterns in 8086 Arithmetic (20:01)

  4. 8086 Decoder Code Review (1:17:49)

  5. Using the Reference Decoder as a Shared Library (8:48)

  6. Simulating Non-memory MOVs (18:00)

  7. Simulating ADD, SUB, and CMP (25:56)

  8. Simulating Conditional Jumps (19:41)

  9. Simulating Memory (26:32)

  10. Simulating Real Programs (16:02)

  11. Other Common Instructions (19:43)

  12. The Stack (26:58)

  13. Estimating Cycles (23:56)

  14. From 8086 to x64 (26:21)

  15. 8086 Simulation Code Review (33:05)

Part 2: Basic Profiling (4 hours, plus homework)

In this part of the course, we learn about how to measure time, and instrument programs to automatically determine where time is being spent.

  1. Generating Haversine Input JSON (15:40)

  2. Writing a Simple Haversine Distance Processor (12:09)

  3. Initial Haversine Processor Code Review (29:22)

  4. Introduction to RDTSC (48:05)

  5. How does QueryPerformanceCounter measure time? (31:43)

  6. Instrumentation-Based Profiling (18:01)

  7. Profiling Nested Blocks (26:12)

  8. Profiling Recursive Blocks (30:44)

  9. A First Look at Profiling Overhead (18:37)

  10. Comparing the Overhead of RDTSC and QueryPerformanceCounter (13:00)

Part 3: Moving Data (13 hours, plus homework, and 3 hours of bonus content)

Using our knowledge from parts 1 and 2, in Part 3 we look at how data moves into the CPU, and how to estimate the upper performance limits of our software imposed by the need to move data.

  1. Measuring Data Throughput (21:54)

  2. Repetition Testing (27:57)

  3. Monitoring OS Performance Counters (20:25)

  4. Page Faults (38:52)

    1. Probing OS Page Fault Behavior* (33:05)

    2. Four-Level Paging* (31:23)

    3. Analyzing Page Fault Anomalies* (31:44)

    4. Powerful Page Mapping Techniques* (39:20)

    5. Faster Reads with Large Page Allocations (25:52)

    6. Memory-Mapped Files* (20:46)

  5. Inspecting Loop Assembly (32:31)

  6. Intuiting Latency and Throughput (22:57)

  7. Analyzing Dependency Chains (29:06)

  8. Linking Directly to ASM for Experimentation (48:07)

  9. CPU Front End Basics (31:09)

  10. Branch Prediction (42:03)

  11. Code Alignment (32:03)

  12. The RAT and the Register File (45:21)

  13. Execution Ports and the Scheduler (34:51)

  14. Increasing Read Bandwidth with SIMD Instructions (37:52)

  15. Cache Size and Bandwidth Testing (34:00)

  16. Non-Power-of-Two Cache Size Testing (35:15)

  17. Latency and Throughput, Again (37:55)

  18. Unaligned Load Penalties (27:25)

  19. Cache Sets and Indexing (30:34)

  20. Non-Temporal Stores (28:11)

  21. Prefetching (21:31)

  22. Prefetching Wrap-up (19:40)

    1. A Closer Look at the Prefetching Performance Graph* (33:38)

  23. 2x Faster File Reads (31:01)

  24. Overlapping File Reads with Computation (20:35)

  25. Testing Memory-Mapped Files* (18:03)

* Entries with an asterisk were “bonus” entries that can be skipped.

Part 4: Polynomial Evaluation (optional - 6 1/2 hours, plus homework)

Part 4 prepares us for the analysis in Part 5 by replacing the sin, cos, sqrt, and asin math library functions with our own implementations. We look at what kinds of operations CPUs perform natively, how more complicated operations are broken down into native ones, and how math library functions are typically structured.

If you’re not interested in how math library functions work under-the-hood, and how they should be structured to minimize computation, you can skip this section and start with the supplied source code in Part 5.

  1. Reference Haversine Code (19:49)

  2. Identifying Non-inlined Math Functions (19:27)

  3. Determining Input Ranges (14:37)

  4. Introduction to SSE Intrinsics (48:51)

  5. Function Approximation (56:28)

  6. Range Reduction (16:32)

  7. Approximation Using Higher-Power Polynomials (25:04)

  8. Horner's Rule (28:22)

  9. Fused Multiply-Add (31:26)

  10. Coefficient Arrays for Polynomial Evaluation (24:14)

  11. Approximating Arcsine (24:35)

  12. Extending Arcsine to the Full Input Range (12:10)

Part 5: Computation

Part 5 extends the techniques learned in Part 3 from memory operations to computation more broadly. We learn to read full CPU diagrams, how to estimate the flow of computational microoperations through the CPU, how to utilize vector operations to increase throughput, and how to reason about the peak performance of computational processes.

  1. Our Very Own Haversine (9:02)

  2. Removing Waste (27:51)

  3. Simplified Haversine Candidate (34:11)

  4. Selectively Preventing Optimizations (32:26)

  5. Reading CPU Diagrams (30:40)

  6. Better Dead Code Elimination - Or Is It? (12:06)

  7. Our Nemesis Returns (12:43)

Part 5 is still in progress. This listing will be updated as new videos are posted.

  1. Announcing the 2024 Halloween Spooktacular Challenge!

  2. Day 1: The Challenge

  3. Day 2: Reboot Your Machine

  4. Day 3: Trace in Real-Time Mode

  5. Day 4: Use TraceQueryInformation

  6. Day 5: Call TraceSetInformation Twice

  7. Day 6: PMCs Only Work for A Subset of Event Types

  8. Day 7: Look for PMCs in the ExtendedData

  9. Day 8: Mārtiņš Možeiko's Miniperf

  10. Day 9: What's Left?

  11. Day 10: Use TraceEvent

  12. Day 11: Define Your Own Event UserData

  13. Day 12: Find Another PMC Event Type

  14. Day 13: Use SysCallExit to Mark Start Points

  15. Day 14: Use SysCallEnter to Mark Stop Points

  16. Day 15: Use GetEventProcessorIndex to Help Track Profile Regions

  17. Real-time PMCs on Windows with ETW

  1. The Four Programming Questions from My 1994 Microsoft Internship Interview (19:02)

  2. Question #1: Rectangle Copy (24:50)

  3. Question #2: String Copy (14:50)

  4. Question #3: Flood Fill Detection (23:58)

  5. Question #4: Outline a Circle (1:09:01)