Context Switching Performance — Threads vs Processes

One of the core features of operating systems is multitasking — the ability to run multiple tasks in parallel on a CPU. These days, computers have multiple CPU cores to run tasks concurrently, even then, the number of tasks executed on a computer is far more than the number of CPUs it has. All these tasks are executed in parallel by the operating system by allotting each task a slice of CPU time, providing users with the illusion that all the tasks are executed concurrently.

After its time slice, when a running process is replaced by another process, the operating system stores the state of that running process in memory so that it can resume from where it left. The process of storing the state of a running process and replacing it with another process from its previous state is called context switching.

Context switching also occurs when handling an interrupt or when a process switches between user mode and kernel mode during execution of system calls like reading a file.

By default, in the Linux operating system, each process has a single thread called the main thread. A process can have more than one thread, and switching between these threads also involves context switching.

Context switching doesn’t come without cost, there’s some performance overhead involved. This overhead varies based on several factors such as size of the state to store and retrieve, processes involved — intra-process threads or not, etc.

Context switching between threads(of same process) has less overhead than context switching between two processes. In this blog post, let’s look into the performance overhead difference between processes and threads in action, and why.

Let’s first see what actually happens during context switching.

Saving the state: The running process’s state information from CPU registers, such as the program counter, stack pointer, virtual memory page table pointer are saved in the process control block(PCB). Every process will have its own PCB stored in memory managed by the kernel.
Flushing cache: Translation Lookaside Buffer(TLB) entries for the process are invalidated. TLB is a cache in the memory management unit that helps with faster translation of virtual page addresses to the physical frame addresses. Different processes will have the same virtual address but might translate to different physical addresses, so this cache invalidation is essential.
Restoring the state: State of the new process is read from its PCB and restored in the CPU registers.

Time to see this in action.

We need a task that does a lot of cpu activity to observe context switching performance overhead. To do that let’s write a go program to generate prime numbers. We will then run it concurrently with multiple threads and processes to see the difference.

package mainimport (
 "fmt"
)
const (
 LIMIT = 30000
)
func exhaust_cpu(limit int) {
 primes := 1
 for i := 3; i <= limit; i += 1 {
  divisible := 1
  for j := 2; j <= limit; j++ {
   if i%j == 0 {
    divisible += 1
   }
  }
  if divisible == 2 {
   primes += 1
  }
 }
 fmt.Println("Primes: ", primes)
}
func main() {
 exhaust_cpu(LIMIT)
}

Let’s use goroutines to run the program in 5 threads.

$ cat threads.go
package mainimport (
  "fmt"
  "runtime"
  "sync"
)
const (
  LIMIT = 30000
  THREADS = 5
)
func exhaust_cpu(wg *sync.WaitGroup, limit int){
  defer wg.Done()
  primes := 1
  for i:=3;i<=limit;i+=1 {
    divisible := 1
    for j:=2;j<=limit;j++ {
      if i%j == 0 {
        divisible += 1
      }
    }
    if divisible == 2 {
     primes += 1
    }
  }
  fmt.Println("Primes: ", primes)
}
func main() {
  runtime.GOMAXPROCS(THREADS)
  wg := sync.WaitGroup{}
  for i:=0;i<THREADS;i+=1 {
    go exhaust_cpu(&wg, LIMIT)
    wg.Add(1)
  }
  wg.Wait()
}

To run it in 5 processes, let’s set the individual process threads count to 1, and use a shell script to start 5 processes in the background and wait for it to complete.

# set the threads count to 1 for individual processes
$ diff threads.go procs.go
<   THREADS = 5
---
>   THREADS = 1$ cat run_procs.sh
PROCS_COUNT=${1:-5}
go build procs.go
for i in $(seq 1 ${PROCS_COUNT})
do
  # Run process in background
  ./procs &
done
# Wait for background processes to complete
wait
echo "Waited for all procs"

I am executing these programs in a multi core system, so to make context switching happen frequently, I will pin the processes to one of the CPU cores — core 0 using taskset utility.

# Run threads on core 0
go build threads.go
time taskset -c 0 ./threads# Run process
grep taskset run_procs.sh
 time  taskset -c 0 ./procs &

You can see below the threads are running only on core 0 from the `htop` output.

Press enter or click to view image in full size

In linux, threads of same processes get different PIDs, so do not confuse them as different processes.

# See the threads under the process in proc file system
ls /proc/10838/task/
10838 10839 10840 10841 10842 10843

Let’s take a look at the time required for these programs to run.

$ time taskset -c 0 ./threads
Primes:  3245
Primes:  3245
Primes:  3245
Primes:  3245
Primes:  3245real 0m43.525s
user 0m43.488s
sys 0m0.028s ****************
$ time bash run_procs.sh
Primes:  3245
Primes:  3245
Primes:  3245
Primes:  3245
Primes:  3245
real 0m43.643s
user 0m8.728s
sys 0m0.008s
real 0m43.654s
user 0m8.712s
sys 0m0.016s
real 0m43.660s
user 0m8.724s
sys 0m0.004s
real 0m43.660s
user 0m8.728s
sys 0m0.000s
real 0m43.660s
user 0m8.720s
sys 0m0.012s
Waited for all procs
real 0m43.823s
user 0m43.776s
sys 0m0.072s ****************

There is not much difference in the time taken to run the program in 5 threads and 5 processes(~1% diff). But didn’t I tell you the context switching overhead is less for threads?

I have intentionally left out one crucial piece of information earlier. Threads within a process share the same virtual memory space, so during context switching between those threads, page table pointer change, and TLB cache invalidation steps are skipped, thus reducing the overhead.

Since these processes do not use much memory, there’s not many entries in TLB, hence the overhead of TLB cache invalidation is not seen here.

So let’s make this program consume some memory.

$ cat threads.go
package main
…
…
func exhaust_cpu(wg *sync.WaitGroup, limit int, mlist []int){
  defer wg.Done()
  primes := 1
  // Access the memory, so its virtual addresses gets cached in TLB
  for i:=3;i<=limit;i+=1 {
    for i := range mlist {
      mlist[i] = 100
    }
    divisible := 1
    …
    … 
  fmt.Println("Primes: ", primes)
}func main() {
  // Declare a large chunk of memory
  mlist := make([]int, 1000000)
  for i:=0;i<THREADS;i+=1 {
    go exhaust_cpu(&wg, LIMIT)
    …
  wg.Wait()
}
$ diff threads.go procs.go
13c13
<   THREADS = 10
---
>   THREADS = 1

$ time taskset -c 0 ./threads
Primes:  3245
Primes:  3245
Primes:  3245
Primes:  3245
Primes:  3245real 2m19.336s
user 2m19.256s
sys 0m0.072s *******************
$ time bash run_procs.sh
Primes:  3245
Primes:  3245
real 2m26.499s
user 0m29.156s
sys 0m0.120s
real 2m26.521s
user 0m29.144s
sys 0m0.148s
… 
Waited for all procs
real 2m26.692s
user 2m25.920s
sys 0m0.784s *******************

Now you can see the difference in runtime. Take note of the `sys` time in the output which includes the context switching time — 0m0.072s for threads, and 0m0.784s for processes. But here all 5 threads share the one 1M list, but the processes each have a separate 1M integer list. This would take 5X more time for the processes to access the memory allocated. So we need to allocate the same amount of memory for each thread as processes to make a fair comparison.

$ diff threads.go.old threads.go
38c38,41
<   mlist := make([]int, 1000000)
—
>  // 1M integer list for each thread
>   mlist := make([][]int, THREADS)
>   for i := 0; i < THREADS; i += 1 {
>     mlist[i] = make([]int, 1000000)
>   }
42c45
<     go exhaust_cpu(&wg, LIMIT, mlist)
---
>     go exhaust_cpu(&wg, LIMIT, mlist[i])

$ time taskset -c 0 ./threads
Primes:  3245
Primes:  3245
Primes:  3245
Primes:  3245
Primes:  3245real 2m26.029s
user 2m25.932s
sys 0m0.084s *******************
$ time bash run_procs.sh
Primes:  3245
Primes:  3245
… 
real 2m26.366s
user 0m29.136s
sys 0m0.180s
real 2m26.442s
user 0m29.188s
sys 0m0.104s
…
Waited for all procs
real 2m26.601s
user 2m25.880s
sys 0m0.740s *******************

Now we gave each thread and process the same amount of memory, and the runtime for processes is more than the threads. The sys time for running threads program is 0.084s for all 5 threads, and for processes, every process takes 0.100+ seconds totalling to 0.740s. This shows the context switching overhead difference between processes and threads.

In conclusion, understanding the performance overhead of context switching between processes and threads is crucial for optimizing applications that require extensive multitasking. As demonstrated, context switching between threads of the same process is generally more efficient than between threads of different processes due their shared memory nature. However, this doesn’t mean you have to always choose threads for multi-tasking. Sharing memory also brings in other challenges like race conditions while accessing shared memory objects. Some multitasking applications might work better with processes than threads based on its architecture, and complexity. By carefully considering these factors, we can make an informed decision on whether to use threads or processes.

In another blog post, I will write about the effect of context switching between user and kernel modes.

I hope you enjoyed reading this article and learned something new today. If you did, please consider leaving a clap or comment; it will encourage me to write more such content and continue sharing my learnings with you all.