Context switching – How Programming Languages Model Asynchronous Program Flow

Creating new threads takes time

Creating a new OS thread involves some bookkeeping and initialization overhead, so while switching between two existing threads in the same process is pretty fast, creating new ones and discarding ones you don’t use anymore involves work that takes time. All the extra work will limit throughput if a system needs to create and discard a lot of them. This can be a problem if you have huge amounts of small tasks that need to be handled concurrently, which often is the case when dealing with a lot of I/O.

Each thread has its own stack

We’ll cover stacks in detail later in this book, but for now, it’s enough to know that they occupy a fixed size of memory. Each OS thread comes with its own stack, and even though many systems allow this size to be configured, they’re still fixed in size and can’t grow or shrink. They are, after all, the cause of stack overflows, which will be a problem if you configure them to be too small for the tasks you’re running.

If we have many small tasks that only require a little stack space but we reserve much more than we need, we will occupy large amounts of memory and possibly run out of it.

Context switching

As you now know, threads and schedulers are tightly connected. Context switching happens when the CPU stops executing one thread and proceeds with another one. Even though this process is highly optimized, it still involves storing and restoring the register state, which takes time. Every time that you yield to the OS scheduler, it can choose to schedule a thread from a different process on that CPU.

You see, threads created by these systems belong to a process. When you start a program, it starts a process, and the process creates at least one initial thread where it executes the program you’ve written. Each process can spawn multiple threads that share the same address space.

That means that threads within the same process can access shared memory and can access the same resources, such as files and file handles. One consequence of this is that when the OS switches contexts by stopping one thread and resuming another within the same process, it doesn’t have to save and restore all the state associated with that process, just the state that’s specific to that thread.

On the other hand, when the OS switches from a thread associated with one process to a thread associated with another, the new process will use a different address space, and the OS needs to take measures to make sure that process “A” doesn’t access data or resources that belong to process “B”. If it didn’t, the system wouldn’t be secure.

The consequence is that caches might need to be flushed and more state might need to be saved and restored. In a highly concurrent system under load, these context switches can take extra time and thereby limit the throughput in a somewhat unpredictable manner if they happen frequently enough.

Threads provided by the operating system – How Programming Languages Model Asynchronous Program Flow

Important!

Definitions will vary depending on what book or article you read. For example, if you read about how a specific operating system works, you might see that processes or threads are abstractions that represent “tasks”, which will seem to contradict the definitions we use here. As I mentioned earlier, the choice of reference frame is important, and it’s why we take so much care to define the terms we use thoroughly as we encounter them throughout the book.

The definition of a thread can also vary by operating system, even though most popular systems share a similar definition today. Most notably, Solaris (pre-Solaris 9, which was released in 2002) used to have a two-level thread system that differentiated between application threads, lightweight processes, and kernel threads. This was an implementation of what we call M:N threading, which we’ll get to know more about later in this book. Just beware that if you read older material, the definition of a thread in such a system might differ significantly from the one that’s commonly used today.

Now that we’ve gone through the most important definitions for this chapter, it’s time to talk more about the most popular ways of handling concurrency when programming.

Threads provided by the operating system

Note!

We call this 1:1 threading. Each task is assigned one OS thread.

Since this book will not focus specifically on OS threads as a way to handle concurrency going forward, we treat them more thoroughly here.

Let’s start with the obvious. To use threads provided by the operating system, you need, well, an operating system. Before we discuss the use of threads as a means to handle concurrency, we need to be clear about what kind of operating systems we’re talking about since they come in different flavors.

Embedded systems are more widespread now than ever before. This kind of hardware might not have the resources for an operating system, and if they do, you might use a radically different kind of operating system tailored to your needs, as the systems tend to be less general purpose and more specialized in nature.

Their support for threads, and the characteristics of how they schedule them, might be different from what you’re used to in operating systems such as Windows or Linux.

Since covering all the different designs is a book on its own, we’ll limit the scope to talk about treads, as they’re used in Windows and Linux-based systems running on popular desktop and server CPUs.

OS threads are simple to implement and simple to use. We simply let the OS take care of everything for us. We do this by spawning a new OS thread for each task we want to accomplish and write code as we normally would.

The runtime we use to handle concurrency for us is the operating system itself. In addition to these advantages, you get parallelism for free. However, there are also some drawbacks and complexities resulting from directly managing parallelism and shared resources.

Threads – How Programming Languages Model Asynchronous Program Flow

We keep referring to threads all throughout this book, so before we get too far in, let’s stop and give “thread” a good definition since it’s one of those fundamental terms that causes a lot of confusion.

In the most general sense, a thread refers to a thread of execution, meaning a set of instructions that need to be executed sequentially. If we tie this back to the first chapter of this book, where we provided several definitions under the Concurrency vs. Parallelism subsection, a thread of execution is similar to what we defined as a task with multiple steps that need resources to progress.

The generality of this definition can be a cause of some confusion. A thread to one person can obviously refer to an OS thread, and to another person, it can simply refer to any abstraction that represents a thread of execution on a system.

Threads are often divided into two broad categories:

  • • OS threads: These threads are created by the OS and managed by the OS scheduler. On Linux, this is known as a kernel thread.
  • • User-level threads: These threads are created and managed by us as programmers without the OS knowing about them.

Now, this is where things get a bit tricky: OS threads on most modern operating systems have a lot of similarities. Some of these similarities are dictated by the design of modern CPUs. One example of this is that most CPUs assume that there is a stack it can perform operations on and that it has a register for the stack pointer and instructions for stack manipulation.

User-level threads can, in their broadest sense, refer to any implementation of a system (runtime) that creates and schedules tasks, and you can’t make the same assumptions as you do with OS threads. They can closely resemble OS threads by using separate stacks for each task, as we’ll see in Chapter 5 when we go through our fiber/green threads example, or they can be radically different in nature, as we’ll see when we go through how Rust models concurrent operations later on in Part 3 of this book.

No matter the definition, a set of tasks needs something that manages them and decides who gets what resources to progress. The most obvious resource on a computer system that all tasks need to progress is CPU time. We call the “something” that decides who gets CPU time to progress a scheduler.

Most likely, when someone refers to a “thread” without adding extra context, they refer to an OS thread/kernel thread, so that’s what we’ll do going forward.

I’ll also keep referring to a thread of execution as simply a task. I find the topic of asynchronous programming easier to reason about when we limit the use of terms that have different assumptions associated with them depending on the context as much as possible.

With that out of the way, let’s go through some defining characteristics of OS threads while we also highlight their limitations.

Definitions – How Programming Languages Model Asynchronous Program Flow

In the previous chapter, we covered asynchronous program flow, concurrency, and parallelism in general terms. In this chapter, we’ll narrow our scope. Specifically, we’ll look into different ways to model and deal with concurrency in programming languages and libraries.

It’s important to keep in mind that threads, futures, fibers, goroutines, promises, etc. are abstractions that give us a way to model an asynchronous program flow. They have different strengths and weaknesses, but they share a goal of giving programmers an easy-to-use (and importantly, hard to misuse), efficient, and expressive way of creating a program that handles tasks in a non-sequential, and often unpredictable, order.

The lack of precise definitions is prevalent here as well; many terms have a name that stems from a concrete implementation at some point in time but has later taken on a more general meaning that encompasses different implementations and varieties of the same thing.

We’ll first go through a way of grouping different abstractions together based on their similarities before we go on to discuss the pros and cons of each of them. We’ll also go through important definitions that we’ll use throughout the book and discuss OS threads in quite some detail.

The topics we discuss here are quite abstract and complicated so don’t feel bad if you don’t understand everything immediately. As we progress through the book and you get used to the different terms and techniques by working through some examples, more and more pieces will fall into place.

Specifically, the following topics will be covered:

• Definitions
• Threads provided by the operating system
• Green threads/stackfull coroutines/fibers
• Callback based approaches
• Promises, futures, and async/await

Definitions

We can broadly categorize abstractions over concurrent operations into two groups:

  1. Cooperative: These are tasks that yield voluntarily either by explicitly yielding or by calling a function that suspends the task when it can’t progress further before another operation has finished (such as making a network call). Most often, these tasks yield to a scheduler of some sort. Examples of this are tasks generated by async/await in Rust and JavaScript.
  2. Non-cooperative: Tasks that don’t necessarily yield voluntarily. In such a system, the scheduler must be able to pre-empt a running task, meaning that the scheduler can stop the task and take control over the CPU even though the task would have been able to do work and progress. Examples of this are OS threads and Goroutines (after GO version 1.14).

Figure 2.1 – Non-cooperative vs. cooperative multitasking

Note

In a system where the scheduler can pre-empt running tasks, tasks can also yield voluntarily as they do in a cooperative system, and it’s rare with a system that only relies on pre-emption.

We can further divide these abstractions into two broad categories based on the characteristics of their implementation:

  1. Stackful: Each task has its own call stack. This is often implemented as a stack that’s similar to the stack used by the operating system for its threads. Stackful tasks can suspend execution at any point in the program as the whole stack is preserved.
  2. Stackless: There is not a separate stack for each task; they all run sharing the same call stack. A task can’t be suspended in the middle of a stack frame, limiting the runtime’s ability to pre-empt the task. However, they need to store/restore less information when switching between tasks so they can be more efficient.

There are more nuances to these two categories that you’ll get a deep understanding of when we implement an example of both a stackful coroutine (fiber) and a stackless coroutine (Rust futures generated by async/await) later in the book. For now, we keep the details to a minimum to just provide an overview.

Firmware – Concurrency and Asynchronous Programming: a Detailed Overview

Interrupts

As you know by now, there are two kinds of interrupts:

  • Hardware interrupts
  • Software interrupts

They are very different in nature.

Hardware interrupts

Hardware interrupts are created by sending an electrical signal through an IRQ. These hardware lines signal the CPU directly.

Software interrupts

These are interrupts issued from software instead of hardware. As in the case of a hardware interrupt, the CPU jumps to the IDT and runs the handler for the specified interrupt.

Firmware

Firmware doesn’t get much attention from most of us; however, it’s a crucial part of the world we live in. It runs on all kinds of hardware and has all kinds of strange and peculiar ways to make the computers we program on work.

Now, the firmware needs a microcontroller to be able to work. Even the CPU has firmware that makes it work. That means there are many more small ‘CPUs’ on our system than the cores we program against.

Why is this important? Well, you remember that concurrency is all about efficiency, right? Since we have many CPUs/microcontrollers already doing work for us on our system, one of our concerns is to not replicate or duplicate that work when we write code.

If a network card has firmware that continually checks whether new data has arrived, it’s pretty wasteful if we duplicate that by letting our CPU continually check whether new data arrives as well. It’s much better if we either check once in a while, or even better, get notified when data has arrived.

Summary

This chapter covered a lot of ground, so good job on doing all that legwork. We learned a little bit about how CPUs and operating systems have evolved from a historical perspective and the difference between non-preemptive and preemptive multitasking. We discussed the difference between concurrency and parallelism, talked about the role of the operating system, and learned that system calls are the primary way for us to interact with the host operating system. You’ve also seen how the CPU and the operating system cooperate through an infrastructure designed as part of the CPU.

Lastly, we went through a diagram on what happens when you issue a network call. You know there are at least three different ways for us to deal with the fact that the I/O call takes some time to execute, and we have to decide which way we want to handle that waiting time.

This covers most of the general background information we need so that we have the same definitions and overview before we go on. We’ll go into more detail as we progress through the book, and the first topic that we’ll cover in the next chapter is how programming languages model asynchronous program flow by looking into threads, coroutines and futures.

A simplified overview – Concurrency and Asynchronous Programming: a Detailed Overview-2

Step 4 – Hardware interrupt

A modern CPU has a set of interrupt request line (IRQs) for it to handle events that occur from external devices. A CPU has a fixed set of interrupt lines.

A hardware interrupt is an electrical signal that can occur at any time. The CPU immediately interrupts its normal workflow to handle the interrupt by saving the state of its registers and looking up the interrupt handler. The interrupt handlers are defined in the interrupt descriptor table (IDT).

Step 5 – Interrupt handler

The IDT is a table where the OS (or a driver) registers handlers for different interrupts that may occur. Each entry points to a handler function for a specific interrupt. The handler function for a network card would typically be registered and handled by a driver for that card.

Note

The IDT is not stored on the CPU as it might seem in Figure 1.3. It’s located in a fixed and known location in the main memory. The CPU only holds a pointer to the table in one of its registers.

Step 6 – Writing the data

This is a step that might vary a lot depending on the CPU and the firmware on the network card. If the network card and the CPU support direct memory access (DMA), which should be the standard on all modern systems today, the network card will write data directly to a set of buffers that the OS already has set up in the main memory.

In such a system, the firmware on the network card might issue an interrupt when the data is written to memory. DMA is very efficient, since the CPU is only notified when the data is already in memory. On older systems, the CPU needed to devote resources to handle the data transfer from the network card.

The direct memory access controller ( DMAC) is added to the diagram since in such a system, it would control the access to memory. It’s not part of the CPU as indicated in the previous diagram. We’re deep enough in the rabbit hole now, and exactly where the different parts of a system are is not really important to us right now, so let’s move on.

Step 7 – The driver

The driver would normally handle the communication between the OS and the network card. At some point, the buffers are filled and the network card issues an interrupt. The CPU then jumps to the handler of that interrupt. The interrupt handler for this exact type of interrupt is registered by the driver, so it’s actually the driver that handles this event and, in turn, informs the kernel that the data is ready to be read.

Step 8 – Reading the data

Depending on whether we chose method 1, 2, or 3, the OS will do as follows:

  • • Wake our thread
  • • Return Ready on the next poll
  • • Wake the thread and return a Read event for the handler we registered

A simplified overview – Concurrency and Asynchronous Programming: a Detailed Overview-1

Let’s look at some of the steps where we imagine that we read from a network card:

Remember that we’re simplifying a lot here. This is a rather complex operation but we’ll focus on the parts that are of most interest to us and skip a few steps along the way.

Step 1 – Our code

We register a socket. This happens by issuing a syscall to the OS. Depending on the OS, we either get a file descriptor (macOS/Linux) or a socket (Windows).

The next step is that we register our interest in Read events on that socket.

Step 2 – Registering events with the OS

This is handled in one of three ways:

  1. We tell the operating system that we’re interested in Read events but we want to wait for it to happen by yielding control over our thread to the OS. The OS then suspends our thread by storing the register state and switches to some other thread
    From our perspective, this will be blocking our thread until we have data to read.
  2. We tell the operating system that we’re interested in Read events but we just want a handle to a task that we can poll to check whether the event is ready or not.
    The OS will not suspend our thread, so this will not block our code.
  3. We tell the operating system that we are probably going to be interested in many events, but we want to subscribe to one event queue. When we poll this queue, it will block our thread until one or more events occur.

This will block our thread while we wait for events to occur.

Chapters 3 and 4 will go into detail about the third method, as it’s the most used method for modern async frameworks to handle concurrency.

Step 3 – The network card

We’re skipping some steps here, but I don’t think they’re vital to our understanding.

On the network card, there is a small microcontroller running specialized firmware. We can imagine that this microcontroller is polling in a busy loop, checking whether any data is incoming.

The exact way the network card handles its internals is a little different from what I suggest here, and will most likely vary from vendor to vendor. The important part is that there is a very simple but specialized CPU running on the network card doing work to check whether there are incoming events.

Once the firmware registers incoming data, it issues a hardware interrupt.

But can’t we just change the page table in the CPU? – Concurrency and Asynchronous Programming: a Detailed Overview

Now, this is where the privilege level comes in. Most modern operating systems operate with two ring levels: ring 0, the kernel space, and ring 3, the user space.

Figure 1.2 – Privilege rings

Most CPUs have a concept of more rings than what most modern operating systems use. This has historical reasons, which is also why ring 0 and ring 3 are used (and not 1 and 2).

Every entry in the page table has additional information about it. Amongst that information is the information about which ring it belongs to. This information is set up when your OS boots up.

Code executed in ring 0 has almost unrestricted access to external devices and memory, and is free to change registers that provide security at the hardware level.

The code you write in ring 3 will typically have extremely restricted access to I/O and certain CPU registers (and instructions). Trying to issue an instruction or setting a register from ring 3 to change the page table will be prevented by the CPU. The CPU will then treat this as an exception and jump to the handler for that exception provided by the OS.

This is also the reason why you have no other choice than to cooperate with the OS and handle I/O tasks through syscalls. The system wouldn’t be very secure if this wasn’t the case.

So, to sum it up: yes, the CPU and the OS cooperate a great deal. Most modern desktop CPUs are built with an OS in mind, so they provide the hooks and infrastructure that the OS latches onto upon bootup. When the OS spawns a process, it also sets its privilege level, making sure that normal processes stay within the borders it defines to maintain stability and security.

Interrupts, firmware, and I/O

We’re nearing the end of the general CS subjects in this book, and we’ll start to dig our way out of the rabbit hole soon.

This part tries to tie things together and look at how the whole computer works as a system to handle I/O and concurrency.

Let’s get to it!

Down the rabbit hole – Concurrency and Asynchronous Programming: a Detailed Overview

It turns out that there is a great deal of cooperation between the OS and the CPU, but maybe not in the way you would naively think.

Many modern CPUs provide some basic infrastructure that operating systems use. This infrastructure gives us the security and stability we expect. Actually, most advanced CPUs provide a lot more options than operating systems such as Linux, BSD, and Windows actually use.

There are two in particular that I want to address here:

  • How the CPU prevents us from accessing memory we’re not supposed to access
  • How the CPU handles asynchronous events such as I/O

We’ll cover the first one here and the second in the next section.

How does the CPU prevent us from accessing memory we’re not supposed to access?

As I mentioned, modern CPU architectures define some basic concepts by design. Some examples of this are as follows:

• Virtual memory
• Page table
• Page fault
• Exceptions
• Privilege level

Exactly how this works will differ depending on the specific CPU, so we’ll treat them in general terms here.

Most modern CPUs have a memory management unit (MMU). This part of the CPU is often etched on the same dye, even. The MMU’s job is to translate the virtual address we use in our programs to a physical address.

When the OS starts a process (such as our program), it sets up a page table for our process and makes sure a special register on the CPU points to this page table.

Now, when we try to dereference t_ptr in the preceding code, the address is at some point sent for translation to the MMU, which looks it up in the page table to translate it to a physical address in the memory where it can fetch the data.

In the first case, it will point to a memory address on our stack that holds the value 100.

When we pass in 99999999999999 and ask it to fetch what’s stored at that address (which is what dereferencing does), it looks for the translation in the page table but can’t find it.

The CPU then treats this as a page fault.

At boot, the OS provided the CPU with an interrupt descriptor table. This table has a predefined format where the OS provides handlers for the predefined conditions the CPU can encounter.

Since the OS provided a pointer to a function that handles page fault, the CPU jumps to that function when we try to dereference 99999999999999 and thereby hands over control to the operating system.

The OS then prints a nice message for us, letting us know that we encountered what it calls a segmentation fault. This message will therefore vary depending on the OS you run the code on.

Communicating with the operating system – Concurrency and Asynchronous Programming: a Detailed Overview

Communication with an operating system happens through what we call a system call (syscall). We need to know how to make system calls and understand why it’s so important for us when we want to cooperate and communicate with the operating system. We also need to understand how the basic abstractions we use every day use system calls behind the scenes. We’ll have a detailed walkthrough in Chapter 3, so we’ll keep this brief for now.

A system call uses a public API that the operating system provides so that programs we write in ‘userland’ can communicate with the OS.

Most of the time, these calls are abstracted away for us as programmers by the language or the runtime we use.

Now, a syscall is an example of something that is unique to the kernel you’re communicating with, but the UNIX family of kernels has many similarities. UNIX systems expose this through libc.

Windows, on the other hand, uses its own API, often referred to as WinAPI, and it can operate radically differently from how the UNIX-based systems operate.

Most often, though, there is a way to achieve the same things. In terms of functionality, you might not notice a big difference but as we’ll see later, and especially when we dig into how epoll, kqueue, and IOCP work, they can differ a lot in how this functionality is implemented.

However, a syscall is not the only way we interact with our operating system, as we’ll see in the following section.

The CPU and the operating system

Does the CPU cooperate with the operating system?

If you had asked me this question when I first thought I understood how programs work, I would most likely have answered no. We run programs on the CPU and we can do whatever we want if we know how to do it. Now, first of all, I wouldn’t have thought this through, but unless you learn how CPUs and operating systems work together, it’s not easy to know for sure.

What started to make me think I was very wrong was a segment of code that looked like what you’re about to see. If you think inline assembly in Rust looks foreign and confusing, don’t worry just yet. We’ll go through a proper introduction to inline assembly a little later in this book. I’ll make sure to go through each of the following lines until you get more comfortable with the syntax:

Repository reference: ch01/ac-assembly-dereference/src/main.rs
fn main() {
    let t = 100;
    let t_ptr: *const usize = &t;
    let x = dereference(t_ptr);
    println!(“{}”, x);
}
fn dereference(ptr: *const usize) -> usize {
    let mut res: usize;
    unsafe {
        asm!(“mov {0}, [{1}]”, out(reg) res, in(reg) ptr)
    };
    res
}

What you’ve just looked at is a dereference function written in assembly.

The mov {0}, [{1}] line needs some explanation. {0} and {1} are templates that tell the compiler that we’re referring to the registers that out(reg) and in(reg) represent. The number is just an index, so if we had more inputs or outputs they would be numbered {2}, {3}, and so on. Since we only specify reg and not a specific register, we let the compiler choose what registers it wants to use.

The mov instruction instructs the CPU to take the first 8 bytes (if we’re on a 64-bit machine) it gets when reading the memory location that {1} points to and place that in the register represented by {0}. The [] brackets will instruct the CPU to treat the data in that register as a memory address, and instead of simply copying the memory address itself to {0}, it will fetch what’s at that memory location and move it over.

Anyway, we’re just writing instructions to the CPU here. No standard library, no syscall; just raw instructions. There is no way the OS is involved in that dereference function, right?

If you run this program, you get what you’d expect:
100

Now, if you keep the dereference function but replace the main function with a function that creates a pointer to the 99999999999999 address, which we know is invalid, we get this function:
fn main() {
    let t_ptr = 99999999999999 as *const usize;
    let x = dereference(t_ptr);
    println!(“{}”, x);
}

Now, if we run that we get the following results.

This is the result on Linux:
Segmentation fault (core dumped)

This is the result on Windows:
error: process didn’t exit successfully: `target\debug\ac-assembly-dereference.exe` (exit code: 0xc0000005, STATUS_ACCESS_VIOLATION)

We get a segmentation fault. Not surprising, really, but as you also might notice, the error we get is different on different platforms. Surely, the OS is involved somehow. Let’s take a look at what’s really happening here.