This chapter doesn’t require you to set up anything new, but since we’re writing some low-level code for three different platforms, you need access to these platforms if you want to run all the examples.
The best way to follow along is to open the accompanying repository on your computer and navigate to the ch03 folder.
This chapter is a little special since we build some basic understanding from the ground up, which means some of it is quite low-level and requires a specific operating system and CPU family to run. Don’t worry; I’ve chosen the most used and popular CPU, so this shouldn’t be a problem, but it is something you need to be aware of.
The machine must use a CPU using the x86-64 instruction set on Windows and Linux. Intel and AMD desktop CPUs use this architecture, but if you run Linux (or WSL) on a machine using an ARM processor you might encounter issues with some of the examples using inline assembly. On macOS, the example in the book targets the newer M-family of chips, but the repository also contains examples targeting the older Intel-based Macs.
Unfortunately, some examples targeting specific platforms require that specific operating system to run. However, this will be the only chapter where you need access to three different platforms to run all the examples. Going forward, we’ll create examples that will run on all platforms either natively or using Windows Subsystem for Linux (WSL), but to understand the basics of cross-platform abstractions, we need to actually create examples that target these different platforms.
Running the Linux examples
If you don’t have a Linux machine set up, you can run the Linux example on the Rust Playground, or if you’re on a Windows system, my suggestion is to set up WSL and run the code there. You can find the instructions on how to do that at https://learn.microsoft.com/en-us/windows/wsl/install. Remember, you have to install Rust in the WSL environment as well, so follow the instructions in the Preface section of this book on how to install Rust on Linux.
If you use VS Code as your editor, there is a very simple way of switching your environment to WSL. Press Ctrl+Shift+P and write Reopen folder in WSL. This way, you can easily open the example folder in WSL and run the code examples using Linux there.
In this chapter, we’ll take a look at how an OS-backed event queue works and how three different operating systems handle this task in different ways. The reason for going through this is that most async runtimes I know of use OS-backed event queues such as this as a fundamental part of achieving high-performance I/O. You’ll most likely hear references to these frequently when reading about how async code really works.
Event queues based on the technology we discuss in this chapter is used in many popular libraries like:
• mio (https://github.com/tokio-rs/mio), a key part of popular runtimes like Tokio • polling (https://github.com/smol-rs/polling), the event queue used in Smol and async-std • libuv (https://libuv.org/), the library used to create the event queue used in Node.js (a JavaScript runtime) and the Julia programming language • C# for its asynchronous network calls • Boost.Asio, a library for asynchronous network I/O for C++
All our interactions with the host operating system are done through system calls (syscalls). To make a system call using Rust, we need to know how to use Rust’s foreign function interface (FFI).
In addition to knowing how to use FFI and make syscalls, we need to cover cross-platform abstractions. When creating an event queue, whether you create it yourself or use a library, you’ll notice that the abstractions might seem a bit unintuitive if you only have a high-level overview of how, for example, IOCP works on Windows. The reason for this is that these abstractions need to provide one API that covers the fact that different operating systems handle the same task differently. This process often involves identifying a common denominator between the platforms and building a new abstraction on top of that.
Instead of using a rather complex and lengthy example to explain FFI, syscalls, and cross-platform abstractions, we’ll ease into the topic using a simple example. When we encounter these concepts later on, we’ll already know these subjects well enough, so we’re well prepared for the more interesting examples in the following chapters.
In this chapter, we’ll go through the following main topics: • Why use an OS-backed event queue? • Readiness-based event queues • Completion-based event queues • epoll • kqueue • IOCP • Syscalls, FFI, and cross-platform abstractions
Note
There are popular, although lesser-used, alternatives you should know about even though we don’t cover them here:
wepoll: This uses specific APIs on Windows and wraps IOCP so it closely resembles how epoll works on Linux in contrast to regular IOCP. This makes it easier to create an abstraction layer with the same API on top of the two different technologies. It’s used by both libuv and mio . io_uring: This is a relatively new API on Linux with many similarities to IOCP on Windows.
I’m pretty confident that after you’ve gone through the next two chapters, you will have an easy time reading up on these if you want to learn more about them.
Pre-emption points can be thought of as inserting code that calls into the scheduler and asks it if it wishes to pre-empt the task. These points can be inserted by the compiler or the library you use before every new function call for example.
Furthermore, you need compiler support to make the most out of it. Languages that have metaprogramming abilities (such as macros) can emulate much of the same, but this will still not be as seamless as it will when the compiler is aware of these special async tasks.
Debugging is another area where care must be taken when implementing futures/promises. Since the code is re-written as state machines (or generators), you won’t have the same stack traces as you do with normal functions. Usually, you can assume that the caller of a function is what precedes it both in the stack and in the program flow. For futures and promises, it might be the runtime that calls the function that progresses the state machine, so there might not be a good backtrace you can use to see what happened before calling the function that failed. There are ways to work around this, but most of them will incur some overhead.
Advantages • You can write code and model programs the same way you normally would • No context switching • It can be implemented in a very memory-efficient way • It’s easy to implement for various platforms Drawbacks • Pre-emption can be hard, or impossible, to fully implement, as the tasks can’t be stopped in the middle of a stack frame • It needs compiler support to leverage its full advantages • Debugging can be difficult both due to the non-sequential program flow and the limitations on the information you get from the backtraces.
Summary
You’re still here? That’s excellent! Good job on getting through all that background information. I know going through text that describes abstractions and code can be pretty daunting, but I hope you see why it’s so valuable for us to go through these higher-level topics now at the start of the book. We’ll get to the examples soon. I promise!
In this chapter, we went through a lot of information on how we can model and handle asynchronous operations in programming languages by using both OS-provided threads and abstractions provided by a programming language or a library. While it’s not an extensive list, we covered some of the most popular and widely used technologies while discussing their advantages and drawbacks.
We spent quite some time going in-depth on threads, coroutines, fibers, green threads, and callbacks, so you should have a pretty good idea of what they are and how they’re different from each other.
The next chapter will go into detail about how we do system calls and create cross-platform abstractions and what OS-backed event queues such as Epoll, Kqueue, and IOCP really are and why they’re fundamental to most async runtimes you’ll encounter out in the wild.
Coroutines come in two flavors: asymmetric and symmetric. Asymmetric coroutines yields to a scheduler, and they’re the ones we’ll focus on. Symmetric coroutines yield a specific destination; for example, a different coroutine.
While coroutines are a pretty broad concept in general, the introduction of coroutines as objects in programming languages is what really makes this way of handling concurrency rival the ease of use that OS threads and fibers/green threads are known for.
You see when you write async in Rust or JavaScript, the compiler re-writes what looks like a normal function call into a future (in the case of Rust) or a promise (in the case of JavaScript). Await, on the other hand, yields control to the runtime scheduler, and the task is suspended until the future/promise you’re awaiting has finished.
This way, we can write programs that handle concurrent operations in almost the same way we write our normal sequential programs.
Our JavaScript program can now be written as follows: async function run() { await timer(200); await timer(100); await timer(50); console.log(“I’m the last one”); }
You can consider the run function as a pausable task consisting of several sub-tasks. On each “await” point, it yields control to the scheduler (in this case, it’s the well-known JavaScript event loop).
Once one of the sub-tasks changes state to either fulfilled or rejected, the task is scheduled to continue to the next step.
When using Rust, you can see the same transformation happening with the function signature when you write something such as this: async fn run() -> () { … }
The function wraps the return object, and instead of returning the type (), it returns a Future with an output type of (): Fn run() -> impl Future<Output = ()>
Syntactically, Rust’s futures 0.1 was a lot like the promise example we just showed, and the Rust futures we use today have a lot in common with how async/await works in JavaScript..
This way of rewriting what look like normal functions and code into something else has a lot of benefits, but it’s not without its drawbacks.
As with any stackless coroutine implementation, full pre-emption can be hard, or impossible, to implement. These functions have to yield at specific points, and there is no way to suspend execution in the middle of a stack frame in contrast to fibers/green threads. Some level of pre-emption is possible by having the runtime or compiler insert pre-emption points at every function call, for example, but it’s not the same as being able to pre-empt a task at any point during its execution.
Since we’ll not spend more time talking about OS threads in this book, we’ll go through a short example so you can see how they’re used: ch02/aa-os-threads use std::thread::{self, sleep}; fn main() { println!(“So, we start the program here!”); let t1 = thread::spawn(move || { sleep(std::time::Duration::from_millis(200)); println!(“The long running tasks finish last!”); }); let t2 = thread::spawn(move || { sleep(std::time::Duration::from_millis(100)); println!(“We can chain callbacks…”); let t3 = thread::spawn(move || { sleep(std::time::Duration::from_millis(50)); println!(“…like this!”); }); t3.join().unwrap(); }); println!(“The tasks run concurrently!”); t1.join().unwrap(); t2.join().unwrap(); }
In this example, we simply spawn several OS threads and put them to sleep. Sleeping is essentially the same as yielding to the OS scheduler with a request to be re-scheduled to run after a certain time has passed. To make sure our main thread doesn’t finish and exit (which will exit the process) before our children thread has had time to run we join them at the end of our main function. If we run the example, we’ll see how the operations occur in a different order based on how long we yielded each thread to the scheduler: So, we start the program here! The tasks run concurrently! We can chain callbacks… …like this! The long-running tasks finish last!
So, while using OS threads is great for a number of tasks, we also outlined a number of good reasons to look at alternatives by discussing their limitations and downsides. The first alternatives we’ll look at are what we call fibers and green threads.
Fibers and green threads
Note!
This is an example of M:N threading. Many tasks can run concurrently on one OS thread. Fibers and green threads are often referred to as stackful coroutines.
The name “green threads” originally stems from an early implementation of an M:N threading model used in Java and has since been associated with different implementations of M:N threading. You will encounter different variations of this term, such as “green processes” (used in Erlang), which are different from the ones we discuss here. You’ll also see some that define green threads more broadly than we do here.
The way we define green threads in this book makes them synonymous with fibers, so both terms refer to the same thing going forward.
The implementation of fibers and green threads implies that there is a runtime with a scheduler that’s responsible for scheduling what task (M) gets time to run on the OS thread (N). There are many more tasks than there are OS threads, and such a system can run perfectly fine using only one OS thread. The latter case is often referred to as M:1 threading.
Goroutines is an example of a specific implementation of stackfull coroutines, but it comes with slight nuances. The term “coroutine” usually implies that they’re cooperative in nature, but Goroutines can be pre-empted by the scheduler (at least since version 1.14), thereby landing them in somewhat of a grey area using the categories we present here.
Green threads and fibers use the same mechanisms as an OS, setting up a stack for each task, saving the CPU’s state, and jumping from one task(thread) to another by doing a context switch.
We yield control to the scheduler (which is a central part of the runtime in such a system), which then continues running a different task.
The state of execution is stored in each stack, so in such a solution, there would be no need for async, await, Future, or Pin. In many ways, green threads mimic how an operating system facilitates concurrency, and implementing them is a great learning experience.
A runtime using fibers/green threads for concurrent tasks can have a high degree of flexibility. Tasks can, for example, be pre-empted and context switched at any time and at any point in their execution, so a long-running task that hogs the CPU could in theory be pre-empted by the runtime, acting as a safeguard from having tasks that end up blocking the whole system due to an edge-case or a programmer error.
This gives the runtime scheduler almost the same capabilities as the OS scheduler, which is one of the biggest advantages of systems using fibers/green threads.
The typical flow goes as follows:
• You run some non-blocking code • You make a blocking call to some external resource • The CPU jumps to the main thread, which schedules a different thread to run and jumps to that stack • You run some non-blocking code on the new thread until a new blocking call or the task is finished • The CPU jumps back to the main thread, schedules a new thread that is ready to make progress, and jumps to that thread
Figure 2.2 – Program flow using fibers/green threads
Creating a new OS thread involves some bookkeeping and initialization overhead, so while switching between two existing threads in the same process is pretty fast, creating new ones and discarding ones you don’t use anymore involves work that takes time. All the extra work will limit throughput if a system needs to create and discard a lot of them. This can be a problem if you have huge amounts of small tasks that need to be handled concurrently, which often is the case when dealing with a lot of I/O.
Each thread has its own stack
We’ll cover stacks in detail later in this book, but for now, it’s enough to know that they occupy a fixed size of memory. Each OS thread comes with its own stack, and even though many systems allow this size to be configured, they’re still fixed in size and can’t grow or shrink. They are, after all, the cause of stack overflows, which will be a problem if you configure them to be too small for the tasks you’re running.
If we have many small tasks that only require a little stack space but we reserve much more than we need, we will occupy large amounts of memory and possibly run out of it.
Context switching
As you now know, threads and schedulers are tightly connected. Context switching happens when the CPU stops executing one thread and proceeds with another one. Even though this process is highly optimized, it still involves storing and restoring the register state, which takes time. Every time that you yield to the OS scheduler, it can choose to schedule a thread from a different process on that CPU.
You see, threads created by these systems belong to a process. When you start a program, it starts a process, and the process creates at least one initial thread where it executes the program you’ve written. Each process can spawn multiple threads that share the same address space.
That means that threads within the same process can access shared memory and can access the same resources, such as files and file handles. One consequence of this is that when the OS switches contexts by stopping one thread and resuming another within the same process, it doesn’t have to save and restore all the state associated with that process, just the state that’s specific to that thread.
On the other hand, when the OS switches from a thread associated with one process to a thread associated with another, the new process will use a different address space, and the OS needs to take measures to make sure that process “A” doesn’t access data or resources that belong to process “B”. If it didn’t, the system wouldn’t be secure.
The consequence is that caches might need to be flushed and more state might need to be saved and restored. In a highly concurrent system under load, these context switches can take extra time and thereby limit the throughput in a somewhat unpredictable manner if they happen frequently enough.
Definitions will vary depending on what book or article you read. For example, if you read about how a specific operating system works, you might see that processes or threads are abstractions that represent “tasks”, which will seem to contradict the definitions we use here. As I mentioned earlier, the choice of reference frame is important, and it’s why we take so much care to define the terms we use thoroughly as we encounter them throughout the book.
The definition of a thread can also vary by operating system, even though most popular systems share a similar definition today. Most notably, Solaris (pre-Solaris 9, which was released in 2002) used to have a two-level thread system that differentiated between application threads, lightweight processes, and kernel threads. This was an implementation of what we call M:N threading, which we’ll get to know more about later in this book. Just beware that if you read older material, the definition of a thread in such a system might differ significantly from the one that’s commonly used today.
Now that we’ve gone through the most important definitions for this chapter, it’s time to talk more about the most popular ways of handling concurrency when programming.
Threads provided by the operating system
Note!
We call this 1:1 threading. Each task is assigned one OS thread.
Since this book will not focus specifically on OS threads as a way to handle concurrency going forward, we treat them more thoroughly here.
Let’s start with the obvious. To use threads provided by the operating system, you need, well, an operating system. Before we discuss the use of threads as a means to handle concurrency, we need to be clear about what kind of operating systems we’re talking about since they come in different flavors.
Embedded systems are more widespread now than ever before. This kind of hardware might not have the resources for an operating system, and if they do, you might use a radically different kind of operating system tailored to your needs, as the systems tend to be less general purpose and more specialized in nature.
Their support for threads, and the characteristics of how they schedule them, might be different from what you’re used to in operating systems such as Windows or Linux.
Since covering all the different designs is a book on its own, we’ll limit the scope to talk about treads, as they’re used in Windows and Linux-based systems running on popular desktop and server CPUs.
OS threads are simple to implement and simple to use. We simply let the OS take care of everything for us. We do this by spawning a new OS thread for each task we want to accomplish and write code as we normally would.
The runtime we use to handle concurrency for us is the operating system itself. In addition to these advantages, you get parallelism for free. However, there are also some drawbacks and complexities resulting from directly managing parallelism and shared resources.
In the previous chapter, we covered asynchronous program flow, concurrency, and parallelism in general terms. In this chapter, we’ll narrow our scope. Specifically, we’ll look into different ways to model and deal with concurrency in programming languages and libraries.
It’s important to keep in mind that threads, futures, fibers, goroutines, promises, etc. are abstractions that give us a way to model an asynchronous program flow. They have different strengths and weaknesses, but they share a goal of giving programmers an easy-to-use (and importantly, hard to misuse), efficient, and expressive way of creating a program that handles tasks in a non-sequential, and often unpredictable, order.
The lack of precise definitions is prevalent here as well; many terms have a name that stems from a concrete implementation at some point in time but has later taken on a more general meaning that encompasses different implementations and varieties of the same thing.
We’ll first go through a way of grouping different abstractions together based on their similarities before we go on to discuss the pros and cons of each of them. We’ll also go through important definitions that we’ll use throughout the book and discuss OS threads in quite some detail.
The topics we discuss here are quite abstract and complicated so don’t feel bad if you don’t understand everything immediately. As we progress through the book and you get used to the different terms and techniques by working through some examples, more and more pieces will fall into place.
Specifically, the following topics will be covered:
• Definitions • Threads provided by the operating system • Green threads/stackfull coroutines/fibers • Callback based approaches • Promises, futures, and async/await
Definitions
We can broadly categorize abstractions over concurrent operations into two groups:
Cooperative: These are tasks that yield voluntarily either by explicitly yielding or by calling a function that suspends the task when it can’t progress further before another operation has finished (such as making a network call). Most often, these tasks yield to a scheduler of some sort. Examples of this are tasks generated by async/await in Rust and JavaScript.
Non-cooperative: Tasks that don’t necessarily yield voluntarily. In such a system, the scheduler must be able to pre-empt a running task, meaning that the scheduler can stop the task and take control over the CPU even though the task would have been able to do work and progress. Examples of this are OS threads and Goroutines (after GO version 1.14).
Figure 2.1 – Non-cooperative vs. cooperative multitasking
Note
In a system where the scheduler can pre-empt running tasks, tasks can also yield voluntarily as they do in a cooperative system, and it’s rare with a system that only relies on pre-emption.
We can further divide these abstractions into two broad categories based on the characteristics of their implementation:
Stackful: Each task has its own call stack. This is often implemented as a stack that’s similar to the stack used by the operating system for its threads. Stackful tasks can suspend execution at any point in the program as the whole stack is preserved.
Stackless: There is not a separate stack for each task; they all run sharing the same call stack. A task can’t be suspended in the middle of a stack frame, limiting the runtime’s ability to pre-empt the task. However, they need to store/restore less information when switching between tasks so they can be more efficient.
There are more nuances to these two categories that you’ll get a deep understanding of when we implement an example of both a stackful coroutine (fiber) and a stackless coroutine (Rust futures generated by async/await) later in the book. For now, we keep the details to a minimum to just provide an overview.
A modern CPU has a set of interrupt request line (IRQs) for it to handle events that occur from external devices. A CPU has a fixed set of interrupt lines.
A hardware interrupt is an electrical signal that can occur at any time. The CPU immediately interrupts its normal workflow to handle the interrupt by saving the state of its registers and looking up the interrupt handler. The interrupt handlers are defined in the interrupt descriptor table (IDT).
Step 5 – Interrupt handler
The IDT is a table where the OS (or a driver) registers handlers for different interrupts that may occur. Each entry points to a handler function for a specific interrupt. The handler function for a network card would typically be registered and handled by a driver for that card.
Note
The IDT is not stored on the CPU as it might seem in Figure 1.3. It’s located in a fixed and known location in the main memory. The CPU only holds a pointer to the table in one of its registers.
Step 6 – Writing the data
This is a step that might vary a lot depending on the CPU and the firmware on the network card. If the network card and the CPU support direct memory access (DMA), which should be the standard on all modern systems today, the network card will write data directly to a set of buffers that the OS already has set up in the main memory.
In such a system, the firmware on the network card might issue an interrupt when the data is written to memory. DMA is very efficient, since the CPU is only notified when the data is already in memory. On older systems, the CPU needed to devote resources to handle the data transfer from the network card.
The direct memory access controller ( DMAC) is added to the diagram since in such a system, it would control the access to memory. It’s not part of the CPU as indicated in the previous diagram. We’re deep enough in the rabbit hole now, and exactly where the different parts of a system are is not really important to us right now, so let’s move on.
Step 7 – The driver
The driver would normally handle the communication between the OS and the network card. At some point, the buffers are filled and the network card issues an interrupt. The CPU then jumps to the handler of that interrupt. The interrupt handler for this exact type of interrupt is registered by the driver, so it’s actually the driver that handles this event and, in turn, informs the kernel that the data is ready to be read.
Step 8 – Reading the data
Depending on whether we chose method 1, 2, or 3, the OS will do as follows:
• Wake our thread
• Return Ready on the next poll
• Wake the thread and return a Read event for the handler we registered
It turns out that there is a great deal of cooperation between the OS and the CPU, but maybe not in the way you would naively think.
Many modern CPUs provide some basic infrastructure that operating systems use. This infrastructure gives us the security and stability we expect. Actually, most advanced CPUs provide a lot more options than operating systems such as Linux, BSD, and Windows actually use.
There are two in particular that I want to address here:
How the CPU prevents us from accessing memory we’re not supposed to access
How the CPU handles asynchronous events such as I/O
We’ll cover the first one here and the second in the next section.
How does the CPU prevent us from accessing memory we’re not supposed to access?
As I mentioned, modern CPU architectures define some basic concepts by design. Some examples of this are as follows:
Exactly how this works will differ depending on the specific CPU, so we’ll treat them in general terms here.
Most modern CPUs have a memory management unit (MMU). This part of the CPU is often etched on the same dye, even. The MMU’s job is to translate the virtual address we use in our programs to a physical address.
When the OS starts a process (such as our program), it sets up a page table for our process and makes sure a special register on the CPU points to this page table.
Now, when we try to dereference t_ptr in the preceding code, the address is at some point sent for translation to the MMU, which looks it up in the page table to translate it to a physical address in the memory where it can fetch the data.
In the first case, it will point to a memory address on our stack that holds the value 100.
When we pass in 99999999999999 and ask it to fetch what’s stored at that address (which is what dereferencing does), it looks for the translation in the page table but can’t find it.
The CPU then treats this as a page fault.
At boot, the OS provided the CPU with an interrupt descriptor table. This table has a predefined format where the OS provides handlers for the predefined conditions the CPU can encounter.
Since the OS provided a pointer to a function that handles page fault, the CPU jumps to that function when we try to dereference 99999999999999 and thereby hands over control to the operating system.
The OS then prints a nice message for us, letting us know that we encountered what it calls a segmentation fault. This message will therefore vary depending on the OS you run the code on.