Why use an OS-backed event queue? – Understanding OS-Backed Event Queues, System Calls, and Cross-Platform Abstractions

You already know by now that we need to cooperate closely with the OS to make I/O operations as efficient as possible. Operating systems such as Linux, macOS, and Windows provide several ways of performing I/O, both blocking and non-blocking.

I/O operations need to go through the operating system since they are dependent on resources that our operating system abstracts over. This can be the disk drive, the network card, or other peripherals. Especially in the case of network calls, we’re not only dependent on our own hardware, but we also depend on resources that might reside far away from our own, causing a significant delay.

In the previous chapter, we covered different ways to handle asynchronous operations when programming, and while they’re all different, they all have one thing in common: they need control over when and if they should yield to the OS scheduler when making a syscall.

In practice, this means that syscalls that normally would yield to the OS scheduler (blocking calls) needs to be avoided and we need to use non-blocking calls instead. We also need an efficient way to know the status of each call so we know when the task that made the otherwise blocking call is ready to progress. This is the main reason for using an OS-backed event queue in an asynchronous runtime.

We’ll look at three different ways of handling an I/O operation as an example.

Blocking I/O

When we ask the operating system to perform a blocking operation, it will suspend the OS thread that makes the call. It will then store the CPU state it had at the point where we made the call and go on to do other things. When data arrives for us through the network, it will wake up our thread again, restore the CPU state, and let us resume as if nothing has happened.

Blocking operations are the least flexible to use for us as programmers since we yield control to the OS at every call. The big advantage is that our thread gets woken up once the event we’re waiting for is ready so we can continue. If we take the whole system running on the OS into account, it’s a pretty efficient solution since the OS will give threads that have work to do time on the CPU to progress. However, if we narrow the scope to look at our process in isolation, we find that every time we make a blocking call, we put a thread to sleep, even if we still have work that our process could do. This leaves us with the choice of spawning new threads to do work on or just accepting that we have to wait for the blocking call to return. We’ll go a little more into detail about this later.

Non-blocking I/O

Unlike a blocking I/O operation, the OS will not suspend the thread that made an I/O request, but instead give it a handle that the thread can use to ask the operating system if the event is ready or not.

We call the process of querying for status polling.

Non-blocking I/O operations give us as programmers more freedom, but, as usual, that comes with a responsibility. If we poll too often, such as in a loop, we will occupy a lot of CPU time just to ask for an updated status, which is very wasteful. If we poll too infrequently, there will be a significant delay between an event being ready and us doing something about it, thus limiting our throughput.

Technical requirements – Understanding OS-Backed Event Queues, System Calls, and Cross-Platform Abstractions

This chapter doesn’t require you to set up anything new, but since we’re writing some low-level code for three different platforms, you need access to these platforms if you want to run all the examples.

The best way to follow along is to open the accompanying repository on your computer and navigate to the ch03 folder.

This chapter is a little special since we build some basic understanding from the ground up, which means some of it is quite low-level and requires a specific operating system and CPU family to run. Don’t worry; I’ve chosen the most used and popular CPU, so this shouldn’t be a problem, but it is something you need to be aware of.

The machine must use a CPU using the x86-64 instruction set on Windows and Linux. Intel and AMD desktop CPUs use this architecture, but if you run Linux (or WSL) on a machine using an ARM processor you might encounter issues with some of the examples using inline assembly. On macOS, the example in the book targets the newer M-family of chips, but the repository also contains examples targeting the older Intel-based Macs.

Unfortunately, some examples targeting specific platforms require that specific operating system to run. However, this will be the only chapter where you need access to three different platforms to run all the examples. Going forward, we’ll create examples that will run on all platforms either natively or using Windows Subsystem for Linux (WSL), but to understand the basics of cross-platform abstractions, we need to actually create examples that target these different platforms.

Running the Linux examples

If you don’t have a Linux machine set up, you can run the Linux example on the Rust Playground, or if you’re on a Windows system, my suggestion is to set up WSL and run the code there. You can find the instructions on how to do that at https://learn.microsoft.com/en-us/windows/wsl/install. Remember, you have to install Rust in the WSL environment as well, so follow the instructions in the Preface section of this book on how to install Rust on Linux.

If you use VS Code as your editor, there is a very simple way of switching your environment to WSL. Press Ctrl+Shift+P and write Reopen folder in WSL. This way, you can easily open the example folder in WSL and run the code examples using Linux there.

NOTE – Understanding OS-Backed Event Queues, System Calls, and Cross-Platform Abstractions

In this chapter, we’ll take a look at how an OS-backed event queue works and how three different operating systems handle this task in different ways. The reason for going through this is that most async runtimes I know of use OS-backed event queues such as this as a fundamental part of achieving high-performance I/O. You’ll most likely hear references to these frequently when reading about how async code really works.

Event queues based on the technology we discuss in this chapter is used in many popular libraries like:

• mio (https://github.com/tokio-rs/mio), a key part of popular runtimes like Tokio
• polling (https://github.com/smol-rs/polling), the event queue used in Smol and async-std
• libuv (https://libuv.org/), the library used to create the event queue used in Node.js (a JavaScript runtime) and the Julia programming language
• C# for its asynchronous network calls
• Boost.Asio, a library for asynchronous network I/O for C++

All our interactions with the host operating system are done through system calls (syscalls). To make a system call using Rust, we need to know how to use Rust’s foreign function interface (FFI).

In addition to knowing how to use FFI and make syscalls, we need to cover cross-platform abstractions. When creating an event queue, whether you create it yourself or use a library, you’ll notice that the abstractions might seem a bit unintuitive if you only have a high-level overview of how, for example, IOCP works on Windows. The reason for this is that these abstractions need to provide one API that covers the fact that different operating systems handle the same task differently. This process often involves identifying a common denominator between the platforms and building a new abstraction on top of that.

Instead of using a rather complex and lengthy example to explain FFI, syscalls, and cross-platform abstractions, we’ll ease into the topic using a simple example. When we encounter these concepts later on, we’ll already know these subjects well enough, so we’re well prepared for the more interesting examples in the following chapters.

In this chapter, we’ll go through the following main topics:
• Why use an OS-backed event queue?
• Readiness-based event queues
• Completion-based event queues
• epoll
• kqueue
• IOCP
• Syscalls, FFI, and cross-platform abstractions

Note

There are popular, although lesser-used, alternatives you should know about even though we don’t cover them here:

wepoll: This uses specific APIs on Windows and wraps IOCP so it closely resembles how epoll works on Linux in contrast to regular IOCP. This makes it easier to create an abstraction layer with the same API on top of the two different technologies. It’s used by both libuv and mio . io_uring: This is a relatively new API on Linux with many similarities to IOCP on Windows.

I’m pretty confident that after you’ve gone through the next two chapters, you will have an easy time reading up on these if you want to learn more about them.

Example – How Programming Languages Model Asynchronous Program Flow

Since we’ll not spend more time talking about OS threads in this book, we’ll go through a short example so you can see how they’re used:
ch02/aa-os-threads
use std::thread::{self, sleep};
fn main() {
println!(“So, we start the program here!”);
let t1 = thread::spawn(move || {
sleep(std::time::Duration::from_millis(200));
println!(“The long running tasks finish last!”);
});
let t2 = thread::spawn(move || {
sleep(std::time::Duration::from_millis(100));
println!(“We can chain callbacks…”);
let t3 = thread::spawn(move || {
sleep(std::time::Duration::from_millis(50));
println!(“…like this!”);
});
t3.join().unwrap();
});
println!(“The tasks run concurrently!”);
t1.join().unwrap();
t2.join().unwrap();
}

In this example, we simply spawn several OS threads and put them to sleep. Sleeping is essentially the same as yielding to the OS scheduler with a request to be re-scheduled to run after a certain time has passed. To make sure our main thread doesn’t finish and exit (which will exit the process) before our children thread has had time to run we join them at the end of our main function.
If we run the example, we’ll see how the operations occur in a different order based on how long we yielded each thread to the scheduler:
So, we start the program here!
The tasks run concurrently!
We can chain callbacks…
…like this!
The long-running tasks finish last!

So, while using OS threads is great for a number of tasks, we also outlined a number of good reasons to look at alternatives by discussing their limitations and downsides. The first alternatives we’ll look at are what we call fibers and green threads.

Fibers and green threads

Note!

This is an example of M:N threading. Many tasks can run concurrently on one OS thread. Fibers and green threads are often referred to as stackful coroutines.

The name “green threads” originally stems from an early implementation of an M:N threading model used in Java and has since been associated with different implementations of M:N threading. You will encounter different variations of this term, such as “green processes” (used in Erlang), which are different from the ones we discuss here. You’ll also see some that define green threads more broadly than we do here.

The way we define green threads in this book makes them synonymous with fibers, so both terms refer to the same thing going forward.

The implementation of fibers and green threads implies that there is a runtime with a scheduler that’s responsible for scheduling what task (M) gets time to run on the OS thread (N). There are many more tasks than there are OS threads, and such a system can run perfectly fine using only one OS thread. The latter case is often referred to as M:1 threading.

Goroutines is an example of a specific implementation of stackfull coroutines, but it comes with slight nuances. The term “coroutine” usually implies that they’re cooperative in nature, but Goroutines can be pre-empted by the scheduler (at least since version 1.14), thereby landing them in somewhat of a grey area using the categories we present here.

Green threads and fibers use the same mechanisms as an OS, setting up a stack for each task, saving the CPU’s state, and jumping from one task(thread) to another by doing a context switch.

We yield control to the scheduler (which is a central part of the runtime in such a system), which then continues running a different task.

The state of execution is stored in each stack, so in such a solution, there would be no need for async, await, Future, or Pin. In many ways, green threads mimic how an operating system facilitates concurrency, and implementing them is a great learning experience.

A runtime using fibers/green threads for concurrent tasks can have a high degree of flexibility. Tasks can, for example, be pre-empted and context switched at any time and at any point in their execution, so a long-running task that hogs the CPU could in theory be pre-empted by the runtime, acting as a safeguard from having tasks that end up blocking the whole system due to an edge-case or a programmer error.

This gives the runtime scheduler almost the same capabilities as the OS scheduler, which is one of the biggest advantages of systems using fibers/green threads.

The typical flow goes as follows:

• You run some non-blocking code
• You make a blocking call to some external resource
• The CPU jumps to the main thread, which schedules a different thread to run and jumps to that stack
• You run some non-blocking code on the new thread until a new blocking call or the task is finished
• The CPU jumps back to the main thread, schedules a new thread that is ready to make progress, and jumps to that thread

Figure 2.2 – Program flow using fibers/green threads

Context switching – How Programming Languages Model Asynchronous Program Flow

Creating new threads takes time

Creating a new OS thread involves some bookkeeping and initialization overhead, so while switching between two existing threads in the same process is pretty fast, creating new ones and discarding ones you don’t use anymore involves work that takes time. All the extra work will limit throughput if a system needs to create and discard a lot of them. This can be a problem if you have huge amounts of small tasks that need to be handled concurrently, which often is the case when dealing with a lot of I/O.

Each thread has its own stack

We’ll cover stacks in detail later in this book, but for now, it’s enough to know that they occupy a fixed size of memory. Each OS thread comes with its own stack, and even though many systems allow this size to be configured, they’re still fixed in size and can’t grow or shrink. They are, after all, the cause of stack overflows, which will be a problem if you configure them to be too small for the tasks you’re running.

If we have many small tasks that only require a little stack space but we reserve much more than we need, we will occupy large amounts of memory and possibly run out of it.

Context switching

As you now know, threads and schedulers are tightly connected. Context switching happens when the CPU stops executing one thread and proceeds with another one. Even though this process is highly optimized, it still involves storing and restoring the register state, which takes time. Every time that you yield to the OS scheduler, it can choose to schedule a thread from a different process on that CPU.

You see, threads created by these systems belong to a process. When you start a program, it starts a process, and the process creates at least one initial thread where it executes the program you’ve written. Each process can spawn multiple threads that share the same address space.

That means that threads within the same process can access shared memory and can access the same resources, such as files and file handles. One consequence of this is that when the OS switches contexts by stopping one thread and resuming another within the same process, it doesn’t have to save and restore all the state associated with that process, just the state that’s specific to that thread.

On the other hand, when the OS switches from a thread associated with one process to a thread associated with another, the new process will use a different address space, and the OS needs to take measures to make sure that process “A” doesn’t access data or resources that belong to process “B”. If it didn’t, the system wouldn’t be secure.

The consequence is that caches might need to be flushed and more state might need to be saved and restored. In a highly concurrent system under load, these context switches can take extra time and thereby limit the throughput in a somewhat unpredictable manner if they happen frequently enough.

Threads – How Programming Languages Model Asynchronous Program Flow

We keep referring to threads all throughout this book, so before we get too far in, let’s stop and give “thread” a good definition since it’s one of those fundamental terms that causes a lot of confusion.

In the most general sense, a thread refers to a thread of execution, meaning a set of instructions that need to be executed sequentially. If we tie this back to the first chapter of this book, where we provided several definitions under the Concurrency vs. Parallelism subsection, a thread of execution is similar to what we defined as a task with multiple steps that need resources to progress.

The generality of this definition can be a cause of some confusion. A thread to one person can obviously refer to an OS thread, and to another person, it can simply refer to any abstraction that represents a thread of execution on a system.

Threads are often divided into two broad categories:

  • • OS threads: These threads are created by the OS and managed by the OS scheduler. On Linux, this is known as a kernel thread.
  • • User-level threads: These threads are created and managed by us as programmers without the OS knowing about them.

Now, this is where things get a bit tricky: OS threads on most modern operating systems have a lot of similarities. Some of these similarities are dictated by the design of modern CPUs. One example of this is that most CPUs assume that there is a stack it can perform operations on and that it has a register for the stack pointer and instructions for stack manipulation.

User-level threads can, in their broadest sense, refer to any implementation of a system (runtime) that creates and schedules tasks, and you can’t make the same assumptions as you do with OS threads. They can closely resemble OS threads by using separate stacks for each task, as we’ll see in Chapter 5 when we go through our fiber/green threads example, or they can be radically different in nature, as we’ll see when we go through how Rust models concurrent operations later on in Part 3 of this book.

No matter the definition, a set of tasks needs something that manages them and decides who gets what resources to progress. The most obvious resource on a computer system that all tasks need to progress is CPU time. We call the “something” that decides who gets CPU time to progress a scheduler.

Most likely, when someone refers to a “thread” without adding extra context, they refer to an OS thread/kernel thread, so that’s what we’ll do going forward.

I’ll also keep referring to a thread of execution as simply a task. I find the topic of asynchronous programming easier to reason about when we limit the use of terms that have different assumptions associated with them depending on the context as much as possible.

With that out of the way, let’s go through some defining characteristics of OS threads while we also highlight their limitations.

Firmware – Concurrency and Asynchronous Programming: a Detailed Overview

Interrupts

As you know by now, there are two kinds of interrupts:

  • Hardware interrupts
  • Software interrupts

They are very different in nature.

Hardware interrupts

Hardware interrupts are created by sending an electrical signal through an IRQ. These hardware lines signal the CPU directly.

Software interrupts

These are interrupts issued from software instead of hardware. As in the case of a hardware interrupt, the CPU jumps to the IDT and runs the handler for the specified interrupt.

Firmware

Firmware doesn’t get much attention from most of us; however, it’s a crucial part of the world we live in. It runs on all kinds of hardware and has all kinds of strange and peculiar ways to make the computers we program on work.

Now, the firmware needs a microcontroller to be able to work. Even the CPU has firmware that makes it work. That means there are many more small ‘CPUs’ on our system than the cores we program against.

Why is this important? Well, you remember that concurrency is all about efficiency, right? Since we have many CPUs/microcontrollers already doing work for us on our system, one of our concerns is to not replicate or duplicate that work when we write code.

If a network card has firmware that continually checks whether new data has arrived, it’s pretty wasteful if we duplicate that by letting our CPU continually check whether new data arrives as well. It’s much better if we either check once in a while, or even better, get notified when data has arrived.

Summary

This chapter covered a lot of ground, so good job on doing all that legwork. We learned a little bit about how CPUs and operating systems have evolved from a historical perspective and the difference between non-preemptive and preemptive multitasking. We discussed the difference between concurrency and parallelism, talked about the role of the operating system, and learned that system calls are the primary way for us to interact with the host operating system. You’ve also seen how the CPU and the operating system cooperate through an infrastructure designed as part of the CPU.

Lastly, we went through a diagram on what happens when you issue a network call. You know there are at least three different ways for us to deal with the fact that the I/O call takes some time to execute, and we have to decide which way we want to handle that waiting time.

This covers most of the general background information we need so that we have the same definitions and overview before we go on. We’ll go into more detail as we progress through the book, and the first topic that we’ll cover in the next chapter is how programming languages model asynchronous program flow by looking into threads, coroutines and futures.

A simplified overview – Concurrency and Asynchronous Programming: a Detailed Overview-1

Let’s look at some of the steps where we imagine that we read from a network card:

Remember that we’re simplifying a lot here. This is a rather complex operation but we’ll focus on the parts that are of most interest to us and skip a few steps along the way.

Step 1 – Our code

We register a socket. This happens by issuing a syscall to the OS. Depending on the OS, we either get a file descriptor (macOS/Linux) or a socket (Windows).

The next step is that we register our interest in Read events on that socket.

Step 2 – Registering events with the OS

This is handled in one of three ways:

  1. We tell the operating system that we’re interested in Read events but we want to wait for it to happen by yielding control over our thread to the OS. The OS then suspends our thread by storing the register state and switches to some other thread
    From our perspective, this will be blocking our thread until we have data to read.
  2. We tell the operating system that we’re interested in Read events but we just want a handle to a task that we can poll to check whether the event is ready or not.
    The OS will not suspend our thread, so this will not block our code.
  3. We tell the operating system that we are probably going to be interested in many events, but we want to subscribe to one event queue. When we poll this queue, it will block our thread until one or more events occur.

This will block our thread while we wait for events to occur.

Chapters 3 and 4 will go into detail about the third method, as it’s the most used method for modern async frameworks to handle concurrency.

Step 3 – The network card

We’re skipping some steps here, but I don’t think they’re vital to our understanding.

On the network card, there is a small microcontroller running specialized firmware. We can imagine that this microcontroller is polling in a busy loop, checking whether any data is incoming.

The exact way the network card handles its internals is a little different from what I suggest here, and will most likely vary from vendor to vendor. The important part is that there is a very simple but specialized CPU running on the network card doing work to check whether there are incoming events.

Once the firmware registers incoming data, it issues a hardware interrupt.

But can’t we just change the page table in the CPU? – Concurrency and Asynchronous Programming: a Detailed Overview

Now, this is where the privilege level comes in. Most modern operating systems operate with two ring levels: ring 0, the kernel space, and ring 3, the user space.

Figure 1.2 – Privilege rings

Most CPUs have a concept of more rings than what most modern operating systems use. This has historical reasons, which is also why ring 0 and ring 3 are used (and not 1 and 2).

Every entry in the page table has additional information about it. Amongst that information is the information about which ring it belongs to. This information is set up when your OS boots up.

Code executed in ring 0 has almost unrestricted access to external devices and memory, and is free to change registers that provide security at the hardware level.

The code you write in ring 3 will typically have extremely restricted access to I/O and certain CPU registers (and instructions). Trying to issue an instruction or setting a register from ring 3 to change the page table will be prevented by the CPU. The CPU will then treat this as an exception and jump to the handler for that exception provided by the OS.

This is also the reason why you have no other choice than to cooperate with the OS and handle I/O tasks through syscalls. The system wouldn’t be very secure if this wasn’t the case.

So, to sum it up: yes, the CPU and the OS cooperate a great deal. Most modern desktop CPUs are built with an OS in mind, so they provide the hooks and infrastructure that the OS latches onto upon bootup. When the OS spawns a process, it also sets its privilege level, making sure that normal processes stay within the borders it defines to maintain stability and security.

Interrupts, firmware, and I/O

We’re nearing the end of the general CS subjects in this book, and we’ll start to dig our way out of the rabbit hole soon.

This part tries to tie things together and look at how the whole computer works as a system to handle I/O and concurrency.

Let’s get to it!

Concurrency from the operating system’s perspective – Concurrency and Asynchronous Programming: a Detailed Overview

The role of the operating system

The operating system (OS) stands in the center of everything we do as programmers (well, unless you’re writing an operating system or working in the embedded realm), so there is no way for us to discuss any kind of fundamentals in programming without talking about operating systems in a bit of detail.

Concurrency from the operating system’s perspective

This ties into what I talked about earlier when I said that concurrency needs to be talked about within a reference frame, and I explained that the OS might stop and start your process at any time.

What we call synchronous code is, in most cases, code that appears synchronous to us as programmers. Neither the OS nor the CPU lives in a fully synchronous world.

Operating systems use preemptive multitasking and as long as the operating system you’re running is preemptively scheduling processes, you won’t have a guarantee that your code runs instruction by instruction without interruption.

The operating system will make sure that all important processes get some time from the CPU to make progress.

Note

This is not as simple when we’re talking about modern machines with 4, 6, 8, or 12 physical cores, since you might actually execute code on one of the CPUs uninterrupted if the system is under very little load. The important part here is that you can’t know for sure and there is no guarantee that your code will be left to run uninterrupted.

Teaming up with the operating system

When you make a web request, you’re not asking the CPU or the network card to do something for you – you’re asking the operating system to talk to the network card for you.

There is no way for you as a programmer to make your system optimally efficient without playing to the strengths of the operating system. You basically don’t have access to the hardware directly. You must remember that the operating system is an abstraction over the hardware.

However, this also means that to understand everything from the ground up, you’ll also need to know how your operating system handles these tasks.To be able to work with the operating system, you’ll need to know how you can communicate with it, and that’s exactly what we’re going to go through next.