Why use an OS-backed event queue? – Understanding OS-Backed Event Queues, System Calls, and Cross-Platform Abstractions

You already know by now that we need to cooperate closely with the OS to make I/O operations as efficient as possible. Operating systems such as Linux, macOS, and Windows provide several ways of performing I/O, both blocking and non-blocking.

I/O operations need to go through the operating system since they are dependent on resources that our operating system abstracts over. This can be the disk drive, the network card, or other peripherals. Especially in the case of network calls, we’re not only dependent on our own hardware, but we also depend on resources that might reside far away from our own, causing a significant delay.

In the previous chapter, we covered different ways to handle asynchronous operations when programming, and while they’re all different, they all have one thing in common: they need control over when and if they should yield to the OS scheduler when making a syscall.

In practice, this means that syscalls that normally would yield to the OS scheduler (blocking calls) needs to be avoided and we need to use non-blocking calls instead. We also need an efficient way to know the status of each call so we know when the task that made the otherwise blocking call is ready to progress. This is the main reason for using an OS-backed event queue in an asynchronous runtime.

We’ll look at three different ways of handling an I/O operation as an example.

Blocking I/O

When we ask the operating system to perform a blocking operation, it will suspend the OS thread that makes the call. It will then store the CPU state it had at the point where we made the call and go on to do other things. When data arrives for us through the network, it will wake up our thread again, restore the CPU state, and let us resume as if nothing has happened.

Blocking operations are the least flexible to use for us as programmers since we yield control to the OS at every call. The big advantage is that our thread gets woken up once the event we’re waiting for is ready so we can continue. If we take the whole system running on the OS into account, it’s a pretty efficient solution since the OS will give threads that have work to do time on the CPU to progress. However, if we narrow the scope to look at our process in isolation, we find that every time we make a blocking call, we put a thread to sleep, even if we still have work that our process could do. This leaves us with the choice of spawning new threads to do work on or just accepting that we have to wait for the blocking call to return. We’ll go a little more into detail about this later.

Non-blocking I/O

Unlike a blocking I/O operation, the OS will not suspend the thread that made an I/O request, but instead give it a handle that the thread can use to ask the operating system if the event is ready or not.

We call the process of querying for status polling.

Non-blocking I/O operations give us as programmers more freedom, but, as usual, that comes with a responsibility. If we poll too often, such as in a loop, we will occupy a lot of CPU time just to ask for an updated status, which is very wasteful. If we poll too infrequently, there will be a significant delay between an event being ready and us doing something about it, thus limiting our throughput.

Technical requirements – Understanding OS-Backed Event Queues, System Calls, and Cross-Platform Abstractions

This chapter doesn’t require you to set up anything new, but since we’re writing some low-level code for three different platforms, you need access to these platforms if you want to run all the examples.

The best way to follow along is to open the accompanying repository on your computer and navigate to the ch03 folder.

This chapter is a little special since we build some basic understanding from the ground up, which means some of it is quite low-level and requires a specific operating system and CPU family to run. Don’t worry; I’ve chosen the most used and popular CPU, so this shouldn’t be a problem, but it is something you need to be aware of.

The machine must use a CPU using the x86-64 instruction set on Windows and Linux. Intel and AMD desktop CPUs use this architecture, but if you run Linux (or WSL) on a machine using an ARM processor you might encounter issues with some of the examples using inline assembly. On macOS, the example in the book targets the newer M-family of chips, but the repository also contains examples targeting the older Intel-based Macs.

Unfortunately, some examples targeting specific platforms require that specific operating system to run. However, this will be the only chapter where you need access to three different platforms to run all the examples. Going forward, we’ll create examples that will run on all platforms either natively or using Windows Subsystem for Linux (WSL), but to understand the basics of cross-platform abstractions, we need to actually create examples that target these different platforms.

Running the Linux examples

If you don’t have a Linux machine set up, you can run the Linux example on the Rust Playground, or if you’re on a Windows system, my suggestion is to set up WSL and run the code there. You can find the instructions on how to do that at https://learn.microsoft.com/en-us/windows/wsl/install. Remember, you have to install Rust in the WSL environment as well, so follow the instructions in the Preface section of this book on how to install Rust on Linux.

If you use VS Code as your editor, there is a very simple way of switching your environment to WSL. Press Ctrl+Shift+P and write Reopen folder in WSL. This way, you can easily open the example folder in WSL and run the code examples using Linux there.

NOTE – Understanding OS-Backed Event Queues, System Calls, and Cross-Platform Abstractions

In this chapter, we’ll take a look at how an OS-backed event queue works and how three different operating systems handle this task in different ways. The reason for going through this is that most async runtimes I know of use OS-backed event queues such as this as a fundamental part of achieving high-performance I/O. You’ll most likely hear references to these frequently when reading about how async code really works.

Event queues based on the technology we discuss in this chapter is used in many popular libraries like:

• mio (https://github.com/tokio-rs/mio), a key part of popular runtimes like Tokio
• polling (https://github.com/smol-rs/polling), the event queue used in Smol and async-std
• libuv (https://libuv.org/), the library used to create the event queue used in Node.js (a JavaScript runtime) and the Julia programming language
• C# for its asynchronous network calls
• Boost.Asio, a library for asynchronous network I/O for C++

All our interactions with the host operating system are done through system calls (syscalls). To make a system call using Rust, we need to know how to use Rust’s foreign function interface (FFI).

In addition to knowing how to use FFI and make syscalls, we need to cover cross-platform abstractions. When creating an event queue, whether you create it yourself or use a library, you’ll notice that the abstractions might seem a bit unintuitive if you only have a high-level overview of how, for example, IOCP works on Windows. The reason for this is that these abstractions need to provide one API that covers the fact that different operating systems handle the same task differently. This process often involves identifying a common denominator between the platforms and building a new abstraction on top of that.

Instead of using a rather complex and lengthy example to explain FFI, syscalls, and cross-platform abstractions, we’ll ease into the topic using a simple example. When we encounter these concepts later on, we’ll already know these subjects well enough, so we’re well prepared for the more interesting examples in the following chapters.

In this chapter, we’ll go through the following main topics:
• Why use an OS-backed event queue?
• Readiness-based event queues
• Completion-based event queues
• epoll
• kqueue
• IOCP
• Syscalls, FFI, and cross-platform abstractions

Note

There are popular, although lesser-used, alternatives you should know about even though we don’t cover them here:

wepoll: This uses specific APIs on Windows and wraps IOCP so it closely resembles how epoll works on Linux in contrast to regular IOCP. This makes it easier to create an abstraction layer with the same API on top of the two different technologies. It’s used by both libuv and mio . io_uring: This is a relatively new API on Linux with many similarities to IOCP on Windows.

I’m pretty confident that after you’ve gone through the next two chapters, you will have an easy time reading up on these if you want to learn more about them.

PRE-EMPTION POINTS – How Programming Languages Model Asynchronous Program Flow

Pre-emption points can be thought of as inserting code that calls into the scheduler and asks it if it wishes to pre-empt the task. These points can be inserted by the compiler or the library you use before every new function call for example.

Furthermore, you need compiler support to make the most out of it. Languages that have metaprogramming abilities (such as macros) can emulate much of the same, but this will still not be as seamless as it will when the compiler is aware of these special async tasks.

Debugging is another area where care must be taken when implementing futures/promises. Since the code is re-written as state machines (or generators), you won’t have the same stack traces as you do with normal functions. Usually, you can assume that the caller of a function is what precedes it both in the stack and in the program flow. For futures and promises, it might be the runtime that calls the function that progresses the state machine, so there might not be a good backtrace you can use to see what happened before calling the function that failed. There are ways to work around this, but most of them will incur some overhead.

Advantages
• You can write code and model programs the same way you normally would
• No context switching
• It can be implemented in a very memory-efficient way
• It’s easy to implement for various platforms
Drawbacks
• Pre-emption can be hard, or impossible, to fully implement, as the tasks can’t be stopped in the middle of a stack frame
• It needs compiler support to leverage its full advantages
• Debugging can be difficult both due to the non-sequential program flow and the limitations on the information you get from the backtraces.

Summary

You’re still here? That’s excellent! Good job on getting through all that background information. I know going through text that describes abstractions and code can be pretty daunting, but I hope you see why it’s so valuable for us to go through these higher-level topics now at the start of the book. We’ll get to the examples soon. I promise!

In this chapter, we went through a lot of information on how we can model and handle asynchronous operations in programming languages by using both OS-provided threads and abstractions provided by a programming language or a library. While it’s not an extensive list, we covered some of the most popular and widely used technologies while discussing their advantages and drawbacks.

We spent quite some time going in-depth on threads, coroutines, fibers, green threads, and callbacks, so you should have a pretty good idea of what they are and how they’re different from each other.

The next chapter will go into detail about how we do system calls and create cross-platform abstractions and what OS-backed event queues such as Epoll, Kqueue, and IOCP really are and why they’re fundamental to most async runtimes you’ll encounter out in the wild.

Each stack has a fixed space – How Programming Languages Model Asynchronous Program Flow

As fibers and green threads are similar to OS threads, they do have some of the same drawbacks as well. Each task is set up with a stack of a fixed size, so you still have to reserve more space than you actually use. However, these stacks can be growable, meaning that once the stack is full, the runtime can grow the stack. While this sounds easy, it’s a rather complicated problem to solve.

We can’t simply grow a stack as we grow a tree. What actually needs to happen is one of two things:

  1. You allocate a new piece of continuous memory and handle the fact that your stack is spread over two disjointed memory segments
  2. You allocate a new larger stack (for example, twice the size of the previous stack), move all your data over to the new stack, and continue from there

The first solution sounds pretty simple, as you can leave the original stack as it is, and you can basically context switch over to the new stack when needed and continue from there. However, modern CPUs can work extremely fast if they can work on a contiguous piece of memory due to caching and their ability to predict what data your next instructions are going to work on. Spreading the stack over two disjointed pieces of memory will hinder performance. This is especially noticeable when you have a loop that happens to be just at the stack boundary, so you end up making up to two context switches for each iteration of the loop.

The second solution solves the problems with the first solution by having the stack as a contiguous piece of memory, but it comes with some problems as well.

First, you need to allocate a new stack and move all the data over to the new stack. But what happens with all pointers and references that point to something located on the stack when everything moves to a new location? You guessed it: every pointer and reference to anything located on the stack needs to be updated so they point to the new location. This is complex and time-consuming, but if your runtime already includes a garbage collector, you already have the overhead of keeping track of all your pointers and references anyway, so it might be less of a problem than it would for a non-garbage collected program. However, it does require a great deal of integration between the garbage collector and the runtime to do this every time the stack grows, so implementing this kind of runtime can get very complicated.

Secondly, you have to consider what happens if you have a lot of long-running tasks that only require a lot of stack space for a brief period of time (for example, if it involves a lot of recursion at the start of the task) but are mostly I/O bound the rest of the time. You end up growing your stack many times over only for one specific part of that task, and you have to make a decision whether you will accept that the task occupies more space than it needs or at some point move it back to a smaller stack. The impact this will have on your program will of course vary greatly based on the type of work you do, but it’s still something to be aware of.

Context switching – How Programming Languages Model Asynchronous Program Flow

Creating new threads takes time

Creating a new OS thread involves some bookkeeping and initialization overhead, so while switching between two existing threads in the same process is pretty fast, creating new ones and discarding ones you don’t use anymore involves work that takes time. All the extra work will limit throughput if a system needs to create and discard a lot of them. This can be a problem if you have huge amounts of small tasks that need to be handled concurrently, which often is the case when dealing with a lot of I/O.

Each thread has its own stack

We’ll cover stacks in detail later in this book, but for now, it’s enough to know that they occupy a fixed size of memory. Each OS thread comes with its own stack, and even though many systems allow this size to be configured, they’re still fixed in size and can’t grow or shrink. They are, after all, the cause of stack overflows, which will be a problem if you configure them to be too small for the tasks you’re running.

If we have many small tasks that only require a little stack space but we reserve much more than we need, we will occupy large amounts of memory and possibly run out of it.

Context switching

As you now know, threads and schedulers are tightly connected. Context switching happens when the CPU stops executing one thread and proceeds with another one. Even though this process is highly optimized, it still involves storing and restoring the register state, which takes time. Every time that you yield to the OS scheduler, it can choose to schedule a thread from a different process on that CPU.

You see, threads created by these systems belong to a process. When you start a program, it starts a process, and the process creates at least one initial thread where it executes the program you’ve written. Each process can spawn multiple threads that share the same address space.

That means that threads within the same process can access shared memory and can access the same resources, such as files and file handles. One consequence of this is that when the OS switches contexts by stopping one thread and resuming another within the same process, it doesn’t have to save and restore all the state associated with that process, just the state that’s specific to that thread.

On the other hand, when the OS switches from a thread associated with one process to a thread associated with another, the new process will use a different address space, and the OS needs to take measures to make sure that process “A” doesn’t access data or resources that belong to process “B”. If it didn’t, the system wouldn’t be secure.

The consequence is that caches might need to be flushed and more state might need to be saved and restored. In a highly concurrent system under load, these context switches can take extra time and thereby limit the throughput in a somewhat unpredictable manner if they happen frequently enough.

Threads – How Programming Languages Model Asynchronous Program Flow

We keep referring to threads all throughout this book, so before we get too far in, let’s stop and give “thread” a good definition since it’s one of those fundamental terms that causes a lot of confusion.

In the most general sense, a thread refers to a thread of execution, meaning a set of instructions that need to be executed sequentially. If we tie this back to the first chapter of this book, where we provided several definitions under the Concurrency vs. Parallelism subsection, a thread of execution is similar to what we defined as a task with multiple steps that need resources to progress.

The generality of this definition can be a cause of some confusion. A thread to one person can obviously refer to an OS thread, and to another person, it can simply refer to any abstraction that represents a thread of execution on a system.

Threads are often divided into two broad categories:

  • • OS threads: These threads are created by the OS and managed by the OS scheduler. On Linux, this is known as a kernel thread.
  • • User-level threads: These threads are created and managed by us as programmers without the OS knowing about them.

Now, this is where things get a bit tricky: OS threads on most modern operating systems have a lot of similarities. Some of these similarities are dictated by the design of modern CPUs. One example of this is that most CPUs assume that there is a stack it can perform operations on and that it has a register for the stack pointer and instructions for stack manipulation.

User-level threads can, in their broadest sense, refer to any implementation of a system (runtime) that creates and schedules tasks, and you can’t make the same assumptions as you do with OS threads. They can closely resemble OS threads by using separate stacks for each task, as we’ll see in Chapter 5 when we go through our fiber/green threads example, or they can be radically different in nature, as we’ll see when we go through how Rust models concurrent operations later on in Part 3 of this book.

No matter the definition, a set of tasks needs something that manages them and decides who gets what resources to progress. The most obvious resource on a computer system that all tasks need to progress is CPU time. We call the “something” that decides who gets CPU time to progress a scheduler.

Most likely, when someone refers to a “thread” without adding extra context, they refer to an OS thread/kernel thread, so that’s what we’ll do going forward.

I’ll also keep referring to a thread of execution as simply a task. I find the topic of asynchronous programming easier to reason about when we limit the use of terms that have different assumptions associated with them depending on the context as much as possible.

With that out of the way, let’s go through some defining characteristics of OS threads while we also highlight their limitations.

Definitions – How Programming Languages Model Asynchronous Program Flow

In the previous chapter, we covered asynchronous program flow, concurrency, and parallelism in general terms. In this chapter, we’ll narrow our scope. Specifically, we’ll look into different ways to model and deal with concurrency in programming languages and libraries.

It’s important to keep in mind that threads, futures, fibers, goroutines, promises, etc. are abstractions that give us a way to model an asynchronous program flow. They have different strengths and weaknesses, but they share a goal of giving programmers an easy-to-use (and importantly, hard to misuse), efficient, and expressive way of creating a program that handles tasks in a non-sequential, and often unpredictable, order.

The lack of precise definitions is prevalent here as well; many terms have a name that stems from a concrete implementation at some point in time but has later taken on a more general meaning that encompasses different implementations and varieties of the same thing.

We’ll first go through a way of grouping different abstractions together based on their similarities before we go on to discuss the pros and cons of each of them. We’ll also go through important definitions that we’ll use throughout the book and discuss OS threads in quite some detail.

The topics we discuss here are quite abstract and complicated so don’t feel bad if you don’t understand everything immediately. As we progress through the book and you get used to the different terms and techniques by working through some examples, more and more pieces will fall into place.

Specifically, the following topics will be covered:

• Definitions
• Threads provided by the operating system
• Green threads/stackfull coroutines/fibers
• Callback based approaches
• Promises, futures, and async/await

Definitions

We can broadly categorize abstractions over concurrent operations into two groups:

  1. Cooperative: These are tasks that yield voluntarily either by explicitly yielding or by calling a function that suspends the task when it can’t progress further before another operation has finished (such as making a network call). Most often, these tasks yield to a scheduler of some sort. Examples of this are tasks generated by async/await in Rust and JavaScript.
  2. Non-cooperative: Tasks that don’t necessarily yield voluntarily. In such a system, the scheduler must be able to pre-empt a running task, meaning that the scheduler can stop the task and take control over the CPU even though the task would have been able to do work and progress. Examples of this are OS threads and Goroutines (after GO version 1.14).

Figure 2.1 – Non-cooperative vs. cooperative multitasking

Note

In a system where the scheduler can pre-empt running tasks, tasks can also yield voluntarily as they do in a cooperative system, and it’s rare with a system that only relies on pre-emption.

We can further divide these abstractions into two broad categories based on the characteristics of their implementation:

  1. Stackful: Each task has its own call stack. This is often implemented as a stack that’s similar to the stack used by the operating system for its threads. Stackful tasks can suspend execution at any point in the program as the whole stack is preserved.
  2. Stackless: There is not a separate stack for each task; they all run sharing the same call stack. A task can’t be suspended in the middle of a stack frame, limiting the runtime’s ability to pre-empt the task. However, they need to store/restore less information when switching between tasks so they can be more efficient.

There are more nuances to these two categories that you’ll get a deep understanding of when we implement an example of both a stackful coroutine (fiber) and a stackless coroutine (Rust futures generated by async/await) later in the book. For now, we keep the details to a minimum to just provide an overview.

A simplified overview – Concurrency and Asynchronous Programming: a Detailed Overview-1

Let’s look at some of the steps where we imagine that we read from a network card:

Remember that we’re simplifying a lot here. This is a rather complex operation but we’ll focus on the parts that are of most interest to us and skip a few steps along the way.

Step 1 – Our code

We register a socket. This happens by issuing a syscall to the OS. Depending on the OS, we either get a file descriptor (macOS/Linux) or a socket (Windows).

The next step is that we register our interest in Read events on that socket.

Step 2 – Registering events with the OS

This is handled in one of three ways:

  1. We tell the operating system that we’re interested in Read events but we want to wait for it to happen by yielding control over our thread to the OS. The OS then suspends our thread by storing the register state and switches to some other thread
    From our perspective, this will be blocking our thread until we have data to read.
  2. We tell the operating system that we’re interested in Read events but we just want a handle to a task that we can poll to check whether the event is ready or not.
    The OS will not suspend our thread, so this will not block our code.
  3. We tell the operating system that we are probably going to be interested in many events, but we want to subscribe to one event queue. When we poll this queue, it will block our thread until one or more events occur.

This will block our thread while we wait for events to occur.

Chapters 3 and 4 will go into detail about the third method, as it’s the most used method for modern async frameworks to handle concurrency.

Step 3 – The network card

We’re skipping some steps here, but I don’t think they’re vital to our understanding.

On the network card, there is a small microcontroller running specialized firmware. We can imagine that this microcontroller is polling in a busy loop, checking whether any data is incoming.

The exact way the network card handles its internals is a little different from what I suggest here, and will most likely vary from vendor to vendor. The important part is that there is a very simple but specialized CPU running on the network card doing work to check whether there are incoming events.

Once the firmware registers incoming data, it issues a hardware interrupt.

Down the rabbit hole – Concurrency and Asynchronous Programming: a Detailed Overview

It turns out that there is a great deal of cooperation between the OS and the CPU, but maybe not in the way you would naively think.

Many modern CPUs provide some basic infrastructure that operating systems use. This infrastructure gives us the security and stability we expect. Actually, most advanced CPUs provide a lot more options than operating systems such as Linux, BSD, and Windows actually use.

There are two in particular that I want to address here:

  • How the CPU prevents us from accessing memory we’re not supposed to access
  • How the CPU handles asynchronous events such as I/O

We’ll cover the first one here and the second in the next section.

How does the CPU prevent us from accessing memory we’re not supposed to access?

As I mentioned, modern CPU architectures define some basic concepts by design. Some examples of this are as follows:

• Virtual memory
• Page table
• Page fault
• Exceptions
• Privilege level

Exactly how this works will differ depending on the specific CPU, so we’ll treat them in general terms here.

Most modern CPUs have a memory management unit (MMU). This part of the CPU is often etched on the same dye, even. The MMU’s job is to translate the virtual address we use in our programs to a physical address.

When the OS starts a process (such as our program), it sets up a page table for our process and makes sure a special register on the CPU points to this page table.

Now, when we try to dereference t_ptr in the preceding code, the address is at some point sent for translation to the MMU, which looks it up in the page table to translate it to a physical address in the memory where it can fetch the data.

In the first case, it will point to a memory address on our stack that holds the value 100.

When we pass in 99999999999999 and ask it to fetch what’s stored at that address (which is what dereferencing does), it looks for the translation in the page table but can’t find it.

The CPU then treats this as a page fault.

At boot, the OS provided the CPU with an interrupt descriptor table. This table has a predefined format where the OS provides handlers for the predefined conditions the CPU can encounter.

Since the OS provided a pointer to a function that handles page fault, the CPU jumps to that function when we try to dereference 99999999999999 and thereby hands over control to the operating system.

The OS then prints a nice message for us, letting us know that we encountered what it calls a segmentation fault. This message will therefore vary depending on the OS you run the code on.