Why use an OS-backed event queue? – Understanding OS-Backed Event Queues, System Calls, and Cross-Platform Abstractions

You already know by now that we need to cooperate closely with the OS to make I/O operations as efficient as possible. Operating systems such as Linux, macOS, and Windows provide several ways of performing I/O, both blocking and non-blocking.

I/O operations need to go through the operating system since they are dependent on resources that our operating system abstracts over. This can be the disk drive, the network card, or other peripherals. Especially in the case of network calls, we’re not only dependent on our own hardware, but we also depend on resources that might reside far away from our own, causing a significant delay.

In the previous chapter, we covered different ways to handle asynchronous operations when programming, and while they’re all different, they all have one thing in common: they need control over when and if they should yield to the OS scheduler when making a syscall.

In practice, this means that syscalls that normally would yield to the OS scheduler (blocking calls) needs to be avoided and we need to use non-blocking calls instead. We also need an efficient way to know the status of each call so we know when the task that made the otherwise blocking call is ready to progress. This is the main reason for using an OS-backed event queue in an asynchronous runtime.

We’ll look at three different ways of handling an I/O operation as an example.

Blocking I/O

When we ask the operating system to perform a blocking operation, it will suspend the OS thread that makes the call. It will then store the CPU state it had at the point where we made the call and go on to do other things. When data arrives for us through the network, it will wake up our thread again, restore the CPU state, and let us resume as if nothing has happened.

Blocking operations are the least flexible to use for us as programmers since we yield control to the OS at every call. The big advantage is that our thread gets woken up once the event we’re waiting for is ready so we can continue. If we take the whole system running on the OS into account, it’s a pretty efficient solution since the OS will give threads that have work to do time on the CPU to progress. However, if we narrow the scope to look at our process in isolation, we find that every time we make a blocking call, we put a thread to sleep, even if we still have work that our process could do. This leaves us with the choice of spawning new threads to do work on or just accepting that we have to wait for the blocking call to return. We’ll go a little more into detail about this later.

Non-blocking I/O

Unlike a blocking I/O operation, the OS will not suspend the thread that made an I/O request, but instead give it a handle that the thread can use to ask the operating system if the event is ready or not.

We call the process of querying for status polling.

Non-blocking I/O operations give us as programmers more freedom, but, as usual, that comes with a responsibility. If we poll too often, such as in a loop, we will occupy a lot of CPU time just to ask for an updated status, which is very wasteful. If we poll too infrequently, there will be a significant delay between an event being ready and us doing something about it, thus limiting our throughput.

Technical requirements – Understanding OS-Backed Event Queues, System Calls, and Cross-Platform Abstractions

This chapter doesn’t require you to set up anything new, but since we’re writing some low-level code for three different platforms, you need access to these platforms if you want to run all the examples.

The best way to follow along is to open the accompanying repository on your computer and navigate to the ch03 folder.

This chapter is a little special since we build some basic understanding from the ground up, which means some of it is quite low-level and requires a specific operating system and CPU family to run. Don’t worry; I’ve chosen the most used and popular CPU, so this shouldn’t be a problem, but it is something you need to be aware of.

The machine must use a CPU using the x86-64 instruction set on Windows and Linux. Intel and AMD desktop CPUs use this architecture, but if you run Linux (or WSL) on a machine using an ARM processor you might encounter issues with some of the examples using inline assembly. On macOS, the example in the book targets the newer M-family of chips, but the repository also contains examples targeting the older Intel-based Macs.

Unfortunately, some examples targeting specific platforms require that specific operating system to run. However, this will be the only chapter where you need access to three different platforms to run all the examples. Going forward, we’ll create examples that will run on all platforms either natively or using Windows Subsystem for Linux (WSL), but to understand the basics of cross-platform abstractions, we need to actually create examples that target these different platforms.

Running the Linux examples

If you don’t have a Linux machine set up, you can run the Linux example on the Rust Playground, or if you’re on a Windows system, my suggestion is to set up WSL and run the code there. You can find the instructions on how to do that at https://learn.microsoft.com/en-us/windows/wsl/install. Remember, you have to install Rust in the WSL environment as well, so follow the instructions in the Preface section of this book on how to install Rust on Linux.

If you use VS Code as your editor, there is a very simple way of switching your environment to WSL. Press Ctrl+Shift+P and write Reopen folder in WSL. This way, you can easily open the example folder in WSL and run the code examples using Linux there.

NOTE – Understanding OS-Backed Event Queues, System Calls, and Cross-Platform Abstractions

In this chapter, we’ll take a look at how an OS-backed event queue works and how three different operating systems handle this task in different ways. The reason for going through this is that most async runtimes I know of use OS-backed event queues such as this as a fundamental part of achieving high-performance I/O. You’ll most likely hear references to these frequently when reading about how async code really works.

Event queues based on the technology we discuss in this chapter is used in many popular libraries like:

• mio (https://github.com/tokio-rs/mio), a key part of popular runtimes like Tokio
• polling (https://github.com/smol-rs/polling), the event queue used in Smol and async-std
• libuv (https://libuv.org/), the library used to create the event queue used in Node.js (a JavaScript runtime) and the Julia programming language
• C# for its asynchronous network calls
• Boost.Asio, a library for asynchronous network I/O for C++

All our interactions with the host operating system are done through system calls (syscalls). To make a system call using Rust, we need to know how to use Rust’s foreign function interface (FFI).

In addition to knowing how to use FFI and make syscalls, we need to cover cross-platform abstractions. When creating an event queue, whether you create it yourself or use a library, you’ll notice that the abstractions might seem a bit unintuitive if you only have a high-level overview of how, for example, IOCP works on Windows. The reason for this is that these abstractions need to provide one API that covers the fact that different operating systems handle the same task differently. This process often involves identifying a common denominator between the platforms and building a new abstraction on top of that.

Instead of using a rather complex and lengthy example to explain FFI, syscalls, and cross-platform abstractions, we’ll ease into the topic using a simple example. When we encounter these concepts later on, we’ll already know these subjects well enough, so we’re well prepared for the more interesting examples in the following chapters.

In this chapter, we’ll go through the following main topics:
• Why use an OS-backed event queue?
• Readiness-based event queues
• Completion-based event queues
• epoll
• kqueue
• IOCP
• Syscalls, FFI, and cross-platform abstractions

Note

There are popular, although lesser-used, alternatives you should know about even though we don’t cover them here:

wepoll: This uses specific APIs on Windows and wraps IOCP so it closely resembles how epoll works on Linux in contrast to regular IOCP. This makes it easier to create an abstraction layer with the same API on top of the two different technologies. It’s used by both libuv and mio . io_uring: This is a relatively new API on Linux with many similarities to IOCP on Windows.

I’m pretty confident that after you’ve gone through the next two chapters, you will have an easy time reading up on these if you want to learn more about them.

PRE-EMPTION POINTS – How Programming Languages Model Asynchronous Program Flow

Pre-emption points can be thought of as inserting code that calls into the scheduler and asks it if it wishes to pre-empt the task. These points can be inserted by the compiler or the library you use before every new function call for example.

Furthermore, you need compiler support to make the most out of it. Languages that have metaprogramming abilities (such as macros) can emulate much of the same, but this will still not be as seamless as it will when the compiler is aware of these special async tasks.

Debugging is another area where care must be taken when implementing futures/promises. Since the code is re-written as state machines (or generators), you won’t have the same stack traces as you do with normal functions. Usually, you can assume that the caller of a function is what precedes it both in the stack and in the program flow. For futures and promises, it might be the runtime that calls the function that progresses the state machine, so there might not be a good backtrace you can use to see what happened before calling the function that failed. There are ways to work around this, but most of them will incur some overhead.

Advantages
• You can write code and model programs the same way you normally would
• No context switching
• It can be implemented in a very memory-efficient way
• It’s easy to implement for various platforms
Drawbacks
• Pre-emption can be hard, or impossible, to fully implement, as the tasks can’t be stopped in the middle of a stack frame
• It needs compiler support to leverage its full advantages
• Debugging can be difficult both due to the non-sequential program flow and the limitations on the information you get from the backtraces.

Summary

You’re still here? That’s excellent! Good job on getting through all that background information. I know going through text that describes abstractions and code can be pretty daunting, but I hope you see why it’s so valuable for us to go through these higher-level topics now at the start of the book. We’ll get to the examples soon. I promise!

In this chapter, we went through a lot of information on how we can model and handle asynchronous operations in programming languages by using both OS-provided threads and abstractions provided by a programming language or a library. While it’s not an extensive list, we covered some of the most popular and widely used technologies while discussing their advantages and drawbacks.

We spent quite some time going in-depth on threads, coroutines, fibers, green threads, and callbacks, so you should have a pretty good idea of what they are and how they’re different from each other.

The next chapter will go into detail about how we do system calls and create cross-platform abstractions and what OS-backed event queues such as Epoll, Kqueue, and IOCP really are and why they’re fundamental to most async runtimes you’ll encounter out in the wild.

Coroutines and async/await – How Programming Languages Model Asynchronous Program Flow

Coroutines come in two flavors: asymmetric and symmetric. Asymmetric coroutines yields to a scheduler, and they’re the ones we’ll focus on. Symmetric coroutines yield a specific destination; for example, a different coroutine.

While coroutines are a pretty broad concept in general, the introduction of coroutines as objects in programming languages is what really makes this way of handling concurrency rival the ease of use that OS threads and fibers/green threads are known for.

You see when you write async in Rust or JavaScript, the compiler re-writes what looks like a normal function call into a future (in the case of Rust) or a promise (in the case of JavaScript). Await, on the other hand, yields control to the runtime scheduler, and the task is suspended until the future/promise you’re awaiting has finished.

This way, we can write programs that handle concurrent operations in almost the same way we write our normal sequential programs.

Our JavaScript program can now be written as follows:
async function run() {
    await timer(200);
    await timer(100);
    await timer(50);
    console.log(“I’m the last one”);
}

You can consider the run function as a pausable task consisting of several sub-tasks. On each “await” point, it yields control to the scheduler (in this case, it’s the well-known JavaScript event loop).

Once one of the sub-tasks changes state to either fulfilled or rejected, the task is scheduled to continue to the next step.

When using Rust, you can see the same transformation happening with the function signature when you write something such as this:
async fn run() -> () { … }

The function wraps the return object, and instead of returning the type (), it returns a Future with an output type of ():
Fn run() -> impl Future<Output = ()>

Syntactically, Rust’s futures 0.1 was a lot like the promise example we just showed, and the Rust futures we use today have a lot in common with how async/await works in JavaScript..

This way of rewriting what look like normal functions and code into something else has a lot of benefits, but it’s not without its drawbacks.

As with any stackless coroutine implementation, full pre-emption can be hard, or impossible, to implement. These functions have to yield at specific points, and there is no way to suspend execution in the middle of a stack frame in contrast to fibers/green threads. Some level of pre-emption is possible by having the runtime or compiler insert pre-emption points at every function call, for example, but it’s not the same as being able to pre-empt a task at any point during its execution.

Callback based approaches – How Programming Languages Model Asynchronous Program Flow

Note!

This is another example of M:N threading. Many tasks can run concurrently on one OS thread. Each task consists of a chain of callbacks.

You probably already know what we’re going to talk about in the next paragraphs from JavaScript, which I assume most know.

The whole idea behind a callback-based approach is to save a pointer to a set of instructions we want to run later together with whatever state is needed. In Rust, this would be a closure.

Implementing callbacks is relatively easy in most languages. They don’t require any context switching or pre-allocated memory for each task.

However, representing concurrent operations using callbacks requires you to write the program in a radically different way from the start. Re-writing a program that uses a normal sequential program flow to one using callbacks represents a substantial rewrite, and the same goes the other way.

Callback-based concurrency can be hard to reason about and can become very complicated to understand. It’s no coincidence that the term “callback hell” is something most JavaScript developers are familiar with.

Since each sub-task must save all the state it needs for later, the memory usage will grow linearly with the number of callbacks in a task.

Advantages
• Easy to implement in most languages
• No context switching
• Relatively low memory overhead (in most cases)
Drawbacks
• Memory usage grows linearly with the number of callbacks.
• Programs and code can be hard to reason about.
• It’s a very different way of writing programs and it will affect almost all aspects of the program since all yielding operations require one callback.
• Ownership can be hard to reason about. The consequence is that writing callback-based programs without a garbage collector can become very difficult.
• Sharing state between tasks is difficult due to the complexity of ownership rules.
• Debugging callbacks can be difficult.

Coroutines: promises and futures

Note!

This is another example of M:N threading. Many tasks can run concurrently on one OS thread. Each task is represented as a state machine.

Promises in JavaScript and futures in Rust are two different implementations that are based on the same idea.

There are differences between different implementations, but we’ll not focus on those here. It’s worth explaining promises a bit since they’re widely known due to their use in JavaScript. Promises also have a lot in common with Rust’s futures.

First of all, many languages have a concept of promises, but I’ll use the one from JavaScript in the following examples.

Promises are one way to deal with the complexity that comes with a callback-based approach.

Instead of:
setTimer(200, () => {
  setTimer(100, () => {
    setTimer(50, () => {
      console.log(“I’m the last one”);
    });
  });
});

We can do:
function timer(ms) {
    return new Promise((resolve) => setTimeout(resolve, ms));
}
timer(200)
.then(() => timer(100))
.then(() => timer(50))
.then(() => console.log(“I’m the last one”));

The latter approach is also referred to as the continuation-passing style. Each subtask calls a new one once it’s finished.

The difference between callbacks and promises is even more substantial under the hood. You see, promises return a state machine that can be in one of three states: pending, fulfilled, or rejected.

When we call timer(200) in the previous example, we get back a promise in the pending state.

Now, the continuation-passing style does fix some of the issues related to callbacks, but it still retains a lot of them when it comes to complexity and the different ways of writing programs. However, they enable us to leverage the compiler to solve a lot of these problems, which we’ll discuss in the next paragraph.

Context switching – How Programming Languages Model Asynchronous Program Flow

Even though these fibers/green threads are lightweight compared to OS threads, you still have to save and restore registers at every context switch. This likely won’t be a problem most of the time, but when compared to alternatives that don’t require context switching, it can be less efficient.

Context switching can also be pretty complex to get right, especially if you intend to support many different platforms.

Scheduling

When a fiber/green thread yields to the runtime scheduler, the scheduler can simply resume execution on a new task that’s ready to run. This means that you avoid the problem of being put in the same run queue as every other task in the system every time you yield to the scheduler. From the OS perspective, your threads are busy doing work all the time, so it will try to avoid pre-empting them if it can.

One unexpected downside of this is that most OS schedulers make sure all threads get some time to run by giving each OS thread a time slice where it can run before the OS pre-empts the thread and schedules a new thread on that CPU. A program using many OS threads might be allotted more time slices than a program with fewer OS threads. A program using M:N threading will most likely only use a few OS threads (one thread per CPU core seems to be the starting point on most systems). So, depending on whatever else is running on the system, your program might be allotted fewer time slices in total than it would be using many OS threads. However, with the number of cores available on most modern CPUs and the typical workload on concurrent systems, the impact from this should be minimal.

FFI

Since you create your own stacks that are supposed to grow/shrink under certain conditions and might have a scheduler that assumes it can pre-empt running tasks at any point, you will have to take extra measures when you use FFI. Most FFI functions will assume a normal OS-provided C-stack, so it will most likely be problematic to call an FFI function from a fiber/green thread. You need to notify the runtime scheduler, context switch to a different OS thread, and have some way of notifying the scheduler that you’re done and the fiber/green thread can continue. This naturally creates overhead and added complexity both for the runtime implementor and the user making the FFI call.

Advantages
• It is simple to use for the user. The code will look like it does when using OS threads.
• Context switching is reasonably fast.
• Abundant memory usage is less of a problem when compared to OS threads.
• You are in full control over how tasks are scheduled and if you want you can prioritize them as you see fit.
• It’s easy to incorporate pre-emption, which can be a powerful feature.
Drawbacks
• Stacks need a way to grow when they run out of space creating additional work and complexity
• You still need to save the CPU state on every context switch
• It’s complicated to implement correctly if you intend to support many platforms and/or CPU architectures
• FFI can have a lot of overhead and add unexpected complexity

Each stack has a fixed space – How Programming Languages Model Asynchronous Program Flow

As fibers and green threads are similar to OS threads, they do have some of the same drawbacks as well. Each task is set up with a stack of a fixed size, so you still have to reserve more space than you actually use. However, these stacks can be growable, meaning that once the stack is full, the runtime can grow the stack. While this sounds easy, it’s a rather complicated problem to solve.

We can’t simply grow a stack as we grow a tree. What actually needs to happen is one of two things:

  1. You allocate a new piece of continuous memory and handle the fact that your stack is spread over two disjointed memory segments
  2. You allocate a new larger stack (for example, twice the size of the previous stack), move all your data over to the new stack, and continue from there

The first solution sounds pretty simple, as you can leave the original stack as it is, and you can basically context switch over to the new stack when needed and continue from there. However, modern CPUs can work extremely fast if they can work on a contiguous piece of memory due to caching and their ability to predict what data your next instructions are going to work on. Spreading the stack over two disjointed pieces of memory will hinder performance. This is especially noticeable when you have a loop that happens to be just at the stack boundary, so you end up making up to two context switches for each iteration of the loop.

The second solution solves the problems with the first solution by having the stack as a contiguous piece of memory, but it comes with some problems as well.

First, you need to allocate a new stack and move all the data over to the new stack. But what happens with all pointers and references that point to something located on the stack when everything moves to a new location? You guessed it: every pointer and reference to anything located on the stack needs to be updated so they point to the new location. This is complex and time-consuming, but if your runtime already includes a garbage collector, you already have the overhead of keeping track of all your pointers and references anyway, so it might be less of a problem than it would for a non-garbage collected program. However, it does require a great deal of integration between the garbage collector and the runtime to do this every time the stack grows, so implementing this kind of runtime can get very complicated.

Secondly, you have to consider what happens if you have a lot of long-running tasks that only require a lot of stack space for a brief period of time (for example, if it involves a lot of recursion at the start of the task) but are mostly I/O bound the rest of the time. You end up growing your stack many times over only for one specific part of that task, and you have to make a decision whether you will accept that the task occupies more space than it needs or at some point move it back to a smaller stack. The impact this will have on your program will of course vary greatly based on the type of work you do, but it’s still something to be aware of.

Example – How Programming Languages Model Asynchronous Program Flow

Since we’ll not spend more time talking about OS threads in this book, we’ll go through a short example so you can see how they’re used:
ch02/aa-os-threads
use std::thread::{self, sleep};
fn main() {
println!(“So, we start the program here!”);
let t1 = thread::spawn(move || {
sleep(std::time::Duration::from_millis(200));
println!(“The long running tasks finish last!”);
});
let t2 = thread::spawn(move || {
sleep(std::time::Duration::from_millis(100));
println!(“We can chain callbacks…”);
let t3 = thread::spawn(move || {
sleep(std::time::Duration::from_millis(50));
println!(“…like this!”);
});
t3.join().unwrap();
});
println!(“The tasks run concurrently!”);
t1.join().unwrap();
t2.join().unwrap();
}

In this example, we simply spawn several OS threads and put them to sleep. Sleeping is essentially the same as yielding to the OS scheduler with a request to be re-scheduled to run after a certain time has passed. To make sure our main thread doesn’t finish and exit (which will exit the process) before our children thread has had time to run we join them at the end of our main function.
If we run the example, we’ll see how the operations occur in a different order based on how long we yielded each thread to the scheduler:
So, we start the program here!
The tasks run concurrently!
We can chain callbacks…
…like this!
The long-running tasks finish last!

So, while using OS threads is great for a number of tasks, we also outlined a number of good reasons to look at alternatives by discussing their limitations and downsides. The first alternatives we’ll look at are what we call fibers and green threads.

Fibers and green threads

Note!

This is an example of M:N threading. Many tasks can run concurrently on one OS thread. Fibers and green threads are often referred to as stackful coroutines.

The name “green threads” originally stems from an early implementation of an M:N threading model used in Java and has since been associated with different implementations of M:N threading. You will encounter different variations of this term, such as “green processes” (used in Erlang), which are different from the ones we discuss here. You’ll also see some that define green threads more broadly than we do here.

The way we define green threads in this book makes them synonymous with fibers, so both terms refer to the same thing going forward.

The implementation of fibers and green threads implies that there is a runtime with a scheduler that’s responsible for scheduling what task (M) gets time to run on the OS thread (N). There are many more tasks than there are OS threads, and such a system can run perfectly fine using only one OS thread. The latter case is often referred to as M:1 threading.

Goroutines is an example of a specific implementation of stackfull coroutines, but it comes with slight nuances. The term “coroutine” usually implies that they’re cooperative in nature, but Goroutines can be pre-empted by the scheduler (at least since version 1.14), thereby landing them in somewhat of a grey area using the categories we present here.

Green threads and fibers use the same mechanisms as an OS, setting up a stack for each task, saving the CPU’s state, and jumping from one task(thread) to another by doing a context switch.

We yield control to the scheduler (which is a central part of the runtime in such a system), which then continues running a different task.

The state of execution is stored in each stack, so in such a solution, there would be no need for async, await, Future, or Pin. In many ways, green threads mimic how an operating system facilitates concurrency, and implementing them is a great learning experience.

A runtime using fibers/green threads for concurrent tasks can have a high degree of flexibility. Tasks can, for example, be pre-empted and context switched at any time and at any point in their execution, so a long-running task that hogs the CPU could in theory be pre-empted by the runtime, acting as a safeguard from having tasks that end up blocking the whole system due to an edge-case or a programmer error.

This gives the runtime scheduler almost the same capabilities as the OS scheduler, which is one of the biggest advantages of systems using fibers/green threads.

The typical flow goes as follows:

• You run some non-blocking code
• You make a blocking call to some external resource
• The CPU jumps to the main thread, which schedules a different thread to run and jumps to that stack
• You run some non-blocking code on the new thread until a new blocking call or the task is finished
• The CPU jumps back to the main thread, schedules a new thread that is ready to make progress, and jumps to that thread

Figure 2.2 – Program flow using fibers/green threads

Scheduling – How Programming Languages Model Asynchronous Program Flow

The OS can schedule tasks differently than you might expect, and every time you yield to the OS, you’re put in the same queue as all other threads and processes on the system.

Moreover, since there is no guarantee that the thread will resume execution on the same CPU core as it left off or that two tasks won’t run in parallel and try to access the same data, you need to synchronize data access to prevent data races and other pitfalls associated with multicore programming.

Rust as a language will help you prevent many of these pitfalls, but synchronizing data access will require extra work and add to the complexity of such programs. We often say that using OS threads to handle concurrency gives us parallelism for free, but it isn’t free in terms of added complexity and the need for proper data access synchronization.

The advantage of decoupling asynchronous operations from OS threads

Decoupling asynchronous operations from the concept of threads has a lot of benefits.

First of all, using OS threads as a means to handle concurrency requires us to use what essentially is an OS abstraction to represent our tasks.

Having a separate layer of abstraction to represent concurrent tasks gives us the freedom to choose how we want to handle concurrent operations. If we create an abstraction over concurrent operations such as a future in Rust, a promise in JavaScript, or a goroutine in GO, it is up to the runtime implementor to decide how these concurrent tasks are handled.

A runtime could simply map each concurrent operation to an OS thread, they could use fibers/green threads or state machines to represent the tasks. The programmer that writes the asynchronous code will not necessarily have to change anything in their code if the underlying implementation changes. In theory, the same asynchronous code could be used to handle concurrent operations on a microcontroller without an OS if there’s just a runtime for it.

To sum it up, using threads provided by the operating system to handle concurrency has the following advantages:
• Simple to understand
• Easy to use
• Switching between tasks is reasonably fast
• You get parallelism for free
However, they also have a few drawbacks:
• OS-level threads come with a rather large stack. If you have many tasks waiting simultaneously (as you would in a web server under heavy load), you’ll run out of memory pretty fast.
• Context switching can be costly and you might get an unpredictable performance since you let the OS do all the scheduling.
• The OS has many things it needs to handle. It might not switch back to your thread as fast as you’d wish.
• It is tightly coupled to an OS abstraction. This might not be an option on some systems.