Why use an OS-backed event queue? – Understanding OS-Backed Event Queues, System Calls, and Cross-Platform Abstractions

You already know by now that we need to cooperate closely with the OS to make I/O operations as efficient as possible. Operating systems such as Linux, macOS, and Windows provide several ways of performing I/O, both blocking and non-blocking.

I/O operations need to go through the operating system since they are dependent on resources that our operating system abstracts over. This can be the disk drive, the network card, or other peripherals. Especially in the case of network calls, we’re not only dependent on our own hardware, but we also depend on resources that might reside far away from our own, causing a significant delay.

In the previous chapter, we covered different ways to handle asynchronous operations when programming, and while they’re all different, they all have one thing in common: they need control over when and if they should yield to the OS scheduler when making a syscall.

In practice, this means that syscalls that normally would yield to the OS scheduler (blocking calls) needs to be avoided and we need to use non-blocking calls instead. We also need an efficient way to know the status of each call so we know when the task that made the otherwise blocking call is ready to progress. This is the main reason for using an OS-backed event queue in an asynchronous runtime.

We’ll look at three different ways of handling an I/O operation as an example.

Blocking I/O

When we ask the operating system to perform a blocking operation, it will suspend the OS thread that makes the call. It will then store the CPU state it had at the point where we made the call and go on to do other things. When data arrives for us through the network, it will wake up our thread again, restore the CPU state, and let us resume as if nothing has happened.

Blocking operations are the least flexible to use for us as programmers since we yield control to the OS at every call. The big advantage is that our thread gets woken up once the event we’re waiting for is ready so we can continue. If we take the whole system running on the OS into account, it’s a pretty efficient solution since the OS will give threads that have work to do time on the CPU to progress. However, if we narrow the scope to look at our process in isolation, we find that every time we make a blocking call, we put a thread to sleep, even if we still have work that our process could do. This leaves us with the choice of spawning new threads to do work on or just accepting that we have to wait for the blocking call to return. We’ll go a little more into detail about this later.

Non-blocking I/O

Unlike a blocking I/O operation, the OS will not suspend the thread that made an I/O request, but instead give it a handle that the thread can use to ask the operating system if the event is ready or not.

We call the process of querying for status polling.

Non-blocking I/O operations give us as programmers more freedom, but, as usual, that comes with a responsibility. If we poll too often, such as in a loop, we will occupy a lot of CPU time just to ask for an updated status, which is very wasteful. If we poll too infrequently, there will be a significant delay between an event being ready and us doing something about it, thus limiting our throughput.

Technical requirements – Understanding OS-Backed Event Queues, System Calls, and Cross-Platform Abstractions

This chapter doesn’t require you to set up anything new, but since we’re writing some low-level code for three different platforms, you need access to these platforms if you want to run all the examples.

The best way to follow along is to open the accompanying repository on your computer and navigate to the ch03 folder.

This chapter is a little special since we build some basic understanding from the ground up, which means some of it is quite low-level and requires a specific operating system and CPU family to run. Don’t worry; I’ve chosen the most used and popular CPU, so this shouldn’t be a problem, but it is something you need to be aware of.

The machine must use a CPU using the x86-64 instruction set on Windows and Linux. Intel and AMD desktop CPUs use this architecture, but if you run Linux (or WSL) on a machine using an ARM processor you might encounter issues with some of the examples using inline assembly. On macOS, the example in the book targets the newer M-family of chips, but the repository also contains examples targeting the older Intel-based Macs.

Unfortunately, some examples targeting specific platforms require that specific operating system to run. However, this will be the only chapter where you need access to three different platforms to run all the examples. Going forward, we’ll create examples that will run on all platforms either natively or using Windows Subsystem for Linux (WSL), but to understand the basics of cross-platform abstractions, we need to actually create examples that target these different platforms.

Running the Linux examples

If you don’t have a Linux machine set up, you can run the Linux example on the Rust Playground, or if you’re on a Windows system, my suggestion is to set up WSL and run the code there. You can find the instructions on how to do that at https://learn.microsoft.com/en-us/windows/wsl/install. Remember, you have to install Rust in the WSL environment as well, so follow the instructions in the Preface section of this book on how to install Rust on Linux.

If you use VS Code as your editor, there is a very simple way of switching your environment to WSL. Press Ctrl+Shift+P and write Reopen folder in WSL. This way, you can easily open the example folder in WSL and run the code examples using Linux there.

Context switching – How Programming Languages Model Asynchronous Program Flow

Even though these fibers/green threads are lightweight compared to OS threads, you still have to save and restore registers at every context switch. This likely won’t be a problem most of the time, but when compared to alternatives that don’t require context switching, it can be less efficient.

Context switching can also be pretty complex to get right, especially if you intend to support many different platforms.

Scheduling

When a fiber/green thread yields to the runtime scheduler, the scheduler can simply resume execution on a new task that’s ready to run. This means that you avoid the problem of being put in the same run queue as every other task in the system every time you yield to the scheduler. From the OS perspective, your threads are busy doing work all the time, so it will try to avoid pre-empting them if it can.

One unexpected downside of this is that most OS schedulers make sure all threads get some time to run by giving each OS thread a time slice where it can run before the OS pre-empts the thread and schedules a new thread on that CPU. A program using many OS threads might be allotted more time slices than a program with fewer OS threads. A program using M:N threading will most likely only use a few OS threads (one thread per CPU core seems to be the starting point on most systems). So, depending on whatever else is running on the system, your program might be allotted fewer time slices in total than it would be using many OS threads. However, with the number of cores available on most modern CPUs and the typical workload on concurrent systems, the impact from this should be minimal.

FFI

Since you create your own stacks that are supposed to grow/shrink under certain conditions and might have a scheduler that assumes it can pre-empt running tasks at any point, you will have to take extra measures when you use FFI. Most FFI functions will assume a normal OS-provided C-stack, so it will most likely be problematic to call an FFI function from a fiber/green thread. You need to notify the runtime scheduler, context switch to a different OS thread, and have some way of notifying the scheduler that you’re done and the fiber/green thread can continue. This naturally creates overhead and added complexity both for the runtime implementor and the user making the FFI call.

Advantages
• It is simple to use for the user. The code will look like it does when using OS threads.
• Context switching is reasonably fast.
• Abundant memory usage is less of a problem when compared to OS threads.
• You are in full control over how tasks are scheduled and if you want you can prioritize them as you see fit.
• It’s easy to incorporate pre-emption, which can be a powerful feature.
Drawbacks
• Stacks need a way to grow when they run out of space creating additional work and complexity
• You still need to save the CPU state on every context switch
• It’s complicated to implement correctly if you intend to support many platforms and/or CPU architectures
• FFI can have a lot of overhead and add unexpected complexity

Each stack has a fixed space – How Programming Languages Model Asynchronous Program Flow

As fibers and green threads are similar to OS threads, they do have some of the same drawbacks as well. Each task is set up with a stack of a fixed size, so you still have to reserve more space than you actually use. However, these stacks can be growable, meaning that once the stack is full, the runtime can grow the stack. While this sounds easy, it’s a rather complicated problem to solve.

We can’t simply grow a stack as we grow a tree. What actually needs to happen is one of two things:

  1. You allocate a new piece of continuous memory and handle the fact that your stack is spread over two disjointed memory segments
  2. You allocate a new larger stack (for example, twice the size of the previous stack), move all your data over to the new stack, and continue from there

The first solution sounds pretty simple, as you can leave the original stack as it is, and you can basically context switch over to the new stack when needed and continue from there. However, modern CPUs can work extremely fast if they can work on a contiguous piece of memory due to caching and their ability to predict what data your next instructions are going to work on. Spreading the stack over two disjointed pieces of memory will hinder performance. This is especially noticeable when you have a loop that happens to be just at the stack boundary, so you end up making up to two context switches for each iteration of the loop.

The second solution solves the problems with the first solution by having the stack as a contiguous piece of memory, but it comes with some problems as well.

First, you need to allocate a new stack and move all the data over to the new stack. But what happens with all pointers and references that point to something located on the stack when everything moves to a new location? You guessed it: every pointer and reference to anything located on the stack needs to be updated so they point to the new location. This is complex and time-consuming, but if your runtime already includes a garbage collector, you already have the overhead of keeping track of all your pointers and references anyway, so it might be less of a problem than it would for a non-garbage collected program. However, it does require a great deal of integration between the garbage collector and the runtime to do this every time the stack grows, so implementing this kind of runtime can get very complicated.

Secondly, you have to consider what happens if you have a lot of long-running tasks that only require a lot of stack space for a brief period of time (for example, if it involves a lot of recursion at the start of the task) but are mostly I/O bound the rest of the time. You end up growing your stack many times over only for one specific part of that task, and you have to make a decision whether you will accept that the task occupies more space than it needs or at some point move it back to a smaller stack. The impact this will have on your program will of course vary greatly based on the type of work you do, but it’s still something to be aware of.

Context switching – How Programming Languages Model Asynchronous Program Flow

Creating new threads takes time

Creating a new OS thread involves some bookkeeping and initialization overhead, so while switching between two existing threads in the same process is pretty fast, creating new ones and discarding ones you don’t use anymore involves work that takes time. All the extra work will limit throughput if a system needs to create and discard a lot of them. This can be a problem if you have huge amounts of small tasks that need to be handled concurrently, which often is the case when dealing with a lot of I/O.

Each thread has its own stack

We’ll cover stacks in detail later in this book, but for now, it’s enough to know that they occupy a fixed size of memory. Each OS thread comes with its own stack, and even though many systems allow this size to be configured, they’re still fixed in size and can’t grow or shrink. They are, after all, the cause of stack overflows, which will be a problem if you configure them to be too small for the tasks you’re running.

If we have many small tasks that only require a little stack space but we reserve much more than we need, we will occupy large amounts of memory and possibly run out of it.

Context switching

As you now know, threads and schedulers are tightly connected. Context switching happens when the CPU stops executing one thread and proceeds with another one. Even though this process is highly optimized, it still involves storing and restoring the register state, which takes time. Every time that you yield to the OS scheduler, it can choose to schedule a thread from a different process on that CPU.

You see, threads created by these systems belong to a process. When you start a program, it starts a process, and the process creates at least one initial thread where it executes the program you’ve written. Each process can spawn multiple threads that share the same address space.

That means that threads within the same process can access shared memory and can access the same resources, such as files and file handles. One consequence of this is that when the OS switches contexts by stopping one thread and resuming another within the same process, it doesn’t have to save and restore all the state associated with that process, just the state that’s specific to that thread.

On the other hand, when the OS switches from a thread associated with one process to a thread associated with another, the new process will use a different address space, and the OS needs to take measures to make sure that process “A” doesn’t access data or resources that belong to process “B”. If it didn’t, the system wouldn’t be secure.

The consequence is that caches might need to be flushed and more state might need to be saved and restored. In a highly concurrent system under load, these context switches can take extra time and thereby limit the throughput in a somewhat unpredictable manner if they happen frequently enough.

Threads provided by the operating system – How Programming Languages Model Asynchronous Program Flow

Important!

Definitions will vary depending on what book or article you read. For example, if you read about how a specific operating system works, you might see that processes or threads are abstractions that represent “tasks”, which will seem to contradict the definitions we use here. As I mentioned earlier, the choice of reference frame is important, and it’s why we take so much care to define the terms we use thoroughly as we encounter them throughout the book.

The definition of a thread can also vary by operating system, even though most popular systems share a similar definition today. Most notably, Solaris (pre-Solaris 9, which was released in 2002) used to have a two-level thread system that differentiated between application threads, lightweight processes, and kernel threads. This was an implementation of what we call M:N threading, which we’ll get to know more about later in this book. Just beware that if you read older material, the definition of a thread in such a system might differ significantly from the one that’s commonly used today.

Now that we’ve gone through the most important definitions for this chapter, it’s time to talk more about the most popular ways of handling concurrency when programming.

Threads provided by the operating system

Note!

We call this 1:1 threading. Each task is assigned one OS thread.

Since this book will not focus specifically on OS threads as a way to handle concurrency going forward, we treat them more thoroughly here.

Let’s start with the obvious. To use threads provided by the operating system, you need, well, an operating system. Before we discuss the use of threads as a means to handle concurrency, we need to be clear about what kind of operating systems we’re talking about since they come in different flavors.

Embedded systems are more widespread now than ever before. This kind of hardware might not have the resources for an operating system, and if they do, you might use a radically different kind of operating system tailored to your needs, as the systems tend to be less general purpose and more specialized in nature.

Their support for threads, and the characteristics of how they schedule them, might be different from what you’re used to in operating systems such as Windows or Linux.

Since covering all the different designs is a book on its own, we’ll limit the scope to talk about treads, as they’re used in Windows and Linux-based systems running on popular desktop and server CPUs.

OS threads are simple to implement and simple to use. We simply let the OS take care of everything for us. We do this by spawning a new OS thread for each task we want to accomplish and write code as we normally would.

The runtime we use to handle concurrency for us is the operating system itself. In addition to these advantages, you get parallelism for free. However, there are also some drawbacks and complexities resulting from directly managing parallelism and shared resources.

Firmware – Concurrency and Asynchronous Programming: a Detailed Overview

Interrupts

As you know by now, there are two kinds of interrupts:

  • Hardware interrupts
  • Software interrupts

They are very different in nature.

Hardware interrupts

Hardware interrupts are created by sending an electrical signal through an IRQ. These hardware lines signal the CPU directly.

Software interrupts

These are interrupts issued from software instead of hardware. As in the case of a hardware interrupt, the CPU jumps to the IDT and runs the handler for the specified interrupt.

Firmware

Firmware doesn’t get much attention from most of us; however, it’s a crucial part of the world we live in. It runs on all kinds of hardware and has all kinds of strange and peculiar ways to make the computers we program on work.

Now, the firmware needs a microcontroller to be able to work. Even the CPU has firmware that makes it work. That means there are many more small ‘CPUs’ on our system than the cores we program against.

Why is this important? Well, you remember that concurrency is all about efficiency, right? Since we have many CPUs/microcontrollers already doing work for us on our system, one of our concerns is to not replicate or duplicate that work when we write code.

If a network card has firmware that continually checks whether new data has arrived, it’s pretty wasteful if we duplicate that by letting our CPU continually check whether new data arrives as well. It’s much better if we either check once in a while, or even better, get notified when data has arrived.

Summary

This chapter covered a lot of ground, so good job on doing all that legwork. We learned a little bit about how CPUs and operating systems have evolved from a historical perspective and the difference between non-preemptive and preemptive multitasking. We discussed the difference between concurrency and parallelism, talked about the role of the operating system, and learned that system calls are the primary way for us to interact with the host operating system. You’ve also seen how the CPU and the operating system cooperate through an infrastructure designed as part of the CPU.

Lastly, we went through a diagram on what happens when you issue a network call. You know there are at least three different ways for us to deal with the fact that the I/O call takes some time to execute, and we have to decide which way we want to handle that waiting time.

This covers most of the general background information we need so that we have the same definitions and overview before we go on. We’ll go into more detail as we progress through the book, and the first topic that we’ll cover in the next chapter is how programming languages model asynchronous program flow by looking into threads, coroutines and futures.

A simplified overview – Concurrency and Asynchronous Programming: a Detailed Overview-1

Let’s look at some of the steps where we imagine that we read from a network card:

Remember that we’re simplifying a lot here. This is a rather complex operation but we’ll focus on the parts that are of most interest to us and skip a few steps along the way.

Step 1 – Our code

We register a socket. This happens by issuing a syscall to the OS. Depending on the OS, we either get a file descriptor (macOS/Linux) or a socket (Windows).

The next step is that we register our interest in Read events on that socket.

Step 2 – Registering events with the OS

This is handled in one of three ways:

  1. We tell the operating system that we’re interested in Read events but we want to wait for it to happen by yielding control over our thread to the OS. The OS then suspends our thread by storing the register state and switches to some other thread
    From our perspective, this will be blocking our thread until we have data to read.
  2. We tell the operating system that we’re interested in Read events but we just want a handle to a task that we can poll to check whether the event is ready or not.
    The OS will not suspend our thread, so this will not block our code.
  3. We tell the operating system that we are probably going to be interested in many events, but we want to subscribe to one event queue. When we poll this queue, it will block our thread until one or more events occur.

This will block our thread while we wait for events to occur.

Chapters 3 and 4 will go into detail about the third method, as it’s the most used method for modern async frameworks to handle concurrency.

Step 3 – The network card

We’re skipping some steps here, but I don’t think they’re vital to our understanding.

On the network card, there is a small microcontroller running specialized firmware. We can imagine that this microcontroller is polling in a busy loop, checking whether any data is incoming.

The exact way the network card handles its internals is a little different from what I suggest here, and will most likely vary from vendor to vendor. The important part is that there is a very simple but specialized CPU running on the network card doing work to check whether there are incoming events.

Once the firmware registers incoming data, it issues a hardware interrupt.

But can’t we just change the page table in the CPU? – Concurrency and Asynchronous Programming: a Detailed Overview

Now, this is where the privilege level comes in. Most modern operating systems operate with two ring levels: ring 0, the kernel space, and ring 3, the user space.

Figure 1.2 – Privilege rings

Most CPUs have a concept of more rings than what most modern operating systems use. This has historical reasons, which is also why ring 0 and ring 3 are used (and not 1 and 2).

Every entry in the page table has additional information about it. Amongst that information is the information about which ring it belongs to. This information is set up when your OS boots up.

Code executed in ring 0 has almost unrestricted access to external devices and memory, and is free to change registers that provide security at the hardware level.

The code you write in ring 3 will typically have extremely restricted access to I/O and certain CPU registers (and instructions). Trying to issue an instruction or setting a register from ring 3 to change the page table will be prevented by the CPU. The CPU will then treat this as an exception and jump to the handler for that exception provided by the OS.

This is also the reason why you have no other choice than to cooperate with the OS and handle I/O tasks through syscalls. The system wouldn’t be very secure if this wasn’t the case.

So, to sum it up: yes, the CPU and the OS cooperate a great deal. Most modern desktop CPUs are built with an OS in mind, so they provide the hooks and infrastructure that the OS latches onto upon bootup. When the OS spawns a process, it also sets its privilege level, making sure that normal processes stay within the borders it defines to maintain stability and security.

Interrupts, firmware, and I/O

We’re nearing the end of the general CS subjects in this book, and we’ll start to dig our way out of the rabbit hole soon.

This part tries to tie things together and look at how the whole computer works as a system to handle I/O and concurrency.

Let’s get to it!

Communicating with the operating system – Concurrency and Asynchronous Programming: a Detailed Overview

Communication with an operating system happens through what we call a system call (syscall). We need to know how to make system calls and understand why it’s so important for us when we want to cooperate and communicate with the operating system. We also need to understand how the basic abstractions we use every day use system calls behind the scenes. We’ll have a detailed walkthrough in Chapter 3, so we’ll keep this brief for now.

A system call uses a public API that the operating system provides so that programs we write in ‘userland’ can communicate with the OS.

Most of the time, these calls are abstracted away for us as programmers by the language or the runtime we use.

Now, a syscall is an example of something that is unique to the kernel you’re communicating with, but the UNIX family of kernels has many similarities. UNIX systems expose this through libc.

Windows, on the other hand, uses its own API, often referred to as WinAPI, and it can operate radically differently from how the UNIX-based systems operate.

Most often, though, there is a way to achieve the same things. In terms of functionality, you might not notice a big difference but as we’ll see later, and especially when we dig into how epoll, kqueue, and IOCP work, they can differ a lot in how this functionality is implemented.

However, a syscall is not the only way we interact with our operating system, as we’ll see in the following section.

The CPU and the operating system

Does the CPU cooperate with the operating system?

If you had asked me this question when I first thought I understood how programs work, I would most likely have answered no. We run programs on the CPU and we can do whatever we want if we know how to do it. Now, first of all, I wouldn’t have thought this through, but unless you learn how CPUs and operating systems work together, it’s not easy to know for sure.

What started to make me think I was very wrong was a segment of code that looked like what you’re about to see. If you think inline assembly in Rust looks foreign and confusing, don’t worry just yet. We’ll go through a proper introduction to inline assembly a little later in this book. I’ll make sure to go through each of the following lines until you get more comfortable with the syntax:

Repository reference: ch01/ac-assembly-dereference/src/main.rs
fn main() {
    let t = 100;
    let t_ptr: *const usize = &t;
    let x = dereference(t_ptr);
    println!(“{}”, x);
}
fn dereference(ptr: *const usize) -> usize {
    let mut res: usize;
    unsafe {
        asm!(“mov {0}, [{1}]”, out(reg) res, in(reg) ptr)
    };
    res
}

What you’ve just looked at is a dereference function written in assembly.

The mov {0}, [{1}] line needs some explanation. {0} and {1} are templates that tell the compiler that we’re referring to the registers that out(reg) and in(reg) represent. The number is just an index, so if we had more inputs or outputs they would be numbered {2}, {3}, and so on. Since we only specify reg and not a specific register, we let the compiler choose what registers it wants to use.

The mov instruction instructs the CPU to take the first 8 bytes (if we’re on a 64-bit machine) it gets when reading the memory location that {1} points to and place that in the register represented by {0}. The [] brackets will instruct the CPU to treat the data in that register as a memory address, and instead of simply copying the memory address itself to {0}, it will fetch what’s at that memory location and move it over.

Anyway, we’re just writing instructions to the CPU here. No standard library, no syscall; just raw instructions. There is no way the OS is involved in that dereference function, right?

If you run this program, you get what you’d expect:
100

Now, if you keep the dereference function but replace the main function with a function that creates a pointer to the 99999999999999 address, which we know is invalid, we get this function:
fn main() {
    let t_ptr = 99999999999999 as *const usize;
    let x = dereference(t_ptr);
    println!(“{}”, x);
}

Now, if we run that we get the following results.

This is the result on Linux:
Segmentation fault (core dumped)

This is the result on Windows:
error: process didn’t exit successfully: `target\debug\ac-assembly-dereference.exe` (exit code: 0xc0000005, STATUS_ACCESS_VIOLATION)

We get a segmentation fault. Not surprising, really, but as you also might notice, the error we get is different on different platforms. Surely, the OS is involved somehow. Let’s take a look at what’s really happening here.