Why use an OS-backed event queue? – Understanding OS-Backed Event Queues, System Calls, and Cross-Platform Abstractions

You already know by now that we need to cooperate closely with the OS to make I/O operations as efficient as possible. Operating systems such as Linux, macOS, and Windows provide several ways of performing I/O, both blocking and non-blocking.

I/O operations need to go through the operating system since they are dependent on resources that our operating system abstracts over. This can be the disk drive, the network card, or other peripherals. Especially in the case of network calls, we’re not only dependent on our own hardware, but we also depend on resources that might reside far away from our own, causing a significant delay.

In the previous chapter, we covered different ways to handle asynchronous operations when programming, and while they’re all different, they all have one thing in common: they need control over when and if they should yield to the OS scheduler when making a syscall.

In practice, this means that syscalls that normally would yield to the OS scheduler (blocking calls) needs to be avoided and we need to use non-blocking calls instead. We also need an efficient way to know the status of each call so we know when the task that made the otherwise blocking call is ready to progress. This is the main reason for using an OS-backed event queue in an asynchronous runtime.

We’ll look at three different ways of handling an I/O operation as an example.

Blocking I/O

When we ask the operating system to perform a blocking operation, it will suspend the OS thread that makes the call. It will then store the CPU state it had at the point where we made the call and go on to do other things. When data arrives for us through the network, it will wake up our thread again, restore the CPU state, and let us resume as if nothing has happened.

Blocking operations are the least flexible to use for us as programmers since we yield control to the OS at every call. The big advantage is that our thread gets woken up once the event we’re waiting for is ready so we can continue. If we take the whole system running on the OS into account, it’s a pretty efficient solution since the OS will give threads that have work to do time on the CPU to progress. However, if we narrow the scope to look at our process in isolation, we find that every time we make a blocking call, we put a thread to sleep, even if we still have work that our process could do. This leaves us with the choice of spawning new threads to do work on or just accepting that we have to wait for the blocking call to return. We’ll go a little more into detail about this later.

Non-blocking I/O

Unlike a blocking I/O operation, the OS will not suspend the thread that made an I/O request, but instead give it a handle that the thread can use to ask the operating system if the event is ready or not.

We call the process of querying for status polling.

Non-blocking I/O operations give us as programmers more freedom, but, as usual, that comes with a responsibility. If we poll too often, such as in a loop, we will occupy a lot of CPU time just to ask for an updated status, which is very wasteful. If we poll too infrequently, there will be a significant delay between an event being ready and us doing something about it, thus limiting our throughput.

Callback based approaches – How Programming Languages Model Asynchronous Program Flow

Note!

This is another example of M:N threading. Many tasks can run concurrently on one OS thread. Each task consists of a chain of callbacks.

You probably already know what we’re going to talk about in the next paragraphs from JavaScript, which I assume most know.

The whole idea behind a callback-based approach is to save a pointer to a set of instructions we want to run later together with whatever state is needed. In Rust, this would be a closure.

Implementing callbacks is relatively easy in most languages. They don’t require any context switching or pre-allocated memory for each task.

However, representing concurrent operations using callbacks requires you to write the program in a radically different way from the start. Re-writing a program that uses a normal sequential program flow to one using callbacks represents a substantial rewrite, and the same goes the other way.

Callback-based concurrency can be hard to reason about and can become very complicated to understand. It’s no coincidence that the term “callback hell” is something most JavaScript developers are familiar with.

Since each sub-task must save all the state it needs for later, the memory usage will grow linearly with the number of callbacks in a task.

Advantages
• Easy to implement in most languages
• No context switching
• Relatively low memory overhead (in most cases)
Drawbacks
• Memory usage grows linearly with the number of callbacks.
• Programs and code can be hard to reason about.
• It’s a very different way of writing programs and it will affect almost all aspects of the program since all yielding operations require one callback.
• Ownership can be hard to reason about. The consequence is that writing callback-based programs without a garbage collector can become very difficult.
• Sharing state between tasks is difficult due to the complexity of ownership rules.
• Debugging callbacks can be difficult.

Coroutines: promises and futures

Note!

This is another example of M:N threading. Many tasks can run concurrently on one OS thread. Each task is represented as a state machine.

Promises in JavaScript and futures in Rust are two different implementations that are based on the same idea.

There are differences between different implementations, but we’ll not focus on those here. It’s worth explaining promises a bit since they’re widely known due to their use in JavaScript. Promises also have a lot in common with Rust’s futures.

First of all, many languages have a concept of promises, but I’ll use the one from JavaScript in the following examples.

Promises are one way to deal with the complexity that comes with a callback-based approach.

Instead of:
setTimer(200, () => {
  setTimer(100, () => {
    setTimer(50, () => {
      console.log(“I’m the last one”);
    });
  });
});

We can do:
function timer(ms) {
    return new Promise((resolve) => setTimeout(resolve, ms));
}
timer(200)
.then(() => timer(100))
.then(() => timer(50))
.then(() => console.log(“I’m the last one”));

The latter approach is also referred to as the continuation-passing style. Each subtask calls a new one once it’s finished.

The difference between callbacks and promises is even more substantial under the hood. You see, promises return a state machine that can be in one of three states: pending, fulfilled, or rejected.

When we call timer(200) in the previous example, we get back a promise in the pending state.

Now, the continuation-passing style does fix some of the issues related to callbacks, but it still retains a lot of them when it comes to complexity and the different ways of writing programs. However, they enable us to leverage the compiler to solve a lot of these problems, which we’ll discuss in the next paragraph.

Context switching – How Programming Languages Model Asynchronous Program Flow

Even though these fibers/green threads are lightweight compared to OS threads, you still have to save and restore registers at every context switch. This likely won’t be a problem most of the time, but when compared to alternatives that don’t require context switching, it can be less efficient.

Context switching can also be pretty complex to get right, especially if you intend to support many different platforms.

Scheduling

When a fiber/green thread yields to the runtime scheduler, the scheduler can simply resume execution on a new task that’s ready to run. This means that you avoid the problem of being put in the same run queue as every other task in the system every time you yield to the scheduler. From the OS perspective, your threads are busy doing work all the time, so it will try to avoid pre-empting them if it can.

One unexpected downside of this is that most OS schedulers make sure all threads get some time to run by giving each OS thread a time slice where it can run before the OS pre-empts the thread and schedules a new thread on that CPU. A program using many OS threads might be allotted more time slices than a program with fewer OS threads. A program using M:N threading will most likely only use a few OS threads (one thread per CPU core seems to be the starting point on most systems). So, depending on whatever else is running on the system, your program might be allotted fewer time slices in total than it would be using many OS threads. However, with the number of cores available on most modern CPUs and the typical workload on concurrent systems, the impact from this should be minimal.

FFI

Since you create your own stacks that are supposed to grow/shrink under certain conditions and might have a scheduler that assumes it can pre-empt running tasks at any point, you will have to take extra measures when you use FFI. Most FFI functions will assume a normal OS-provided C-stack, so it will most likely be problematic to call an FFI function from a fiber/green thread. You need to notify the runtime scheduler, context switch to a different OS thread, and have some way of notifying the scheduler that you’re done and the fiber/green thread can continue. This naturally creates overhead and added complexity both for the runtime implementor and the user making the FFI call.

Advantages
• It is simple to use for the user. The code will look like it does when using OS threads.
• Context switching is reasonably fast.
• Abundant memory usage is less of a problem when compared to OS threads.
• You are in full control over how tasks are scheduled and if you want you can prioritize them as you see fit.
• It’s easy to incorporate pre-emption, which can be a powerful feature.
Drawbacks
• Stacks need a way to grow when they run out of space creating additional work and complexity
• You still need to save the CPU state on every context switch
• It’s complicated to implement correctly if you intend to support many platforms and/or CPU architectures
• FFI can have a lot of overhead and add unexpected complexity

Example – How Programming Languages Model Asynchronous Program Flow

Since we’ll not spend more time talking about OS threads in this book, we’ll go through a short example so you can see how they’re used:
ch02/aa-os-threads
use std::thread::{self, sleep};
fn main() {
println!(“So, we start the program here!”);
let t1 = thread::spawn(move || {
sleep(std::time::Duration::from_millis(200));
println!(“The long running tasks finish last!”);
});
let t2 = thread::spawn(move || {
sleep(std::time::Duration::from_millis(100));
println!(“We can chain callbacks…”);
let t3 = thread::spawn(move || {
sleep(std::time::Duration::from_millis(50));
println!(“…like this!”);
});
t3.join().unwrap();
});
println!(“The tasks run concurrently!”);
t1.join().unwrap();
t2.join().unwrap();
}

In this example, we simply spawn several OS threads and put them to sleep. Sleeping is essentially the same as yielding to the OS scheduler with a request to be re-scheduled to run after a certain time has passed. To make sure our main thread doesn’t finish and exit (which will exit the process) before our children thread has had time to run we join them at the end of our main function.
If we run the example, we’ll see how the operations occur in a different order based on how long we yielded each thread to the scheduler:
So, we start the program here!
The tasks run concurrently!
We can chain callbacks…
…like this!
The long-running tasks finish last!

So, while using OS threads is great for a number of tasks, we also outlined a number of good reasons to look at alternatives by discussing their limitations and downsides. The first alternatives we’ll look at are what we call fibers and green threads.

Fibers and green threads

Note!

This is an example of M:N threading. Many tasks can run concurrently on one OS thread. Fibers and green threads are often referred to as stackful coroutines.

The name “green threads” originally stems from an early implementation of an M:N threading model used in Java and has since been associated with different implementations of M:N threading. You will encounter different variations of this term, such as “green processes” (used in Erlang), which are different from the ones we discuss here. You’ll also see some that define green threads more broadly than we do here.

The way we define green threads in this book makes them synonymous with fibers, so both terms refer to the same thing going forward.

The implementation of fibers and green threads implies that there is a runtime with a scheduler that’s responsible for scheduling what task (M) gets time to run on the OS thread (N). There are many more tasks than there are OS threads, and such a system can run perfectly fine using only one OS thread. The latter case is often referred to as M:1 threading.

Goroutines is an example of a specific implementation of stackfull coroutines, but it comes with slight nuances. The term “coroutine” usually implies that they’re cooperative in nature, but Goroutines can be pre-empted by the scheduler (at least since version 1.14), thereby landing them in somewhat of a grey area using the categories we present here.

Green threads and fibers use the same mechanisms as an OS, setting up a stack for each task, saving the CPU’s state, and jumping from one task(thread) to another by doing a context switch.

We yield control to the scheduler (which is a central part of the runtime in such a system), which then continues running a different task.

The state of execution is stored in each stack, so in such a solution, there would be no need for async, await, Future, or Pin. In many ways, green threads mimic how an operating system facilitates concurrency, and implementing them is a great learning experience.

A runtime using fibers/green threads for concurrent tasks can have a high degree of flexibility. Tasks can, for example, be pre-empted and context switched at any time and at any point in their execution, so a long-running task that hogs the CPU could in theory be pre-empted by the runtime, acting as a safeguard from having tasks that end up blocking the whole system due to an edge-case or a programmer error.

This gives the runtime scheduler almost the same capabilities as the OS scheduler, which is one of the biggest advantages of systems using fibers/green threads.

The typical flow goes as follows:

• You run some non-blocking code
• You make a blocking call to some external resource
• The CPU jumps to the main thread, which schedules a different thread to run and jumps to that stack
• You run some non-blocking code on the new thread until a new blocking call or the task is finished
• The CPU jumps back to the main thread, schedules a new thread that is ready to make progress, and jumps to that thread

Figure 2.2 – Program flow using fibers/green threads

Scheduling – How Programming Languages Model Asynchronous Program Flow

The OS can schedule tasks differently than you might expect, and every time you yield to the OS, you’re put in the same queue as all other threads and processes on the system.

Moreover, since there is no guarantee that the thread will resume execution on the same CPU core as it left off or that two tasks won’t run in parallel and try to access the same data, you need to synchronize data access to prevent data races and other pitfalls associated with multicore programming.

Rust as a language will help you prevent many of these pitfalls, but synchronizing data access will require extra work and add to the complexity of such programs. We often say that using OS threads to handle concurrency gives us parallelism for free, but it isn’t free in terms of added complexity and the need for proper data access synchronization.

The advantage of decoupling asynchronous operations from OS threads

Decoupling asynchronous operations from the concept of threads has a lot of benefits.

First of all, using OS threads as a means to handle concurrency requires us to use what essentially is an OS abstraction to represent our tasks.

Having a separate layer of abstraction to represent concurrent tasks gives us the freedom to choose how we want to handle concurrent operations. If we create an abstraction over concurrent operations such as a future in Rust, a promise in JavaScript, or a goroutine in GO, it is up to the runtime implementor to decide how these concurrent tasks are handled.

A runtime could simply map each concurrent operation to an OS thread, they could use fibers/green threads or state machines to represent the tasks. The programmer that writes the asynchronous code will not necessarily have to change anything in their code if the underlying implementation changes. In theory, the same asynchronous code could be used to handle concurrent operations on a microcontroller without an OS if there’s just a runtime for it.

To sum it up, using threads provided by the operating system to handle concurrency has the following advantages:
• Simple to understand
• Easy to use
• Switching between tasks is reasonably fast
• You get parallelism for free
However, they also have a few drawbacks:
• OS-level threads come with a rather large stack. If you have many tasks waiting simultaneously (as you would in a web server under heavy load), you’ll run out of memory pretty fast.
• Context switching can be costly and you might get an unpredictable performance since you let the OS do all the scheduling.
• The OS has many things it needs to handle. It might not switch back to your thread as fast as you’d wish.
• It is tightly coupled to an OS abstraction. This might not be an option on some systems.

Threads provided by the operating system – How Programming Languages Model Asynchronous Program Flow

Important!

Definitions will vary depending on what book or article you read. For example, if you read about how a specific operating system works, you might see that processes or threads are abstractions that represent “tasks”, which will seem to contradict the definitions we use here. As I mentioned earlier, the choice of reference frame is important, and it’s why we take so much care to define the terms we use thoroughly as we encounter them throughout the book.

The definition of a thread can also vary by operating system, even though most popular systems share a similar definition today. Most notably, Solaris (pre-Solaris 9, which was released in 2002) used to have a two-level thread system that differentiated between application threads, lightweight processes, and kernel threads. This was an implementation of what we call M:N threading, which we’ll get to know more about later in this book. Just beware that if you read older material, the definition of a thread in such a system might differ significantly from the one that’s commonly used today.

Now that we’ve gone through the most important definitions for this chapter, it’s time to talk more about the most popular ways of handling concurrency when programming.

Threads provided by the operating system

Note!

We call this 1:1 threading. Each task is assigned one OS thread.

Since this book will not focus specifically on OS threads as a way to handle concurrency going forward, we treat them more thoroughly here.

Let’s start with the obvious. To use threads provided by the operating system, you need, well, an operating system. Before we discuss the use of threads as a means to handle concurrency, we need to be clear about what kind of operating systems we’re talking about since they come in different flavors.

Embedded systems are more widespread now than ever before. This kind of hardware might not have the resources for an operating system, and if they do, you might use a radically different kind of operating system tailored to your needs, as the systems tend to be less general purpose and more specialized in nature.

Their support for threads, and the characteristics of how they schedule them, might be different from what you’re used to in operating systems such as Windows or Linux.

Since covering all the different designs is a book on its own, we’ll limit the scope to talk about treads, as they’re used in Windows and Linux-based systems running on popular desktop and server CPUs.

OS threads are simple to implement and simple to use. We simply let the OS take care of everything for us. We do this by spawning a new OS thread for each task we want to accomplish and write code as we normally would.

The runtime we use to handle concurrency for us is the operating system itself. In addition to these advantages, you get parallelism for free. However, there are also some drawbacks and complexities resulting from directly managing parallelism and shared resources.

Firmware – Concurrency and Asynchronous Programming: a Detailed Overview

Interrupts

As you know by now, there are two kinds of interrupts:

  • Hardware interrupts
  • Software interrupts

They are very different in nature.

Hardware interrupts

Hardware interrupts are created by sending an electrical signal through an IRQ. These hardware lines signal the CPU directly.

Software interrupts

These are interrupts issued from software instead of hardware. As in the case of a hardware interrupt, the CPU jumps to the IDT and runs the handler for the specified interrupt.

Firmware

Firmware doesn’t get much attention from most of us; however, it’s a crucial part of the world we live in. It runs on all kinds of hardware and has all kinds of strange and peculiar ways to make the computers we program on work.

Now, the firmware needs a microcontroller to be able to work. Even the CPU has firmware that makes it work. That means there are many more small ‘CPUs’ on our system than the cores we program against.

Why is this important? Well, you remember that concurrency is all about efficiency, right? Since we have many CPUs/microcontrollers already doing work for us on our system, one of our concerns is to not replicate or duplicate that work when we write code.

If a network card has firmware that continually checks whether new data has arrived, it’s pretty wasteful if we duplicate that by letting our CPU continually check whether new data arrives as well. It’s much better if we either check once in a while, or even better, get notified when data has arrived.

Summary

This chapter covered a lot of ground, so good job on doing all that legwork. We learned a little bit about how CPUs and operating systems have evolved from a historical perspective and the difference between non-preemptive and preemptive multitasking. We discussed the difference between concurrency and parallelism, talked about the role of the operating system, and learned that system calls are the primary way for us to interact with the host operating system. You’ve also seen how the CPU and the operating system cooperate through an infrastructure designed as part of the CPU.

Lastly, we went through a diagram on what happens when you issue a network call. You know there are at least three different ways for us to deal with the fact that the I/O call takes some time to execute, and we have to decide which way we want to handle that waiting time.

This covers most of the general background information we need so that we have the same definitions and overview before we go on. We’ll go into more detail as we progress through the book, and the first topic that we’ll cover in the next chapter is how programming languages model asynchronous program flow by looking into threads, coroutines and futures.

A simplified overview – Concurrency and Asynchronous Programming: a Detailed Overview-2

Step 4 – Hardware interrupt

A modern CPU has a set of interrupt request line (IRQs) for it to handle events that occur from external devices. A CPU has a fixed set of interrupt lines.

A hardware interrupt is an electrical signal that can occur at any time. The CPU immediately interrupts its normal workflow to handle the interrupt by saving the state of its registers and looking up the interrupt handler. The interrupt handlers are defined in the interrupt descriptor table (IDT).

Step 5 – Interrupt handler

The IDT is a table where the OS (or a driver) registers handlers for different interrupts that may occur. Each entry points to a handler function for a specific interrupt. The handler function for a network card would typically be registered and handled by a driver for that card.

Note

The IDT is not stored on the CPU as it might seem in Figure 1.3. It’s located in a fixed and known location in the main memory. The CPU only holds a pointer to the table in one of its registers.

Step 6 – Writing the data

This is a step that might vary a lot depending on the CPU and the firmware on the network card. If the network card and the CPU support direct memory access (DMA), which should be the standard on all modern systems today, the network card will write data directly to a set of buffers that the OS already has set up in the main memory.

In such a system, the firmware on the network card might issue an interrupt when the data is written to memory. DMA is very efficient, since the CPU is only notified when the data is already in memory. On older systems, the CPU needed to devote resources to handle the data transfer from the network card.

The direct memory access controller ( DMAC) is added to the diagram since in such a system, it would control the access to memory. It’s not part of the CPU as indicated in the previous diagram. We’re deep enough in the rabbit hole now, and exactly where the different parts of a system are is not really important to us right now, so let’s move on.

Step 7 – The driver

The driver would normally handle the communication between the OS and the network card. At some point, the buffers are filled and the network card issues an interrupt. The CPU then jumps to the handler of that interrupt. The interrupt handler for this exact type of interrupt is registered by the driver, so it’s actually the driver that handles this event and, in turn, informs the kernel that the data is ready to be read.

Step 8 – Reading the data

Depending on whether we chose method 1, 2, or 3, the OS will do as follows:

  • • Wake our thread
  • • Return Ready on the next poll
  • • Wake the thread and return a Read event for the handler we registered

Down the rabbit hole – Concurrency and Asynchronous Programming: a Detailed Overview

It turns out that there is a great deal of cooperation between the OS and the CPU, but maybe not in the way you would naively think.

Many modern CPUs provide some basic infrastructure that operating systems use. This infrastructure gives us the security and stability we expect. Actually, most advanced CPUs provide a lot more options than operating systems such as Linux, BSD, and Windows actually use.

There are two in particular that I want to address here:

  • How the CPU prevents us from accessing memory we’re not supposed to access
  • How the CPU handles asynchronous events such as I/O

We’ll cover the first one here and the second in the next section.

How does the CPU prevent us from accessing memory we’re not supposed to access?

As I mentioned, modern CPU architectures define some basic concepts by design. Some examples of this are as follows:

• Virtual memory
• Page table
• Page fault
• Exceptions
• Privilege level

Exactly how this works will differ depending on the specific CPU, so we’ll treat them in general terms here.

Most modern CPUs have a memory management unit (MMU). This part of the CPU is often etched on the same dye, even. The MMU’s job is to translate the virtual address we use in our programs to a physical address.

When the OS starts a process (such as our program), it sets up a page table for our process and makes sure a special register on the CPU points to this page table.

Now, when we try to dereference t_ptr in the preceding code, the address is at some point sent for translation to the MMU, which looks it up in the page table to translate it to a physical address in the memory where it can fetch the data.

In the first case, it will point to a memory address on our stack that holds the value 100.

When we pass in 99999999999999 and ask it to fetch what’s stored at that address (which is what dereferencing does), it looks for the translation in the page table but can’t find it.

The CPU then treats this as a page fault.

At boot, the OS provided the CPU with an interrupt descriptor table. This table has a predefined format where the OS provides handlers for the predefined conditions the CPU can encounter.

Since the OS provided a pointer to a function that handles page fault, the CPU jumps to that function when we try to dereference 99999999999999 and thereby hands over control to the operating system.

The OS then prints a nice message for us, letting us know that we encountered what it calls a segmentation fault. This message will therefore vary depending on the OS you run the code on.

Concurrency from the operating system’s perspective – Concurrency and Asynchronous Programming: a Detailed Overview

The role of the operating system

The operating system (OS) stands in the center of everything we do as programmers (well, unless you’re writing an operating system or working in the embedded realm), so there is no way for us to discuss any kind of fundamentals in programming without talking about operating systems in a bit of detail.

Concurrency from the operating system’s perspective

This ties into what I talked about earlier when I said that concurrency needs to be talked about within a reference frame, and I explained that the OS might stop and start your process at any time.

What we call synchronous code is, in most cases, code that appears synchronous to us as programmers. Neither the OS nor the CPU lives in a fully synchronous world.

Operating systems use preemptive multitasking and as long as the operating system you’re running is preemptively scheduling processes, you won’t have a guarantee that your code runs instruction by instruction without interruption.

The operating system will make sure that all important processes get some time from the CPU to make progress.

Note

This is not as simple when we’re talking about modern machines with 4, 6, 8, or 12 physical cores, since you might actually execute code on one of the CPUs uninterrupted if the system is under very little load. The important part here is that you can’t know for sure and there is no guarantee that your code will be left to run uninterrupted.

Teaming up with the operating system

When you make a web request, you’re not asking the CPU or the network card to do something for you – you’re asking the operating system to talk to the network card for you.

There is no way for you as a programmer to make your system optimally efficient without playing to the strengths of the operating system. You basically don’t have access to the hardware directly. You must remember that the operating system is an abstraction over the hardware.

However, this also means that to understand everything from the ground up, you’ll also need to know how your operating system handles these tasks.To be able to work with the operating system, you’ll need to know how you can communicate with it, and that’s exactly what we’re going to go through next.