PRE-EMPTION POINTS – How Programming Languages Model Asynchronous Program Flow

Pre-emption points can be thought of as inserting code that calls into the scheduler and asks it if it wishes to pre-empt the task. These points can be inserted by the compiler or the library you use before every new function call for example.

Furthermore, you need compiler support to make the most out of it. Languages that have metaprogramming abilities (such as macros) can emulate much of the same, but this will still not be as seamless as it will when the compiler is aware of these special async tasks.

Debugging is another area where care must be taken when implementing futures/promises. Since the code is re-written as state machines (or generators), you won’t have the same stack traces as you do with normal functions. Usually, you can assume that the caller of a function is what precedes it both in the stack and in the program flow. For futures and promises, it might be the runtime that calls the function that progresses the state machine, so there might not be a good backtrace you can use to see what happened before calling the function that failed. There are ways to work around this, but most of them will incur some overhead.

Advantages
• You can write code and model programs the same way you normally would
• No context switching
• It can be implemented in a very memory-efficient way
• It’s easy to implement for various platforms
Drawbacks
• Pre-emption can be hard, or impossible, to fully implement, as the tasks can’t be stopped in the middle of a stack frame
• It needs compiler support to leverage its full advantages
• Debugging can be difficult both due to the non-sequential program flow and the limitations on the information you get from the backtraces.

Summary

You’re still here? That’s excellent! Good job on getting through all that background information. I know going through text that describes abstractions and code can be pretty daunting, but I hope you see why it’s so valuable for us to go through these higher-level topics now at the start of the book. We’ll get to the examples soon. I promise!

In this chapter, we went through a lot of information on how we can model and handle asynchronous operations in programming languages by using both OS-provided threads and abstractions provided by a programming language or a library. While it’s not an extensive list, we covered some of the most popular and widely used technologies while discussing their advantages and drawbacks.

We spent quite some time going in-depth on threads, coroutines, fibers, green threads, and callbacks, so you should have a pretty good idea of what they are and how they’re different from each other.

The next chapter will go into detail about how we do system calls and create cross-platform abstractions and what OS-backed event queues such as Epoll, Kqueue, and IOCP really are and why they’re fundamental to most async runtimes you’ll encounter out in the wild.

Coroutines and async/await – How Programming Languages Model Asynchronous Program Flow

Coroutines come in two flavors: asymmetric and symmetric. Asymmetric coroutines yields to a scheduler, and they’re the ones we’ll focus on. Symmetric coroutines yield a specific destination; for example, a different coroutine.

While coroutines are a pretty broad concept in general, the introduction of coroutines as objects in programming languages is what really makes this way of handling concurrency rival the ease of use that OS threads and fibers/green threads are known for.

You see when you write async in Rust or JavaScript, the compiler re-writes what looks like a normal function call into a future (in the case of Rust) or a promise (in the case of JavaScript). Await, on the other hand, yields control to the runtime scheduler, and the task is suspended until the future/promise you’re awaiting has finished.

This way, we can write programs that handle concurrent operations in almost the same way we write our normal sequential programs.

Our JavaScript program can now be written as follows:
async function run() {
    await timer(200);
    await timer(100);
    await timer(50);
    console.log(“I’m the last one”);
}

You can consider the run function as a pausable task consisting of several sub-tasks. On each “await” point, it yields control to the scheduler (in this case, it’s the well-known JavaScript event loop).

Once one of the sub-tasks changes state to either fulfilled or rejected, the task is scheduled to continue to the next step.

When using Rust, you can see the same transformation happening with the function signature when you write something such as this:
async fn run() -> () { … }

The function wraps the return object, and instead of returning the type (), it returns a Future with an output type of ():
Fn run() -> impl Future<Output = ()>

Syntactically, Rust’s futures 0.1 was a lot like the promise example we just showed, and the Rust futures we use today have a lot in common with how async/await works in JavaScript..

This way of rewriting what look like normal functions and code into something else has a lot of benefits, but it’s not without its drawbacks.

As with any stackless coroutine implementation, full pre-emption can be hard, or impossible, to implement. These functions have to yield at specific points, and there is no way to suspend execution in the middle of a stack frame in contrast to fibers/green threads. Some level of pre-emption is possible by having the runtime or compiler insert pre-emption points at every function call, for example, but it’s not the same as being able to pre-empt a task at any point during its execution.

Callback based approaches – How Programming Languages Model Asynchronous Program Flow

Note!

This is another example of M:N threading. Many tasks can run concurrently on one OS thread. Each task consists of a chain of callbacks.

You probably already know what we’re going to talk about in the next paragraphs from JavaScript, which I assume most know.

The whole idea behind a callback-based approach is to save a pointer to a set of instructions we want to run later together with whatever state is needed. In Rust, this would be a closure.

Implementing callbacks is relatively easy in most languages. They don’t require any context switching or pre-allocated memory for each task.

However, representing concurrent operations using callbacks requires you to write the program in a radically different way from the start. Re-writing a program that uses a normal sequential program flow to one using callbacks represents a substantial rewrite, and the same goes the other way.

Callback-based concurrency can be hard to reason about and can become very complicated to understand. It’s no coincidence that the term “callback hell” is something most JavaScript developers are familiar with.

Since each sub-task must save all the state it needs for later, the memory usage will grow linearly with the number of callbacks in a task.

Advantages
• Easy to implement in most languages
• No context switching
• Relatively low memory overhead (in most cases)
Drawbacks
• Memory usage grows linearly with the number of callbacks.
• Programs and code can be hard to reason about.
• It’s a very different way of writing programs and it will affect almost all aspects of the program since all yielding operations require one callback.
• Ownership can be hard to reason about. The consequence is that writing callback-based programs without a garbage collector can become very difficult.
• Sharing state between tasks is difficult due to the complexity of ownership rules.
• Debugging callbacks can be difficult.

Coroutines: promises and futures

Note!

This is another example of M:N threading. Many tasks can run concurrently on one OS thread. Each task is represented as a state machine.

Promises in JavaScript and futures in Rust are two different implementations that are based on the same idea.

There are differences between different implementations, but we’ll not focus on those here. It’s worth explaining promises a bit since they’re widely known due to their use in JavaScript. Promises also have a lot in common with Rust’s futures.

First of all, many languages have a concept of promises, but I’ll use the one from JavaScript in the following examples.

Promises are one way to deal with the complexity that comes with a callback-based approach.

Instead of:
setTimer(200, () => {
  setTimer(100, () => {
    setTimer(50, () => {
      console.log(“I’m the last one”);
    });
  });
});

We can do:
function timer(ms) {
    return new Promise((resolve) => setTimeout(resolve, ms));
}
timer(200)
.then(() => timer(100))
.then(() => timer(50))
.then(() => console.log(“I’m the last one”));

The latter approach is also referred to as the continuation-passing style. Each subtask calls a new one once it’s finished.

The difference between callbacks and promises is even more substantial under the hood. You see, promises return a state machine that can be in one of three states: pending, fulfilled, or rejected.

When we call timer(200) in the previous example, we get back a promise in the pending state.

Now, the continuation-passing style does fix some of the issues related to callbacks, but it still retains a lot of them when it comes to complexity and the different ways of writing programs. However, they enable us to leverage the compiler to solve a lot of these problems, which we’ll discuss in the next paragraph.

Context switching – How Programming Languages Model Asynchronous Program Flow

Even though these fibers/green threads are lightweight compared to OS threads, you still have to save and restore registers at every context switch. This likely won’t be a problem most of the time, but when compared to alternatives that don’t require context switching, it can be less efficient.

Context switching can also be pretty complex to get right, especially if you intend to support many different platforms.

Scheduling

When a fiber/green thread yields to the runtime scheduler, the scheduler can simply resume execution on a new task that’s ready to run. This means that you avoid the problem of being put in the same run queue as every other task in the system every time you yield to the scheduler. From the OS perspective, your threads are busy doing work all the time, so it will try to avoid pre-empting them if it can.

One unexpected downside of this is that most OS schedulers make sure all threads get some time to run by giving each OS thread a time slice where it can run before the OS pre-empts the thread and schedules a new thread on that CPU. A program using many OS threads might be allotted more time slices than a program with fewer OS threads. A program using M:N threading will most likely only use a few OS threads (one thread per CPU core seems to be the starting point on most systems). So, depending on whatever else is running on the system, your program might be allotted fewer time slices in total than it would be using many OS threads. However, with the number of cores available on most modern CPUs and the typical workload on concurrent systems, the impact from this should be minimal.

FFI

Since you create your own stacks that are supposed to grow/shrink under certain conditions and might have a scheduler that assumes it can pre-empt running tasks at any point, you will have to take extra measures when you use FFI. Most FFI functions will assume a normal OS-provided C-stack, so it will most likely be problematic to call an FFI function from a fiber/green thread. You need to notify the runtime scheduler, context switch to a different OS thread, and have some way of notifying the scheduler that you’re done and the fiber/green thread can continue. This naturally creates overhead and added complexity both for the runtime implementor and the user making the FFI call.

Advantages
• It is simple to use for the user. The code will look like it does when using OS threads.
• Context switching is reasonably fast.
• Abundant memory usage is less of a problem when compared to OS threads.
• You are in full control over how tasks are scheduled and if you want you can prioritize them as you see fit.
• It’s easy to incorporate pre-emption, which can be a powerful feature.
Drawbacks
• Stacks need a way to grow when they run out of space creating additional work and complexity
• You still need to save the CPU state on every context switch
• It’s complicated to implement correctly if you intend to support many platforms and/or CPU architectures
• FFI can have a lot of overhead and add unexpected complexity

Example – How Programming Languages Model Asynchronous Program Flow

Since we’ll not spend more time talking about OS threads in this book, we’ll go through a short example so you can see how they’re used:
ch02/aa-os-threads
use std::thread::{self, sleep};
fn main() {
println!(“So, we start the program here!”);
let t1 = thread::spawn(move || {
sleep(std::time::Duration::from_millis(200));
println!(“The long running tasks finish last!”);
});
let t2 = thread::spawn(move || {
sleep(std::time::Duration::from_millis(100));
println!(“We can chain callbacks…”);
let t3 = thread::spawn(move || {
sleep(std::time::Duration::from_millis(50));
println!(“…like this!”);
});
t3.join().unwrap();
});
println!(“The tasks run concurrently!”);
t1.join().unwrap();
t2.join().unwrap();
}

In this example, we simply spawn several OS threads and put them to sleep. Sleeping is essentially the same as yielding to the OS scheduler with a request to be re-scheduled to run after a certain time has passed. To make sure our main thread doesn’t finish and exit (which will exit the process) before our children thread has had time to run we join them at the end of our main function.
If we run the example, we’ll see how the operations occur in a different order based on how long we yielded each thread to the scheduler:
So, we start the program here!
The tasks run concurrently!
We can chain callbacks…
…like this!
The long-running tasks finish last!

So, while using OS threads is great for a number of tasks, we also outlined a number of good reasons to look at alternatives by discussing their limitations and downsides. The first alternatives we’ll look at are what we call fibers and green threads.

Fibers and green threads

Note!

This is an example of M:N threading. Many tasks can run concurrently on one OS thread. Fibers and green threads are often referred to as stackful coroutines.

The name “green threads” originally stems from an early implementation of an M:N threading model used in Java and has since been associated with different implementations of M:N threading. You will encounter different variations of this term, such as “green processes” (used in Erlang), which are different from the ones we discuss here. You’ll also see some that define green threads more broadly than we do here.

The way we define green threads in this book makes them synonymous with fibers, so both terms refer to the same thing going forward.

The implementation of fibers and green threads implies that there is a runtime with a scheduler that’s responsible for scheduling what task (M) gets time to run on the OS thread (N). There are many more tasks than there are OS threads, and such a system can run perfectly fine using only one OS thread. The latter case is often referred to as M:1 threading.

Goroutines is an example of a specific implementation of stackfull coroutines, but it comes with slight nuances. The term “coroutine” usually implies that they’re cooperative in nature, but Goroutines can be pre-empted by the scheduler (at least since version 1.14), thereby landing them in somewhat of a grey area using the categories we present here.

Green threads and fibers use the same mechanisms as an OS, setting up a stack for each task, saving the CPU’s state, and jumping from one task(thread) to another by doing a context switch.

We yield control to the scheduler (which is a central part of the runtime in such a system), which then continues running a different task.

The state of execution is stored in each stack, so in such a solution, there would be no need for async, await, Future, or Pin. In many ways, green threads mimic how an operating system facilitates concurrency, and implementing them is a great learning experience.

A runtime using fibers/green threads for concurrent tasks can have a high degree of flexibility. Tasks can, for example, be pre-empted and context switched at any time and at any point in their execution, so a long-running task that hogs the CPU could in theory be pre-empted by the runtime, acting as a safeguard from having tasks that end up blocking the whole system due to an edge-case or a programmer error.

This gives the runtime scheduler almost the same capabilities as the OS scheduler, which is one of the biggest advantages of systems using fibers/green threads.

The typical flow goes as follows:

• You run some non-blocking code
• You make a blocking call to some external resource
• The CPU jumps to the main thread, which schedules a different thread to run and jumps to that stack
• You run some non-blocking code on the new thread until a new blocking call or the task is finished
• The CPU jumps back to the main thread, schedules a new thread that is ready to make progress, and jumps to that thread

Figure 2.2 – Program flow using fibers/green threads

Scheduling – How Programming Languages Model Asynchronous Program Flow

The OS can schedule tasks differently than you might expect, and every time you yield to the OS, you’re put in the same queue as all other threads and processes on the system.

Moreover, since there is no guarantee that the thread will resume execution on the same CPU core as it left off or that two tasks won’t run in parallel and try to access the same data, you need to synchronize data access to prevent data races and other pitfalls associated with multicore programming.

Rust as a language will help you prevent many of these pitfalls, but synchronizing data access will require extra work and add to the complexity of such programs. We often say that using OS threads to handle concurrency gives us parallelism for free, but it isn’t free in terms of added complexity and the need for proper data access synchronization.

The advantage of decoupling asynchronous operations from OS threads

Decoupling asynchronous operations from the concept of threads has a lot of benefits.

First of all, using OS threads as a means to handle concurrency requires us to use what essentially is an OS abstraction to represent our tasks.

Having a separate layer of abstraction to represent concurrent tasks gives us the freedom to choose how we want to handle concurrent operations. If we create an abstraction over concurrent operations such as a future in Rust, a promise in JavaScript, or a goroutine in GO, it is up to the runtime implementor to decide how these concurrent tasks are handled.

A runtime could simply map each concurrent operation to an OS thread, they could use fibers/green threads or state machines to represent the tasks. The programmer that writes the asynchronous code will not necessarily have to change anything in their code if the underlying implementation changes. In theory, the same asynchronous code could be used to handle concurrent operations on a microcontroller without an OS if there’s just a runtime for it.

To sum it up, using threads provided by the operating system to handle concurrency has the following advantages:
• Simple to understand
• Easy to use
• Switching between tasks is reasonably fast
• You get parallelism for free
However, they also have a few drawbacks:
• OS-level threads come with a rather large stack. If you have many tasks waiting simultaneously (as you would in a web server under heavy load), you’ll run out of memory pretty fast.
• Context switching can be costly and you might get an unpredictable performance since you let the OS do all the scheduling.
• The OS has many things it needs to handle. It might not switch back to your thread as fast as you’d wish.
• It is tightly coupled to an OS abstraction. This might not be an option on some systems.

Threads – How Programming Languages Model Asynchronous Program Flow

We keep referring to threads all throughout this book, so before we get too far in, let’s stop and give “thread” a good definition since it’s one of those fundamental terms that causes a lot of confusion.

In the most general sense, a thread refers to a thread of execution, meaning a set of instructions that need to be executed sequentially. If we tie this back to the first chapter of this book, where we provided several definitions under the Concurrency vs. Parallelism subsection, a thread of execution is similar to what we defined as a task with multiple steps that need resources to progress.

The generality of this definition can be a cause of some confusion. A thread to one person can obviously refer to an OS thread, and to another person, it can simply refer to any abstraction that represents a thread of execution on a system.

Threads are often divided into two broad categories:

  • • OS threads: These threads are created by the OS and managed by the OS scheduler. On Linux, this is known as a kernel thread.
  • • User-level threads: These threads are created and managed by us as programmers without the OS knowing about them.

Now, this is where things get a bit tricky: OS threads on most modern operating systems have a lot of similarities. Some of these similarities are dictated by the design of modern CPUs. One example of this is that most CPUs assume that there is a stack it can perform operations on and that it has a register for the stack pointer and instructions for stack manipulation.

User-level threads can, in their broadest sense, refer to any implementation of a system (runtime) that creates and schedules tasks, and you can’t make the same assumptions as you do with OS threads. They can closely resemble OS threads by using separate stacks for each task, as we’ll see in Chapter 5 when we go through our fiber/green threads example, or they can be radically different in nature, as we’ll see when we go through how Rust models concurrent operations later on in Part 3 of this book.

No matter the definition, a set of tasks needs something that manages them and decides who gets what resources to progress. The most obvious resource on a computer system that all tasks need to progress is CPU time. We call the “something” that decides who gets CPU time to progress a scheduler.

Most likely, when someone refers to a “thread” without adding extra context, they refer to an OS thread/kernel thread, so that’s what we’ll do going forward.

I’ll also keep referring to a thread of execution as simply a task. I find the topic of asynchronous programming easier to reason about when we limit the use of terms that have different assumptions associated with them depending on the context as much as possible.

With that out of the way, let’s go through some defining characteristics of OS threads while we also highlight their limitations.

Definitions – How Programming Languages Model Asynchronous Program Flow

In the previous chapter, we covered asynchronous program flow, concurrency, and parallelism in general terms. In this chapter, we’ll narrow our scope. Specifically, we’ll look into different ways to model and deal with concurrency in programming languages and libraries.

It’s important to keep in mind that threads, futures, fibers, goroutines, promises, etc. are abstractions that give us a way to model an asynchronous program flow. They have different strengths and weaknesses, but they share a goal of giving programmers an easy-to-use (and importantly, hard to misuse), efficient, and expressive way of creating a program that handles tasks in a non-sequential, and often unpredictable, order.

The lack of precise definitions is prevalent here as well; many terms have a name that stems from a concrete implementation at some point in time but has later taken on a more general meaning that encompasses different implementations and varieties of the same thing.

We’ll first go through a way of grouping different abstractions together based on their similarities before we go on to discuss the pros and cons of each of them. We’ll also go through important definitions that we’ll use throughout the book and discuss OS threads in quite some detail.

The topics we discuss here are quite abstract and complicated so don’t feel bad if you don’t understand everything immediately. As we progress through the book and you get used to the different terms and techniques by working through some examples, more and more pieces will fall into place.

Specifically, the following topics will be covered:

• Definitions
• Threads provided by the operating system
• Green threads/stackfull coroutines/fibers
• Callback based approaches
• Promises, futures, and async/await

Definitions

We can broadly categorize abstractions over concurrent operations into two groups:

  1. Cooperative: These are tasks that yield voluntarily either by explicitly yielding or by calling a function that suspends the task when it can’t progress further before another operation has finished (such as making a network call). Most often, these tasks yield to a scheduler of some sort. Examples of this are tasks generated by async/await in Rust and JavaScript.
  2. Non-cooperative: Tasks that don’t necessarily yield voluntarily. In such a system, the scheduler must be able to pre-empt a running task, meaning that the scheduler can stop the task and take control over the CPU even though the task would have been able to do work and progress. Examples of this are OS threads and Goroutines (after GO version 1.14).

Figure 2.1 – Non-cooperative vs. cooperative multitasking

Note

In a system where the scheduler can pre-empt running tasks, tasks can also yield voluntarily as they do in a cooperative system, and it’s rare with a system that only relies on pre-emption.

We can further divide these abstractions into two broad categories based on the characteristics of their implementation:

  1. Stackful: Each task has its own call stack. This is often implemented as a stack that’s similar to the stack used by the operating system for its threads. Stackful tasks can suspend execution at any point in the program as the whole stack is preserved.
  2. Stackless: There is not a separate stack for each task; they all run sharing the same call stack. A task can’t be suspended in the middle of a stack frame, limiting the runtime’s ability to pre-empt the task. However, they need to store/restore less information when switching between tasks so they can be more efficient.

There are more nuances to these two categories that you’ll get a deep understanding of when we implement an example of both a stackful coroutine (fiber) and a stackless coroutine (Rust futures generated by async/await) later in the book. For now, we keep the details to a minimum to just provide an overview.

A simplified overview – Concurrency and Asynchronous Programming: a Detailed Overview-1

Let’s look at some of the steps where we imagine that we read from a network card:

Remember that we’re simplifying a lot here. This is a rather complex operation but we’ll focus on the parts that are of most interest to us and skip a few steps along the way.

Step 1 – Our code

We register a socket. This happens by issuing a syscall to the OS. Depending on the OS, we either get a file descriptor (macOS/Linux) or a socket (Windows).

The next step is that we register our interest in Read events on that socket.

Step 2 – Registering events with the OS

This is handled in one of three ways:

  1. We tell the operating system that we’re interested in Read events but we want to wait for it to happen by yielding control over our thread to the OS. The OS then suspends our thread by storing the register state and switches to some other thread
    From our perspective, this will be blocking our thread until we have data to read.
  2. We tell the operating system that we’re interested in Read events but we just want a handle to a task that we can poll to check whether the event is ready or not.
    The OS will not suspend our thread, so this will not block our code.
  3. We tell the operating system that we are probably going to be interested in many events, but we want to subscribe to one event queue. When we poll this queue, it will block our thread until one or more events occur.

This will block our thread while we wait for events to occur.

Chapters 3 and 4 will go into detail about the third method, as it’s the most used method for modern async frameworks to handle concurrency.

Step 3 – The network card

We’re skipping some steps here, but I don’t think they’re vital to our understanding.

On the network card, there is a small microcontroller running specialized firmware. We can imagine that this microcontroller is polling in a busy loop, checking whether any data is incoming.

The exact way the network card handles its internals is a little different from what I suggest here, and will most likely vary from vendor to vendor. The important part is that there is a very simple but specialized CPU running on the network card doing work to check whether there are incoming events.

Once the firmware registers incoming data, it issues a hardware interrupt.

But can’t we just change the page table in the CPU? – Concurrency and Asynchronous Programming: a Detailed Overview

Now, this is where the privilege level comes in. Most modern operating systems operate with two ring levels: ring 0, the kernel space, and ring 3, the user space.

Figure 1.2 – Privilege rings

Most CPUs have a concept of more rings than what most modern operating systems use. This has historical reasons, which is also why ring 0 and ring 3 are used (and not 1 and 2).

Every entry in the page table has additional information about it. Amongst that information is the information about which ring it belongs to. This information is set up when your OS boots up.

Code executed in ring 0 has almost unrestricted access to external devices and memory, and is free to change registers that provide security at the hardware level.

The code you write in ring 3 will typically have extremely restricted access to I/O and certain CPU registers (and instructions). Trying to issue an instruction or setting a register from ring 3 to change the page table will be prevented by the CPU. The CPU will then treat this as an exception and jump to the handler for that exception provided by the OS.

This is also the reason why you have no other choice than to cooperate with the OS and handle I/O tasks through syscalls. The system wouldn’t be very secure if this wasn’t the case.

So, to sum it up: yes, the CPU and the OS cooperate a great deal. Most modern desktop CPUs are built with an OS in mind, so they provide the hooks and infrastructure that the OS latches onto upon bootup. When the OS spawns a process, it also sets its privilege level, making sure that normal processes stay within the borders it defines to maintain stability and security.

Interrupts, firmware, and I/O

We’re nearing the end of the general CS subjects in this book, and we’ll start to dig our way out of the rabbit hole soon.

This part tries to tie things together and look at how the whole computer works as a system to handle I/O and concurrency.

Let’s get to it!