Coroutines and async/await – How Programming Languages Model Asynchronous Program Flow

Coroutines come in two flavors: asymmetric and symmetric. Asymmetric coroutines yields to a scheduler, and they’re the ones we’ll focus on. Symmetric coroutines yield a specific destination; for example, a different coroutine.

While coroutines are a pretty broad concept in general, the introduction of coroutines as objects in programming languages is what really makes this way of handling concurrency rival the ease of use that OS threads and fibers/green threads are known for.

You see when you write async in Rust or JavaScript, the compiler re-writes what looks like a normal function call into a future (in the case of Rust) or a promise (in the case of JavaScript). Await, on the other hand, yields control to the runtime scheduler, and the task is suspended until the future/promise you’re awaiting has finished.

This way, we can write programs that handle concurrent operations in almost the same way we write our normal sequential programs.

Our JavaScript program can now be written as follows:
async function run() {
    await timer(200);
    await timer(100);
    await timer(50);
    console.log(“I’m the last one”);
}

You can consider the run function as a pausable task consisting of several sub-tasks. On each “await” point, it yields control to the scheduler (in this case, it’s the well-known JavaScript event loop).

Once one of the sub-tasks changes state to either fulfilled or rejected, the task is scheduled to continue to the next step.

When using Rust, you can see the same transformation happening with the function signature when you write something such as this:
async fn run() -> () { … }

The function wraps the return object, and instead of returning the type (), it returns a Future with an output type of ():
Fn run() -> impl Future<Output = ()>

Syntactically, Rust’s futures 0.1 was a lot like the promise example we just showed, and the Rust futures we use today have a lot in common with how async/await works in JavaScript..

This way of rewriting what look like normal functions and code into something else has a lot of benefits, but it’s not without its drawbacks.

As with any stackless coroutine implementation, full pre-emption can be hard, or impossible, to implement. These functions have to yield at specific points, and there is no way to suspend execution in the middle of a stack frame in contrast to fibers/green threads. Some level of pre-emption is possible by having the runtime or compiler insert pre-emption points at every function call, for example, but it’s not the same as being able to pre-empt a task at any point during its execution.

Callback based approaches – How Programming Languages Model Asynchronous Program Flow

Note!

This is another example of M:N threading. Many tasks can run concurrently on one OS thread. Each task consists of a chain of callbacks.

You probably already know what we’re going to talk about in the next paragraphs from JavaScript, which I assume most know.

The whole idea behind a callback-based approach is to save a pointer to a set of instructions we want to run later together with whatever state is needed. In Rust, this would be a closure.

Implementing callbacks is relatively easy in most languages. They don’t require any context switching or pre-allocated memory for each task.

However, representing concurrent operations using callbacks requires you to write the program in a radically different way from the start. Re-writing a program that uses a normal sequential program flow to one using callbacks represents a substantial rewrite, and the same goes the other way.

Callback-based concurrency can be hard to reason about and can become very complicated to understand. It’s no coincidence that the term “callback hell” is something most JavaScript developers are familiar with.

Since each sub-task must save all the state it needs for later, the memory usage will grow linearly with the number of callbacks in a task.

Advantages
• Easy to implement in most languages
• No context switching
• Relatively low memory overhead (in most cases)
Drawbacks
• Memory usage grows linearly with the number of callbacks.
• Programs and code can be hard to reason about.
• It’s a very different way of writing programs and it will affect almost all aspects of the program since all yielding operations require one callback.
• Ownership can be hard to reason about. The consequence is that writing callback-based programs without a garbage collector can become very difficult.
• Sharing state between tasks is difficult due to the complexity of ownership rules.
• Debugging callbacks can be difficult.

Coroutines: promises and futures

Note!

This is another example of M:N threading. Many tasks can run concurrently on one OS thread. Each task is represented as a state machine.

Promises in JavaScript and futures in Rust are two different implementations that are based on the same idea.

There are differences between different implementations, but we’ll not focus on those here. It’s worth explaining promises a bit since they’re widely known due to their use in JavaScript. Promises also have a lot in common with Rust’s futures.

First of all, many languages have a concept of promises, but I’ll use the one from JavaScript in the following examples.

Promises are one way to deal with the complexity that comes with a callback-based approach.

Instead of:
setTimer(200, () => {
  setTimer(100, () => {
    setTimer(50, () => {
      console.log(“I’m the last one”);
    });
  });
});

We can do:
function timer(ms) {
    return new Promise((resolve) => setTimeout(resolve, ms));
}
timer(200)
.then(() => timer(100))
.then(() => timer(50))
.then(() => console.log(“I’m the last one”));

The latter approach is also referred to as the continuation-passing style. Each subtask calls a new one once it’s finished.

The difference between callbacks and promises is even more substantial under the hood. You see, promises return a state machine that can be in one of three states: pending, fulfilled, or rejected.

When we call timer(200) in the previous example, we get back a promise in the pending state.

Now, the continuation-passing style does fix some of the issues related to callbacks, but it still retains a lot of them when it comes to complexity and the different ways of writing programs. However, they enable us to leverage the compiler to solve a lot of these problems, which we’ll discuss in the next paragraph.

Context switching – How Programming Languages Model Asynchronous Program Flow

Even though these fibers/green threads are lightweight compared to OS threads, you still have to save and restore registers at every context switch. This likely won’t be a problem most of the time, but when compared to alternatives that don’t require context switching, it can be less efficient.

Context switching can also be pretty complex to get right, especially if you intend to support many different platforms.

Scheduling

When a fiber/green thread yields to the runtime scheduler, the scheduler can simply resume execution on a new task that’s ready to run. This means that you avoid the problem of being put in the same run queue as every other task in the system every time you yield to the scheduler. From the OS perspective, your threads are busy doing work all the time, so it will try to avoid pre-empting them if it can.

One unexpected downside of this is that most OS schedulers make sure all threads get some time to run by giving each OS thread a time slice where it can run before the OS pre-empts the thread and schedules a new thread on that CPU. A program using many OS threads might be allotted more time slices than a program with fewer OS threads. A program using M:N threading will most likely only use a few OS threads (one thread per CPU core seems to be the starting point on most systems). So, depending on whatever else is running on the system, your program might be allotted fewer time slices in total than it would be using many OS threads. However, with the number of cores available on most modern CPUs and the typical workload on concurrent systems, the impact from this should be minimal.

FFI

Since you create your own stacks that are supposed to grow/shrink under certain conditions and might have a scheduler that assumes it can pre-empt running tasks at any point, you will have to take extra measures when you use FFI. Most FFI functions will assume a normal OS-provided C-stack, so it will most likely be problematic to call an FFI function from a fiber/green thread. You need to notify the runtime scheduler, context switch to a different OS thread, and have some way of notifying the scheduler that you’re done and the fiber/green thread can continue. This naturally creates overhead and added complexity both for the runtime implementor and the user making the FFI call.

Advantages
• It is simple to use for the user. The code will look like it does when using OS threads.
• Context switching is reasonably fast.
• Abundant memory usage is less of a problem when compared to OS threads.
• You are in full control over how tasks are scheduled and if you want you can prioritize them as you see fit.
• It’s easy to incorporate pre-emption, which can be a powerful feature.
Drawbacks
• Stacks need a way to grow when they run out of space creating additional work and complexity
• You still need to save the CPU state on every context switch
• It’s complicated to implement correctly if you intend to support many platforms and/or CPU architectures
• FFI can have a lot of overhead and add unexpected complexity

Each stack has a fixed space – How Programming Languages Model Asynchronous Program Flow

As fibers and green threads are similar to OS threads, they do have some of the same drawbacks as well. Each task is set up with a stack of a fixed size, so you still have to reserve more space than you actually use. However, these stacks can be growable, meaning that once the stack is full, the runtime can grow the stack. While this sounds easy, it’s a rather complicated problem to solve.

We can’t simply grow a stack as we grow a tree. What actually needs to happen is one of two things:

  1. You allocate a new piece of continuous memory and handle the fact that your stack is spread over two disjointed memory segments
  2. You allocate a new larger stack (for example, twice the size of the previous stack), move all your data over to the new stack, and continue from there

The first solution sounds pretty simple, as you can leave the original stack as it is, and you can basically context switch over to the new stack when needed and continue from there. However, modern CPUs can work extremely fast if they can work on a contiguous piece of memory due to caching and their ability to predict what data your next instructions are going to work on. Spreading the stack over two disjointed pieces of memory will hinder performance. This is especially noticeable when you have a loop that happens to be just at the stack boundary, so you end up making up to two context switches for each iteration of the loop.

The second solution solves the problems with the first solution by having the stack as a contiguous piece of memory, but it comes with some problems as well.

First, you need to allocate a new stack and move all the data over to the new stack. But what happens with all pointers and references that point to something located on the stack when everything moves to a new location? You guessed it: every pointer and reference to anything located on the stack needs to be updated so they point to the new location. This is complex and time-consuming, but if your runtime already includes a garbage collector, you already have the overhead of keeping track of all your pointers and references anyway, so it might be less of a problem than it would for a non-garbage collected program. However, it does require a great deal of integration between the garbage collector and the runtime to do this every time the stack grows, so implementing this kind of runtime can get very complicated.

Secondly, you have to consider what happens if you have a lot of long-running tasks that only require a lot of stack space for a brief period of time (for example, if it involves a lot of recursion at the start of the task) but are mostly I/O bound the rest of the time. You end up growing your stack many times over only for one specific part of that task, and you have to make a decision whether you will accept that the task occupies more space than it needs or at some point move it back to a smaller stack. The impact this will have on your program will of course vary greatly based on the type of work you do, but it’s still something to be aware of.

Scheduling – How Programming Languages Model Asynchronous Program Flow

The OS can schedule tasks differently than you might expect, and every time you yield to the OS, you’re put in the same queue as all other threads and processes on the system.

Moreover, since there is no guarantee that the thread will resume execution on the same CPU core as it left off or that two tasks won’t run in parallel and try to access the same data, you need to synchronize data access to prevent data races and other pitfalls associated with multicore programming.

Rust as a language will help you prevent many of these pitfalls, but synchronizing data access will require extra work and add to the complexity of such programs. We often say that using OS threads to handle concurrency gives us parallelism for free, but it isn’t free in terms of added complexity and the need for proper data access synchronization.

The advantage of decoupling asynchronous operations from OS threads

Decoupling asynchronous operations from the concept of threads has a lot of benefits.

First of all, using OS threads as a means to handle concurrency requires us to use what essentially is an OS abstraction to represent our tasks.

Having a separate layer of abstraction to represent concurrent tasks gives us the freedom to choose how we want to handle concurrent operations. If we create an abstraction over concurrent operations such as a future in Rust, a promise in JavaScript, or a goroutine in GO, it is up to the runtime implementor to decide how these concurrent tasks are handled.

A runtime could simply map each concurrent operation to an OS thread, they could use fibers/green threads or state machines to represent the tasks. The programmer that writes the asynchronous code will not necessarily have to change anything in their code if the underlying implementation changes. In theory, the same asynchronous code could be used to handle concurrent operations on a microcontroller without an OS if there’s just a runtime for it.

To sum it up, using threads provided by the operating system to handle concurrency has the following advantages:
• Simple to understand
• Easy to use
• Switching between tasks is reasonably fast
• You get parallelism for free
However, they also have a few drawbacks:
• OS-level threads come with a rather large stack. If you have many tasks waiting simultaneously (as you would in a web server under heavy load), you’ll run out of memory pretty fast.
• Context switching can be costly and you might get an unpredictable performance since you let the OS do all the scheduling.
• The OS has many things it needs to handle. It might not switch back to your thread as fast as you’d wish.
• It is tightly coupled to an OS abstraction. This might not be an option on some systems.

Threads – How Programming Languages Model Asynchronous Program Flow

We keep referring to threads all throughout this book, so before we get too far in, let’s stop and give “thread” a good definition since it’s one of those fundamental terms that causes a lot of confusion.

In the most general sense, a thread refers to a thread of execution, meaning a set of instructions that need to be executed sequentially. If we tie this back to the first chapter of this book, where we provided several definitions under the Concurrency vs. Parallelism subsection, a thread of execution is similar to what we defined as a task with multiple steps that need resources to progress.

The generality of this definition can be a cause of some confusion. A thread to one person can obviously refer to an OS thread, and to another person, it can simply refer to any abstraction that represents a thread of execution on a system.

Threads are often divided into two broad categories:

  • • OS threads: These threads are created by the OS and managed by the OS scheduler. On Linux, this is known as a kernel thread.
  • • User-level threads: These threads are created and managed by us as programmers without the OS knowing about them.

Now, this is where things get a bit tricky: OS threads on most modern operating systems have a lot of similarities. Some of these similarities are dictated by the design of modern CPUs. One example of this is that most CPUs assume that there is a stack it can perform operations on and that it has a register for the stack pointer and instructions for stack manipulation.

User-level threads can, in their broadest sense, refer to any implementation of a system (runtime) that creates and schedules tasks, and you can’t make the same assumptions as you do with OS threads. They can closely resemble OS threads by using separate stacks for each task, as we’ll see in Chapter 5 when we go through our fiber/green threads example, or they can be radically different in nature, as we’ll see when we go through how Rust models concurrent operations later on in Part 3 of this book.

No matter the definition, a set of tasks needs something that manages them and decides who gets what resources to progress. The most obvious resource on a computer system that all tasks need to progress is CPU time. We call the “something” that decides who gets CPU time to progress a scheduler.

Most likely, when someone refers to a “thread” without adding extra context, they refer to an OS thread/kernel thread, so that’s what we’ll do going forward.

I’ll also keep referring to a thread of execution as simply a task. I find the topic of asynchronous programming easier to reason about when we limit the use of terms that have different assumptions associated with them depending on the context as much as possible.

With that out of the way, let’s go through some defining characteristics of OS threads while we also highlight their limitations.

Firmware – Concurrency and Asynchronous Programming: a Detailed Overview

Interrupts

As you know by now, there are two kinds of interrupts:

  • Hardware interrupts
  • Software interrupts

They are very different in nature.

Hardware interrupts

Hardware interrupts are created by sending an electrical signal through an IRQ. These hardware lines signal the CPU directly.

Software interrupts

These are interrupts issued from software instead of hardware. As in the case of a hardware interrupt, the CPU jumps to the IDT and runs the handler for the specified interrupt.

Firmware

Firmware doesn’t get much attention from most of us; however, it’s a crucial part of the world we live in. It runs on all kinds of hardware and has all kinds of strange and peculiar ways to make the computers we program on work.

Now, the firmware needs a microcontroller to be able to work. Even the CPU has firmware that makes it work. That means there are many more small ‘CPUs’ on our system than the cores we program against.

Why is this important? Well, you remember that concurrency is all about efficiency, right? Since we have many CPUs/microcontrollers already doing work for us on our system, one of our concerns is to not replicate or duplicate that work when we write code.

If a network card has firmware that continually checks whether new data has arrived, it’s pretty wasteful if we duplicate that by letting our CPU continually check whether new data arrives as well. It’s much better if we either check once in a while, or even better, get notified when data has arrived.

Summary

This chapter covered a lot of ground, so good job on doing all that legwork. We learned a little bit about how CPUs and operating systems have evolved from a historical perspective and the difference between non-preemptive and preemptive multitasking. We discussed the difference between concurrency and parallelism, talked about the role of the operating system, and learned that system calls are the primary way for us to interact with the host operating system. You’ve also seen how the CPU and the operating system cooperate through an infrastructure designed as part of the CPU.

Lastly, we went through a diagram on what happens when you issue a network call. You know there are at least three different ways for us to deal with the fact that the I/O call takes some time to execute, and we have to decide which way we want to handle that waiting time.

This covers most of the general background information we need so that we have the same definitions and overview before we go on. We’ll go into more detail as we progress through the book, and the first topic that we’ll cover in the next chapter is how programming languages model asynchronous program flow by looking into threads, coroutines and futures.

A simplified overview – Concurrency and Asynchronous Programming: a Detailed Overview-2

Step 4 – Hardware interrupt

A modern CPU has a set of interrupt request line (IRQs) for it to handle events that occur from external devices. A CPU has a fixed set of interrupt lines.

A hardware interrupt is an electrical signal that can occur at any time. The CPU immediately interrupts its normal workflow to handle the interrupt by saving the state of its registers and looking up the interrupt handler. The interrupt handlers are defined in the interrupt descriptor table (IDT).

Step 5 – Interrupt handler

The IDT is a table where the OS (or a driver) registers handlers for different interrupts that may occur. Each entry points to a handler function for a specific interrupt. The handler function for a network card would typically be registered and handled by a driver for that card.

Note

The IDT is not stored on the CPU as it might seem in Figure 1.3. It’s located in a fixed and known location in the main memory. The CPU only holds a pointer to the table in one of its registers.

Step 6 – Writing the data

This is a step that might vary a lot depending on the CPU and the firmware on the network card. If the network card and the CPU support direct memory access (DMA), which should be the standard on all modern systems today, the network card will write data directly to a set of buffers that the OS already has set up in the main memory.

In such a system, the firmware on the network card might issue an interrupt when the data is written to memory. DMA is very efficient, since the CPU is only notified when the data is already in memory. On older systems, the CPU needed to devote resources to handle the data transfer from the network card.

The direct memory access controller ( DMAC) is added to the diagram since in such a system, it would control the access to memory. It’s not part of the CPU as indicated in the previous diagram. We’re deep enough in the rabbit hole now, and exactly where the different parts of a system are is not really important to us right now, so let’s move on.

Step 7 – The driver

The driver would normally handle the communication between the OS and the network card. At some point, the buffers are filled and the network card issues an interrupt. The CPU then jumps to the handler of that interrupt. The interrupt handler for this exact type of interrupt is registered by the driver, so it’s actually the driver that handles this event and, in turn, informs the kernel that the data is ready to be read.

Step 8 – Reading the data

Depending on whether we chose method 1, 2, or 3, the OS will do as follows:

  • • Wake our thread
  • • Return Ready on the next poll
  • • Wake the thread and return a Read event for the handler we registered

Down the rabbit hole – Concurrency and Asynchronous Programming: a Detailed Overview

It turns out that there is a great deal of cooperation between the OS and the CPU, but maybe not in the way you would naively think.

Many modern CPUs provide some basic infrastructure that operating systems use. This infrastructure gives us the security and stability we expect. Actually, most advanced CPUs provide a lot more options than operating systems such as Linux, BSD, and Windows actually use.

There are two in particular that I want to address here:

  • How the CPU prevents us from accessing memory we’re not supposed to access
  • How the CPU handles asynchronous events such as I/O

We’ll cover the first one here and the second in the next section.

How does the CPU prevent us from accessing memory we’re not supposed to access?

As I mentioned, modern CPU architectures define some basic concepts by design. Some examples of this are as follows:

• Virtual memory
• Page table
• Page fault
• Exceptions
• Privilege level

Exactly how this works will differ depending on the specific CPU, so we’ll treat them in general terms here.

Most modern CPUs have a memory management unit (MMU). This part of the CPU is often etched on the same dye, even. The MMU’s job is to translate the virtual address we use in our programs to a physical address.

When the OS starts a process (such as our program), it sets up a page table for our process and makes sure a special register on the CPU points to this page table.

Now, when we try to dereference t_ptr in the preceding code, the address is at some point sent for translation to the MMU, which looks it up in the page table to translate it to a physical address in the memory where it can fetch the data.

In the first case, it will point to a memory address on our stack that holds the value 100.

When we pass in 99999999999999 and ask it to fetch what’s stored at that address (which is what dereferencing does), it looks for the translation in the page table but can’t find it.

The CPU then treats this as a page fault.

At boot, the OS provided the CPU with an interrupt descriptor table. This table has a predefined format where the OS provides handlers for the predefined conditions the CPU can encounter.

Since the OS provided a pointer to a function that handles page fault, the CPU jumps to that function when we try to dereference 99999999999999 and thereby hands over control to the operating system.

The OS then prints a nice message for us, letting us know that we encountered what it calls a segmentation fault. This message will therefore vary depending on the OS you run the code on.

Communicating with the operating system – Concurrency and Asynchronous Programming: a Detailed Overview

Communication with an operating system happens through what we call a system call (syscall). We need to know how to make system calls and understand why it’s so important for us when we want to cooperate and communicate with the operating system. We also need to understand how the basic abstractions we use every day use system calls behind the scenes. We’ll have a detailed walkthrough in Chapter 3, so we’ll keep this brief for now.

A system call uses a public API that the operating system provides so that programs we write in ‘userland’ can communicate with the OS.

Most of the time, these calls are abstracted away for us as programmers by the language or the runtime we use.

Now, a syscall is an example of something that is unique to the kernel you’re communicating with, but the UNIX family of kernels has many similarities. UNIX systems expose this through libc.

Windows, on the other hand, uses its own API, often referred to as WinAPI, and it can operate radically differently from how the UNIX-based systems operate.

Most often, though, there is a way to achieve the same things. In terms of functionality, you might not notice a big difference but as we’ll see later, and especially when we dig into how epoll, kqueue, and IOCP work, they can differ a lot in how this functionality is implemented.

However, a syscall is not the only way we interact with our operating system, as we’ll see in the following section.

The CPU and the operating system

Does the CPU cooperate with the operating system?

If you had asked me this question when I first thought I understood how programs work, I would most likely have answered no. We run programs on the CPU and we can do whatever we want if we know how to do it. Now, first of all, I wouldn’t have thought this through, but unless you learn how CPUs and operating systems work together, it’s not easy to know for sure.

What started to make me think I was very wrong was a segment of code that looked like what you’re about to see. If you think inline assembly in Rust looks foreign and confusing, don’t worry just yet. We’ll go through a proper introduction to inline assembly a little later in this book. I’ll make sure to go through each of the following lines until you get more comfortable with the syntax:

Repository reference: ch01/ac-assembly-dereference/src/main.rs
fn main() {
    let t = 100;
    let t_ptr: *const usize = &t;
    let x = dereference(t_ptr);
    println!(“{}”, x);
}
fn dereference(ptr: *const usize) -> usize {
    let mut res: usize;
    unsafe {
        asm!(“mov {0}, [{1}]”, out(reg) res, in(reg) ptr)
    };
    res
}

What you’ve just looked at is a dereference function written in assembly.

The mov {0}, [{1}] line needs some explanation. {0} and {1} are templates that tell the compiler that we’re referring to the registers that out(reg) and in(reg) represent. The number is just an index, so if we had more inputs or outputs they would be numbered {2}, {3}, and so on. Since we only specify reg and not a specific register, we let the compiler choose what registers it wants to use.

The mov instruction instructs the CPU to take the first 8 bytes (if we’re on a 64-bit machine) it gets when reading the memory location that {1} points to and place that in the register represented by {0}. The [] brackets will instruct the CPU to treat the data in that register as a memory address, and instead of simply copying the memory address itself to {0}, it will fetch what’s at that memory location and move it over.

Anyway, we’re just writing instructions to the CPU here. No standard library, no syscall; just raw instructions. There is no way the OS is involved in that dereference function, right?

If you run this program, you get what you’d expect:
100

Now, if you keep the dereference function but replace the main function with a function that creates a pointer to the 99999999999999 address, which we know is invalid, we get this function:
fn main() {
    let t_ptr = 99999999999999 as *const usize;
    let x = dereference(t_ptr);
    println!(“{}”, x);
}

Now, if we run that we get the following results.

This is the result on Linux:
Segmentation fault (core dumped)

This is the result on Windows:
error: process didn’t exit successfully: `target\debug\ac-assembly-dereference.exe` (exit code: 0xc0000005, STATUS_ACCESS_VIOLATION)

We get a segmentation fault. Not surprising, really, but as you also might notice, the error we get is different on different platforms. Surely, the OS is involved somehow. Let’s take a look at what’s really happening here.