DHO is Mostly Confused

Rants from Devon H. O'Dell

Go Go Gadget Co-Ro

Jun 5, 2017 19 min read
Programming Concurrency Threads Coroutines

I like coroutines. The other day on my team’s Slack channel, someone introduced Adam Dunkels’ protothreads. Either I hadn’t seen these before or had forgotten them, but it looked cool. A short conversation ensued, and at some point a colleague offered:

i can’t ever decide if i like [coroutines] or not. so ignoring that, i think there’s a decent heuristic: something like tatham’s coroutines will provoke a conversation every time you use it. i figure if something produces so much discussion, it’s certainly non-obvious to understand

This colleague is an excellent software engineer and computer scientist. She’s definitely familiar enough with threads, and so this comment struck me as odd. Programs implemented using threads are, in my view, much more difficult to understand than those implemented atop coroutines. If you can understand how threads work, you can understand coroutines. In this post, I’d like to describe what threads and coroutines actually are, the similarities and differences between them, and why I think coroutines provoke so much discussion.

We’ll start out with some definitions. What are threads; what are coroutines? The easiest way I find to introduce concepts is to present definitions based on the properties of the things the word describes. So we’ll start at a low level, and work our way up.

A computer is some hardware that provides state, and can execute at least one program. A program is simply a specification for how state is changed over time. A program executed by a computer is called a process. The context in which the program is executed is called a “thread,” and every process contains at least one thread. Processes containing multiple threads may see those threads execute in parallel. The state provided by the computer may or may not be shared between processes, but within a process, this state is shared. Threads are also typically given some “protected” state: state intended for use by the thread, but not expressly prohibited from sharing between threads.

As a program executes, state changes, and time progresses. These tuples of (execution unit, computer state) may be arranged over the vector of time to describe what is called an “execution history.” Each thread is a container of an independent ⊕ There may be dependencies in state between threads, but the execution histories exist indepentedly of eachother. execution history vector. This independence allows threads to be run either serially or in parallel. Multiplexing threads within a process is indistinguishable from a hardware platform that provides no meaningful abstraction for a process, and multiplexes threads directly.

So we have this hierarchy of “computer” and “process” and “thread” – but all of these things are just containers of independent execution histories. The computer has one or more processors that execute instructions to modify state over time. It executes processes that are comprised of one or more execution histories, which also describes what a thread is. If these are all the same thing, why in the world do we have different words for them?

Each word describes the same thing, but at a different level of abstraction. This abstraction is great, because it provides us a means to approximate a new Turing machine, which means we get a little extra state to do something useful each time we abstract. So although we usually say that a computer doesn’t contain another computer, we have VMs! Processes don’t contain other processes, but they do contain threads – which are effectively little processes. And threads don’t contain threads. But a thread comprising Turing-complete execution units running on an abstraction of a Turing machine can obviously implement another Turing machine. Therefore, a thread may contain another thread, just not at the same abstraction level!

Our computer is our base-level abstraction for a Turing machine (we don’t actually have one of these because nobody has infinite tape). Our process is another abstraction that allows us to pretend our computer runs multiple Turing machines at once: this is useful because we don’t want the computation of one of them to influence the computation of another. And threads exist for effectively the same reason – they’re useful for multiplexing the same kind of computation on discrete bits of state.

Coroutines are what we get when we define an abstraction on top of threads. In a novel way, coroutines are threads, in the same way that threads are processes and processes are computers – all abstractions for a Turing machine. Coroutines recognize that the fundamental unit of execution is this tuple of (execution unit, computer state), that multiple vectors of these can execute independently with an almost arbitrary interleaving. Furthermore, since they exist at an abstraction above threads, that abstraction should be providing something that makes them easier to understand.

To understand why we might like this abstraction, let’s take a deeper look at some of the problems with writing purely threaded software.

Threads provide an abstraction above the machine or process level, representing multiple independent execution histories. This abstraction became popular in recent years as multicore and manycore systems became ubiquitous. Threads existed even before such systems were terribly common, so what is it about the abstraction that makes them useful?

Many tasks we solve with software are repetitive. We rarely try to solve tasks that involve taking a single specific input and producing a single specific output. Such a transformation only needs to occur once, and we only write software for something this specific when the input and / or output sets are large. In network servers, for example, requests from a client arrive that direct the software to do some work. These requests will all be very similar, and so it makes sense for us to use the same codepath to process each request. Additionally, network requests tend to take time to satisfy, and request arrival often follows a Poisson distribution which (practically) means that requests will usually occur in clusters.

Let’s say we’re serving requests for image thumbnails. This means we’re serving a lot of small content across a potentially large library. For simplicitly, let’s assume a each request can only result in a single thumbnail, and every thumbnail is equally likely to be requested. A random disk seek on a fast spinning disk takes about 10 milliseconds. This seek time dominates the time we spend on the request, so we can approximate the request time as 10ms + ε. The simplest solution, to serve each request from a single thread of execution as it arrives, has two consequences:

The oldest request accumulates the latency of every request in front of it.
We are maximally able to serve 100 requests per second.

Because request arrival rate tends to follow a Poisson distribution, it’s likely that a large number of requests experience the negative of the first point. And if we receive over 100 requests within the same 10 milliseconds, the folks with the latest requests will wait over a second for their thumbnail. In fact, even if we add more disks in a RAID configuration, we can never beat this problem, if we insist on serving requests serially.

The abstraction of the thread allows us to take advantage of parallelism in hardware. Let’s now assume we have a RAID1+0 configuration (mirrored and striped), and so our objects are stored twice, and distributed across many disks. Adding threads allows us to amortize the cost of disk seeks over multiple disks, reducing the probable latency for any individual request. Because the time for processing the request (compared to the disk seek) is ε, our mental model now says that the probability of any two requests requiring the same disk is proportional to the number of disks. Let’s say that we have 10 disks. Now with 100 clustered requests, it’s entirely possible we can serve all of them within 100ms instead of 1s.

This is fantastic for read-only and read-mostly workloads. However, when requests require us to mutate state, things go downhill quickly. As an abstraction on top of some state (either a machine or a process), threads have access to everything in the parent’s state. Traditionally, this has meant that threads are able to share stateful things like memory that tend to be protected between instances of their parent abstraction. This is problematic when the shared state is mutable. Because multiple threads of execution may occur in parallel, some actions may occur at the same time. It’s well known that conflicts occur when multiple threads attempt to access some shared mutable state, and one of those accesses is a write. Consider the following code:

#include <threads.h>
#include <stdint.h>
#include <stdio.h>

static volatile uint64_t counter;
#define NTHRD 2

thrd_start_t
thread(void *arg)
{
        (void) arg;
        for (int i = 0; i < 1000000; i++) {
            counter++;
        }
        return NULL;
}

int
main(void)
{
        thrd_t tds[NTHRD];
        for (int i = 0; i < NTHRD; i++) {
                if (thrd_create(&tds[i], thread, NULL) != thrd_success) {
                        return -1;
                }
        }
        for (int i = 0; i < NTHRD; i++) {
                if (thrd_join(tds[i], NULL) != thrd_success) {
                        return -1;
                }
        }
        printf("counter: %" PRIu64 "\n", counter);
}

In this example, the resulting value of counter is unclear. Strictly speaking, in C11, this program invokes undefined behavior; it’s fair to say that the program as a whole has no meaning. It is entirely possible that the value output at the end of the program is any number from 1,000,000 to 2,000,000. The issue here is that code is designed to implement a protocol. In the case above, that protocol is the “increment a counter” protocol, which is actually broken down into three steps:

Read the value of the counter from memory.
Modify the value of the counter by adding one to the value read.
Write the value of the counter back to memory.

Even though this read-modify-write operation is expressed in C (and possibly at the hardware level) as a single “increment” instruction, it actually decomposes into these three operations. And each of them could happen interleaved with similar operations executed by other threads. If our first and second threads consistently execute each operation in lock-step with each other (112233), each loop only increments the counter by one despite having run twice.

The reason that this can happen has to do partially with how computer hardware handles memory accesses, and partially due to how threads are scheduled (by the machine or an OS scheduler). Because reads from and writes to memory are expensive (when compared to the cost of performing some operations in registers), processors tend to cache these reads and writes. On a system with multiple processors, with each thread simultaneously executing on a different processor, the threads will be operating on cached copies of the memory (despite our use of volatile). If a processor did not provide an increment instruction, the assembly for counter++ might look like:

load    %reg, (counter) ; load counter from memory into register
add     %reg, 1         ; add 1 to value in register
store   (counter), %reg ; store register back into memory

One property of threads is that they’re often pre-emptable. This means that whatever is running the threads (whether the machine or an OS scheduler) may stop the process from executing while it is in-between instructions. In the case where our increment is implemented as three separate instructions, the thread could be paused between any of them. This pre-emption further exacerbates the problem of the data race.

Typically, we solve this problem by using synchronization constructs. In this case, the hardware might provide us an instruction to do a synchronized increment. In other cases, we might need to use a mutex (or other serializing construct) to ensure the correctness of our implementation of the “increment a counter” protocol. This has its own issues (many of which are presented well in Samy Al Bahra’s article on non-blocking synchronization) that I might cover in a separate blog post.

For now, let’s go back to our thumbnail serving example. Implemented in somewhat pseudocode, we might have the following.

thrd_start_t
serve_thumbnails(void *arg)
{
        while (true) {
                struct request_state *state = get_actionable_request();

                if (read_request(state) ||
                    get_thumbnail(state) ||
                    send_response(state)) {
                        handle_error(state);
                }
        }
        cleanup(state);
}

Ignoring caching, pre-emption is the primary effect that allows a threaded implementation of our thumbnail service to operate efficiently. However, if we take a deeper look, read_request and send_response are also potentially latency-intensive. Our 10ms + ε estimate wasn’t exactly correct because although get_thumbnail is approximately bounded at 10ms, it could take us dozens or hundreds of milliseconds to read the request and send the response. During that time, the thread is consuming resources (every abstraction requires some amount of state to represent), while not providing any computational benefit. Schedulers recognize this and run something else during the time the thread is blocked waiting on input, or waiting for the output side to become ready for more data.

Unfortunately, a thread consumes resources (at least memory) even when it is not being executed. Even with threads being lighter-weight than processes, their mere existence isn’t necessarily cheap when you have a lot of them. Additionally, the cost to “context switch” between threads of execution is usually somewhat expensive. Context switches are pretty awful because during that time, the machine isn’t doing anything particularly useful. It is just changing the environment around a bit to switch to a different Turing machine abstraction. When we’re doing I/O intensive operations, these operations might occur while we still have time to run on the scheduler (that is, we would not have been pre-empted if we had not done some I/O task).

To address this problem, we can make our I/O operations asynchronous. In the example above, we’ve already created an abstraction for our request state. We can use that to create a finite state machine to know what to execute at any point.

thrd_start_t
serve_thumbnails(void *arg)
{
        while (true) {
next_state:
                struct request_state *state = get_actionable_request();

                switch (state->fsm_state) {
                case STATE_READ_REQUEST:
                        switch (read_request(state)) {
                        case BLOCKED:
                                queue_for_input(state);
                                goto next_state;
                        case ERROR:
                                state->fsm_state = STATE_ERROR;
                                break;
                        case SUCCESS:
                                state->fsm_state = STATE_GET_THUMBNAIL;
                                break;
                        }
                        break;

                case STATE_GET_THUMBNAIL:
                        switch (get_thumbnail(state)) {
                        case BLOCKED:
                                queue_for_input(state);
                                goto next_state;
                        case ERROR:
                                state->fsm_state = STATE_ERROR;
                                break;
                        case SUCCESS:
                                state->fsm_state = STATE_SEND_RESPONSE;
                                break;
                        }
                        break;

                case STATE_SEND_RESPONSE:
                        switch (send_response(state)) {
                        case BLOCKED:
                                queue_for_output(state);
                                goto next_state;
                        case ERROR:
                                state->fsm_state = STATE_ERROR;
                                break;
                        case SUCCESS:
                                state->fsm_state = STATE_DONE;
                                break;
                        }
                        break;

                case STATE_ERROR:
                        handle_error(state);
                        break;

                case STATE_DONE:
                        cleanup(state);
                        break;
                }
        }

        return NULL;
}

This works, but it’s not particularly easy to read. We no longer have a linear flow of code. There’s tons of branching, and we have to be very careful to advance our state correctly to avoid cases where we might cause a loop in our state graph, fail to handle a state, or something else bad. There are a couple of opportunities to avoid repetition in this code, but that won’t drastically improve its readability. The property we have lost here is called “composability.”

Composability is a property of a system or API that describes how easy it is to implement, consume, and understand that system or interface. Examples of composable interfaces include functions for working with common data structures like queues, hash tables, and graphs of various shapes. These structures tend to have add, remove, and find interfaces. We can “compose” these interfaces with relative ease to get correct structures that efficiently manage and represent our data sets or program flow.

In our example above, we see that C isn’t the best language we could possibly use to compose a finite state automaton. When a language, system, or interface becomes too difficult to put together, too difficult to read, or too difficult to understand, we say that it is not composable. We generally solve this by introducing additional layer of abstraction. For example, the wonderful library and tools provided by libfsm give us several pleasant means to compose finite state automata through languages, tools, and libraries.

Let’s look at our original example again:

thrd_start_t
serve_thumbnails(void *arg)
{
        while (true) {
                struct request_state *state = get_actionable_request();

                if (read_request(state) ||
                    get_thumbnail(state) ||
                    send_response(state)) {
                        handle_error(state);
                }
        }
        cleanup(state);
}

In an ideal world, this would be expressive enough to encode our state transitions. If read_request would block, it would be nice if it could simply stop its execution and retry exactly where it left off, instead of us having to design an entire machine to do that. The finite state machine we created does exactly this, but it isn’t a very nice way to abstract the problem.

This is precisely where coroutines help. The point of the abstraction of coroutines is to offload much of this complexity into the scheduler. Unlike threads, coroutines are not pre-empted; they are instead “cooperatively” scheduled. That is to say, a coroutine (unlike a thread) runs until it explicitly yields its runtime to another coroutine. Like any abstraction, it requires some additional state, but this state is relatively small. Of course, we still have to “context switch” between coroutines, but this is a much simpler and faster operation than having the OS context switch between threads.

A coroutine doesn’t typically have a “protected” state like a thread does. When this is useful, this protected state is usually much smaller than that of a thread. On many machine implementations, coroutines only require a subset of machine registers to be saved and restored. Coroutines exist in a very small number of different states: running, runnable, or blocked on a resource. Because of this, multiple coroutines may execute concurrently on a single thread.

Concurrency and parallelism are often confused to mean the same thing. Parallelism describes a system in which multiple independent execution histories may occur at literally the same time. This is the case with threads, where (for example) parallel independent execution histories may be executed simultaneously on multiple processors. Concurrency is weaker: it simply states that multiple independent execution histories may be interleaved such that they appear to be occurring at the same time. This is roughly the same feature operating system schedulers provide to give the illusion of multitasking many processes on the same system.

Using coroutines, our original example still works just fine. We might have multiple threads of execution (so that we can utilize our hardware in parallel) with each thread multiplexing several coroutines. In this world, the complexity moves around, but it is tightly contained in a few spots. This lends itself nicely to composability. As an example, our read_request function might look like this:

int
read_request(struct request_state *state)
{
        ssize_t r;

        do {
                r = read(state->fd, state->buf, state->bufsiz);
                if (r == -1) {
                        if (errno == EAGAIN) {
                                yield_on_read(state);
                        } else {
                                return 1;
                        }
                }
        } while (request_not_complete(state));

        return 0;
}

The special function yield_on_read is provided by the coroutine implementation. It snapshots the current machine state of the coroutine, updates the coroutine’s state to become runnable when more data exists on the file descriptor, and returns to the scheduler with a lightweight “context switch.” Later on, when data arrives (or possibly when a timeout occurs) on that file descriptor, the scheduler returns directly into the coroutine where it left off (again with this lightweight “context switch”). The coroutine continues by testing request_not_complete (which will be true because the request wasn’t completed the previous time), and will re-execute the read system call, this time guaranteed to get at least some data. In the time between this, some other coroutines may have executed. Similar things could be implemented for get_thumbnail and send_response. Indeed, we could abstract out all blocking I/O operations into a single API such that other coroutines don’t need to implement their own state handling in a loop like we’ve done here.

How do coroutines snapshot the machine state? This is usually done using functions like makecontext and swapcontext, or getcontext and setcontext. As previously mentioned, some machines do not require all execution context to be saved. In such cases, some assembly code to read or restore the minimal required machine state is written.

Why aren’t coroutines more widely used? Although the resulting code often ends up being more composable, and although highly concurrent workloads like network servers tend to benefit from such an expression, there’s a fair amount of overhead involved in implementing the underlying API and scheduling. Threads are already relatively lightweight (at least compared to processes) and already have a scheduler to run them. Much of the time, the overhead of implementing APIs and scheduling is not seen as justification for the resulting composability and scalability win.

In my opinion, this is the reason that coroutines provoke a discussion. It looks like a difficult technical discussion about how complicated writing a scheduler is, but really it boils down to a discussion about where and how to hide complexity. It seems clear to me that the bulk of code written using coroutines is less complicated, and the composability benefits are huge.

But in some cases, coroutines have significant drawbacks. When part of a cooperatively scheduled task has high computational overhead, do we yield before performing the computation? All other coroutines are blocked while one is executing, so we need to be aware of this tradeoff. If we want to take advantage of parallelism with coroutines, we also face more challenges. If several coroutines are working with some shared state and those coroutines can yield in the middle of something that should logically be a critical section, we still need locking. Problematically, if we also have added multiple threads on which to execute coroutines, this means that we need to introduce some kind of locking protocol. Several more corner cases like these exist, and none are particularly trivial to solve. All add complexity to the scheduler and have runtime overhead.

It’s frequently safer and easier to go with the devil you know. And so we have long conversations (and blog posts) about coroutines instead of actually using them.

That all said, the popularity and reasonably good performance of the Go programming language in the systems and network programming space shows that it is possible to get coroutines right.

If you’re interested in implementations or adopting coroutines into your work, there are several great resources. My two favorites (apart from homegrown implementations) are libmill, which is a library that provides Go-style concurrency for C, and libtask provides coroutines as well as appropriate non-blocking file and network I/O interfaces as a portable version of Plan 9’s concurrent programming interface.

I’ve avoided covering numerous related topics. In particular, message passing in a threaded environment combines the simplicity of existing pre-empted thread scheduling with the composability of data structure operations. In fact, message passing is a fantastic way to implement notification queues for systems built on either threads or coroutines (or both). Like any abstraction, the devil is in the details. Message passing in threaded environments moves the complexity away from the scheduler and the threading code, but into the algorithms and data structures used to pass messages. In environments with extremely high thread counts, this can be a non-starter when using an unmanaged language like C. Anyway – it’s an idea for another blog post!