DHO is Mostly Confused

Rants from Devon H. O'Dell

Ripping Rings

Mar 16, 2016 10 min read
Programming Algorithms

TL;DR, I’ve released a “library” called librip: a “minimal-overhead API for instruction-level tracing in highly concurrent software”. The rest of this post is about the “why” of the software, followed by a slightly a little bit of the “what” and only a smidge of the “how”.

Everyone knows that debugging is twice as hard as writing a program in the first place. So if you’re as clever as you can be when you write it, how will you ever debug it?

I’ve previously included this Brian Kernighan quote from The Elements of Programming Style before in my writing (I likely will again) – it’s as true today as it was over forty years ago when it was published. However, it is not all despair. The quote ends with a question, and questions beg for answers. And fundamentally, that’s what I do in my capacity as a software engineer. The answer to this particular question is to always be learning – you have to level-up your clever.

But I’m also not always debugging my own code. I suppose this means that I have to be twice as clever as the original author was when writing the code to debug the code. In my experience, this usually results in code deletion and simplification – which is the underlying point of the context surrounding the quoted passage above.

In my work at Fastly, there’s rarely a shortage of things to debug. Whether the Internet-at-large has managed to trigger some pathological edge case or I’m working on new features, debugging is always in the picture. And I can’t always be twice as clever. In such cases, I’m usually aided by the trusty debugging toolset: gdb, perf, strace, lsof, etc. But here’s the catch: sometimes these tools are entirely unusable.

Tens of Thousands of Threads

It’s no secret that we at Fastly use a modified Varnish Cache to facilitate our content delivery. What’s less well known is how it works. When Varnish was originally implemented, the idea was that threads were cheap, and that operating systems were going to do a better job scheduling them than you were. This is (to a large degree) true today, but it comes at a cost.

On our caches, our Varnish processes run with tens of thousands of threads in the normal case. And it turns out that the kernel does a reasonably good job with this workload. When most of your time is spent writing data over a network, the kernel does a pretty reasonable job of keeping everyone running within acceptable latency bounds. But as with Biggie’s money, with more threads, we do get some more problems.

Composability

Lock-based synchronization is nearly the canonical solution to sharing memory in concurrent software today. But there are significant drawbacks, especially in heavily threaded environments.

In a 2008 ACM Queue article, Real-world Concurrency, Bryan Cantrill points out that it is possible to write composable software built on top of mutexes. (I do find it funny that one of the suggested ways of doing this is to not use mutexes.)

Anyway, composability is a desirable property because such systems are easier to build, consume, and verify (and there is plenty of literature about this for nearly any language). While it may be possible to write composable software that consumes locks, this isn’t actually the problem. The problem is that locks themselves do not compose. Why not?

Locks are not self-contained. Although lock and unlock functions work on discrete structures, they are relatively tightly coupled both to logic described in code as well as to the data they are meant to protect. This means that programmers are required to understand implementation details of APIs surrounding code which may or may not need protection, and may or may not result in locks being held on return.

Secondly, lock operations can’t be recombined in any order to produce valid execution histories. Lock operations must precede unlocks. Locks cannot recurse. (If they can, are they unlocked once or many times?) Locks cannot be acquired and released in arbitrary orders. Nothing about their API or implementation provides guidance on what order and where locks must be acquired.

So while it is possible to hide locks behind a composable API, the code in the API itself isn’t guaranteed to be composable. Indeed, at the lowest level, it’s clear that implementing correct, composable APIs which consume locks (sorry Bryan, I had to) is at best non-trivial.

Contention

Let’s assume we have some API that isn’t at all OpenSSL. (Just kidding, it is.) But let’s assume it isn’t, because I need a strawman API that has a correct implementation that uses locks and is composable.

Even when implemented correctly, this API may cause the software (and in some cases the whole system) to enter a state known as livelock.

In livelock, your software is spending so much time negotiating lock transfer that it doesn’t actually get any work done. Your system continues to run, but appears to do nothing. I have recently encountered a livelock where the average lock hold time was something like 10 minutes because the processes owning the lock couldn’t get any time on the scheduler. (And once they did, other processes suddenly needed those locks as well.)

Concurrency

Given these problems, we might care to use a different approach to sharing memory. We may want to explore lock-free and wait-free algorithms as an alternative to lock-based approaches. So we dust off Concurrency Kit and go hacking.

Problematically, nearly all of our problems are multiple producer, multiple consumer (MPMC) problems. And in unmanaged languages like C, many (all?) MP and MC problems require some form of safe memory reclamation (SMR). Many SMR schemes have an approach that goes something like this:

Stash pointers in a thread local area.
Mark unused stashed pointers as such.
Wait for the system to enter a quiescent state.
Free any unused pointers.

The actual approach for stashing, marking, and detecting quiescent states varies, but this is the gist. When you have tens of thousands of threads, detecting a quiescent state is very difficult. Even in schemes that make it easier, the O(mn) complexity of scanning pointers in thread local areas makes many of these nearly unusable in environments with such high thread counts.

“But dho!”, you say. “Just use reference counts!” Ignoring entirely the cache unfriendliness of reference counting schemes (and therefore poor throughput with high numbers of objects), when we require non-global, per-object references (such that the reference is embedded into the memory of the protected item), we’re back to locking.

So we put Concurrency Kit back on the shelf and hope it doesn’t collect too much dust before the next time we can use it.

Debugging

Have you ever tried to attach gdb to a process with more than 40,000 threads and 500GB resident memory usage? I’ll wait. Actually, I won’t, because it’s going to take you somewhere between 45 minutes and 3 hours. (Don’t bother with lldb, it’s the same problem.) I’ve theorized that this is due to a combination of /proc/<pid>/maps suckage and something between linear and polynomial time complexity of attaching to threads in the debugger itself. (It walks a linked list of threads multiple times for each thread it attaches to.)

Backtrace.io crash dumps help for some cases, but the overhead of tracing this many threads is fundamentally non-trivial, so we can’t (yet) use their tools for tracing live processes. That said, when our software does crash, theirs is the only tool that works for us.

So now we have lock-based software with tens of thousands of threads that we can’t debug. Twice as hard as writing the code, indeed. In the absence of crash dumps, determining cause of bugs can be excruciatingly difficult. And in such highly threaded software, the likelihood that a bug is related to composability, contention, or concurrency is actually rather high.

Outside of contention, many concurrency bugs are the result of a bad execution history. Deadlocks occur when locks are acquired and / or released out of order. Race conditions occur when execution histories happen out of order with respect to a protocol defining a required ordering. If we could somehow observe execution histories, that would be pretty close to ideal for debugging numerous issues.

But how do we get this without a debugger? We sort of write our own.

Ripping Rings

I’ve released a new “library”, librip: a “minimal-overhead API for instruction-level tracing in highly concurrent software”. This code was (when I wrote it) originally baked into our Varnish implementation, where it was known as “rip_ring”. Rip because %rip is the name of the 64-bit program counter (PC) on x86-64 machines, ring because snapshots of this value are stored into a shared memory ring buffer exposed through Linux tmpfs. (Although it is currently extremely platform-specific, ports to other architectures and systems are welcomed.)

The software is comprised of two components. One is an API for actually sampling the PC into this shared memory ring buffer. The other is a script that can process the file and resolve addresses to symbols in the binary.

And how does it work?

Pretty goddamn well, to be perfectly honest. The overhead is pretty low: there are a few bitwise operations performed on some thread-local variables, and then stuff is written to memory.

Since the shared memory is backed by a tmpfs, if it is mapped with MAP_SHARED, it survives crashes. This means we get the recent execution history of every thread active in the system (plus exited threads) on a crash. It’s not quite a backtrace, but it’s close.

PC samples are littered strategically through the code. For example, our locking function interface is modified:

#define Lck_Lock(l)                                            \
        do {                                                   \
                rip_snap();                                    \
                Lck__Lock((l), __func__, __FILE__, __LINE__);  \
                rip_snap();                                    \
        } while (0)

#define Lck_Unlock(l)                                          \
        do {                                                   \
                Lck__Unlock((l), __func__, __FILE__, __LINE__);\
                rip_snap();                                    \
        } while (0)

At any particular time, this gives us all threads contending on a mutex, tells us which thread is the mutex owner, and gives us all threads that have recently released the mutex.

What does this look like? Depends what’s wrong. A livelock is exposed in the default condensed output of the read_rip utility (written by a colleague of mine):

locking_function : 19126 threads, 2 locations: ssl.c:220 (10); ssl.c:210 (19116)

Also remember, this is hanging out in shared memory. So you can use this tool to immediately rule out livelock as a runtime performance problem without pausing your software to attach a debugger or using tools that add system-wide trace-time overhead.

Lock order reversals and deadlocks are trickier. The nice thing is that the debugging information puts both rip_snap() calls on the same source line, but they (of course) get different addresses. So we can get per-thread output like:

[35283]: {82}source_a.c:124<0x442bae>, {83}source_a.c:129, {84}source_b.c:2670,
    {85}source_b.c:217, {86}source_b.c:662, {87}source_a.c:124<0x442af7>,
    {88}source_a.c:124<0x442bae>, {89}source_a.c:129, {90}source_b.c:767,
    {91}source_c.c:132, {92}source_c.c:134<0x477f12>, {93}source_c.c:144<0x477fa3>,
    {94}source_b.c:110, {95}source_b.c:3873, {96}source_b.c:3405,
    {97}source_d.c:339<0x45f617>, {98}source_d.c:339<0x45f6a3>, {99}source_d.c:341,
    {100}source_c.c:238<0x479f52>, {101}source_c.c:238<0x479fe0>,
    {102}source_c.c:240, {103}source_b.c:3405, {104}source_b.c:2773,
    {105}source_a.c:124<0x442af7>, {106}source_a.c:124<0x442bae>, {107}source_a.c:129,
    {108}source_a.c:124<0x442af7>, {109}source_a.c:124<0x442bae>, {110}source_a.c:129,
    {111}source_e.c:573<0x43be8f>, {112}source_e.c:573<0x43bf1b>,
    {113}source_a.c:124<0x442af7>

And this works quite well. Given our lock and unlock interfaces are the only ones with multiple rip_snap() calls mapping to the same source line, we can create neat little scripts, like this one that a colleague of mine whipped up in a couple of minutes. It scrapes output to help find deadlocks by providing a list of threads that have acquired locks, but not released them.

#!/usr/bin/perl

use strict;
use warnings;

my $sf = shift; # source file with lock

while (my $l = <>) {
  next unless $l =~ /\Q$sf\E/;
  next unless $l =~ /\{\d+\}\Q$sf\E:(\d+)<0x[0-9a-f]+>, \{\d+\}\Q$sf\E:\1<0x[0-9a-f]+>/;
  next if $l =~
      /\{\d+\}\Q$sf\E:(\d+)<0x[0-9a-f]+>, \{\d+\}\Q$sf\E:\1<0x[0-9a-f]+>, .*\{\d+\}\Q$sf\E/;
  print $l;
}

Super cool.

The implementation details aren’t that interesting. We do some pointer packing such that event order can be reconstructed linearly. Ring buffers wrap around, but they are also FIFO, so with counters, we can reconstruct the correct event ordering. I expect this framework to be useful in any highly-concurrent scenario. Whether using native threads or coroutines, livelocks and deadlocks are real problems as long as you employ lock-based synchronization plans to solve the problem of sharing memory.

Other than that, the utility of this thing is all in how you use it. I’d appreciate any feedback, and especially code to make this useful on other architectures (especially ones with weaker memory models than x86-TSO).