DHO is Mostly Confused

Rants from Devon H. O'Dell

Debugging: Psychology, Theory, and Application

Jun 23, 2016 32 min read
Programming Debugging Software Engineering Education Psychology

I spoke at VarnishCon 2016 on debugging. A small portion of the content was related to Varnish, but about half of the talk focuses on research into debugging. Though I have made my slides available, they’re not very useful without speaker notes (and I ended up ad-libbing a fair bit more than intended). This post is an attempt to provide a reference material for the ideas in the talk as well as to expand on some points I didn’t have time to cover.

Theory

Debugging, if we’re honest about it, is simply another word for problem solving. We’ve all been doing this since seeing word problems in our basic school math education. “If Levar has 3 apples and Melissa has 2 apples, how many apples do both Levar and Melissa have?” Debugging isn’t that dissimilar: the only thing missing here is that formulating the question we have to solve isn’t actually part of our education. Given the word problem above, debugging would be like showing kids a picture of Levar and Melissa and their apples. They’d be asked to create a word problem that could be solved to determine the total number of apples, and then to solve that problem. It’s easier said than done, but there are things we can do to make debugging easier for ourselves and for others.

Impetus

Why does any of this matter? Certainly we spend time debugging our software, but eventually this just becomes part of the process.

Personally, information in this area is currently extra interesting. I have a new role as “Tech Lead” at Fastly, where my primary objective is to help other members of my team succeed. Without a good idea of what success looks like or how to get people there, I can’t succeed in this. It would follow here that either they succeed and my role is obviated, or they simply don’t succeed. I’d prefer for success all around.

But there are reasons that everyone should care regardless of their role in an organization.

A 2013 study (PDF) conducted by the University of Cambridge Judge School of Business concluded that the estimated total cost of debugging globally reaches $312 billion. Their study also finds that 49.9% of programming time was spent on debugging activities (“fixing bugs” and “making code work”). Other studies find that the proportion of debugging to coding time is as high as 80%.

This implies that if we can reduce the time it takes to “fix bugs” and “make code work”, we can save our employers money. If that doesn’t sound rewarding for some reason, we can also spend more of our time working on new code, which is quite often a more rewarding exercise than staring at that damn bug for 8 hours a day.

The Root of the Problem

As soon as we started programming, we found to our surprise that it wasn’t as easy to get programs right as we had thought. We had to discover debugging. I can remember the exact instant when I realized that a large part of my life from then on was going to be spent in finding mistakes in my own programs. —Maurice Wilkes, 1949

I think we’re all sympathetic to Wilkes’ discovery. Indeed, we all individually “find out” that it’s not as easy to write software as it first appears to be. “Just tell the computer to copy this thing there,” we naïvely speculate. “How hard can that possibly be?”

25 years later, we have an indicator for how hard it is. In The Elements of Programming Style, Brian Kernighan famously states:

Everyone knows that debugging is twice as hard as writing a program in the first place. So if you’re as clever as you can be when you write it, how will you ever debug it?

First, we should understand that in context, this quote is illustrating that we really shouldn’t be trying to outsmart compilers in the name of optimization. Expressing our problem as simply as possible is beneficial to others reading the code. And as it turns out, it is usually beneficial to compilers.

That said, there’s still a strong, unsupported conjecture here: that debugging is twice as hard as programming. Is this really true? Should it be true?

I don’t have a good feeling for the order of magnitude of difference, but I feel that debugging is necessarily more difficult than programming. The rationale here is that a bug is generally indicative of a piece of software behaving in an unexpected way. This lack of expectation translates directly to some unhandled case in the software. When software is sufficiently well designed, we can deduce that such unhandled cases were unhandled because they were not understood (as opposed to being understood but forgotten).

Certainly, some bugs are indicative of forgetfulness. But these cases of misunderstandings and ignorance are fundamentally more insidious when dealing with software bugs. If we work on well-designed software, and we encounter a bug in it, we are operating without the tool of knowledge, which is a thing we did have when we were writing the software.

Whether this makes debugging exactly twice as hard as writing software remains ambiguous, and would be worth a quantitative study. But we can probably safely say that it is harder. From this, it is clear that if we are operating at the limits of our knowledge and ability, to debug code, we must extend the limits of our knowledge and ability.

There are other interesting things we can read into this quote. One is that when we’re debugging software, it’s usually not software we wrote ourselves. The software industry is overwhelmingly collaborative, and very few popular projects or professional endeavors receive contributions from only a single source. This means that to debug most code, we actually have to extend our abilities and knowledge past those of our colleagues and peers.

Intuitively, we know that this is a solveable problem. Whether you’ve been in a computer science course for a semester, gone through a code bootcamp for a few months, or practiced professionally for more than twenty years, you’re better at writing code than you were when you started. You’re more clever. And you’re better at debugging than when you started.

So how will we ever debug this clever code? How will we extend the limits of our knowledge and ability? Simple: we have to learn more.

Flowing Along

Clearly there’s another way to interpret this quote. Instead of seeing it as a conjecture preventing us from debugging complicated systems, we can look at it as a platform we can use to raise our ability.

Linus Åkesson calls this platform “Kernighan’s lever”. Linus notes:

By putting in a small amount of motivation towards the short-term goal of implementing some functionality, you suddenly end up with a much larger amount of motivation towards a long term investment in your own personal growth as a programmer.

Linus then builds on top of Mihaly Csikszentmihalyi’s work on the psychological concept called Flow. The whole book summarized is that “flow” is the name of the state you get into when you are fully immersed in a task such that you find it difficult to be distracted, lose track of time, and gather great enjoyment from the process. Furthermore, this state only occurs when you’re taking on a task that matches the limits of your abilities. The book goes on to talk about practicing achieving Flow state, mostly by repeating the thesis. Csikszentmihalyi’s TED talk on Flow is more digestable.

Anyway, Linus observes that when this concept is applied to Kernighan’s quote, there are two consequences. If we want to optimize our experience for debugging, we have to implement code below our ability — but this is tedious and boring. If we want to optimize our experience for writing code, we have to debug above our ability, but then debugging is frustrating. Making a decision between tedium and frustration seems counter-productive as well.

So in this model of learning more by implementing outside of our ability, we are necessarily frustrated at least part of the time. If we make the tradeoff to never do this, we don’t learn more. I wanted to see if there was any academic research on debugging, how it’s taught, and how we learn to get better at it.

Reading Papers

It turns out that there’s been a fair amount of research into many areas of debugging not directly related to tools and automation. Debugging: a review of the literature from an educational perspective (PDF) from Renée McCauley, Sue Fitzgerald, Gary Lewandowski, Laurie Murphy, Beth Simon, Lynda Thomas, and Carol Zander provides an excellent overview of the research on the pedagogical aspects of debugging, reviewing materials from as early as 1967 through to 2008. I’d like to stress to researchers that this sort of literature review is extremely useful to practitioners and to other researchers who aren’t quite sure where to start.

In the literature, much of the research from 1980 to 2008 focuses on classification of programmers based on skill level. Nearly all of this classification research relies on the Dreyfus model of skill acquisition (PDF) or a simplified form of it. In particular, this research tends to compare programmers of “novice” and “expert” skill levels and figure out what abilities the expert has that the novice lacks.

The pedagogical research here then looks at whether it is possible to teach these skills to novices. Though several papers claim success (notably Chmiel & Loui in 2004), I find the success criteria dubious. For example, Chmiel & Loui write:

From these observations, it appears that the improvement in debugging time shown by the treatment group over the control group is directly related to their completion of the debugging exercises rather than differences in aptitude or in program design skills. Students who completed the debugging exercises spent significantly less time debugging their programs than those who did not complete the exercises.

However, some of the previous observations include:

Although the treatment group spent a smaller percentage of time on debugging than the control group, this statistic does not show whether the treatment group also spent less actual time on debugging the programming assignments.

And

Students in each group were not consistently spending the same amount of time on each assignment.

And

We analyzed the exam scores of students in each group to determine whether there was a difference in aptitude between the groups. …[T]here was no noticeable difference in the aptitude of the two groups.

This sort of thing seemed common in the papers I read (I had to choose one of them to pick on). Fundamentally, it remains unclear whether this actually reduced the amount of time it took for students to debug issues, and it is clear that none of these studies were able to increase student test scores by any statistically significant amount. This would suggest that specific strategies were taught and tested with problems that looked similar.

Applying this research to the idea of “Kernighan’s lever” almost invalidates the idea of the lever. If folks were able to increase the edge of their knowledge and debug in flow, we’d expect a more complete understanding, which should translate to better test scores. To that end, students who were not given specific debugging instruction should have been operating more outside their skill level and should have therefore increased test scores with some significance, and that’s just not what we see:

On the first midterm, the treatment group averaged 70.7% while the control group averaged 72.0%. On the final exam, the treatment group averaged 78.6% while the control group averaged 74.0%. T-tests showed neither of these differences to be statistically significant.

Furthermore, the Dreyfus model seems particularly useless for classifying programmer capability if you want to perform research to figure out how to move people from one end to the other. In Chmiel & Loui, they note that “[n]ovices make observations … with no attention to the overall situation.” This is a behavior that is almost entirely contradictory to the approach of the expert, who is “able to act on intuition” and is “not limited by the rules.” The focus then should be on changing behaviors instead of whether debugging time was reduced.

All of this together makes it seem like education has a limit, and there is some sort of self-limiting factor on the part of students.

McCauley et. al. seem to be aware of this, and in their section on future research directions, they note:

The relationships between debugging success and individual factors such as goal orientation, motivation, and other self theories (Dweck, 1999) have not been researched. … Given that debugging presents challenges for many novices, exploring the influence of self theories on student approaches to debugging appears worthwhile.

Later in 2008, they got their wish.

Dweck and Self-Theory

In 1999, Carol Dweck published a book called Self-theories: Their role in motivation, personality and development. This work covers over two decades of research conducted by Dweck and others, and presents an interesting thesis: an individual’s own perception of intelligence and how it is obtained directly affects the way they solve problems and interact with others. This perception is called a self-theory.

An individual’s self-theory falls somewhere within a range of mindsets. At one extreme, we have the fixed mindset; at the other end, we find the growth mindset.

The fixed mindset person has a view that intelligence is a fixed resource. Internally, this means that when a fixed-mindset person approaches a difficult problem, they give up quickly under the assumption that the solution lies outside of their ability. This view is projected onto others: someone else who can solve the problem has an inherently larger capacity for intelligence. When required to solve difficult problems, the fixed mindset person may find a way to evade the issue. They may even become visibly agitated.

On the other end of the spectrum, we have growth mindset people, who believe that intelligence is malleable. They recognize problems as opportunities for learning and growth, and solving problems as the only way to capitalize on that opportunity.

The practices of both programming and debugging can be compartmentalized into a single field, a field of problem solving. Some problems are inherently challenging. When viewed from this perspective, an individual with a growth mindset is necessarily more capable and productive than an individual with a fixed mindset. The reasoning here is that the person with a fixed mindset is more concerned with keeping up appearances of a particular level of intelligence: they want to appear smart. The easiest way to do this is to only solve problems at or below their skill level. The growth-minded person is also concerned with keeping up appearances, but the appearance they want recognition for is hard work. And it turns out the only way to really give the appearance of working hard is to actually work hard.

I admit that this sounds a little wishy-washy, but the research and experimentation holds up. I’m still digesting it, but I definitely recommend a talk Allison Kaptur gave as a keynote speech at Kiwi PyCon 2015. (She also provides a blog post serving as a rough transcript of the talk.)

The talk is titled Effective Learning Strategies for Programmers, and I highly recommend reading / watching for more information on Dweck’s research and how it can be practically applied.

Self-Theory and Debugging

In 2008, Laurie Murphy and Lynda Thomas released a paper Dangers of a Fixed Mindset: Implications of Self-theories Research for Computer Science Education (PDF). This work is (as far as I can find) the first to apply Dweck’s research to the practice of problem solving in computer science. It considers numerous factors of how self-theory affects the field of computer science, including:

Learning to write programs

[M]ost CS students … face many challenges and a barrage of negative feedback … While students with a growth mindset view errors and obstacles as opportunities for learning, those with a fixed mindset are likely to interpret excessive negative feedback a as a challenge to their intelligence and to avoid similar situations in the future.

The extremely unfortunate situation of gender disparity in CS

Research has also shown that, although high-IQ girls tend to out perform all other groups in elementary school, they are also more likely to have a fixed mindset. As a consequence, they are less inclined to seek out challenges.

Why collaborative programming practices don’t always work

Pair-programming has been shown to boost confidence, improve retention, increase program quality and heighten students’ enjoyment. Students with a growth mindset “feel good about their abilities when they help their peers learn.” However, because those with a fixed mindset feel smartest when they out perform others, they are less likely to value collaborative learning.

Defensive classroom climates

Those who believe intelligence is fixed seek validation and judge and label others. Such practices are exhibited in CS by students who ask “pseudo-questions” so they can demonstrate their programming knowledge, and by instructors who accord special status to students with previous experience.

Psychological success factors like self-efficacy and self-handicapping

A study by Bergin and Reilly looked at both self-efficacy and motivation and observed that intrinsic motivation had a strong correlation with programming performance, as did self-efficacy. They did not, however, establish the nature of this link. Self-theories research, which has linked theories of intelligence to motivation and self-esteem, may shed light on results from CS education.

The paper then goes on to provide suggestions for how teachers may better utilize this information to help student retention, handle classrooms where students have diverse backgrounds, and more. It sets recommendations for future research about what teachers can do to influence growth mindsets in students.

In 2010, Manipulating Mindset to Positively Influence Introductory Programming Performance (PDF) was published by Quintin Cutts, Emily Cutts, Stephen Draper, Patrick O’Donnel, and Peter Saffrey. Though this study admits some faults, it does show that with some particular “interventions”, students can both improve mindset and grades. These interventions included first teaching all students about self-theory research and its implications in learning; this was called “mindset training intervention”.

Some students received a crib-sheet that explained various ways in which they could solve various types of problems generally. When tutors were assisting students, they were required to assist from the crib-sheet only. The idea was to remind students at every struggle that general solutions to problems exist and to take them through the process of figuring out how to apply a general solution to a specific problem, every time there was a problem. This was called the “crib-sheet intervention”.

Finally, when feedback was received on worksheets / problem sets, some students received “rubric intervention”. For these students, feedback sheets all included at the top:

Remember, learning to program can take a surprising amount of time & effort – students may get there at different rates, but almost all students who put in the time & effort get there eventually. Making good use of the feedback on this sheet is an essential part of this process.

This intervention was designed to remind students to enter the growth mindset when they received feedback, which is ostensibly when they would be in a position to learn the most.

The study was done in a 2x2x2 matrix form. While the crib-sheet intervention was ineffective alone, only students who received all three interventions were shown to improve test scores. The ineffectiveness of the crib-sheet intervention applied alone backs up other research studying the “saying is believing” theory and finding that it isn’t necessarily true.

So… what?

This research fundamentally shows that it is possible to both improve mindset in students and improve their scores. By being aware of self-theory and its impleications, approaching problems methodologically and with an open mind, we can improve our abilities in problem solving. And debugging is just a fancy word for problem solving.

Finally we have a path forward both for educating students and continuing to educate ourselves. By approaching problems with a growth mindset, we are much more likely to learn and grow our skills. If we approach our problems with a fixed mindset, we stunt our growth.

Practice

Once we reject the fixed-mindset view that programming is an innate talent (and Jacob Kaplan-Moss discusses several reasons we might want to do that which have nothing to do with personal growth), we are ready to continue our journey to becoming an expert. How do we do this? What general steps do we take to debug software?

Mental Models

When we think about or discuss software, we are usually simplifying its behavior in order to better comprehend it. In a talk at ACM’s Applicative conference in 2016, Andi Kleen discussed mental models as they relate to performance tuning. One of the foundational points of this talk was that it is actually impossible to fully understand any sufficiently complex software. Why should this be?

Software and hardware becomes more complex by handling additional states. The number of states and state transitions handled scales with the number of logical components of a piece of software, including subsystems, classes, threads, consumed libraries, and even the number of source lines. As all these variables increase, so does the amount of state carried by the system as it runs, and we necessarily observe the combinatorial explosion problem. Trying to reason exactly about such systems is akin to the problem of trying to understand infinity (PDF).

So we must necessarily create mental models of how software should work to simplify the process of understanding it. In software engineering, we even do this as an up-front exercise when designing new software: from product requirements documents to “architectural” design to technical design documents, the process begins and completes by following predefined models of what the software should do.

A bug is effectively defined as a disconnect between the expectation of a piece of software to behave in some particular way versus how the software actually behaved. It follows that one critical component of why bugs occur is an incorrect mental model. And the research (PDF) supports this.

Importantly, this means that mental models do not help us when debugging software. The first step of approaching a software bug is to be critical of our own understanding of the buggy code and the code it interacts with. In other words, we must accept our knowledge of the system is flawed and discard it as faulty. (This is, as you might imagine, more difficult for a fixed-mindset individual.)

The Scientific Method

Software engineering is synonymous with applied computer science. When attempting to diagnose and troubleshoot bugs in a system, it is proper to follow the scientific method. As a refresher:

Form a hypothesis.
Rigorously test the hypothesis and gather data.
If the data do not support the hypothesis, goto 1.

Experience leads me to believe that many engineers (myself sometimes included) follow this process backwards. We are so caught up in trying to fix the underlying problem that we forget about the process of first figuring out what the underlying problem actually is.

To that end, the mistake we often make is treating our investigation as if the hypothesis was “I think there is a bug in the software.” As the presence of a bug is usually self-evident, such a hypothesis is already proven. As engineers, we need to come up with a hypothesis of why the bug happens, not that the bug exists.

To this end, a better initial hypothesis might be something to the effect of, “I think the bug is on source line 42.” This is somewhat better than, “I think there is a bug in the software” as it provides some specificity to the situation. Sometimes we can test whether line 42 is buggy simply by commenting it out! A better hypothesis describes the situation in detail, which means we usually need to gather data to form a hypothesis, not just to test it.

“An off-by-one condition on line 42 causes us to fail to log the bytes received from the last read call” is a fantastic hypothesis. We can test that the issue is on line 42. We can test that the bug is solved by looking at our logs.

Investigation

Sometimes we have a good intuition of what the issue is with our code. This could be because some compiler or tool gave us a hint as to the issue; it could be because the bug misbehaved in a very particular way, giving us a clue to its nature. This intuition gives us an initial hypothesis (and sometimes coming up with the hypothesis is the hardest part). In this case, we quickly identify a possibly buggy section of code and begin to analyze it for a bug.

Intent versus Syntax

When forming a hypothesis, we should begin by reading code for semantics — what it means — versus syntax — what it says.

Let’s take a simple, buggy VCL snippet as an example:

    sub vcl_fetch {
        if (beresp.http.authenticated) {
            set beresp.http.cacheable = "false";
            return (pass);
        }

        return (deliver);
    }

First, notice there is nothing obviously wrong with this code. We may have an intuition that something is wrong here specifically because we found some object cached that we expected this code to avoid caching. So we focus our efforts here.

If we read this code for what it says, we are literally pronouncing the syntax of the code out loud, substituting some syntactic elements with words while stripping others. “If the backend response header ‘authenticated’ is set, set the backend response header ‘cacheable’ to ‘false’ and return ‘pass’. Otherwise, return deliver.” This tells us nothing new about the nature of the bug.

If we instead read the code for its intent, we get “responses for resources that required authentication must not be cached.” In addition to being shorter to say (and therefore easier to communicate about with others, should you need external input), it tells us something else: what the code is supposed to do. When we do not yet have a hypothesis, this can be crucial for figuring out which areas of the code may be responsible for a bug.

Once intent is understood, we can begin to gather data. We might look at the responses for authenticated resources and realize that such resources are actually identified by a header called authed instead of one called authenticated. It is only at this point that we have data about the state of the system that we should actually begin to do any syntactic analysis of the code.

Discard Comments

When a bug occurs in heavily commented code, ignore the comments. While they may be useful for forming a mental model of what some code is supposed to do, reading comments while debugging can subtly reinforce the correctness of incorrect code. Consider:

    /* Print all the elements */
    for (int i = 0; i <= n; i++) {
        printf("%s\n", elems[i]);
    }

You may already see that this is intended to illustrate an off-by-one error. The point here is really that the comment can tend to make us paper over this code. We may think, “Yep, this prints all the elements.” It does. It also “prints” more than that. Just because a comment is correct in describing the code it annotates does not mean that the code is correct in solving the problem.

Understand Which Bugs Occur

The McCauley et. al. literature review finds many different (and some conflicting) papers on bug classification. I’m not of the opinion that any particular set of classifications is better or worse than any other (assuming both are relatively complete). That said, agreeing on a standard terminology within a team is important so that we can efficiently and effectively communicate with others about problems.

I’ll cover a few classifications I find useful.

Syntax / Semantic Bugs

Some of the most obvious bugs have to do with not following the syntax or semantics defined by the programming language we’re using. Because our tools can already assist us a great deal here, we tend to think of these problems as trivial. This is a mistake.

Although syntax errors are nearly always caught at compile time, languages (like PHP) which do not necessarily distinguish between compile time and run time may confuse this a bit. (This is one reason a robust testing environment is important, regardless of your preferred testing strategy.) Weakly typed languages may be unable to perform type conversions at run time. In some cases, static and dynamic analysis tools for our platform can help.

A good awareness of our environment always includes an understanding of the capabilities of our language and tools. When these issues hit at run time, they can be very hard to spot because we heavily rely on our tools to figure these sorts of issues out for us. Be prepared to learn more about your environment when you encounter new issues like this.

Logic Bugs

This is a broad set of bugs with many sub-classifications. Off-by-one errors, overflows and underflows, and control flow errors (like early returns, incorrect conditionals, using logical or when you meant logical and — or vice versa) are all examples.

It may be unfortunate to group all of these different bugs into a single category, but I prefer to do this because they are all related to a bug in our expression of logic in a program form. Furthermore, they all tend to behave differently enough from each other that they’re unlikely to be mistaken for each other.

Race Conditions

Though these are arguably logic bugs, I call out race conditions separately for a few reasons:

They’re frequently not reproducible with synthetic workloads.
A race condition is not necessarily indicative of a bug.
The absense of race conditions is not necessarily indicative of correct software.
Race conditions can appear as a result of logic bugs elsewhere.
Race conditions can mask other bugs.

Especially because of the facts that race conditions can be more symptomatic than original in nature, and because this has the effect of masking other more fundamental bugs, I find it useful to think about them as a separate category.

Performance Bugs

Premature optimization may be bad, but not performing to SLA or capacity is worse. Performance problems tend to be particularly insidious because they’re typically very hard to work around. This is likely because such issues tend to be deeply seeded in software design and highly dependent on the problem space. In addition to this, performance gains in existing systems are frequently found by changing assumptions of the complexity of the problem being solved.

To avoid being bitten by unsolvable performance problems, try to make sure your interfaces are composable. Design systems that can be independently tested and validated such that you can capacity plan. Avoid sharing state through memory. When this is not possible, use mutexes only when they are unlikely to have high contention. In highly contended workloads, attempt to make use of wait- and lock-free solutions where possible.

…and many more

When we accept that “debugging” is a domain-specific term for problem solving, it follows that a “bug” is a domain-specific term for a problem. Unexpected behaviors in software may also be caused by (or at least attributed to) a number of areas outside of the code itself:

Protocol bugs: the software you are implementing is specified incorrectly such that an implementation to the specification cannot possibly fulfill the stated goals of the specification.
Process bugs: a failure to incorporate accepted best-practices into an individual or organizational project. This includes lack of version control, not performing code reviews, absence and aversion to testing, etc.
Environmental bugs: problems with the tools we use when designing, building, and running our software may end up affecting how our software runs. From compiler bugs to kernel issues, library bugs to hardware bugs, any number of issues in our environment may affect the run-time behavior of our software. These issues tend to be insidious; we usually assume our operating platform to be infallible. It’s not, and this is why actual full-stack understanding is necessary.

Behavioral Patterns

It turns out that these classifications are more than just a taxonomy. Though the above classifications seem to be solely based on how a bug was introduced into the system, it turns out that nearly all of these classifications have run-time signatures as well. In other words, we can observe patterns of behavior from a single sort of bug.

For example, an off-by-one bug is always going to do something one fewer or one more time than expected (whether that “something” is “scan a string” or “count references”). Once we understand the reason for this bug, we can generalize: maybe there are special cases where we could have off-by-N. And indeed, in C this is usually a thing that happens iterating over an array of structs.

But first we have to be able to observe these patterns.

Tooling

Tools and introspection are crucial to the debugging process. As both the processes of forming and testing a hypothesis require analyzing data, we need tools to collect, analyze, and understand the data. Relevant tools differ for nearly every project, but include things like:

Software to read statistics counters from the buggy application (like varnishstat).
Graphing / plotting statistics over time (using something like Ganglia or Graphite).
Debuggers like gdb, lldb, or any other language / platform-specific debugger.
Profiling utilities like perf or hwpmc.
Tracing facilities like dtrace or lttng.

I can’t stress enough the importance of recording metrics over time. At Fastly we use Ganglia, DataDog, and other tools to gather data over time and track changes. As far as recognizing patterns go, visualizations really can’t be beat.

The useful tools vary by project, so it’s not super useful to discuss them in any real detail. What is important is to understand that you will need to gather data for both forming and testing your hypothesis, and understanding the tools available in your domain gets you halfway there.

Importantly, not every tool fits the task. Be prepared to extend your software to provide additional information, as well as to write your own tools to gather and analyze this information.

Application

What does this look like in the real world? In practice, we encounter cases where we’re looking at problems extending far past our experience. Bugs, as Kernighan might say, we are not yet clever enough to debug. In some cases, we may be extremely time-constrained in how long we have to solve an issue. In yet other cases, the tools we have may not even be useful.

Our workload at Fastly is unique. Our Varnish runs at a thread load between 40000-60000 threads; it consumes in some of our busier POPs over 300GB RAM. This comes with all sorts of problems:

The Maps Debacle

Many debugging tools on Linux rely on the /proc/[pid]/maps file to gather information about mapped memory regions in the executable. In releases of the Linux kernel from 3.2 to 4.5, some code to annotate which maps belonged to thread stacks actually resulted in O(n^2) complexity, iterating over maps. A good analysis of this problem is available on the backtrace.io blog. Reading this file on our systems took minutes.

Fundamentally, this meant rolling releases of libraries like libunwind and perf to use /proc/<pid>/task/<pid>/maps (which does not suffer from this issue). It also resulted in sending patches upstream to jemalloc.

The GDB Fiasco

Both GDB and LLDB have a similar polynomial-time issue when attaching to processes with large numbers of threads. This is largely due to how both utilities have decided to structure contexts about threads they are attached to. The last time I tried to attach GDB to one of our Varnish instances, it took over 4 hours and still hadn’t finished attaching. I tried to cancel this operation and panicked the kernel.

To this end, we’re using other tools for runtime introspection that I will talk about later. And you might ask why not just grab a core of the process and analyze it offline…

The Core Dump Disappointment

Our cache nodes have 768GB RAM. Varnish uses several hundred gigabytes of this memory, and most of it is actually not at all related to object storage, but instead to thread stacks, various memory pools, and metadata. Our root filesystems generally have about 150GB free space. This means we do not have enough disk space to store core dumps if Varnish crashes. We could, of course, offline one of our SSDs, but it is not economically sensible for us to waste 500GB-2TB of space to leave room for core dumps.

Besides, it actually takes a ridiculously long time for the kernel to dump a couple hundred gigs of memory into a file. And even then, it would take over an hour to copy the dump off of the machine at 10Gbit speed, compressed.

Here, we are using some software from backtrace.io to solve the problem. I highly recommend using their utilities if you’re working on systems software, regardless of your software architecture and memory usage.

Carrying On…

Modifying existing utilities to work around system bottlenecks is a little frustrating, but being fundamentally unable to do any runtime debugging is maddening. To that end, we’ve come up with creative ways to gather information from our system at low operational overhead. Our strategy here is frequently to have these features be things we can leave “always on” as turning them on during an outage situation may lose valuable originating context. To this end, the solutions must be low-overhead.

I have blogged about librip this year, detailing the software and the problems it solves. To summarize, librip is a “minimal-overhead API for instruction-level tracing in highly concurrent software” and it enables us to get execution histories of threads at runtime. This granularity isn’t quite to the level of providing a backtrace, but it gets close. And it has helped us solve deadlocks, livelocks, stalls, and performance issues.

Tracing is another point where custom tooling can be extremely useful. Several great existing technologies exist on this front, including DTrace and LTTng. Tools like DTrace are fantastic as they operate without any overhead when disabled. However, the dynamic nature of DTrace makes its runtime overhead considerable when generating high-frequency traces, which is where instrumentation-oriented tools like LTTng become useful. (DTrace can do this too, but then you’re responsible for writing the instrumentation hooks as well as dealing with the overhead of instrumenting.)

In any case, tracing information can be a firehose: it spits out a ton of potentially useful information, but figuring out which bits of the information are actually useful can be difficult. Usually we’re looking for a couple different things:

Long tail trace information. 99th percentile trace events are common in systems handling tens of thousands of events per second. Figuring out the 99.999th percentile is more difficult, as described by Dan Luu.
Correlating events. When we notice some long tail traces, what other traces are responsible? For example, if we spent a long time waiting to acquire a mutex, we may be interested to correlate this time with the original holder of the mutex.

I’m working on some tracing tools that use local trace buffers to scribble information into, and logging entire trace buffers at a time. The idea is that it is easier to determine whether any individual trace chain is interesting than it is to determine whether any individual trace is interesting in terms of data retention. It’s my hope that this will be successful, and I will surely blog about it if so.

Wrapping it Up

This has been a ridiculously long post. To summarize:

Recent research provides us evidence that the way we think about problems directly effects our ability to learn and how we interact with others about these problems.
There are ways that we can improve our mindset (and influence the mindsets of those around us) such that we optimize our experience for learning, problem solving, and success.
Debugging is a special case of problem solving, and benefits hugely from shifts in mindset.
Through classifying bugs, recognizing patterns, and using tools, we are able to come up with a general strategy for solving problems in our code, based on the scientific method.
Understanding this methodology prepares us to ask better questions about what data we need to solve problems. This allows us to extend existing and author new tools to help us in our journey.

I’m extremely interested in this subject in general. Look for more posts related to debugging and pedagogical topics in the (hopefully near) future!