DHO is Mostly Confused
Rants from Devon H. O'Dell
Efficient String Searching with Binary Search Trees: Part 2
In my previous post, I discussed tries and what seemed to me like a novel way to efficiently search for strings in a balanced binary search tree.
Usually, using a binary search tree to search for strings is ineffectual; other data structures are used. This is largely because of the key comparison overhead. First, nodes must store their keys (this isn’t always necessary for e.g. hash tables with known inputs). Second, to traverse the tree, keys need to be compared at each level. This necessitates in the worst case k key comparisons over log n items (where k is key length), giving us O(k log n) overhead for key lookups in such structures. Hash tables and tries give us key lookups that are effectively O(k) overhead. We can in fact do the same thing in a binary tree if we store some state as to how many bytes we’ve compared, and we can do it efficiently if we only compare objects with the same key length.
Back to Performance
I finally found my code that did a performance comparison against the critbit tree I used. After playing with the code a bit, I got it working again and re-surprised myself with the results. The memory overhead of my version isn’t as high as I remembered it, because I removed the key padding requirement in the test (but hadn’t reflected that change in the bitrotted branch I rescued the code from).
I ran my tests with a single thread on an Intel Xeon E5-2690, but achieved similar results on a Core i7-4558U. I’d be interested in seeing how this works on other processors as well.
Memory
The memory overhead of the binary search tree compared to the crit-bit tree in my tests is only about 10% at most. This shouldn’t have been terribly surprising: unless our keyspace is relatively small, it’s unlikely that the size of our composed data structure will dominate the overall size of the data it represents. In my test, I was using relatively long keys (averaging 84 bytes per key), generated by stringing random dictionary words together.
For a million such keys, the critbit representation consumed 116,548,115
bytes. The tree representation consumed 124,548,115
bytes. The size of the
original data set was 85,548,115
bytes.
Lookup Time
Lookup time varies depending on the prefix length and on how long the unique suffix is. With a sufficiently long prefix and a sufficiently short suffix, the critbit implementation beats the pants off of the stateful search tree. However, I do not feel this reflects real world data. There are a few cases in which we would find long shared prefixes followed by short unique suffixes (a directory hierarchy springs to mind), but you need to have relatively few directories for this to hold.
Many strings of any significant length, even strings sharing long prefixes, have relatively long distinct suffixes. With a shared prefix of any length and an average distinct suffix of 8 bytes, the critbit tree wins every time. As soon as we get to a distinct suffix of 16 bytes, the critbit tree loses. And it seems to never win, regardless of how long the shared prefix is. (Indeed, even with 512-byte prefixes, the critbit implementation benchmarked still never wins.)
That’s Odd.
I found this a little curious. Was it cache misses? Was it branch mispredictions? Was it implementation? What’s a good baseline for red-black trees? The answers are a little humbling.
I initially thought the optimizer might be doing something tricky / unfair, but the performance benefits remained with no optimization.
I thought perhaps a radix tree would work better than a crit-bit trie, but it also underperformed (and of course used significantly more memory).
I looked at some perf
profiles and though it was clear that the critbit tree
had more branch mispredictions and more cache misses. It underperforms because
it does too much work. It takes more effort to find critical bits and
traverse than it does to compare strings and traverse. That got me thinking
about my comparison function and whether there was any benefit to its
complexity.
We’re OK!
People have been optimizing memcmp(2)
for years, and it’s no surprise that
it is fast, especially on long inputs. It turns out a red-black tree with a
comparison function of simply:
#define RB_CMP(T) \
static inline int \
rb_##T##_mb_cmp(struct T *search, struct T *cmp) \
{ \
\
if (search->keylen == cmp->keylen) { \
return memcmp(search->key, cmp->key, search->keylen); \
} else { \
return (search->keylen > cmp->keylen) ? 1 : -1; \
} \
}
does perform worse than my version. However, it still beats the critbit and radix trees as soon as there’s any notable variance in the inputs. Even a
#define RB_CMP(T) \
static inline int \
rb_##T##_mb_cmp(struct T *search, struct T *cmp) \
{ \
\
return memcmp(search->key, cmp->key, search->keylen); \
}
outperforms the critbit and radix trees I tested on many inputs, including worst-case inputs.
Conclusions?
Benchmark. Make baselines. Tries are neat, but if you have memory, don’t bother.
In my experience, clever code usually doesn’t perform well. It does too much. In this case, I’m pleasantly surprised that something marginally clever does actually win. The difficult part about being clever in code is really knowing when to not be clever. And that provides a good segue to a quote I’ve always enjoyed, Kernighan’s Lever:
Everyone knows that debugging is twice as hard as writing a program in the first place. So if you’re as clever as you can be when you write it, how will you ever debug it?
This comparator was a pain in the butt to debug.
This also gives me good insight as to why std::map
in C++ is implemented as
a red-black tree. (I used to make fun of that whenever it was brought up. But
I think it is a surprising result, because every time I have made fun of it,
nobody has corrected me.)
If you’d like to reproduce my results, I have code available in a github repository.