DHO is Mostly Confused

Rants from Devon H. O'Dell

Efficient String Searching with Binary Search Trees

Jun 29, 2015 9 min read
Programming Algorithms

For some time, I’ve been a fan of the trie as an efficient string searching data structure. A trie is a tree-like data structure in which every node represents some piece of a string. Many variants exist; perhaps the most popular is the radix tree. Tries have a couple of neat properties:

They have a low RAM impact compared to some other tree structures.
They only require O(k) operations to search, insert, or remove (where k is the key length) a particular key.
They compress very well with many shared prefixes (this makes them particularly good for doing prefix matching on e.g. IPv6 addresses).
A suffix tree is just a backwards prefix tree. So they’re great for this application as well.

Why would we prefer a structure that requires O(k) time over a mostly O(1) hash table? Though hash tables are great, they waste a lot of memory. Every unused bucket is unfortunate. If chaining, performance tends towards O(n) as the table fills past its capacity. If using linear probing, we can get decent performance using robin hood hashing, but eventually we need to resize the table. Resizing large, open-addressed hash tables is expensive as all items must be re-inserted.

Furthermore, it is ignored that calculating the hash for a string is of complexity O(k). So we can say that hash lookups for strings also carry an O(k) component, ignoring whatever work we have to do for collision avoidance.

I digress.

Various trie implementations exist. When dealing with a normalized English alphanumeric character set, the following might be a valid trie node representation:

    struct trie_node {
        bool                final;    /* Does this represent a final node? */
        struct trie_node    n[36];    /* a-z0-9 */
    }

If we needed full 7-bit ASCII support:

    struct trie_node {
        bool                final;
        struct trie_node    n[128];
    }

Of course, this isn’t super memory-efficient, and it balloons quickly. We could of course use a simple bitset, at which point we’d need only 16 bytes per node (32 if we want to deal in UTF-8 or other character encodings that use the top bit). However, then we don’t have a node representing each level. Because of this, terminal nodes need to carry a key with them, but also need a list of child nodes.

    struct trie_node {
        void             *key;
        uint64_t         bits[2];
        struct trie_node **children;
    }

Crit-bit trees optimize for memory further, requiring only one integer and one pointer on top of each string.

    struct trie_node {
        struct trie_node *child[2];
        uint32_t         byte;
        uint8_t          otherbits;
    }

The space savings comes at an increased cost: instead of walking the tree and testing if we are at a terminal node once we get to the end of our string, we have to do a an additional O(k) comparison to determine whether the key we were looking for actually had membership in the set. In general, this cost is not hugely significant, but it bothered me enough to wonder whether there was a way to organize the data to do only a single comparison over the key length.

In particular, although we only do O(k) work (as opposed to O(k log n) for a binary search tree), we still have to deal with branch mispredictions and cache misses as we do with any sort of tree structure. What if we could reduce the overhead for a tree search? Would the memory savings be worth it?

Stateful Operations and Sorting Criteria

The ability to maintain state across disconnected but related operations can be a huge boon in solving problems. HTTP cookies, for instance, solved state problems in a stateless protocol (as long as the client cooperated). Can maintaining state help us here, too?

It can.

Since we have to store keys in the tree anyway, we can waste a little space and store lengths as well. Any keys without a matching length don’t match by definition. So if our tree has two sorting criteria (key length, then key value), we cut down the number of key comparisons we have to do significantly.

What about state?

If we are able to remember everything we’ve searched in a red-black tree, we can cut down further on the comparison process. If we know what we’ve already matched, we can order our keys by their value (and we can of course terminate early if we can tell that our key won’t be deeper in the tree).

Stateful-Search Binary Search Trees

Although we can implement the state outside of the tree, it helps tremendously from a composability perspective to have our state handled within the API for our data structure. Since this mutates standard definitions for how such a tree represented in memory and is specific to a particular type of node, I’m giving it it’s own name: this post describes a “stateful search binary search tree”. The solution seems obvious in retrospect, so I suppose that I’m not original in “discovering” this particular use case. But I have not seen it documented anywhere else – and since I’ve also failed to do so for a year after having written it, it’s probably a good time to do so.

Most of the implementation is the comparison function and although this would benefit in terms of composability with a native implementation, I have not done this yet.

#include <sys/tree.h>

uint64_t rb_prefix_len;

#define RB_CMP(T)                                                               \
static inline int                                                               \
rb_##T##_mb_cmp(struct T *search, struct T *cmp)                                \
{                                                                               \
                                                                                \
        /* First, we sort the tree by key length. */                            \
        if (search->keylen == cmp->keylen) {                                    \
                uint64_t *skey = search->key;                                   \
                uint64_t *ckey = cmp->key;                                      \
                                                                                \
                /* If the keys are equal length, skip our already-compared */   \
                /* prefix. */                                                   \
                skey += rb_prefix_len;                                          \
                ckey += rb_prefix_len;                                          \
                                                                                \
                /* Compare bytes while our total search doesn't exceed the */   \
                /* key length and the prefix of these two nodes continues to */ \
                /* match. Only increment in the body because we want to do */   \
                /* additional comparisons when on the final or first */         \
                /* differing quadword. */                                       \
                while ((rb_prefix_len + 1) * 8 < search->keylen &&              \
                    *skey++ == *ckey++) rb_prefix_len++;                        \
                                                                                \
                /* If we've compared the whole key, determine whether there */  \
                /* was a match and return order otherwise. */                   \
                if ((rb_prefix_len + 1) * 8 == search->keylen) {                \
                        if (*skey == *ckey) {                                   \
                                return 0;                                       \
                        } else {                                                \
                                return *skey > *ckey ? 1 : -1;                  \
                        }                                                       \
                }                                                               \
                                                                                \
                /* We matched as much of the prefix as possible but these */    \
                /* keys differ before the last 8 bytes. */                      \
                return *skey > *ckey ? 1 : -1;                                  \
        } else {                                                                \
                return (search->keylen > cmp->keylen) ? 1 : -1;                 \
        }                                                                       \
}

struct item {
        void            *key;
        uint64_t        keylen;
        RB_ENTRY(item)  entry;
};

RB_HEAD(tree_t, item)   head;

RB_CMP(item)
RB_GENERATE_STATIC(tree_t, item, entry, rb_item_mb_cmp);

This implementation makes a further assumption: all keys are allocated and all allocations are at least 8-byte aligned. Furthermore, all keys are padded to an 8 byte boundary. From a consumer perspective, this is a little gross, but the performance gain is noticable.

Simple usage is as follows:

struct item *
insert_item(char *key, uint64_t keylen)
{
        struct item *i;
        uint64_t pl;
        char *nk;

        i = malloc(sizeof (*i));

        pl = keylen;
        if (pl & 7) {
                pl += 8 - (pl & 7);
        }
        nk = calloc(1, pl);
        memcpy(nk, key, keylen);
        i->key = nk;

        rb_prefix_len = 0;
        return RB_INSERT(tree_t, &head, item);
}

The composability problem here is obvious. First, we have to pad allocations. That could of course be handled elsewhere, but I added it here for clarity. Second, any RB_INSERT or RB_FIND operation requires resetting the prefix length to 0. Finally, this increases the size of the necessary critical section if the tree is mutated on-line with concurrent access. I imagine it would be easier to forget to put the rb_prefix_len = 0; inside the critical section (though this could also be solved by giving the variable __thread storage).

Secondly, we can’t just search by random input. All input has to be padded, so any user-input search strings need to be extended in search. This was fine for me because my use case included both fixed-length strings that were already 8-byte aligned and variable-width strings read into a capped-length buffer from the network.

If these constraints do not hold for you, it should be relatively simple to modify the comparison function to do bytewise validation.

Performance

How does it perform? Well, that’s a good question. In some microbenchmarks, it outperformed the critbit tree I used by 33% (even on small keys) while only using about 15% more memory. Of course, I have no idea where those results got to. So I’ll get to benchmarking that again soon and put another post analyzing the performance.

Until then!

Edit (2015-07-01): The comparison function above was copypasta’ed from an old branch where I had left the code bitrotting. A later version of the code fixes these issues and requires no key padding. I fixed the bugs in the above version, but fixing the key padding issue requires more rewriting than I care to do. That version is included below:

#define RB_CMP(T)                                                               \
static inline int                                                               \
rb_##T##_mb_cmp(struct T *search, struct T *cmp)                                \
{                                                                               \
                                                                                \
        /* First, we sort the tree by key length. */                            \
        if (search->keylen == cmp->keylen) {                                    \
                uint64_t *skey = (uint64_t *)search->key;                       \
                uint64_t *ckey = (uint64_t *)cmp->key;                          \
                                                                                \
                /* If the keys are equal length, skip our already-compared */   \
                /* prefix. */                                                   \
                skey += rb_prefix_len;                                          \
                ckey += rb_prefix_len;                                          \
                                                                                \
                /* Compare bytes while our total search doesn't exceed the */   \
                /* key length and the prefix of these two nodes continues to */ \
                /* match. Only increment in the body because we want to do */   \
                /* additional comparisons when on the final or first */         \
                /* differing quadword. */                                       \
                while ((rb_prefix_len + 1) * 8 < search->keylen &&              \
                    *skey++ == *ckey++) rb_prefix_len++;                        \
                                                                                \
                /* If we've compared the whole key, determine whether there */  \
                /* was a match and return order otherwise. */                   \
                if ((rb_prefix_len + 1) * 8 >= search->keylen) {                \
		        /* We have bytes left to match */                       \
                        if (search->keylen & 7) {                               \
                                return memcmp(skey, ckey, search->keylen & 7);  \
                        }                                                       \
                        if (*skey == *ckey) {                                   \
                                return 0;                                       \
                        } else {                                                \
                                return *skey > *ckey ? 1 : -1;                  \
                        }                                                       \
                }                                                               \
                                                                                \
                /* We matched as much of the prefix as possible but these */    \
                /* keys differ before the last 8 bytes. */                      \
                return *skey > *ckey ? 1 : -1;                                  \
        } else {                                                                \
                return (search->keylen > cmp->keylen) ? 1 : -1;                 \
        }                                                                       \
}

When we are unable to advance by 8 more bytes, there may still be bytes to compare, in the case that keys are not padded. This version detects that and simply does a memcmp on the remaining length (which can be determined by keylen & 7).

You may also be interested in the second part of this post, where I discuss the performance of this comparison function versus other algorithms / data structures.