Concurrency Freaks: Fast lookups with memory_order

On this post we're going to explain how to make your lock-free/wait-free data structures do faster lookups by using std::memory_order_consume (at least on PowerPC).

Motivation

Many known concurrent data structures with lock-free and wait-free progress
use some kind of node traversal when doing lookups (Harris Linked List, Michael & Scott, Fast Concurrent Lock-Free Binary Trees, etc).
From the point of view of the C11/C++11 memory model, this traversal is done with an acquire-load each time the pointer to the next node is read, using memory_order_acquire or memory_order_seq_cst. Doing this ensures that the contents of the node will be properly initialized, and otherwise consistent, including the key associated with the node.
http://en.cppreference.com/w/cpp/atomic/memory_order

The problem with using memory_order_acquire or memory_order_seq_cst is that they do a memory barrier on CPUs with weak ordering like PowerPC.

source: The C11 and C++11 Concurrency Model

One of the available orderings in the C11/C++11 memory model is memory_order_consume which when used to read the pointer to the next node, provides the guarantee that the contents of the next node and dependent variables will be up to date. This includes the pointer to the key
associated with the node, but not necessarily its contents.
A good example (pointed out to us by Martin Buchholz) is when the key contains a pointer to a global (static) variable, for which the dependency can not be guaranteed, and as such, it may not be properly initialized, and no Happens-Before relation is provided.

The motivation behind the usage of memory_order_consume is that on most architectures (DEC Alphas being the exception due to its reordering of dependent loads) it does not require a memory barrier, and therefore, provides better throughput when traversing the nodes.
A desirable algorithm would allow any kind of user-provided key, regardless of its contents, to be used on a node-based data structure, and these nodes to be traversed using as much as possible memory_order_consume, without doing any loads with memory_order_acquire or reducing its usage to a minimum.

Algorithm

Our technique is very simple, it requires adding a new member field to each node, keyhash, which will store an hash of the key. Notice that this field can be a regular variable, i.e. there is no need to use std::atomic<>.
A C++1x example code looks like this:

template<typename K>
class Node {
std::atomic<K*>    key;
std::size_t        keyhash;
std::atomic<Node*> next;
};

std::hash<K> hashFunc;

Node* lookup(Node* head, K* lookup_key) {
std::size_t lookup_hash = hashFunc(*lookup_key);
Node *node = head->next.load();
while (node != nullptr) {
    if (node->keyhash == lookup_hash) {
      K* node_key = node->key.load(std::memory_order_acquire);
      if (*node_key == *lookup_key) {
        atomic_thread_fence(std::memory_order_acquire);
        return node;
      }
    }
    node = node->next.load(std::memory_order_consume);
}
atomic_thread_fence(std::memory_order_acquire);
return nullptr;
}

The idea is that we do memory_order_consume on each node->next and we check if the keyhash member of the node is the same as the hash of the key we're searching for. Only when there is a match, do we look at the key, using memory_order_acquire, otherwise we continue to the next node without ever looking at the key. This memory_order_acquire will provide consistency in the key and make sure its fields are properly initialized, even for members that are, or point to, global variables.

A few things to notice on the code above:

The first load, on head->next, is done with a std::memory_order_seq_cst so as to prevent stores from going inside the loop. Depending on the particularities of the data structure you may or may not need this. Rule of thumb is, if you need it when using memory_order_acquire then you certainly need it with memory_order_consume;
Before returning, either success or failure to find the key, we do an atomic_thread_fence(std::memory_order_acquire) to prevent the loads from "escaping". Again here, if you needed it for memory_order_acquire then you also need it for memory_order_consume. Doing the loads of node->next with memory_order_seq_cst does not need any atomic_thread_fence() in the end, but then it will be much slower on PowerPC;
If you're using RCU, then it may provide you acquire/release semantics (depending on the implementation) and you won't need the atomic_thread_fences();

Ordered data structures

The technique above works well for unordered-node-based data structures, but how about ordered-node-based data structures, like Harris Linked List or some kind of Lock-Free Binary Tree?
We can also use the same approach but with an extra trick: instead of ordering on the key, we order on the key's hash and only when there is a match do we start ordering on the key.
What this means in practice is that we can create an internal "Comparator" method that calls the user's comparator, and in C++ it would work more or less like this:
int internal_comparator(Node* node, K* lookup_key, std::size_t lookup_hash) {
    if (lookup_hash > node->keyhash) return 1;
    if (lookup_hash < node->keyhash) return -1;
    K* node_key = node->key.load(std::memory_order_acquire);
    return comparator(lookup_key, node_key);
}
This way, the logic of the code of the lock-free/wait-free data structure requires little modification, it's just a matter of calling internal_comparator() instead of comparator(), and using memory_order_consume for the node's traversal.

Neat, huh?   :)

Microbenchmarks

So what do we gain by all of this?
Speeeeeed, that's what!
We ran microbenchmarks on a PowerPC, Power8E with 8 cores, Ubuntu Linux 14.04 64 bit, GCC 4.9.2.

The plots below show the mean for 5 runs of the number of operations per second as a function of the number of threads on a microbenchmark that iterates for 20 seconds. On each iteration, we randomly pick a key and traverse a linked list of 100, 1000, or 10000 keys until we find the one we're looking for.

To make it more realistic, before starting the lookup, we call rcu_read_lock(), and call rcu_read_unlock() when the lookup finishes. This causes a fixed overhead that harms short lookups (on small lists). Your mileage may vary when using other memory reclamation techniques like reference counting or hazard pointers, but as Andreia pointed out, it doesn't seem possible to have performance gains with this technique when using reference counting or HPs. For example, hazard pointers do a (sequentially consistent) store in an hazard pointer and then a load on the next pointer's node, which could (but should not) be re-ordered if the load is memory_order_consume or memory_order_acquire, but will not be re-ordered if it is memory_order_seq_cst.
The source code for the microbenchmark is not important because it's not really production-ready, but it's on github if you want to take a look.

The benchmark executes three different scenarios:

Scenario 1: pointers to the next node are read with memory_order_seq_cst and, therefore, the key can be read as a regular variable;
Scenario 2: pointers to the next node are read with memory_order_acquire, and the key is also read as a regular variable;
Scenario 3: we used the code shown in lookup() where the pointer to the next node is read with memory_order_consume, and when the hashes of the keys match, the key associated with the node is read with memory_order_acquire.

The plots show the ratios obtained from the results of third scenario over the second (consume/acquire), and for the third scenario over the first (consume/seq_cst).

Analysis of Results

From the first plot we can see that the advantage of using memory_order_consume over memory_order_acquire is not much, but can go up to 80% increase.
The second plot shows that the gain of memory_order_consume over memory_order_seq_cst (the default ordering for atomics in C11 and C++1x) is significant and can provide up to a 7x throughput gain, which is definitely worth the small effort of implementing this "hash trick" with memory_order_consume.

How about x86?
Well, in x86 there is no significant difference between the three different scenarios, because as you can see in the C++11 atomics mapping, in x86, all loads translate to a MOV instruction without any memory barrier, regardless of it being a load with memory_order_seq_cst, memory_order_acquire, or memory_order_consume. So, don't bother using this technique if you're only going to run on x86... you won't gain anything.

Now, you may be wondering: Huhhh I thought that memory_order_consume was "broken" on GCC 4.9 ?!?
Yeah, just in case, we're actually using memory_order_relaxed as a replacement for memory_order_consume in our benchmarks. Please do NOT do this on your production code, we did it just because there was no other practical alternative. For our purposes, consume and relaxed will be (theoretically) the same on PowerPC, so it's ok (kind of), and it does give a good proxy of what the performance will be with the "actual" memory_order_consume, when it gets properly implemented in some future GCC version.
For more on this topic, make sure to watch Paul McKenney's at CppCon 2015 "C++ Atomics: The Sad Story of memory_order_consume: A Happy Ending at Last?"
and Michael Wong's presentation "C++11/14/17 Atomics the Deep dive: the gory details, before the story consumes you!"
http://cppcon.org/2015program/

Conclusion

Of course, it's not all roses, there's a thorn. We're adding a new 64bit (or 32bit?) member to the node to store the hash's key.
It may seem like a lot compared to the fact that on a simple linked list there are only two other 64bit variables (the pointer to the key and the pointer to the next node), so we're increasing the memory usage of this data structure by 50%... but keep in mind that for every node, there is an associated key, and the key can be quite large, to the point that having one extra 64bit variable (per key) is negligible.

There are other ways to use memory_order_consume in lock-free data structures, and if we have some time we'll go over a few of them in a future post, but this technique is the most generic, easy to understand, and easy to implement... and in my book, simplicity counts a lot!

Who would have thought that std::memory_order_consume is actually useful ?!? ;)

Concurrency Freaks

Thursday, September 3, 2015

Fast lookups with memory_order_consume

Motivation

Algorithm

Ordered data structures

Microbenchmarks

Analysis of Results

Conclusion

No comments:

Post a Comment

Contributors