Wednesday, September 7, 2016

What Happens When 4096 Cores All Do synchronize_rcu_expedited()?

Here is a presentation by Paul McKenney about the scalability of synchronize_rcu(), or synchronize_rcu_expected() to be more accurate.
https://www.youtube.com/watch?v=1nfpjHTWaUc


Basically what's he's talking about is how to solve the difficulty of having multiple threads (and cores) all calling synchronize_rcu() at the same time.
Scaling rcu_read_lock() and rcu_read_unlock() is more or less easy and the linux kernel RCU can do it almost for free (it's all about the scheduling), and on userspace (URCU) we can do it with just one sequentially consistent load and one store for rcu_read_lock() and one release store for rcu_read_unlock(). As long as each reader has its own entry in an array or list and there is no false sharing, this scales pretty much linearly.

Scaling synchronize_rcu() is an harder task because synchronize_rcu() is inherently blocking, i.e. it must wait for all current ongoing readers to complete before returning to the caller.
From what I understood and from some brief discussions with Paul (thanks Paul for helping me understand this stuff!), most (or all) URCU implementations have a global lock inside synchronize_rcu(). Now, if you know anything about locks you know that global locks are usually a bottleneck when it comes to scalability, however, for this particular case, the lock is there to protect access to a queue, so it's not so much of a problem until you hit the double digit core count.

The typical URCU implementation of synchronize_rcu() seems to work like this:
- Add the thread to a queue of waiting threads (that called synchronize_rcu());
- Get a global lock;
- Wait for readers if there are any needing waiting for;
- Drain the queue appropriately;
- Release the global lock;

There is a lot more going on, I'm just oversimplifying for the sake of clarity (and sanity).
If you want more details then go an read the code, for example for URCU QSBR:
https://github.com/urcu/userspace-rcu/blob/master/src/urcu-qsbr.c#L251

The idea is that this queuing technique allows multiple threads to "share the grace period".
In other words, instead of each call to synchronize_rcu() having to wait for one grace period they all wait for just the same (one or two) grace period, which provides scalability.
Typically, the longest operation is the grace period itself, which can be very long compared to the time it takes to acquire and release a lock, but this is not always so, for example when the readers complete in a very short time, or when there are many cores which create contention on the global lock.
When this happens, then alternative ways are needed, and that's what Paul is talking about in the video about, a kind of hierarchical lock solution to distribute contention (for the Linux Kernel RCU).


We have one algorithm for URCU that is very simple and doesn't need to hold a global lock or even to use a queue of threads waiting for the grace period. It will likely also suffer from contention under a high core count, though not as seriously as the global lock approach, but as far as we know, this is the only URCU implementation that does not use a lock in synchronize_rcu() and that is capable of sharing a grace period, and still maintains wait-free progress and high scalability (throughput) for the readers.

If you're interested, stay tuned for the next post!

No comments:

Post a Comment