Thursday, September 22, 2016

URCU ReadersVersion in C++

On the previous post we showed a new Userspace RCU (URCU) algorithm that can be implemented with just a atomics and the Java/C11/C++1x memory model.
Here is the implementation in C++:
https://github.com/pramalhe/ConcurrencyFreaks/blob/master/CPP/papers/gracesharingurcu/URCUGraceVersion.hpp

Not only is this code simple enough to just copy/paste into your codebase, but it has a pretty good throughput and scales well.
This seems to be the best basic implementation of an URCU (with just atomics) in all possible scenarios. There may be other implementations that surpass it, but they would definitely have to use some non-portable trickery, which URCUs usually do, stuff like using POSIX signals, or some CPU special instructions, or some Kernel API.
Anyways, if you're interested in portable low-overhead memory reclamation for C/C++ that is wait-free population oblivious for readers and blocking for updaters, yet capable of aggregating grace periods, then this should do the trick for you!

We did a simple benchmark against two of the URCU implementations in the userspace urcu lib http://liburcu.org/ and we'll make the benchmark available in a couple weeks, but in the meantime here are some plots below.
We compared against the Bullet-Proof URCU and the default URCU. More details in this readme https://github.com/urcu/userspace-rcu

The first plot shows just readers, i.e. calls to rcu_read_lock()/rcu_read_unlock(), where the read-side critical section is just an atomic load on a global variable. This means the read-side critical section is as short as possible, and that if there is any contention is caused by the overhead in rcu_read_lock()/rcu_read_unlock(). As you can see from the plots, all three scale well, but the ReadersVersion is the one that scales better.




The second plot shows just calls to synchronize_rcu() and none of the methods scale (more on this in another post) but at least they don't dropoff as the number of threads increases, and ReadersVersion is about 10x faster than the other two URCUs.



The third plot shows again calls to synchronize_rcu() but this time there are two other thread that are continuously doing rcu_read_lock/rcu_read_unlock. The read-side critical section is very long, it consists of reading an array with 100000 entries, which means the time is dominated by the waiting for the readers and not by the synchronization mechanism inside synchronize_rcu().
Again here, ReadersVersion scales well because it aggregates grace periods, and so does the default urcu but doesn't have as good scalability as that. Bullet-Proof has no mechanism for sharing grace periods, and therefore, has no scalability on this kind of scenario.


2 comments:

  1. This paper served as a great explanation of RCU in general (reading working implementations was VERY helpful). I was curious about a few things:

    1) in the GitHub repo the article points to, you have multiple implemenations. There is GraceVersion, GraceVersionSyncScale etc. Which one do the benchmarks correspond to.

    2) In the C++ impl there is use of a padding construct within the array which is trying to align things on a 128 byte (?) boundary? It was not clear to me (as a concurrency novice) what that does. Secondly, from what I can tell, most compilers won't respect the overalignment requested. At least on my Mac, I can only get up to 16 byte alignment respected properly. What does that do to the implementation shown?

    ReplyDelete
    Replies
    1. Hi, nice to hear that :)

      1) If you look in the paper, figure 1
      https://github.com/pramalhe/ConcurrencyFreaks/blob/master/papers/gracesharingurcu-2017.pdf

      the lines labeled "GraceVersion" correspond to this class:
      https://github.com/pramalhe/ConcurrencyFreaks/blob/master/CPP/papers/gracesharingurcu/URCUGraceVersion.hpp

      and the lines labeled "Two-Phase" correspond to this class:
      https://github.com/pramalhe/ConcurrencyFreaks/blob/master/CPP/papers/gracesharingurcu/URCUTwoPhase.hpp

      The Two-Phase algorithm can use different ReadIndicators, so we made three different ReadIndicator implementations each with different properties:
      https://github.com/pramalhe/ConcurrencyFreaks/blob/master/CPP/papers/gracesharingurcu/RIAtomicCounter.hpp
      https://github.com/pramalhe/ConcurrencyFreaks/blob/master/CPP/papers/gracesharingurcu/RIAtomicCounterArray.hpp
      https://github.com/pramalhe/ConcurrencyFreaks/blob/master/CPP/papers/gracesharingurcu/RIEntryPerThread.hpp


      2) The goal of the 128 byte padding is because on x86 the prefetcher can request the cache line and the next. Each cache line is 64 bytes, therefore, to completely avoid false-sharing, we should have non-consecutive cache lines for each reader's state.
      It's a trade-off, we're using more memory to have better scalability for readers, and causing the writer/updater to have to request a different cache line for each active reader. URCUs are made for read-mostly scenarios, so it should be ok for these kind of scenarios.

      Delete