Tuesday, September 23, 2014

The two sides of Wait-Free Population Oblivious

The strongest of all progress conditions is wait-free populations oblivious (WFPO), and yet, there are two very different things that are called WFPO. One of them scales, and the other one doesn't.
You see, the WFPO definition talks about the number of operations not being bounded by the number of threads, but it doesn't take into account that a single hardware instruction may take an amount of time that is proportional to the number of threads/cores working on the same variable (or to be more specific, the same cache line).
One example of this is atomic_fetch_add() in C11/C++1x, which is implemented on x86 as a single XADD operation. From a progress condition point of view, this means that a call to atomic_fetch_add() does a single operation (the XADD instruction) and is therefore WFPO, but the time it takes to complete the XADD instruction is proportional to the number of threads/cores working on the same variable, as shown in this wonderful post by Martin Thompson (in his duration plots, lower is better):

I've repeated some of his work in C++14, and on an Opteron with 32 cores. The results are on the plot below, in number of operations per time unit as a function of the number of threads (higher is better):
There are three different lines for three different scenarios:
- All threads do XADD on the same variable;
- Each thread does XADD on a different variable in an array, but they are adjacent, and therefore, on the same cache line(s);
- Each thread does XADD on a different variable in an array, with padding so that each touches a different cache line to avoid false sharing;

C++ source code for this benchmark is available on github at this link:

As you can see, having multiple threads write to the same variable (or cache line) creates a scalability bottleneck, even though all three cases have the same progress condition WFPO... in fact, all three plots are doing exactly the same instruction, it's just that different things are being done by the cache-coherence system.
When all threads are writing to the same variable using the XADD, an heavy synchronization must be done between the cores to figure out which one is going to have which number, so that XADD returns an unique (sequential) number to each of the cores/threads. It doesn't really matter how sophisticated CPUs will become in the future (or cache-coherence protocols), there will always be a bound on the speed of communication between the cores, namely, the speed of light. It is not possible to transmit information faster than the speed of light, and to add to the pain, the more cores are sharing the same memory, the more messages have to be passed to achieve consensus, and the more physically apart these cores will be from each other due to physical placement and heat dissipation.
This means that the time it takes for a CPU with 64 cores to complete a fully synchronized instruction like a XADD is larger than 64/4 times it takes to complete the same kind of operation on a CPU with just 4 cores. The reason being that there are more messages to pass, and over longer distances, as shown in the schematic below:

CPU with 4 cores. Core 4 is reachable from core 1 with 2 links:

CPU with 64 cores. Core 64 is reachable from core 1 in 14 links:

It is important to understand that this limitation is implicitly a physical one, that is bounded by the laws of physics. Even if future CPU architects figure out a cheap and efficient way to stack up cores and place them in some kind of three-dimensional structure in a CPU, the distance between them times the number of cores will still grow larger than linearly, and no matter how smart they are, they won't be able to figure out how to increase the speed of the communication channels between the cores to something above the speed of light... it's simple Physics!
Even if there was a way to make the duration of the instruction to grow only linearly with the number of cores, it would still become a scalability bottleneck.

The takeaway from this is that, there are scalable WFPO algorithms and non-scalable WFPO algorithms. In order to complete, the Scalable WFPO algorithms/methods do not need to modify a shared variable, or do so with a low probability (like for example, doing XADD on one of a large number of variables, selected randomly).
The Non-Scalable WFPO algorithms/methods require the usage of heavy synchronization instructions, where multiple threads modify the same variable (i.e. lots of contention on one or a few variables).

At first thought, it could seem as if only synchronized modifying operations influence scalability, and read-only operations would always scale well, but we already saw before that even if you're only reading a variable and a single other thread is modifying it, it can kill scalability, as shown in this post http://concurrencyfreaks.com/2013/10/immutable-data-structures-are-not-as.html
Having a single thread modifying a variable (with a release barrier) that is read by multiple threads (with an acquire barrier) can imply a scalability bottleneck due to the cache misses on the cache line of that particular variable.
The reasons behind this effect, at their root, are due to the same physics we mentioned above. When a core modifies a variable, the modification of that cache line must be transmitted to all other cores. It's not as bad to have a single thread modifying a variable and all the other ones only reading, as having multiple thread modifying the same variable, but even so, all threads must obey the laws of physics.

WFPO is therefore, not enough to distinguish between the truly scalable methods and the ones that are not scalable.
To conclude, let's make up some definitions (emphasis on the make up):
- Scalable Wait-Free Population Oblivious: A method whose number of operations is bounded and does not depend on the number of threads (or cores) and as none or very little contention on all the shared variables that the method accesses;
- Non-Scalable Wait-Free Population Oblivious: A method whose number of operations is bounded and does not depend on the number of threads (or cores) and has some or high contention on one or several of the shared variables that the method accesses;

Notice that these are just empirical rules of thumb, and hopefully one day, someone will be able to formally prove (using Information Theory or Queueing theory) that there are in fact two different kinds of WFPO, and what are their precise properties.

No comments:

Post a Comment