Wednesday, August 30, 2017

Why is memory reclamation so important?

As it so happens, object lifetime tracking (or Memory Reclamation as I prefer to call it) is the current most difficult open problem in shared memory concurrency.
If you want to design or implement a lock-free or wait-free data structure, the hardest part isn't designing/implementing the data structure itself. The hardest task is to add memory reclamation to this data structure, without reducing the progress or throughput of the data structure you designed/implemented.

Huhhhh wait, are you saying that implementing a lock-free data structure is easy when compared to memory reclamation? Wwwhhaaaat?!?
No, I'm only saying that memory reclamation is even harder  :)
Let me try to explain why, and there are four reasons behind it:


Reason number 1, progress

So let's say you designed/implemented/found-on-github a non-blocking data structure to serve your needs, but this data structure comes without memory reclamation, and unless you're on a language with a Garbage Collector, you now come face to face with the problem of adding memory reclamation to it.
Think of it this way, what is the purpose of going through all the trouble of doing lock-free/wait-free stuff if then you end up having to slap a blocking mechanism on top of it?
Obviously, it would be kind of fruitless to put a blocking memory reclamation on top of a non-blocking data structure because the end result would be blocking.
This implies that whatever memory reclamation you're going to use on your favorite data structure, needs to have at least the same progress guarantees as the data structure. If not, then all that work that went into designing/implementing a lock-free data structure, just goes down the drain, and you might as well go back to using a global lock, which has low overhead and provides safe memory reclamation (just grab the lock before calling delete/free on nodes of the data structure).

Btw, even if you have a GC, no known GC is lock-free (much less wait-free), so you don't really have non-blocking guarantees of progress. Just take a look at the previous post with Erez Petrank's presentation.



Reason number 2, throughput

When we use a non-blocking data structure, whether lock-free or wait-free, one of the goals is to achieve more performance. By performance I mean either more throughput, or lower tail latency, or both.
Typically, as developers, what we want from these data structures is to have higher throughput as the number of threads working on the data structure increases, and maybe we also care that the tail latency stays low. If we didn't care about one or both of these things, then we could just protect whatever single-threaded data structure we needed to use, with one global lock and be done with it, but as everyone knows, this doesn't scale.

Progress is not enough, if the lock-free data structure you're using is fast, then you need a memory reclamation technique that is at least as fast as it, so as to not become the bottleneck.
What is the purpose of having a super-duper fast lock-free data structure if then you slap on top of it a lock-free-but-horribly-slow memory reclamation technique which drags down performance? Well ok, you may still have the progress guarantee, which will give you low tail latency for wait-free data structures with wait-free memory reclamation, but really... if it's slow, will you use it?



Reason number 3, simplicity

When we apply a memory reclamation on top of a pre-existing non-blocking data structure, we want it to be as simple as possible. The simpler it is, the less likely it we will be that we make mistakes. Unfortunately, most memory reclamation techniques are hard to deploy on pre-existing data structures.
I've previously wrote an entire post explaining just how hard it can be for Hazard Pointers and how much easier it is for URCUs:
http://concurrencyfreaks.blogspot.com/2016/08/hazard-pointers-vs-rcu.html

The jist of it is, that it requires expert knowledge about the data structure where the memory reclamation is being applied, so expert that most data structure designers/researchers don't bother to add memory reclamation to their data structures. They claim it's "an implementation detail", but I know better, it's because it's really hard. Don't take it the wrong way, it's not that they aren't smart enough, it's just that this kind of work is not valued in academia, therefore, there is no incentive for them to waste precious space in their articles explaining how to make a data structure that is actually deployable in a production setting.  Not to mention spending brain power thinking about it (it's true, it always gives me a headache).
And if you're convinced that things like Hazard Pointers can be tricky to use, well then, it can be even more complex, for techniques like Drop The Anchor.


Reason number 4, resilience

Another reason to go with lock-free/wait-free data structures is because they are resilient to failures. On a shared memory system with multiples processes accessing the same data structure, even if one of the processes dies, the others will be able to progress in their work. This is the true gem of lock-free data structures: progress in the presence of failure.
Blocking data structures (typically) do not have this property (there are exceptions though).
If we add a blocking memory reclamation (like URCU) to a lock-free/wait-free data structure, we are loosing this resilience because one dead process will prevent further memory reclamation and eventually bring down the whole system.
There goes the resilience advantage out the window.



The three families of memory reclamation

Manual memory reclamation comes in one of three flavors: Atomic reference counting, quiescent-based, pointer-based.
All memory reclamation techniques belong to (at least) one of these families. Many memory reclamation algorithms exist but we'll focus only on the ones that can be implemented using the C++ memory model and atomics.

Atomic Reference Counting is just what the name says, it's a "reference counting" technique but with atomics. The details are a bit more tricky than your usual reference counting (aka smart pointers), but it's still graspable for most. They can be implemented in a wait-free way (in x86) but they have two main drawbacks: they're slow and they aren't universal.
They're slow because whenever we have to traverse a list of nodes we need to atomically increment a counter in one node and decrement a counter in another... even though we're just reading.
They're not universal because of several reasons, most of them described in section 9.1 of Paul McKenney's excellent book
https://www.kernel.org/pub/linux/kernel/people/paulmck/perfbook/perfbook-e1.pdf

Quiescent-based is where Epoch-Based-Reclamation (EBR) and all Userspace RCUs belong. The fastest memory reclamation is here (for readers). These methods tend to have very little synchronization, are easy to deploy in nearly all algorithms (i.e. data structures) and they're easy to understand.
When everything goes well, these are the fastest way of doing memory reclamation.
The downside is that they're blocking for memory reclamation. This isn't a simple limitation of the current algorithms that someone will one day be able to crack. Nope, it's something intrinsic to this approach. Reclamation with EBR or URCU is, and always will be, inherently blocking.

Pointer-based is where Hazard Pointers, Pass The Buck, Drop The Anchor and Hazard Eras belong. Pointer-based, well, they're not too hard to implement, but they can be extremely tricky to deploy. Their main problem is they're slow (though faster than atomic reference counting).
They can be completely wait-free, for readers and reclaimers. In case you didn't notice, this makes them the only family that can be used generically and capable of providing lock-free/wait-free progress.
Hazard Eras are the fastest of these, approaching (and in some scenarios surpassing) the best URCUs, but they do it at a cost in memory usage. It's the typical trade-off memory vs CPU.

You may notice that most (ok, almost all) academic papers on non-blocking data structures don't mention memory reclamation anywhere, or just briefly mention it in one sentence. This is because they would need many pages to detail all the intricacies of adapting such algorithm to have safe memory reclamation. Their typical approach is to say "That's not my problem, I'm just designing the algorithm. Memory Reclamation is an implementation detail."
... an "implementation detail", yeah, right.

Going back to our four reasons why memory reclamation is a hard problem (progress, performance, simplicity, resilience), if we look at the three families and compare some of the techniques of each family we see their advantages/disadavantages.


In the end there is no one solution fits all and it's highly unlikely that there ever will be. As always in engineering (software or not) some solutions are better in some scenarios and others in different scenarios. Though I have to say that Hazard Eras works well for scenarios with traversal of linked lists.


Wrapping up

The point I'm trying to make is the following: The fact that the memory reclamation technique needs to have the same progress (or better) as the data structure where you're using it, means that for fully lock-free or wait-free data structures the only true solution is to use pointer-based memory reclamation. Moreover, if we want the data structure to be fast, we need the memory reclamation needs to be just as fast (on top of having the same progress).

Let me be super clear on this. As of this writing, and likely for all eternity, the biggest bottleneck on deploying a lock-free or wait-free data structure is the memory reclamation technique.
If you think about it, a memory reclamation technique that is wait-free, must itself be composed of one or several data structures, likely to have wait-free progress, so we're stuck in a catch-22 where we want to make wait-free data structures faster, but we need to use (other) wait-free data structures to protect those wait-free data structures.

To end on a positive note, yes it's sad we likely can't make memory reclamation any faster than URCU or Hazard Eras, but we can certainly make it easier to use, and that's one of the things Andreia and I are working on. More on this in a future post  ;)

Wednesday, August 16, 2017

Lock-Free Memory Reclamation by Erez Petrank

I've thought about doing some talks about Lock-Free Memory Reclamation, but Erez Petrank beat me to it. Even though I know a lot about the subject, I couldn't have done a better job myself (really, I couldn't), and lucky for you, his talks are online ;)
There are actually two talks, one is about Manual Lock-Free Memory Reclamation:
https://www.youtube.com/watch?v=aedEe0Zx_g0
https://www.youtube.com/watch?v=BCXrG1M65HU


The other is about Automatic Memory Reclamation sometimes called Garbage Collection:
https://www.youtube.com/watch?v=u9pbPpZXu18
https://www.youtube.com/watch?v=2nc7IqfshgQ

on this second one I learned a _lot_ because although Andreia and I are working on manual reclamation stuff so we know all the details he mentioned, on GC stuff I'm a bit clueless, except for the basic stuff. Both talks are really insightful and funny.
If you're looking for accurate and deep knowledge of memory reclamation, there is nothing better out there than these talks... except maybe a long chat with Erez himself  ;)


Here's my favorite quote by Erez Petrank:
(...) There are several papers which will tell you that there are Lock-Free Garbage Collectors. Do Not trust them! Some of them don't understand what Lock-Freedom is, some of them don't understand what Garbage Collection is, and some of them understand both but do not really look at the details.
GC progress is the big thing that no one looks at. They say: "Ok sure, I don't worry about the garbage collection, I only worry about the program and that is lock-free.", which is not ok, because if the garbage collection does not complete its operation, the program can not complete its operation because it can not allocate. (...)



Thursday, August 10, 2017

What do blockchains and wait-free universal constructs have in common?

SPAA 2017 and PODC 2017 are over but what I heard there will stay with me for a long time  :)
I learned a lot, got to present a brief announcement on Hazard Eras, met in person many interesting researchers in the field, had some awesome discussions about lock-free memory reclamation, and much, much more.
Sure, most of the talks are very theoretical, not the kind of thing I would recommend a software engineer, but just to get to pick the brains of all these men and women is worth the admission price.

For those of you who were not there, Maurice Herlihy gave an amazing keynote about Blockchains and how they are related to Universal Constructs and wait-free.
I say "amazing" because Andreia and I have been working hard for the past year on a practical universal construct, so it's cool to see that there are some similarities of our research with Blockchains and Proof of Work. I even got to chat with Maurice a couple of times to get some of his insights on wait-free universal constructs versus transactional memory.
He was nice enough to put his slides online, so click the link for a small taste of what it was like to be there https://t.co/xy3K4YX9fC