“Don't let the past steal your present.”

— Cherralea Morgen

This article is inspired by many other wonderful blogs, thanks to the authors for sharing their knowledge.🥰

And of course the official cppreference documentation.

Some of the content is co-authored with GitHub Copilot.🤖

Prologue

Recently, I work on a project with high concurrency requirements, where threading is heavily used. Besides the traditional mutex and lock, I found some lock-free solutions utilizing atomic operations. Although I knew it long ago, I haven’t got a chance to actually use them in my project. And this is the moment, to take the first step into the atomic world of C++.🗺️


Introduction

Mutex VS Atomic

Well, before we dive into the atomic world, let’s have a quick comparison between mutex and atomic.

First of all, both of them are to solve the concurrency problem, to ensure data consistency when multiple threads are accessing the shared data. However, they have different mechanisms to achieve this goal.

Mutex is used along with lock, which relies on the operating system to provide the synchronization mechanism. In this way, OS ensures the lock🔒 and unlock🔓 operations are atomic. Therefore, it is a heavyweight solution.

On the other hand, atomic operations rely on the memory model of the hardware. It ensures the data is accessed the way you want by carefully ordering memory instructions. Therefore, it is lock-free, and more performant than mutex.

Instruction Order

In case you are not familiar with this concept, I’ll give a brief introduction here.

We write programs using high level languages like C++, but eventually, they are compiled into machine code, which is a series of instructions executed by the CPU. Even the simplest statement like a++ is translated into multiple instructions.

In order to optimize the performance, the CPU is able to execute multiple instructions at the same time. Therefore, reordering the instructions can better utilize this feature. For example, in the following code, CPU can execute a = 1 and b = 2 at the same time, or even execute d = 3 ahead of c = a + b. But it won’t execute c = a + b in advance due to data dependency.

1
2
3
4
a = 1;
b = 2;
c = a + b;
d = 3;

So, how does CPU know what orders instructions can be executed? First, it follows the original order in the code, which is generated by the compiler. The compiler may adjust the order of instructions with optimization strategies. Second, CPU uses its memory model to adjust the order of instructions again. That is to say, both the compiler and CPU can reorder the instructions.

For local data, like the simple example above, it is easy to tell the dependencies between instructions. But things change for shared data. Given the example below, if we want bar to always print 100, we need to ensure a = 100 happens before printf("%d", a). But how can we guarantee this order?

1
2
3
4
5
6
7
8
9
10
int a = 0;
void foo() // thread 1
{
a = 100;
}

void bar() // thread 2
{
assert(a == 100); // always true?
}

Apparently, the compiler wouldn’t know the dependencies between these two threads. So, it relies on the runtime, but how?🤔

Thread Sanitizer

Not familiar with sanitizer? Check out Sanitize Your C/C++ Program With Sanitizers.😉

For the example above, we can write this simple test.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
#include <cassert>
#include <thread>

int data = 0;

void foo()
{
data = 100;
}

void bar()
{
assert(data == 100);
}

int main()
{
std::thread t1(foo);
std::thread t2(bar);
t1.join();
t2.join();

return 0;
}

I know, it is hard to reproduce concurrent problems. However, with the help of thread sanitizer, we can catch this tricky boy.🕵️Regarding to the documentation, you may use -O1 to balance the performance degradation.

1
gcc -g -fsanitize=thread -o main main.c

Now, you can always see the warning from thread sanitizer. It tells you which operations are causing data race, and the threads they belong to.

1
WARNING: ThreadSanitizer: data race ...

If you unluckily seeing the linker reports “missing libtsan_preinit.o“. Check (TSAN) /usr/bin/ld can’t find libtsan_preinit.o for solutions.


Let’s Get Atomic!

The original example comes from Atomics and Concurrency in C++.

In this chapter, we will solve the problem above using atomic operations. It is considered harder to use atomic operations then mutex. So, good luck!🫡

When you read the examples in this section, assume the worst case of instruction re-ordering.💀

Non-Atomic

To ensure assert executes after data = 100, we can add a flag indicating whether data is updated. So we make a little modification to our example. Since other parts remain the same, only foo() and bar() are given here. As you can see, now we wait until updated is set to true.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
void foo()
{
data = 100;
updated = true;
}

void bar()
{
while (!updated)
{
std::this_thread::yield();
}
assert(data == 100);
}

You may ask, “why not while(data != 100)?”😅Well, in real projects, directly checking data might be expensive or unavailable.

It seems fine. However, when you run it with TSan, the problem still exists. Why?😟Because the execution order of data = 100 and udpated = true is not guaranteed. For either the compiler or CPU, these two instructions are independent, thus they can be reordered. Therefore, updated might be true before data is set, causing the assertion to fail.

Using std::atomic

Let’s see how atomic operation solves this problem. Luckily, we don’t need to write those obscure assembly code. Since C++ 11, we have std::atomic available. Now, let’s rewrite the example again. Still, only neccessary parts are given.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
#include <atomic>

int data = 0;
// Wow, we have a new toy!
std::atomic<bool> updated = false;

void foo()
{
data = 100;
updated.store(true);
}

void bar()
{
while (!updated.load())
{
std::this_thread::yield();
}
assert(data == 100);
}

Now, we run it again with TSan. Tada! The warning disappears!🎉

The most significant change is that the direct use of = is replaced by load and store provided by std::atomic. So, what does these two functions do?🤔

🗝️Essentially, each of them represents an atomic operation. (duh, of course) As we already know, one statement may be translated into multiple instructions, and the primary objective of std::atomic is to ensure that instructions from one statement won’t be separated or reordered, and other instructions won’t be inserted between them. In one word, atomic.

🗝️Beyond making instructions of one statement atomic, it can also ensures the memory access order with other statements. That is to say, it prevents the compiler and CPU from moving other statements that is originally after the atomic statement in the source code ahead of it, or the other way around.

The instructions mentioned here only refer to memory related ones, like load and store. Other arithmetic instructions, as they depends on the data from memory, so their reordering won’t affect the correctness. Only the reordering of memory related instructions will cause some troubles. And when I say statement, it also means the instructions translated from it.

So, are we getting it?😲I think it is easy to understand the first 🗝, just ensure the atomicity of one statement. However, even if we can ensure the atomicity, this statement can still be reordered as a whole, which indeed, is still atomic. The second 🗝 is a bit more tricky, so let’s break it down in the next chapter.


Memory Order

Memory order is the key to fully understand what atomic operation is all about. I think a good analogy is that, atomic operation is like placing fences around the instruction.

The Fence

Memory order is like an agreement among you, the compiler and the CPU. It explicitly defines the order of memory instructions so that they are not allowed to be reordered arbitrarily. To achieve this, it places “fences” around the atomic instruction.

Assume that we have the following memory access statements A, B and C. Each of them is translated in to two memory access instructions. All of them are independent from each other. Note that these three statements can exists in three threads, which makes this case more complicated.

1
2
  statements:  A   B   C
instructions: a a b b c c

Without atomic operation, these instructions can be executed in any order. Since the reordering is done in instruction level, and they may in different threads, we may get the following result.

1
a c b c a b

What a mess!😨Now, if we want B to be atomic, we can add fence like this.

1
2
  statements:  A  <[  B  ]> C
instructions: a a <[ b b ]> c c

First, [ ] means other instructions cannot be reordered into it. That is to say, there will only be [ b b ] and no [ b a b c]. Then, < prevents instructions before it to be moved afterwards, while > prevents instructions after it to be moved before. Therefore, with fences like this, there will only be one execution order.

1
a a b b c c

There is one assumption that, the order of the instructions translated from one statement won’t change. That is to say, if a1 a2 comes from A, then there won’t be a2 a1.

Is It Too Strict?

In our example above, the default load and store uses this strategy, adding fences at both sides of a statement. This is the most strict memory order, which is called memory_order_seq_cst (sequential consistency). It ensures the sequential consistency of the memory access. That is to say, the memory access order is the same as the order in the source code.

It is good to keep instructions in order, but it is not free. Too many restrictions will certainly impact the performance. So can we relax the order a little bit? Thankfully, the standard library provides several memory orders for us to choose.

Back to our foo()-bar() demo. In fact, we only require that update is written after data in foo(), and update is read before data in bar(). Don’t be fooled by the while loop. The loop has nothing to do with data except the logic dependency in our mind, so they can also be reordered. For this, we can relax the memory order a little bit.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
void foo()
{
data = 100;
updated.store(true, std::memory_order_release);
}

void bar()
{
while (!updated.load(std::memory_order_acquire))
{
std::this_thread::yield();
}
assert(data == 100);
}

Now, we use std::memory_order_release in foo() and std::memory_order_acquire in bar(). As we expected, it also works. But, strange names, right?😵‍💫

First of all, all memory orders ensure the atomicity. That is to say, they all add [ ] around the statement. But they add < and > in different strategy. As we talked about earlier, the default memory_order_seq_cst adds both < and >. So, it is the most strict and secure, but also the slowest. The other two, however, only adds one, making them a little bit faster.

  • std::memory_order_release only adds <, so it only ensures the instructions before won’t be moved afterwards.
  • std::memory_order_acquire only adds >, so it only ensures the instructions after won’t be moved before.

Release and acquire may be confusing, so you can just remember their meaning.

Note that such atomic operation also applies to other memory statements that doesn’t use std::atomic.

Can We Be More Relaxed?

As a meticulous reader, you may notice that, there is a missing combination of < and >, which is not using any of them. Yes, there is such a memory order, which is called std::memory_order_relaxed. It only ensures the atomicity of one statement. If you just want to keep one statement together, it is the best choice.

So, can we use std::memory_order_relaxed in our example?🤔

1
2
3
4
5
6
7
8
9
10
11
12
13
14
void foo()
{
data = 100;
updated.store(true, std::memory_order_relaxed);
}

void bar()
{
while (!updated.load(std::memory_order_relaxed))
{
std::this_thread::yield();
}
assert(data == 100);
}

If you already have a good understanding of atomic operations, you can tell that TSan will again give you the warning. That is because there is only [ ] and no < or >. So, the instructions can still be reordered.

So here comes another question. Now that just being atomic with std::memory_order_relaxed is not enough, why we still have this option? The answer is simple. Not all accesses to a std::atomic variable requires strict order, and these are the cases where std::memory_order_relaxed is preferred.

Something More

In the sections above, we learnt how atomic operations work. Now, I’d like to talk about some missing parts.

First, in our analogy, < and [ are not bind together, so are > and ]. Therefore, we can have the following execution order.

1
2
before reorder: a a < [ b b ] c c
after reorder: a a < c c [ b b ]

Then, it leads to another point. Similar to the critical section in lock, we can also create such sections by combining release and acquire. For example, we have A B C D, where B uses release and C uses acquire. In this case, the reorder can only happen between B and C, or instructions outside this fenced area. And because of the atomicity, b and c won’t mess with each other.

1
2
before reorder: a a < [ b b ] [ c c ] > d d
after reorder: a a < [ c c ] [ b b ] > d d

Epilogue

Hurrah! now you’re welcomed into the atomic world of C++.🎊With atomic operations, you can implement lock-free solutions to concurrent problems for better performance. For example, lock-free queue in producer-consumer problem.

But now, take a break and reward yourself with a cup of coffee.☕️ᓚᘏᗢ