Concurrent programming have its own and special issues on top of the normal difficulties of writing SW. One of the problems we may find when writing concurrent SW is the so called Race Condition Situation.
Simply explained, a Race Condition happens when the correct execution of a program depends on the sequence or timing of the different execution flows that composes that program (processes or threads). Correct execution in this context means that the program does what it is suppose to do and provides the correct result every time it executed.
Race conditions usually happens when the different processes or threads access a shared resource at the same time, what may cause that such a resource to end up in an unstable state.
Race conditions are tricky as they may be difficult to reproduce. Usually, during development the application works fine 99% of the time. Every now and then it fails, but, as we are developing, it may be glitches due to the code we are working out.
However, when we move the program to the real environment (production, operations or whatever work you use on your domain) those glitches will persists and, under certain circumstances (usually heavy load of the machine that have an impact on the thread scheduling) will become more frequents.
These bugs are usually known as Heisenbugs, because they seems to disappear during debugging phase. Usually during debugging: executions goes slower, the environment is in a controlled and clean state, memory gets initialised with safe values due to the multiples executions of the program (that is less common but it happens on some circumstances),...
Why Does Race Conditions Happen?
As we have already mentioned, Race Conditions usually happens when different processes/threads try to access simultaneously to a shared resource. In general, and we will come back to this in detail later, any time you need to access a shared resource (usually memory) on a concurrent application, such an access has to be protected using a Critical Region, Mutex or other mechanisms. We will go through all of them along this series so do not worry.
Race conditions may happen in different situations (for instance, changing permissions on a file or rising privileges on a process), but, in the realm of concurrent programming, the classical issue happens when accessing memory.
I will include the typical example here for completeness but you can find it all over the Net and on any book talking about this topic.
Let's imagine that our program has two thread that need to access a value in memory at the same time. Imagine it is a bank account, and each thread is a different money transfer that, will have to update the back amount (our shared value). The account have 100 EUR, and Thread1
is trying to transfer 50 EUR while Thread2
is trying to transfer 25 EUR.
The normal execution of the program is expected to be one of these (the R/W column indicates if the thread is reading or writing to the shared resource):
Thread1 | Thread2 | W/R | Account | || | Thread1 | Thread2 | W/R | Account |
---|---|---|---|---|---|---|---|---|
100 | 100 | |||||||
100 | R | 100 | R | |||||
100+50 | 100+25 | |||||||
150 | W | 150 | 125 | W | 125 | |||
150 | R | 125 | R | |||||
150+25 | 125+50 | |||||||
175 | W | 175 | 175 | W | 175 |
In both cases we get the expected value, 175 EUR. However without taken precautions what could have happened is:
Thread1 | Thread2 | W/R | Account |
---|---|---|---|
100 | |||
100 | R | ||
100 | R | ||
100+50 | 100+25 | ||
150 | W | 150 | |
125 | W | 125 |
As you can see when the read and write operations get interlaced bad things happen. We may lose 25 EUR or in worst case 50 EUR... In any case, despite of the financial disaster what is worst is that our program is not working properly.
A test program
So, to illustrate this and further introduce the solution, let's write a simple test program to see, first-hand a race condition. We will get first lane just at the pole position :).
This is our test program:
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <pthread.h>
#define MAX_ITER 10000
int shared_var = 0;
typedef struct tpar {
int id;
int cnt;
} TPAR;
void * task (void *_p)
{
int v;
TPAR *p = (TPAR*)_p;
while (1) {
printf ("%02d: %d\n", p->id, shared_var);
v = shared_var;
shared_var = v + 1;
p->cnt++;
}
}
int main ()
{
int i;
void *r;
TPAR tp1, tp2;
pthread_t tid1, tid2;
tp1.id = 1;
tp1.cnt = 0;
if (pthread_create (&tid1, NULL, task, &tp1) < 0) exit (EXIT_FAILURE);
tp2.id = 2;
tp2.cnt = 0;
if (pthread_create (&tid2, NULL, task, &tp2) < 0) exit (EXIT_FAILURE);
i = 0;
while (1) {
i++;
usleep (100);
if (i >= MAX_ITER) break;
}
pthread_cancel (tid1);
pthread_cancel (tid2);
pthread_join (tid1, &r);
pthread_join (tid2, &r);
printf ("Thread %d run %d times\n", tp1.id, tp1.cnt);
printf ("Thread %d run %d times\n", tp2.id, tp2.cnt);
printf ("Shared Value : %d (total threads : %d) \n", shared_var, tp1.cnt+tp2.cnt);
if (shared_var != tp1.cnt+tp2.cnt)
printf ("\n*** FAIL ***\n");
else
printf ("\n*** SUCCESS ***\n");
return 0;
}
This program just starts two threads that increase a shared variable. Each thread counts the number of times it gets executed, so at the end, the sum of the number of executions of the threads shall match the value of the shared variable. We run the program for 10000
otherwise it is harder to catch the race condition.
So if we compile and run the program many times, you will see how, sometimes the program succeed and some others it just fails. The fail rate will increase with the load of the system. You increase the load on the system just opening a new terminal and launching the command:
yes > /dev/null
So, the problem we already know what it is. It is exactly the same thing that happens with our account example. So in order to fix this, we need to make sure that, once the shared value is read, nobody else is writing to it until we are done with whatever we have to do.
Mutex
The traditional way to solve this issue is using a Mutex. Mutex is the contraction of Mutual Exclusion and it is an element that allows us to do precisely that. Make sure that, whenever one of our threads access the shared resource all other threads are excluded and have to wait to access the resource.
So, the thread needs to acquire the Mutex before using the shared resource. If the resource is available, the thread gets it and locks it, so any other thread that tries to acquire it, will have to wait (will sleep). From that point on, the thread can do whatever it wants to the resource as it knows for sure that it cannot be modified by any other thread. Whenever it is done, the Mutex is released and any other thread can acquire it again. In case some thread was waiting for the resource, it will be waked up so it can continue its execution.
In general, the part of the program between the acquisition of the Mutex and its release is known as Critical Section... for obvious reasons. Some systems/languages provides constructors with exactly this name. For instance, Java uses the concept of Synchronized
methods or blocks, which are just critical regions defined as blocks using the language block syntax, instead of creating a Mutex and writing code to acquire and release it.
So, how will our previous program look like? using a Mutex.
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <pthread.h>
#define MAX_ITER 10000
int shared_var = 0;
pthread_mutex_t shared_var_mutex = PTHREAD_MUTEX_INITIALIZER;
typedef struct tpar {
int id;
int cnt;
} TPAR;
void * task (void *_p)
{
int v;
TPAR *p = (TPAR*)_p;
while (1) {
printf ("%02d: %d\n", p->id, shared_var);
pthread_mutex_lock (&shared_var_mutex);
v = shared_var;
shared_var = v + 1;
pthread_mutex_unlock (&shared_var_mutex);
p->cnt++;
}
}
int main ()
{
int i;
void *r;
TPAR tp1, tp2;
pthread_t tid1, tid2;
pthread_mutex_t mutex;
tp1.id = 1;
tp1.cnt = 0;
if (pthread_create (&tid1, NULL, task, &tp1) < 0) exit (EXIT_FAILURE);
tp2.id = 2;
tp2.cnt = 0;
if (pthread_create (&tid2, NULL, task, &tp2) < 0) exit (EXIT_FAILURE);
i = 0;
while (1) {
i++;
usleep (100);
if (i >= MAX_ITER) break;
}
pthread_cancel (tid1);
pthread_cancel (tid2);
pthread_join (tid1, &r);
pthread_join (tid2, &r);
printf ("Thread %d run %d times\n", tp1.id, tp1.cnt);
printf ("Thread %d run %d times\n", tp2.id, tp2.cnt);
printf ("Shared Value : %d (total threads : %d) \n", shared_var, tp1.cnt+tp2.cnt);
if (shared_var != tp1.cnt+tp2.cnt)
printf ("\n*** FAIL ***\n");
else
printf ("\n*** SUCCESS ***\n");
return 0;
}
As you can see the program is exactly the same, but the modification of the variable is performed in a critical section. Between the acquire and release of the mutex. In this program this is achieve using the functions pthread_lock
and pthread_unlock
. Using other libraries or languages may use different names, but the semantics doesn't change.
Now you can compile and run this program as many times as you wish, with a very heavy load and it will always succeed.
Mutexes are shared resources themselves
You may be thinking this. And that is actually correct. A Mutex is an entity (let's be generic) that is accessed by different threads in order to get mutual exclusion access to a critical region... But how is mutual exclusion ensured when accessing the mutex itself?.
Well, this is actually a problem. An old one that have been solved long ago. The solution is implemented in the processor itself. Each processor (well actually most of them) provides some special machine instructions to be able to implement the so called, Test_And_Set operations. Right, in order to implement a Mutex (actually in order to be able to implement semaphores), we need a way to read a memory location, test it and set it, without being interrupted.
For a very basic systems this can be achieved just disabling interrupts, so that code will be executed without any interruption. But other processors provide specific instructions (LL
/SC
for MIPS, the TAS
instruction on old Motorola 68K, or the LOCK
modifier for intel platforms and the CMPXCHG
) for that purpose.
Traditionally, each system provided different functions to allow programmers to atomically update variables. As per the 2011 the C standard supports a new _Atomic
keyword and associated functions to deal with it.
Let's take as example a very simple program using both options. That's is, classical the GNU built-in functions for atomic memory access and the new C11 atomic operation library.
#include <stdio.h>
#include <stdatomic.h>
int main () {
int my_var = 0;
int expected = 1;
int _Atomic my_at_var = 0;
__sync_bool_compare_and_swap (&my_var, 0, 1);
atomic_compare_exchange_strong (&my_at_var, &expected, 0);
return 0;
}
To compile this program use the following command-line:
gcc -std=c11 -o atomic atomic.c
Now if we take a look to the generated code:
$ objdump -d atomic | grep -A30 "<main>:" 000000000000066a <main>: 66a: 55 push %rbp 66b: 48 89 e5 mov %rsp,%rbp 66e: 48 83 ec 20 sub $0x20,%rsp 672: 64 48 8b 04 25 28 00 mov %fs:0x28,%rax 679: 00 00 67b: 48 89 45 f8 mov %rax,-0x8(%rbp) 67f: 31 c0 xor %eax,%eax 681: c7 45 e0 00 00 00 00 movl $0x0,-0x20(%rbp) 688: c7 45 e4 01 00 00 00 movl $0x1,-0x1c(%rbp) 68f: c7 45 e8 00 00 00 00 movl $0x0,-0x18(%rbp) 696: b8 00 00 00 00 mov $0x0,%eax 69b: ba 01 00 00 00 mov $0x1,%edx 6a0: f0 0f b1 55 e0 lock cmpxchg %edx,-0x20(%rbp) 6a5: 48 8d 45 e8 lea -0x18(%rbp),%rax 6a9: 48 89 45 f0 mov %rax,-0x10(%rbp) 6ad: c7 45 ec 00 00 00 00 movl $0x0,-0x14(%rbp) 6b4: 8b 45 ec mov -0x14(%rbp),%eax 6b7: 89 c6 mov %eax,%esi 6b9: 48 8b 4d f0 mov -0x10(%rbp),%rcx 6bd: 48 8d 55 e4 lea -0x1c(%rbp),%rdx 6c1: 8b 02 mov (%rdx),%eax 6c3: f0 0f b1 31 lock cmpxchg %esi,(%rcx) 6c7: 0f 94 c1 sete %cl 6ca: 84 c9 test %cl,%cl 6cc: 75 02 jne 6d0 <main+0x66> 6ce: 89 02 mov %eax,(%rdx) 6d0: b8 00 00 00 00 mov $0x0,%eax 6d5: 48 8b 7d f8 mov -0x8(%rbp),%rdi 6d9: 64 48 33 3c 25 28 00 xor %fs:0x28,%rdi
There is no point on going into the details of this code as we already know what it does, but please, notice the two lock cmpxchg
instructions, one for the built-in function and another for the _Atomic
variable.
Test and Set in other architectures
Now that we have this simple program to explore the atomic modification of variables, we can quickly take a look to what happens with other platforms.
Let's start with ARM
$ arm-linux-gnueabi-gcc -std=c11 -o atomic.arm atomic.c $ arm-linux-gnueabi-objdump -d atomic.arm | grep -A32 "<main>:" 00010468 <main>: 10468: e92d4830 push {r4, r5, fp, lr} 1046c: e28db00c add fp, sp, #12 10470: e24dd018 sub sp, sp, #24 10474: e59f30ac ldr r3, [pc, #172] ; 10528 <main+0xc0> 10478: e5933000 ldr r3, [r3] 1047c: e50b3010 str r3, [fp, #-16] 10480: e3a03000 mov r3, #0 10484: e50b3024 str r3, [fp, #-36] ; 0xffffffdc 10488: e3a03001 mov r3, #1 1048c: e50b3020 str r3, [fp, #-32] ; 0xffffffe0 10490: e3a03000 mov r3, #0 10494: e50b301c str r3, [fp, #-28] ; 0xffffffe4 10498: e24b3024 sub r3, fp, #36 ; 0x24 1049c: e3a02001 mov r2, #1 104a0: e3a01000 mov r1, #0 104a4: e1a00003 mov r0, r3 104a8: eb000319 bl 11114 <__sync_val_compare_and_swap_4> 104ac: e1a03000 mov r3, r0 104b0: e3530000 cmp r3, #0 104b4: e24b301c sub r3, fp, #28 104b8: e50b3014 str r3, [fp, #-20] ; 0xffffffec 104bc: e3a03000 mov r3, #0 104c0: e50b3018 str r3, [fp, #-24] ; 0xffffffe8 104c4: e51b3018 ldr r3, [fp, #-24] ; 0xffffffe8 104c8: e1a02003 mov r2, r3 104cc: e51b3014 ldr r3, [fp, #-20] ; 0xffffffec 104d0: e24b4020 sub r4, fp, #32 104d4: e5945000 ldr r5, [r4] 104d8: e1a01005 mov r1, r5 104dc: e1a00003 mov r0, r3 104e0: eb00030b bl 11114 <__sync_val_compare_and_swap_4> 104e4: e1a03000 mov r3, r0
In this case we can identify two calls to a function named __sync_val_compare_and_swap
which, pretty much, seems to do the thing we want. Looking into this function we find:
$ arm-linux-gnueabi-objdump -d atomic.arm | grep -A20 "<__sync_val_compare_and_swap_4>:" 00011114 <__sync_val_compare_and_swap_4>: 11114: e92d41f0 push {r4, r5, r6, r7, r8, lr} 11118: e1a05000 mov r5, r0 1111c: e1a04001 mov r4, r1 11120: e1a07002 mov r7, r2 11124: e59f6034 ldr r6, [pc, #52] ; 11160 <__sync_val_compare_and_swap_4+0x4c> 11128: e5953000 ldr r3, [r5] 1112c: e1530004 cmp r3, r4 11130: 1a000007 bne 11154 <__sync_val_compare_and_swap_4+0x40> 11134: e1a02005 mov r2, r5 11138: e1a01007 mov r1, r7 1113c: e1a00004 mov r0, r4 11140: e12fff36 blx r6 11144: e3500000 cmp r0, #0 11148: 1afffff6 bne 11128 <__sync_val_compare_and_swap_4+0x14> 1114c: e1a00004 mov r0, r4 11150: e8bd81f0 pop {r4, r5, r6, r7, r8, pc} 11154: e1a04003 mov r4, r3 11158: e1a00004 mov r0, r4 1115c: e8bd81f0 pop {r4, r5, r6, r7, r8, pc} 11160: ffff0fc0 .word 0xffff0fc0
So, there is a lot of code here that looks pretty normal, but there are two instructions in there that do all the magic:
11124: e59f6034 ldr r6, [pc, #52] ; 11160 <__sync_val_compare_and_swap_4+0x4c> (...) 11140: e12fff36 blx r6 (...) 11160: ffff0fc0 .word 0xffff0fc0
This is basically a call to a function located at 0xffff0fc0
and what will we find there?. To know, let's take a look to the kernel documentation folder. Specifically to Kernel-Provided User Helpers. In this document we found the following (I reproduce it here for the reader convenience):
kuser_cmpxchg ------------- Location: 0xffff0fc0 Reference prototype: int __kuser_cmpxchg(int32_t oldval, int32_t newval, volatile int32_t *ptr); Input: r0 = oldval r1 = newval r2 = ptr lr = return address Output: r0 = success code (zero or non-zero) C flag = set if r0 == 0, clear if r0 != 0 Clobbered registers: r3, ip, flags Definition: Atomically store newval in *ptr only if *ptr is equal to oldval. Return zero if *ptr was changed or non-zero if no exchange happened. The C flag is also set if *ptr was changed to allow for assembly optimization in the calling code. (...)
So this is a software implementation of the CMPXCHG
instruction we found on intel platforms... and how is the LOCK
modifier implemented? Right, there is not. That is why this is a kernel helper function. In order to ensure that the function is not interrupted, it is executed in kernel space.
We can now take a quick look to the code for MIPS:
$ mips-linux-gnu-gcc -std=c11 -o atomic.mips atomic.c edma@lena:/mnt/mia/work/projects/concurrency/10race (master)$ mips-linux-gnu-objdump -d atomic.mips | grep -A40 "<main>:" 004007b0 <main>: 4007b0: 27bdffc8 addiu sp,sp,-56 4007b4: afbf0034 sw ra,52(sp) 4007b8: afbe0030 sw s8,48(sp) 4007bc: 03a0f025 move s8,sp 4007c0: 3c1c0042 lui gp,0x42 4007c4: 279c9010 addiu gp,gp,-28656 4007c8: afbc0010 sw gp,16(sp) 4007cc: 8f82804c lw v0,-32692(gp) 4007d0: 8c420000 lw v0,0(v0) 4007d4: afc2002c sw v0,44(s8) 4007d8: afc00018 sw zero,24(s8) 4007dc: 24020001 li v0,1 4007e0: afc2001c sw v0,28(s8) 4007e4: afc00020 sw zero,32(s8) 4007e8: 0000000f sync 4007ec: c3c30018 ll v1,24(s8) 4007f0: 14600006 bnez v1,40080c <main+0x5c> 4007f4: 24020000 li v0,0 4007f8: 24010001 li at,1 4007fc: e3c10018 sc at,24(s8) 400800: 1020fffa beqz at,4007ec <main+0x3c> 400804: 24020001 li v0,1 400808: 0000000f sync 40080c: 27c20020 addiu v0,s8,32 400810: afc20028 sw v0,40(s8) 400814: afc00024 sw zero,36(s8) 400818: 8fc20024 lw v0,36(s8) 40081c: 00403825 move a3,v0 400820: 8fc30028 lw v1,40(s8) 400824: 27c2001c addiu v0,s8,28 400828: 8c460000 lw a2,0(v0) 40082c: 0000000f sync 400830: c0640000 ll a0,0(v1) 400834: 14860006 bne a0,a2,400850 <main+0xa0> 400838: 24050000 li a1,0 40083c: 00e00825 move at,a3 400840: e0610000 sc at,0(v1) 400844: 1020fffa beqz at,400830 <main+0x80> 400848: 24050001 li a1,1 40084c: 0000000f sync
Here we can see the use of pairs LL/SC
that are instruction specifically provided to implement this test-and-set behaviour. The MIPS IV instruction set manual uses the term RMW (Read-Modify-Write) instead of TAS, but the idea is pretty much the same.
PowerPC architecture provides a similar model using the pair of instructions LWARX/STWCX.
. This is the code gcc
generates for this platform:
$ powerpc-linux-gnu-gcc -std=c11 -o atomic.ppc atomic.c edma@lena:/mnt/mia/work/projects/concurrency/10race (master)$ powerpc-linux-gnu-objdump -d atomic.ppc | grep -A40 "<main>:" 100004cc <main>: 100004cc: 94 21 ff c0 stwu r1,-64(r1) 100004d0: 7c 08 02 a6 mflr r0 100004d4: 90 01 00 44 stw r0,68(r1) 100004d8: 93 e1 00 3c stw r31,60(r1) 100004dc: 7c 3f 0b 78 mr r31,r1 100004e0: 81 22 8f f8 lwz r9,-28680(r2) 100004e4: 91 3f 00 2c stw r9,44(r31) 100004e8: 39 20 00 00 li r9,0 100004ec: 39 20 00 00 li r9,0 100004f0: 91 3f 00 18 stw r9,24(r31) 100004f4: 39 20 00 01 li r9,1 100004f8: 91 3f 00 1c stw r9,28(r31) 100004fc: 39 20 00 00 li r9,0 10000500: 91 3f 00 20 stw r9,32(r31) 10000504: 39 40 00 01 li r10,1 10000508: 39 3f 00 18 addi r9,r31,24 1000050c: 7c 00 04 ac hwsync 10000510: 7d 00 48 28 lwarx r8,0,r9 10000514: 2f 88 00 00 cmpwi cr7,r8,0 10000518: 40 9e 00 10 bne cr7,10000528 <main+0x5c> 1000051c: 7d 40 49 2d stwcx. r10,0,r9 10000520: 4f 80 00 00 mcrf cr7,cr0 10000524: 40 9e ff ec bne cr7,10000510 <main+0x44> 10000528: 4c 00 01 2c isync 1000052c: 39 3f 00 20 addi r9,r31,32 10000530: 91 3f 00 28 stw r9,40(r31) 10000534: 39 20 00 00 li r9,0 10000538: 91 3f 00 24 stw r9,36(r31) 1000053c: 81 3f 00 24 lwz r9,36(r31) 10000540: 7d 26 4b 78 mr r6,r9 10000544: 81 1f 00 28 lwz r8,40(r31) 10000548: 39 3f 00 1c addi r9,r31,28 1000054c: 80 e9 00 00 lwz r7,0(r9) 10000550: 7c 00 04 ac hwsync 10000554: 7d 40 40 28 lwarx r10,0,r8 10000558: 7f 8a 38 00 cmpw cr7,r10,r7 1000055c: 40 9e 00 10 bne cr7,1000056c <main+0xa0> 10000560: 7c c0 41 2d stwcx. r6,0,r8 10000564: 4f 80 00 00 mcrf cr7,cr0 10000568: 40 9e ff ec bne cr7,10000554 <main+0x88>
For more details about this instrictions take a look to Save your code from meltdown using PowerPC atomic instruction.
Another still popular processor is the SPARC one.
$ sparc64-linux-gnu-gcc -std=c11 -o atomic.sparc atomic.c edma@lena:/mnt/mia/work/projects/concurrency/10race (master)$ sparc64-linux-gnu-objdump -d atomic.sparc | grep -A40 "<main>:" 00000000001006d0 <main>: 1006d0: 9d e3 bf 30 save %sp, -208, %sp 1006d4: c2 59 e0 28 ldx [ %g7 + 0x28 ], %g1 1006d8: c2 77 a7 f7 stx %g1, [ %fp + 0x7f7 ] 1006dc: 82 10 20 00 clr %g1 1006e0: c0 27 a7 df clr [ %fp + 0x7df ] 1006e4: 82 10 20 01 mov 1, %g1 1006e8: c2 27 a7 e3 st %g1, [ %fp + 0x7e3 ] 1006ec: c0 27 a7 e7 clr [ %fp + 0x7e7 ] 1006f0: 82 07 a7 df add %fp, 0x7df, %g1 1006f4: 86 10 20 00 clr %g3 1006f8: 84 10 20 01 mov 1, %g2 1006fc: c5 e0 50 03 cas [ %g1 ], %g3, %g2 100700: 82 07 a7 e7 add %fp, 0x7e7, %g1 100704: c2 77 a7 ef stx %g1, [ %fp + 0x7ef ] 100708: c0 27 a7 eb clr [ %fp + 0x7eb ] 10070c: c2 07 a7 eb ld [ %fp + 0x7eb ], %g1 100710: 84 10 00 01 mov %g1, %g2 100714: c8 5f a7 ef ldx [ %fp + 0x7ef ], %g4 100718: 82 07 a7 e3 add %fp, 0x7e3, %g1 10071c: c6 00 40 00 ld [ %g1 ], %g3 100720: c5 e1 10 03 cas [ %g4 ], %g3, %g2 100724: 86 18 80 03 xor %g2, %g3, %g3 100728: 80 a0 00 03 cmp %g0, %g3 10072c: 86 60 3f ff subc %g0, -1, %g3 100730: 80 a0 e0 00 cmp %g3, 0 100734: 12 48 00 03 bne %icc, 100740 <main+0x70> 100738: 01 00 00 00 nop 10073c: c4 20 40 00 st %g2, [ %g1 ] 100740: 82 10 20 00 clr %g1 100744: 83 38 60 00 sra %g1, 0, %g1 100748: b0 10 00 01 mov %g1, %i0 10074c: c2 5f a7 f7 ldx [ %fp + 0x7f7 ], %g1 100750: c4 59 e0 28 ldx [ %g7 + 0x28 ], %g2 100754: 82 18 40 02 xor %g1, %g2, %g1 100758: 84 10 20 00 clr %g2 10075c: 02 c8 40 04 brz %g1, 10076c <main+0x9c> 100760: 01 00 00 00 nop 100764: 40 04 06 87 call 202180 <__stack_chk_fail@plt> 100768: 01 00 00 00 nop 10076c: 81 cf e0 08 return %i7 + 8
SPARC uses a similar approach to the one provided by the old MC68K or the intel processors. The instruction CAS
does the trick. Can you spot the two CAS
instructions on the listing above?
Finally, let's take a look to the more recent open source RISC-V processor.
$ riscv64-linux-gnu-gcc -std=c11 -o atomic.riscv atomic.c edma@lena:/mnt/mia/work/projects/concurrency/10race (master)$ riscv64-linux-gnu-objdump -d atomic.riscv | grep -A40 "<main>:" 00000000000104dc <main>: 104dc: 7179 addi sp,sp,-48 104de: f406 sd ra,40(sp) 104e0: f022 sd s0,32(sp) 104e2: 1800 addi s0,sp,48 104e4: 67c9 lui a5,0x12 104e6: e187b783 ld a5,-488(a5) # 11e18 <__stack_chk_guard@@GLIBC_2.27> 104ea: fef43423 sd a5,-24(s0) 104ee: fc042823 sw zero,-48(s0) 104f2: 4785 li a5,1 104f4: fcf42a23 sw a5,-44(s0) 104f8: fc042c23 sw zero,-40(s0) 104fc: 4785 li a5,1 104fe: fd040693 addi a3,s0,-48 10502: 0f50000f fence iorw,ow 10506: 1406a72f lr.w.aq a4,(a3) 1050a: 00071563 bnez a4,10514 <main+0x38> 1050e: 1cf6a62f sc.w.aq a2,a5,(a3) 10512: fa75 bnez a2,10506 <main+0x2a> 10514: fd840793 addi a5,s0,-40 10518: fef43023 sd a5,-32(s0) 1051c: fc042e23 sw zero,-36(s0) 10520: fdc42783 lw a5,-36(s0) 10524: 0007859b sext.w a1,a5 10528: fe043603 ld a2,-32(s0) 1052c: fd440793 addi a5,s0,-44 10530: 4394 lw a3,0(a5) 10532: 0f50000f fence iorw,ow 10536: 1406272f lr.w.aq a4,(a2) 1053a: 00d71563 bne a4,a3,10544 <main+0x68> 1053e: 1cb6252f sc.w.aq a0,a1,(a2) 10542: f975 bnez a0,10536 <main+0x5a> 10544: 40d706bb subw a3,a4,a3 10548: 2681 sext.w a3,a3 1054a: 0016b693 seqz a3,a3 1054e: 2681 sext.w a3,a3 10550: e291 bnez a3,10554 <main+0x78>
The RISC-V processor, as we have already seen in all other RISC processors (assembly is very similar to MIPS and PowerPC) uses a pairing of two instructions: LR/SC
used in combination with the FENCE
instruction. The aq
flag after the LR/SC
instructions stands for aquire
and it indicates that this is the kind of access intended. The other rl
flag stands for release
. The FENCE
is used to ensure ordering on memory and io operations. We will come on this later. First we need to learn a few other things.
Conclusion
In this paper we had introduced one of the most common problems with concurrent software, Race Conditions. We have illustrated it with a real example and also shown how to solve the problem using the standard POSIX Mutex solution. The we explored how these Mutexes are implemented and dive deep into the processor capabilities for making them work, but this is just part of the trick to get a working Mutex. In the next instalment we will explore the other fundamental element for a proper mutex implementation.
Continue Reading... Concurrency. Mutex and Futex Other posts in the series: Concurrency. Introduction■