Concurrency. Race Conditions
PROGRAMMING
Concurrency. Race Conditions
2021-02-04
By
David "DeMO" Martínez Oliveira

Concurrent programming have its own and special issues on top of the normal difficulties of writing SW. One of the problems we may find when writing concurrent SW is the so called Race Condition Situation.

Simply explained, a Race Condition happens when the correct execution of a program depends on the sequence or timing of the different execution flows that composes that program (processes or threads). Correct execution in this context means that the program does what it is suppose to do and provides the correct result every time it executed.

Race conditions usually happens when the different processes or threads access a shared resource at the same time, what may cause that such a resource to end up in an unstable state.

Race conditions are tricky as they may be difficult to reproduce. Usually, during development the application works fine 99% of the time. Every now and then it fails, but, as we are developing, it may be glitches due to the code we are working out.

However, when we move the program to the real environment (production, operations or whatever work you use on your domain) those glitches will persists and, under certain circumstances (usually heavy load of the machine that have an impact on the thread scheduling) will become more frequents.

These bugs are usually known as Heisenbugs, because they seems to disappear during debugging phase. Usually during debugging: executions goes slower, the environment is in a controlled and clean state, memory gets initialised with safe values due to the multiples executions of the program (that is less common but it happens on some circumstances),...

Why Does Race Conditions Happen?

As we have already mentioned, Race Conditions usually happens when different processes/threads try to access simultaneously to a shared resource. In general, and we will come back to this in detail later, any time you need to access a shared resource (usually memory) on a concurrent application, such an access has to be protected using a Critical Region, Mutex or other mechanisms. We will go through all of them along this series so do not worry.

Race conditions may happen in different situations (for instance, changing permissions on a file or rising privileges on a process), but, in the realm of concurrent programming, the classical issue happens when accessing memory.

I will include the typical example here for completeness but you can find it all over the Net and on any book talking about this topic.

Let's imagine that our program has two thread that need to access a value in memory at the same time. Imagine it is a bank account, and each thread is a different money transfer that, will have to update the back amount (our shared value). The account have 100 EUR, and Thread1 is trying to transfer 50 EUR while Thread2 is trying to transfer 25 EUR.

The normal execution of the program is expected to be one of these (the R/W column indicates if the thread is reading or writing to the shared resource):

Thread1 Thread2 W/R Account || Thread1 Thread2 W/R Account
100 100
100 R 100 R
100+50 100+25
150 W 150 125 W 125
150 R 125 R
150+25 125+50
175 W 175 175 W 175

In both cases we get the expected value, 175 EUR. However without taken precautions what could have happened is:

Thread1 Thread2 W/R Account
100
100 R
100 R
100+50 100+25
150 W 150
125 W 125

As you can see when the read and write operations get interlaced bad things happen. We may lose 25 EUR or in worst case 50 EUR... In any case, despite of the financial disaster what is worst is that our program is not working properly.

A test program

So, to illustrate this and further introduce the solution, let's write a simple test program to see, first-hand a race condition. We will get first lane just at the pole position :).

This is our test program:

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <pthread.h>

#define MAX_ITER 10000

int shared_var = 0;
typedef struct tpar {
  int id;
  int cnt;
} TPAR;

void * task (void *_p)
{
  int v;
  TPAR  *p = (TPAR*)_p;
  while (1) {
    printf ("%02d: %d\n", p->id, shared_var);
    
    v = shared_var;
    shared_var = v + 1;
    
    p->cnt++;
  }
}

int main ()
{
  int        i;
  void       *r;
  TPAR       tp1, tp2;
  pthread_t  tid1, tid2;

  tp1.id = 1;
  tp1.cnt = 0;
  if (pthread_create (&tid1, NULL, task, &tp1) < 0) exit (EXIT_FAILURE);

  tp2.id = 2;
  tp2.cnt = 0;
  if (pthread_create (&tid2, NULL, task, &tp2) < 0) exit (EXIT_FAILURE);

  i = 0;
  while (1) {
    i++;
    usleep (100);
    if (i >= MAX_ITER)  break;
  }
  pthread_cancel (tid1);
  pthread_cancel (tid2);
  pthread_join (tid1, &r);
  pthread_join (tid2, &r);
  
  printf ("Thread %d  run %d times\n", tp1.id, tp1.cnt);
  printf ("Thread %d  run %d times\n", tp2.id, tp2.cnt);
  printf ("Shared Value : %d  (total threads : %d) \n", shared_var, tp1.cnt+tp2.cnt);
  if (shared_var != tp1.cnt+tp2.cnt)
    printf ("\n*** FAIL ***\n");
  else
    printf ("\n*** SUCCESS ***\n");
  return 0;
}

This program just starts two threads that increase a shared variable. Each thread counts the number of times it gets executed, so at the end, the sum of the number of executions of the threads shall match the value of the shared variable. We run the program for 10000 otherwise it is harder to catch the race condition.

So if we compile and run the program many times, you will see how, sometimes the program succeed and some others it just fails. The fail rate will increase with the load of the system. You increase the load on the system just opening a new terminal and launching the command:

yes > /dev/null

So, the problem we already know what it is. It is exactly the same thing that happens with our account example. So in order to fix this, we need to make sure that, once the shared value is read, nobody else is writing to it until we are done with whatever we have to do.

Mutex

The traditional way to solve this issue is using a Mutex. Mutex is the contraction of Mutual Exclusion and it is an element that allows us to do precisely that. Make sure that, whenever one of our threads access the shared resource all other threads are excluded and have to wait to access the resource.

So, the thread needs to acquire the Mutex before using the shared resource. If the resource is available, the thread gets it and locks it, so any other thread that tries to acquire it, will have to wait (will sleep). From that point on, the thread can do whatever it wants to the resource as it knows for sure that it cannot be modified by any other thread. Whenever it is done, the Mutex is released and any other thread can acquire it again. In case some thread was waiting for the resource, it will be waked up so it can continue its execution.

In general, the part of the program between the acquisition of the Mutex and its release is known as Critical Section... for obvious reasons. Some systems/languages provides constructors with exactly this name. For instance, Java uses the concept of Synchronized methods or blocks, which are just critical regions defined as blocks using the language block syntax, instead of creating a Mutex and writing code to acquire and release it.

So, how will our previous program look like? using a Mutex.

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <pthread.h>

#define MAX_ITER 10000

int             shared_var = 0;
pthread_mutex_t shared_var_mutex = PTHREAD_MUTEX_INITIALIZER;

typedef struct tpar {
  int id;
  int cnt;
} TPAR;

void * task (void *_p)
{
  int v;
  TPAR  *p = (TPAR*)_p;
  while (1) {

    printf ("%02d: %d\n", p->id, shared_var);
    pthread_mutex_lock (&shared_var_mutex);
    
    v = shared_var;
    shared_var = v + 1;
    
    pthread_mutex_unlock (&shared_var_mutex);
    p->cnt++;
  }
}

int main ()
{
  int        i;
  void       *r;
  TPAR       tp1, tp2;
  pthread_t  tid1, tid2;
  pthread_mutex_t mutex;

  tp1.id = 1;
  tp1.cnt = 0;
  if (pthread_create (&tid1, NULL, task, &tp1) < 0) exit (EXIT_FAILURE);

  tp2.id = 2;
  tp2.cnt = 0;
  if (pthread_create (&tid2, NULL, task, &tp2) < 0)  exit (EXIT_FAILURE);

  i = 0;
  while (1) {
    i++;
    usleep (100);
    if (i >= MAX_ITER)  break;  
  }

  pthread_cancel (tid1);
  pthread_cancel (tid2);
  pthread_join (tid1, &r);
  pthread_join (tid2, &r);
  
  printf ("Thread %d  run %d times\n", tp1.id, tp1.cnt);
  printf ("Thread %d  run %d times\n", tp2.id, tp2.cnt);
  printf ("Shared Value : %d  (total threads : %d) \n", shared_var, tp1.cnt+tp2.cnt);
  if (shared_var != tp1.cnt+tp2.cnt)
    printf ("\n*** FAIL ***\n");
  else
    printf ("\n*** SUCCESS ***\n");
  return 0;
}

As you can see the program is exactly the same, but the modification of the variable is performed in a critical section. Between the acquire and release of the mutex. In this program this is achieve using the functions pthread_lock and pthread_unlock. Using other libraries or languages may use different names, but the semantics doesn't change.

Now you can compile and run this program as many times as you wish, with a very heavy load and it will always succeed.

Mutexes are shared resources themselves

You may be thinking this. And that is actually correct. A Mutex is an entity (let's be generic) that is accessed by different threads in order to get mutual exclusion access to a critical region... But how is mutual exclusion ensured when accessing the mutex itself?.

Well, this is actually a problem. An old one that have been solved long ago. The solution is implemented in the processor itself. Each processor (well actually most of them) provides some special machine instructions to be able to implement the so called, Test_And_Set operations. Right, in order to implement a Mutex (actually in order to be able to implement semaphores), we need a way to read a memory location, test it and set it, without being interrupted.

For a very basic systems this can be achieved just disabling interrupts, so that code will be executed without any interruption. But other processors provide specific instructions (LL/SC for MIPS, the TAS instruction on old Motorola 68K, or the LOCK modifier for intel platforms and the CMPXCHG) for that purpose.

Traditionally, each system provided different functions to allow programmers to atomically update variables. As per the 2011 the C standard supports a new _Atomic keyword and associated functions to deal with it.

Let's take as example a very simple program using both options. That's is, classical the GNU built-in functions for atomic memory access and the new C11 atomic operation library.

#include <stdio.h>
#include <stdatomic.h>

int main () {
  int my_var = 0;
  int expected = 1;
  int _Atomic my_at_var = 0;

  __sync_bool_compare_and_swap (&my_var, 0, 1);
  atomic_compare_exchange_strong (&my_at_var, &expected, 0);
  
  return 0;
}

To compile this program use the following command-line:

gcc -std=c11 -o atomic atomic.c

Now if we take a look to the generated code:

$ objdump -d atomic | grep -A30 "<main>:"
000000000000066a <main>:
 66a:   55                      push   %rbp
 66b:   48 89 e5                mov    %rsp,%rbp
 66e:   48 83 ec 20             sub    $0x20,%rsp
 672:   64 48 8b 04 25 28 00    mov    %fs:0x28,%rax
 679:   00 00
 67b:   48 89 45 f8             mov    %rax,-0x8(%rbp)
 67f:   31 c0                   xor    %eax,%eax
 681:   c7 45 e0 00 00 00 00    movl   $0x0,-0x20(%rbp)
 688:   c7 45 e4 01 00 00 00    movl   $0x1,-0x1c(%rbp)
 68f:   c7 45 e8 00 00 00 00    movl   $0x0,-0x18(%rbp)
 696:   b8 00 00 00 00          mov    $0x0,%eax
 69b:   ba 01 00 00 00          mov    $0x1,%edx
 6a0:   f0 0f b1 55 e0          lock cmpxchg %edx,-0x20(%rbp)
 6a5:   48 8d 45 e8             lea    -0x18(%rbp),%rax
 6a9:   48 89 45 f0             mov    %rax,-0x10(%rbp)
 6ad:   c7 45 ec 00 00 00 00    movl   $0x0,-0x14(%rbp)
 6b4:   8b 45 ec                mov    -0x14(%rbp),%eax
 6b7:   89 c6                   mov    %eax,%esi
 6b9:   48 8b 4d f0             mov    -0x10(%rbp),%rcx
 6bd:   48 8d 55 e4             lea    -0x1c(%rbp),%rdx
 6c1:   8b 02                   mov    (%rdx),%eax
 6c3:   f0 0f b1 31             lock cmpxchg %esi,(%rcx)
 6c7:   0f 94 c1                sete   %cl
 6ca:   84 c9                   test   %cl,%cl
 6cc:   75 02                   jne    6d0 <main+0x66>
 6ce:   89 02                   mov    %eax,(%rdx)
 6d0:   b8 00 00 00 00          mov    $0x0,%eax
 6d5:   48 8b 7d f8             mov    -0x8(%rbp),%rdi
 6d9:   64 48 33 3c 25 28 00    xor    %fs:0x28,%rdi

There is no point on going into the details of this code as we already know what it does, but please, notice the two lock cmpxchg instructions, one for the built-in function and another for the _Atomic variable.

Test and Set in other architectures

Now that we have this simple program to explore the atomic modification of variables, we can quickly take a look to what happens with other platforms.

Let's start with ARM

$ arm-linux-gnueabi-gcc -std=c11 -o atomic.arm atomic.c
$ arm-linux-gnueabi-objdump -d atomic.arm | grep -A32 "<main>:"
00010468 <main>:
   10468:       e92d4830        push    {r4, r5, fp, lr}
   1046c:       e28db00c        add     fp, sp, #12
   10470:       e24dd018        sub     sp, sp, #24
   10474:       e59f30ac        ldr     r3, [pc, #172]  ; 10528 <main+0xc0>
   10478:       e5933000        ldr     r3, [r3]
   1047c:       e50b3010        str     r3, [fp, #-16]
   10480:       e3a03000        mov     r3, #0
   10484:       e50b3024        str     r3, [fp, #-36]  ; 0xffffffdc
   10488:       e3a03001        mov     r3, #1
   1048c:       e50b3020        str     r3, [fp, #-32]  ; 0xffffffe0
   10490:       e3a03000        mov     r3, #0
   10494:       e50b301c        str     r3, [fp, #-28]  ; 0xffffffe4
   10498:       e24b3024        sub     r3, fp, #36     ; 0x24
   1049c:       e3a02001        mov     r2, #1
   104a0:       e3a01000        mov     r1, #0
   104a4:       e1a00003        mov     r0, r3
   104a8:       eb000319        bl      11114 <__sync_val_compare_and_swap_4>
   104ac:       e1a03000        mov     r3, r0
   104b0:       e3530000        cmp     r3, #0
   104b4:       e24b301c        sub     r3, fp, #28
   104b8:       e50b3014        str     r3, [fp, #-20]  ; 0xffffffec
   104bc:       e3a03000        mov     r3, #0
   104c0:       e50b3018        str     r3, [fp, #-24]  ; 0xffffffe8
   104c4:       e51b3018        ldr     r3, [fp, #-24]  ; 0xffffffe8
   104c8:       e1a02003        mov     r2, r3
   104cc:       e51b3014        ldr     r3, [fp, #-20]  ; 0xffffffec
   104d0:       e24b4020        sub     r4, fp, #32
   104d4:       e5945000        ldr     r5, [r4]
   104d8:       e1a01005        mov     r1, r5
   104dc:       e1a00003        mov     r0, r3
   104e0:       eb00030b        bl      11114 <__sync_val_compare_and_swap_4>
   104e4:       e1a03000        mov     r3, r0

In this case we can identify two calls to a function named __sync_val_compare_and_swap which, pretty much, seems to do the thing we want. Looking into this function we find:

$ arm-linux-gnueabi-objdump -d atomic.arm | grep -A20 "<__sync_val_compare_and_swap_4>:"
00011114 <__sync_val_compare_and_swap_4>:
   11114:       e92d41f0        push    {r4, r5, r6, r7, r8, lr}
   11118:       e1a05000        mov     r5, r0
   1111c:       e1a04001        mov     r4, r1
   11120:       e1a07002        mov     r7, r2
   11124:       e59f6034        ldr     r6, [pc, #52]   ; 11160 <__sync_val_compare_and_swap_4+0x4c>
   11128:       e5953000        ldr     r3, [r5]
   1112c:       e1530004        cmp     r3, r4
   11130:       1a000007        bne     11154 <__sync_val_compare_and_swap_4+0x40>
   11134:       e1a02005        mov     r2, r5
   11138:       e1a01007        mov     r1, r7
   1113c:       e1a00004        mov     r0, r4
   11140:       e12fff36        blx     r6
   11144:       e3500000        cmp     r0, #0
   11148:       1afffff6        bne     11128 <__sync_val_compare_and_swap_4+0x14>
   1114c:       e1a00004        mov     r0, r4
   11150:       e8bd81f0        pop     {r4, r5, r6, r7, r8, pc}
   11154:       e1a04003        mov     r4, r3
   11158:       e1a00004        mov     r0, r4
   1115c:       e8bd81f0        pop     {r4, r5, r6, r7, r8, pc}
   11160:       ffff0fc0        .word   0xffff0fc0

So, there is a lot of code here that looks pretty normal, but there are two instructions in there that do all the magic:

   11124:       e59f6034        ldr     r6, [pc, #52]   ; 11160 <__sync_val_compare_and_swap_4+0x4c>
   (...)
   11140:       e12fff36        blx     r6
   (...)
   11160:       ffff0fc0        .word   0xffff0fc0

This is basically a call to a function located at 0xffff0fc0 and what will we find there?. To know, let's take a look to the kernel documentation folder. Specifically to Kernel-Provided User Helpers. In this document we found the following (I reproduce it here for the reader convenience):

Note: The address should give you an idea about it is kernel related... but just asking Google about that address will get you to the right page directly
kuser_cmpxchg
-------------

Location:   0xffff0fc0

Reference prototype:

  int __kuser_cmpxchg(int32_t oldval, int32_t newval, volatile int32_t *ptr);

Input:

  r0 = oldval
  r1 = newval
  r2 = ptr
  lr = return address

Output:

  r0 = success code (zero or non-zero)
  C flag = set if r0 == 0, clear if r0 != 0

Clobbered registers:

  r3, ip, flags

Definition:

  Atomically store newval in *ptr only if *ptr is equal to oldval.
  Return zero if *ptr was changed or non-zero if no exchange happened.
  The C flag is also set if *ptr was changed to allow for assembly
  optimization in the calling code.
  (...)
  

So this is a software implementation of the CMPXCHG instruction we found on intel platforms... and how is the LOCK modifier implemented? Right, there is not. That is why this is a kernel helper function. In order to ensure that the function is not interrupted, it is executed in kernel space.

We can now take a quick look to the code for MIPS:

$ mips-linux-gnu-gcc -std=c11 -o atomic.mips atomic.c
edma@lena:/mnt/mia/work/projects/concurrency/10race (master)$ mips-linux-gnu-objdump -d atomic.mips | grep -A40 "<main>:"
004007b0 <main>:
  4007b0:       27bdffc8        addiu   sp,sp,-56
  4007b4:       afbf0034        sw      ra,52(sp)
  4007b8:       afbe0030        sw      s8,48(sp)
  4007bc:       03a0f025        move    s8,sp
  4007c0:       3c1c0042        lui     gp,0x42
  4007c4:       279c9010        addiu   gp,gp,-28656
  4007c8:       afbc0010        sw      gp,16(sp)
  4007cc:       8f82804c        lw      v0,-32692(gp)
  4007d0:       8c420000        lw      v0,0(v0)
  4007d4:       afc2002c        sw      v0,44(s8)
  4007d8:       afc00018        sw      zero,24(s8)
  4007dc:       24020001        li      v0,1
  4007e0:       afc2001c        sw      v0,28(s8)
  4007e4:       afc00020        sw      zero,32(s8)
  4007e8:       0000000f        sync
  4007ec:       c3c30018        ll      v1,24(s8)
  4007f0:       14600006        bnez    v1,40080c <main+0x5c>
  4007f4:       24020000        li      v0,0
  4007f8:       24010001        li      at,1
  4007fc:       e3c10018        sc      at,24(s8)
  400800:       1020fffa        beqz    at,4007ec <main+0x3c>
  400804:       24020001        li      v0,1
  400808:       0000000f        sync
  40080c:       27c20020        addiu   v0,s8,32
  400810:       afc20028        sw      v0,40(s8)
  400814:       afc00024        sw      zero,36(s8)
  400818:       8fc20024        lw      v0,36(s8)
  40081c:       00403825        move    a3,v0
  400820:       8fc30028        lw      v1,40(s8)
  400824:       27c2001c        addiu   v0,s8,28
  400828:       8c460000        lw      a2,0(v0)
  40082c:       0000000f        sync
  400830:       c0640000        ll      a0,0(v1)
  400834:       14860006        bne     a0,a2,400850 <main+0xa0>
  400838:       24050000        li      a1,0
  40083c:       00e00825        move    at,a3
  400840:       e0610000        sc      at,0(v1)
  400844:       1020fffa        beqz    at,400830 <main+0x80>
  400848:       24050001        li      a1,1
  40084c:       0000000f        sync

Here we can see the use of pairs LL/SC that are instruction specifically provided to implement this test-and-set behaviour. The MIPS IV instruction set manual uses the term RMW (Read-Modify-Write) instead of TAS, but the idea is pretty much the same.

PowerPC architecture provides a similar model using the pair of instructions LWARX/STWCX.. This is the code gcc generates for this platform:

$ powerpc-linux-gnu-gcc -std=c11 -o atomic.ppc atomic.c
edma@lena:/mnt/mia/work/projects/concurrency/10race (master)$ powerpc-linux-gnu-objdump -d atomic.ppc | grep -A40 "<main>:"
100004cc <main>:
100004cc:       94 21 ff c0     stwu    r1,-64(r1)
100004d0:       7c 08 02 a6     mflr    r0
100004d4:       90 01 00 44     stw     r0,68(r1)
100004d8:       93 e1 00 3c     stw     r31,60(r1)
100004dc:       7c 3f 0b 78     mr      r31,r1
100004e0:       81 22 8f f8     lwz     r9,-28680(r2)
100004e4:       91 3f 00 2c     stw     r9,44(r31)
100004e8:       39 20 00 00     li      r9,0
100004ec:       39 20 00 00     li      r9,0
100004f0:       91 3f 00 18     stw     r9,24(r31)
100004f4:       39 20 00 01     li      r9,1
100004f8:       91 3f 00 1c     stw     r9,28(r31)
100004fc:       39 20 00 00     li      r9,0
10000500:       91 3f 00 20     stw     r9,32(r31)
10000504:       39 40 00 01     li      r10,1
10000508:       39 3f 00 18     addi    r9,r31,24
1000050c:       7c 00 04 ac     hwsync
10000510:       7d 00 48 28     lwarx   r8,0,r9
10000514:       2f 88 00 00     cmpwi   cr7,r8,0
10000518:       40 9e 00 10     bne     cr7,10000528 <main+0x5c>
1000051c:       7d 40 49 2d     stwcx.  r10,0,r9
10000520:       4f 80 00 00     mcrf    cr7,cr0
10000524:       40 9e ff ec     bne     cr7,10000510 <main+0x44>
10000528:       4c 00 01 2c     isync
1000052c:       39 3f 00 20     addi    r9,r31,32
10000530:       91 3f 00 28     stw     r9,40(r31)
10000534:       39 20 00 00     li      r9,0
10000538:       91 3f 00 24     stw     r9,36(r31)
1000053c:       81 3f 00 24     lwz     r9,36(r31)
10000540:       7d 26 4b 78     mr      r6,r9
10000544:       81 1f 00 28     lwz     r8,40(r31)
10000548:       39 3f 00 1c     addi    r9,r31,28
1000054c:       80 e9 00 00     lwz     r7,0(r9)
10000550:       7c 00 04 ac     hwsync
10000554:       7d 40 40 28     lwarx   r10,0,r8
10000558:       7f 8a 38 00     cmpw    cr7,r10,r7
1000055c:       40 9e 00 10     bne     cr7,1000056c <main+0xa0>
10000560:       7c c0 41 2d     stwcx.  r6,0,r8
10000564:       4f 80 00 00     mcrf    cr7,cr0
10000568:       40 9e ff ec     bne     cr7,10000554 <main+0x88>

For more details about this instrictions take a look to Save your code from meltdown using PowerPC atomic instruction.

Another still popular processor is the SPARC one.

$ sparc64-linux-gnu-gcc -std=c11 -o atomic.sparc atomic.c
edma@lena:/mnt/mia/work/projects/concurrency/10race (master)$ sparc64-linux-gnu-objdump -d atomic.sparc | grep -A40 "<main>:"
00000000001006d0 <main>:
  1006d0:       9d e3 bf 30     save  %sp, -208, %sp
  1006d4:       c2 59 e0 28     ldx  [ %g7 + 0x28 ], %g1
  1006d8:       c2 77 a7 f7     stx  %g1, [ %fp + 0x7f7 ]
  1006dc:       82 10 20 00     clr  %g1
  1006e0:       c0 27 a7 df     clr  [ %fp + 0x7df ]
  1006e4:       82 10 20 01     mov  1, %g1
  1006e8:       c2 27 a7 e3     st  %g1, [ %fp + 0x7e3 ]
  1006ec:       c0 27 a7 e7     clr  [ %fp + 0x7e7 ]
  1006f0:       82 07 a7 df     add  %fp, 0x7df, %g1
  1006f4:       86 10 20 00     clr  %g3
  1006f8:       84 10 20 01     mov  1, %g2
  1006fc:       c5 e0 50 03     cas  [ %g1 ], %g3, %g2
  100700:       82 07 a7 e7     add  %fp, 0x7e7, %g1
  100704:       c2 77 a7 ef     stx  %g1, [ %fp + 0x7ef ]
  100708:       c0 27 a7 eb     clr  [ %fp + 0x7eb ]
  10070c:       c2 07 a7 eb     ld  [ %fp + 0x7eb ], %g1
  100710:       84 10 00 01     mov  %g1, %g2
  100714:       c8 5f a7 ef     ldx  [ %fp + 0x7ef ], %g4
  100718:       82 07 a7 e3     add  %fp, 0x7e3, %g1
  10071c:       c6 00 40 00     ld  [ %g1 ], %g3
  100720:       c5 e1 10 03     cas  [ %g4 ], %g3, %g2
  100724:       86 18 80 03     xor  %g2, %g3, %g3
  100728:       80 a0 00 03     cmp  %g0, %g3
  10072c:       86 60 3f ff     subc  %g0, -1, %g3
  100730:       80 a0 e0 00     cmp  %g3, 0
  100734:       12 48 00 03     bne  %icc, 100740 <main+0x70>
  100738:       01 00 00 00     nop
  10073c:       c4 20 40 00     st  %g2, [ %g1 ]
  100740:       82 10 20 00     clr  %g1
  100744:       83 38 60 00     sra  %g1, 0, %g1
  100748:       b0 10 00 01     mov  %g1, %i0
  10074c:       c2 5f a7 f7     ldx  [ %fp + 0x7f7 ], %g1
  100750:       c4 59 e0 28     ldx  [ %g7 + 0x28 ], %g2
  100754:       82 18 40 02     xor  %g1, %g2, %g1
  100758:       84 10 20 00     clr  %g2
  10075c:       02 c8 40 04     brz  %g1, 10076c <main+0x9c>
  100760:       01 00 00 00     nop
  100764:       40 04 06 87     call  202180 <__stack_chk_fail@plt>
  100768:       01 00 00 00     nop
  10076c:       81 cf e0 08     return  %i7 + 8

SPARC uses a similar approach to the one provided by the old MC68K or the intel processors. The instruction CAS does the trick. Can you spot the two CAS instructions on the listing above?

Finally, let's take a look to the more recent open source RISC-V processor.

$ riscv64-linux-gnu-gcc -std=c11 -o atomic.riscv atomic.c
edma@lena:/mnt/mia/work/projects/concurrency/10race (master)$ riscv64-linux-gnu-objdump -d atomic.riscv | grep -A40 "<main>:"
00000000000104dc <main>:
   104dc:       7179                    addi    sp,sp,-48
   104de:       f406                    sd      ra,40(sp)
   104e0:       f022                    sd      s0,32(sp)
   104e2:       1800                    addi    s0,sp,48
   104e4:       67c9                    lui     a5,0x12
   104e6:       e187b783                ld      a5,-488(a5) # 11e18 <__stack_chk_guard@@GLIBC_2.27>
   104ea:       fef43423                sd      a5,-24(s0)
   104ee:       fc042823                sw      zero,-48(s0)
   104f2:       4785                    li      a5,1
   104f4:       fcf42a23                sw      a5,-44(s0)
   104f8:       fc042c23                sw      zero,-40(s0)
   104fc:       4785                    li      a5,1
   104fe:       fd040693                addi    a3,s0,-48
   10502:       0f50000f                fence   iorw,ow
   10506:       1406a72f                lr.w.aq a4,(a3)
   1050a:       00071563                bnez    a4,10514 <main+0x38>
   1050e:       1cf6a62f                sc.w.aq a2,a5,(a3)
   10512:       fa75                    bnez    a2,10506 <main+0x2a>
   10514:       fd840793                addi    a5,s0,-40
   10518:       fef43023                sd      a5,-32(s0)
   1051c:       fc042e23                sw      zero,-36(s0)
   10520:       fdc42783                lw      a5,-36(s0)
   10524:       0007859b                sext.w  a1,a5
   10528:       fe043603                ld      a2,-32(s0)
   1052c:       fd440793                addi    a5,s0,-44
   10530:       4394                    lw      a3,0(a5)
   10532:       0f50000f                fence   iorw,ow
   10536:       1406272f                lr.w.aq a4,(a2)
   1053a:       00d71563                bne     a4,a3,10544 <main+0x68>
   1053e:       1cb6252f                sc.w.aq a0,a1,(a2)
   10542:       f975                    bnez    a0,10536 <main+0x5a>
   10544:       40d706bb                subw    a3,a4,a3
   10548:       2681                    sext.w  a3,a3
   1054a:       0016b693                seqz    a3,a3
   1054e:       2681                    sext.w  a3,a3
   10550:       e291                    bnez    a3,10554 <main+0x78>

The RISC-V processor, as we have already seen in all other RISC processors (assembly is very similar to MIPS and PowerPC) uses a pairing of two instructions: LR/SC used in combination with the FENCE instruction. The aq flag after the LR/SC instructions stands for aquire and it indicates that this is the kind of access intended. The other rl flag stands for release. The FENCE is used to ensure ordering on memory and io operations. We will come on this later. First we need to learn a few other things.

Conclusion

In this paper we had introduced one of the most common problems with concurrent software, Race Conditions. We have illustrated it with a real example and also shown how to solve the problem using the standard POSIX Mutex solution. The we explored how these Mutexes are implemented and dive deep into the processor capabilities for making them work, but this is just part of the trick to get a working Mutex. In the next instalment we will explore the other fundamental element for a proper mutex implementation.

Continue Reading... Concurrency. Mutex and Futex

Other posts in the series:

Concurrency. Introduction

Header Image Credits: Victoire Joncheray on Unspash

 
Tu publicidad aquí :)