Everything you never wanted to know about stack canaries
SECURITY
Everything you never wanted to know about stack canaries
2021-03-30
By
David "DeMO" Martínez Oliveira

Yes, you have read it right. Unless you are a compiler or a exploit developer you do not really want to know anything about stack canaries. You just need to know they exists and will prevent stack overflows. But if you want to know how they work in detail, then keep reading.

Stack canaries are a technique to detect stack overflows and therefore prevent the exploitation of vulnerable applications. They have been around for a while and in this paper I'm going to put together everything I have learnt about them so far.

Nowadays, a vulnerable program doesn't mean automatically a security breach, i.e. that it can be exploited. The use of ASLR (Address Space Layout Randomization), proper execution permission on process pages and stack canaries makes very difficult to exploit a classical buffer overflows, even tho it is still possible in some cases.

In this paper we'll explore the use of stack canaries, but before jumping in the details, and for completeness, let's describe the issue this technique solves so it will be way easier to understand what is coming next in this paper. If you already know how functions are called in assembler, what is a stack frame and how the stack layout looks like when calling a function, you can skip the next sections and go straight to section Stack Canaries. Otherwise do not panic, I will explain anything you need to know in the next sections.... and it is easier than you may expect.

Function Calls

Whenever we call a function in a high level language like C, the compiler needs to generate opcodes to change the program flow, but it has to do that in a way such as it can get back to the exact point it was just before calling the function.

This process implies the modification of the so-called Instruction Pointer or Program Counter (the name depends on the documentation you are reading). This is just a processor register that points to the next instruction to be executed. So, calling a function requires to temporally store the current value of the IP, before modifying it with the address of the function (I will use IP from now on as all examples will be x86 and that is how it is named for that architecture).

So, for a simple C program like this:

void func () {
// Do something
return;
}
int main () { func(); return 0}

The compiler will generate something like this:

      main:   ADDR+0  There is always some code there
+-----------  ADDR+1  F()  => TMP=IP=ADDR+2; IP=func
|             ADDR+2  return;  <----------------------+
|             (...)                                   |
+ ->  func:   ADDR+X  Do something                    |
              (...)                                   |
              ADD+X+N: return; => IP=TMP  ------------+

Each processor provides instructions to change the IP register (all branch instructions does) and store the current IP value somewhere. This somewhere is either the stack (like the x86 processors) or a Return Address Register or Link Register (like ARM and MIPS). In the same way, processors also provides a way to easily restore the saved IP value, from the stack or from the mentioned return address register.

Note that the platforms implementing the Return Address Register solution also need to use the stack whenever the function needs to call other function, because the Return Address Register will be overwritten, losing the original return address. Only leaf functions (functions don't calling any other function), can be implemented just using the register.

The Stack

To get the whole picture, we still need to depict the structure of the stack when a function is called.

Calling a single function is pretty straightforward, and everything looks very simple. We just store our arguments somewhere, the return address, and use some random memory for our local variables.

Now think about a complex function that calls many other functions. You can chose different areas of memory to store that information for each function. That will also work fine, but now, think about a recursive function, or even worst, a re-entrant function. Implementing this storing data in memory differently for each function may be possible but tricky.

Fortunately, a lot of smart people years ago, figured out how to solve this in a elegant and neat way... Just use a stack. Each block of memory required for each function (return address, parameters, local variables) just get stored in the stack. Each function call pushes its information into the stack, and pops it up when the function is done. We can call the same function many times. Each time a new memory block will be added to the stack and everything will work fine, as each function invocation will have its own data block. Each one of those blocks is generically known as a Stack Frame.

So, how does a typical Stack Frame looks like?

SP+XX: Parameters (depends on ABI)
SP+00: Return Address
SP-04: Base Pointer Save
SP-08: Local Variables Ends
SP- N: Depends on number and types of Local Variables

Note: SP is the Stack Pointer a register that points to either the last used entry or the first empty entry in the stack.

As we can see, depending on the ABI the first thing in the Stack Frame may be the list of parameters. For instance, 32bits x86 Linux ABI passes all parameters in the stack, while the 64bits ABI uses registers, until they get exhausted, then the stack will be used.... but that is very uncommon, you need to pass a lot of parameters to the function and that is not a good programming practice.

Then we find the return address as we have already discussed and after that, the local variables. Note that local variables are stored in the stack. This is very convenient because we can allocate memory just decreasing the Stack Pointer and free it straight away, just increasing it back.

NOTE:When you allocate memory in the heap, that data will be somewhere else, but the pointer to that data is likely to be stored in the stack. Languages like C++ that heavily rely on objects allocated in the heap (created with new), usually need to unwind the stack before the function is left. In other words, it is not enough to just increase the Stack Pointer in those cases

Smashing The Stack

So, know we know everything to understand how a stack overflow works. We will now use some real code to match all the theoretical concepts introduced so far.

#include <string.h>

void func () {
  unsigned char buf[32];
  memset (buf, 0, 64);
  return;
}

int main () {
  func ();
  return 0;
}

If we compile and run it we get this:

$ gcc -fno-stack-protector -o bufferoverflow bufferoverflow.c
$ ./bufferoverflow
Segmentation fault

Yes, we have smashed the stack. In this example it is very obvious isn't it? We are writing 64 zeros in a 32 bytes buffer. If you remember our stack diagram, the local variables (that is, buf) are allocated in the stack in the lower addresses. When we start writing to buf in the loop, we start filling the stack from the lower address up... towards the return address, eventually overwriting it. When the function returns, it will return to address 0x00 instead of the main function (because we have written zeros in the buffer).

Let's take a look to the function:

$ objdump -Mintel -d bufferoverflow | grep -A20 "<func>:"
000000000000064a <func>:
 64a:   55                      push   rbp
 64b:   48 89 e5                mov    rbp,rsp
 64e:   48 83 ec 20             sub    rsp,0x20
 652:   48 8d 45 e0             lea    rax,[rbp-0x20]
 656:   ba 40 00 00 00          mov    edx,0x40
 65b:   be 00 00 00 00          mov    esi,0x0
 660:   48 89 c7                mov    rdi,rax
 663:   e8 b8 fe ff ff          call   520 <memset@plt>
 668:   90                      nop
 669:   c9                      leave
 66a:   c3                      ret

000000000000066b <main>:
 66b:   55                      push   rbp
 66c:   48 89 e5                mov    rbp,rsp
 66f:   b8 00 00 00 00          mov    eax,0x0
 674:   e8 d1 ff ff ff          call   64a <func>
 679:   b8 00 00 00 00          mov    eax,0x0
 67e:   5d                      pop    rbp
 67f:   c3                      ret

In the main function we can easily identify the call to func using the call instruction. This will push the return address,0x679 in this case, in the stack. Actually this is a PIE (Position Independent Executable) binary and the actual address will be some random address plus this offset.

RSP --> | RET (0x679)
        | EMPTY
        |--------------------

Note: For intel processors the Stack Pointer (RSP) points to the last used position

Then we continue execution on func and the first thing we do is to push in the stack the base index register. We haven't talked about this, but, in short, this register is used to index values in the stack (namely parameters and local variables).

        | RET (0x679)
RSP --> | RBP             <=== RBP
        | EMPTY
        |--------------------

After that, the buffer is allocated. 32 bytes (0x20 in hexadecimal) just subtracting that amount to RSP.

       Higher Addresses
RSP+0x30  | RET (0x679)
          | RBP       <== RBP
          | 8 bytes
          | 8 bytes
          | 8 bytes
RSP --->  | 8 bytes   <-- buf (RBP-x20)
          | EMPTY
          |--------------------
      Lower Addresses
      

So it is clear now, how the memset function will write zeros pass the return address. This is a stack overflow.

In this case the program is not vulnerable as there is no way for an attacker to control what is written in the return address. Actually, it will always be zeros. However in many other cases, this buffer operations take as source some user input, some data read from the network, or from a file... external data sources that an attacker can manipulate to write an appropriate value in the return address and change the program execution flow.

Stack Canaries

Finally we have reach the point were we can introduce the canaries. The name comes from the old days in the mines. Miners use to carry canaries (the birds) to detect leaks of gases (firedamp). The canaries dead with small concentrations of the gas, concentration not mortal for humans, and that will warn miners to get the hell out of there.

Stack Canaries do something similar. They do not fix the buffer overflow, but they allow us to detect it, and finish execution in an ordered way.... In other words.... to get the hell out of there.

The concept they are based on is very simple. As we had seen in the previous section, the stack has to be smashed from lower address to top, being the return address the highest, and that address is only used by the very last instruction in the function. The ret instruction.

A canary, is just a random number that we introduce just after the return address so it will always be smashed before the return address. Before leaving the function that value is checked again and if it was changed then the stack was smashed.

On the previous example, we omitted the generation of the canary on purpose so we get simpler code to better understand how the stack frame is build. For that we used the -fno-frame-protector. Now we can recompile the test program without this flag and see what we get.

$ make bufferoverflow
cc     bufferoverflow.c   -o bufferoverflow
$ ./bufferoverflow
*** stack smashing detected ***: <unknown> terminated
Aborted

Now, instead of the segmentation fault (caused for the return to address 0x00000) we get a message indicating than the stack has been smashed.

The anatomy of a canary

Let's take a look to the code for this last compilation:

$ objdump -Mintel -d bufferoverflow | grep -A20 "<func>:"
00000000000006aa <func>:
 6aa:   55                      push   rbp
 6ab:   48 89 e5                mov    rbp,rsp
 6ae:   48 83 ec 30             sub    rsp,0x30
 6b2:   64 48 8b 04 25 28 00    mov    rax,QWORD PTR fs:0x28
 6b9:   00 00
 6bb:   48 89 45 f8             mov    QWORD PTR [rbp-0x8],rax
 6bf:   31 c0                   xor    eax,eax
 6c1:   48 8d 45 d0             lea    rax,[rbp-0x30]
 6c5:   ba 40 00 00 00          mov    edx,0x40
 6ca:   be 00 00 00 00          mov    esi,0x0
 6cf:   48 89 c7                mov    rdi,rax
 6d2:   e8 a9 fe ff ff          call   580 <memset@plt>
 6d7:   90                      nop
 6d8:   48 8b 45 f8             mov    rax,QWORD PTR [rbp-0x8]
 6dc:   64 48 33 04 25 28 00    xor    rax,QWORD PTR fs:0x28
 6e3:   00 00
 6e5:   74 05                   je     6ec <func+0x42>
 6e7:   e8 84 fe ff ff          call   570 <__stack_chk_fail@plt>
 6ec:   c9                      leave
 6ed:   c3                      ret

The middle part of the function is still the same, but we can now see some code at the beginning and at the end. Specifically this:

     6b2:   mov    rax,QWORD PTR fs:0x28
     6bb:   mov    QWORD PTR [rbp-0x8],rax

This code just reads some value from address FS:0x28 (FS is a segment register, you do not see those very often nowadays but some of us are familiar to them from the DOS age). That is the canary, a random value initialised when the program is loaded. Then what it does is to store it at the top of the stack frame:

RSP+0x30  | RET (0x679)
RBP -->   | RBP
          | 8 bytes   <=== RBP - 0x08
          | 8 bytes
          | 8 bytes
          | 8 bytes         
          | 8 bytes
RSP --->  | 8 bytes   <-- buf (RBP-0x30)
          | EMPTY
          |--------------------

In this diagram it seems to be an extra position in the stack. That's right and it is because for 64bits x86 processors the stack shall be aligned to 16 bytes (0x10) blocks. We just need 8 bytes for the canary, but the compiler generates 8 more to keep the stack alignment. Well, I hope that is the right explanation, at this level, compilers do many arcane tricks that sometimes looks like something but they are something else.

Then, just before leaving the function, it checks that the canary stored in the stack is still the same than the one in the special address... it checks it is still alive:

 6d8:   mov    rax,QWORD PTR [rbp-0x8]
 6dc:   xor    rax,QWORD PTR fs:0x28
 6e5:   je     6ec <func+0x42>
 6e7:   call   570 <__stack_chk_fail@plt>

In case the value doesn't match (the stack has been smashed), the __stack_chk_fail function is invoked. That is the one showing the message:

*** stack smashing detected ***: <unknown> terminated
Aborted

Canaries are only generated when needed

Let's do a small change to our test program:

#include <string.h>

void func () {
  unsigned char buf[32];
  memset (buf, 0, 64);
  return;
}

int func1 () {
  int a = 1;
  int b = 2;
  return a + b;
}

int main () {
  func (); 
  return 0;
}

The generated code for func1 is:

00000000000006ee <func1>:
 6ee:   55                      push   rbp
 6ef:   48 89 e5                mov    rbp,rsp
 6f2:   c7 45 f8 01 00 00 00    mov    DWORD PTR [rbp-0x8],0x1
 6f9:   c7 45 fc 02 00 00 00    mov    DWORD PTR [rbp-0x4],0x2
 700:   8b 55 f8                mov    edx,DWORD PTR [rbp-0x8]
 703:   8b 45 fc                mov    eax,DWORD PTR [rbp-0x4]
 706:   01 d0                   add    eax,edx
 708:   5d                      pop    rbp
 709:   c3                      ret

As you can see there is no canary used in this one, because it is not possible to smash the stack in this case. In general, gcc won't generate canary code for functions that doesn't declare arrays/buffers as local variables.

How are canaries generated?

In the previous section we have seen how the canary value is retrieved from the address fs:0x28. But how does that value get there? and where does it come from?.

If we look for this address in our test program, we will just find it on func. That means that the code to initialise the canary is either in the system or in the standard C library... in some of all that code that gets executed before we get to the main function. So let's compile a static version of our test program in order to get all the code in, and see what we find:

0000000000400e40 <__libc_start_main>:
(...)
  401046:       48 8b 05 53 7a 2b 00    mov    rax,QWORD PTR [rip+0x2b7a53]        # 6b8aa0 <_dl_random>
  40104d:       48 8b 00                mov    rax,QWORD PTR [rax]
  401050:       30 c0                   xor    al,al
  401052:       64 48 89 04 25 28 00    mov    QWORD PTR fs:0x28,rax
  401059:       00 00

Looks like it is initialised by the C library. The value is taken from a global symbol named _dl_random. A quick search will show that this is a random value provided by the kernel ELF loader. We can see in the code above how the less significant byte is deleted before storing the value in the address fs:0x28...

So the value is generated by the kernel at the time of loading the binary. Then it is also exposed to the application as the auxiliary vector value AT_RANDOM, that points to the address containing the bytes that will be used by the canary.

All together now

To illustrate everything we have gone trough so far, I have built a simple test program with a function that dumps the stack and allows us to easily modify it to change the control flow or fire the canary detection. It also shows the auxiliary vectors (yes, we could have used LD_SHOW_AUXV=1 but this is just other way).

#include <stdio.h>
#include <sys/auxv.h>

// Let's store this in globals so we keep the stack cleaner
void *ret_addr = NULL;
long *new_ret  = NULL;

int func ()
{
  char buf[16];
  long i;
  long *p      = (long*) buf;
  long *ret    = NULL;
  long *canary = NULL;

  // initialise buffer for easy stack identification
  p[0] = 0xbada5500cafebabe;
  p[1] = 0xbada550ff1cebabe;

  // Dump the stack
  for (i = 8; i >= -2; i--) {
    printf ("STACK + %p : %16lx", (p+i), p[i]);
    if (p[i] == (long)ret_addr) {
      ret = p + i;                // When we find ret address store the ptr
      canary = p + i - 2;         // And also the canary one
      printf (" <=== RET ADDR\n");
    }
    else if ( canary && (p + i) == (long*)canary) printf (" <=== CANARY\n");
    else printf ("\n");

  }
  printf( "-----[Let's force jumping into l2]----------------\n");
  *ret = (long)new_ret;  // Return to l2 instead of l1
  //*canary = 0;         // This fires the stack protection. 
  for (i = 8; i >= -2; i--) {
    printf ("STACK + %p : %16lx", (p+i), p[i]);
    if ( (p + i) == (long*)ret) printf (" <=== RET ADDR\n");
    else if ( canary && (p + i) == (long*)canary) printf (" <=== CANARY\n");
    else printf ("\n");
  }
  return 0;
}

int main () {
  int i;
  ret_addr = &&l1;
  new_ret = &&l2;
  unsigned char *rnd = (unsigned char*) getauxval (AT_RANDOM);
  printf ("Ret address1   : %p\n", ret_addr);
  printf ("Ret address2   : %p\n", new_ret); 
  printf ("AT_RANDOM (%p) : ", rnd); 
  for (i = 0; i < 16; i++) printf ("%02x ", rnd[i]);
  printf ("\n");
  func ();
 l1:
  printf ("Don't see me!\n");
  
 l2:
  printf("Normal shutdown!\n");


  return 0;
}

The output of this program is shown below (actual values will be different in your machine):

$ ./test
Ret address1   : 0x5606f7e0cab3
Ret address2   : 0x5606f7e0cabf
AT_RANDOM (0x7ffc43e13139) : 2a 6b 50 bd fe 25 7d 97 9e b1 a5 47 8e c9 20 51
STACK + 0x7ffc43e12d00 :     5606f7e0cae0
STACK + 0x7ffc43e12cf8 :     7ffc43e13139
STACK + 0x7ffc43e12cf0 :       1043e12de0
STACK + 0x7ffc43e12ce8 :     5606f7e0cab3 <=== RET ADDR # l1
STACK + 0x7ffc43e12ce0 :     7ffc43e12d00 # RBP
STACK + 0x7ffc43e12cd8 : 977d25febd506b00 <=== CANARY
STACK + 0x7ffc43e12cd0 :     5606f7e0c6a0
STACK + 0x7ffc43e12cc8 : bada550ff1cebabe
STACK + 0x7ffc43e12cc0 : bada5500cafebabe # buf
STACK + 0x7ffc43e12cb8 :     7ffc43e12cc0
STACK + 0x7ffc43e12cb0 :     7ffc43e12cd8
-----[Let's force jumping into l2]----------------
STACK + 0x7ffc43e12d00 :     5606f7e0cae0
STACK + 0x7ffc43e12cf8 :     7ffc43e13139
STACK + 0x7ffc43e12cf0 :       1043e12de0
STACK + 0x7ffc43e12ce8 :     5606f7e0cabf <=== RET ADDR # l2
STACK + 0x7ffc43e12ce0 :     7ffc43e12d00 # RBP
STACK + 0x7ffc43e12cd8 : 977d25febd506b00 <=== CANARY
STACK + 0x7ffc43e12cd0 :     5606f7e0c6a0
STACK + 0x7ffc43e12cc8 : bada550ff1cebabe
STACK + 0x7ffc43e12cc0 : bada5500cafebabe # buf
STACK + 0x7ffc43e12cb8 :     7ffc43e12cc0
STACK + 0x7ffc43e12cb0 :     7ffc43e12cd8
Normal shutdown!

Note: the lines preceded by # are annotations I have added to the text, they are not produced by the program

The program, just calls a function that drops the stack showing the return address and the canary, that matches the AT_RANDOM auxiliary vector (shown at the beginning) except for the last byte (remember that xor al, al ?). The function also changes the return address to jump to l2 so the first print is not shown....

Note that the canary determination by the program is a bit lousy in the sense that it doesn't have to be at two words of the return address. You can recompile the program with optimisations on and the code will be different and the whole stack layout may be different. Anyway, it is a good simple test program you can use to build up your own tests and solve further doubts you may still have.

Before I leave

This looks like a very good technique to prevent buffer overflow exploits. We may think that we are safe enabling the stack protector flag in our compiler and ask it to produce canaries for all potentially exploitable functions. This is not fully true, there is always a way.

Exploiting applications protected with canaries usually requires to figure out the value of the canary. This way when the stack is overwritten the proper canary will also be written and therefore gain control on the execution. So a canary is as good as the applications preventing the exfiltration of its value.

For instance, suppose the application has a format string vulnerability somewhere in the code. Using a format string vulnerability an attacker can dump data from the stack and have a chance to figure out the canary value. Once the value of the canary is know, the buffer overflow can be exploited despite of the existence of the canary. The attacker will just write the canary in the buffer in the right position so the check at the end of the function will pass and effectively modify the flow of the application.

So the only way to have a secure program is to make sure there is no buffer overflows on them.

Conclusions

We have gone through a basic description of how functions are called from high level languages and what needs to be done at assembly level to support them (recursivity, re-entrancy,...). We briefly introduced the stack frame and how a program bug can be used to modify the normal program control flow and potentially exploit it to gain access to a given system.

Then we introduced the canary mechanism to detect stack overflows and looked into its low level implementation.

Header Image Credits: João furlani

 
Tu publicidad aquí :)