PROGRAMMING
Using a parameter to iterate a loop
2017-10-24 | by David "DeMO" Martínez Oliveira
Some years ago I participate in a challenge. It went fine and all participants got some feedback afterwards so we could figure out what we had done wrong. As part of that feedback, the question of re-using parameters within function rose... so I decided to look into the details.
The challenge consisted on writing the code for a function that, overall, had to run a loop. I cannot really remember all the details. In my solution I created a local variable to iterate through the loop. As mentioned above, it was considered a better solution to re-use the parameter received by the function as a counter for the loop (at least that is what I remember :). So, let's find out if that is correct!.
You may be confused by my poor description. In that case, just check the code below:
#include <stdio.h> int f1 (int v) { int i, r = 0; for (i = 0; i < 100; i++) r += i; return r; } int f2 (int i) { int r = 0; for (i = 0; i < 100; i++) r += i; return r; } int main () { printf ("Result : %d\n", f1 (10)); printf ("Result : %d\n", f2 (10)); }Function
f1
was my solution while, f2
was the expected solution. For me, that sounded a bit weird but, at that time, I moved on to other things and forgot about this. For some reason this have just come back to my mind and I decide to look into the details, to see if f2
was actually a better solution.
At first glance, at the source code level, f2
is shorter (we are saving a variable declaration) and also, it looks kind of somehow smarter.... We could even argument that we are using less memory... but...
Let's Look at the code
We will start our analysis, looking at generated object code for both functions. I will analyse the x86 64bits case. I also checked the 32bits version and there is no big difference so I will just talk about the 64 bits version because it is just mainstream nowadays. This is what you get after compiling the test program:$ gcc -o k k.c $ objdump -M intel -d k | grep -A30 "<f1>:" 00000000004004f4 <f1>: 4004f4: 55 push rbp 4004f5: 48 89 e5 mov rbp,rsp 4004f8: 89 7d ec mov DWORD PTR [rbp-0x14],edi 4004fb: c7 45 fc 00 00 00 00 mov DWORD PTR [rbp-0x4],0x0 400502: c7 45 f8 00 00 00 00 mov DWORD PTR [rbp-0x8],0x0 400509: eb 0a jmp 400515 <f1+0x21> 40050b: 8b 45 f8 mov eax,DWORD PTR [rbp-0x8] 40050e: 01 45 fc add DWORD PTR [rbp-0x4],eax 400511: 83 45 f8 01 add DWORD PTR [rbp-0x8],0x1 400515: 83 7d f8 63 cmp DWORD PTR [rbp-0x8],0x63 400519: 7e f0 jle 40050b <f1+0x17> 40051b: 8b 45 fc mov eax,DWORD PTR [rbp-0x4] 40051e: 5d pop rbp 40051f: c3 ret 0000000000400520 <f2>: 400520: 55 push rbp 400521: 48 89 e5 mov rbp,rsp 400524: 89 7d ec mov DWORD PTR [rbp-0x14],edi 400527: c7 45 fc 00 00 00 00 mov DWORD PTR [rbp-0x4],0x0 40052e: c7 45 f8 00 00 00 00 mov DWORD PTR [rbp-0x8],0x0 400535: eb 0a jmp 400541 <f2+0x21> 400537: 8b 45 f8 mov eax,DWORD PTR [rbp-0x8] 40053a: 01 45 fc add DWORD PTR [rbp-0x4],eax 40053d: 83 45 f8 01 add DWORD PTR [rbp-0x8],0x1 400541: 83 7d f8 63 cmp DWORD PTR [rbp-0x8],0x63 400545: 7e f0 jle 400537 <f2+0x17> 400547: 8b 45 fc mov eax,DWORD PTR [rbp-0x4] 40054a: 5d pop rbp 40054b: c3 retWoW, look at that!... the compiler has generated exactly the same code for both cases. Or in other words, there is no practical difference re-using the parameter to run the loop. But there is actually a disadvantage on doing that. Such a disadvantage is hard to see with these simple functions, so, let's convert them into something that looks a bit more real.
A more real life example
Let's make our function iterate through some array and do some operation with its values. Following well-known best practice on SW development, let's give our variables meaningful names. Something like this:#define MAX_PERSONS 100 int person_age[MAX_PERSONS]; int f1 (int flag) { int i, sum_age = 0; if (flag < 10) return -1; for (i = 0; i < MAX_PERSONS; i++) sum_age += person_age[i]; return r; } int f2 (int flag) { int sum_age = 0; if (flag < 10) return -1; for (flag = 0; flag < MAX_PERSONS; flag++) sum_age += person_age[flag]; return r; }Can you see the problem now?... the code reusing the parameter looks pretty weird. The function code does not make much sense and it is even misleading. However,
f1
is easily readable and understandable... specially for a former FORTRAN programmer :).
Early Optimisation is the root...
You have probably heard this before: Early optimisation is the root of all evils. Always do optimisations as late as possible. In this simple case, it is arguable that using the parameter as a counter in a loop is an optimisation. It may be seen as some source code level optimisation. However, let's see what happens if we would like to optimise the functions. Imagine, that the loop takes quite some time and we want to make it faster. We could modify our function into something like this:int f4 (int v) { register int i, r =0; for (i = 0; i < 100; i++) r += i; return r; }We just instruct the compiler to use registers to store our variables. Let's see the generated code for this function:
00000000004005bb <f4>: 4005bb: 55 push rbp 4005bc: 48 89 e5 mov rbp,rsp 4005bf: 41 54 push r12 4005c1: 53 push rbx 4005c2: 89 7d ec mov DWORD PTR [rbp-0x14],edi 4005c5: 41 bc 00 00 00 00 mov r12d,0x0 4005cb: bb 00 00 00 00 mov ebx,0x0 4005d0: eb 06 jmp 4005d8 <f4+0x1d> -- LOOP ---------------------------------------------------------- 4005d2: 41 01 dc add r12d,ebx 4005d5: 83 c3 01 add ebx,0x1 4005d8: 83 fb 63 cmp ebx,0x63 4005db: 7e f5 jle 4005d2 <f4+0x17> --------------------------------------------------------------------- 4005dd: 44 89 e0 mov eax,r12d 4005e0: 5b pop rbx 4005e1: 41 5c pop r12 4005e3: 5d pop rbp 4005e4: c3 retThe initialisation part of the function becomes a bit longer but, let's look at the loop body. We are operating on registers instead of performing indirect memory accesses to get variable values. This code is a lot faster!. To be fair, let's also modify the
r
variable with the register
keyword for f2
:
int f2_3 (int i) { register int r = 0; for (i =0; i < 100; i++) r += i; return r; }This C code will produce the following assembly:
$ objdump -M intel -d k | grep -A30 "<f2_2>:" 000000000040054c <f2_2>: 40054c: 55 push rbp 40054d: 48 89 e5 mov rbp,rsp 400550: 53 push rbx 400551: 89 7d e4 mov DWORD PTR [rbp-0x1c],edi 400554: bb 00 00 00 00 mov ebx,0x0 400559: c7 45 f4 00 00 00 00 mov DWORD PTR [rbp-0xc],0x0 400560: eb 07 jmp 400569 <f2_2+0x1d> --- LOOP ------------------------------------------------------------- 400562: 03 5d f4 add ebx,DWORD PTR [rbp-0xc] 400565: 83 45 f4 01 add DWORD PTR [rbp-0xc],0x1 400569: 83 7d f4 63 cmp DWORD PTR [rbp-0xc],0x63 40056d: 7e f3 jle 400562 <f2_2+0x16> ---------------------------------------------------------------------- 40056f: 89 d8 mov eax,ebx 400571: 5b pop rbx 400572: 5d pop rbp 400573: c3 retThis is indeed, better than the original. We are storing values on
EBX
saving a memory access, however, the loop body for this version performs 3 memory access. For such a small array, it is very likely that data will be cached and therefore the difference on performance will be very small.
Optimisations? Nowadays?
I believe this has been an interesting discussion so far. However, there are many people claiming that, nowadays, programmers shouldn't care about optimisation. They should leave that work to the compiler who, in most cases, would do a better job that the humans. So, once again, let's check this, recompiling our program with option-O2
:
$ gcc -O2 -o k k.c $ objdump -M intel -d k | grep -A35 "<f1>:" 0000000000400590 <f1>: 400590: b8 56 13 00 00 mov eax,0x1356 400595: c3 ret 400596: 66 2e 0f 1f 84 00 00 nop WORD PTR cs:[rax+rax*1+0x0] 40059d: 00 00 00 00000000004005a0 <f2>: 4005a0: b8 56 13 00 00 mov eax,0x1356 4005a5: c3 ret 4005a6: 66 2e 0f 1f 84 00 00 nop WORD PTR cs:[rax+rax*1+0x0] 4005ad: 00 00 00 00000000004005b0 <f2_2>: 4005b0: b8 56 13 00 00 mov eax,0x1356 4005b5: c3 ret 4005b6: 66 2e 0f 1f 84 00 00 nop WORD PTR cs:[rax+rax*1+0x0] 4005bd: 00 00 00 00000000004005c0 <f3>: 4005c0: b8 56 13 00 00 mov eax,0x1356 4005c5: c3 ret 4005c6: 66 2e 0f 1f 84 00 00 nop WORD PTR cs:[rax+rax*1+0x0] 4005cd: 00 00 00 00000000004005d0 <f4>: 4005d0: b8 56 13 00 00 mov eax,0x1356 4005d5: c3 retLoL... As you can see, the compiler is smart enough to figure out that we are actually calculating a constant and it is just returning that value back. Interesting enough, the compiler generates exactly the same code for all the functions. I would say that this is not very representative because of the simplicity of the code so, let's try with a slight modification of our test functions.
Evaluating the Compiler Optimisations
Let's re-write our test functions like this:int buffer[100]; int f1 (int v) { int i, r = 0; for (i = 0; i < 100; i++) r += (i + buffer[i]); return r; } int f2 (int i) { int r = 0; for (i =0; i < 100; i++) r += (i + buffer[i]); return r; } int f2_2 (int i) { register int r = 0; for (i =0; i < 100; i++) r += (i + buffer[i]); return r; } int f4 (int v) { register int i, r =0; for (i = 0; i < 100; i++) r += (i + buffer[i]); return r; }These set of functions produces the following assembly:
$ gcc -O2 -o k1 k1.c $ objdump -M intel -d k1 | grep -A50 "<f1>:" 00000000004005f0 <f1>: 4005f0: 31 d2 xor edx,edx 4005f2: 31 c0 xor eax,eax 4005f4: 0f 1f 40 00 nop DWORD PTR [rax+0x0] 4005f8: 8b 0c 95 40 10 60 00 mov ecx,DWORD PTR [rdx*4+0x601040] 4005ff: 01 d1 add ecx,edx 400601: 48 83 c2 01 add rdx,0x1 400605: 01 c8 add eax,ecx 400607: 48 83 fa 64 cmp rdx,0x64 40060b: 75 eb jne 4005f8 <f1+0x8> 40060d: f3 c3 repz ret 40060f: 90 nop 0000000000400610 <f2>: 400610: 31 d2 xor edx,edx 400612: 31 c0 xor eax,eax 400614: 0f 1f 40 00 nop DWORD PTR [rax+0x0] 400618: 8b 0c 95 40 10 60 00 mov ecx,DWORD PTR [rdx*4+0x601040] 40061f: 01 d1 add ecx,edx 400621: 48 83 c2 01 add rdx,0x1 400625: 01 c8 add eax,ecx 400627: 48 83 fa 64 cmp rdx,0x64 40062b: 75 eb jne 400618 <f2+0x8> 40062d: f3 c3 repz ret 40062f: 90 nop 0000000000400630 <f2_2>: 400630: 31 d2 xor edx,edx 400632: 31 c0 xor eax,eax 400634: 0f 1f 40 00 nop DWORD PTR [rax+0x0] 400638: 8b 0c 95 40 10 60 00 mov ecx,DWORD PTR [rdx*4+0x601040] 40063f: 01 d1 add ecx,edx 400641: 48 83 c2 01 add rdx,0x1 400645: 01 c8 add eax,ecx 400647: 48 83 fa 64 cmp rdx,0x64 40064b: 75 eb jne 400638 <f2_2+0x8> 40064d: f3 c3 repz ret 40064f: 90 nop 0000000000400650 <f4>: 400650: 31 d2 xor edx,edx 400652: 31 c0 xor eax,eax 400654: 0f 1f 40 00 nop DWORD PTR [rax+0x0] 400658: 8b 0c 95 40 10 60 00 mov ecx,DWORD PTR [rdx*4+0x601040] 40065f: 01 d1 add ecx,edx 400661: 48 83 c2 01 add rdx,0x1 400665: 01 c8 add eax,ecx 400667: 48 83 fa 64 cmp rdx,0x64 40066b: 75 eb jne 400658 <f4+0x8> 40066d: f3 c3 repz ret 40066f: 90 nopAs we can see, at least for this simple code, the compiler is actually producing the same code for all the different implementations, despite of the use of the
register
modifier!!!
The compiler factor
OK, this may just happen withgcc
so, it would be good to check if we will get the same results with other compilers. Let's recompile our simple test program using CLANG/LLVM and see if we get different results.
$ clang -O2 -o k1-llvm k1.c $ objdump -M intel -d k1-llvm | grep -A60 "<f1>:" 0000000000400500 <f1>: 400500: 31 c9 xor ecx,ecx 400502: 31 c0 xor eax,eax 400504: 66 66 66 2e 0f 1f 84 data32 data32 nop WORD PTR cs:[rax+rax*1+0x0] 40050b: 00 00 00 00 00 400510: 8b 14 85 30 10 60 00 mov edx,DWORD PTR [rax*4+0x601030] 400517: 01 ca add edx,ecx 400519: 8d 0c 10 lea ecx,[rax+rdx*1] 40051c: 48 ff c0 inc rax 40051f: 83 f8 64 cmp eax,0x64 400522: 75 ec jne 400510 <f1+0x10> 400524: 8d 44 10 ff lea eax,[rax+rdx*1-0x1] 400528: c3 ret 400529: 0f 1f 80 00 00 00 00 nop DWORD PTR [rax+0x0] 0000000000400530 <f2>: 400530: 31 c9 xor ecx,ecx 400532: 31 c0 xor eax,eax 400534: 66 66 66 2e 0f 1f 84 data32 data32 nop WORD PTR cs:[rax+rax*1+0x0] 40053b: 00 00 00 00 00 400540: 8b 14 85 30 10 60 00 mov edx,DWORD PTR [rax*4+0x601030] 400547: 01 ca add edx,ecx 400549: 8d 0c 10 lea ecx,[rax+rdx*1] 40054c: 48 ff c0 inc rax 40054f: 83 f8 64 cmp eax,0x64 400552: 75 ec jne 400540 <f2+0x10> 400554: 8d 44 10 ff lea eax,[rax+rdx*1-0x1] 400558: c3 ret 400559: 0f 1f 80 00 00 00 00 nop DWORD PTR [rax+0x0] 0000000000400560 <f2_2>: 400560: 31 c9 xor ecx,ecx 400562: 31 c0 xor eax,eax 400564: 66 66 66 2e 0f 1f 84 data32 data32 nop WORD PTR cs:[rax+rax*1+0x0] 40056b: 00 00 00 00 00 400570: 8b 14 85 30 10 60 00 mov edx,DWORD PTR [rax*4+0x601030] 400577: 01 ca add edx,ecx 400579: 8d 0c 10 lea ecx,[rax+rdx*1] 40057c: 48 ff c0 inc rax 40057f: 83 f8 64 cmp eax,0x64 400582: 75 ec jne 400570 <f2_2+0x10> 400584: 8d 44 10 ff lea eax,[rax+rdx*1-0x1] 400588: c3 ret 400589: 0f 1f 80 00 00 00 00 nop DWORD PTR [rax+0x0] 0000000000400590 <f4>: 400590: 31 c9 xor ecx,ecx 400592: 31 c0 xor eax,eax 400594: 66 66 66 2e 0f 1f 84 data32 data32 nop WORD PTR cs:[rax+rax*1+0x0] 40059b: 00 00 00 00 00 4005a0: 8b 14 85 30 10 60 00 mov edx,DWORD PTR [rax*4+0x601030] 4005a7: 01 ca add edx,ecx 4005a9: 8d 0c 10 lea ecx,[rax+rdx*1] 4005ac: 48 ff c0 inc rax 4005af: 83 f8 64 cmp eax,0x64 4005b2: 75 ec jne 4005a0 <f4+0x10> 4005b4: 8d 44 10 ff lea eax,[rax+rdx*1-0x1] 4005b8: c3 ret 4005b9: 0f 1f 80 00 00 00 00 nop DWORD PTR [rax+0x0]As we can see, code is a bit different but, once again, the compiler produces exactly the same code for all the different versions of the function, which is consistent with our previous results with gcc.
Other Platforms?
So far, we have seen that the effect of re-using a parameter in a function has little impact on the code... but, this may just happen on Intel processors. I cannot test every single platform out there and, in all honesty, I do not care enough about this topic to go that far. But what I can do with almost zero effort is to check what happens with an ARM processor.$ arm-linux-gnueabi-gcc -marm -O2 -o k1-arm-1 k1.c $ arm-linux-gnueabi-objdump -d k1-arm-1 | grep -A50 "<f1>:" 00008460 <f1>: 8460: e59f2020 ldr r2, [pc, #32] ; 8488 <f1+0x28> 8464: e3a00000 mov r0, #0 8468: e1a03000 mov r3, r0 846c: e5b21004 ldr r1, [r2, #4]! 8470: e0831001 add r1, r3, r1 8474: e2833001 add r3, r3, #1 8478: e3530064 cmp r3, #100 ; 0x64 847c: e0800001 add r0, r0, r1 8480: 1afffff9 bne 846c <f1+0xc> 8484: e12fff1e bx lr 8488: 00011028 .word 0x00011028 0000848c <f2>: 848c: e59f2020 ldr r2, [pc, #32] ; 84b4 <f2+0x28> 8490: e3a00000 mov r0, #0 8494: e1a03000 mov r3, r0 8498: e5b21004 ldr r1, [r2, #4]! 849c: e0831001 add r1, r3, r1 84a0: e2833001 add r3, r3, #1 84a4: e3530064 cmp r3, #100 ; 0x64 84a8: e0800001 add r0, r0, r1 84ac: 1afffff9 bne 8498 <f2+0xc> 84b0: e12fff1e bx lr 84b4: 00011028 .word 0x00011028 000084b8 <f2_2>: 84b8: e59f2020 ldr r2, [pc, #32] ; 84e0 <f2_2+0x28> 84bc: e3a00000 mov r0, #0 84c0: e1a03000 mov r3, r0 84c4: e5b21004 ldr r1, [r2, #4]! 84c8: e0831001 add r1, r3, r1 84cc: e2833001 add r3, r3, #1 84d0: e3530064 cmp r3, #100 ; 0x64 84d4: e0800001 add r0, r0, r1 84d8: 1afffff9 bne 84c4 <f2_2+0xc> 84dc: e12fff1e bx lr 84e0: 00011028 .word 0x00011028 000084e4 <f4>: 84e4: e59f2020 ldr r2, [pc, #32] ; 850c <f4+0x28> 84e8: e3a00000 mov r0, #0 84ec: e1a03000 mov r3, r0 84f0: e5b21004 ldr r1, [r2, #4]! 84f4: e0831001 add r1, r3, r1 84f8: e2833001 add r3, r3, #1 84fc: e3530064 cmp r3, #100 ; 0x64 8500: e0800001 add r0, r0, r1 8504: 1afffff9 bne 84f0 <f4+0xc> 8508: e12fff1e bx lr 850c: 00011028 .word 0x00011028I have compiled for ARM32 bits and enabled optimisations (similar results can be obtained compiling for
thumb
with flag -mthumb
). Again, the compiler generates exactly the same code.
Of course, it may be that there is some platform out there were re-using a parameter may make a difference but, after all these tests I would trend to believe that such a situation is pretty unlikely.
Squeezing Instructions
So, apparently using a parameter as a counter within a function has no effect and what we have shown that does not lead to a better solution just like that. However, it makes some sense to believe that re-using the parameter should produce some benefit. Actually that is the case, but it is not enough to just re-use the parameter. What we are going to do is to apply theregister
modifier to the function parameter and see what happens. Our new test function will now look like this:
int f2_3 (register int i) { register int r = 0; for (i =0; i < 100; i++) r += i; return r; }This function produces the following assembly:
$ objdump -M intel -d k | grep -A30 "<f2_3>:" 0000000000400574 <f2_3>: 400574: 55 push rbp 400575: 48 89 e5 mov rbp,rsp 400578: 53 push rbx 400579: bb 00 00 00 00 mov ebx,0x0 40057e: b8 00 00 00 00 mov eax,0x0 400583: eb 05 jmp 40058a <f2_3+0x16> --- LOOP ------------------------------------------------------------- 400585: 01 c3 add ebx,eax 400587: 83 c0 01 add eax,0x1 40058a: 83 f8 63 cmp eax,0x63 40058d: 7e f6 jle 400585 <f2_3+0x11> ---------------------------------------------------------------------- 40058f: 89 d8 mov eax,ebx 400591: 5b pop rbx 400592: 5d pop rbp 400593: c3 retCool. Now we have got rid of all the memory access and we are only using registers on our loop body. This is now a pretty fast implementation, similar to the one we got earlier for the function using the local variable. Let's now apply the
register
modifier also to the function declaring the local variable. We will obtain the following assembly:
00000000004005e5 <f5>: 4005e5: 55 push rbp 4005e6: 48 89 e5 mov rbp,rsp 4005e9: 41 54 push r12 4005eb: 53 push rbx 4005ec: 41 bc 00 00 00 00 mov r12d,0x0 4005f2: bb 00 00 00 00 mov ebx,0x0 4005f7: eb 06 jmp 4005ff <f5+0x1a> --- LOOP ------------------------------------------------------------- 4005f9: 41 01 dc add r12d,ebx 4005fc: 83 c3 01 add ebx,0x1 4005ff: 83 fb 63 cmp ebx,0x63 400602: 7e f5 jle 4005f9 <f5+0x14> ---------------------------------------------------------------------- 400604: 44 89 e0 mov eax,r12d 400607: 5b pop rbx 400608: 41 5c pop r12 40060a: 5d pop rbp 40060b: c3 retThis solution uses two extra instruction (
push r12d
and pop r12d
). Taking into account that we are running 100 iterations composed of 3 instructions each, the 2 additional instructions does not really add much to the final execution time/
Therefore, so far we can save 2 CPU instruction by reusing a function parameter within a function as far as we do not forget to add the register
keyword to parameter... otherwise, just reusing the parameter actually brings no benefit at all.
Conclusions
We have analysed the re-use of a parameter inside a function to save... something...We have used a very simple example and we have seen how just re-using the parameter brings no benefit at all. We have also checked that, at least for a simple example like this, the compiler will actually produce the same code for all source code variations despite of the tricks we could try to apply at the source code level. We have checked this with two different compilers and two different platforms. Finally, we had verified that, in order to save 2 CPU instructions we have to actually apply theregister
modifier to the parameter itself.
I believe that it is good to know the reasons behind a case like the one we have analysed. Trusting the compiler blindly, and expecting it to just solve the problem works most of the times, but it is always better to know what goes on under the hood... Well, that is just my opinion :)
LoL!