Using a parameter to iterate a loop

PROGRAMMING
Using a parameter to iterate a loop
2017-10-24 | by David "DeMO" Martínez Oliveira

Some years ago I participate in a challenge. It went fine and all participants got some feedback afterwards so we could figure out what we had done wrong. As part of that feedback, the question of re-using parameters within function rose... so I decided to look into the details.
The challenge consisted on writing the code for a function that, overall, had to run a loop. I cannot really remember all the details. In my solution I created a local variable to iterate through the loop. As mentioned above, it was considered a better solution to re-use the parameter received by the function as a counter for the loop (at least that is what I remember :). So, let's find out if that is correct!.

You may be confused by my poor description. In that case, just check the code below:

#include <stdio.h>

int f1 (int v) {
int i, r = 0;

for (i = 0; i < 100; i++) r += i;

return r;
}

int f2 (int i) {
int r = 0;

for (i = 0; i < 100; i++) r += i;
return r;
}

int main () {
printf ("Result : %d\n", f1 (10));
printf ("Result : %d\n", f2 (10));
}

Function f1 was my solution while, f2 was the expected solution. For me, that sounded a bit weird but, at that time, I moved on to other things and forgot about this. For some reason this have just come back to my mind and I decide to look into the details, to see if f2 was actually a better solution.

At first glance, at the source code level, f2 is shorter (we are saving a variable declaration) and also, it looks kind of somehow smarter.... We could even argument that we are using less memory... but...

# Let's Look at the code

We will start our analysis, looking at generated object code for both functions. I will analyse the x86 64bits case. I also checked the 32bits version and there is no big difference so I will just talk about the 64 bits version because it is just mainstream nowadays. This is what you get after compiling the test program:

\$ gcc -o k k.c
\$ objdump -M intel -d k | grep -A30 "<f1>:"
00000000004004f4 <f1>:
4004f4:	55                   	push   rbp
4004f5:	48 89 e5             	mov    rbp,rsp
4004f8:	89 7d ec             	mov    DWORD PTR [rbp-0x14],edi
4004fb:	c7 45 fc 00 00 00 00 	mov    DWORD PTR [rbp-0x4],0x0
400502:	c7 45 f8 00 00 00 00 	mov    DWORD PTR [rbp-0x8],0x0
400509:	eb 0a                	jmp    400515 <f1+0x21>
40050b:	8b 45 f8             	mov    eax,DWORD PTR [rbp-0x8]
40050e:	01 45 fc             	add    DWORD PTR [rbp-0x4],eax
400511:	83 45 f8 01          	add    DWORD PTR [rbp-0x8],0x1
400515:	83 7d f8 63          	cmp    DWORD PTR [rbp-0x8],0x63
400519:	7e f0                	jle    40050b <f1+0x17>
40051b:	8b 45 fc             	mov    eax,DWORD PTR [rbp-0x4]
40051e:	5d                   	pop    rbp
40051f:	c3                   	ret

0000000000400520 <f2>:
400520:	55                   	push   rbp
400521:	48 89 e5             	mov    rbp,rsp
400524:	89 7d ec             	mov    DWORD PTR [rbp-0x14],edi
400527:	c7 45 fc 00 00 00 00 	mov    DWORD PTR [rbp-0x4],0x0
40052e:	c7 45 f8 00 00 00 00 	mov    DWORD PTR [rbp-0x8],0x0
400535:	eb 0a                	jmp    400541 <f2+0x21>
400537:	8b 45 f8             	mov    eax,DWORD PTR [rbp-0x8]
40053a:	01 45 fc             	add    DWORD PTR [rbp-0x4],eax
40053d:	83 45 f8 01          	add    DWORD PTR [rbp-0x8],0x1
400541:	83 7d f8 63          	cmp    DWORD PTR [rbp-0x8],0x63
400545:	7e f0                	jle    400537 <f2+0x17>
400547:	8b 45 fc             	mov    eax,DWORD PTR [rbp-0x4]
40054a:	5d                   	pop    rbp
40054b:	c3                   	ret

WoW, look at that!... the compiler has generated exactly the same code for both cases. Or in other words, there is no practical difference re-using the parameter to run the loop. But there is actually a disadvantage on doing that. Such a disadvantage is hard to see with these simple functions, so, let's convert them into something that looks a bit more real.

# A more real life example

Let's make our function iterate through some array and do some operation with its values. Following well-known best practice on SW development, let's give our variables meaningful names. Something like this:

#define MAX_PERSONS 100
int person_age[MAX_PERSONS];

int f1 (int flag) {
int i, sum_age = 0;

if (flag < 10) return -1;
for (i = 0; i < MAX_PERSONS; i++) sum_age += person_age[i];
return r;
}

int f2 (int flag) {
int sum_age = 0;

if (flag < 10) return -1;
for (flag = 0; flag < MAX_PERSONS; flag++) sum_age += person_age[flag];
return r;
}

Can you see the problem now?... the code reusing the parameter looks pretty weird. The function code does not make much sense and it is even misleading. However, f1 is easily readable and understandable... specially for a former FORTRAN programmer :).

# Early Optimisation is the root...

You have probably heard this before: Early optimisation is the root of all evils. Always do optimisations as late as possible. In this simple case, it is arguable that using the parameter as a counter in a loop is an optimisation. It may be seen as some source code level optimisation.

However, let's see what happens if we would like to optimise the functions. Imagine, that the loop takes quite some time and we want to make it faster. We could modify our function into something like this:

int f4 (int v) {
register int i, r =0;

for (i = 0; i < 100; i++) r += i;
return r;
}

We just instruct the compiler to use registers to store our variables. Let's see the generated code for this function:

00000000004005bb <f4>:
4005bb:	55                   	push   rbp
4005bc:	48 89 e5             	mov    rbp,rsp
4005bf:	41 54                	push   r12
4005c1:	53                   	push   rbx
4005c2:	89 7d ec             	mov    DWORD PTR [rbp-0x14],edi
4005c5:	41 bc 00 00 00 00    	mov    r12d,0x0
4005cb:	bb 00 00 00 00       	mov    ebx,0x0
4005d0:	eb 06                	jmp    4005d8 <f4+0x1d>
-- LOOP ----------------------------------------------------------
4005d2:	41 01 dc             	add    r12d,ebx
4005d5:	83 c3 01             	add    ebx,0x1
4005d8:	83 fb 63             	cmp    ebx,0x63
4005db:	7e f5                	jle    4005d2 <f4+0x17>
---------------------------------------------------------------------
4005dd:	44 89 e0             	mov    eax,r12d
4005e0:	5b                   	pop    rbx
4005e1:	41 5c                	pop    r12
4005e3:	5d                   	pop    rbp
4005e4:	c3                   	ret

The initialisation part of the function becomes a bit longer but, let's look at the loop body. We are operating on registers instead of performing indirect memory accesses to get variable values. This code is a lot faster!. To be fair, let's also modify the r variable with the register keyword for f2:

int f2_3 (int i) {
register int r = 0;

for (i =0; i < 100; i++) r += i;
return r;
}

This C code will produce the following assembly:

\$  objdump -M intel -d k | grep -A30 "<f2_2>:"
000000000040054c <f2_2>:
40054c:	55                   	push   rbp
40054d:	48 89 e5             	mov    rbp,rsp
400550:	53                   	push   rbx
400551:	89 7d e4             	mov    DWORD PTR [rbp-0x1c],edi
400554:	bb 00 00 00 00       	mov    ebx,0x0
400559:	c7 45 f4 00 00 00 00 	mov    DWORD PTR [rbp-0xc],0x0
400560:	eb 07                	jmp    400569 <f2_2+0x1d>
--- LOOP -------------------------------------------------------------
400562:	03 5d f4             	add    ebx,DWORD PTR [rbp-0xc]
400565:	83 45 f4 01          	add    DWORD PTR [rbp-0xc],0x1
400569:	83 7d f4 63          	cmp    DWORD PTR [rbp-0xc],0x63
40056d:	7e f3                	jle    400562 <f2_2+0x16>
----------------------------------------------------------------------
40056f:	89 d8                	mov    eax,ebx
400571:	5b                   	pop    rbx
400572:	5d                   	pop    rbp
400573:	c3                   	ret

This is indeed, better than the original. We are storing values on EBX saving a memory access, however, the loop body for this version performs 3 memory access. For such a small array, it is very likely that data will be cached and therefore the difference on performance will be very small.

I believe this has been an interesting discussion so far. However, there are many people claiming that, nowadays, programmers shouldn't care about optimisation. They should leave that work to the compiler who, in most cases, would do a better job that the humans.

So, once again, let's check this, recompiling our program with option -O2:

\$ gcc -O2 -o k k.c
\$ objdump -M intel -d k | grep -A35 "<f1>:"
0000000000400590 <f1>:
400590:	b8 56 13 00 00       	mov    eax,0x1356
400595:	c3                   	ret
400596:	66 2e 0f 1f 84 00 00 	nop    WORD PTR cs:[rax+rax*1+0x0]
40059d:	00 00 00

00000000004005a0 <f2>:
4005a0:	b8 56 13 00 00       	mov    eax,0x1356
4005a5:	c3                   	ret
4005a6:	66 2e 0f 1f 84 00 00 	nop    WORD PTR cs:[rax+rax*1+0x0]

00000000004005b0 <f2_2>:
4005b0:	b8 56 13 00 00       	mov    eax,0x1356
4005b5:	c3                   	ret
4005b6:	66 2e 0f 1f 84 00 00 	nop    WORD PTR cs:[rax+rax*1+0x0]
4005bd:	00 00 00

00000000004005c0 <f3>:
4005c0:	b8 56 13 00 00       	mov    eax,0x1356
4005c5:	c3                   	ret
4005c6:	66 2e 0f 1f 84 00 00 	nop    WORD PTR cs:[rax+rax*1+0x0]
4005cd:	00 00 00

00000000004005d0 <f4>:
4005d0:	b8 56 13 00 00       	mov    eax,0x1356
4005d5:	c3                   	ret

LoL... As you can see, the compiler is smart enough to figure out that we are actually calculating a constant and it is just returning that value back. Interesting enough, the compiler generates exactly the same code for all the functions. I would say that this is not very representative because of the simplicity of the code so, let's try with a slight modification of our test functions.

# Evaluating the Compiler Optimisations

Let's re-write our test functions like this:

int buffer[100];

int f1 (int v) {
int i, r = 0;

for (i = 0; i < 100; i++) r += (i + buffer[i]);

return r;
}

int f2 (int i) {
int r = 0;

for (i =0; i < 100; i++) r += (i + buffer[i]);
return r;
}

int f2_2 (int i) {
register int r = 0;

for (i =0; i < 100; i++) r += (i + buffer[i]);
return r;
}

int f4 (int v) {
register int i, r =0;

for (i = 0; i < 100; i++) r += (i + buffer[i]);
return r;
}

These set of functions produces the following assembly:

\$  gcc -O2 -o k1 k1.c
\$  objdump -M intel -d k1 | grep -A50 "<f1>:"
00000000004005f0 <f1>:
4005f0:	31 d2                	xor    edx,edx
4005f2:	31 c0                	xor    eax,eax
4005f4:	0f 1f 40 00          	nop    DWORD PTR [rax+0x0]
4005f8:	8b 0c 95 40 10 60 00 	mov    ecx,DWORD PTR [rdx*4+0x601040]
400601:	48 83 c2 01          	add    rdx,0x1
400607:	48 83 fa 64          	cmp    rdx,0x64
40060b:	75 eb                	jne    4005f8 <f1+0x8>
40060d:	f3 c3                	repz ret
40060f:	90                   	nop

0000000000400610 <f2>:
400610:	31 d2                	xor    edx,edx
400612:	31 c0                	xor    eax,eax
400614:	0f 1f 40 00          	nop    DWORD PTR [rax+0x0]
400618:	8b 0c 95 40 10 60 00 	mov    ecx,DWORD PTR [rdx*4+0x601040]
400621:	48 83 c2 01          	add    rdx,0x1
400627:	48 83 fa 64          	cmp    rdx,0x64
40062b:	75 eb                	jne    400618 <f2+0x8>
40062d:	f3 c3                	repz ret
40062f:	90                   	nop

0000000000400630 <f2_2>:
400630:	31 d2                	xor    edx,edx
400632:	31 c0                	xor    eax,eax
400634:	0f 1f 40 00          	nop    DWORD PTR [rax+0x0]
400638:	8b 0c 95 40 10 60 00 	mov    ecx,DWORD PTR [rdx*4+0x601040]
400641:	48 83 c2 01          	add    rdx,0x1
400647:	48 83 fa 64          	cmp    rdx,0x64
40064b:	75 eb                	jne    400638 <f2_2+0x8>
40064d:	f3 c3                	repz ret
40064f:	90                   	nop

0000000000400650 <f4>:
400650:	31 d2                	xor    edx,edx
400652:	31 c0                	xor    eax,eax
400654:	0f 1f 40 00          	nop    DWORD PTR [rax+0x0]
400658:	8b 0c 95 40 10 60 00 	mov    ecx,DWORD PTR [rdx*4+0x601040]
400661:	48 83 c2 01          	add    rdx,0x1
400667:	48 83 fa 64          	cmp    rdx,0x64
40066b:	75 eb                	jne    400658 <f4+0x8>
40066d:	f3 c3                	repz ret
40066f:	90                   	nop

As we can see, at least for this simple code, the compiler is actually producing the same code for all the different implementations, despite of the use of the register modifier!!!

# The compiler factor

OK, this may just happen with gcc so, it would be good to check if we will get the same results with other compilers. Let's recompile our simple test program using CLANG/LLVM and see if we get different results.

\$  clang -O2 -o k1-llvm k1.c
\$  objdump -M intel -d k1-llvm | grep -A60 "<f1>:"
0000000000400500 <f1>:
400500:	31 c9                	xor    ecx,ecx
400502:	31 c0                	xor    eax,eax
400504:	66 66 66 2e 0f 1f 84 	data32 data32 nop WORD PTR cs:[rax+rax*1+0x0]
40050b:	00 00 00 00 00
400510:	8b 14 85 30 10 60 00 	mov    edx,DWORD PTR [rax*4+0x601030]
400519:	8d 0c 10             	lea    ecx,[rax+rdx*1]
40051c:	48 ff c0             	inc    rax
40051f:	83 f8 64             	cmp    eax,0x64
400522:	75 ec                	jne    400510 <f1+0x10>
400524:	8d 44 10 ff          	lea    eax,[rax+rdx*1-0x1]
400528:	c3                   	ret
400529:	0f 1f 80 00 00 00 00 	nop    DWORD PTR [rax+0x0]

0000000000400530 <f2>:
400530:	31 c9                	xor    ecx,ecx
400532:	31 c0                	xor    eax,eax
400534:	66 66 66 2e 0f 1f 84 	data32 data32 nop WORD PTR cs:[rax+rax*1+0x0]
40053b:	00 00 00 00 00
400540:	8b 14 85 30 10 60 00 	mov    edx,DWORD PTR [rax*4+0x601030]
400549:	8d 0c 10             	lea    ecx,[rax+rdx*1]
40054c:	48 ff c0             	inc    rax
40054f:	83 f8 64             	cmp    eax,0x64
400552:	75 ec                	jne    400540 <f2+0x10>
400554:	8d 44 10 ff          	lea    eax,[rax+rdx*1-0x1]
400558:	c3                   	ret
400559:	0f 1f 80 00 00 00 00 	nop    DWORD PTR [rax+0x0]

0000000000400560 <f2_2>:
400560:	31 c9                	xor    ecx,ecx
400562:	31 c0                	xor    eax,eax
400564:	66 66 66 2e 0f 1f 84 	data32 data32 nop WORD PTR cs:[rax+rax*1+0x0]
40056b:	00 00 00 00 00
400570:	8b 14 85 30 10 60 00 	mov    edx,DWORD PTR [rax*4+0x601030]
400579:	8d 0c 10             	lea    ecx,[rax+rdx*1]
40057c:	48 ff c0             	inc    rax
40057f:	83 f8 64             	cmp    eax,0x64
400582:	75 ec                	jne    400570 <f2_2+0x10>
400584:	8d 44 10 ff          	lea    eax,[rax+rdx*1-0x1]
400588:	c3                   	ret
400589:	0f 1f 80 00 00 00 00 	nop    DWORD PTR [rax+0x0]

0000000000400590 <f4>:
400590:	31 c9                	xor    ecx,ecx
400592:	31 c0                	xor    eax,eax
400594:	66 66 66 2e 0f 1f 84 	data32 data32 nop WORD PTR cs:[rax+rax*1+0x0]
40059b:	00 00 00 00 00
4005a0:	8b 14 85 30 10 60 00 	mov    edx,DWORD PTR [rax*4+0x601030]
4005a9:	8d 0c 10             	lea    ecx,[rax+rdx*1]
4005ac:	48 ff c0             	inc    rax
4005af:	83 f8 64             	cmp    eax,0x64
4005b2:	75 ec                	jne    4005a0 <f4+0x10>
4005b4:	8d 44 10 ff          	lea    eax,[rax+rdx*1-0x1]
4005b8:	c3                   	ret
4005b9:	0f 1f 80 00 00 00 00 	nop    DWORD PTR [rax+0x0]

As we can see, code is a bit different but, once again, the compiler produces exactly the same code for all the different versions of the function, which is consistent with our previous results with gcc.

# Other Platforms?

So far, we have seen that the effect of re-using a parameter in a function has little impact on the code... but, this may just happen on Intel processors. I cannot test every single platform out there and, in all honesty, I do not care enough about this topic to go that far. But what I can do with almost zero effort is to check what happens with an ARM processor.

\$  arm-linux-gnueabi-gcc -marm -O2 -o k1-arm-1 k1.c
\$  arm-linux-gnueabi-objdump -d k1-arm-1 | grep -A50 "<f1>:"
00008460 <f1>:
8460:	e59f2020 	ldr	r2, [pc, #32]	; 8488 <f1+0x28>
8464:	e3a00000 	mov	r0, #0
8468:	e1a03000 	mov	r3, r0
846c:	e5b21004 	ldr	r1, [r2, #4]!
8470:	e0831001 	add	r1, r3, r1
8474:	e2833001 	add	r3, r3, #1
8478:	e3530064 	cmp	r3, #100	; 0x64
847c:	e0800001 	add	r0, r0, r1
8480:	1afffff9 	bne	846c <f1+0xc>
8484:	e12fff1e 	bx	lr
8488:	00011028 	.word	0x00011028

0000848c <f2>:
848c:	e59f2020 	ldr	r2, [pc, #32]	; 84b4 <f2+0x28>
8490:	e3a00000 	mov	r0, #0
8494:	e1a03000 	mov	r3, r0
8498:	e5b21004 	ldr	r1, [r2, #4]!
849c:	e0831001 	add	r1, r3, r1
84a0:	e2833001 	add	r3, r3, #1
84a4:	e3530064 	cmp	r3, #100	; 0x64
84a8:	e0800001 	add	r0, r0, r1
84ac:	1afffff9 	bne	8498 <f2+0xc>
84b0:	e12fff1e 	bx	lr
84b4:	00011028 	.word	0x00011028

000084b8 <f2_2>:
84b8:	e59f2020 	ldr	r2, [pc, #32]	; 84e0 <f2_2+0x28>
84bc:	e3a00000 	mov	r0, #0
84c0:	e1a03000 	mov	r3, r0
84c4:	e5b21004 	ldr	r1, [r2, #4]!
84c8:	e0831001 	add	r1, r3, r1
84cc:	e2833001 	add	r3, r3, #1
84d0:	e3530064 	cmp	r3, #100	; 0x64
84d4:	e0800001 	add	r0, r0, r1
84d8:	1afffff9 	bne	84c4 <f2_2+0xc>
84dc:	e12fff1e 	bx	lr
84e0:	00011028 	.word	0x00011028

000084e4 <f4>:
84e4:	e59f2020 	ldr	r2, [pc, #32]	; 850c <f4+0x28>
84e8:	e3a00000 	mov	r0, #0
84ec:	e1a03000 	mov	r3, r0
84f0:	e5b21004 	ldr	r1, [r2, #4]!
84f4:	e0831001 	add	r1, r3, r1
84f8:	e2833001 	add	r3, r3, #1
84fc:	e3530064 	cmp	r3, #100	; 0x64
8500:	e0800001 	add	r0, r0, r1
8504:	1afffff9 	bne	84f0 <f4+0xc>
8508:	e12fff1e 	bx	lr
850c:	00011028 	.word	0x00011028

I have compiled for ARM32 bits and enabled optimisations (similar results can be obtained compiling for thumb with flag -mthumb). Again, the compiler generates exactly the same code.

Of course, it may be that there is some platform out there were re-using a parameter may make a difference but, after all these tests I would trend to believe that such a situation is pretty unlikely.

# Squeezing Instructions

So, apparently using a parameter as a counter within a function has no effect and what we have shown that does not lead to a better solution just like that. However, it makes some sense to believe that re-using the parameter should produce some benefit. Actually that is the case, but it is not enough to just re-use the parameter.

What we are going to do is to apply the register modifier to the function parameter and see what happens. Our new test function will now look like this:

int f2_3 (register int i) {
register int r = 0;

for (i =0; i < 100; i++) r += i;
return r;
}

This function produces the following assembly:

\$  objdump -M intel -d k | grep -A30 "<f2_3>:"
0000000000400574 <f2_3>:
400574:	55                   	push   rbp
400575:	48 89 e5             	mov    rbp,rsp
400578:	53                   	push   rbx
400579:	bb 00 00 00 00       	mov    ebx,0x0
40057e:	b8 00 00 00 00       	mov    eax,0x0
400583:	eb 05                	jmp    40058a <f2_3+0x16>
--- LOOP -------------------------------------------------------------
400587:	83 c0 01             	add    eax,0x1
40058a:	83 f8 63             	cmp    eax,0x63
40058d:	7e f6                	jle    400585 <f2_3+0x11>
----------------------------------------------------------------------
40058f:	89 d8                	mov    eax,ebx
400591:	5b                   	pop    rbx
400592:	5d                   	pop    rbp
400593:	c3                   	ret

Cool. Now we have got rid of all the memory access and we are only using registers on our loop body. This is now a pretty fast implementation, similar to the one we got earlier for the function using the local variable. Let's now apply the register modifier also to the function declaring the local variable. We will obtain the following assembly:

00000000004005e5 <f5>:
4005e5:	55                   	push   rbp
4005e6:	48 89 e5             	mov    rbp,rsp
4005e9:	41 54                	push   r12
4005eb:	53                   	push   rbx
4005ec:	41 bc 00 00 00 00    	mov    r12d,0x0
4005f2:	bb 00 00 00 00       	mov    ebx,0x0
4005f7:	eb 06                	jmp    4005ff <f5+0x1a>
--- LOOP -------------------------------------------------------------
4005f9:	41 01 dc             	add    r12d,ebx
4005fc:	83 c3 01             	add    ebx,0x1
4005ff:	83 fb 63             	cmp    ebx,0x63
400602:	7e f5                	jle    4005f9 <f5+0x14>
----------------------------------------------------------------------
400604:	44 89 e0             	mov    eax,r12d
400607:	5b                   	pop    rbx
400608:	41 5c                	pop    r12
40060a:	5d                   	pop    rbp
40060b:	c3                   	ret

This solution uses two extra instruction (push r12d and pop r12d). Taking into account that we are running 100 iterations composed of 3 instructions each, the 2 additional instructions does not really add much to the final execution time/

Therefore, so far we can save 2 CPU instruction by reusing a function parameter within a function as far as we do not forget to add the register keyword to parameter... otherwise, just reusing the parameter actually brings no benefit at all.

# Conclusions

We have analysed the re-use of a parameter inside a function to save... something...We have used a very simple example and we have seen how just re-using the parameter brings no benefit at all.

We have also checked that, at least for a simple example like this, the compiler will actually produce the same code for all source code variations despite of the tricks we could try to apply at the source code level. We have checked this with two different compilers and two different platforms. Finally, we had verified that, in order to save 2 CPU instructions we have to actually apply the register modifier to the parameter itself.

I believe that it is good to know the reasons behind a case like the one we have analysed. Trusting the compiler blindly, and expecting it to just solve the problem works most of the times, but it is always better to know what goes on under the hood... Well, that is just my opinion :)

LoL!