web developer & system programmer

coder . cl

ramblings and thoughts on programming...


source code optimization in c

published: 31-12-2009 / updated: 31-12-2009
posted in: algorithms, c, development, programming, tips
by Daniel Molina Wegener

"…premature optimization is the root of all evil".
— Donald Knuth

I agree with the fact that we must do our source code level optimizations when we have finshed the construction stage or it is almost complete. I was searching articles and papers about optimizing C source code to be applied on my programs and libraries. I’ve collected some of those optimizations. But you must not confuse algorithm optimization, source code optimization and compiler optimization, since the first one refers to algorithm design and the second one just refers to the algorithm implementation, and both are sharing just few common approaches to formal reductions.

Usually the source code optimization just applies well known formal reductions. We will not treat those formalities in this article. Instead we will take a look on some examples that I’ve collected, allowing a common reasoning about how to optimize the source code. Most ideas for source code optimizations comes from η-conversions using λ equivalences, allowing for example reductions from [pmath size=8]O(log ~n) ~left~ O(n)[/pmath]. Also, some optimizations without an apparent η-conversion, since most is done at low level, as those which are taken by the compiler and translated to less assembler instructions, such as function inlining and parameter reduction. Most η-conversions which implies η-reductions, and are called strength reductions. But we have also the opposed side, where we are — instead of doing reductions — populating our programs with more lines of code, including repeated code, such as the loop unrolling optimization.


inline code

C supports both macros and inlining functions. Both features are reaching the same effect, when the code is compiled, all inlined instructions are placed when we call the macro or the inline function. For example, a typical approach, is a max(a, b) function or a MAX(a, b) macro.

/* the slowest expression, compiling and running */
int
max(int a, int b)
{
	if (a > b)
		return a;
	else
		return b;
}

/* normal expression with inlining */
inline int
max(int a, int b)
{
	return ((a > b) ? a : b);
}

/* the faster expression using macros */
#define MAX(a, b)			((a > b) ? a : b)

The inline and the macro definitions both are similar. The macro places every occurrence of MAX(var1, var2) expressions with the ((var1 > var2) ? var1 : var2) expression. The inline does the same, and places every occurrence of max(var1, var2) with the result of the expression ((var1 > var2) ? var1 : var2). So, the macro usage is oriented to replace entire blocks of code, instead of inline functions which are oriented to replace calls and skip the assembler call instruction and its derivates. We may look examples of the call opcode in typical system calls on operating systems, such as FreeBSD system calls.


parameter reduction

Reducing parameters implies less assembler instructions. C calling convention typically push parameters into the stack, which implies that for each parameter a push opcode call is made. For example you can group function arguments — if they are use subsequently by a group of functions — under struct forms. Think a little, each push instruction in his family takes 2 clocks, varying to 18 clocks on x86 architecture — depending on vendor and model. For 8 arguments, it will take from 16 to 36 clocks, against 2 to 18 clocks with one argument. This approach on argument reduction may be applied to those functions which are not inlined.

/* slower version */
int
my_function(int arg1, int arg2, int arg3, int arg4)
{
	/* do something... */
}

/* faster version */
struct myargs
{
	int arg1; int arg2; int arg3; int arg4;
};

int
my_function(struct myargs *args)
{
	/* do something... */
}


practical reductions

We do reduction when we remove unnecessary steps in our functions. We can do most reductions just by thinking a little on the code, and also there are some well known reductions which can be used as optimizations.

removing else

/* with else, smaller code, but slower one */
inline int
test(int a)
{
	return a > 0 ? 1 : 0;
}

/* without else, large code but faster one */
inline int
test(int a)
{
	if (a > 0)
		return 1;
	/* implied else */
	return 0;
}

On this optimization, we are removing jmp and jxx family of instructions, where most of them takes near to 7 clocks on x86 architecture and also the required instruction to setup the proper context for the following instruction in the program sequence. This is like the spartan programming style, where most code is minimized through reductions and similar programming tasks.

bitwise operations

Bitwise operations are cheaper than other operations. For example, curiously, a shift operation and plus operations have less clocks than multiplication and division operations. The happens to logical evaluations. In other post, I’ve wrote about logical minimizations. A complement to what I’ve said, is the fact that mul instruction takes from 10 to 40 clocks and div instruction takes from 15 to 40 clocks on x86 architecture, against add, sub, shr, shl and similar instruction that are taking from 2 to 7 clocks. Remember that the number clocks used by some instruction is vendor and model dependent. For example if we do:

/* we take a call to the math.h function pow() */
n = y * pow(2.0, x);

/* we can replace it by */
n = y << x;

In other words, the x << y operation is equivalent to [pmath size=8]y * 2 ^ x[/pmath]. You can find other math equivalences too. From this basic approach, we can reach other types of optimizations through the proper maths. You can take a look on the bit wizardry page from Jörg Arndt if you want to seek for more bitwise operations and reductions.

reversing counter loops

Usually we do one step loops for fixed set of counters. The following example shows a reduction which can be used every time we can do a reverse loop.

/* for loop coded usually */
for (c = 0; c < MAX_C_VALUE; c++) {
	/* do something... */
}

/* for reversed loop */
for (c = MAX_C_VALUE; c--; /* we do not do nothing here */ ) {
	/* do something... */
}

table lookups

A table lookup is the technique where we create an array and kind related structures to lookup for data, usually previously calculated data or simply, we are looking for concrete data that we want to use.

/* we can have a switch statement */
ourtype_t *varn = NULL;
switch (var1) {
case 0:
	varn = value1; break;
case 1:
	varn = value2; break;
case 2:
	varn = value3; break;
default:
	varn = value1;
}

/* or have an if/else statement */
if (var1 == 0)
	varn = value1;
else if (var1 == 1)
	varn = value2;
else if (var2 == 2)
	varn = value3;
else
	varn = value1;

/* so we can simply replace those values using arrays */
ourtype_t mapsvalues[] = { value1, value2, value3 };
varn = mapsvalues[var1];

Also we usually can setup reductions by creating table lookups and state machines, so we can create the proper map between certain variable data and certain function. State machines are a powerful abstraction which allows us to code different states inside data structures.

We can have predefined values in our table lookup tasks. Then we are using lazy evaluation. Every time we have a constant value which is requested and not calculated each time we work with it, we are doing lazy evaluation.

register keyword

/* using the register keyword should help creating faster code */
register int counter;


reduce data access computation

If we have deep constructed data structures (struct in C), every time we access most deep nodes, we are using pointer arithmetics, which implies basic math operations to access certain parts of our structures. Here we can do tow tasks: alias creation and usage and padding adjustment. Alias usage, means that we must use an internal pointer to access directly a structure member, so we omit pointer calculation each time we access it. Padding adjustment, is just about to create the proper data structure member order.

/* structure without the proper padding */
struct my_struct {
	char *a;
	void *b;
	int c;
	double n;
	char *x;
};

/* the same structure with the proper padding */
struct my_struct {
	double n;
	int c;
	char *a;
	char *x;
	void *b;
};

Also the data access computation is reduced on assembler level code, not the C code. Here we have an implied η-reduction and invisible one. For deeply and nested data structures, we have the same issue. We should have as rule that the structure size must have an ideal size of [pmath size=8]~s = ~n^2[/pmath], with [pmath size=8]~s[/pmath] as the structure size and [pmath size=8]~n[/pmath] as the padding adjustemnt to power of two.

/* some structs with nested members */
struct a {
	int m1;
};

struct b {
	int m2;
	struct a *m3;
}

struct c {
	int m4;
	struct b *m5;
}

/* the non cheaper version to access its members */
struct c *x = some_function_returning_c();
x->m5->m3->m1 = some_other_function();
if (x->m5->m3->m1 != 0) {
	another_function(x->m5->m3->m1);
}

/* the cheaper version to access its members
   using aliases */
struct c *x = some_function_returning_c();
struct b *p;
b = x->m5->m3;
b->m1 = some_other_function();
if (b->m1 != 0) {
	another_function(b->m1);
}


loop unrolling

Each step on a loop repeats some instructions. For example if we have a fixed size array, where we must treat each element with certain function, we can use unrolled loops.

/* normal iteration over fixed size array */
for (i = 0; i < 100; i++) {
	call_some_function(my_array[i]);
}

/* applied loop unrolling and reverse loop */
for (i = 100; i--;) {
	call_some_function(my_array[i]);
	call_some_function(my_array[--i]);
	call_some_function(my_array[--i]);
	call_some_function(my_array[--i]);
	call_some_function(my_array[--i]);
}

Note that here we have a fixed size array. In other case is hard to know the array size. Also we can use our compiler optimization to unroll each loop, if our compiler has the proper option. For example GCC supports automatic loop unrolling by using the -funroll-loops flag.


loop jamming

Reusing blocks of code inside loops matters.

/* here we have two loops for something that can
   be done one loop */
for (i = 0; i < 100; i++) {
	some_function_a(my_array[i]);
}
for (i = 0; i < 100; i += 10) {
	some_function_b(my_array[i]);
}

/* here we have two loops for something that can
   be done one loop */
for (i = 0; i < 100; i++) {
	some_function_a(my_array[i]);
	if ((i % 10) == 0) {
		some_function_b(my_array[i]);
	}
}

On the example above we have the proper loop doing the same tasks with less operations, so we have reduced the steps of that loop from [pmath size=8]O(2n)[/pmath] to [pmath size=8]O(n)[/pmath] — note that [pmath size=8]O(2n)[/pmath] is just symbolic and not strict.


bit padding matters

It depends on the architecture. Processing data types shorter or larger than the register size do not have a cheaper cost than using them. For example if we use char or short, we are not using a complete register, which is more hard to handle than register length variables, such as int and long.


references


2 comments to “source code optimization in c”

  1. Keep writing this articles Daniel. Only a detail, inline is supported, but is not an ANSI C ( 1989 ) standard.

  2. I have been browsing online greater than three hours today, yet I by no means discovered any interesting article like yours. It is pretty price sufficient for me. In my opinion, if all web owners and bloggers made just right content material as you did, the net will probably be much more helpful than ever before.

post a comment

XHTML: You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>