One optimization trick for C++

In one of his trainings Andrei has touched a very interesting subject of optimizing C++ (or just code in general) with respect to laying out the code, or basically understanding how the assembly will look like. In this post I would like to dig a little deeper into the subject.

Meet d10

Task is simple:

Count how many 10-base digits a given uint64_t value consists of.

For the purpose of this post (to avoid repetition) let’s assume that every presented implementation of the above task has the following declaration:

uint32_t d10(uint64_t value)

In order to neatly group all the constants in our code, the following enumeration is used:

enum Base10 : uint64_t {
	D1Min	= 0UL,
	D1Max	= 9UL,
	D2Min	= 10UL,
	D2Max	= 99UL,
	D3Min	= 100UL,
	D3Max	= 999UL,
	D4Min	= 1000UL,
	D4Max	= 9999UL,
	D5Min	= 10000UL,
	D5Max	= 99999UL,
	D6Min	= 100000UL,
	D6Max	= 999999UL,
	D7Min	= 1000000UL,
	D7Max	= 9999999UL,
	D8Min	= 10000000UL,
	D8Max	= 99999999UL,
	D9Min	= 100000000UL,
	D9Max	= 999999999UL,
	D10Min	= 1000000000UL,
	D10Max	= 9999999999UL,
	D11Min	= 10000000000UL,
	D11Max	= 99999999999UL,
	D12Min	= 100000000000UL,
	D12Max	= 999999999999UL,
	D13Min	= 1000000000000UL,
	D13Max	= 9999999999999UL,
	D14Min	= 10000000000000UL,
	D14Max	= 99999999999999UL,
	D15Min	= 100000000000000UL,
	D15Max	= 999999999999999UL,
	D16Min	= 1000000000000000UL,
	D16Max	= 9999999999999999UL,
	D17Min	= 10000000000000000UL,
	D17Max	= 99999999999999999UL,
	D18Min	= 100000000000000000UL,
	D18Max	= 999999999999999999UL,
	D19Min	= 1000000000000000000UL,
	D19Max	= 9999999999999999999UL,
	D20Min	= 10000000000000000000UL,
	D20Max	= 18446744073709551615UL
};

The important thing to remember is that the enumeration values become part of the instruction code stream and do not have to referenced via pointers/temporary values/stack.

Performance tests should be interpreted as follows: smaller is better. In other words 200% score means that the given algorithm was 4 times worse then the baseline.

Specs of the test machine are: Intel Pentium Dual Core T4200 2GHz, 4GB DDR2, ubuntu 12.04, gcc 4.6.3 -O3 64-bit.

Let’s be compact

The compact version will become the baseline for our tests and it basically represents the standard way that most of us would implement d10.

compact

	uint32_t result = 0U;

	do {
		++result;
		value /= 10UL;
	} while (value);

	return result;

Remarks

Dividing is slow! It’s roughly 5x slower than multiplying. Because of that compilers introduce an optimization trick which basically turns

MOV eax, $dividend
MOV edi, $divisor # We can't divide by an immediate value
DIV edi

into

MOV eax, $dividend
MOV edi, $magic_number # We can't multiply by an immediate value
MUL edi
SHR edx, $magic_number_2 # Result of multiplication is put to edx

In other words, we turned th division into multiplication by some magic numbers where basically multiplication and shift result in the same operation as division. Very nifty.

Apart from that, the compact code isn’t very interesting:

  1. Increment ecx register (where result is stored)
  2. Do the multiplication-shift (aka division)
  3. Execute test instruction to check whether after division we got 0
  4. If we got 0, just end the code; if not – jump to the beginning

Base 10 is not enough

Let’s pretend that we do not want only to return the count of 10-base digits and we wish to make it more general.

general

	const double float_value = value;

	double float_result;
	for (float_result = 1.0;
		 float_value >= std::pow(10.0, float_result);
		 ++float_result);

	return static_cast<uint32_t>(float_result);

Suddenly we got a solution which works for every base (base-2, base-16, etc.).

Remarks

Let’s look at the performance results:

general

Horrible! There are, however, several reasons for such a low score:

  • We are making a call to another function which might not be very fast
  • In order to call the function we have to move arguments to xmm0 and xmm1 registers which are by use by us, and then retrieve the result by putting the stack values into the xmmX registers
  • Every conversion from/to int requires using cvtsi2sd/cvttsd2si instructions which take their time (15 i. on older CPUs, 4 on faster)
  • We cannot use ‘inlining’ of constant value 1.0, as it is a floating variable which means that we have to add it from memory location:
    addsd	xmmX, QWORD PTR .LC0[rip]
    .LC0:
    	.long	0
    	.long	1072693248
    
  • As we need to use (for storing constant value 10.0) base register rbx we have to push and pop it from the stack every function call.

Thing to remember here is that operations on floats are not that slow as we might believe. They are a few cycles slower than standard operations, so it’s not a tragedy.

Fast-math to the rescue!

We all do know about the infamous -ffast-math flag for gcc which basically enables various optimizations which allow for faster floating point operations. The catch? They might break your code.

Because of that there are two camps: one believes that -ffast-math is just a death wish and should never be used, while the other thinks that if your code doesn’t do anything funny (denormals, etc.) it can safely be used.

In our case usage of -ffast-math resulted in correct results (tests passes), thus I gave the famous flag a chance.

Remarks

Results are as follows:

general-fast

Generally not that fast (pun intended). Some gain is observed, but nothing spectacular.

Table

Being general didn’t help us so let’s take the road of precomputed values. During our first attempt we will choose a simple solution using a table through which we will walk. In LOOKUP which is std::array we stored following values: Base10::D1Min, Base10::D2Min, …, Base10::D20Min. Not pretty, but simple.

table

	uint32_t result;

	for (result = 0U;
		 value >= LOOKUP[result] && result < LOOKUP.size();
		 ++result);

	return result;

Remarks

Results:
table

Finally some improvements, but due to huge values of our first versions in future comparisons, they will be dropped. Enhance!

table_2

Ok, so even if the cpu has to:

  • fetch every data piece from via a pointer,
  • cannot inline the data in any way (as it doesn’t know the tale values),
  • has to make two comparisons ever time,

it’s still tens percents faster! In other words, avoiding computation (mul and shr) by providing a well organized memory that can be prefetched (but still not stored completely in the cache as the whole table takes 1.25kB, while the whole cache set is only 0.5kB) we can drastically reduce the execution time.

Partial unroll

Let’s take other approach (presented by Andrei), instead of using tables, we shall “unroll” the computations that we perform in compact version, so that we don’t have to iterate so many times.

partial_unroll

	uint32_t result = 1U;
	
	while (true) {
		if (value < Base10::D2Min) return result;
		if (value < Base10::D3Min) return result + 1U;
		if (value < Base10::D4Min) return result + 2U;
		if (value < Base10::D5Min) return result + 3U;

		value /= 10000UL;
		result += 4;
	};

	return result;

Remarks

We basically traded less loops and computations (mul + shr) for more conditional logic. More branches, you say? But we know that the branch predictor will have problems with this, and no pre-fetching and … blah, blah. Test first, you fool!

punroll

Not only the unroll is faster than the compact (standard) version, but it also beats the table version which seems relatively simpler (less branches).

BTW, the generated assembly isn’t very innovative. Compare (cmp), return if below or equal (jbe) or iterate the loop. The only tinkering the compiler has done is that it changed the "jump if below n" to "jump if below or equal n – 1“.

Unroll like there’s no tomorrow!

Partial seems good? Then let’s make it fully unrolled!

unroll

	if (value <= Base10::D1Max) return 1;
	if (value <= Base10::D2Max) return 2;
	if (value <= Base10::D3Max) return 3;
	if (value <= Base10::D4Max) return 4;
	if (value <= Base10::D5Max) return 5;
	if (value <= Base10::D6Max) return 6;
	if (value <= Base10::D7Max) return 7;
	if (value <= Base10::D8Max) return 8;
	if (value <= Base10::D9Max) return 9;
	if (value <= Base10::D10Max) return 10;
	if (value <= Base10::D11Max) return 11;
	if (value <= Base10::D12Max) return 12;
	if (value <= Base10::D13Max) return 13;
	if (value <= Base10::D14Max) return 14;
	if (value <= Base10::D15Max) return 15;
	if (value <= Base10::D16Max) return 16;
	if (value <= Base10::D17Max) return 17;
	if (value <= Base10::D18Max) return 18;
	if (value <= Base10::D19Max) return 19;
	if (value <= Base10::D20Max) return 20;

Remarks

We know it looks hideous and it’s unimplementable for bigger set of arguments, but what about the efficiency? Glad you asked.

unroll

Even faster, and the compiler still didn’t do any crazy stuff.

Hiding conditional logic

While unrolling provides us with a nice boost, there are still a lot of branches/jumps. What if we could “avoid” them?

unroll_v2

	return
		1
		+ (value >= Base10::D2Min)
		+ (value >= Base10::D3Min)
		+ (value >= Base10::D4Min)
		+ (value >= Base10::D5Min)
		+ (value >= Base10::D6Min)
		+ (value >= Base10::D7Min)
		+ (value >= Base10::D8Min)
		+ (value >= Base10::D9Min)
		+ (value >= Base10::D10Min)
		+ (value >= Base10::D11Min)
		+ (value >= Base10::D12Min)
		+ (value >= Base10::D13Min)
		+ (value >= Base10::D14Min)
		+ (value >= Base10::D15Min)
		+ (value >= Base10::D16Min)
		+ (value >= Base10::D17Min)
		+ (value >= Base10::D18Min)
		+ (value >= Base10::D19Min)
		+ (value >= Base10::D20Min);

Remarks

The obvious penalty of the above solution is that we will have to always go through all the digits, even though we could return early. The positive? There is no branching. CPU can nicely rush through all the code filling the pipeline.

unroll_v2

While the penalties are clearly visible (big lag for small digits) the potential gains cannot be seen and it doesn’t seem that the lack of branches helps the CPU that much.

It’s all about searching

Looking at the fully unrolled loops we can easily see that what we’ve been doing all the time is plain stupid linear search. Being decently equipped engineers we know that we can turn this n into log(n) via binary search.

binary_stl

	auto upRange = std::upper_bound(LOOKUP.cbegin(), LOOKUP.cend(), value);

	if (upRange == LOOKUP.cend()) {
		return 20;
	} else {
		return std::distance(LOOKUP.cbegin(), upRange);
	}

Remarks

Being lazy and smart cookies at the same time, we used the standard algorithms.

binary_stl

Not the improvement we wanted.

Not for the faint of heart

Andrei didn’t present the fully unrolled binary search, so let’s show it.

binary

	if (value < Base10::D8Min) {
		if (value < Base10::D4Min) {
			if (value < Base10::D2Min) {
				return 1;
			} else {
				if (value < Base10::D3Min) {
					return 2;
				} else {
					return 3;
				}
			}
		} else {
			if (value < Base10::D6Min) {
				if (value < Base10::D5Min) {
					return 4;
				} else {
					return 5;
				}
			} else {
				if (value < Base10::D7Min) {
					return 6;
				} else {
					return 7;
				}
			}
		}
	} else {
		if (value < Base10::D16Min) {
			if (value < Base10::D12Min) {
				if (value < Base10::D10Min) {
					if (value < Base10::D9Min) {
						return 8;
					} else {
						return 9;
					}
				} else {
					if (value < Base10::D11Min) {
						return 10;
					} else {
						return 11;
					}
				}
			} else {
				if (value < Base10::D14Min) {
					if (value < Base10::D13Min) {
						return 12;
					} else {
						return 13;
					}
				} else {
					if (value < Base10::D15Min) {
						return 14;
					} else {
						return 15;
					}
				}
			}
		} else {
			if (value < Base10::D18Min) {
				if (value < Base10::D17Min) {
					return 16;
				} else {
					return 17;
				}
			} else {
				if (value < Base10::D19Min) {
					return 18;
				} else {
					if (value < Base10::D20Min) {
						return 19;
					} else {
						return 20;
					}
				}
			}
		}

Remarks

Yes, it’s uber horrible to write, but we didn’t do it to get the cleanest code award.

binary

And we have the winner!

That’s the end of our journey

I wish this post will become a minor, but still valuable addition, to what Andrei has presented. To summarize all the tips I gathered while creating this material:

  • don’t assume – measure
  • compiler knows more tricks then you do, so concentrate on the algorithms and data layout, not micro-optimizations
  • general programming knowledge is applicable everywhere (i.e., binary search)
  • moving data from storage (cache) to code instructions is good, but don’t go overboard – you can’t squash all the code, everytime

As the percentage charts don’t show the differences between the fastest algorithms enough, one more chart.

total

Advertisements

Tags: ,

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s


%d bloggers like this: