In one of his trainings Andrei has touched a very interesting subject of optimizing C++ (or just code in general) with respect to laying out the code, or basically understanding how the assembly will look like. In this post I would like to dig a little deeper into the subject.
Meet d10
Task is simple:
Count how many 10-base digits a given uint64_t value consists of.
For the purpose of this post (to avoid repetition) let’s assume that every presented implementation of the above task has the following declaration:
uint32_t d10(uint64_t value)
In order to neatly group all the constants in our code, the following enumeration is used:
enum Base10 : uint64_t { D1Min = 0UL, D1Max = 9UL, D2Min = 10UL, D2Max = 99UL, D3Min = 100UL, D3Max = 999UL, D4Min = 1000UL, D4Max = 9999UL, D5Min = 10000UL, D5Max = 99999UL, D6Min = 100000UL, D6Max = 999999UL, D7Min = 1000000UL, D7Max = 9999999UL, D8Min = 10000000UL, D8Max = 99999999UL, D9Min = 100000000UL, D9Max = 999999999UL, D10Min = 1000000000UL, D10Max = 9999999999UL, D11Min = 10000000000UL, D11Max = 99999999999UL, D12Min = 100000000000UL, D12Max = 999999999999UL, D13Min = 1000000000000UL, D13Max = 9999999999999UL, D14Min = 10000000000000UL, D14Max = 99999999999999UL, D15Min = 100000000000000UL, D15Max = 999999999999999UL, D16Min = 1000000000000000UL, D16Max = 9999999999999999UL, D17Min = 10000000000000000UL, D17Max = 99999999999999999UL, D18Min = 100000000000000000UL, D18Max = 999999999999999999UL, D19Min = 1000000000000000000UL, D19Max = 9999999999999999999UL, D20Min = 10000000000000000000UL, D20Max = 18446744073709551615UL };
The important thing to remember is that the enumeration values become part of the instruction code stream and do not have to referenced via pointers/temporary values/stack.
Performance tests should be interpreted as follows: smaller is better. In other words 200% score means that the given algorithm was 4 times worse then the baseline.
Specs of the test machine are: Intel Pentium Dual Core T4200 2GHz, 4GB DDR2, ubuntu 12.04, gcc 4.6.3 -O3 64-bit.
Let’s be compact
The compact version will become the baseline for our tests and it basically represents the standard way that most of us would implement d10
.
compact
uint32_t result = 0U; do { ++result; value /= 10UL; } while (value); return result;
Remarks
Dividing is slow! It’s roughly 5x slower than multiplying. Because of that compilers introduce an optimization trick which basically turns
MOV eax, $dividend MOV edi, $divisor # We can't divide by an immediate value DIV edi
into
MOV eax, $dividend MOV edi, $magic_number # We can't multiply by an immediate value MUL edi SHR edx, $magic_number_2 # Result of multiplication is put to edx
In other words, we turned th division into multiplication by some magic numbers where basically multiplication and shift result in the same operation as division. Very nifty.
Apart from that, the compact code isn’t very interesting:
- Increment
ecx
register (whereresult
is stored) - Do the multiplication-shift (aka division)
- Execute
test
instruction to check whether after division we got 0 - If we got 0, just end the code; if not – jump to the beginning
Base 10 is not enough
Let’s pretend that we do not want only to return the count of 10-base digits and we wish to make it more general.
general
const double float_value = value; double float_result; for (float_result = 1.0; float_value >= std::pow(10.0, float_result); ++float_result); return static_cast<uint32_t>(float_result);
Suddenly we got a solution which works for every base (base-2, base-16, etc.).
Remarks
Let’s look at the performance results:
Horrible! There are, however, several reasons for such a low score:
- We are making a
call
to another function which might not be very fast - In order to call the function we have to move arguments to
xmm0
andxmm1
registers which are by use by us, and then retrieve the result by putting the stack values into thexmmX
registers - Every conversion from/to int requires using
cvtsi2sd
/cvttsd2si
instructions which take their time (15 i. on older CPUs, 4 on faster) - We cannot use ‘inlining’ of constant value 1.0, as it is a floating variable which means that we have to add it from memory location:
addsd xmmX, QWORD PTR .LC0[rip] .LC0: .long 0 .long 1072693248
- As we need to use (for storing constant value 10.0) base register
rbx
we have topush
andpop
it from the stack every function call.
Thing to remember here is that operations on floats are not that slow as we might believe. They are a few cycles slower than standard operations, so it’s not a tragedy.
Fast-math to the rescue!
We all do know about the infamous -ffast-math
flag for gcc which basically enables various optimizations which allow for faster floating point operations. The catch? They might break your code.
Because of that there are two camps: one believes that -ffast-math
is just a death wish and should never be used, while the other thinks that if your code doesn’t do anything funny (denormals, etc.) it can safely be used.
In our case usage of -ffast-math
resulted in correct results (tests passes), thus I gave the famous flag a chance.
Remarks
Results are as follows:
Generally not that fast (pun intended). Some gain is observed, but nothing spectacular.
Table
Being general didn’t help us so let’s take the road of precomputed values. During our first attempt we will choose a simple solution using a table through which we will walk. In LOOKUP
which is std::array
we stored following values: Base10::D1Min
, Base10::D2Min
, …, Base10::D20Min
. Not pretty, but simple.
table
uint32_t result; for (result = 0U; value >= LOOKUP[result] && result < LOOKUP.size(); ++result); return result;
Remarks
Finally some improvements, but due to huge values of our first versions in future comparisons, they will be dropped. Enhance!
Ok, so even if the cpu has to:
- fetch every data piece from via a pointer,
- cannot inline the data in any way (as it doesn’t know the tale values),
- has to make two comparisons ever time,
it’s still tens percents faster! In other words, avoiding computation (mul
and shr
) by providing a well organized memory that can be prefetched (but still not stored completely in the cache as the whole table takes 1.25kB, while the whole cache set is only 0.5kB) we can drastically reduce the execution time.
Partial unroll
Let’s take other approach (presented by Andrei), instead of using tables, we shall “unroll” the computations that we perform in compact version, so that we don’t have to iterate so many times.
partial_unroll
uint32_t result = 1U; while (true) { if (value < Base10::D2Min) return result; if (value < Base10::D3Min) return result + 1U; if (value < Base10::D4Min) return result + 2U; if (value < Base10::D5Min) return result + 3U; value /= 10000UL; result += 4; }; return result;
Remarks
We basically traded less loops and computations (mul
+ shr
) for more conditional logic. More branches, you say? But we know that the branch predictor will have problems with this, and no pre-fetching and … blah, blah. Test first, you fool!
Not only the unroll is faster than the compact (standard) version, but it also beats the table version which seems relatively simpler (less branches).
BTW, the generated assembly isn’t very innovative. Compare (cmp
), return if below or equal (jbe
) or iterate the loop. The only tinkering the compiler has done is that it changed the "jump if below n" to "jump if below or equal n – 1“.
Unroll like there’s no tomorrow!
Partial seems good? Then let’s make it fully unrolled!
unroll
if (value <= Base10::D1Max) return 1; if (value <= Base10::D2Max) return 2; if (value <= Base10::D3Max) return 3; if (value <= Base10::D4Max) return 4; if (value <= Base10::D5Max) return 5; if (value <= Base10::D6Max) return 6; if (value <= Base10::D7Max) return 7; if (value <= Base10::D8Max) return 8; if (value <= Base10::D9Max) return 9; if (value <= Base10::D10Max) return 10; if (value <= Base10::D11Max) return 11; if (value <= Base10::D12Max) return 12; if (value <= Base10::D13Max) return 13; if (value <= Base10::D14Max) return 14; if (value <= Base10::D15Max) return 15; if (value <= Base10::D16Max) return 16; if (value <= Base10::D17Max) return 17; if (value <= Base10::D18Max) return 18; if (value <= Base10::D19Max) return 19; if (value <= Base10::D20Max) return 20;
Remarks
We know it looks hideous and it’s unimplementable for bigger set of arguments, but what about the efficiency? Glad you asked.
Even faster, and the compiler still didn’t do any crazy stuff.
Hiding conditional logic
While unrolling provides us with a nice boost, there are still a lot of branches/jumps. What if we could “avoid” them?
unroll_v2
return 1 + (value >= Base10::D2Min) + (value >= Base10::D3Min) + (value >= Base10::D4Min) + (value >= Base10::D5Min) + (value >= Base10::D6Min) + (value >= Base10::D7Min) + (value >= Base10::D8Min) + (value >= Base10::D9Min) + (value >= Base10::D10Min) + (value >= Base10::D11Min) + (value >= Base10::D12Min) + (value >= Base10::D13Min) + (value >= Base10::D14Min) + (value >= Base10::D15Min) + (value >= Base10::D16Min) + (value >= Base10::D17Min) + (value >= Base10::D18Min) + (value >= Base10::D19Min) + (value >= Base10::D20Min);
Remarks
The obvious penalty of the above solution is that we will have to always go through all the digits, even though we could return early. The positive? There is no branching. CPU can nicely rush through all the code filling the pipeline.
While the penalties are clearly visible (big lag for small digits) the potential gains cannot be seen and it doesn’t seem that the lack of branches helps the CPU that much.
It’s all about searching
Looking at the fully unrolled loops we can easily see that what we’ve been doing all the time is plain stupid linear search. Being decently equipped engineers we know that we can turn this n
into log(n)
via binary search.
binary_stl
auto upRange = std::upper_bound(LOOKUP.cbegin(), LOOKUP.cend(), value); if (upRange == LOOKUP.cend()) { return 20; } else { return std::distance(LOOKUP.cbegin(), upRange); }
Remarks
Being lazy and smart cookies at the same time, we used the standard algorithms.
Not the improvement we wanted.
Not for the faint of heart
Andrei didn’t present the fully unrolled binary search, so let’s show it.
binary
if (value < Base10::D8Min) { if (value < Base10::D4Min) { if (value < Base10::D2Min) { return 1; } else { if (value < Base10::D3Min) { return 2; } else { return 3; } } } else { if (value < Base10::D6Min) { if (value < Base10::D5Min) { return 4; } else { return 5; } } else { if (value < Base10::D7Min) { return 6; } else { return 7; } } } } else { if (value < Base10::D16Min) { if (value < Base10::D12Min) { if (value < Base10::D10Min) { if (value < Base10::D9Min) { return 8; } else { return 9; } } else { if (value < Base10::D11Min) { return 10; } else { return 11; } } } else { if (value < Base10::D14Min) { if (value < Base10::D13Min) { return 12; } else { return 13; } } else { if (value < Base10::D15Min) { return 14; } else { return 15; } } } } else { if (value < Base10::D18Min) { if (value < Base10::D17Min) { return 16; } else { return 17; } } else { if (value < Base10::D19Min) { return 18; } else { if (value < Base10::D20Min) { return 19; } else { return 20; } } } }
Remarks
Yes, it’s uber horrible to write, but we didn’t do it to get the cleanest code award.
And we have the winner!
That’s the end of our journey
I wish this post will become a minor, but still valuable addition, to what Andrei has presented. To summarize all the tips I gathered while creating this material:
- don’t assume – measure
- compiler knows more tricks then you do, so concentrate on the algorithms and data layout, not micro-optimizations
- general programming knowledge is applicable everywhere (i.e., binary search)
- moving data from storage (cache) to code instructions is good, but don’t go overboard – you can’t squash all the code, everytime
As the percentage charts don’t show the differences between the fastest algorithms enough, one more chart.
Tags: C#, optimization
Leave a Reply