Hacker’s Delight
and the Arduino (GCC Compiler)
I got a new (used, but new to me) book about computer coding: Hacker’s Delight by Henry S. Warren Jr. This is one of those books that I will read most of more than once, because the material covered is very technical and delves deeply into the intricacies of programming efficiently.
One algorithm that was particularly interesting, for no reason other than the algorithm is clever. This is a method for counting the 1-bits in a binary word. There is the naïve implementation, of course, but iterating through the bits and counting the ones:
for (long j = 0; j < 1000; j++)
{
x = 0x371FC7C9;
c = 0;
for (int i = 0; i < 32; i++)
{
c += x & 0x1;
x >>= 1;
}
}
I had to run this code 1 thousand times to get a good count. The Arduino (ATMega 328P chip) is FAST! This code ran in 24.58 microseconds per iteration.
There is a much more clever method in the book, which divided the value up into 2-bit blocks and counted the ones using addition, then added the 2-bit values into 4-bits, then 8-bits and finally adding the two 16-bit blocks to get the final count. So I added the code from the book to compare the times:
for (long j = 0; j < 1000; j++)
{
x = 0x371FC7C9;
x = x - ((x >> 1) & 0x55555555);
x = (x & 0x33333333) + ((x >> 2) & 0x33333333);
x = (x + (x >> 4)) & 0x0F0F0F0F;
x = x + (x >> 8);
x = x + (x >> 16);
c = x & 0x0000003F;
}
This was an impressive optimization, as it ran in zero milliseconds! I increased the loop count to 1 million so I could get a measurable time, and it was still zero milliseconds. WOW! So, I added yet another 1 million iteration loop around the “j” loop. Still zero…
So, 1 trillion iterations in zero milliseconds? That processor is not a super-computer, so something was wrong. I realized the compiler must have been seeing my code as unchanged in each loop and was simply optimizing the loops away.
I added an array with two elements (with identical bit counts) to try to fool the optimizer.
long xx[2] = { 0x371FC7C9, 0x71FC7C93 };
for (long j = 0; j < 1000000; j++)
{
x = xx[j & 1]; // 0x371FC7C9;
x = x - ((x >> 1) & 0x55555555);
x = (x & 0x33333333) + ((x >> 2) & 0x33333333);
x = (x + (x >> 4)) & 0x0F0F0F0F;
x = x + (x >> 8);
x = x + (x >> 16);
c = x & 0x0000003F;
}
This worked, as I got 8.68 microseconds per execution. This is about 3 times faster than the naïve implementation. Nice.
Now, I may just have a use for that beautiful code someday.
Enhancements
I wanted to display the execution time on a display, so I added the code to a sketch with the test loops in the main loop. Foiled again! The optimized code was, again, optimized away.
I experimented with several changes to fix that. What worked was declaring the array test variable volatile so that the x variable was loaded each iteration:
volatile unsigned long xx[2] = { 0x371FC7C9, 0x71FC7C93 };
Comments are closed.