### Do Low-level Optimizations Matter?

by Nathan Myers, ncm at cantrip dot org, 2020-01-09

Collectively, we have been thinking about sorting for longer than we have had computers. There is still an active literature 1,2,3. We are taught that how the counts of comparisons and swaps vary with problem size is what matters: as problems get bigger, order dominates, and anything else is buried in the noise. We learn that the best in-place sorts run in time around kN lg2(N), and a better sort algorithm has a smaller k, or is better-behaved for chosen input.

We ought to be able to sort pretty fast, by now, using a Standard Library sort. Sorting is real work: if you need it to go faster, you probably need to spend more on hardware. You can’t cheat on that.

Or can you? The classic sorting research was conducted on machines from many generations ago. While today’s are made to seem similar from the outside, inside they are very, very different. Do we still understand what affects sorting speed today? Maybe the rules have changed.

### Laboratory

Trying out ideas with Standard Library sort implementations can be tricky; as they have been tuned, they have become complicated. Results can be confusing. We need a laboratory: simple sorts, and simple data4. So, let us begin with a file of a hundred-million totally random integers:

``  \$ dd if=/dev/urandom count=100 bs=4000000 of=1e8ints``

This is objectively the worst case, containing the greatest possible entropy; but also the best case, hiding no surprises. We can map this file into our process, and the OS will copy it from the file buffer cache, the work showing up as `sys` time. All the rest of the run time is spent on nothing but sorting.

Next, we need a baseline, using `std::sort`, for reference:

``````  #include <sys/mman.h>
#include <fcntl.h>
#include <algorithm>

static const int size = 100'000'000;

int main(int, char**) {
int fd = ::open("1e8ints", O_RDONLY);
int flags = MAP_PRIVATE|MAP_POPULATE|MAP_NORESERVE;
auto* a = (int*) ::mmap(
nullptr, size * sizeof(int), perms, flags, fd, 0);

std::sort(a, a + size);

return a[0] == a[size - 1];
}``````

The `return` statement keeps the compiler from optimizing away the whole program. Trying it5:

``````  \$ g++ -O3 -march=native sort_base.cc && time ./a.out

real  0m7.365
user  0m7.257
sys   0m0.108s``````

We see about `73ns` per element. lg2(1e8) is about 27, making k a bit less than `3ns`. Each time `std::sort` looks at an element, it spends, on average, `3ns`, presumably on shuffling it from memory to cache to register to ALU, and maybe back to cache and to memory. This happens 27 times before the element ends up where it belongs. (Nobody said, in school, that k has a unit, but for our purposes, it’s nanoseconds.)

#### Reality Check

Just for a reality check, let’s plug in a radix sort. It is an unfair comparison, on several axes, but it suggests a hard upper bound on the kind of improvement we can reasonably hope for. This function distributes elements to one of 256 buckets, and then copies them back, in stable bucket order, and again for the next byte.

``````  #include <vector>
#include <cstring>

void sort_radix256(unsigned* begin, unsigned* end) {
std::vector<int> buckets[256];
for (auto& v : buckets) v.reserve((end - begin)/128);
for (int byte = 0; byte != sizeof(unsigned); ++byte) {
for (unsigned* p = begin; p != end; ++p) {
buckets[*p & 0xff].push_back((*p >> 8) | (*p << 24));
}
unsigned* p = begin;
for (auto& v : buckets) {
std::memcpy(p, v.data(), v.size() * sizeof(unsigned));
p += v.size();
v.clear();
} } }

- auto* a = (int*) ::mmap(
+ auto* a = (unsigned*) ::mmap(

- std::sort(a, a + size);

real  0m1.193s
user  0m0.985s
sys   0m0.208s``````

We see the OS spending extra time initializing memory (two thirds of it to zeroes), because radix sorting is not done in place. The rest of the time is spent copying the input to buckets and back, four passes in all, `2.5ns` per element per pass. When you can use it, and for big enough N, radix sort always wins, because it’s O(N): in this case, 4N. Notably, the sequence of instructions is almost the same, no matter what the values of the elements are. All that varies is which bucket each element goes into.

This radix sort would be easy to make even faster. Instead, let us make one that is slower, but that shares important qualities with in-place sorts. Instead of 256 buckets, this will use just two. The first pass picks a bucket according to the low bit, the next pass uses the next bit, and on around.

``````  void sort_radix2(unsigned* begin, unsigned* end) {
std::vector<unsigned> a_buckets[2];
for (auto& v : a_buckets) v.reserve((end - begin)/3);
int i = 0;
for (; i < (end - begin) / 2; ++i) a_buckets[0].push_back(begin[i]);
for (; i < (end - begin); ++i) a_buckets[1].push_back(begin[i]);

std::vector<unsigned> b_buckets[2];
for (auto& v : b_buckets) v.reserve((end - begin)/3);

for (int bit = 0; bit != 8 * sizeof(unsigned); ++bit) {
for (int j = 0; j != 2; ++j) {
for (unsigned v : a_buckets[j]) {
b_buckets[v & 0x1].push_back((v >> 1) | (v << 31));
}
}
std::swap(a_buckets[0], b_buckets[0]), b_buckets[0].clear();
std::swap(a_buckets[1], b_buckets[1]), b_buckets[1].clear();
}

for (auto bucket : a_buckets)
for (unsigned v : bucket) *begin++ = v;
}

real  0m4.759s
user  0m4.543s
sys   0m0.216s
``````

It’s slower, because it examines each element 32 times. It’s not N lg2(N), but 32N. This `4.5s` is remarkably close to our `std::sort` baseline, `7.3s`. Maybe we can think of the difference between `4.5s` and our baseline as what we pay for the generality of `std::sort`? 32N is not far from 27N. Should we take `4.5s` `*` 27/32 = `3.8s` as an ultimate stretch goal?

#### Starting Point: `quicksort`

Let’s see how well we can do with a regular sorting algorithm. Start by pasting in a bog-standard quicksort6:

``````  int* partition(int* begin, int* end) {
int pivot = end[-1];
int* left = begin;
for (int* right = begin; right < end - 1; ++right) {
if (*right <= pivot) {
int tmp = *left; *left = *right, *right = tmp;
++left;
}
}
int tmp = *left; *left = end[-1], end[-1] = tmp;
return left;
}

void quicksort(int* begin, int* end) {
while (end - begin > 1) {
int* mid = partition(begin, end);
quicksort(begin, mid);
begin = mid + 1;
} }

- auto* a = (unsigned*) ::mmap(
+ auto* a = (int*) ::mmap(

+ quicksort(a, a + size);

real  0m8.309s
user  0m8.193s
sys   0m0.116s``````

This is really not bad. The `std::sort` in the standard library, at `7.3s`, is a lot more code than we see here, but it has to perform well on tricky cases we won’t be testing.

#### What Is New?

What is different in modern CPUs from the machines the algorithms we use were originally tuned for? Well, lots. Today they have up to a half-billion transistors per core, not just the thousands originally used to execute a notably similar instruction set. (Imagine, in each generation, being asked to find a way to use another million, ten million, hundred million! more transistors, to get better benchmark times.)

We have many more caches, registers, instruction decoders, and functional units–barrel shifters, multipliers, ALUs7. There’s even a little just-in-time compiler, in hardware, to translate complex-instruction sequences into simpler ones, with its own peephole optimizer, and a cache of its output. A small-enough loop can execute entirely from that cache.

One thing wholly new is the branch predictor8. Technically, it’s another cache, one that accumulates a history of which way was chosen on each of the last M conditional branch points encountered during execution. (The number M is secret.) Branch prediction matters because of something else that’s wholly new: speculative execution9.

When regular execution needs to stop and wait on a result, speculative execution can run on ahead, stuffing pipelines with work from upcoming loop iterations. Because the values needed to decide which way a conditional branch will go haven’t been computed yet, it has to guess, based on the pattern of what has happened before at that instruction address. When the branch predictor guesses wrong, a lot of work may need to be discarded, but whenever it guesses right, that work has already been done when regular execution gets there.

These days, often, most of the work gets done speculatively, so it is vitally important to guess right. Branch predictors have become astonishingly good at discovering patterns in our code, and guessing right. Nowadays, they use a neural net to identify such patterns; the branch prediction cache holds sets of coefficients for the neural net.

Branching on random data, though, is the worst case for any branch predictor, no matter how smart, because it can never find more regularity than the data has. Here, it guesses wrong half the time, and work is done and then discarded. The pipelines never fill, and functional units sit idle.

If we want to better adapt our algorithm to the way a modern CPU works, we need to protect speculative execution against randomness in our data. Our only available course is to eliminate conditional branches that depend on that data.

So, let’s see what our conditional branches are doing. In this quicksort, there are only three conditional branches.

The first check, at the top of `quicksort` itself:

``    while (end - begin > 1) {``

detects when recursion has bottomed out. It evaluates to `true` half the time, on a complicated schedule, but it is evaluated only about N times, and hardly depends on the input.

The second:

``    for (int* right = begin; right < end - 1; ++right) {``

controls the loop in `partition` that walks through a subrange of the input. It runs about N lg2(N) times, but usually it’s taken. It, too, doesn’t depend much on input values. It gets hard to predict only when `begin` is very close to `end`, a small multiple of N times.

The third conditional branch:

``      if (*right <= pivot) {``

happens N lg2(N) times, too, but it depends directly on the values of elements in the input, every time. To prevent missed branch predictions, we must find a way to avoid that branch. We need identically the same sequence of instructions executed in the true and the false case–and just have nothing change, in the false case.

The portable way to do this is to turn the result of the comparison into a regular value, and then do ordinary arithmetic with it. Sometimes, we can use the boolean value itself, as is. Perhaps the simplest possible example is:

``  if (a < b) ++i;``

which can always be transformed, via conversion to `int`, to:

``  i += (a < b);``

A more general method is to convert the boolean value into a mask value, of all zeroes or all ones. Then we can use bitwise operations to produce different results for the all-zeroes and the all-ones cases. For a slightly more complicated example, here we add up two seconds/nanoseconds pairs:

``````  secs = secs1 + secs2, nsecs = nsecs1 + nsecs2;
if (nsecs >= 1'000'000'000) {
++secs, nsecs -= 1'000'000'000;
}``````

This transforms to:

``````  secs = secs1 + secs2, nsecs = nsecs1 + nsecs2;
int carry = (nsecs >= 1'000'000'000);
secs += carry, nsecs -= ((-carry) & 1'000'000'000);``````

With a carry, `(-carry)` is 111…111, and it subtracts a billion; with no carry, `(-carry)` is zero, and it subtracts zero.

#### A New Primitive, `swap_if`

How can we use this method in our sort? First, let us make a `swap_if`:

``````  inline bool swap_if(bool c, int& a, int& b) {
int ta = a, mask = -c;  // false -> 0, true -> 111..111
return c;
}``````

In our `partition` function, then, we can transform

``````    if (*right <= pivot) {
int tmp = *left; *left = *right, *right = tmp;
++left;
}``````

into just

``    left += swap_if(*right <= pivot, *left, *right);``

This expands to more code, and all of it runs on every iteration, not just half of them, as before. But now, there is no branch to predict, or to mis-predict. It is just straight-line code.

Trying it:

``````  \$ g++ -O3 -march=native sort_swap_if.cc && time ./a.out

real  0m5.792s
user  0m5.679s
sys   0m0.112s``````

This is excellent! `5.7s` is 1.44x better than before, and 1.29x as fast as `std::sort`.

Another way to make a value control what happens is indexing. (We have already seen an example of this, in the second radix sort presented above.) The boolean value is used as an array index:

``````    int v[2] = { a, b };
b = v[1-c], a = v[c];

real  0m4.576s
user  0m4.459s
sys   0m0.116s``````

Even better! This is 1.84x as fast as the naïve quicksort, even though the values have to take a detour through L1 cache so they can have addresses. This version is easily generalized to types that are not just a machine word:

``````  template <typename T>
bool swap_if(bool c, T& a, T& b) {
T v[2] = { std::move(a), std::move(b) };
b = std::move(v[1-c]), a = std::move(v[c]);
return c;
}``````

When `T` is `int`, compilers generate identical code for the template version,

For completeness, there is also:

``````    uint64_t both = (uint64_t(uint32_t(a)) << 32) | uint32_t(b);
int shift = c * 32;
both = (both << shift) | (both >> (64 - shift));
a = int(uint32_t(both & 0xffffffff)), b = int(both >> 32);``````

The line with the shifts turns into a single “rotate” instruction. But this is not faster than the indexed version: it runs in `4.8s`.

#### `cmov` Considered Disturbing

How does the first, “and-and-or”, version do on Clang:

``````  \$ clang++ -O3 -march=native sort_swap_if.cc && time ./a.out

real  0m3.551s
user  0m3.430s
sys   0m0.120s``````

HOLY CRAP! `3.4s`? What just happened?

Weren’t we just guessing that `3.8s` was as fast as we should ever hope to get? This is 2.4x as fast as quicksort, and more than 2x as fast as `std::sort`! This raises so many questions.

First, why the big difference between G++ and Clang? A quick detour through Godbolt10 reveals what is going on. G++ actually generated the four “`and`”, and two “`or`” instructions seen in `swap_if`. But Clang’s optimizer recognized what we were trying to do with the masks, and replaced it all with a pair of simple `cmov`, conditional-move, instructions. (On top of that, it unrolled the loop in `partition`.)

What is the `cmov` instruction? It has been in ARM forever. Back in 2000, AMD included `cmov` in its 64-bit x86 ISA extensions. Then, Intel had to adopt them when Itanium flopped.

`cmov` just copies a value from one place to another–register to register, register to memory, memory to register–but only if a condition-code flag has been set, such as by a recent comparison instruction. Since `cmov` replaces a conditional branch, the result of the comparison doesn’t need to be predicted. The execution unit doesn’t know which value it will end up with, but it knows where it will be, and can schedule copying that out to cache, and eventually memory, and run on ahead, without discarding anything.

Why doesn’t G++ generate `cmov` instructions? Older releases did, in fact, often enough to generate bug reports11. It turned out that, any time a subsequent loop iteration depends on the result, and the branch would have been predicted correctly, `cmov` may be slower than a branch, sometimes much slower. `cmov` can stall speculation all by itself.

The latest designs from Intel and AMD are said to avoid the `cmov` pipeline stall, often, but Gcc has not caught up with them yet.

For more on `cmov` and pessimization in G++, follow links in 12.

In this case, though, nothing later depends on the result–it just gets stored, and the loop moves on–so this is a poster-child case for using `cmov`.

Clang, it turns out, can turn a simpler definition of `swap_if` into `cmov` instructions:

``````    int ta = a, tb = b;
a = c ? tb : ta;
b = c ? ta : tb;``````

But for this, G++ just generates a badly-predicted branch. I have not discovered a way to persuade G++ to produce `cmov` instructions (not even in old releases, and not even with the new `__builtin_expect_with_probability` intrinsic, probability 0.5). Even “profile-guided optimization” doesn’t help. (In the Linux kernel, wherever a `cmov` is needed, a macro expands directly to assembly code.) Looking into G++, it appears that it refuses to emit a `cmov` if anything else follows it in the “basic block”, even if what follows ought to be another `cmov` 13.

#### Faster Than (Pessimized) Radix Sort?

Another big question: how could the Clanged version be even faster than our target time? This might come back to `cmov`, again. Radix sort always performs the same number of copy operations. So, also, do our first three versions of `swap_if`, as compiled by G++. But with `cmov`, only half of the swap operations end up needing to copy the pair of words out to L1 cache14,15, and thereby, perhaps, delay reading the next pair from it. There may be other reasons: radix sort sweeps majestically, sequentially through the whole sequence, relying on the prefetcher to keep ahead of it, while quicksort spends much of its time noodling around with values in L1 cache.

In the original quicksort, k was almost `3ns`, but now it is just `1.3ns`. More than half of the time we thought was being spent on honest computation was wasted, sitting stalled on branch mis-predictions! Who knows how much is still wasted? (Actually, the CPU knows: it keeps a count of how many of various cache misses happened, and you can read those out, given some care, with `perf`16.)

#### Next

What can we do with what we’ve discovered? Sure, we can make our own programs faster, particularly if we are using Clang. But it would be better if we could make all programs faster.

We could propose a new library function, `std::swap_if`. Then, implementers could ensure it uses `cmov` for machine-word types (including pointers and floating-point values, and even small objects like `std::string_view`), and use it in their sorting and partitioning code. We could use it in our own programs, too. But for it to do much good, we would need to get it into a Standard, and then persuade many people to change a lot of code.

The experience with Clang’s optimizer hints at an alternative: Why can’t compilers recognize the sequence we did, and perform their own transformation?

This would be way better; just recompile, and all code gets faster, some a lot faster, with no need to rewrite any of it. The `std::sort` implementions in libstdc++ and libc++ are quite a bit more complicated than our toy quicksort, but maybe a compiler could spot places to transform that look unpromising to us.

Getting this optimization into compilers depends on those few who can add a new optimization to G++’s or Clang’s code generator taking time away from other optimization work. Doesn’t a 2x improvement over current `std::sort` automatically deserve that kind of attention? But it might not be so easy: too often, `cmov` is slower. The optimizer doesn’t just need to recognize when it could substitute `cmov` for a branch, it needs to decide whether it should.

While the optimizer knows whether anything depends on the result, it doesn’t know whether the input would have patterns that the branch predictor could tease out. Such an optimization would need a great deal of testing to ensure it doesn’t pessimize (too much) code.

Still, Clang, at least, seems to be all-in on plugging in `cmov` where it seems to make sense. It just needs to learn to recognize the conditionally-swapping and the conditionally-incrementing cases. It should be able to compose those into the transformation used here, and thence to `cmov`.

#### Conclusion

What can we conclude from this discovery? Before we conclude anything, we should remind ourselves of its limitations. The tests run were on completely random data. Truly random data seldom occurs in real life. If there is a pattern in the data, the branch predictor might be able to find and exploit it. Shouldn’t it get that chance, sometimes?

Furthermore, the idea was tested on only the simplest possible elements, where both comparing and swapping were as cheap as they ever can be. While sorting word-sized objects is still an important use case, real sorting problems often have very different time budgets. We never changed the number or kind of comparisons performed, but the first big improvements doubled the number of copies; then the final improvement halved them again. A copy is a quarter or a third of a swap, but a dramatic change in their number had comparatively little effect on run time.

For many object types, the self-moves seen in the simplest, conditional-expression `swap_if` version are not permitted. For a production library, we might need to specialize on whether self-moves are allowed.

Our deep takeaway might be that counting comparisons and swaps ignores what are, today, the actually expensive operations: cache misses, of all kinds. Despite the millions of transistors devoted to the task, caches miss, sometimes systematically. Any time the number of misses, of all kinds, is not directly proportional to the number of operations counted, an analysis may produce the wrong answer, and lead us to bad choices.

But a lesson for the working programmer might be that, sometimes, you know the problem better than the compiler or cache systems, and measuring can lead you to simple code changes17 that avoid costly misses, with sometimes dramatic results.

#### Recommendations

Do these results suggest improvements to our Standard Library?

In C++, a factor of two in performance matters. We need two versions of each of the Standard sort, partition, binary search, merge, and heap algorithms that constitute nearly half of `<algorithm>`18: one set that shields the branch predictor, and a second set that exposes the branch predictor to any patterns it can tease out. But this does not mean we need that many new named library functions! Instead, we can bind the choice to an optional property of the comparison predicate. The default should be to shield the branch predictor, at least for small types, because the consequence of failing to protect it can be so severe.

A branchless `std::swap_if` would be good to have in the Standard Library19, even after optimizers learn to generate it themselves, because optimizers are sometimes too cautious. We should consider carefully, also, whether `std::count_if` merits attention.

#### More to Come

Is that everything? Just the one weird trick?

No! This was just one example. Modern CPU chips are packed to the rafters with dodgy gimcracks. They have pre-fetchers, micro-op caches with micro-op fusion, register renaming, a shadow return-address stack, pipelines everywhere, atomic interlocks, translation lookaside buffers, hugepages, vector units. Caches collude with one another over private buses.

It all makes many programs go faster than they have any business going, but not necessarily your program. A better algorithm is no longer enough to get top performance; your program needs to join in the dance of the million-transistor accelerators. Anybody who insists C is close to the machine is, at best, deluded.

Can the dance of the gimcracks drive k south of one nanosecond? Stay tuned20.

Finally: If you control a budget, and depend on the performance of software, it might be a good use of that budget to help improve compilers and standard libraries in the ways suggested.

The author thanks Andrei Alexandrescu for a most thorough and helpful review of an early, and very different, draft of this article.

1. Programs were run an an i7-7820X SkylakeX.

2. I call this bog-standard, but the quicksorts most easily found online are always oddly pessimized, both more complicated and slower. This one uses the “Lomuto partition”, which is simpler than Hoare’s.)

3. Little optimization pun there.