Programming Languages

CPU Branch Prediction: Your Code's Hidden Bottleneck

There's a famous Stack Overflow answer that's been viewed over 3 million times. The question: why is processing a sorted array faster than processing an unsorted one? The code is identical — a loop with an if statement. The only difference is whether the input array is sorted. The sorted version runs 6x faster. The explanation — CPU branch prediction — reveals one of the most consequential performance factors in modern computing.

Every if statement, every loop condition, every switch case — these are branches in your code. Your CPU encounters billions of them per second and must predict the outcome of each one before it knows the actual answer. When it predicts correctly, execution continues at full speed. When it predicts wrong, the CPU throws away 10-20 cycles of work and starts over. Understanding this mechanism changes how you think about performance-critical code.

Why CPUs Need to Predict

Modern CPUs are pipelined: they don't wait for one instruction to finish before starting the next. A typical CPU pipeline is 12-20 stages deep. While one instruction is being executed, the next 15 or so instructions are already being fetched, decoded, and prepared. This pipelining is what allows CPUs to execute roughly one instruction per clock cycle despite each instruction taking many cycles to complete.

But when the CPU encounters a branch (if x > 0), it has a problem. The next instruction depends on whether the branch is taken or not taken. The CPU won't know the answer for several cycles — the comparison result needs to travel through the pipeline. It has two choices: stall the pipeline and wait (wasting 15+ cycles doing nothing), or guess the branch direction and keep working.

CPUs always guess. The branch predictor makes a prediction, and the CPU fetches and executes instructions along the predicted path. If the prediction was correct, nothing is wasted. If the prediction was wrong, the CPU must flush the incorrectly executed instructions, discard their results, and restart from the correct path. This 'pipeline flush' costs roughly 15-20 cycles on modern x86 CPUs — equivalent to 15-20 instructions of lost work.

How Branch Predictors Work

Modern branch predictors are remarkably sophisticated. They're essentially pattern-matching hardware that learns from the history of each branch instruction.

The Simple Case: Biased Branches

Many branches are heavily biased — they almost always go the same way. An error check that succeeds 99.9% of the time, a loop condition that's true for 999 iterations and false once. The branch predictor quickly learns these patterns and achieves near-perfect accuracy. A branch that goes the same way 99% of the time has at most a 1% misprediction rate.

Pattern Recognition

Modern predictors go beyond simple bias tracking. They recognize repeating patterns. If a branch follows the pattern taken-taken-not-taken-taken-taken-not-taken... (period 3), the predictor will learn this pattern and predict it correctly. The TAGE (TAgged GEometric history length) predictor used in modern Intel and AMD CPUs maintains multiple history tables at different history lengths, catching patterns with periods from 2 to several hundred.

Correlated Branches

The most advanced predictors also learn correlations between branches. If branch A being taken always means branch B (later in the code) is not taken, the predictor can learn this relationship. This is called 'two-level adaptive prediction' and it handles code patterns like:

// These branches are correlated
if (x > 0) {      // Branch A
y = x * 2;
}
// ... some code ...
if (x <= 0) {     // Branch B — always opposite of Branch A
y = -x;
}
// The predictor learns: when A is taken, B is not taken, and vice versa
// Both branches are predicted with near-perfect accuracy

The Sorted Array Mystery Explained

Back to the famous Stack Overflow question. The code looks like this:

int sum = 0;
for (int i = 0; i < arraySize; i++) {
if (data[i] >= 128)    // This is the critical branch
sum += data[i];
}
// With SORTED data [0, 1, 2, ..., 127, 128, 129, ..., 255]:
// Branch pattern: NNNNNN...NNNTTTTTT...TTT
// First half: always NOT taken. Second half: always taken.
// Prediction accuracy: ~99.9% (only mispredicts at the transition)
// With UNSORTED data [random values 0-255]:
// Branch pattern: TNNTTNNTTTNNTNNT... (random)
// Prediction accuracy: ~50% (no pattern to learn)
// Every misprediction costs ~15 cycles

With sorted data, the branch is perfectly predictable: it's not-taken for the first half of the array, then taken for the second half. The predictor quickly learns each phase and only mispredicts at the single transition point. With random data, the branch is essentially a coin flip — the predictor can't do better than ~50% accuracy, and half the iterations waste 15 cycles on pipeline flushes.

At 100,000 iterations with a ~50% misprediction rate, that's 50,000 mispredictions × 15 cycles = 750,000 wasted cycles. On a 3 GHz CPU, that's a quarter of a millisecond — a significant fraction of the total runtime for this simple loop.

How Many Branches Can Your CPU Handle?

The branch predictor has limited storage. It tracks history for a finite number of branch locations using a structure called the Branch Target Buffer (BTB) and associated history tables. Modern CPUs can track tens of thousands of branch locations, but large codebases can exceed this.

When two different branch instructions alias to the same predictor entry (because the BTB is indexed by a hash of the instruction address), they interfere with each other's predictions. This is rare in normal code but becomes relevant in very large functions or when many small functions each contain branches.

Daniel Lemire has published excellent research measuring branch predictor capacity across different CPUs. The results show that modern Intel CPUs (Alder Lake and later) can effectively predict branches with histories up to about 200 branches deep — meaning the predictor considers the outcomes of the last ~200 branches when predicting the current one. AMD CPUs have similar capacity. ARM chips vary more widely; the M-series Apple chips have particularly strong branch predictors.

Writing Branch-Prediction-Friendly Code

In most code, branch prediction doesn't matter. The predictor is good enough, and the branches aren't in hot loops. But when you're optimizing performance-critical paths — inner loops, data processing pipelines, parsers, codecs — branch behavior can dominate runtime.

Replace Branches With Arithmetic

The fastest branch is one that doesn't exist. Many conditional patterns can be replaced with branchless arithmetic.

// Branching version:
int max_branch(int a, int b) {
if (a > b) return a;
return b;
}
// Branchless version (compiler often does this automatically):
int max_branchless(int a, int b) {
return a ^ ((a ^ b) & -(a < b));
}
// Conditional move — the compiler's favorite trick:
int max_cmov(int a, int b) {
// Most compilers will emit a CMOV instruction for this
return (a > b) ? a : b;
}
// Clamping with branches:
int clamp_branch(int x, int lo, int hi) {
if (x < lo) return lo;
if (x > hi) return hi;
return x;
}
// Clamping branchless:
int clamp_branchless(int x, int lo, int hi) {
// min(max(x, lo), hi)
int t = x > lo ? x : lo;  // CMOV
return t < hi ? t : hi;   // CMOV
}

Modern compilers are generally good at converting simple conditionals to CMOV (conditional move) instructions, which avoid branches entirely. But they can't always do it — especially when the branches have side effects or when the 'then' path is expensive. Explicit branchless patterns help in hot loops where the compiler isn't generating optimal code.

Sort Before Processing

If you're going to filter data with a branch, sorting the data first makes the branch perfectly predictable. This sounds counterintuitive — sorting is O(n log n) overhead — but for large datasets where the processing loop runs many times or where the branch misprediction cost is high, sorting pays for itself.

Use Lookup Tables

Complex conditional logic (switch statements with many cases) can be replaced with table lookups. Instead of a chain of comparisons, index into an array. The array access might cause a cache miss, but that's typically cheaper than a cascade of mispredicted branches.

// Branch-heavy version:
char to_hex(int nibble) {
if (nibble < 10)
return '0' + nibble;
else
return 'a' + nibble - 10;
}
// Lookup table version (no branches):
static const char hex_table[] = "0123456789abcdef";
char to_hex_lut(int nibble) {
return hex_table[nibble];
}

Provide Hints to the Compiler

GCC and Clang support __builtin_expect to tell the compiler which branch direction is likely. This doesn't directly control the CPU's branch predictor (which learns at runtime), but it affects how the compiler lays out code — putting the likely path in a straight line without jumps, which is faster due to instruction cache effects.

// Tell the compiler that errors are unlikely
if (__builtin_expect(result == NULL, 0)) {
handle_error();
}
// C++20 provides cleaner syntax:
if (result == NULL) [[unlikely]] {
handle_error();
}
// Linux kernel defines macros for this:
#define likely(x)   __builtin_expect(!!(x), 1)
#define unlikely(x) __builtin_expect(!!(x), 0)
if (unlikely(ptr == NULL)) {
panic("null pointer");
}

Measuring Branch Behavior

You can measure branch prediction performance directly using hardware performance counters. On Linux, perf stat gives you misprediction rates for any program.

# Measure branch prediction statistics
$ perf stat -e branches,branch-misses ./my_program
Performance counter stats for './my_program':
1,482,937,256  branches          # 892.7M/sec
3,271,842  branch-misses     # 0.22% of all branches
# 0.22% misprediction rate — excellent.
# Anything under 1% is good.
# Over 5% in a hot loop means branch prediction is hurting you.
# Over 10% means you should consider branchless alternatives.

For micro-benchmarks, tools like Google Benchmark can track branch-misses per iteration, giving you precise feedback on whether a code change improved branch behavior. This is more actionable than wall-clock time, which includes noise from other system activity.

When to Care (and When Not To)

Branch prediction optimization matters in a narrow but important category of code: tight inner loops that process large amounts of data. Parsers, compressors, image processing, numerical computation, database query evaluation — these are where branch behavior dominates.

For everything else — business logic, API handlers, CRUD operations — branch prediction is irrelevant. The branches in your web server's request router are not your bottleneck. Don't rewrite readable if/else chains into unreadable branchless arithmetic in code that runs once per HTTP request. Optimize where the profiler tells you to, not where the theory tells you there might be a problem.

The real value of understanding branch prediction isn't in manual optimization — it's in understanding why your code performs the way it does. When a profiler shows that a simple loop with an if statement is slower than expected, branch misprediction is often the explanation. When sorting your input data magically speeds up processing, branch prediction is why. Having this mental model means you can diagnose performance issues that would otherwise be mysterious.