Python's JIT Compiler Is Finally Happening

Python has been 'too slow' for as long as Python has existed. The standard response from the Python community — 'use C extensions for the hot loops' — has always been a concession that the language's default execution model is fundamentally limited. CPython interprets bytecode one instruction at a time, with each instruction dispatched through a switch statement. It's simple, portable, and easy to debug. It's also about 100x slower than compiled C for compute-heavy work.

Python 3.15 changes this. After years of experimental work, the copy-and-patch JIT compiler is shipping as a default-enabled feature. It won't make Python as fast as C — nothing will, short of static compilation — but the early benchmarks show 15-30% speedups on real-world code, with specific patterns seeing much larger improvements. For a language where 'just rewrite the hot path in C' has been the performance story for 30 years, a JIT that meaningfully accelerates pure Python code is a genuine milestone.

Why CPython Has Never Had a JIT

This isn't because nobody tried. PyPy has had a JIT for over a decade and regularly runs Python code 5-10x faster than CPython. But PyPy is a separate implementation with its own runtime, and it's never achieved CPython's market share because the C extension ecosystem — NumPy, pandas, scikit-learn, everything that makes Python the language of data science — is tied to CPython's C API.

Building a JIT into CPython itself has been discussed and attempted multiple times. The challenges are well-documented. CPython's architecture makes JIT compilation hard: the bytecode is dynamically typed (the JIT needs type information to generate efficient code, and Python doesn't provide it statically), the C API allows C code to directly manipulate Python objects in ways that break JIT assumptions, and the interpreter's reference counting garbage collector creates bookkeeping overhead that a JIT can't easily eliminate.

Previous attempts — Unladen Swallow (Google, 2009), Pyston (Dropbox, 2014) — tried to bolt LLVM-based JIT compilation onto CPython. Both found that LLVM's compilation overhead was too high for Python's typical workloads. LLVM is designed for ahead-of-time compilation of large codebases; using it to JIT-compile short Python functions adds milliseconds of compilation time for microseconds of execution time. The compilation was more expensive than the speedup.

Copy-and-Patch: A Different Kind of JIT

The copy-and-patch technique, introduced in a 2021 research paper, takes a fundamentally different approach to JIT compilation. Instead of translating bytecode into an intermediate representation and running optimization passes (the LLVM approach), copy-and-patch works with pre-compiled code templates.

Here's the idea: for each bytecode instruction (LOAD_FAST, BINARY_ADD, CALL_FUNCTION, etc.), the compiler pre-compiles a C implementation into machine code, with placeholder 'holes' for variable-specific data — register allocations, constant values, memory offsets. At runtime, JIT compilation is just: copy the pre-compiled template, fill in the holes with the specific values for this function. No optimization passes, no register allocation algorithms, no instruction selection. Just memcpy and patch.

Traditional JIT pipeline:
Python bytecode
→ Parse into IR
→ Type inference
→ Optimization passes (CSE, DCE, loop unrolling, inlining...)
→ Register allocation
→ Instruction selection
→ Machine code
Total: 1-100ms per function
Copy-and-patch pipeline:
Python bytecode
→ For each instruction, copy pre-compiled template
→ Fill in holes (constants, offsets)
→ Done — machine code ready
Total: 1-100μs per function (1000x faster compilation)

The trade-off is code quality. LLVM produces highly optimized machine code. Copy-and-patch produces machine code that's essentially a compiled version of the interpreter loop — each bytecode instruction is still a separate template, with minimal cross-instruction optimization. The generated code is better than interpretation (no dispatch overhead, no switch statement, better branch prediction) but worse than what a full optimizing compiler would produce.

For Python, this trade-off is excellent. Python functions are typically short, called many times, and individually take microseconds. A JIT that compiles in microseconds and speeds up execution by 20-30% is more valuable than one that compiles in milliseconds and speeds up execution by 200% — because the compilation overhead for the first approach is amortized almost immediately.

What Gets Faster

The JIT doesn't magically speed up all Python code equally. Understanding what benefits most requires understanding what the interpreter spends its time on.

Bytecode dispatch overhead. In the interpreter, each bytecode instruction requires: fetch the next opcode, decode it, jump to the handler through a switch statement. This dispatch overhead can be 30-50% of total execution time for tight loops. The JIT eliminates it entirely — the instructions are compiled into sequential machine code with direct jumps.

Type-specialized operations. Python 3.11 introduced the specializing adaptive interpreter, which replaces generic operations with type-specific ones after observing the actual types used. BINARY_ADD becomes BINARY_ADD_INT when it sees two integers. The JIT compiles these specialized instructions into efficient machine code — an integer addition becomes a single add instruction instead of a function call.

Branch prediction. The interpreter's central dispatch loop — a switch with hundreds of cases — is a nightmare for the CPU's branch predictor. The JIT replaces this with direct control flow that the CPU predicts accurately. On modern CPUs where branch mispredictions cost 15-20 cycles, this alone accounts for a significant portion of the speedup.

What doesn't get faster: C extension calls (NumPy, pandas), I/O operations (network, disk), operations dominated by memory allocation (creating millions of small objects). If your Python program spends 95% of its time in C extensions and 5% in pure Python, the JIT speeds up the 5% — measurable but not transformative.

The Specialization Pipeline

The JIT doesn't work alone. It's the final stage in a performance pipeline that started with Python 3.11's specializing interpreter and continued through 3.12-3.14's incremental improvements.

Tier 0: Interpreter. All code starts here. Standard bytecode interpretation with adaptive specialization. After a function is called several times, hot instructions are replaced with type-specialized versions.
Tier 1: JIT-compiled bytecode. The copy-and-patch JIT compiles specialized bytecode into machine code. This eliminates dispatch overhead and enables basic optimizations like constant folding and dead code elimination within the compiled templates.
Tier 2 (future): Trace-based optimization. Recording execution traces through hot code paths and compiling entire traces — across function boundaries — into optimized machine code. This is planned but not yet shipping.

This tiered approach is similar to what Ruby's YJIT and other modern language runtimes use. Start with fast interpretation, move to fast compilation when code is hot, save expensive optimization for the hottest paths. It's the same fundamental insight: most code isn't worth optimizing, so spend your compilation budget on the code that runs the most.

Memory and Startup Impact

JIT compilers consume memory for compiled code. The copy-and-patch JIT's memory overhead is modest — compiled code is larger than bytecode but smaller than what LLVM-based JITs produce (because there's no optimization bloat). The current implementation uses roughly 1.5-3x the memory of the bytecode it replaces, and only compiles functions that are actually called frequently enough to benefit.

Startup time is a concern for short-lived Python scripts. The JIT adds overhead for loading the template library and setting up the compilation infrastructure. For scripts that run for less than a second, the JIT overhead might exceed the speedup. CPython addresses this by only JIT-compiling functions after they've been called a threshold number of times — short-lived scripts stay in the interpreter and pay no JIT tax.

This is configurable. The -X jit flag controls JIT behavior, and environment variables tune the compilation threshold. For serverless functions and CLI tools where startup matters, you can raise the threshold or disable the JIT entirely. For long-running servers and data processing scripts where steady-state performance matters, the defaults work well.

What This Means for the Python Ecosystem

The JIT doesn't change Python's position in the performance hierarchy — C, Rust, Go, and Java are still dramatically faster for compute-heavy work. What it changes is the threshold at which Python developers need to reach for those alternatives.

A 20-30% speedup on pure Python code means that some workloads that previously required C extensions or a rewrite now run fast enough in pure Python. Data processing scripts that took 10 minutes take 7. Web servers that handled 1000 requests per second handle 1300. These aren't revolutionary numbers, but they're the difference between 'Python is fast enough' and 'we need to rewrite this in Go.'

More importantly, the JIT infrastructure creates a foundation for future optimization. The copy-and-patch approach can be extended with better templates, more specialization, and eventually trace-based compilation. Each version of Python can ship better templates without changing the JIT's fundamental architecture. The 20-30% speedup in 3.15 is a floor, not a ceiling.

After three decades as one of the world's most popular and slowest languages, CPython is finally investing seriously in performance. The JIT won't satisfy the 'Python is too slow' crowd — nothing will, because for some workloads Python genuinely is too slow, JIT or not. But for the vast majority of Python code, where execution speed was 'fine but not great,' the JIT moves it closer to 'genuinely good.' That's a bigger deal than it sounds.