Making CPython faster

LWN.net needs you!
Without subscribers, LWN would simply not exist. Please consider signing up for a subscription and helping to keep LWN publishing.

Over the last month or so, there has been a good bit of news surrounding the idea of increasing the performance of the CPython interpreter. At the 2021 Python Language Summit in mid-May, Guido van Rossum announced that he and a small team are being funded by Microsoft to work with the community on getting performance improvements upstream into the interpreter—crucially, without breaking the C API so that the ecosystem of Python extensions (e.g. NumPy) continue to work. Another talk at the summit looked at Cinder, which is a performance-oriented CPython fork that is used in production at Instagram. Cinder was recently released as open-source software, as was another project to speed up CPython that originated at Dropbox: Pyston.

There have been discussions on and development of performance enhancements for CPython going back quite a ways; it is a perennial topic at the yearly language summit, for example. More recently, Mark Shannon proposed a plan that could, he thought, lead to a 5x speedup for the language by increasing its performance by 50% in each of four phases. It was an ambitious proposal, and one that required significant monetary resources, but it seemed to go nowhere after it was raised in October 2020. It now seems clear that there were some discussions and planning going on behind the scenes with regard to Shannon's proposal.

Faster CPython

After an abbreviated retirement, Van Rossum went to work at Microsoft toward the end of 2020 and he got to choose a project to work on there. He decided that project would be to make CPython faster; to that end, he has formed a small team that includes Shannon, Eric Snow, and, possibly, others eventually. The immediate goal is even more ambitious than what Shannon laid out for the first phase in his proposal: double the speed of CPython 3.11 (due October 2022) over that of 3.10 (now feature-frozen and due in October).

The plan, as described in the report on the talk by Joanna Jablonski (and outlined in Van Rossum's talk slides) is to work with the community in the open on GitHub. The faster-cpython repositories are being used to house the code, ideas, tools, issue tracking and feature discussions, and so on. The work is to be done in collaboration with the other core developers in the normal, incremental way that changes to CPython are made. There will be "no surprise 6,000 line PRs [pull requests]" and the team will be responsible for support and maintenance of any changes.

The main constraint is to preserve ABI and API compatibility so that the extensions continue to work. Keeping even extreme cases functional (the example given is pushing a million items onto the stack) and ensuring that the code remains maintainable are both important as well.

While the team can consider lots of different changes to the language implementation, the base object type and the semantics of reference counting for garbage collection will need to stay the same. But things like the bytecode, stack-frame layout, and the internals of non-public objects can all be altered for better performance. Beyond that, the compiler that turns source code into bytecode and the interpreter, which runs the bytecode on the Python virtual machine (VM), are fair game as well.

In a "meta" issue in the GitHub tracker, Van Rossum outlined the three main pieces of the plan for 3.11. They all revolve around the idea of speeding up the bytecode interpreter through speculative specialization, which adapts the VM to run faster on some code because the object being operated on is of a known and expected type (or has some other attribute that can be determined with a simple test). Shannon further described what is being proposed in PEP 569 ("Specializing Adaptive Interpreter"), which he announced on the python-dev mailing list in mid-May. The "Motivation" section of the PEP explains the overarching idea:

Typical optimizations for virtual machines are expensive, so a long "warm up" time is required to gain confidence that the cost of optimization is justified. In order to get speed-ups rapidly, without [noticeable] warmup times, the VM should speculate that specialization is justified even after a few executions of a function. To do that effectively, the interpreter must be able to optimize and deoptimize continually and very cheaply.
By using adaptive and speculative specialization at the granularity of individual virtual machine instructions, we get a faster interpreter that also generates profiling information for more sophisticated optimizations in the future.

In order to do these optimizations, Python code objects will be modified in a process called "quickening" once they have been executed a few times. The code object will get a new, internal array to store bytecode that can be modified on-the-fly for a variety of optimization possibilities. In the GitHub issue tracking the quickening feature, Shannon lists several of these possibilities, including switching to "super instructions" that do much more (but more specialized) work than existing bytecode instructions. The instructions in this bytecode array can also be changed at run time in order to adapt to different patterns of use.

During the quickening process, adaptive versions of instructions that can benefit from specialization are placed in the array instead of the regular instructions; the array is not a Python object, but simply a C array containing the code in the usual bytecode format (8-bit opcode followed by 8-bit operand). The adaptive versions determine whether to use the specialization or not:

CPython bytecode contains many bytecodes that represent high-level operations, and would benefit from specialization. Examples include CALL_FUNCTION, LOAD_ATTR, LOAD_GLOBAL and BINARY_ADD .
By introducing a "family" of specialized instructions for each of these instructions allows effective specialization, since each new instruction is specialized to a single task. Each family will include an "adaptive" instruction, that maintains a counter and periodically attempts to specialize itself. Each family will also include one or more specialized instructions that perform the equivalent of the generic operation much faster provided their inputs are as expected. Each specialized instruction will maintain a saturating counter which will be incremented whenever the inputs are as expected. Should the inputs not be as expected, the counter will be decremented and the generic operation will be performed. If the counter reaches the minimum value, the instruction is deoptimized by simply replacing its opcode with the adaptive version.

The PEP goes on to describe two of these families (CALL_FUNCTION and LOAD_GLOBAL) and the kinds of specializations that could be created for them. For example, there could be specialized versions to call builtin functions with one argument or to load a global object from the builtin namespace. It is believed that 25-30% of Python instructions could benefit from specialization. The PEP only gives a few examples of the kinds of changes that could be made; the exact set of optimizations, and which instructions will be targeted, are still to be determined.

Other CPythons

In Dino Viehland's summit talk about Cinder, he described a feature called "shadow bytecode" that is similar to what is being proposed in PEP 659. Cinder, though, is not being run as an open-source project; the code is used in production, however, and is being made available to potentially adapt parts of it for upstream CPython. Some parts of Cinder have already been added to CPython, including two enhancements (bpo-41756 and bpo-42085) that simplified coroutines to eliminate the use of the StopIteration exception: "On simple benchmarks, this was 1.6 times faster, but it was also a 5% win in production."

Pyston takes a somewhat different approach than what is being proposed, but there are overlaps (e.g. quickening). As described in its GitHub repository, Pyston uses the dynamic assembler (DynASM) that comes from the LuaJIT project. That results in a just-in-time (JIT) compiler with "very low overhead" as one of the techniques used. Using Pyston provides around 30% better performance on web applications.

Both Cinder and Pyston are based on Python 3.8, so any features that are destined for upstream will likely need updating. The intent of the PEP 659 work is to work within the community directly, which is not something either of the other two projects were able to do; both started as internal closed-source "skunkworks" projects that have only recently seen the light of day. How much of that work will be useful in the upstream CPython remains to be seen.

It will be interesting to watch the work of Van Rossum's team as it tries to reach a highly ambitious goal. Neither of the other two efforts achieved performance boosts anywhere close to the 100% increase targeted by the specializing adaptive interpreter team, though Shannon said that he had working code to fulfill the 50% goal for his first phase back in October. Building on top of that code makes 2x (or 100% increase) seem plausible at least and, if that target can be hit, the overall 5x speedup Shannon envisioned might be reached as well. Any increase is welcome, of course, but those kinds of numbers would be truly eye-opening—stay tuned ...

Index entries for this article
Python	CPython
Python	Performance