Why Python Is So Slow (And What Is Being Done About It)

PITTSBURGH —”In Python, you pay at the runtime,” goes an old Python aphorism.

The Python programming language has garnered a rep for being rather pokey, a good starter language, but without the speed of its more sophisticated brethren.

But yet, many of the talks at this year’s PyCon US 2024, held last month in Pittsburgh, showed how researchers are pushing of the frontiers of the language.

Compile Python Code for Faster Maths

Saksham Sharma at Pycon 2024.

As a director of Quantitative Research Technology at Tower Research Capital Saksham Sharma builds trading systems in C++. “I like fast code,” he said at the beginning of his talk.

He’d like to bring some of that zest to Python.

Python is an interpreted language (though the CPython reference implementation of Python itself is actually written in C). The interpreter converts the source code into an efficient bytecode of ones and zeros. It then executes the source code directly and builds an internal state of the program from all the objects and variables as they are read in (rather than compiled into machine code ahead of time as is done by a compiler).

“So we are going through a bunch of indirections here, so things could get slow,” Sharma said.

For Python, even a simple instruction to add two numbers together can result in over 500 instructions to the CPU itself, including not only the addition itself, but all the supporting the instructions, such as writing the answer to a new object.

Cython, an optimizing static compiler for Python, allows you to write code in C, compile it ahead of time, and then use the results in your Python program.

“You can build external libraries and utilities that build into your interpreter, and they can interact with your internal state of your interpreter,” Sharma said. “If you have a function that you wrote in sight on your interpreter can be configured to call that function.”

Thus the Python code to add two variables…

Can be rendered for Cython thusly:

More typing for the developer, but less work for the compiler.

Sharma found, that on his own machine, an additional operation like this could take 70 nanoseconds with Python, but roughly 14 nanoseconds with Cython.

“Cython definitely made it faster because the interpreter is no longer in the picture,” Sharma said. For instance, each time the interpreter has to add two variables, it has to check the type of each variable. But if you know what the type is already, why not eliminate that check altogether. This is what programmers do when they declare a variable type in the code.

“Typed code can be much, much faster,” Sharma said.

A slide summarizing Saksham Sharma's work.

C ython can speed up inner-loops (Saksham Sharma)

A Python that Zooms with Static Typing

A photo of Antonio Cuni.

Anaconda’s Antonio Cuni.

As Sharma pointed out in the previous talk, there is a lot to be gained through static typing in Python. With static typing, you define what type of data a variable is. Is it a string? an integer? An Array? It takes time for the interpreter to figure all this out.

In his talk, Anaconda Principal Software Engineer Antonio Cuni introduced SPy, a new subset of Python that requires static typing. The aim is to offer the speed of C or C++ while retaining the user-friendly feel of Python itself.

Cuni explained that Python has to do a lot of stuff before executing the instructions themselves. Also like Sharma, Cuni pointed out that with a “low-level language, you usually do less stuff at runtime.”

Before it can execute the logic, the Python interpreter must find all the tools, such as tools and libraries, to execute the logic itself. This middle stage of work can take a lot of time.

The good news ia a lot of this work can be done ahead of time in a compilation stage.

With SPy, all global constants — such as classes, modules and global data — are frozen as “immutable,” and can be optimized (thanks to their typing) with a just-in-time (JIT) compiler.

Currently, Cuni is working on implementing SPy, either as an extension to CPython, or with its own JIT compiler. He is also looking into a version that can run within WebAssembly.

A slide from Cuni's talk comparing compilation vs. interpretation.,

Compile vs. Interpret (Antonio Cuni)

C Extensions, But Statically Linked

Loren Arthur, a Meta engineering manager, also demonstrated in his talk that rewriting processing-heavy functions in C can save boost performance considerably –but you have to be careful how they are loaded into the program.

Loren Arthur

A C module imported into Python can, in his own demonstration cut chew through the data in a sample file from 4 seconds — which is how long it would take regular Python code to chew through — down to nearly a half a second.

It sounds minuscule, of course. But for an operation the size of Meta it adds up. Converting Python functionality into nimbler C code for 90,000 Python applications saved Meta engineers 5,000 hours a week, thanks to improved build speeds alone.

This was great. Instagram built thousands of C extensions to move things along.

But then! The social media giant ran into another problem. The import time for c extensions ballooned deleteriously the more they included in a build. Weird because most of these modules are rather small, maybe having a single method and a string return in them.

Using Callgrind (part of the Valgrind suite of dynamic analysis tools), Arthur found that a Python function, called dlopen, takes 92% of the time, opening the shared object.

“Loading shared objects is expensive, especially when you get to very large numbers,” he said.

Meta found the answer in the form of embedded c extensions, to do static linking rather than dynamic linking with the shared object. Instead of calling a shared object, the c code is copied directly into the executable file.

Objects that Live Forever

Vinícius Gubiani Ferreira at Pycon 2024

Vinícius Gubiani Ferreira explains a new way to do multicore programming, at Pycon 2024.

The Global Interpreter Lock (GIL) , which prevents multiple processes from executing Python code at the same time, didn’t start out to be the villain in this story, in the view of Vinícius Gubiani Ferreira, software engineer team lead at Azion Technologies.

Rather GIL was the hero who stuck around too long and became a villain.

Ferreira’s talk discussed PEP 683, which sought to improve memory consumption for large-scale applications. The resulting library was included in Python v 3.12, released in October.

GIL was designed to prevent race conditions, but it also hobbled Python from doing true multi-core parallel computing. There is work to make GIL optional in Python, but it may be a few years before it is stabilized into the language runtime itself.

Basically, everything in Python is an object, Ferreira explained. Variables, dictionaries, functions, methods and even instances: all objects. In its most basic form, an object consists of a type, variable and a reference count, which tallies the number of other objects that point to this one.

All Python objects are mutable, even those marked as immutable (such as strings), and, in Python, the reference count changes a lot. Like, really a lot. This is actually problematic. Every update means the cache gets invalidated. It complicates forking a program. It causes data races; changes may overwrite each other, and if the result equals out to zero, then Boom! The garbage collector erases the object.

The more you scale an app, the more aggravated these problems become.

The answer is easy enough: Create an immutable state where the reference count never changes, namely by setting refcount to a specifically high number that can’t be changed (You could have a program increment up to it, Ferreira noted, but would take days). The runtime would manage these super-special objects separately, and be in charge of shutting them down.

Better yet: These immortal beings also bypass the GIL. So they can be used at anything, and by multiple threads simultaneously.

There is a slight performance penalty of up to 8 8% in Cpython with this approach, not surprising given the runtime has to keep a separate table. But especially in multi-processor environments (such as Instagram’s), the performance improvement pay off.

“You have to measure it to see if you are doing the right thing,” Ferreira said.

Sharing the Immutable

A photo of Yury Selivanov.

Yury Selivanov on building a fast Python service using sub-interpreters.

Another way around the GIL is through sub-interpreters, a hot topic at this year’s event. A sub-interpreter architecture allows for multiple interpreters sharing the same memory space, each with its own GIL.

One such orchestrator is a Python framework called memhive that implements a worker pool of sub-interpreters, as well as an RPC mechanism so they can share data. It was presented at Pycon by its creator Yury Selivanov, a Python core developer and CEO/co-founder of EdgeDB, in his Pycon talk.

Selivanov kicked off his talk by demonstrating a program on his laptop that was using 10 CPU cores to execute 10 asynchronous event loops simultaneously. They share the same memory space, that of a million keys.

What is stopping you from doing this on your own machine? That old villain, GIL.

Memhive sets up a primary sub-interpreter that can then spawn as many other sub-interpreters as needed.

Immutable objects are a challenge, and there are plenty of them in Python, such as strings or tuples. If you want to change them, you have to create a new object and copy each element over — a pretty expensive operation, computationally speaking, and doubly so when you factor in updating the cache.

Memhive uses a shared data structure, called structure sharing — hamt.c buried in the Python library — where subsequent changes are captured, but the parts of the old immutable data structure are referenced, rather than copied, saving considerable work.

A diagram of structured sharing.

Structured sharing references instead of copies, saving time.

“If you want to add a key, you don’t have to copy the entire tree, you can just create the missing new branches, and reference the others,” Selivanov said. “So if you have a collection with billions of keys, and you want to add a new one, you will just create a couple of those underlying nodes, and the rest can be reused. You don’t have to do anything about them.”

What structured sharing opens the door for parallel in processing, in that data is immutable, allowing multiple sub-interpreters to work in parallel on the same data set.

“Because we’re using immutable things, we, we can actually access the underlying memory safely. without acquiring locks or anything,” he said. This can lead to improvements of 6x to 150,000x in speed, depending on the amount of copying being done.

Slide showing the speed of copying large datasets using structured sharing.

Even as the number of changes increase dramatically, the time it takes to make them remains under control.

Summary

So, true Python is not the fastest language, and many of these developments, should they come to pass, will be years in the making. But there is a lot the coder can do now, if they are aware of the trade-off between speed and flexibility of Python itself.

“Python is a beautiful language for gluing together different pieces of your business logic. And other languages are well suited for extremely low level, often fast optimizations,” Sharma said. “And we need to figure out the right balance of these things.”