Python with support for multithreading

Thanks, Larry, for the introduction. Hi everyone, I'm Sam Gross and I'm here to talk about my project that I've been working on for the past few years to let Python run multi-threaded programs without the global interpreter lock. I'd like to start with a bit of background on multi-threading and Python's global interpreter lock. Multi-threading using multiple threads allows an individual program to perform multiple tasks at the same time. Threads are useful for two related reasons. First, threads can exploit the parallelism available in multi-processor and multi-core computers. Threads can run on separate cores, so a program that can divide its work into multiple tasks executed by separate threads can run faster on a multi-core computer than if it used a single thread. The second reason is that threads can be useful for structuring concurrency, even on a single processor system. Threads might be used to handle multiple requests in a network server, to overlap IO with computation, or to perform long-running tasks in a graphical application while keeping the user interface responsive. Threads aren't the only way to structure concurrency. For example, modern Python includes the async IO package to structure concurrent asynchronous code. But this project is mostly focused on the first reason. We want Python to be able to take advantage of the parallelism available in multi-core computers by making multi-threaded code run more efficiently. One of the big obstacles to efficiency is that the global interpreter lock, or GIL, ensures that only one thread can run Python code at a time. When Python was first publicly released in 1991, it didn't support threads or have a global interpreter lock. Support for threads was added about a year later, in 1992, together with the GIL. For context, the 1990s saw a number of operating systems add better support for threading and support for computers with multiple processors. I think there's an interesting analogy between the introduction of threads in Python and multi-processor support in some of these operating systems. At least in the cases of Linux and FreeBSD, support for multiple processors was accompanied by the introduction of global locks, the big kernel lock in Linux and a lock called Giant in FreeBSD. In the original implementation, the big kernel lock ensured that only a single thread could enter the kernel at a time, somewhat similar to the way that Python's global interpreter lock ensures that only a single thread can run Python code at a time. Over time, the uses of the big kernel lock were gradually replaced by fine-grained locking and other techniques to improve scalability. The big kernel lock was completely removed from Linux in 2011. From a technical perspective, the global interpreter lock made sense for a few reasons. First, it's simple to implement. It doesn't require a lot of code. Second, the global interpreter lock is really efficient for single-threaded code. The lock is only rarely touched, so it doesn't substantially affect performance if you only use one thread. Third, the global interpreter lock avoids a lot of the potential deadlocking hazards of using multiple fine-grained locks. In particular, the locking discipline for the GIL is pretty straightforward. You need to acquire the GIL once before calling into Python code and you should release the GIL before calling an operation that might block the thread. Finally, the global interpreter lock still allows for some parallelism. C extensions can release the lock around long-running pieces of code that don't access Python objects. We're now about 30 years after the introduction of threads to Python. As Larry mentioned, the computer hardware landscape has changed. Multicore processors are in desktops, laptops, servers, smartphones, and today they're in glasses as well. In server environments, it's common to see computers with dozens of CPU cores or more. Especially as the core counts increase, it's difficult to take advantage of multicore parallelism from Python due to the GIL. Even with the GIL, there are a number of ways to parallelize Python programs today, but they all have substantial limitations. First, in C API extensions, you can parallelize operations internally in a way that's invisible to Python. For example, scikit-learn uses OpenMP to parallelize its k-means operation. When you call the scikit-learn's k-means function from Python, scikit-learn distributes the work to a number of threads, waits for those threads to finish, and aggregates the results and returns back to Python. This works well when individual operations are large enough to be efficiently parallelized, but not when there are a lot of small steps. A lot of scientific computing libraries like scikit-learn, PyTorch, and some versions of NumPy and SciPy use this technique. Second, C API extensions can release the global interpreter lock around long-running operations so that multiple operations can overlap. The previously mentioned scientific computing libraries do this, but it usually doesn't scale well beyond a few cores because contention on the GIL quickly becomes an issue. You can also use multiple processes, including through Python's multiprocessing library. Processes are like a heavyweight version of threads. The main challenge with processes is that communication between processes is usually more expensive and more difficult than communicating between threads. I'm going to expand on this point. With processes, objects need to be serialized to be sent from one process to another. You can't share Python objects directly like you can between threads. This is a performance issue because serialization is expensive and requires making extra copies. It can also lead to usability issues because not all objects support serialization. Additionally, many C and C++ libraries you might want to use from Python don't work well across processes, even if the libraries support multiple threads. Some libraries have opaque data structures that might be difficult or impossible to share between processes. Removing the global interpreter lock is a challenge because besides the actual removal of the lock, you have to make Python safe to run with multiple threads. You have to address things like reference counting. Python primarily manages memory by keeping track of the number of references to each object and the current implementation isn't thread safe without the gill. You also have to figure out how to handle concurrent accesses to Python's collections. Things like dictionaries, lists, and sets. Once you remove the gill, unless you've made other changes, concurrent modifications to things like dictionaries would crash the interpreter. There are a lot of other internal components in the Python interpreter that need to be made thread safe. Many of these are easy to fix and only require small changes, but other require more substantial changes. Some of the specific internals I want to talk about today are Python's built-in memory allocator and the cyclic garbage collector. Finally, all these changes need to be made while keeping good single-threaded performance and while allowing for efficient multi-threaded scaling. This project isn't the first attempt to remove Python's global interpreter lock. I'm aware of three previous attempts in CPython. The first was in 1996 when Greg Stein released his free threading patch. The patch used atomic reference counting on Windows and a global reference count block on Linux. Lists and dictionary accesses were protected by mutexes. Parts of the patch were actually adopted upstream. In particular, it introduced a PyThread state structure and correct per-thread exception handling, both of which live on in Python today. The second attempt I know of was Python safe thread by Adam Olson in 2007. Adam pursued an interesting design. Most objects were owned by a single thread which could use normal reference counting operations. If another thread needed to modify the object's reference count, it would have to pause and wait for the owning thread to give up ownership. The implementation also had restrictions on sharing objects between threads. You could share immutable objects like numbers and strings and some special objects like a new shared dictionary type. Trying to share other mutable objects, like a regular dictionary or list, would raise an exception. The most recent attempt to remove the gill from CPython is Larry Hastings' Galectomy, which started around 2016. The Galectomy used atomic reference counting for objects and later versions experimented with buffered reference counting to improve performance. The implementation used locks to protect accesses to lists and dictionaries. The Galectomy was the effort that had the most influence on this project. I'd seen Larry's PyCon talks in the Galectomy where he described parts of the implementation and the performance challenges. I think the biggest takeaway from these previous efforts is the recurring challenges due to reference counting, both in how it affects single thread performance and efficient scaling. Reference counting is by no means the only issue you have to address when removing the gill, but it's essentially the first problem you have to solve because the performance impact is so large. A number of people have come to the conclusion that reference counting is the biggest obstacle to removing the gill and that we should ditch reference counting and replace it with a trace and garbage collector. So why not ditch reference counting? The primary reason is that getting rid of reference counting would make it much harder to get a reasonable degree of compatibility with existing CPy extensions. I think the effort to get a reasonable degree of compatibility would probably be bigger than just directly dealing with the issues that reference counting poses. The most straightforward way to make reference counting thread safe is to use atomic operations. The problem with this strategy is that atomic instructions are slower to execute than their non-atomic counterparts. For example, if you modify Python to use atomic reference counting and run the PyPerformance benchmark suite, the benchmarks take on average about 60% longer to run. This chart shows the slowdown of using atomic reference counting in each benchmark in the PyPerformance suite. There's big variance in individual benchmark performance, so don't treat the 60% number as particularly precise. A few benchmarks aren't affected by the reference counting change, while others take more than twice as long to run. Biased reference counting is an algorithm created to address the overhead of atomic reference counting. It was created by Jiho Choi, Thomas Schuell, and Joseph Torias at the University of Illinois at Urbana-Champaign and published in 2018. It's based on the observation that most objects are only accessed by a single thread, even in a multithreaded environment. Their version was implemented in Swift, but the same technique can be applied to Python. Biased reference counting works by splitting the reference count field into two parts, as well as storing the ID of the thread that created the object. The first part of the reference count field stores the local reference count. The thread that created the object uses plain, non-atomic operations to modify that local reference count field. The second field stores the shared reference count. Other threads modify the shared field using atomic instructions. The true reference count is the sum of these two fields. Because most objects are only accessed by one thread, the technique avoids most atomic operations while still maintaining thread safety. Here's a basic implementation of incrementing the reference count using biased reference counting. The algorithm checks if the object's thread ID matches the current thread ID. If they match, the algorithm increments the local reference count field using normal non-atomic increments. Otherwise, the algorithm increments the shared field using an atomic operation. This is faster than just using atomic reference counting, because at least on current hardware, the extra comparison to check the thread ID is cheaper than the cost of an atomic increment. This technique has some similarity to the strategy used by Adam Olson in the Python Safe Thread project. Both associate a Python object with a thread that creates it, and that thread uses fast non-atomic operations. However, biased reference counting avoids the blocking operation present in Python Safe Thread, where a different thread accesses the object and has to wait for the original thread to give up ownership. Here's an updated version of the previous chart. Atomic reference counting overhead is still shown in blue. Those are the same results as before. Biased reference counting is in green. The overhead of biased reference counting is much less than the overhead of atomic reference counting, although it still has some extra overhead compared to the regular non-thread-safe reference counting that's currently implemented in Python. Using Python 3.9 as a base, biased reference counting is about 10% slower on average across the performance benchmarks compared to that 60% slowdown we saw with atomic reference counting. While biased reference counting does a pretty good job of reducing the execution overhead of atomic reference counting, it doesn't address the scalability issues when multiple threads frequently modify the reference count of the same object. To expand on this, I'm going to make some generalizations about how memory operations scale on modern computer systems. This table is divided in columns between operations that are read-only and operations that modify memory. The rows are divided into private accesses, that's memory that's accessed only by a single core, and shared access, memory that's accessed by multiple cores. The checkmarks indicate if the operations scale well as the number of cores increase. The first thing to notice is that read-only accesses scale well with lots of cores. Memory reads from multiple cores both to the same location scales well as well as reads to different locations. The same data can exist in the cache of multiple CPU cores as long as the data isn't modified. The second thing I'd like to point out is that write operations scale well if the data is private. In other words, multiple CPU cores performing write operations scale well as long as they're to different memory addresses, or more accurately, to different cache lines. This includes atomic operations. Things like atomic increment and decrement are slower than their non-atomic counterparts, but both atomic and non-atomic operations scale well if the data is private. Finally, concurrent modifications to shared data doesn't scale well. So for example, lots of threads concurrently modifying the reference count of the same object won't scale well on multi-core machines. This is relevant for Python because there are some objects that are necessarily shared between threads and they're frequently accessed. For example, the constants none, true, false, some small integers and some other objects are shared by all threads in a Python program. Like other Python objects, they're also reference counted, even though they can never be collected during the lifetime of the program. Absent other changes, reference counting on these objects would be a bottleneck for multi-threaded programs once you remove the gill. The fix is to skip reference counting on these objects. The objects are marked as immortal using a bit in the reference count field. The reference counting code is changed to be a no-op if the immortal bit is set. And the objects are kept alive for the duration of the program. This works well because things like none, true, and false already stay alive for the lifetime of a Python program. Immortal objects aren't a new idea. They're used in the Swift as well as some alternate Python implementations like Cinder and Piston. To go back to the scaling table, immortal objects work because we've transformed rights to share data to read only accesses. We've gone from modifying the reference count fields of objects like none, true, and false to only needing to read that field to check if the object is immortal. So why doesn't Python already use immortal objects? For single-threaded code, immortal objects aren't a clear win. The cost of the extra branch to check if the object is immortal is typically higher than the work saved by skipping the reference count operation for single-threaded code. At the Language Summit, however, back in April, Eddie Elizondo and Eric Snow presented an implementation of immortal objects with smaller costs than previous efforts. So we might see immortal objects in CPython independent of any GIL changes. Early on in this project, I decided to test out the implementation on Larry Hastings' Fibonacci benchmark from the Galactomy project. The benchmark is pretty simple. There's a function fib that calculates the nth Fibonacci number and it's called concurrently for multiple threads. The Fibonacci calculation is deliberately slow. It's mostly there to do some work. I was running the program with an early version of Nogil Python with eight threads, and I was pretty happy to see the top command show all eight cores being fully utilized. If you graph the performance results, they look something like this, and I was ready to call it a day. And indeed, this would have been great if the chart showed throughput versus the number of threads, but the Y axis here is actually time taken. Even though the program was using eight CPU cores, those cores were being used so inefficiently that there wasn't a speedup. In fact, adding more threads made things worse. Here's the same data, but inverted to show throughput instead of time taken. If things were going well, then eight threads would be eight times faster than a single thread. But here, eight threads are actually only running at 25% of the speed of a single thread. So what's going on? It's the same issue I mentioned previously. Reference count contention here is so bad, the program didn't scale at all, and in this case, reference count contention wasn't solved by immortal objects. Let's look at the Python byte code. I've highlighted the two load global byte code operations. These correspond to the two recursive calls to Fib. They load the Fib function from the global dictionary and push it to the interpreter stack. As part of that operation, they increment the reference count of the Fib function object. Later, when the function call is finished, the object is popped from the stack and the reference count is decremented. Since all the threads are calling the same function, they're also all concurrently modifying that same reference count field on the Fib function object. You could try making all functions immortal, but that has a number of downsides. Functions don't necessarily live forever. In addition, they hold a reference to the global's dictionary, so making functions immortal would prevent the global's dictionary from being freed, which would in turn hold on to global variables unless those variables were explicitly deleted. Some programs and libraries depend on running finalizers for global variables at exit, and making function objects immortal would prevent that from happening. The reference chain from function objects to the global's dictionary to global variables. This project uses a variant of deferred reference counting to address the reference count contention on function objects. Deferred reference counting skips the reference counting operation for objects as they're pushed to and popped from the interpreter stack. Other operations like assignment to a dictionary list use the normal reference counting behavior. Because we've skipped some reference counting operations, we can't free these objects when the reference count reaches zero. Instead, they're only free during the cyclic garbage collection. The garbage collector is modified to inspect the interpreter stack. At the start of a garbage collection cycle, it looks through all the objects on the interpreter stack, adds back any missing references. That's the deferred part of deferred reference counting. It then does a normal garbage collection cycle and finally removes the reference count that it added in the first step. So deferred reference counting avoids some reference counting operations, but it also delays deallocation. Normally that could cause problems, but we only enable deferred reference counting for some types of objects. Module level functions, methods, and some type objects. It turns out these objects would typically only be deallocated during a cyclic garbage collection cycle anyways. I mentioned before that function objects have a reference to the global's dictionary. Well, the global's dictionary also has a reference to functions defined at the module level. Functions in Python naturally form reference cycles, and there are similar reference cycles for methods and type objects. So deferred reference counting on these objects doesn't typically delay deallocation compared to the normal behavior in CPython. Here's the Fibonacci benchmark again, but with deferred reference counting enabled. The results are much better now, and we get a nearly eight times speedup with eight threads. The no-guilt project uses a combination of these reference counting techniques to achieve good single-threaded performance and avoid a lot of the obstacles to multi-core scaling. Most objects in no-guilt Python use biased reference counting. They're deallocated immediately when their reference count goes to zero, with a few exceptions. A few types of objects such as function objects, methods, and some type objects use deferred reference counting within the interpreter loop. Outside the interpreter loop, they're treated like other objects and use biased reference counting. They're only deallocated during garbage collection cycles. Constants such as true, false, and none are immortal. Reference counting on these objects is skipped everywhere, both in the interpreter and in library code. They're never deallocated, at least not until the interpreter shuts down. Now that I've talked about reference counting techniques, I'd like to move on to collection thread safety. This is how things like dictionaries, lists, and sets handle concurrent accesses. Dictionaries are particularly important because they're widely used in Python, including for storing object attributes and global variables. Like many previous attempts at removing the guilt, the no-guilt project adds per-instance locks to collections like dictionary and list. I'm going to talk a bit about the locks used. The implementation is based on the locks used in WebKit, the browser engine for Safari. The design and trade-offs are described in detail by a blog post called locking in WebKit. If you're interested in how locks work, I highly recommend it. It's one of my favorite technical blog posts, and if you search locking in WebKit, it will be the first hit. I'm not going to describe the implementation too much, but I'll describe some of the more important attributes of it. First, the locks are small. They can be as small as 2 bits, but due to alignment requirements in C, in no-guilt Python, they typically require the same space as a pointer, so 8 bytes in a 64-bit system. That's still a lot smaller than Pthread mutex or C++11 standard mutex, which take 40 or more bytes. The locks also only need to be initialized to zero, just like an integer. It doesn't require any memory allocation. It has fast, uncontended access. That's really important because by far the most common case is that only one thread accesses a given collection at a time. Finally, it avoids thread starvation in heavily contended locks by eventually handing off the ownership of the lock to the threads in the same order that they attempt to acquire it. All right. That's it about the locks themselves. One of the challenges with introducing locks to collection types is that you have to be careful to avoid introducing lock ordering deadlocks. For example, storing a value, excuse me, storing a key in some dictionary might involve comparing that key to an existing key already in the dictionary, comparing it for equality in particular. For string keys, that comparison is really simple, but user-defined types can override the equality operator, and if that user's code accesses either this dictionary or some other dictionary while implementing equality, then we risk deadlock by holding multiple locks while another thread might attempt to acquire those same locks in a different order. For dictionaries, a reasonable solution is to unlock the dictionary before calling the equality operator and relock it afterwards. Essentially, we make sure that a thread can hold the lock for only a single collection at any time, and that's what was implemented in this project for a while. It turns out there's a lot of places you might want to call back that might call back into Python, and incidentally acquire locks. For example, the PyDeck ref macro can call arbitrary code through an object's finalizer. Identifying all these places is doable, but it's tedious and error-prone. It would be really convenient if we didn't have to be so careful about unlocking these locks before calling back into Python or calling things like PyDeck ref. Notably, the global interpreter lock doesn't have this issue. You just need to acquire the global interpreter lock once, and it's only released in a few places, usually implicitly. For the most part, you don't have to worry about lock ordering deadlocks with the global interpreter lock. Fortunately, we can make these locks behave more like the gill. The basic idea is to allow threads to implicitly release locked collections in places where the threads would have normally released the gill. We don't do this for all types of locks, but instead introduce a new internal API, PyBeginCriticalSection and PyEndCriticalSection. PyBeginCriticalSection locks the past and lock and then pushes it to a stack of active critical sections for the current thread. Subsequent calls to PyBeginCriticalSection also push on to that same stack. When threads would have normally released the gill, the threads instead release all the locks for the active critical sections and then mark those sections as inactive. This includes times when the thread is blocked trying to acquire a lock held by a different thread. In places where the thread would have normally reacquired the gill, it reactivates the topmost critical section and locks the corresponding lock. However, it doesn't lock any previous critical sections. The PyEndCriticalSection releases the lock for the topmost section and pops it from the stack. If there was a previous critical section and it was inactivated, it's now reactivated and its lock is relocked. The final PyEndCriticalSection unlocks and removes the last critical section from the stack. The scheme avoids the risk of any lock ordering deadlocks. To understand why, consider that a lock ordering deadlock requires a thread to hold one lock, let's call it A, and be blocked while trying to acquire a second lock, say B. There's also has to be another thread trying to acquire those same locks but in a different order. With this API, the moment a thread is trying to acquire that second lock, B, it will release its previously held locks, allowing other threads to proceed and avoiding the deadlock. So since this avoids lock ordering deadlocks, why isn't something like this more widely used? Well, with this sort of scheme, there's a lot of places where the held locks might be implicitly released, and that's not something you really want in a general locking API. But it's fine here for replacing the global interpreter lock because the places where these locks might have been implicitly released are the exact same places where the gill could have been implicitly released. There's one additional bit to the critical section API. Sometimes it's useful to lock two collections at once. For example, when extending a list with the items of a dictionary, it's useful to lock both the list and the dictionary. The list because it's being modified, and the dictionary so that the extend operation sees a consistent view of the items in the dictionary. So there's one additional API that locks two mutexes at once. Locking collections works well for most accesses, but there's a few operations where acquiring a lock would inhibit scaling. The most important is when accessing a single item in a dictionary because things like globals and object attributes are implemented using dictionaries, and a single dictionary may be frequently accessed by many threads, just like we saw in the Fibonacci example. The single element accesses for list and dictionary are implemented without acquiring the collection's lock. This is made more complicated by reference counting. Acquiring the accessing thread needs to increment the reference count of the returned object, but another thread might have simultaneously modified the dictionary and freed that object. To handle this, the fast lockless path only handles the case where the reference count isn't zero and where the value is still in the collection. If it fails, the operation falls back to a slower path that retries the operation after acquiring the collection's lock. Normally the scheme wouldn't be safe because when an object is freed, its memory, including the reference count field, is invalid. The scheme requires cooperation with the memory allocator to ensure that the reference count field remains valid for the duration of this access, even if the object is freed. That leads us to the memory allocator. Python has a memory allocator called PyMalloc. It's optimized for small allocations and it's used for allocating Python objects. Unfortunately it's not thread safe. This project replaces PyMalloc with MeMalloc, a memory allocator written by Don Lehan at Microsoft Research. MeMalloc's performance is slightly better than PyMalloc and it's thread safe, so it's likely that MeMalloc would end up in Python independent of any GIL changes, and there's an outstanding PR from Christian Himes that implements MeMalloc into CPython. MeMalloc with some modifications is also important for the lock list dictionary and list accesses we just talked about. Finally, MeMalloc makes it easier to make Python's cyclic garbage collector thread safe. I'm going to describe how the garbage collector works in Python today and then describe the modifications made when removing the GIL and how using MeMalloc helps. The first challenge with cyclic garbage collection is that the mechanism used to keep track of objects isn't thread safe. The current garbage collector relies on a global linked list to find all Python objects that support cyclic garbage collection. The linked list uses two fields in the object's headers, GC next and GC previous. Objects are inserted into the linked list shortly after they're allocated and removed from the linked list before they're deallocated. This poses a thread safety issue because threads allocate in free objects concurrently. The list also can't be made thread local because objects might be allocated in one thread and deallocated on a different thread. This project gets rid of the doubly linked list, although the GC next and GC previous fields are still in use for now. Instead of the linked list, the garbage collector uses MeMalloc to find all the objects that support garbage collection. These objects are allocated from a separate MeMalloc heap and other data. The collector scans each block in the heap to find objects. Unallocated blocks and objects that aren't tracked by the collector have their least significant bit in the first word set to zero. Tracked objects have the least significant bit set to one. The cyclic garbage collector can't run safely when other threads are accessing or modifying Python objects. The global interpreter lock previously ensured this, but now the collector needs to pause all other Python threads before running. Python already had a mechanism for requesting threads to handle asynchronous signals and exceptions which we've extended to handle requests to pause for the garbage collector. Once all other Python threads are paused, the garbage collector runs. The other threads are resumed after the collector finishes identifying unreachable objects. I want to talk a bit about compatibility because a frequent concern with removing the gill is that it will break existing Python code and extensions. For pure Python code, that is, Python code that doesn't make use of the C API, I don't expect the code to require any changes. Now that probably won't be 100% true, but I think it's pretty close to it. Importantly, things like, things that looked atomic with the gill, like list.append, still look like they're performed atomically due to per-collection locking. For C API extensions, many don't require any changes, and those that do, the changes are pretty small. I'll talk about a few examples later. The most common issue is global or shared data in C that's implicitly protected by the gill. A less common issue is due to the use of borrowed references. For example, PyDigGetItem and PyListGetItem return objects without increasing their reference counts. This can cause problems if the dictionary or list is concurrently modified. Most uses of borrowed references aren't an issue. For example, PyDigGetItem is frequently used to parse the keyword argument dictionary, and that dictionary is effectively private to the called function. It's not shared with other threads. I'd like to give some examples from real Python packages that use the C API. Scikit-learn is a Python library for machine learning that builds on top of NumPy and SciPy. Under the hood, it's a mix of Python, SciFun, and some C++. Thanks in large part to Olivier Grissel, Scikit-learn now works without the gill, and there's even a continuous build that runs with no-gill Python. I want to go over the bugs we fixed while making Scikit-learn work without the gill. Getting all the Scikit-learn tests to pass required fixing ten issues. Of these, five were in the no-gill project but unrelated to removing the gill. The other five issues were related to removing the gill, essentially thread safety issues in these projects if you ran them without the global interpreter lock. One of these issues was in the Scikit-learn codebase directly. Three of them were in dependencies, two in NumPy and one in SciFun, and the last one was in VisTracer, which isn't really a dependency but a useful profiling tool. Here's the bug in Scikit-learn caused by removing the gill. This is an excerpt of some SciFun code used to compute k-means. When run without the gill and with OpenMP parallelism enabled, the operation would give an incorrect result, triggering test failures. The problem is that SciFun's with-gill construct won't prevent multiple threads from running at the same time if you've already disabled the gill. The fix was pretty simple. We just replaced with-gill with a different lock. Since the code is already using OpenMP, we used an OpenMP-based lock. The actual PR that landed in Scikit-learn is a bit larger than this because the same pattern is repeated a few times in the file, but this is the essence of the fix. I'll only briefly go over the other issues that cropped up when removing the gill. For NumPy, there were four total issues, but two of those were fixed before the Scikit-learn effort. Of the four issues, three were issues due to caches that were no longer thread-safe without the gill. Some of those caches could just be disabled when running with no-gill Python, but one of the caches had to be protected by a lock. One issue was due to the use of borrowed references in NumPy that was no longer safe without the gill. That was fixed by switching to an API that returned a new reference instead. SciFun had an issue with locks allocated from a pool that wasn't thread-safe without the gill, but it turned out those locks weren't needed if atomic operations were available. Finally, VizTracer, the profiling tool, had some internal state in C that needed to be protected by a lock. The critical section API came in handy there. The fixes for Scikit-learn and VizTracer have landed upstream in those projects, but the changes to NumPy and SciFun are still in forks maintained by me. Overall, the changes were relatively small. The Scikit-learn test depended on a total of 25 other Python packages. Other than the changes I mentioned here, those dependencies didn't require any other changes. The other project I'd like to talk about is work done by my colleagues at Meta AI on Hanabi. Hanabi is a card game that's been proposed as a benchmark problem in AI. Hanabi is an interesting AI challenge for a few reasons. First, it's a cooperative game. Two to five players work together to place down cards in numerical order. It's also an imperfect information game. You can see other players' cards, but you can't see your own cards. And to play well, you need to think about your partner's behaviors and how they'll respond to your actions and the limited information that you're allowed to share. My colleagues have developed agents, bots, that play Hanabi, both with other bots and with people. The bots use artificial neural networks to decide what actions to take, and they're trained using reinforcement learning techniques. The code is partly in Python, using PyTorch and NumPy, and partly in C++, using PyBind11. Here's a diagram of the system used to train the bot. The system simulates thousands of games, that's the section marked environments, at once using multiple threads. Each thread is responsible for simulating more than one game. The inference model and the training model are the bot's neural networks. The inference model determines what action the bot will take in the simulated games. The results of those simulated games are fed into the training model to improve the bot's behaviors. Every so often, the training model is copied back to the inference model. The Hanabi system was built with vanilla Python and with the global interpreter lock in mind. The worker threads and inference loops have to avoid running Python code. Only the training loop and the training model run the Python interpreter. This means that many pieces of code had to be written in C++, or like an inference model written in a subset of Python called TorchScript that's serialized to representation that can be executed from C++ without invoking the Python interpreter. Having so many components in C++ made it harder to explore new research ideas. My colleague implemented a version of the system designed for running without the gill. Most of the components were moved to Python, including the worker threads, the inference model, and the logic for each player. The environment, that's the component that implemented the rules of the game, stayed in C++ because there isn't a compelling reason to move it to Python. The research ideas don't involve changing the rules of Hanabi. As I mentioned before, the project uses NumPy, PyTorch, and PyBind11. Those projects were already patched to work without the gill, although we found and fixed another issue with the PyBind11 patch in the course of this work. I'd like to show the performance of this system and the steps we took to optimize it. First, here's a chart showing the performance of the new mostly Python system with the global interpreter lock. The X axis shows the number of worker threads, and the Y axis shows the number of games simulated per second. The gill is pretty clearly a bottleneck here. Using more than one worker thread doesn't improve performance. Here's the performance once the gill is removed. Now adding more worker threads improves performance at least up to a bit. This only shows up to eight threads but the computer it's running on has 40 hardware cores. This is the same system, no gill, but with even more worker threads, up to 80 instead of only up to eight. We hit a bottleneck pretty quickly. Increasing the number of worker threads beyond ten didn't improve performance. Running with more worker threads actually hurt performance. To investigate this, I used Linux perf tools and GDB. The majority of CPU time was spent trying to acquire a lock but didn't give much information beyond that. GDB narrowed the issue further. At any given point, most of the threads were stuck trying to acquire the same lock. This is a lock I added to our patched version of PyBind 11 to make it thread safe. PyBind maintains an internal hash table called registered instances that maps C++ pointers to their corresponding Python wrappers. This is done so that if you return the same C++ object multiple times, you always get back the same Python wrapper object. The lock I added to protect the hash table was the source of the bottleneck. There are a lot of techniques for implementing concurrent hash tables but one of the simplest is to split the table into multiple pieces. Each a hash table with its own lock. The key, in this case the C++ pointer, is used both to determine which sub hash table to use and to index into it. This worked surprisingly well and was easy to implement. The turquoise bars show the performance after splitting up the registered instances table. This new system can simulate nearly 1,200 games per second with 60 workers which is about a 20 time speed up compared to if we ran this program with a global interpreter lock enabled. I suspect there's still opportunities to further improve scaling but we think this system has performance comparable to the original C++ heavy system and having more of the components in Python made it much easier to explore new research ideas. You can try out the Nogil version of Python yourself. Both URLs here go to the same GitHub page. The easiest way to install it is a Python version management tool for Linux and Mac. There's also a Docker image or you can build Nogil Python from source if you prefer. You can pip install most packages including many scientific computing libraries. Please let me know if something doesn't work by filing an issue on the project's GitHub page. In the next month or so, I'll be working on a Python enhancement proposal or PEP proposing to move the global interpreter lock from a future version of Python using the ideas from this project. To help address compatibility issues, I'll propose placing the GIL changes behind a compiler flag so that the default build will still have the GIL enabled. I expect this to take some time, but in the meantime, I'd appreciate your feedback on the project, either on GitHub or elsewhere, and if you're interested in efficient multi-threaded Python programs, please try out Nogil Python. Thank you. All right. We have time for a couple of questions. I've been asked to ask a couple of questions first, so, Sam, you didn't know this was going to happen, did you? My experience in working on the Galactomy, I had a miserable time debugging because debugging across multiple threads makes it so hard to go back and reason about who changed what and how it happened. I relied very heavily on what's called reversible debuggers. What has been your experience with debugging CPython, figuring out what happened? Yeah, so, that was actually one of the things I took from your project. So I didn't use the same reversible debugger, I used RR, which is a project from Mozilla, which is another reversible debugger, and it made debugging both thread safety issues and reference counting bugs much easier because you could go to the source of the failure and then just go back in time to figure out what operation was responsible for corrupting the data. Okay. We're, if people have questions, by the way, you can line up at that mic and I might call on you. So you talked about how you're, I can't remember, there's five or six different, like, X reference counting and Y reference counting. The one where you're, the bias reference counting that has the per thread local count and then the global count. So one of my first experiments with the Galactomy was to just switch to atomic reference counting for everywhere, and performance was tragic, and in particular, the really tragic thing was that the more threads you added, the slower it got. And I'm not good at computer architecture, but my understanding is that essentially the way that atomic inker and decker works is that there is an internal bus that's shared by the cores on your CPU that they use to talk to each other, to tell each other things, and in particular when you do atomic inker and decker, it's like there's one CPU telling the other CPUs, hey, forget about this cache line, because I've blown it away, because I did a write to that. And the more inkers and deckers that you do atomic, the more contention there is for that bus, and so that's a place where you're going to have scaling issues where, like, atomic inker and decker is going to fail you. So how much overhead did you find, like, you talked about a certain amount of overhead, like did you see problems with this contention where adding threads made the atomic inker and deckers get worse because you were doing enough of them? So my understanding of the way caches work now is that there isn't a global bus lock. I think that was the early implementation on multiple, when there were multiple... Not a bus lock, but a bus where they have to communicate with each other, and they have to, well, I guess a lock, yeah, okay. So caches are confusing. I'm pretty sure if I say something, it's going to be at least a little bit wrong, maybe a lot wrong. So atomic increments and deckerments don't do a global lock anymore. They effectively lock the cache line. What that actually means and how it's implemented in hardware is a little bit complicated, and I don't think I can explain it, but what it means is that atomic increments and deckerments to different cache lines still scale well. So that's only kind of relevant to the reference counting here. It means that if you have a bunch of threads and they do some atomic reference counting, it still works fine as long as they're not to the same object. But biased reference counting doesn't really address the scalability issue. It only addresses sort of the single-threaded overhead. And the way scalability is addressed is by avoiding reference counting contention on shared objects. So mostly skipping the reference counting, either by deferred reference counting for functions or immortal objects, or like skipping the reference counting entirely for immortal objects. There are some objects that are going to be shared across threads, and they're going to have multiple threads modify the reference count, and that's often okay. It's sort of a matter of scale. Like reference counting is just incredibly frequent, so doing it everywhere doesn't scale at all, but doing a little bit is okay. Like having some objects that are shared is still fine. And that's what we saw in the Hanabi system, is that there's a bunch of Python objects that are shared between the threads, and it didn't affect performance too much. It was only the few things that were accessed so frequently, like the dictionary lock or the function objects, that really affected scaling. Okay. That's heartwarming. We do have a question from the audience. Thank you for the amazing talk. I just wanted to understand, do you think in the future we would be able to get some semantics behind the whole work of parallelizing Python and removing the GIL? Because of course, removing the GIL is going to introduce some edge cases in a lot of stuff. Do you think we'll be able to have formal guarantees on what we have in the Python code, more or less, or a model or a proxy? Especially because you've shown a lot of research ideas, like bias-reverse counting, so it looks like it's possible? So I think one thing that would be nice is to have a formal model of what operations are performed atomically. But I don't think that's practical, because I think it would be hard to come up with a formal model of how things behave currently in Python, and keeping existing behavior I think is more important than formalizing the model. One of the challenges with both the way the current implementation works and no GIL works is that there are a bunch of operations that are seemingly atomic if they're performed on simple types, but might not be atomic if they're performed on types that override equality or some other function. And it's really hard to reason about that in general. So I think what will be necessary is sufficient documentation, but I think a formal model will be pretty difficult. And also, if I can ask another question, have you considered lock-free data structures for some structures in Python? So I think lock-free data structures might make sense in some places. I rejected lock-free data structures for things like the built-in dictionary type, because in general the lock-free algorithms are slower to execute, at least for single-thread cases, than algorithms that use locking. And the most common case here is a loss of threads, or threads accessing individual objects without any contention. Thank you very much. Yeah, next question, please. Hi, thank you for the talk. You mentioned the critical section technique and that it works because of the current design with the GIL. Once it's removed, is there anything we need to do, like, for further maintenance to make sure that nothing breaks? Yeah, that's a great question. So one thing I didn't mention in this talk is that the APIs that currently exist to acquire or release the GIL, like PyBegin allow threads, I think, and PyEnd allow threads, are still used and are still important in this project. The good thing is that means existing code continues to work, but it can be a little confusing, and it was one of the things that came up in the Hanabi project, is that my colleague wrote code not using this, and it ended up causing issues. Essentially, you still have to use those macros at places before the thread would block. And it's important for two reasons. One is, like, the critical section API, and the second is to allow the threads to pause at the correct place for garbage collection. If you skip over that and block while, I guess, still attached to the Python interpreter, you can risk introducing deadlock, in the same way that if you block a thread while holding the GIL, you can risk causing deadlock. So they're still necessary, but they're used in exactly the same places. All right. Thank you. Okay. One more from the audience. Thank you. I love this work, so thank you so much. So I have a question more about the community and the process. So you're in the process of submitting this as an upstream patch to the core C Python interpreter, and I know that you have been in conversations with people from the core Python development community, and the same way that in previous attempts, they say, no, 60% slowdown is too much, I'm guessing you have some guesses of what do you need to achieve before saying, okay, this is ready, do you have some information about, yeah, these are the goals that have been reached and the ones that I still have to do more work to reach before the community accepts this? Yeah, that's a great question. In terms of sort of what's acceptable performance, I don't know. I think it's something we'll sort of have to figure out. But to answer the question is, no, I don't know. Okay. And other things beyond performance, like impact on other libraries or stuff like that? So I mean, having low impact on other libraries is clearly important. But again, it's one of those things where it's much harder, performance itself is kind of hard to quantify, but impact is even harder. So I think it's going to be hard to know what the, like, requirements there are. They'll know when they know it. So yeah. All right. I'm going to take my session chair privilege, I'm going to ask one more question before I let you go. Sam, you get to borrow somebody's magic wand for ten minutes. If you could, would you give up on reference counting in Python and actually go to tracing GC? Because as we saw from your slides, you've had a very complicated and difficult trail to get to where you are with performance. So reference counting is now very complicated and it's still causing performance issues. My understanding is that tracing garbage collection in a multithreaded system is just fundamentally more multithreaded friendly. So would you prefer if you could change the Python history, would you go to tracing garbage collection if you could? So the short answer is no, I wouldn't. And well, this is a tough question. So I think the, like, typical answer would be yes. It's much easier to get better scaling, there's a lot of issues you can avoid. Wouldn't have been great if we had gone to cyclic garbage collection. One thing I really like about reference counting in Python is that it means that destructors are called as soon as the reference count goes to zero, essentially as soon as the object is no longer referenced, as long as there's no cycle, which is the common case. And I found that this makes it much easier to integrate some C++ APIs, especially ones that handle external resources, either lots of, like, memory on the CPU or GPU memory. Working on this, I worked on PyTorch and before that Torch 7, which was written in Lua. And Lua has a cyclic garbage collector, it doesn't use reference counting. And we ran into issues pretty frequently due to the garbage collector. So switching to Python and reference counting was actually really nice as we were going from Torch to PyTorch. So I like reference counting, I feel like that might be a little controversial, but yeah, I'd keep it. It's good news. And bad news for PyPy. All right. Thank you, ladies and gentlemen. It's coffee time now, so go get your caffeine on.

Menu

Python with support for multithreading

Toggle timeline summary

Transcription