Python concurrency - how do threads work?

Even after so many years working with Python (16+ at the moment of this writing) I went through a job interview where I ended up giving a partially correct answer regarding how concurrency works in CPython. I said "partially correct" because, depending on the CPython version used to test my theory, it could provide a result that matched my answer or not. That triggered me to do a deeper dive into how CPython deals with threads.

You can check out here the project with the examples I'm using in this article.

Question and answer

The question was something like this: "if you have a global count integer and many threads incrementing it, say, 4 threads incrementing it 1 million times each, what will be the resulting count?".

Knowing how the Global Interpreter Lock (GIL) works, my quick answer was: "4 million". The interviewer said, "no, you're wrong, it will be less than 4 million". I said "I really do believe I'm right, because the GIL won't be released while each thread is incrementing the counter, and that's a known limitation in Python multithreading". Then I proceeded to write a script to prove my point, ran it and there it was: 4 million. The interviewer however ran the exact same script on his side, and he got a result that was way less than 4 million - it was something around 2 million, not round. What the heck?

Pythonic surprises

I asked the interviewer which Python version he was using, and he said "3.9". I told him I was using 3.10, so clearly there was something to that - between the two versions there was probably a change in how things work regarding threads. Low and behold, there was indeed: I found out that the increment operator was implemented in 3.10 as an atomic operation, differently from 3.9 and below.

But my assumption of how thread scheduling works in CPython was incorrect too: for many years I've been using multithreading in Python expecting that, as long as there's no stdlib-related I/O going on, the interpreter would keep working on a single thread at a time, letting the thread hold the GIL, and only when it finished a thread it would release it for another thread to acquire it and start working. Now, this assumption was only partially correct; After some time working on the same thread (by default 5 milliseconds, if I remember correctly), the interpreter forces a context switch to another thread that is waiting to run (unless there's low-level processing going on - which we'll take a look at later). So the effect of this is that, while the code is still not being run truly concurrently, each thread gets the chance to work a bit after a while.

Every time I work with multithreading in Python I only do so when there's I/O, and I try to use message-passing for sharing data instead of shared memory, but when I do use shared memory I use thread locks because I know that I might get race conditions if I don't. But my previous assumption could have been dangerous to me if I had misused it - fortunately I've never been bitten by that problem though.

Let's see how that works, next.

Examples

Counter with different switch intervals

Consider the following piece of code (let's say the script name is check_counts.py):

import sys
from concurrent.futures import ThreadPoolExecutor, wait


THREADS = 4
AMOUNT = 1_000_000
count = 0


def increase() -> None:
    global count

    for _ in range(AMOUNT):
        count += 1


def main() -> None:
    global count

    count = 0
    with ThreadPoolExecutor(max_workers=THREADS) as executor:
        wait([executor.submit(increase) for _ in range(THREADS)])
    print('Result after increases:', count)


if __name__ == '__main__':
    print('Starting to check counts')
    main()

Now, let's run it with Python 3.9 first:

$ python3.9 check_counts.py
Starting to check counts
Result after increases: 1449174

Then with Python 3.10:

$ python3.10 check_counts.py
Starting to check counts
Result after increases: 4000000

In the second example you might be led to believe that the threads are run until completion, but that's not what happens: what happens is that each increment is done in Python 3.10 as if it were using an implicit lock, to make sure there are no race conditions there.

Now let's change the script a bit:

import sys
from concurrent.futures import ThreadPoolExecutor, wait


THREADS = 4
AMOUNT = 1_000_000
count = 0


def increase() -> None:
    global count

    for _ in range(AMOUNT):
        count += 1


def main() -> None:
    global count

    sys.setswitchinterval(60.0)

    count = 0
    with ThreadPoolExecutor(max_workers=THREADS) as executor:
        wait([executor.submit(increase) for _ in range(THREADS)])
    print('Result after increases with long switch interval:', count)


if __name__ == '__main__':
    print('Starting to check counts')
    main()

Notice the following call: sys.setswitchinterval(60.0). This call is crucial to change the behavior here, and what it does is to change the threshold to force the context switch between threads - in this case, from 5ms to 60s. How does the code run, now, in Python 3.9? Here it is:

$ python3.9 check_counts.py
Starting to check counts
Result after increases with long switch interval: 4000000

This happened because each thread took way less than 60s to finish, so the interpreter didn't have to force a context switch; In other words, the threads ran effectively in series and until completion - instead of in chunks of work like could have happened if the interval was lower.

Low-level computations

The behavior you get might be different, however, depending on what you're using in your Python code. One of these situations is when you're using C modules that don't release the GIL; Since the interpreter is unable to control what happens in the C level, if your code calls a C-level function, and doesn't release the GIL before running that, then the interpreter will need to wait until the function returns so that it can move on. And it doesn't matter how long the C function takes to run, the interpreter is simply unable to interfere.

Consider for example this simple C module (let's call it libdemo.c):

#include <stdio.h>
#include <unistd.h>

void force_sleep(int seconds) {
    printf("Will sleep for %d seconds in C.\n", seconds);
    sleep(seconds);
    printf("Finished sleeping in C.\n");
}

And then this Python module that makes use of the above (let's call it demo.py):

import ctypes
from pathlib import Path


LIB_PATH = Path(__file__).parent / 'libdemo.so'
libdemo = ctypes.PyDLL(str(LIB_PATH))

Notice that I used ctypes.PyDLL, which doesn't release the GIL from the current thread (as opposed to ctypes.CDLL, which does release it).

Now, let's have a script that exercises some sleeps - first using Python's own time.sleep, then using the force_sleep function above:

import time
from concurrent.futures import ThreadPoolExecutor, wait
from contextlib import contextmanager

from libs.demo import libdemo


SLEEP_SECS = 1
THREADS = 4


@contextmanager
def timing(what: str):
    start = time.time()
    yield
    elapsed = time.time() - start

    print(elapsed, 'seconds for', what)


def main() -> None:
    with timing(f'sleeping in Python in {THREADS} threads'):
        with ThreadPoolExecutor(max_workers=THREADS) as executor:
            futures = [
                executor.submit(time.sleep, SLEEP_SECS)
                for _ in range(THREADS)
            ]
            wait(futures)

    with timing(f'sleeping in C in {THREADS} threads'):
        with ThreadPoolExecutor(max_workers=THREADS) as executor:
            futures = [
                executor.submit(libdemo.force_sleep, SLEEP_SECS)
                for _ in range(THREADS)
            ]
            wait(futures)


if __name__ == '__main__':
    print('Starting to check sleeps')
    main()

And this is what we get when we run it:

$ python check_sleeps.py
Starting to check sleeps
1.002687931060791 seconds for sleeping in Python in 4 threads
Will sleep for 1 seconds in C.
Finished sleeping in C.
Will sleep for 1 seconds in C.
Finished sleeping in C.
Will sleep for 1 seconds in C.
Finished sleeping in C.
Will sleep for 1 seconds in C.
Finished sleeping in C.
4.002422094345093 seconds for sleeping in C in 4 threads

Notice that using the time.sleep function enabled the script to run them concurrently, but using libdemo.force_sleep forced the sleeps to run in series, each one taking a whole second plus a bit more processing time - even if we didn't change the interval for the interpreter to switch context. This expresses the care we have to take for the code if we're running low-level stuff in a multithreading situation.

Conclusion

While multithreading in Python does have its quirks (and is justifiably criticized), it's still very interesting to use and understand. There are very good reasons for the GIL to exist (which are out of the scope of this article - search for "GIL garbage collection", or read this article here), and the PyPy team is working on an STM (Software Transactional Memory) approach to get rid of the GIL in PyPy, so we're still a long way from letting the GIL go.

There are ways, however, to work around the GIL if you need to do intensive computation with Python.

#python #concurrency #multithreading #atomicity #parallelism #gil