Performance

Python can be used to write and test code quickly because it is an interpreted language that types dynamically. However, these are also the reasons it is slow when performing simple tasks repeatedly, for example in loops.

When developing code, there can often be trade-offs between different implementations. However, at the beginning of the development of an algorithm, it is usually counterproductive to worry about the efficiency of the code.

«We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%.»[1]

[1]	Donald Knuth, founder of Literate Programming, in Computer Programming as an Art (1974)

.. seealso::
   * `Speed up your data science and scientific computing code
     <https://pythonspeed.com/datascience/>`_

k-Means example

In the following, I will provide examples of the k-means algorithm algorithm, which is used to form a predefined number of clusters from a set of objects. This can be achieved using MacQueen’s algorithm in the following three steps:

Choose the first :samp:`k` elements as cluster centres
Assign each new element to the cluster with the least increase in variance.
Update the cluster centre

Steps 2 and 3 are repeated until the assignments no longer change.

A possible implementation with pure Python could look like this:

.. literalinclude:: py_kmeans.py
   :caption: py_kmeans.py
   :name: py_kmeans.py
   :lines: 6-

We can create sample data with:

from sklearn.datasets import make_blobs

points, labels_true = make_blobs(
    n_samples=1000, centers=3, random_state=0, cluster_std=0.60
)

And finally, we can perform the calculation with:

kmeans(points, 10)

Performance measurements

Once you have worked with your code, it can be useful to examine its efficiency more closely. :doc:`cProfile <tracing>`, :doc:`ipython-profiler`, :doc:`scalene`, :doc:`tprof` or :doc:`memray` can be used for this. So far, I usually carry out the following steps:

:doc:`cProfile <tracing>`, :doc:`ipython-profiler`, :doc:`scalene` or :doc:`tprof` can be used for this. So far, I usually carry out the following steps:

I profile the entire programme with :doc:`cProfile <tracing>` or py-spy to find slow functions.
If necessary, I can use the line_profiler to identify the slow sections within the function
If the slow function is computationally intensive, I try one of the following optimisations; however, if the application is data-intensive (dictionaries, strings, I/O), I take a closer look at the architecture.
Then I optimise a slow function.
Finally, I create a new profile and filter out the result of my optimised version so that I can compare the results.

.. versionadded:: Python3.15
   :pep:`799` will provide a special profiling module that organises the
   profiling tools integrated in Python under a uniform namespace. This module
   contains:

   :mod:`profiling.tracing`
       deterministic function call tracing, which has been moved from
       :doc:`cProfile <tracing>`.
   :mod:`profiling.sampling`
       the new statistical sampling profiler :doc:`tachyon`.

.. seealso::
    * `airspeed velocity (asv) <https://asv.readthedocs.io/en/stable/>`_
      is a tool for benchmarking Python packages during their lifetime. Runtime,
      memory consumption and even user-defined values can be recorded and
      displayed in an interactive web frontend.
    * `Python Speed Center <https://speed.python.org/>`_
    * `Tracing the Python GIL
      <https://www.maartenbreddels.com/perf/jupyter/python/tracing/gil/2021/01/14/Tracing-the-Python-GIL.html>`_

.. toctree::
    :hidden:
    :titlesonly:
    :maxdepth: 0

    tracing
    ipython-profiler.ipynb
    scalene.ipynb
    tprof
    memray
    tachyon

1. Search for existing implementations

You should not try to reinvent the wheel: If there are existing implementations, you should use them. There are even two implementations for the k-means algorithm:

sklearn.cluster.KMeans

from sklearn.cluster import KMeans

KMeans(10).fit_predict(points)

dask_ml.cluster.KMeans

from dask_ml.cluster import KMeans

KMeans(10).fit(points).predict(points)

The best that could be said against these existing solutions is that they could create a considerable overhead in your project if you are not already using scikit-learn or Dask-ML elsewhere. In the following, I will therefore show you further possibilities to optimise your own code.

2. Find anti-patterns

Then you can use :doc:`perflint` to search your code for the most common performance anti-patterns in Python.

.. toctree::
    :hidden:
    :titlesonly:
    :maxdepth: 0

    perflint

.. seealso::
   * `Effective Python <https://effectivepython.com>`_

3. Vectorisations with NumPy

:doc:`../workspace/numpy/index` moves repetitive operations into a statically typed compiled layer, combining the fast development time of Python with the fast execution time of C.

Version	Spectral-norm	vs 3.14x
CPython 3.14 – Basis	14,046ms
NumPy	27ms	520x

You may be able to use :doc:`../workspace/numpy/ufunc`, :doc:`vectorisation <../workspace/numpy/vectorisation>`, :doc:`../workspace/numpy/indexing-slicing` in various combinations to move repetitive operations into compiled code and thus avoid slow loops, for example:

.. literalinclude:: np_kmeans.py
   :caption: np_kmeans.py
   :name: np_kmeans.py
   :lines: 5-12

The advantages of NumPy are that the Python overhead only occurs per array and not per array element. However, because NumPy uses a specific language for array operations, it also requires a different mindset when writing code. Finally, the batch operations can also lead to excessive memory consumption.

.. seealso::
   * `Speeding up NumPy with parallelism
     <https://pythonspeed.com/articles/numpy-parallelism/>`_ by Itamar
     Turner-Trauring

4. Special data structures

:doc:`../workspace/pandas/index`

for SQL-like :doc:`../workspace/pandas/group-operations` and: :doc:`../workspace/pandas/aggregation`.

This way you can also bypass the loop in the compute_centers method:

.. literalinclude:: pd_kmeans.py
   :caption: pd_kmeans.py
   :name: pd_kmeans.py
   :lines: 5-8, 16-19

scipy.spatial

for spatial queries like distances, nearest neighbours, k-Means :abbr:`etc (et cetera)`.

Our find_labels method can then be written more specifically:

.. literalinclude:: sp_kmeans.py
   :caption: sp_kmeans.py
   :name: sp_kmeans.py
   :lines: 5-13

scipy.sparse

sparse matrices for 2-dimensional structured data.

Sparse

for N-diemensional structured data.

scipy.sparse.csgraph

for graph-like problems, for example searching for shortest paths.

Xarray

for grouping across multiple dimensions.

.. toctree::
    :hidden:
    :titlesonly:
    :maxdepth: 0

    parallelise-pandas

5. Select compiler

Faster CPython

At PyCon US in May 2021, Guido van Rossum presented Faster CPython, a project that aims to double the speed of Python 3.11. The cooperation with the other Python core developers is regulated in :pep:`PEP 659 – Specializing Adaptive Interpreter <659>`. There is also an open issue tracker and various tools for collecting bytecode statistics. CPU-intensive Python code in particular is likely to benefit from the changes; code already written in C, I/O-heavy processes and multithreaded code, on the other hand, are unlikely to benefit.

And indeed, the cPython versions have become significantly more efficient since then:

Version
CPython 3.10.4	1.422x
CPython 3.12.0	1.093x
CPython 3.13.0	1.024x
CPython 3.15.0a0 – Basis

.. seealso::
   * `Faster CPython
     <https://web.archive.org/web/20221007175548/https://faster-cpython.readthedocs.io/>`__
   * `Faster CPython Benchmark Infrastructure
     <https://github.com/faster-cpython/benchmarking-public?tab=readme-ov-file>`_

Free-threaded Python was also included in another comparison:

Version	N-body	vs 3.14	Spectral-norm	vs 3.14x
CPython 3.10	1,663ms	0.75x	16,826ms	0.83x
CPython 3.11	1,200ms	1.04x	13,430ms	1.05x
CPython 3.13	1,134ms	1.10x	13,637ms	1.03x
CPython 3.14 – Basis	1,242ms		14,046ms
CPython 3.14t	1,513ms	0.82x	14,551ms	0.97x

– Surce: The Optimization Ladder

If you don’t want to wait with your project until the release of Python 3.11 in the final version probably on 24 October 2022, you can also have a look at the following Python interpreters:

Python JIT compiler

.. versionadded:: Python3.15
   The `JIT <https://en.wikipedia.org/wiki/Just-in-time_compilation>`_ compiler
   in Python 3.15 shows an average performance increase of 3–4% compared to the
   standard CPython interpreter, with performance measurements varying between
   -20 and over 100% on x86-64 Linux and AArch64 macOS systems.

   .. seealso::
      * `The JIT Compiler
        <https://github.com/python/cpython/blob/main/Tools/jit/README.md>`_
      * :ref:`whatsnew315-jit`

Version
CPython 3.15.0a0 (JIT)	1.001x
CPython 3.15.0a0 – Basis

Cython

For intensive numerical operations, Python can be very slow, even if you have avoided all anti-patterns and used vectorisations with NumPy. In this case, translating code into Cython can be helpful.

Version	N-body	vs 3.14	Spectral-norm	vs 3.14x
CPython 3.14 – Basis	1,242ms		14,046ms
Cython	10ms	124x	142ms	99x

Unfortunately, the code often has to be restructured and thus increases in complexity. Explicit type annotations and the provision of code also become more cumbersome.

Our example could then look like this:

.. literalinclude:: cy_kmeans.pyx
   :caption: cy_kmeans.pyx
   :name: cy_kmeans.pyx
   :lines: 5-32

.. seealso::
    * `Cython Tutorials
      <https://cython.readthedocs.io/en/latest/src/tutorial/>`_

Numba

Numba is a JIT compiler that translates mainly scientific Python and NumPy code into fast machine code, for example:

.. literalinclude:: nb_kmeans.py
   :caption: nb_kmeans.py
   :name: nb_kmeans.py
   :lines: 5-29

However, Numba requires LLVM and some Python constructs are not supported.

Version	N-body	vs 3.14	Spectral-norm	vs 3.14x
CPython 3.14 – Basis	1,242ms		14,046ms
Numba	22ms	56x	104ms	135x

6. Task planner

:doc:`jupyter-tutorial:hub/ipyparallel/index`, :doc:`dask` and Ray can distribute tasks in a cluster. In doing so, they have different focuses:

ipyparallel simply integrates into a :doc:`jupyter-tutorial:hub/index`.
:doc:`dask` imitates pandas, NumPy iterators, Toolz und PySpark when it distributes their tasks.
Ray provides a simple, universal API for building distributed applications.
- RLlib will scale reinforcement learning in particular.
- A backend for joblib supports distributed scikit-learn programs.
- XGBoost-Ray is a backend for distributed XGBoost.
- LightGBM-Ray is a backend for distributed LightGBM.
- Collective Communication Lib offers a set of native collective primitives for Gloo and the NVIDIA Collective Communication Library (NCCL).

Our example could look like this with Dask:

.. literalinclude:: ds_kmeans.py
   :caption: ds_kmeans.py
   :name: ds_kmeans.py
   :lines: 5-

.. toctree::
    :hidden:
    :titlesonly:
    :maxdepth: 0

    dask.ipynb

7. Multithreading, Multiprocessing and Async

After a brief :doc:`overview <multiprocessing-threading-async>`, three examples of :doc:`threading <threading-example>`, :doc:`multiprocessing <multiprocessing>` and :doc:`async <asyncio-example>` illustrate the rules and best practices.

.. toctree::
    :hidden:
    :titlesonly:
    :maxdepth: 0

    multiprocessing-threading-async
    threading-example.ipynb
    multiprocessing.ipynb
    threading-forking-combined.ipynb
    asyncio-example.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance

k-Means example

Performance measurements

1. Search for existing implementations

2. Find anti-patterns

3. Vectorisations with NumPy

4. Special data structures

5. Select compiler

Faster CPython

Python JIT compiler

Cython

Numba

6. Task planner

7. Multithreading, Multiprocessing and Async

FilesExpand file tree

index.rst

Latest commit

History

index.rst

File metadata and controls

Performance

k-Means example

Performance measurements

1. Search for existing implementations

2. Find anti-patterns

3. Vectorisations with NumPy

4. Special data structures

5. Select compiler

Faster CPython

Python JIT compiler

Cython

Numba

6. Task planner

7. Multithreading, Multiprocessing and Async