Skip to content

Commit c88e672

Browse files
committed
📝 Update performance measurements section
* Add cProfile/profiling.tracing * Add tprof * Add performance metrics
1 parent eefba94 commit c88e672

3 files changed

Lines changed: 245 additions & 37 deletions

File tree

docs/performance/index.rst

Lines changed: 133 additions & 37 deletions
Original file line numberDiff line numberDiff line change
@@ -24,10 +24,10 @@ it is usually counterproductive to worry about the efficiency of the code.
2424
k-Means example
2525
---------------
2626

27-
In the following, I show examples of the `k-means algorithm
28-
<https://en.wikipedia.org/wiki/K-means_clustering>`_ to form a previously known
29-
number of groups from a set of objects. This can be achieved in the following
30-
three steps:
27+
In the following, I will provide examples of the `k-means algorithm
28+
<https://en.wikipedia.org/wiki/K-means_clustering>`_ algorithm, which is used to
29+
form a predefined number of clusters from a set of objects. This can be achieved
30+
using MacQueen’s algorithm in the following three steps:
3131

3232
#. Choose the first :samp:`k` elements as cluster centres
3333
#. Assign each new element to the cluster with the least increase in variance.
@@ -40,6 +40,7 @@ A possible implementation with pure Python could look like this:
4040
.. literalinclude:: py_kmeans.py
4141
:caption: py_kmeans.py
4242
:name: py_kmeans.py
43+
:lines: 6-
4344

4445
We can create sample data with:
4546

@@ -62,18 +63,30 @@ Performance measurements
6263
------------------------
6364

6465
Once you have worked with your code, it can be useful to examine its efficiency
65-
more closely. `cProfile
66-
<https://docs.python.org/3.14/library/profile.html#module-cProfile>`_,
67-
:doc:`ipython-profiler` or :doc:`scalene` can be used for this.
66+
more closely. :doc:`cProfile <tracing>`, :doc:`ipython-profiler`, :doc:`scalene`
67+
or :doc:`tprof` can be used for this. So far, I usually carry out the following
68+
steps:
69+
70+
#. I profile the entire programme with :doc:`cProfile <tracing>` or `py-spy
71+
<https://github.com/benfred/py-spy>`_ to find slow functions.
72+
#. If necessary, I can use the `line_profiler
73+
<https://github.com/pyutils/line_profiler>`_ to identify the slow sections
74+
within the function
75+
#. If the slow function is computationally intensive, I try one of the following
76+
optimisations; however, if the application is data-intensive (dictionaries,
77+
strings, I/O), I take a closer look at the architecture.
78+
#. Then I optimise a slow function.
79+
#. Finally, I create a new profile and filter out the result of my optimised
80+
version so that I can compare the results.
6881

6982
.. versionadded:: Python3.15
7083
:pep:`799` will provide a special profiling module that organises the
7184
profiling tools integrated in Python under a uniform namespace. This module
7285
contains:
7386

7487
:mod:`profiling.tracing`
75-
deterministic function call tracing, which has been moved from `cProfile
76-
<https://docs.python.org/3.14/library/profile.html#module-cProfile>`_.
88+
deterministic function call tracing, which has been moved from
89+
:doc:`cProfile <tracing>`.
7790
:mod:`profiling.sampling`
7891
the new statistical sampling profiler :doc:`tachyon`.
7992

@@ -91,12 +104,14 @@ more closely. `cProfile
91104
:titlesonly:
92105
:maxdepth: 0
93106

107+
tracing
94108
ipython-profiler.ipynb
95109
scalene.ipynb
110+
tprof
96111
tachyon
97112

98-
Search for existing implementations
99-
-----------------------------------
113+
1. Search for existing implementations
114+
--------------------------------------
100115

101116
You should not try to reinvent the wheel: If there are existing implementations,
102117
you should use them. There are even two implementations for the k-means
@@ -128,8 +143,8 @@ create a considerable overhead in your project if you are not already using
128143
<https://ml.dask.org>`_ elsewhere. In the following, I will therefore show you
129144
further possibilities to optimise your own code.
130145

131-
Find anti-patterns
132-
------------------
146+
2. Find anti-patterns
147+
---------------------
133148

134149
Then you can use :doc:`perflint` to search your code for the most common
135150
performance anti-patterns in Python.
@@ -144,31 +159,39 @@ performance anti-patterns in Python.
144159
.. seealso::
145160
* `Effective Python <https://effectivepython.com>`_
146161

147-
Vectorisations with NumPy
148-
-------------------------
162+
3. Vectorisations with NumPy
163+
----------------------------
149164

150165
:doc:`../workspace/numpy/index` moves repetitive operations into a statically
151166
typed compiled layer, combining the fast development time of Python with the
152-
fast execution time of C. You may be able to use
153-
:doc:`../workspace/numpy/ufunc`, :doc:`vectorisation
154-
<../workspace/numpy/vectorisation>` and
155-
:doc:`../workspace/numpy/indexing-slicing` in all combinations to move
156-
repetitive operations into compiled code to avoid slow loops.
157-
158-
With NumPy we can do without some loops:
167+
fast execution time of C.
168+
169+
+---------------+---------------+----------+
170+
| Version | Spectral-norm | vs 3.14x |
171+
+===============+===============+==========+
172+
| CPython 3.14 | 14,046ms | |
173+
| – Basis | | |
174+
+---------------+---------------+----------+
175+
| NumPy | 27ms | 520x |
176+
+---------------+---------------+----------+
177+
178+
You may be able to use :doc:`../workspace/numpy/ufunc`, :doc:`vectorisation
179+
<../workspace/numpy/vectorisation>`, :doc:`../workspace/numpy/indexing-slicing`
180+
in various combinations to move repetitive operations into compiled code and
181+
thus avoid slow loops, for example:
159182

160183
.. literalinclude:: np_kmeans.py
161184
:caption: np_kmeans.py
162185
:name: np_kmeans.py
163-
:lines: 1-8
186+
:lines: 5-12
164187

165188
The advantages of NumPy are that the Python overhead only occurs per array and
166189
not per array element. However, because NumPy uses a specific language for array
167190
operations, it also requires a different mindset when writing code. Finally, the
168191
batch operations can also lead to excessive memory consumption.
169192

170-
Special data structures
171-
-----------------------
193+
4. Special data structures
194+
--------------------------
172195

173196
:doc:`../workspace/pandas/index`
174197
for SQL-like :doc:`../workspace/pandas/group-operations` and
@@ -179,7 +202,7 @@ Special data structures
179202
.. literalinclude:: pd_kmeans.py
180203
:caption: pd_kmeans.py
181204
:name: pd_kmeans.py
182-
:lines: 2-4, 11-15
205+
:lines: 5-8, 16-19
183206

184207
`scipy.spatial <https://docs.scipy.org/doc/scipy/reference/spatial.html>`_
185208
for spatial queries like distances, nearest neighbours, k-Means :abbr:`etc
@@ -190,7 +213,7 @@ Special data structures
190213
.. literalinclude:: sp_kmeans.py
191214
:caption: sp_kmeans.py
192215
:name: sp_kmeans.py
193-
:lines: 6-9
216+
:lines: 5-13
194217

195218
`scipy.sparse <https://docs.scipy.org/doc/scipy/reference/sparse.html>`_
196219
`sparse matrices <https://en.wikipedia.org/wiki/Sparse_matrix>`_
@@ -209,8 +232,8 @@ Special data structures
209232

210233
parallelise-pandas
211234

212-
Select compiler
213-
---------------
235+
5. Select compiler
236+
------------------
214237

215238
Faster CPython
216239
~~~~~~~~~~~~~~
@@ -226,8 +249,47 @@ particular is likely to benefit from the changes; code already written in C,
226249
I/O-heavy processes and multithreaded code, on the other hand, are unlikely to
227250
benefit.
228251

252+
And indeed, the cPython versions have become significantly more efficient since
253+
then:
254+
255+
+------------------+---------+
256+
| Version | |
257+
+==================+=========+
258+
| CPython 3.10.4 | 1.422x |
259+
+------------------+---------+
260+
| CPython 3.12.0 | 1.093x |
261+
+------------------+---------+
262+
| CPython 3.13.0 | 1.024x |
263+
+------------------+---------+
264+
| CPython 3.15.0a0 | |
265+
| – Basis | |
266+
+------------------+---------+
267+
229268
.. seealso::
230-
* `Faster CPython <https://web.archive.org/web/20221007175548/https://faster-cpython.readthedocs.io/>`__
269+
* `Faster CPython
270+
<https://web.archive.org/web/20221007175548/https://faster-cpython.readthedocs.io/>`__
271+
* `Faster CPython Benchmark Infrastructure
272+
<https://github.com/faster-cpython/benchmarking-public?tab=readme-ov-file>`_
273+
274+
Free-threaded Python was also included in another comparison:
275+
276+
+---------------+---------+---------+---------------+----------+
277+
| Version | N-body | vs 3.14 | Spectral-norm | vs 3.14x |
278+
+===============+=========+=========+===============+==========+
279+
| CPython 3.10 | 1,663ms | 0.75x | 16,826ms | 0.83x |
280+
+---------------+---------+---------+---------------+----------+
281+
| CPython 3.11 | 1,200ms | 1.04x | 13,430ms | 1.05x |
282+
+---------------+---------+---------+---------------+----------+
283+
| CPython 3.13 | 1,134ms | 1.10x | 13,637ms | 1.03x |
284+
+---------------+---------+---------+---------------+----------+
285+
| CPython 3.14 | 1,242ms | | 14,046ms | |
286+
| – Basis | | | | |
287+
+---------------+---------+---------+---------------+----------+
288+
| CPython 3.14t | 1,513ms | 0.82x | 14,551ms | 0.97x |
289+
+---------------+---------+---------+---------------+----------+
290+
291+
– Surce: `The Optimization Ladder
292+
<https://cemrehancavdar.com/2026/03/10/optimization-ladder/>`_
231293

232294
If you don’t want to wait with your project until the release of Python 3.11 in
233295
the final version probably on 24 October 2022, you can also have a look at the
@@ -247,12 +309,32 @@ Python JIT compiler
247309
<https://github.com/python/cpython/blob/main/Tools/jit/README.md>`_
248310
* :ref:`whatsnew315-jit`
249311

312+
+------------------+---------+
313+
| Version | |
314+
+==================+=========+
315+
| CPython 3.15.0a0 | 1.001x |
316+
| (JIT) | |
317+
+------------------+---------+
318+
| CPython 3.15.0a0 | |
319+
| – Basis | |
320+
+------------------+---------+
321+
250322
Cython
251323
~~~~~~
252324

253325
For intensive numerical operations, Python can be very slow, even if you have
254326
avoided all anti-patterns and used vectorisations with NumPy. In this case,
255327
translating code into `Cython <https://cython.org>`_ can be helpful.
328+
329+
+---------------+---------+---------+---------------+----------+
330+
| Version | N-body | vs 3.14 | Spectral-norm | vs 3.14x |
331+
+===============+=========+=========+===============+==========+
332+
| CPython 3.14 | 1,242ms | | 14,046ms | |
333+
| – Basis | | | | |
334+
+---------------+---------+---------+---------------+----------+
335+
| Cython | 10ms | 124x | 142ms | 99x |
336+
+---------------+---------+---------+---------------+----------+
337+
256338
Unfortunately, the code often has to be restructured and thus increases in
257339
complexity. Explicit type annotations and the provision of code also become more
258340
cumbersome.
@@ -262,7 +344,7 @@ Our example could then look like this:
262344
.. literalinclude:: cy_kmeans.pyx
263345
:caption: cy_kmeans.pyx
264346
:name: cy_kmeans.pyx
265-
:lines: 1-28
347+
:lines: 5-32
266348

267349
.. seealso::
268350
* `Cython Tutorials
@@ -277,13 +359,27 @@ scientific Python and NumPy code into fast machine code, for example:
277359
.. literalinclude:: nb_kmeans.py
278360
:caption: nb_kmeans.py
279361
:name: nb_kmeans.py
280-
:lines: 1-25
362+
:lines: 5-29
281363

282364
However, Numba requires `LLVM <https://en.wikipedia.org/wiki/LLVM>`_ and some
283365
Python constructs are not supported.
284366

285-
Task planner
286-
------------
367+
+---------------+---------+---------+---------------+----------+
368+
| Version | N-body | vs 3.14 | Spectral-norm | vs 3.14x |
369+
+===============+=========+=========+===============+==========+
370+
| CPython 3.14 | 1,242ms | | 14,046ms | |
371+
| – Basis | | | | |
372+
+---------------+---------+---------+---------------+----------+
373+
| Numba | 22ms | 56x | 104ms | 135x |
374+
+---------------+---------+---------+---------------+----------+
375+
376+
.. seealso::
377+
* `Speeding up NumPy with parallelism
378+
<https://pythonspeed.com/articles/numpy-parallelism/>`_ by Itamar
379+
Turner-Trauring
380+
381+
6. Task planner
382+
---------------
287383

288384
:doc:`jupyter-tutorial:hub/ipyparallel/index`, :doc:`dask` and `Ray
289385
<https://docs.ray.io/en/latest/>`_ can distribute tasks in a cluster. In doing
@@ -319,7 +415,7 @@ Our example could look like this with Dask:
319415
.. literalinclude:: ds_kmeans.py
320416
:caption: ds_kmeans.py
321417
:name: ds_kmeans.py
322-
:lines: 1-
418+
:lines: 5-
323419

324420
.. toctree::
325421
:hidden:
@@ -328,8 +424,8 @@ Our example could look like this with Dask:
328424

329425
dask.ipynb
330426

331-
Multithreading, Multiprocessing and Async
332-
-----------------------------------------
427+
7. Multithreading, Multiprocessing and Async
428+
--------------------------------------------
333429

334430
After a brief :doc:`overview <multiprocessing-threading-async>`, three examples
335431
of :doc:`threading <threading-example>`, :doc:`multiprocessing

docs/performance/tprof.rst

Lines changed: 67 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,67 @@
1+
.. SPDX-FileCopyrightText: 2026 Veit Schiele
2+
..
3+
.. SPDX-License-Identifier: BSD-3-Clause
4+
5+
``tprof``
6+
=========
7+
8+
`tprof <https://github.com/adamchainz/tprof>`_ measures from Python 3.12 onwards
9+
the time spent executing a module in specific functions. Unlike other profilers,
10+
it only tracks the specified functions with :mod:`sys.monitoring`, eliminating
11+
the need for filtering.
12+
13+
``tprof`` supports use as a command line programme and with a Python interface:
14+
15+
:samp:`uv run tprof -t {MODULE}:{FUNCTION} (-m {MODULE} | {PATH/TO/SCRIPT})`
16+
Suppose you have determined that creating :class:`pathlib.Path` objects in
17+
the :mod:`main` module is slowing down your code. Here’s how you can measure
18+
this with ``tprof``:
19+
20+
.. code-block:: console
21+
22+
$ uv run tprof -t pathlib:Path.open -m main
23+
🎯 tprof results:
24+
function calls total mean ± σ min … max
25+
pathlib:Path.open() 1 93μs 93μs 93μs … 93μs
26+
27+
With the ``-x`` option, you can also compare two functions with each other:
28+
29+
.. code-block:: console
30+
31+
$ uv run tprof -x -t old -m main -t new -m main
32+
🎯 tprof results:
33+
function calls total mean ± σ min … max delta
34+
main:old() 1 41μs 41μs 41μs … 41μs -
35+
main:new() 1 20μs 20μs 20μs … 20μs -50.67%
36+
37+
``tprof(*targets, label: str | None = None, compare: bool = False)``
38+
uses this code as a :doc:`context manager <python-basics:control-flow/with>`
39+
in your code to perform profiling in a specific block. The report is
40+
generated each time the block is run through.
41+
42+
``*targets``
43+
are callable elements for profiling or references to elements that are
44+
resolved with :func:`pkgutil.resolve_name`.
45+
``label``
46+
is an optional string that can be added to the report as a header.
47+
``compare``
48+
set to ``True`` activates comparison mode.
49+
50+
Example:
51+
52+
.. code-block:: Python
53+
54+
from pathlib import Path
55+
56+
from tprof import tprof
57+
58+
with tprof(Path.open):
59+
p = Path("docs", "save-data", "myfile.txt")
60+
f = p.open()
61+
62+
.. code-block:: console
63+
64+
$ uv run python main.py
65+
🎯 tprof results:
66+
function calls total mean ± σ min … max
67+
pathlib:Path.open() 1 82μs 82μs 82μs … 82μs

0 commit comments

Comments
 (0)