Skip to content

Commit b771337

Browse files
committed
document reduction features
1 parent e024e80 commit b771337

5 files changed

Lines changed: 288 additions & 1 deletion

File tree

doc/source/Compression.rst

Lines changed: 109 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,109 @@
1+
.. Copyright 2025-2026
2+
3+
.. _Compression:
4+
5+
Compression
6+
###########
7+
8+
Lossy compression is a widely used data reduction technique in scientific computing. By accepting a controlled loss
9+
of precision, compressors such as SZ and ZFP can achieve significant reductions in data volume while preserving the
10+
features that matter for downstream analysis. DTLMod models the performance impact of compression on in situ
11+
workflows without actually compressing any data: it simulates the computational cost of compression and
12+
decompression and adjusts the volume of data transported through the DTL according to a compression ratio.
13+
14+
How compression works in DTLMod
15+
-------------------------------
16+
17+
Unlike decimation, compression does not change the **shape** of a variable. A :math:`1000 \times 1000` array remains
18+
a :math:`1000 \times 1000` array after compression. What changes is the **byte-size** of the variable: the number
19+
of bytes transported through the DTL is divided by the compression ratio. This reflects the fact that real-world
20+
lossy compressors produce a bitstream that is smaller than the original data but still represents all the elements
21+
of the array.
22+
23+
Compression is a **publisher-side only** operation. Applying compression on the subscriber side is not meaningful
24+
because compression aims to reduce the volume of data that needs to be transported---which requires intervention
25+
before the data leaves the publisher.
26+
27+
Compressor profiles
28+
-------------------
29+
30+
DTLMod provides three ways to determine the compression ratio for a variable:
31+
32+
**Fixed ratio.** The simplest option: you directly specify the desired compression ratio. This is useful when you
33+
already know, from experiments or from the literature, the compression ratio achieved by a particular compressor
34+
on data similar to yours. The ratio must be at least 1.0 (a ratio of 1 means no size reduction).
35+
36+
**SZ profile.** This profile is inspired by the `SZ lossy compressor <https://szcompressor.org/>`_, a
37+
prediction-based algorithm. SZ achieves high compression ratios on smooth scientific data because it can accurately
38+
predict neighboring values and only store the (small) prediction errors. The compression ratio is derived from two
39+
user-specified parameters:
40+
41+
- **accuracy** (or error bound): the maximum acceptable pointwise error. Tighter accuracy requirements reduce the
42+
compression ratio because more bits are needed to represent the prediction residuals.
43+
44+
- **data smoothness**: a value between 0 and 1 that characterizes how regular the data is. Smooth data
45+
(e.g., temperature fields) yields higher compression ratios because predictions are more accurate. Noisy or
46+
turbulent data yields lower ratios.
47+
48+
The model computes the ratio as:
49+
50+
.. math::
51+
52+
r = \max\!\Big(1,\;\alpha \cdot \left(-\log_{10} \varepsilon\right)^{\beta} \cdot (0.5 + \sigma)\Big)
53+
54+
where :math:`\varepsilon` is the accuracy, :math:`\sigma` is the data smoothness, and :math:`\alpha = 3.0`,
55+
:math:`\beta = 0.8` are empirical parameters fitted from published benchmarks on scientific datasets.
56+
57+
**ZFP profile.** This profile is inspired by the `ZFP compressor <https://computing.llnl.gov/projects/zfp>`_,
58+
a transform-based algorithm. ZFP organizes data into small blocks, applies a near-orthogonal transform, and
59+
encodes the resulting coefficients with a fixed number of bits per value. The compression ratio depends primarily
60+
on the requested accuracy:
61+
62+
.. math::
63+
64+
\text{rate} = \max(1,\;-\log_2 \varepsilon + 1) \quad;\quad r = \frac{64}{\text{rate}}
65+
66+
where the rate represents the number of bits per double-precision value after compression. Higher accuracy
67+
requirements increase the rate and therefore decrease the compression ratio.
68+
69+
Compression and decompression costs
70+
------------------------------------
71+
72+
Two independent cost parameters control the simulated computational overhead of compression:
73+
74+
- **compression cost per element**: the number of floating-point operations incurred per array element when
75+
compressing the data on the publisher side.
76+
77+
- **decompression cost per element**: the number of floating-point operations incurred per array element when
78+
decompressing the data on the subscriber side, after it has been received.
79+
80+
Both parameters default to 1.0. The total compression cost for a variable is computed as:
81+
82+
.. math::
83+
84+
C_{\text{compress}} = c_{\text{comp}} \times \frac{N_{\text{local}}}{\text{element\_size}}
85+
86+
.. math::
87+
88+
C_{\text{decompress}} = c_{\text{decomp}} \times \frac{N_{\text{local}}}{\text{element\_size}}
89+
90+
where :math:`N_{\text{local}}` is the local size of the variable in bytes and :math:`\text{element\_size}` is the
91+
size of one array element. The compression cost is incurred by the publisher right before putting the variable into
92+
the DTL, and the decompression cost is incurred by the subscriber right after receiving it.
93+
94+
Per-transaction variability
95+
---------------------------
96+
97+
In practice, the compression ratio achieved on a given variable varies from one time step to the next as the data
98+
evolves. DTLMod can model this variability through an optional **ratio variability** parameter that introduces a
99+
bounded, deterministic perturbation around the nominal compression ratio at each transaction. This enables the
100+
simulation of realistic scenarios in which the effectiveness of compression fluctuates over the course of a run.
101+
102+
Re-parameterization
103+
-------------------
104+
105+
As with decimation, compression parameters can be updated between transactions. You can change the compression
106+
ratio, switch compressor profiles, adjust accuracy or smoothness, or modify the cost parameters for a variable that
107+
is already being compressed. Only the parameters that are explicitly provided in the update are modified; the
108+
others retain their previous values. This supports the simulation of adaptive compression strategies that adjust
109+
their settings in response to changes in the data.

doc/source/Decimation.rst

Lines changed: 85 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,85 @@
1+
.. Copyright 2025-2026
2+
3+
.. _Decimation:
4+
5+
Decimation
6+
##########
7+
8+
Decimation is a spatial subsampling technique that reduces the size of a multidimensional array by keeping only every
9+
*n*-th element along each dimension. It is the method of choice when a workflow component does not need the full
10+
resolution of the data produced upstream---a common situation in visualization or coarse-grained analysis pipelines.
11+
12+
How decimation works
13+
--------------------
14+
15+
A decimation operation is controlled by a **stride vector** that specifies, for each dimension of a
16+
:ref:`Concept_Variable`, how many elements to skip between two retained samples. For a variable of shape
17+
:math:`(D_1, D_2, \ldots, D_k)` and a stride :math:`(s_1, s_2, \ldots, s_k)`, the shape of the reduced variable
18+
becomes :math:`(\lceil D_1/s_1 \rceil, \lceil D_2/s_2 \rceil, \ldots, \lceil D_k/s_k \rceil)`.
19+
20+
For instance, applying a stride of :math:`(1, 2, 4)` to a :math:`640 \times 640 \times 640` variable produces a
21+
:math:`640 \times 320 \times 160` reduced variable---an 8x reduction in data volume.
22+
23+
The stride vector must have the same number of dimensions as the variable it applies to and all stride values must
24+
be strictly positive. A stride of 1 along a given dimension means no subsampling in that dimension.
25+
26+
Decimation is applied **per variable**: within the same :ref:`Concept_Stream`, different variables can be decimated
27+
with different strides, or not be decimated at all.
28+
29+
Publisher-side and subscriber-side decimation
30+
---------------------------------------------
31+
32+
Unlike compression, decimation can be applied on **both sides** of the data flow:
33+
34+
- When applied by a **publisher**, decimation reduces the volume of data that leaves the publisher. The simulated
35+
cost of the decimation kernel is incurred before the data is transported. Only the decimated version of the variable
36+
is put into the DTL, which directly reduces I/O or network costs.
37+
38+
- When applied by a **subscriber**, decimation reduces the volume of data that the subscriber has to process after
39+
receiving it. The subscriber first retrieves the full variable and then applies decimation locally. This can be
40+
useful when the subscriber only needs a coarse view of the data, but the full-resolution version must still be
41+
transported because other subscribers or a checkpoint mechanism may need it.
42+
43+
Interpolation
44+
-------------
45+
46+
In some workflows, the subsampled data must be smoothed or reconstructed to better approximate the original field.
47+
DTLMod models this by allowing an optional **interpolation method** to be specified alongside the stride. The
48+
supported interpolation methods are:
49+
50+
- **linear**: suitable for piecewise-linear fields (variables with at least 1 dimension).
51+
- **quadratic**: suitable for smoother fields (variables with at least 2 dimensions).
52+
- **cubic**: suitable for highly smooth fields (variables with at least 3 dimensions).
53+
54+
The choice of interpolation method does not affect the size of the reduced variable: it only affects the
55+
**computational cost** of the decimation operation. Higher-order interpolation is more expensive: the cost
56+
multiplier is 2x for linear, 4x for quadratic, and 8x for cubic interpolation relative to simple subsampling
57+
without interpolation. This allows you to study the tradeoff between the quality of the reconstructed data and the
58+
computational overhead introduced by the interpolation step.
59+
60+
Computational cost model
61+
------------------------
62+
63+
The simulated cost of a decimation operation, in floating-point operations, is determined by:
64+
65+
.. math::
66+
67+
C = m \times c \times N
68+
69+
where :math:`N` is the number of elements in the **local** (non-decimated) portion of the variable,
70+
:math:`c` is a configurable **cost per element** (defaulting to 1.0), and :math:`m` is the interpolation
71+
multiplier (1 for no interpolation, 2 for linear, 4 for quadratic, 8 for cubic).
72+
73+
The cost per element can be adjusted to match the observed or estimated computational cost of a specific decimation
74+
implementation in the real-world application being simulated.
75+
76+
Re-parameterization
77+
-------------------
78+
79+
A decimation operation can be re-parameterized between transactions. For instance, you can change the stride, the
80+
interpolation method, or the cost per element of a variable that has already been decimated. This enables the
81+
simulation of adaptive workflows in which the level of subsampling changes over time in response to features detected
82+
in the data.
83+
84+
When re-parameterizing, only the parameters that are explicitly provided are updated; the others retain their
85+
previous values.

doc/source/Reduction.rst

Lines changed: 67 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,67 @@
1+
.. Copyright 2025-2026
2+
3+
.. _Reduction:
4+
5+
Data Reduction Operations
6+
=========================
7+
8+
Scientific simulations and in situ processing workflows produce ever-increasing volumes of data. Even with fast
9+
networks and storage systems, the sheer amount of data transported through the DTL can become a bottleneck.
10+
**Data reduction** techniques alleviate this pressure by decreasing the volume of data that must be moved between
11+
publishers and subscribers, at the cost of some additional computation and, depending on the method, a controlled
12+
loss of information.
13+
14+
DTLMod allows you to study the impact of data reduction on the performance of in situ workflows by attaching a
15+
**reduction method** to a :ref:`Concept_Stream` and then applying it, with specific parameters, to individual
16+
:ref:`Concept_Variable` objects. When a publisher puts a reduced variable into the DTL, the simulation accounts for
17+
the computational cost of the reduction operation and transports a smaller volume of data. On the subscriber side,
18+
the simulation may account for a corresponding decompression or reconstruction cost when retrieving the variable.
19+
20+
DTLMod currently exposes two families of reduction methods:
21+
22+
**Decimation** selectively retains a subset of elements from a multidimensional array by applying a per-dimension
23+
stride. The result is a smaller array whose shape reflects the subsampling factor in each dimension. This approach
24+
is common in visualization pipelines where only every *n*-th data point is needed. Since decimation preserves the
25+
original values of the retained elements, it is by nature a lossless operation on the selected subset. Optionally,
26+
an interpolation step can be used to reconstruct missing values. More details are given in the
27+
:ref:`Decimation` section.
28+
29+
**Compression** reduces the byte-size of a variable without altering its shape. The compressed variable retains the
30+
same number of elements but each element occupies fewer bytes, according to a **compression ratio** that can be
31+
specified directly or derived from a compressor model. DTLMod provides built-in models inspired by the SZ and ZFP
32+
lossy compressors to derive realistic compression ratios from data characteristics. More details are given in the
33+
:ref:`Compression` section.
34+
35+
Where and when reduction is applied
36+
------------------------------------
37+
38+
Reduction methods can be applied on either side of the data flow:
39+
40+
- **Publisher-side reduction** is the most common scenario. The publisher compresses or decimates data before putting
41+
it into the DTL, reducing the volume of data that has to be transported and stored. Both decimation and compression
42+
support this mode.
43+
44+
- **Subscriber-side reduction** is only available for decimation. A subscriber can choose to retrieve a subsampled
45+
version of a variable it receives from the DTL, reducing the volume of data it has to process. Compression on the
46+
subscriber side is not supported because its purpose is precisely to reduce what has to be transported, which
47+
requires intervention before the data leaves the publisher.
48+
49+
When a publisher applies a reduction, the information is propagated to subscribers: any variable obtained through
50+
``inquire_variable`` on the subscriber side carries the reduction state set by the publisher. This allows DTLMod to
51+
prevent conflicting reduction operations. In particular, a subscriber cannot apply a second reduction to a variable
52+
that was already reduced by its publisher.
53+
54+
Simulated costs
55+
---------------
56+
57+
A reduction operation in DTLMod introduces two potential costs:
58+
59+
1. A **reduction cost** (in simulated floating-point operations) is incurred by the actor that applies the reduction,
60+
right before it puts or gets the variable. This cost models the computational overhead of running a compressor or
61+
a decimation kernel.
62+
63+
2. A **decompression cost** (for compression only) is incurred on the subscriber side after it receives compressed
64+
data. This cost models the time needed to decompress the data before it can be used by the analysis component.
65+
66+
These costs are fully configurable through the parameters of each reduction method, enabling you to explore tradeoffs
67+
between data movement savings and computational overhead for different reduction strategies.

doc/source/app_API.rst

Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -130,6 +130,29 @@ selecting the :ref:`Concept_Transport` **method** of the Stream to either ``Tran
130130
:ref:`Inside_staging_engine` section of the documentation.
131131

132132

133+
.. |Concept_Reduction| replace:: **Reduction**
134+
.. _Concept_Reduction:
135+
136+
Data Reduction
137+
^^^^^^^^^^^^^^
138+
139+
In situ workflows that produce large volumes of data can benefit from **data reduction** to decrease the amount of
140+
data transported through the DTL. DTLMod exposes reduction as an optional operation that can be applied to individual
141+
|Concept_Variable| objects within a |Concept_Stream|_.
142+
143+
A reduction method is first created on a |Concept_Stream|_ by specifying its type (``"decimation"`` or
144+
``"compression"``). It is then applied to a |Concept_Variable|_ with a set of parameters that control the reduction
145+
behavior---for instance, a stride vector for decimation or a compression ratio and compressor profile for compression.
146+
147+
When a publisher puts a reduced variable into the DTL, the simulation accounts for the computational overhead of the
148+
reduction and transports only the reduced volume. On the subscriber side, a decompression cost may be incurred when
149+
the data is retrieved. The reduction state of a variable is automatically propagated to subscribers: when a subscriber
150+
inquires a variable that has been reduced by its publisher, this information is preserved and prevents conflicting
151+
double reductions.
152+
153+
A complete description of the reduction mechanisms, their parameters, and their internal cost models is given in the
154+
:ref:`Reduction` section.
155+
133156
.. |Concept_Variable| replace:: **Variable**
134157
.. _Concept_Variable:
135158

doc/source/index.rst

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -47,7 +47,10 @@ effects of resource allocation strategies.
4747
Engines <Engines.rst>
4848
   Inside the File engine <Inside_File_engine.rst>
4949
   Inside the Staging engine <Inside_Staging_engine.rst>
50-
50+
Data Reduction Operations <Reduction.rst>
51+
   Decimation <Decimation.rst>
52+
   Compression <Compression.rst>
53+
5154
.. Cheat Sheet on the sublevels
5255
..
5356
.. # with overline, for parts

0 commit comments

Comments
 (0)