document reduction features

frs69wq · frs69wq · commit b77133763092 · 2026-02-14T11:40:59.000-05:00
diff --git a/doc/source/Compression.rst b/doc/source/Compression.rst
@@ -0,0 +1,109 @@
+.. Copyright 2025-2026
+
+.. _Compression:
+
+Compression
+###########
+
+Lossy compression is a widely used data reduction technique in scientific computing. By accepting a controlled loss
+of precision, compressors such as SZ and ZFP can achieve significant reductions in data volume while preserving the
+features that matter for downstream analysis. DTLMod models the performance impact of compression on in situ
+workflows without actually compressing any data: it simulates the computational cost of compression and
+decompression and adjusts the volume of data transported through the DTL according to a compression ratio.
+
+How compression works in DTLMod
+-------------------------------
+
+Unlike decimation, compression does not change the **shape** of a variable. A :math:`1000 \times 1000` array remains
+a :math:`1000 \times 1000` array after compression. What changes is the **byte-size** of the variable: the number
+of bytes transported through the DTL is divided by the compression ratio. This reflects the fact that real-world
+lossy compressors produce a bitstream that is smaller than the original data but still represents all the elements
+of the array.
+
+Compression is a **publisher-side only** operation. Applying compression on the subscriber side is not meaningful
+because compression aims to reduce the volume of data that needs to be transported---which requires intervention
+before the data leaves the publisher.
+
+Compressor profiles
+-------------------
+
+DTLMod provides three ways to determine the compression ratio for a variable:
+
+**Fixed ratio.** The simplest option: you directly specify the desired compression ratio. This is useful when you
+already know, from experiments or from the literature, the compression ratio achieved by a particular compressor
+on data similar to yours. The ratio must be at least 1.0 (a ratio of 1 means no size reduction).
+
+**SZ profile.** This profile is inspired by the `SZ lossy compressor <https://szcompressor.org/>`_, a
+prediction-based algorithm. SZ achieves high compression ratios on smooth scientific data because it can accurately
+predict neighboring values and only store the (small) prediction errors. The compression ratio is derived from two
+user-specified parameters:
+
+- **accuracy** (or error bound): the maximum acceptable pointwise error. Tighter accuracy requirements reduce the
+  compression ratio because more bits are needed to represent the prediction residuals.
+
+- **data smoothness**: a value between 0 and 1 that characterizes how regular the data is. Smooth data
+  (e.g., temperature fields) yields higher compression ratios because predictions are more accurate. Noisy or
+  turbulent data yields lower ratios.
+
+The model computes the ratio as:
+
+.. math::
+
+   r = \max\!\Big(1,\;\alpha \cdot \left(-\log_{10} \varepsilon\right)^{\beta} \cdot (0.5 + \sigma)\Big)
+
+where :math:`\varepsilon` is the accuracy, :math:`\sigma` is the data smoothness, and :math:`\alpha = 3.0`,
+:math:`\beta = 0.8` are empirical parameters fitted from published benchmarks on scientific datasets.
+
+**ZFP profile.** This profile is inspired by the `ZFP compressor <https://computing.llnl.gov/projects/zfp>`_,
+a transform-based algorithm. ZFP organizes data into small blocks, applies a near-orthogonal transform, and
+encodes the resulting coefficients with a fixed number of bits per value. The compression ratio depends primarily
+on the requested accuracy:
+
+.. math::
+
+   \text{rate} = \max(1,\;-\log_2 \varepsilon + 1) \quad;\quad r = \frac{64}{\text{rate}}
+
+where the rate represents the number of bits per double-precision value after compression. Higher accuracy
+requirements increase the rate and therefore decrease the compression ratio.
+
+Compression and decompression costs
+------------------------------------
+
+Two independent cost parameters control the simulated computational overhead of compression:
+
+- **compression cost per element**: the number of floating-point operations incurred per array element when
+  compressing the data on the publisher side.
+
+- **decompression cost per element**: the number of floating-point operations incurred per array element when
+  decompressing the data on the subscriber side, after it has been received.
+
+Both parameters default to 1.0. The total compression cost for a variable is computed as:
+
+.. math::
+
+   C_{\text{compress}} = c_{\text{comp}} \times \frac{N_{\text{local}}}{\text{element\_size}}
+
+.. math::
+
+   C_{\text{decompress}} = c_{\text{decomp}} \times \frac{N_{\text{local}}}{\text{element\_size}}
+
+where :math:`N_{\text{local}}` is the local size of the variable in bytes and :math:`\text{element\_size}` is the
+size of one array element. The compression cost is incurred by the publisher right before putting the variable into
+the DTL, and the decompression cost is incurred by the subscriber right after receiving it.
+
+Per-transaction variability
+---------------------------
+
+In practice, the compression ratio achieved on a given variable varies from one time step to the next as the data
+evolves. DTLMod can model this variability through an optional **ratio variability** parameter that introduces a
+bounded, deterministic perturbation around the nominal compression ratio at each transaction. This enables the
+simulation of realistic scenarios in which the effectiveness of compression fluctuates over the course of a run.
+
+Re-parameterization
+-------------------
+
+As with decimation, compression parameters can be updated between transactions. You can change the compression
+ratio, switch compressor profiles, adjust accuracy or smoothness, or modify the cost parameters for a variable that
+is already being compressed. Only the parameters that are explicitly provided in the update are modified; the
+others retain their previous values. This supports the simulation of adaptive compression strategies that adjust
+their settings in response to changes in the data.
diff --git a/doc/source/Decimation.rst b/doc/source/Decimation.rst
@@ -0,0 +1,85 @@
+.. Copyright 2025-2026
+
+.. _Decimation:
+
+Decimation
+##########
+
+Decimation is a spatial subsampling technique that reduces the size of a multidimensional array by keeping only every
+*n*-th element along each dimension. It is the method of choice when a workflow component does not need the full
+resolution of the data produced upstream---a common situation in visualization or coarse-grained analysis pipelines.
+
+How decimation works
+--------------------
+
+A decimation operation is controlled by a **stride vector** that specifies, for each dimension of a
+:ref:`Concept_Variable`, how many elements to skip between two retained samples. For a variable of shape
+:math:`(D_1, D_2, \ldots, D_k)` and a stride :math:`(s_1, s_2, \ldots, s_k)`, the shape of the reduced variable
+becomes :math:`(\lceil D_1/s_1 \rceil, \lceil D_2/s_2 \rceil, \ldots, \lceil D_k/s_k \rceil)`.
+
+For instance, applying a stride of :math:`(1, 2, 4)` to a :math:`640 \times 640 \times 640` variable produces a
+:math:`640 \times 320 \times 160` reduced variable---an 8x reduction in data volume.
+
+The stride vector must have the same number of dimensions as the variable it applies to and all stride values must
+be strictly positive. A stride of 1 along a given dimension means no subsampling in that dimension.
+
+Decimation is applied **per variable**: within the same :ref:`Concept_Stream`, different variables can be decimated
+with different strides, or not be decimated at all.
+
+Publisher-side and subscriber-side decimation
+---------------------------------------------
+
+Unlike compression, decimation can be applied on **both sides** of the data flow:
+
+- When applied by a **publisher**, decimation reduces the volume of data that leaves the publisher. The simulated
+  cost of the decimation kernel is incurred before the data is transported. Only the decimated version of the variable
+  is put into the DTL, which directly reduces I/O or network costs.
+
+- When applied by a **subscriber**, decimation reduces the volume of data that the subscriber has to process after
+  receiving it. The subscriber first retrieves the full variable and then applies decimation locally. This can be
+  useful when the subscriber only needs a coarse view of the data, but the full-resolution version must still be
+  transported because other subscribers or a checkpoint mechanism may need it.
+
+Interpolation
+-------------
+
+In some workflows, the subsampled data must be smoothed or reconstructed to better approximate the original field.
+DTLMod models this by allowing an optional **interpolation method** to be specified alongside the stride. The
+supported interpolation methods are:
+
+- **linear**: suitable for piecewise-linear fields (variables with at least 1 dimension).
+- **quadratic**: suitable for smoother fields (variables with at least 2 dimensions).
+- **cubic**: suitable for highly smooth fields (variables with at least 3 dimensions).
+
+The choice of interpolation method does not affect the size of the reduced variable: it only affects the
+**computational cost** of the decimation operation. Higher-order interpolation is more expensive: the cost
+multiplier is 2x for linear, 4x for quadratic, and 8x for cubic interpolation relative to simple subsampling
+without interpolation. This allows you to study the tradeoff between the quality of the reconstructed data and the
+computational overhead introduced by the interpolation step.
+
+Computational cost model
+------------------------
+
+The simulated cost of a decimation operation, in floating-point operations, is determined by:
+
+.. math::
+
+   C = m \times c \times N
+
+where :math:`N` is the number of elements in the **local** (non-decimated) portion of the variable,
+:math:`c` is a configurable **cost per element** (defaulting to 1.0), and :math:`m` is the interpolation
+multiplier (1 for no interpolation, 2 for linear, 4 for quadratic, 8 for cubic).
+
+The cost per element can be adjusted to match the observed or estimated computational cost of a specific decimation
+implementation in the real-world application being simulated.
+
+Re-parameterization
+-------------------
+
+A decimation operation can be re-parameterized between transactions. For instance, you can change the stride, the
+interpolation method, or the cost per element of a variable that has already been decimated. This enables the
+simulation of adaptive workflows in which the level of subsampling changes over time in response to features detected
+in the data.
+
+When re-parameterizing, only the parameters that are explicitly provided are updated; the others retain their
+previous values.
diff --git a/doc/source/Reduction.rst b/doc/source/Reduction.rst
@@ -0,0 +1,67 @@
+.. Copyright 2025-2026
+
+.. _Reduction:
+
+Data Reduction Operations
+=========================
+
+Scientific simulations and in situ processing workflows produce ever-increasing volumes of data. Even with fast
+networks and storage systems, the sheer amount of data transported through the DTL can become a bottleneck.
+**Data reduction** techniques alleviate this pressure by decreasing the volume of data that must be moved between
+publishers and subscribers, at the cost of some additional computation and, depending on the method, a controlled
+loss of information.
+
+DTLMod allows you to study the impact of data reduction on the performance of in situ workflows by attaching a
+**reduction method** to a :ref:`Concept_Stream` and then applying it, with specific parameters, to individual
+:ref:`Concept_Variable` objects. When a publisher puts a reduced variable into the DTL, the simulation accounts for
+the computational cost of the reduction operation and transports a smaller volume of data. On the subscriber side,
+the simulation may account for a corresponding decompression or reconstruction cost when retrieving the variable.
+
+DTLMod currently exposes two families of reduction methods:
+
+**Decimation** selectively retains a subset of elements from a multidimensional array by applying a per-dimension
+stride. The result is a smaller array whose shape reflects the subsampling factor in each dimension. This approach
+is common in visualization pipelines where only every *n*-th data point is needed. Since decimation preserves the
+original values of the retained elements, it is by nature a lossless operation on the selected subset. Optionally,
+an interpolation step can be used to reconstruct missing values. More details are given in the
+:ref:`Decimation` section.
+
+**Compression** reduces the byte-size of a variable without altering its shape. The compressed variable retains the
+same number of elements but each element occupies fewer bytes, according to a **compression ratio** that can be
+specified directly or derived from a compressor model. DTLMod provides built-in models inspired by the SZ and ZFP
+lossy compressors to derive realistic compression ratios from data characteristics. More details are given in the
+:ref:`Compression` section.
+
+Where and when reduction is applied
+------------------------------------
+
+Reduction methods can be applied on either side of the data flow:
+
+- **Publisher-side reduction** is the most common scenario. The publisher compresses or decimates data before putting
+  it into the DTL, reducing the volume of data that has to be transported and stored. Both decimation and compression
+  support this mode.
+
+- **Subscriber-side reduction** is only available for decimation. A subscriber can choose to retrieve a subsampled
+  version of a variable it receives from the DTL, reducing the volume of data it has to process. Compression on the
+  subscriber side is not supported because its purpose is precisely to reduce what has to be transported, which
+  requires intervention before the data leaves the publisher.
+
+When a publisher applies a reduction, the information is propagated to subscribers: any variable obtained through
+``inquire_variable`` on the subscriber side carries the reduction state set by the publisher. This allows DTLMod to
+prevent conflicting reduction operations. In particular, a subscriber cannot apply a second reduction to a variable
+that was already reduced by its publisher.
+
+Simulated costs
+---------------
+
+A reduction operation in DTLMod introduces two potential costs:
+
+1. A **reduction cost** (in simulated floating-point operations) is incurred by the actor that applies the reduction,
+   right before it puts or gets the variable. This cost models the computational overhead of running a compressor or
+   a decimation kernel.
+
+2. A **decompression cost** (for compression only) is incurred on the subscriber side after it receives compressed
+   data. This cost models the time needed to decompress the data before it can be used by the analysis component.
+
+These costs are fully configurable through the parameters of each reduction method, enabling you to explore tradeoffs
+between data movement savings and computational overhead for different reduction strategies.
diff --git a/doc/source/app_API.rst b/doc/source/app_API.rst
@@ -130,6 +130,29 @@ selecting the :ref:`Concept_Transport` **method** of the Stream to either ``Tran
 :ref:`Inside_staging_engine` section of the documentation.
 
 
+.. |Concept_Reduction| replace:: **Reduction**
+.. _Concept_Reduction:
+
+Data Reduction
+^^^^^^^^^^^^^^
+
+In situ workflows that produce large volumes of data can benefit from **data reduction** to decrease the amount of
+data transported through the DTL. DTLMod exposes reduction as an optional operation that can be applied to individual
+|Concept_Variable| objects within a |Concept_Stream|_.
+
+A reduction method is first created on a |Concept_Stream|_ by specifying its type (``"decimation"`` or
+``"compression"``). It is then applied to a |Concept_Variable|_ with a set of parameters that control the reduction
+behavior---for instance, a stride vector for decimation or a compression ratio and compressor profile for compression.
+
+When a publisher puts a reduced variable into the DTL, the simulation accounts for the computational overhead of the
+reduction and transports only the reduced volume. On the subscriber side, a decompression cost may be incurred when
+the data is retrieved. The reduction state of a variable is automatically propagated to subscribers: when a subscriber
+inquires a variable that has been reduced by its publisher, this information is preserved and prevents conflicting
+double reductions.
+
+A complete description of the reduction mechanisms, their parameters, and their internal cost models is given in the
+:ref:`Reduction` section.
+
 .. |Concept_Variable| replace:: **Variable**
 .. _Concept_Variable:
 
diff --git a/doc/source/index.rst b/doc/source/index.rst
@@ -47,7 +47,10 @@ effects of resource allocation strategies.
       Engines <Engines.rst>
          Inside the File engine <Inside_File_engine.rst>
          Inside the Staging engine <Inside_Staging_engine.rst>
-
+      Data Reduction Operations <Reduction.rst>
+         Decimation <Decimation.rst>
+         Compression <Compression.rst>
+      
 .. Cheat Sheet on the sublevels
 ..
 ..   # with overline, for parts