Skip to content

Implement true lossless pipeline with operation replay (#196)#250

Open
christian-oreilly wants to merge 10 commits intomainfrom
196-true-lossless
Open

Implement true lossless pipeline with operation replay (#196)#250
christian-oreilly wants to merge 10 commits intomainfrom
196-true-lossless

Conversation

@christian-oreilly
Copy link
Copy Markdown
Member

@christian-oreilly christian-oreilly commented Jan 28, 2026

Fix #196: Implement True Lossless Pipeline with Operation Replay

Summary

This PR implements a truly lossless pipeline by:

  1. Saving original unmodified data instead of preprocessed data
  2. Tracking all operations (preprocessing + artifact detection) in execution order
  3. Replaying operations via RejectionPolicy to ensure reproducibility

Closes #196

Problem Statement

Currently, the pipeline applies preprocessing transformations (filtering, re-referencing) to self.raw before saving via mne_bids.write_raw_bids(). This makes the saved EDF files numerically different from the original BIDS data, violating the "lossless" philosophy.

Critical Issue: The pipeline interleaves preprocessing and artifact detection operations, and these operations have dependencies. For example:

  • Filtering enables good ICA decomposition
  • Channel flagging determines which channels to exclude from re-referencing
  • Re-referencing affects subsequent ICA results

The order matters for reproducibility!

Solution Overview

Key Architectural Changes

  1. Store Original Data

    self.raw_original = raw.copy()  # Before ANY modifications
    self.raw = raw.copy()           # Working copy for processing
  2. Track All Operations in Order

    self.operations_log = []  # Captures preprocessing AND artifact detection
  3. Save Original Data

    mne_bids.write_raw_bids(
        self.raw_original,  # NOT self.raw!
        derivatives_path,
        ...
    )
  4. Replay via RejectionPolicy

    def apply(self, pipeline):
        raw = pipeline.raw_original.copy()
        # Replay each operation in order
        for operation in pipeline.operations_log:
            raw = apply_operation(raw, operation)
        return raw

Changes Made

1. pipeline.py

Modified Methods

  • __init__(): Added self.raw_original and self.operations_log
  • run_with_raw(): Store original data before processing
  • save(): Save original data + operation log (not preprocessed data)
  • load(): Load original data + operation log

New Methods

  • _log_operation(): Track each operation with full metadata

Modified Pipeline Execution

All preprocessing operations now call _log_operation():

filter_args = self.config["filtering"]["filter_args"]
self.raw.filter(**filter_args)
self._log_operation(
    operation_type=OperationType.PREPROCESSING,
    operation_name="filter",
    parameters=filter_args
)

2. rejection.py

Modified RejectionPolicy Class

New Attributes:

self.apply_preprocessing = True
self.preprocessing_operations_to_skip = []
self.operation_param_overrides = {}

Modified Methods:

  • apply(): Now replays operations in sequence
  • _load_from_yaml(): Load preprocessing policy configuration

New Methods:

  • _get_final_params(): Get parameters with overrides
  • _get_channels_to_exclude_at_operation(): Handle operation dependencies

3. Configuration Files

New: operations_log.json

Saved during pipeline.save():

{
  "description": "Complete sequential log of all pipeline operations",
  "operations": [
    {
      "operation_id": 0,
      "operation_type": "preprocessing",
      "operation_name": "filter",
      "parameters": {"l_freq": 1.0, "h_freq": 100.0},
      "timestamp": "2026-01-28T10:30:00"
    },
    {
      "operation_id": 1,
      "operation_type": "artifact_flag",
      "operation_name": "flag_noisy_channels",
      "flags": {"noisy_channels": ["E31", "E67"]}
    },
    {
      "operation_id": 2,
      "operation_type": "preprocessing",
      "operation_name": "set_eeg_reference",
      "parameters": {"ref_channels": "average", "exclude": ["E31", "E67"]},
      "metadata": {"depends_on_operation": 1}
    }
  ]
}

Updated: rejection_policy.yaml

Extended with preprocessing section:

# Preprocessing Policy (NEW)
preprocessing:
  apply_preprocessing: true
  operations_to_skip: []  # e.g., ['notch_filter']
  param_overrides: {}     # e.g., {'filter': {'l_freq': 0.5}}

# Artifact Rejection Policy (EXISTING)
channel_rejection:
  flags_to_reject: ['noisy', 'uncorrelated', 'bridged']
  cleaning_mode: null

ica_rejection:
  flags_to_reject: ['muscle', 'ecg', 'eog', 'channel_noise', 'line_noise']
  rejection_threshold: 0.3
  remove_flagged_ics: true

4. File Structure

New derivatives directory structure:

derivatives/pylossless/
├── sub-XX/
│   └── ses-YY/
│       └── eeg/
│           ├── sub-XX_ses-YY_task-ZZ_eeg.edf    # ORIGINAL unmodified data
│           └── ...
├── operations_log.json                          # NEW: All operations
├── rejection_policy.yaml                         # UPDATED: Includes preprocessing
├── qc_preprocessed/                              # NEW: For QC dashboard
│   └── sub-XX/
│       └── ses-YY/
│           └── eeg/
│               ├── sub-XX_ses-YY_task-ZZ_eeg.edf  # Preprocessed version
│               └── ...
└── dataset_description.json

Examples

Basic Usage (No Code Changes for Users!)

# 1. Run pipeline
pipeline = ll.LosslessPipeline('config.yaml')
pipeline.run_with_raw(raw)

# 2. Save (now saves original data!)
pipeline.save('derivatives/pylossless')

# 3. Load and apply policy (same as before, but now truly lossless!)
pipeline = ll.LosslessPipeline.load('derivatives/pylossless')
rejection_policy = ll.read_rejection_policy('rejection_policy.yaml')
cleaned = rejection_policy.apply(pipeline)

Advanced: Custom Preprocessing

# Skip certain preprocessing operations
rejection_policy = ll.RejectionPolicy()
rejection_policy.preprocessing_operations_to_skip = ['notch_filter']
cleaned = rejection_policy.apply(pipeline)

# Override preprocessing parameters
rejection_policy.operation_param_overrides = {
    'filter': {'l_freq': 0.5, 'h_freq': 40.0}
}
cleaned = rejection_policy.apply(pipeline)

# Use original data without any preprocessing
rejection_policy.apply_preprocessing = False
original_with_artifacts_removed = rejection_policy.apply(pipeline)

Benefits

True Losslessness: Original data files are byte-identical to source BIDS
Reproducibility: Operations replayed in exact execution order
Handles Dependencies: Re-referencing correctly excludes flagged channels
Flexibility: Users can customize preprocessing via rejection policy
Backwards Compatible: Existing workflows continue to work
Transparent: Complete audit trail of all operations
Architectural Consistency: Preprocessing integrated with RejectionPolicy pattern

Migration Guide

For End Users

No changes required! Existing code continues to work:

# This code works exactly as before
pipeline = ll.LosslessPipeline.load('derivatives/pylossless')
rejection_policy = ll.read_rejection_policy('rejection_policy.yaml')
cleaned = rejection_policy.apply(pipeline)

The difference is that now the derivatives contain truly lossless data.

For QC Dashboard

Update to load from qc_preprocessed/ directory:

# OLD
raw = mne_bids.read_raw_bids(derivatives_path)

# NEW
qc_path = derivatives_path / "qc_preprocessed"
if qc_path.exists():
    raw = mne_bids.read_raw_bids(qc_path)
else:
    # Fallback for old format
    raw = mne_bids.read_raw_bids(derivatives_path)

For Developers

When adding new preprocessing operations, use _log_operation():

def _apply_new_preprocessing(self):
    """Apply new preprocessing operation."""
    params = self.config["new_preprocessing"]
    self.raw.apply_operation(**params)
    
    # IMPORTANT: Log the operation
    self._log_operation(
        operation_type=OperationType.PREPROCESSING,
        operation_name="new_preprocessing",
        parameters=params
    )

Breaking Changes

None. This PR is fully backwards compatible.

Related Issues

- Save original unmodified data instead of preprocessed data
- Track all operations in execution order with operations_log.json
- Replay operations via RejectionPolicy for reproducibility
- Handle operation dependencies (e.g., re-referencing excludes flagged channels)
- Save preprocessed version to qc_preprocessed/ for dashboard
- Maintain backwards compatibility with legacy format
@christian-oreilly christian-oreilly linked an issue Jan 28, 2026 that may be closed by this pull request
@christian-oreilly
Copy link
Copy Markdown
Member Author

@scott-huberty This PR propose a significant refactoring of PyLossless so I'd like to merge it earlier than later to avoid code divergence. Do you want to review it before I merge?

@lina-usc lina-usc deleted a comment from codecov Bot Feb 3, 2026
@codecov
Copy link
Copy Markdown

codecov Bot commented Feb 3, 2026

Codecov Report

❌ Patch coverage is 83.41463% with 68 lines in your changes missing coverage. Please review.
✅ Project coverage is 85.01%. Comparing base (3def180) to head (939c3d0).

Files with missing lines Patch % Lines
pylossless/pipeline_aux.py 73.63% 29 Missing ⚠️
pylossless/config/rejection.py 85.16% 27 Missing ⚠️
pylossless/pipeline.py 89.83% 12 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #250      +/-   ##
==========================================
+ Coverage   84.50%   85.01%   +0.50%     
==========================================
  Files          26       27       +1     
  Lines        1717     1995     +278     
==========================================
+ Hits         1451     1696     +245     
- Misses        266      299      +33     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@scott-huberty
Copy link
Copy Markdown
Member

Hey @christian-oreilly sorry for the belated response. I didn't have time to review this super closely but tests are passing and looks like you covered your bases! Just 2 things:

  1. I checked out this branch and built the documentation locally and compared the examples/plot_10_run_pipeline.py to the website: https://pylossless.readthedocs.io/en/latest/auto_examples/plot_10_run_pipeline.html . It looks like on this branch the annotations no longer show up on the figure when you do raw.plot. Do you have an idea of what changed?

  2. In this PR or a follow up PR can we adjust one of our tutorials (or add a new one) that demonstrates how users can configure this new approach? e.g. the things you do in the "advanced usage" sections of your PR description.

Comment on lines +162 to +167
# Check if we have operations log (new lossless format)
if hasattr(pipeline, 'operations_log') and len(pipeline.operations_log) > 0:
logger.info("LOSSLESS: Applying rejection policy by replaying"
" operations...")
raw = self._apply_with_replay(pipeline)
else:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You said that this PR does not introduce breaking changes, but this clause will always be True, so users are opted-in to this new approach. So this doesn't change the API for them but the pipeline behavior does change (albeit in a good way, their original data are not filtered).

Should we cut a new release upon merge?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. It is a significant change in itself, and I think we also have quite a few PR since our last release.

@christian-oreilly
Copy link
Copy Markdown
Member Author

Hey @christian-oreilly sorry for the belated response. I didn't have time to review this super closely but tests are passing and looks like you covered your bases! Just 2 things:

  1. I checked out this branch and built the documentation locally and compared the examples/plot_10_run_pipeline.py to the website: https://pylossless.readthedocs.io/en/latest/auto_examples/plot_10_run_pipeline.html . It looks like on this branch the annotations no longer show up on the figure when you do raw.plot. Do you have an idea of what changed?

No idea... I'll investigate this.

  1. In this PR or a follow up PR can we adjust one of our tutorials (or add a new one) that demonstrates how users can configure this new approach? e.g. the things you do in the "advanced usage" sections of your PR description.

Created as #251 to have this on the radar.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

True Lossless

2 participants