Implement true lossless pipeline with operation replay (#196)#250
Implement true lossless pipeline with operation replay (#196)#250christian-oreilly wants to merge 10 commits intomainfrom
Conversation
- Save original unmodified data instead of preprocessed data - Track all operations in execution order with operations_log.json - Replay operations via RejectionPolicy for reproducibility - Handle operation dependencies (e.g., re-referencing excludes flagged channels) - Save preprocessed version to qc_preprocessed/ for dashboard - Maintain backwards compatibility with legacy format
|
@scott-huberty This PR propose a significant refactoring of PyLossless so I'd like to merge it earlier than later to avoid code divergence. Do you want to review it before I merge? |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #250 +/- ##
==========================================
+ Coverage 84.50% 85.01% +0.50%
==========================================
Files 26 27 +1
Lines 1717 1995 +278
==========================================
+ Hits 1451 1696 +245
- Misses 266 299 +33 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
Hey @christian-oreilly sorry for the belated response. I didn't have time to review this super closely but tests are passing and looks like you covered your bases! Just 2 things:
|
| # Check if we have operations log (new lossless format) | ||
| if hasattr(pipeline, 'operations_log') and len(pipeline.operations_log) > 0: | ||
| logger.info("LOSSLESS: Applying rejection policy by replaying" | ||
| " operations...") | ||
| raw = self._apply_with_replay(pipeline) | ||
| else: |
There was a problem hiding this comment.
You said that this PR does not introduce breaking changes, but this clause will always be True, so users are opted-in to this new approach. So this doesn't change the API for them but the pipeline behavior does change (albeit in a good way, their original data are not filtered).
Should we cut a new release upon merge?
There was a problem hiding this comment.
Agreed. It is a significant change in itself, and I think we also have quite a few PR since our last release.
No idea... I'll investigate this.
Created as #251 to have this on the radar. |
Fix #196: Implement True Lossless Pipeline with Operation Replay
Summary
This PR implements a truly lossless pipeline by:
Closes #196
Problem Statement
Currently, the pipeline applies preprocessing transformations (filtering, re-referencing) to
self.rawbefore saving viamne_bids.write_raw_bids(). This makes the saved EDF files numerically different from the original BIDS data, violating the "lossless" philosophy.Critical Issue: The pipeline interleaves preprocessing and artifact detection operations, and these operations have dependencies. For example:
The order matters for reproducibility!
Solution Overview
Key Architectural Changes
Store Original Data
Track All Operations in Order
Save Original Data
Replay via RejectionPolicy
Changes Made
1.
pipeline.pyModified Methods
__init__(): Addedself.raw_originalandself.operations_logrun_with_raw(): Store original data before processingsave(): Save original data + operation log (not preprocessed data)load(): Load original data + operation logNew Methods
_log_operation(): Track each operation with full metadataModified Pipeline Execution
All preprocessing operations now call
_log_operation():2.
rejection.pyModified
RejectionPolicyClassNew Attributes:
Modified Methods:
apply(): Now replays operations in sequence_load_from_yaml(): Load preprocessing policy configurationNew Methods:
_get_final_params(): Get parameters with overrides_get_channels_to_exclude_at_operation(): Handle operation dependencies3. Configuration Files
New:
operations_log.jsonSaved during
pipeline.save():{ "description": "Complete sequential log of all pipeline operations", "operations": [ { "operation_id": 0, "operation_type": "preprocessing", "operation_name": "filter", "parameters": {"l_freq": 1.0, "h_freq": 100.0}, "timestamp": "2026-01-28T10:30:00" }, { "operation_id": 1, "operation_type": "artifact_flag", "operation_name": "flag_noisy_channels", "flags": {"noisy_channels": ["E31", "E67"]} }, { "operation_id": 2, "operation_type": "preprocessing", "operation_name": "set_eeg_reference", "parameters": {"ref_channels": "average", "exclude": ["E31", "E67"]}, "metadata": {"depends_on_operation": 1} } ] }Updated:
rejection_policy.yamlExtended with preprocessing section:
4. File Structure
New derivatives directory structure:
Examples
Basic Usage (No Code Changes for Users!)
Advanced: Custom Preprocessing
Benefits
✅ True Losslessness: Original data files are byte-identical to source BIDS
✅ Reproducibility: Operations replayed in exact execution order
✅ Handles Dependencies: Re-referencing correctly excludes flagged channels
✅ Flexibility: Users can customize preprocessing via rejection policy
✅ Backwards Compatible: Existing workflows continue to work
✅ Transparent: Complete audit trail of all operations
✅ Architectural Consistency: Preprocessing integrated with RejectionPolicy pattern
Migration Guide
For End Users
No changes required! Existing code continues to work:
The difference is that now the derivatives contain truly lossless data.
For QC Dashboard
Update to load from
qc_preprocessed/directory:For Developers
When adding new preprocessing operations, use
_log_operation():Breaking Changes
None. This PR is fully backwards compatible.
Related Issues