Freed heavy fit objects and large dataProcess intermediates early#207
Conversation
* Stripped survreg() $y / $linear.predictors and lm() $model after fitting in .fitSurvival and .fitLinearModel; downstream predict / summary / vcov paths are unaffected. * Rewrote .updateUnequalVariances to mutate a single data.frame in place instead of rebuilding it ~5x per iteration via data.frame(...). * Added rm(raw), rm(peptides_dict), and gc(verbose = FALSE) inside dataProcess() so the ~250-400 MB held by function arguments is freed before MSstatsSummarizationOutput runs. * Added rm(survival_fit) and rm(fit) inside the per-protein linear / TMP summarizers so local fit objects drop as soon as survival / result is materialised. * Tests: inst/tinytest/test_memory_optimization.R Issue 1 (tests 1a, 1b) — 14 assertions, all green. See MSstats-ai/todos/active/TODO-MS-20260514_fix-memory-bugs.md Co-Authored-By: Claude <noreply@anthropic.com>
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughWalkthroughMemory optimizations reduce peak memory usage in the ChangesMemory optimization in data processing and fitting
🎯 3 (Moderate) | ⏱️ ~20 minutes Suggested labels
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
PR Reviewer Guide 🔍Here are some key observations to aid the review process:
|
PR Code Suggestions ✨No code suggestions found for the PR. |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@R/utils_summarization.R`:
- Around line 208-223: The loess iterations reuse stale fitted values because
input[["fitted"]] is only set from the initial fit; after each weighted refit
(wls.fit) update the stored fitted values so subsequent loess calls use
consistent residual/fitted pairs — e.g. inside the loop after computing wls.fit
(the variable wls.fit) assign input[["fitted"]] <- fitted(wls.fit) or
input[["fitted"]] <- wls.fit$fitted.values before recalculating
input[["abs.resids"]] and proceeding to the next iteration.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 7d6992f3-0ec8-4947-80c1-53b71f9e9caa
📒 Files selected for processing (4)
R/dataProcess.RR/utils_imputation.RR/utils_summarization.Rinst/tinytest/test_memory_optimization_fits.R
| for (i in seq_len(num_iter)) { | ||
| if (i == 1) { | ||
| abs.resids = data.frame(abs.resids = abs(fit$residuals)) | ||
| fitted = data.frame(fitted = fit$fitted.values) | ||
| input = data.frame(input, | ||
| "abs.resids" = abs.resids, | ||
| "fitted" = fitted) | ||
| input[["abs.resids"]] = abs(fit$residuals) | ||
| input[["fitted"]] = fit$fitted.values | ||
| } | ||
| fit.loess = loess(abs.resids ~ fitted, data = input) | ||
| loess.fitted = data.frame(loess.fitted = fitted(fit.loess)) | ||
| input = data.frame(input, "loess.fitted" = loess.fitted) | ||
| ## loess fitted valuaes are predicted sd | ||
| input$weight = 1 / (input$loess.fitted ^ 2) | ||
| input = input[, !(colnames(input) %in% "abs.resids")] | ||
| input[["loess.fitted"]] = fitted(fit.loess) | ||
| ## loess fitted values are predicted sd | ||
| input[["weight"]] = 1 / (input[["loess.fitted"]] ^ 2) | ||
| input[["abs.resids"]] = NULL | ||
| ## re-fit using weight | ||
| wls.fit = lm(formula(fit), data = input, weights = weight) | ||
| abs.resids = data.frame(abs.resids = abs(wls.fit$residuals)) | ||
| input = data.frame(input, "abs.resids" = abs.resids) | ||
| input = input[, -which(colnames(input) %in% c("loess.fitted", "weight"))] | ||
| input[["abs.resids"]] = abs(wls.fit$residuals) | ||
| input[["loess.fitted"]] = NULL | ||
| input[["weight"]] = NULL | ||
| } |
There was a problem hiding this comment.
Refresh fitted after each weighted refit.
At Line 220 you refresh abs.resids from wls.fit, but fitted is never refreshed after Line 211. For num_iter > 1, later loess iterations use stale fitted values and inconsistent residual/fitted pairs.
Suggested fix
for (i in seq_len(num_iter)) {
if (i == 1) {
input[["abs.resids"]] = abs(fit$residuals)
input[["fitted"]] = fit$fitted.values
}
fit.loess = loess(abs.resids ~ fitted, data = input)
input[["loess.fitted"]] = fitted(fit.loess)
## loess fitted values are predicted sd
input[["weight"]] = 1 / (input[["loess.fitted"]] ^ 2)
input[["abs.resids"]] = NULL
## re-fit using weight
wls.fit = lm(formula(fit), data = input, weights = weight)
input[["abs.resids"]] = abs(wls.fit$residuals)
+ input[["fitted"]] = wls.fit$fitted.values
input[["loess.fitted"]] = NULL
input[["weight"]] = NULL
}🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@R/utils_summarization.R` around lines 208 - 223, The loess iterations reuse
stale fitted values because input[["fitted"]] is only set from the initial fit;
after each weighted refit (wls.fit) update the stored fitted values so
subsequent loess calls use consistent residual/fitted pairs — e.g. inside the
loop after computing wls.fit (the variable wls.fit) assign input[["fitted"]] <-
fitted(wls.fit) or input[["fitted"]] <- wls.fit$fitted.values before
recalculating input[["abs.resids"]] and proceeding to the next iteration.
There was a problem hiding this comment.
Will this piece of code be changed in a future PR? It uses data.frame and [[ a lot
There was a problem hiding this comment.
@mstaniak, it looks like the current state of the file already addresses that — the version of .updateUnequalVariances in the branch now uses data.table with := for all in-place operations rather than data.frame and [[:
.updateUnequalVariances = function(input, fit, num_iter) {
weight = abs.resids = loess.fitted = NULL
if (!data.table::is.data.table(input)) {
input = data.table::as.data.table(input)
}
for (i in seq_len(num_iter)) {
if (i == 1) {
input[, abs.resids := abs(fit$residuals)]
input[, fitted := fit$fitted.values]
}
fit.loess = loess(abs.resids ~ fitted, data = input)
input[, loess.fitted := fitted(fit.loess)]
input[, weight := 1 / (loess.fitted ^ 2)]
input[, abs.resids := NULL]
wls.fit = lm(formula(fit), data = input, weights = weight)
input[, abs.resids := abs(wls.fit$residuals)]
input[, loess.fitted := NULL]
input[, weight := NULL]
}
wls.fit
}The diff in the review was anchored to an earlier commit snapshot — the branch has since been updated to the data.table/:= style. My earlier fitted refresh suggestion still applies to this current version, though.
|
Thanks, fantastic update. |
mstaniak
left a comment
There was a problem hiding this comment.
Thanks again for his fantastic update, just one question
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@benchmark/profile_dataprocess_peak.R`:
- Around line 58-60: The required-column validation in
profile_dataprocess_peak.R currently defines required_cols as c("ProteinName",
"PeptideSequence", "PrecursorCharge", "FragmentIon", "ProductCharge",
"IsotopeLabelType", "Run", "BioReplicate", "Condition", "Intensity") which omits
PeptideModifiedSequence expected by the package validity path and risks later
failures; update the required_cols vector (and the analogous check around lines
85-89) to include "PeptideModifiedSequence", and ensure any check/handling for
"AnomalyScores" remains separate (i.e., validate presence of
PeptideModifiedSequence as required while treating AnomalyScores as an
optional/special-case column).
- Around line 143-148: The benchmark is using non-default parameters which makes
results non-comparable to dataProcess(); update the calls to
MSstats:::MSstatsSelectFeatures and MSstats:::MSstatsPrepareForSummarization to
use the same default knobs as dataProcess() rather than the explicit overrides
(featureSubset = "all", n_top_feature = 3, remove_uninformative_feature_outlier
= TRUE), i.e. remove the custom arguments and call the functions with their
defaults (or explicitly pass the same parameter values used by dataProcess()) so
the benchmark reflects the standard dataProcess() baseline.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: b06a9a32-3f43-4c52-841e-6e968edfde2d
📒 Files selected for processing (1)
benchmark/profile_dataprocess_peak.R
| required_cols <- c("ProteinName", "PeptideSequence", "PrecursorCharge", | ||
| "FragmentIon", "ProductCharge", "IsotopeLabelType", | ||
| "Run", "BioReplicate", "Condition", "Intensity") |
There was a problem hiding this comment.
Tighten required-column validation to match the pipeline contract.
The script currently accepts a 10-column schema, but the package validity path expects PeptideModifiedSequence too (with AnomalyScores handled separately). This can admit unsupported inputs and fail later in less obvious places.
Suggested patch
-required_cols <- c("ProteinName", "PeptideSequence", "PrecursorCharge",
+required_cols <- c("ProteinName", "PeptideSequence", "PeptideModifiedSequence", "PrecursorCharge",
"FragmentIon", "ProductCharge", "IsotopeLabelType",
"Run", "BioReplicate", "Condition", "Intensity")Also applies to: 85-89
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@benchmark/profile_dataprocess_peak.R` around lines 58 - 60, The
required-column validation in profile_dataprocess_peak.R currently defines
required_cols as c("ProteinName", "PeptideSequence", "PrecursorCharge",
"FragmentIon", "ProductCharge", "IsotopeLabelType", "Run", "BioReplicate",
"Condition", "Intensity") which omits PeptideModifiedSequence expected by the
package validity path and risks later failures; update the required_cols vector
(and the analogous check around lines 85-89) to include
"PeptideModifiedSequence", and ensure any check/handling for "AnomalyScores"
remains separate (i.e., validate presence of PeptideModifiedSequence as required
while treating AnomalyScores as an optional/special-case column).
|
@Rudhik1904 If you could report your benchmarking results with a large dataset in this PR (i.e. amount of memory freed up, etc.), that would be great for documentation purposes |
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
R/utils_summarization.R (1)
219-221:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winRefresh
fittedafter each weighted refit.At Line 219,
abs.residsis updated fromwls.fit, butfittedis not. Fornum_iter > 1, subsequent loess steps use stale fitted values.Suggested fix
wls.fit = lm(formula(fit), data = input, weights = weight) input[, abs.resids := abs(wls.fit$residuals)] + input[, fitted := wls.fit$fitted.values] input[, loess.fitted := NULL] input[, weight := NULL]🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@R/utils_summarization.R` around lines 219 - 221, The code updates input[, abs.resids := abs(wls.fit$residuals)] but does not refresh the fitted values, so subsequent iterations (when num_iter > 1) use stale predictions; after computing wls.fit add a line to update the stored fitted values (e.g., set input[, fitted := wls.fit$fitted] or equivalent) so each loess reweighting uses the most recent fitted vector before clearing loess.fitted and weight.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@benchmark/profile_dataprocess_peak.R`:
- Line 8: Replace the garbled header comment on line containing "mark and
elapsed time..." with a clear, readable description; e.g., rewrite the comment
used by the benchmark (the commented header string) to: "Mark elapsed time at
every stage boundary; useful for tracking the running max of `used` after
gc(reset = TRUE) to locate peak memory usage." Ensure you update the same
comment block that contains the malformed text so the benchmark header is
readable and informative.
---
Duplicate comments:
In `@R/utils_summarization.R`:
- Around line 219-221: The code updates input[, abs.resids :=
abs(wls.fit$residuals)] but does not refresh the fitted values, so subsequent
iterations (when num_iter > 1) use stale predictions; after computing wls.fit
add a line to update the stored fitted values (e.g., set input[, fitted :=
wls.fit$fitted] or equivalent) so each loess reweighting uses the most recent
fitted vector before clearing loess.fitted and weight.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: dad8d3f4-a2a6-430c-be4c-8fd22547e10c
📒 Files selected for processing (2)
R/utils_summarization.Rbenchmark/profile_dataprocess_peak.R
|
|
||
| weight = abs.resids = loess.fitted = NULL | ||
| if (data.table::is.data.table(input)) { | ||
| input = data.table::copy(input) |
There was a problem hiding this comment.
Why are we copying the data table here? This step seems unnecessary.
There was a problem hiding this comment.
Was this addressed before merging?
| # renamed proteins to | ||
| # scale up the input | ||
|
|
||
| default_input <- "data/reduce_output_spectronaut_input_qc_test.csv" |
There was a problem hiding this comment.
We should use absolute paths here?
User description
PR Type
Bug fix, Enhancement, Tests
Description
Free large pipeline inputs earlier
Strip heavy fields from fit objects
Avoid repeated copies in variance updates
Add fit memory regression tests
Diagram Walkthrough
File Walkthrough
dataProcess.R
Free pipeline intermediates and fit objects earlyR/dataProcess.R
rawafter initial preparationpeptides_dictafter normalizationgc(verbose = FALSE)before output assemblysurvival_fitandfitsoonerutils_imputation.R
Strip unused `survreg` fields after fittingR/utils_imputation.R
fit$yaftersurvreg()fittingfit$linear.predictorsbefore returningutils_summarization.R
Reduce `lm` footprint and variance-copy churnR/utils_summarization.R
linear_model$modelbefore returninginputtodata.frameonce upfronttest_memory_optimization_fits.R
Add regression tests for memory-stripped fitsinst/tinytest/test_memory_optimization_fits.R
survregfieldslm$modelsummary()andvcov()still workMotivation & Context
This PR addresses memory usage issues in the MSstats package by aggressively freeing large intermediate objects that are no longer needed after their primary use. The
dataProcess()function was identified as holding approximately 250–400 MB of function arguments unnecessarily before proceeding to summarization. Additionally, fit objects from iterative per-protein summarization were accumulating in memory across loop iterations, causing unnecessary memory overhead.Solution Summary
The PR implements memory optimizations across three main areas:
raw,peptides_dict) are explicitly removed and garbage collection is triggered before summarization begins$yand$linear.predictorsfromsurvreg()objects and$modelfromlm()objects—while preserving fields required for downstream operations likepredict(),summary(), andvcov()Detailed Changes
R/dataProcess.R:
rawobject immediately after usepeptides_dictobject immediately after usegc(verbose = FALSE)to trigger garbage collection before summarizationMSstatsSummarizeSingleLinear()andMSstatsSummarizeSingleTMP()per-protein summarizers: explicitly removessurvival_fitandfitobjects immediately after their results have been materialized, reducing retained memory across iterationsR/utils_imputation.R:
.fitSurvival()now properly closes nestedif/elsebranches when handling unlabeled inputs withn_features != 1Landcountdf = FALSEfit$yandfit$linear.predictorsfields from the fitted survival object before returning, while preserving fields required forpredict()and other downstream operationsR/utils_summarization.R:
.fitLinearModel(): reformatted call to.updateUnequalVariances()using expanded named-argument layout.updateUnequalVariances(): refactored to convertinputto adata.tableonce and use in-place:=assignments for intermediate column computation (abs.resids,fitted,loess.fitted,weight), eliminating repeateddata.frame()reconstruction per iteration; intermediate columns are cleared by setting toNULLrather than being dropped and rebuilt.fitLinearModel()now stripsfit$modelfield from the fitted linear model object before returning, preserving$coefficients,$qr,$residuals, and other fields required forsummary(),vcov(), andpredict()Unit Tests
Two test files were added to verify memory optimizations:
inst/tinytest/test_memory_optimization_fits.R (82 lines):
MSstats:::.fitSurvival()returns objects with$yand$linear.predictorsset toNULLwhile preserving$coefficientsand$scalepredict()works correctly on stripped survival fit objectsobject.size()of stripped survival fits is smaller than unstripped equivalentsMSstats:::.fitLinearModel()returns objects with$modelset toNULLwhile preserving$coefficients,$qr, and$residualssummary()andvcov()work correctly on stripped linear fit objectsobject.size()of stripped linear fits is smaller than unstripped equivalentsinst/tinytest/test_memory_optimization.R:
Coding Guidelines
No coding guideline violations were identified. The changes maintain backward compatibility by preserving all fields required for downstream operations (
predict(),summary(),vcov()), only removing fields that hold large data objects ($y,$linear.predictors,$model) that are not needed post-fit. The memory optimization is transparent to users.