Merge MZMinetoMSstatsFormat converter into development#132
Conversation
|
Warning Review limit reached
More reviews will be available in 12 minutes and 38 seconds. Learn how PR review limits work. Your organization has run out of usage credits. Purchase more in the billing tab. ⌛ How to resolve this issue?After more reviews become available, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available. Please see our Fair Usage Limits Policy for further information. 📝 WalkthroughWalkthroughAdds MZMine metabolomics support: new S4 class ( ChangesMZMine Metabolomics Converter
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 inconclusive)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
Pre-existing roxygen2 issue surfaced during this work. When re-documenting the package with roxygen2 7.3.3, the build fails with "In topic 'MSstatsClean': @inherits failed to find topic '.cleanRawProteinProspector'." This is unrelated to MZMine. Somewhere in the package an |
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (1)
inst/tinytest/test_converters_MZMinetoMSstatsFormat.R (1)
98-100: ⚡ Quick winStrengthen fallback coverage assertion to detect missing mz_rt IDs.
On Lines 98-100,
all(ProteinName %in% expected_mz_rt)only checks “no unexpected values”; it does not guarantee every expected fallback appears. Use set equality on unique values to avoid false positives.Proposed test assertion hardening
expected_mz_rt = c("123.056_1.23", "245.129_3.45", "367.201_5.67", "489.334_7.89", "555.447_9.1", "123.056_1.45") -expect_true(all(as.character(output_nolib_dt$ProteinName) %in% expected_mz_rt)) +expect_equal( + sort(unique(as.character(output_nolib_dt$ProteinName))), + sort(expected_mz_rt) +)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@inst/tinytest/test_converters_MZMinetoMSstatsFormat.R` around lines 98 - 100, The test currently only asserts that there are no unexpected ProteinName values using expected_mz_rt and output_nolib_dt$ProteinName, which can miss missing expected mz_rt IDs; change the assertion to compare the sets (or sorted uniques) so it verifies every expected fallback is present and no extras — e.g., replace the all(... %in% ...) check with a set equality check between unique(as.character(output_nolib_dt$ProteinName)) and expected_mz_rt (or sorted variants) to ensure exact match.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@R/clean_MZMine.R`:
- Around line 55-56: The code sorts annotations using setorder(ann, id, -score)
but assumes score is numeric; coerce ann$score to numeric before ordering to
avoid mis-ranking when score is character/factor. Update the logic around the
ann data.table (before setorder) to convert score (e.g., ann[, score :=
as.numeric(score)]) and handle potential NAs (optionally warn if coercion
produces NA) so setorder and the subsequent unique(ann, by = "id") produce
correct ann_top results.
In `@R/converters_MZMinetoMSstatsFormat.R`:
- Around line 8-10: Update the parameter docs for annotation to state that the
`Run` column must match the standardized sample/run names produced by the
converter (i.e., after the function's column-name cleaning which strips the
trailing " Peak area" and normalizes spaces/dots), because the code (look for
the converter function that accepts `annotation` and the internal name-cleaning
logic that normalizes sample column names) uses those cleaned names when
joining; tell users to either provide `Run` values in that standardized form or
to run the same name-cleaning routine on their annotation before passing it in.
- Around line 54-55: The wrapper call to MSstatsImport in
converters_MZMinetoMSstatsFormat.R ignores the variadic arguments (...)
advertised for passing to data.table::fread; update the call where input is
assigned (the MSstatsImport invocation) to forward ... into MSstatsImport so
caller-supplied fread options are preserved, i.e., include ... in the
MSstatsImport argument list when calling MSstatsImport(list(input = input),
"MSstats", "MZMine") so the fread passthrough works as documented.
---
Nitpick comments:
In `@inst/tinytest/test_converters_MZMinetoMSstatsFormat.R`:
- Around line 98-100: The test currently only asserts that there are no
unexpected ProteinName values using expected_mz_rt and
output_nolib_dt$ProteinName, which can miss missing expected mz_rt IDs; change
the assertion to compare the sets (or sorted uniques) so it verifies every
expected fallback is present and no extras — e.g., replace the all(... %in% ...)
check with a set equality check between
unique(as.character(output_nolib_dt$ProteinName)) and expected_mz_rt (or sorted
variants) to ensure exact match.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 60c2c010-0294-4a60-b8ea-1991a39f0d27
⛔ Files ignored due to path filters (3)
inst/tinytest/raw_data/MZMine/annotation.csvis excluded by!**/*.csvinst/tinytest/raw_data/MZMine/mzmine_annotations.csvis excluded by!**/*.csvinst/tinytest/raw_data/MZMine/mzmine_input.csvis excluded by!**/*.csv
📒 Files selected for processing (13)
.Rbuildignore.gitignoreDESCRIPTIONNAMESPACER/MSstatsConvert_core_functions.RR/clean_MZMine.RR/converters_MZMinetoMSstatsFormat.Rinst/tinytest/test_converters_MZMinetoMSstatsFormat.Rman/MSstatsClean.Rdman/MSstatsInputFiles.Rdman/MZMinetoMSstatsFormat.Rdman/dot-cleanRawMZMine.Rdvignettes/msstats_data_format.Rmd
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@R/clean_MZMine.R`:
- Around line 55-57: The score column is being coerced with as.numeric directly
(ann[, score := suppressWarnings(as.numeric(score))]) which will return factor
internal codes if score is a factor; change the coercion to go through character
first (i.e., suppressWarnings(as.numeric(as.character(score)))) so factor levels
are converted correctly, keep the anyNA(ann$score) check and the existing stop
message unchanged to validate numeric coercion.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 95a37b9b-01a2-4648-b3c5-3aa94dcfecbf
📒 Files selected for processing (4)
R/clean_MZMine.RR/converters_MZMinetoMSstatsFormat.Rinst/tinytest/test_converters_MZMinetoMSstatsFormat.Rman/MZMinetoMSstatsFormat.Rd
✅ Files skipped from review due to trivial changes (1)
- man/MZMinetoMSstatsFormat.Rd
End-to-end test:MZMinetoMSstatsFormat against the biologist's Same input file PART 1: ROW KEY COMPARISONTheir (feature, run) keys: 59087 -> Every (feature, run) pair matches. No rows dropped, no extra rows. PART 2: INTENSITY COMPARISONstatus N Zero value mismatches. -> All peak-area values agree. The 5,596 both-NA rows are where the PART 3: COMPOUND NAME COMPARISONTheir annotated features (any source): 1796 Their library-annotated features: 69 In both: 69 Name mismatches on the 69 shared library-annotated features: 0 Note on the 1,679 features named by SIRIUS in their pipeline: |
Brings metabolomics into the MSstats family by adding an MZMine converter that mirrors the structure of DIANNtoMSstatsFormat. Phase 1 of a two-phase task; Phase 2 (MSstatsShiny BIO=Metabolomics) will be a separate PR.
- Let MSstatsPreprocess fill IsotopeLabelType - Hardcode IsotopeLabelType in converter - Remove redundant inherited @params - Simplify score coercion - Improve non-numeric score error - Rename `ann` → feature_to_compound - Rename melt variable to Run - Refactor compound-name assignment with explicit data.table join
40a8d8a to
53fed81
Compare
- Error on NULL/missing mzmine_annotations - Drop quant-only unmatched features (no mz_rt fallback) - Log retained feature IDs after join - Update tests, vignette, and roxygen for filtering + MSI Level 2 scope - Remove unused .cleanRawMZMine metadata requirements removeProtein_with1Feature default unchanged (FALSE).
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
R/clean_MZMine.R (1)
54-59:⚠️ Potential issue | 🟠 Major | ⚡ Quick winCoerce
scorethrough character before numeric ranking.At Line 54,
as.numeric(score)can mis-rank factor-typed values and select the wrong top annotation at Line 58/59.Proposed fix
- feature_to_compound[, score := suppressWarnings(as.numeric(score))] + feature_to_compound[, score := suppressWarnings(as.numeric(as.character(score)))]#!/bin/bash # Verify the current coercion pattern in the cleaner rg -n -C2 'score := suppressWarnings\(as.numeric\(score\)\)' R/clean_MZMine.R # Expected: one hit showing direct as.numeric(score) coercion.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@R/clean_MZMine.R` around lines 54 - 59, The score column coercion can mis-rank factor values; update the transformation of feature_to_compound$score to first convert to character and then to numeric (i.e., use as.numeric(as.character(score)) or equivalent) before calling setorder on id and -score and before unique-ing by id so that ranking uses correct numeric values; modify the line where score is assigned (currently using suppressWarnings(as.numeric(score))) to perform the two-step coercion and preserve the suppressWarnings behavior if desired.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@R/clean_MZMine.R`:
- Around line 68-73: The retained features log currently uses
feature_to_compound$id which includes annotation IDs not present in the quant
table; change retained_ids to the IDs actually present in the joined quant rows
(i.e., extract unique IDs from the quant/annotation join result used in this
function) and use that list when building retained_msg and calling
getOption("MSstatsLog") / getOption("MSstatsMsg"); ensure you reference the
joined result variable (the object the rest of clean_MZMine.R uses for quant
rows) rather than feature_to_compound$id so the count and comma-separated list
reflect truly retained features.
---
Duplicate comments:
In `@R/clean_MZMine.R`:
- Around line 54-59: The score column coercion can mis-rank factor values;
update the transformation of feature_to_compound$score to first convert to
character and then to numeric (i.e., use as.numeric(as.character(score)) or
equivalent) before calling setorder on id and -score and before unique-ing by id
so that ranking uses correct numeric values; modify the line where score is
assigned (currently using suppressWarnings(as.numeric(score))) to perform the
two-step coercion and preserve the suppressWarnings behavior if desired.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 5407ceaa-703b-40e4-ac3a-3ce842c3bf78
⛔ Files ignored due to path filters (3)
inst/tinytest/raw_data/MZMine/annotation.csvis excluded by!**/*.csvinst/tinytest/raw_data/MZMine/mzmine_annotations.csvis excluded by!**/*.csvinst/tinytest/raw_data/MZMine/mzmine_input.csvis excluded by!**/*.csv
📒 Files selected for processing (13)
.Rbuildignore.gitignoreDESCRIPTIONNAMESPACER/MSstatsConvert_core_functions.RR/clean_MZMine.RR/converters_MZMinetoMSstatsFormat.Rinst/tinytest/test_converters_MZMinetoMSstatsFormat.Rman/MSstatsClean.Rdman/MSstatsInputFiles.Rdman/MZMinetoMSstatsFormat.Rdman/dot-cleanRawMZMine.Rdvignettes/msstats_data_format.Rmd
✅ Files skipped from review due to trivial changes (3)
- .gitignore
- NAMESPACE
- man/MSstatsInputFiles.Rd
🚧 Files skipped from review as they are similar to previous changes (2)
- man/MZMinetoMSstatsFormat.Rd
- DESCRIPTION
Summary
Adds
MZMinetoMSstatsFormat, a new converter for MZMine metabolomics output, mirroring the structure and conventions ofDIANNtoMSstatsFormat. This is Phase 1 of a cross-package task to add metabolomics support to the MSstats family; Phase 2 (theBIO = "Metabolomics"value in MSstatsShiny) will be a separate PR in Vitek-Lab/MSstatsShiny that depends on this one.What's added
R/clean_MZMine.R: internal.cleanRawMZMine()that handles the wide-to-long pivot, column normalization, and the optional compound-name join.R/converters_MZMinetoMSstatsFormat.R— public converter. Signature mirrorsDIANNtoMSstatsFormatwith metabolomics-appropriate defaults (remove_shared_peptides = FALSE, no decoy/oxidation filters,IsotopeLabelType = "Light").R/MSstatsConvert_core_functions.R—MSstatsMZMineFilesclass registration +MSstatsCleanmethod dispatch. Dispatch is name-based in this package, so no other registration site needs editing.inst/tinytest/raw_data/MZMine/: a wide-format quant CSV (6 features × 4 samples), spectral-library annotations CSV, and run-annotation CSV.inst/tinytest/test_converters_MZMinetoMSstatsFormat.R: 38 assertions covering happy path, IsotopeLabelType, NA charge/fragment columns, highest-score-wins annotation policy, mz_rt fallback for unannotated features, andremoveProtein_with1Featurebehavior.vignettes/msstats_data_format.Rmd.Column mapping
ProteinNamemzmine_annotations(highest-scoring hit per feature), or mz_rt fallbackPeptideSequencerow IDfrom quant filePrecursorChargeNAFragmentIonNAProductChargeNAIsotopeLabelType"Light"(matches DIANN's non-SILAC convention)RunPeak areastrippedIntensityDesign choices worth noting
Optional
mzmine_annotationsparameter. Most features in real MZMine datasets have no spectral-library annotation (in the example dataset I tested with, 69 of 2,569 features were annotated — about 3%). To make output biologically usable,MZMinetoMSstatsFormataccepts an optionalmzmine_annotationsdata frame that joins compound names by highest score per feature. WhenNULL, every feature falls back to a mz_rt string (paste0(round(mz, 4), "_", round(rt, 2))). This is the simple spectral-library join only - not the full multi-source SIRIUS/MS2Query/CANOPUS consensus workflow some labs use.Verification
tinytest::test_all("."): 732/732 pass, including the 38 new MZMine assertions.R CMD checkon MSstatsConvert: clean. One new NOTE (MZMinetoMSstatsFormat: no visible binding for global variable 'Intensity'); same pattern as DIANN/Skyline/Spectronaut.R CMD checkon../MSstatsshows no MZMine-related findings (pre-existing WARNINGs/NOTEs only). I wasn't able to run cross-checks onMSstatsTMT,MSstatsPTM, orMSstatsLiPbecause they're not cloned in my immediate workspace - flagging in case you want to verify those before merge.Heads-up for review
Column-name standardization affects user-facing
Runvalues..standardizeColnamesstrips spaces and dots from column names, so the per-sample column"sampleA.mzML Peak area"becomessampleAmzMLPeakareainternally and the resultingRunvalue (after thePeakareasuffix strip) issampleAmzML. This means a user's annotation file must useRun = sampleAmzML, notsampleA.mzMLorsampleA. This is consistent with how every MSstatsConvert converterbehaves — but MZMine's column names are uglier than the others, so the transformation is more visible. Worth deciding whether docs should call this out more loudly, the converter should normalize the user's
Runvalues internally, or it's fine as-is.Closes Phase 1 of the metabolomics integration. Phase 2 (MSstatsShiny
BIO = "Metabolomics") will follow in a separate PR.Motivation and context — short summary
Adds first-class support for untargeted metabolomics MZMine exports to MSstatsConvert. The PR provides a converter and cleaning pipeline that pivots MZMine wide-format " Peak area" tables into MSstats long format, applies metabolomics-appropriate defaults (IsotopeLabelType = "Light", NA for charge/fragment fields), and requires a spectral-library annotations table to derive ProteinName (highest-score wins). Unlike an earlier design note, the implementation requires mzmine_annotations and drops features without a matching annotation (no mz_rt fallback).
Detailed changes
Unit tests added/modified
Coding guidelines / notable issues
@inheritParamsreferences a function that does not produce a discoverable topic). roxygen2 v7.3.3 fails while v8 tolerates it. To keep the PR scoped, document() was not re-run and the author edited DESCRIPTION; they recommend a separate housekeeping PR to fix the referenced ProteinProspector documentation so roxygen2 v7-compatible docs can be generated.