bigWig coverage tracks#114
Conversation
Code review: bigWig coverage tracks (# 114)Reviewed the full diff. This is a clean, well-documented addition that slots the coverage accumulation into the existing single-pass BAM loop and matches Below are the things worth addressing before merge, ordered by impact. 🔴 Memory: dense per-base arrays held for the whole genome
By contrast Options worth considering:
At minimum, please document the expected memory footprint. 🟠 Hot-loop allocation per readIn if settings.split {
let mut blocks_buf = Vec::new();
genomecov_blocks(record, &mut blocks_buf);
...
}A fresh 🟠 Hot-loop linear search by chromosome name
self.chroms.iter_mut().find(|(name, _)| name == chrom)This runs per read per track. The caller already has the 🟡 Minor
✅ Things I checked and are fine
❓ Worth confirmingThe if settings.deduplicate_strand
&& record.is_paired()
&& !record.is_mate_unmapped()
&& record.flags() & BAM_FREAD2 != 0bedtools' Test coverageGood: two integration tests cross-check combined and stranded-forward against the real bedtools + UCSC chain, plus unit tests for Overall: solid, faithful implementation with strong validation. The memory profile is the one item I'd want resolved (or at least explicitly documented) before this runs on real genomes; the two hot-loop items are straightforward wins given RustQC's performance focus. |
Review follow-upPushed three commits addressing the review ( 🔴 Memory — fixedThe dense whole-genome difference array is gone. The accumulator now does a streaming sweep: since each chromosome's reads arrive coordinate-sorted, it keeps only a sparse map of pending coverage deltas plus a cursor, emits finished bedGraph intervals as the cursor advances, and drops the deltas below it. Peak memory is bounded by the widest single-read span (introns included) — a few KB per track — instead of chromosome length × depth × thread count. The ~25 GB (combined) / ~75 GB (stranded) footprint on a human genome is eliminated and no longer scales with genome size or core count. Output is unchanged, validated bit-for-bit against native bedtools genomecov v2.30 + UCSC bedClip/bedGraphToBigWig/bigWigToBedGraph: 738/738 intervals identical for both the combined ( 🟠 Per-read allocation — fixedSplit-read blocks now use a single reusable scratch buffer instead of a fresh 🟠 Per-read linear chromosome scan — fixedEliminated entirely. The streaming design accumulates one active chromosome at a time, so there is no per-read name lookup at all. 🟡 Redundant BAM re-open — fixedChromosome sizes are carried through 🟡 Dead non-split branch — fixedRemoved; all nf-core/rnaseq tracks are 🟡
|
Test coverage updateClosed out the remaining test gaps from the review follow-up:
Commits |
Implement bedtools genomecov-equivalent coverage accumulation in the
existing BAM streaming pass and write bigWig files via bigtools.
Produces nf-core/rnaseq-compatible outputs:
- {sample}.bigWig (combined, -split -bg)
- {sample}.forward/reverse.bigWig for stranded libraries (-split -du -strand)
Validated against bedtools genomecov + UCSC bedClip/bedGraphToBigWig in
integration tests.
Co-authored-by: Phil Ewels <phil.ewels@seqera.io>
Add a dedicated bigWig docs page and update README, AGENTS.md, CONTRIBUTING.md, getting-started guides, credits, CLI/config references, benchmark details, and library doc comments to describe nf-core/rnaseq- compatible genome coverage track output. Co-authored-by: Phil Ewels <phil.ewels@seqera.io>
Co-authored-by: Phil Ewels <phil.ewels@seqera.io>
Address review feedback on the bigWig coverage tracks: - Reuse a single scratch buffer for split-read alignment blocks instead of allocating a Vec per record per track in the hot loop. - Index each track's chromosomes by name (HashMap) for O(1) lookup per read, replacing the previous linear name scan. - Carry BAM header-ordered chromosome sizes through GenomeCovResult so the output step no longer re-opens the BAM purely to read the header. - Drop the unreachable non-split code path (all nf-core/rnaseq tracks use -split) and document clip_bedgraph as defensive. - Tidy a clippy manual-find warning in the bigWig integration helper. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… size The difference-array accumulator allocated one counter per genome base and kept every chromosome resident, costing ~25 GB (combined) to ~75 GB (stranded) on a human genome. Replace it with a streaming sweep that exploits the coordinate-sorted BAM: keep only a sparse map of pending coverage deltas plus a cursor, emit finished bedGraph intervals as the cursor advances, and drop deltas below it. Peak memory is now bounded by the widest single-read span (introns included), independent of chromosome length, sequencing depth, and thread count. Output is unchanged: validated bit-for-bit against bedtools genomecov v2.30 + UCSC bedClip/bedGraphToBigWig/bigWigToBedGraph (738/738 intervals identical for both combined `-split` and stranded `-split -du -strand +`), plus a randomised brute-force equivalence test covering overlaps, nesting, clamping, and net-zero interval merging. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The streaming accumulator already clamps interval ends to each chromosome's length and never emits a chromosome absent from chrom_sizes, so clip_bedgraph was a pure pass-through. Remove it and write the bedGraph intervals directly. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Builds a synthetic BAM exercising the full matrix of cases that affect genomecov coverage and strand assignment, and compares the combined, forward, and reverse tracks against bedtools genomecov + UCSC tools at both 1 and 4 threads (sequential and parallel paths): - read-1 vs read-2 strand flipping under -du - read-2 with an *unmapped* mate (not flipped — bedtools only flips when the mate is mapped; this pins down the previously-unverified guard) - read-1 mate-unmapped and single-end reads (no flip) - spliced (N) and deletion (D) CIGAR splitting - a read overhanging the chromosome end (clamping) - secondary, duplicate, QC-fail, supplementary (counted); unmapped (ignored) - a second chromosome (parallel per-chromosome merge) Verified end-to-end against bedtools v2.30 + UCSC bedClip/bedGraphToBigWig/ bigWigToBedGraph; skips cleanly when those tools are absent. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Tool-independent test using a forward-only BAM: an unstranded run produces only the combined track, and a --stranded forward run populates the forward track while skipping the empty reverse track (no stub file written). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
9a7485b to
44a6a89
Compare
|
Rebased this branch onto the corrected This branch had Force-pushed, so please re-pull if you have a local copy. The noodles work itself lives in #118. |
|
Corresponding PRs for testing / benchmarks:
No manual effort done here, so still a way to go with validation work. |
Summary
Implements genome-wide bigWig coverage tracks in the existing
rustqc rnaBAM streaming pass, matching nf-core/rnaseq parameters (bedtools genomecov -split, optional-du/-strand, UCSCbedClipsemantics).Closes #112.
Changes
src/rna/bigwig/module: coverage accumulation, bedGraph clipping, bigWig writing viabigtoolscount_reads()read loop andmain.rsoutput underbigwig/rna.bigwig.enabled(defaulttrue)Output files
{sample}.bigWig— combined track (all libraries){sample}.forward.bigWig/{sample}.reverse.bigWig— per-strand tracks (stranded libraries only)Validation
genomecovv2.31.1 exactly (738/738 intervals in integration tests).bigWigfiles are not bit-identical to UCSCbedGraphToBigWig(different encoder), but decode to the same coverageFollow-ups
bedGraphToBigWigsubprocess for bit-identical binaries if required