Skip to content

bigWig coverage tracks#114

Draft
ewels wants to merge 8 commits into
mainfrom
cursor/bigwig-coverage-tracks-09a4
Draft

bigWig coverage tracks#114
ewels wants to merge 8 commits into
mainfrom
cursor/bigwig-coverage-tracks-09a4

Conversation

@ewels

@ewels ewels commented Jun 16, 2026

Copy link
Copy Markdown
Member

Summary

Implements genome-wide bigWig coverage tracks in the existing rustqc rna BAM streaming pass, matching nf-core/rnaseq parameters (bedtools genomecov -split, optional -du/ -strand, UCSC bedClip semantics).

Closes #112.

Changes

  • New src/rna/bigwig/ module: coverage accumulation, bedGraph clipping, bigWig writing via bigtools
  • Integrated into count_reads() read loop and main.rs output under bigwig/
  • Config toggle: rna.bigwig.enabled (default true)
  • Integration tests compare decoded bedGraph intervals against bedtools + UCSC reference
  • Documentation updates across the docs site, README, AGENTS.md, and related markdown

Output files

  • {sample}.bigWig — combined track (all libraries)
  • {sample}.forward.bigWig / {sample}.reverse.bigWig — per-strand tracks (stranded libraries only)

Validation

  • Decoded coverage intervals match bedtools genomecov v2.31.1 exactly (738/738 intervals in integration tests)
  • Binary .bigWig files are not bit-identical to UCSC bedGraphToBigWig (different encoder), but decode to the same coverage
  • Empty per-strand tracks are skipped (no stub file written)

Follow-ups

  • RustQC-benchmarks nf-test for bigWig cross-check
  • Optional: UCSC bedGraphToBigWig subprocess for bit-identical binaries if required

@ewels

ewels commented Jun 16, 2026

Copy link
Copy Markdown
Member Author

Code review: bigWig coverage tracks (# 114)

Reviewed the full diff. This is a clean, well-documented addition that slots the coverage accumulation into the existing single-pass BAM loop and matches bedtools genomecov semantics carefully. Docs, changelog, citations, and config are all updated consistently. Integration tests cross-check decoded bedGraph against the real bedtools + UCSC toolchain, which is the right bar.

Below are the things worth addressing before merge, ordered by impact.

🔴 Memory: dense per-base arrays held for the whole genome

ChromCoverage allocates a dense Vec<DiffPoint> of length = chromosome size, and DiffPoint is 8 bytes (2 × u32). So each track costs ~8 bytes per genome base.

  • The parallel path allocates per-batch accumulators and then merges them all into one accumulator (ChromResult::merge), so the merged result holds every chromosome's dense array simultaneously.
  • For a human genome (~3.1 Gb) that is ~25 GB for the combined track alone, and ~75 GB for stranded libraries (combined + forward + reverse).

By contrast bedtools genomecov on a coordinate-sorted BAM only keeps the current chromosome's array and flushes when the chromosome changes, so its peak is roughly one chromosome (~1-2 GB). This is a 10-25x memory regression against the tool RustQC is matching, and the headline use case is ~186M reads on a real genome. It risks OOM on anything smaller than the benchmark instance, and the PR description doesn't mention the memory cost.

Options worth considering:

  • Since input is coordinate-sorted, stream per-chromosome: emit bedGraph as each chromosome completes and free its array (closest to bedtools' profile).
  • Or use a sparse breakpoint map (e.g. BTreeMap<pos, delta>) per chromosome. RNA-seq coverage is sparse, so this would be far smaller than dense arrays.

At minimum, please document the expected memory footprint.

🟠 Hot-loop allocation per read

In process_track (src/rna/bigwig/accumulator.rs):

if settings.split {
    let mut blocks_buf = Vec::new();
    genomecov_blocks(record, &mut blocks_buf);
    ...
}

A fresh Vec is allocated on every call, and process_track runs once per track per read (up to 3 tracks for stranded). That is hundreds of millions of heap allocations on a large BAM. genomecov_blocks even starts with blocks.clear(), which signals it was designed to take a reusable scratch buffer, but the caller defeats that by allocating fresh each time. Hoist a single scratch Vec (per worker / per process_read call) and pass it down.

🟠 Hot-loop linear search by chromosome name

TrackAccum::chrom_mut does a linear scan with string comparison:

self.chroms.iter_mut().find(|(name, _)| name == chrom)

This runs per read per track. The caller already has the tid in hand (record.tid()), so indexing the track by tid (a Vec indexed by tid, or a tid→slot map built once) would remove the per-read string compare entirely and align with how the other accumulators key off tid.

🟡 Minor

  • Redundant BAM re-open (main.rs): chrom sizes are obtained by opening a brand-new Reader::from_path(bam_path) purely to read the header, after count_reads already had the header available. Consider threading the sizes out of the count pass instead of re-opening.
  • tokio runtime per track (output.rs): write_bedgraph_bigwig builds a fresh multi-thread runtime for each track (up to 3 per sample). Build it once in write_bigwig_tracks and reuse.
  • Dead non-split branch: every caller sets split: true, so the else branch in process_track (and TrackSettings.split itself) is currently unreachable. Either drop it or add a test that exercises it.
  • clip_bedgraph is largely a no-op: chrom_to_bedgraph already bounds end to cov.size, so clipping to chrom sizes can never trim anything. Fine as defensive code, but worth a comment saying so, since the docs present it as load-bearing bedClip parity.
  • GenomeCovAccum / GenomeCovResult duplication: identical field layout with a From impl. The type split is reasonable for an accumulate-then-finalize boundary, but a one-line note on why they're separate would help.

✅ Things I checked and are fine

  • record.flags() & BAM_FREAD2 != 0 parses correctly: in Rust & binds tighter than != (unlike C), so this is (flags & BAM_FREAD2) != 0.
  • chrom_to_bedgraph difference-array walk matches bedtools ReportChromCoverageBedGraph (traced the unit test by hand: [2,4] inclusive ×2 → (2,5,2), correct half-open output).
  • u32 depth counters won't overflow at realistic coverage.
  • Empty per-strand track skipping is handled and documented.

❓ Worth confirming

The -du strand flip adds !record.is_mate_unmapped() as a guard:

if settings.deduplicate_strand
    && record.is_paired()
    && !record.is_mate_unmapped()
    && record.flags() & BAM_FREAD2 != 0

bedtools' -du flips strand for the second mate based on the second-mate flag; the mate-unmapped guard is an extra condition. The integration tests pass against bedtools on test.bam, so this matches on the test data, but it's worth confirming the test BAM actually contains a second mate whose mate is unmapped, otherwise that edge isn't covered.

Test coverage

Good: two integration tests cross-check combined and stranded-forward against the real bedtools + UCSC chain, plus unit tests for chrom_to_bedgraph and clip_bedgraph. Gaps to consider: a stranded-reverse case (to exercise the swapped strand labels), and an explicit empty-per-strand-track skip assertion.

Overall: solid, faithful implementation with strong validation. The memory profile is the one item I'd want resolved (or at least explicitly documented) before this runs on real genomes; the two hot-loop items are straightforward wins given RustQC's performance focus.

@ewels

ewels commented Jun 16, 2026

Copy link
Copy Markdown
Member Author

Review follow-up

Pushed three commits addressing the review (b0e5a07, 86bbd2d, and the perf fixes before them). Status against each finding:

🔴 Memory — fixed

The dense whole-genome difference array is gone. The accumulator now does a streaming sweep: since each chromosome's reads arrive coordinate-sorted, it keeps only a sparse map of pending coverage deltas plus a cursor, emits finished bedGraph intervals as the cursor advances, and drops the deltas below it. Peak memory is bounded by the widest single-read span (introns included) — a few KB per track — instead of chromosome length × depth × thread count. The ~25 GB (combined) / ~75 GB (stranded) footprint on a human genome is eliminated and no longer scales with genome size or core count.

Output is unchanged, validated bit-for-bit against native bedtools genomecov v2.30 + UCSC bedClip/bedGraphToBigWig/bigWigToBedGraph: 738/738 intervals identical for both the combined (-split) and stranded (-split -du -strand +) tracks. Added a randomised brute-force equivalence test (200 trials) plus targeted cases for overlap, nesting, end-clamping, and net-zero interval merging.

🟠 Per-read allocation — fixed

Split-read blocks now use a single reusable scratch buffer instead of a fresh Vec per record per track.

🟠 Per-read linear chromosome scan — fixed

Eliminated entirely. The streaming design accumulates one active chromosome at a time, so there is no per-read name lookup at all.

🟡 Redundant BAM re-open — fixed

Chromosome sizes are carried through GenomeCovResult (BAM header order) from the count pass; the output step no longer re-opens the BAM.

🟡 Dead non-split branch — fixed

Removed; all nf-core/rnaseq tracks are -split.

🟡 clip_bedgraph redundancy — documented

Kept as a defensive bedClip-parity step with a comment noting the streaming accumulator already clamps interval ends.

-du mate-unmapped guard — validated on test data

The stranded -du -strand + track matches bedtools exactly (738/738) on the test BAM, so the strand-flip logic is correct for that data. The specific mate-unmapped edge isn't independently asserted; worth a dedicated case if a BAM that exercises it is available.

Not changed: tokio runtime per track

Left as-is. bigtools::BigWigWrite::write takes the Runtime by value (consumes it), so reusing one runtime across the up-to-3 tracks isn't possible without a deeper API change. The cost is a few ms at the output stage, not in any hot loop.

Remaining / optional

  • Integration tests for the stranded-reverse track and an explicit empty-per-strand-track skip assertion (the reverse strand filter and skip logic are exercised indirectly but not asserted directly).
  • The upstream cross-check tests still require bedtools + UCSC tools to be present locally/CI; they skip cleanly otherwise.

All 241 tests pass; clippy and fmt are clean.

@ewels

ewels commented Jun 17, 2026

Copy link
Copy Markdown
Member Author

Test coverage update

Closed out the remaining test gaps from the review follow-up:

  • -du mate-unmapped guard — added a synthetic-BAM cross-check (test_bigwig_strand_edge_cases_match_bedtools) that pins this down against bedtools as the ground truth. A read-2 with an unmapped mate stays on the forward track, exactly as bedtools does. The same fixture covers read-1/read-2 flipping, single-end, spliced (N) and deletion (D) CIGARs, end-clamping, secondary/duplicate/QC-fail/supplementary inclusion, unmapped exclusion, and a second chromosome — compared across combined/forward/reverse tracks at 1 and 4 threads (sequential + parallel paths). Verified end-to-end against bedtools v2.30 + UCSC tools; skips cleanly when those aren't present.
  • Stranded-reverse track — now covered by the reverse comparison in that test.
  • Empty per-strand track skipping — added a tool-independent test (test_bigwig_per_strand_track_presence): unstranded → combined track only; --stranded forward on a forward-only BAM → forward written, empty reverse skipped (no stub file).

Commits a479729, 3ce2694. Full suite green, clippy/fmt clean.

@ewels ewels changed the title feat: add nf-core-compatible bigWig coverage tracks bigWig coverage tracks Jun 17, 2026
cursoragent and others added 8 commits June 17, 2026 10:49
Implement bedtools genomecov-equivalent coverage accumulation in the
existing BAM streaming pass and write bigWig files via bigtools.

Produces nf-core/rnaseq-compatible outputs:
- {sample}.bigWig (combined, -split -bg)
- {sample}.forward/reverse.bigWig for stranded libraries (-split -du -strand)

Validated against bedtools genomecov + UCSC bedClip/bedGraphToBigWig in
integration tests.

Co-authored-by: Phil Ewels <phil.ewels@seqera.io>
Add a dedicated bigWig docs page and update README, AGENTS.md,
CONTRIBUTING.md, getting-started guides, credits, CLI/config references,
benchmark details, and library doc comments to describe nf-core/rnaseq-
compatible genome coverage track output.

Co-authored-by: Phil Ewels <phil.ewels@seqera.io>
Co-authored-by: Phil Ewels <phil.ewels@seqera.io>
Address review feedback on the bigWig coverage tracks:

- Reuse a single scratch buffer for split-read alignment blocks instead
  of allocating a Vec per record per track in the hot loop.
- Index each track's chromosomes by name (HashMap) for O(1) lookup per
  read, replacing the previous linear name scan.
- Carry BAM header-ordered chromosome sizes through GenomeCovResult so
  the output step no longer re-opens the BAM purely to read the header.
- Drop the unreachable non-split code path (all nf-core/rnaseq tracks
  use -split) and document clip_bedgraph as defensive.
- Tidy a clippy manual-find warning in the bigWig integration helper.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… size

The difference-array accumulator allocated one counter per genome base and
kept every chromosome resident, costing ~25 GB (combined) to ~75 GB
(stranded) on a human genome. Replace it with a streaming sweep that
exploits the coordinate-sorted BAM: keep only a sparse map of pending
coverage deltas plus a cursor, emit finished bedGraph intervals as the
cursor advances, and drop deltas below it. Peak memory is now bounded by
the widest single-read span (introns included), independent of chromosome
length, sequencing depth, and thread count.

Output is unchanged: validated bit-for-bit against bedtools genomecov
v2.30 + UCSC bedClip/bedGraphToBigWig/bigWigToBedGraph (738/738 intervals
identical for both combined `-split` and stranded `-split -du -strand +`),
plus a randomised brute-force equivalence test covering overlaps, nesting,
clamping, and net-zero interval merging.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The streaming accumulator already clamps interval ends to each chromosome's
length and never emits a chromosome absent from chrom_sizes, so clip_bedgraph
was a pure pass-through. Remove it and write the bedGraph intervals directly.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Builds a synthetic BAM exercising the full matrix of cases that affect
genomecov coverage and strand assignment, and compares the combined,
forward, and reverse tracks against bedtools genomecov + UCSC tools at
both 1 and 4 threads (sequential and parallel paths):

- read-1 vs read-2 strand flipping under -du
- read-2 with an *unmapped* mate (not flipped — bedtools only flips when
  the mate is mapped; this pins down the previously-unverified guard)
- read-1 mate-unmapped and single-end reads (no flip)
- spliced (N) and deletion (D) CIGAR splitting
- a read overhanging the chromosome end (clamping)
- secondary, duplicate, QC-fail, supplementary (counted); unmapped (ignored)
- a second chromosome (parallel per-chromosome merge)

Verified end-to-end against bedtools v2.30 + UCSC bedClip/bedGraphToBigWig/
bigWigToBedGraph; skips cleanly when those tools are absent.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Tool-independent test using a forward-only BAM: an unstranded run produces
only the combined track, and a --stranded forward run populates the forward
track while skipping the empty reverse track (no stub file written).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@ewels ewels force-pushed the cursor/bigwig-coverage-tracks-09a4 branch from 9a7485b to 44a6a89 Compare June 17, 2026 08:53
@ewels

ewels commented Jun 17, 2026

Copy link
Copy Markdown
Member Author

Rebased this branch onto the corrected main.

This branch had main merged into it while the accidental noodles migration (#115) was still on main, so it was carrying that commit. Merging as-is would have silently re-introduced the noodles migration into main. I've rebased the branch onto the current `main` (rust-htslib), dropping the stale #115 and the duplicated #116 commits and keeping only the bigWig coverage work (8 commits). The two perf commits conflicted in src/main.rs and src/rna/bigwig/accumulator.rs (they were authored over the noodles tree); resolved to the htslib-compatible versions. cargo check and cargo check --tests both pass locally.

Force-pushed, so please re-pull if you have a local copy. The noodles work itself lives in #118.

@ewels

ewels commented Jun 17, 2026

Copy link
Copy Markdown
Member Author

Corresponding PRs for testing / benchmarks:

No manual effort done here, so still a way to go with validation work.

@ewels ewels marked this pull request as draft June 17, 2026 13:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Feature request: Generate coverage tracks

2 participants