A DSL2 Nextflow repository for composing reusable single-cell workflows from small modules and running them on SLURM + Apptainer HPC systems.
π Full documentation β
Parameters: nextflow_schema.json | Docs source: docs/
.
βββ main.nf # Thin launcher for saved workflows
βββ workflows/ # Higher-level reusable workflows
βββ modules/local/ # Single-step DSL2 modules
βββ configs/ # Base + profile-specific config
βββ data/ # Default repo-local input location
βββ template/ # Copyable per-run launcher scaffold
βββ outputs/ # Default published results (generated)
βββ work/ # Nextflow work dir (generated)
βββ logs/ # Reports and SLURM logs (generated)
βββ slurm_nextflow.sh # Repo-root HPC launcher
βββ slurm_sync_repo.sh # Fast clone / update job for HPC checkouts
| Workflow | Purpose | Compute |
|---|---|---|
integration |
INGEST -> EXPORT_COUNTS -> GENE_HARMONIZE -> SCMODAL_INTEGRATE |
GPU |
ingest_export |
Download Seurat objects and export 10x-like counts only | CPU |
ingest_tabulate |
Download metadata only and build subjectIdTable.csv |
CPU |
nmf_vae |
Ingest, export counts, merge, train NMF-VAE | GPU |
gex_mil |
Ingest, export counts, merge, train scVI + attention-MIL | GPU |
tcr_mil |
Ingest, quantify TCRs via tcrClustR, train BertTCR MIL | GPU |
tcr_epitope |
Ingest, quantify TCRs, embed clones with ESM-2, predict epitope binding | GPU |
Select one with --workflow.
The repo now uses local, predictable defaults:
- Input samplesheet:
./data/samplesheet.csv - Published outputs:
./outputs - Work directory:
./work - Reports/logs:
./logs
These can still be overridden on the CLI.
CPU-friendly workflows can be run directly on macOS or Linux without SLURM:
nextflow run main.nf \
--workflow ingest_tabulate \
--labkey_base_url https://labkey.example.org \
--labkey_folder /My/Foldernextflow run main.nf \
--workflow ingest_export \
--labkey_base_url https://labkey.example.org \
--labkey_folder /My/FolderThe integration workflow is intentionally blocked on local CPU execution outside GitHub Actions smoke tests because SCMODAL_INTEGRATE needs a GPU-backed SLURM environment.
If needed, you can still force behavior explicitly with -profile local or -profile slurm.
For a light structural check without running the heavy science stack:
nextflow run main.nf -profile test -stub-run \
--workflow ingest_export \
--labkey_base_url https://labkey.example.org \
--labkey_folder /My/FolderRecommended: do this outside the pipeline, not as a Nextflow module. A pipeline should not mutate its own checkout while it is running.
sbatch slurm_sync_repo.shOr target a specific scratch location:
sbatch --export=ALL,SYNC_TARGET_DIR=/gscratch/mygroup/GoodWorkflows slurm_sync_repo.shFor SLURM-based runs, the simplest path is usually to copy the template and fill out run.sh. The repo-root wrapper is still available when you want repo-relative outputs or a separate pre-pull job.
| Entry point | Best for | Pre-pull behavior |
|---|---|---|
runs/<name>/run.sh |
Recommended routine SLURM runs with a dedicated run directory and checked-in samplesheet template | Runs inline inside the same SLURM allocation before nextflow run |
bash slurm_nextflow.sh ... |
Repo-root launches, automation, or cases where you want image pre-pull isolated first | Submits a standalone pre-pull job before the orchestrator |
cp -r template runs/my_run_name
cd runs/my_run_name
# edit samplesheet.csv and the FILL IN section in run.sh
sbatch run.shrun.sh does perform a container image pre-pull on SLURM, but it does so inline in the same allocation before Nextflow starts. For CPU-only local workflows, bash run.sh skips SLURM pre-pull entirely.
bash slurm_nextflow.sh \
--workflow integration \
--labkey_base_url https://labkey.example.org \
--labkey_folder /My/FolderPreferred launch mode: run bash slurm_nextflow.sh ... from a login node. In that mode the wrapper submits a standalone Apptainer SIF pre-pull job first, then submits the orchestrator with --dependency=afterok:<PREPULL_JOB_ID>. Pre-pull is therefore a hard prerequisite for orchestration.
If you instead use sbatch slurm_nextflow.sh ..., the same pre-pull runs inline inside the orchestrator allocation before Nextflow launches. template/run.sh also uses inline pre-pull.
GoodWorkflows pre-pulls every required docker image as an Apptainer SIF file into ${PIPELINE_ROOT}/apptainer-sif/ (or $NXF_SINGULARITY_CACHEDIR if set). The standalone pre-pull job populates this shared cache before any tasks start; each task finds the SIF there directly with no conversion overhead.
The pipeline converts each docker image to a SIF file once and stores it in ${PIPELINE_ROOT}/apptainer-sif/ (shared NFS). Override with:
export NXF_SINGULARITY_CACHEDIR=/home/exacloud/gscratch/<lab>/singularity-sifs
bash slurm_nextflow.sh --workflow ingest_tabulate ...APPTAINER_CACHEDIR (the OCI blob/layer cache, typically set in ~/.bashrc) is passed through to compute nodes automatically and speeds up repeated pulls of the same image layers.
Optional: fast-forward the checkout immediately before launch:
SYNC_REPO_BEFORE_RUN=true bash slurm_nextflow.sh \
--workflow integration \
--labkey_base_url https://labkey.example.org \
--labkey_folder /My/FolderA lightweight git sync job is the best fit here:
- β simple and transparent
- β works well on HPC scratch filesystems
- β keeps the pipeline code versioned in Git
- β safer than a self-updating Nextflow process inside the running workflow
So the implemented pattern is:
git cloneonce on HPC- use
slurm_sync_repo.shorscripts/sync_repo.shto fast-forward to the latestmain - launch
run.shfor routine named runs, orslurm_nextflow.shfor repo-root launches
GitHub Actions validates the repository in two layers:
- Workflow smoke tests β runs
main.nfwith-profile test -stub-runforintegration,ingest_export, andingest_tabulate. Theintegrationworkflow smoke test additionally passes--scmodal_use_cpu trueto bypass the local-executor GPU guard;SCMODAL_INTEGRATEruns its stub block, which validates DSL2 wiring without requiring a GPU. - Module smoke tests β runs each module wrapper under
tests/modules/so every module is exercised independently.
The test profile disables containers and uses the local executor so CI can validate DSL2 wiring quickly without requiring HPC infrastructure.
For container-dependent validation, the repo also includes scripts/ci/cache_container_images.sh, which can pre-pull and cache the module images into .ci/docker-cache/ during GitHub Actions runs.
- Docs validation and deploy β on pull requests and pushes that touch workflows, docs, schema, or docs tooling, GitHub Actions regenerates the
nf-docsAPI reference, regenerates synthetic example plots, and runsmkdocs build --strict. Pushes tomainalso deploy the site to GitHub Pages.
bash scripts/docs/generate_api_docs.sh
uvx --with matplotlib python scripts/docs/generate_example_plots.py
mkdocs build --strictThe published vignette and example plots are driven by the seeded synthetic fixture bundle in tests/fixtures/synthetic_trial_data/, so docs and CI do not depend on sensitive or machine-local files.
MIT β see LICENSE.