Skip to content

cellgeni/nf-scenicplus

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

nf-scenicplus

A Nextflow DSL2 pipeline for inferring single-cell gene regulatory networks (eGRNs) using SCENIC+ from paired scRNA-seq and scATAC-seq data.

Overview

SCENIC+ identifies enhancer-driven gene regulatory networks (eGRNs) by linking transcription factor (TF) binding motifs in accessible chromatin regions to target gene expression. This pipeline orchestrates the full SCENIC+ workflow from raw inputs to a final annotated MuData object.

Input: scRNA-seq + scATAC-seq (peak matrix or fragment files)
Output: scplusmdata.h5mu — annotated MuData with eGRNs, AUCell scores, and cistromes

Pipeline DAG

CREATE_CISTOPIC_OBJECT
  └── TOPIC_MODELING (parallel, one job per topic count)
        └── TOPIC_MODELING_EVALUATE (model selection, binarization, DARs)
              ├── PREPARE_GEX_ACC ──── DOWNLOAD_GENOME_ANNOTATION
              │         └── GET_SEARCH_SPACE
              │                   ├── TF_TO_GENE ──────────────────────┐
              │                   └── REGION_TO_GENE ──────────────────┤
              │                                                         │
              ├── MOTIF_ENRICHMENT_CISTARGET ──┐                        │
              └── MOTIF_ENRICHMENT_DEM ────────┴── PREPARE_MENR        │
                                                         │              │
                                                         └──────────────┤
                                                                        ↓
                                                         EGRN_DIRECT / EGRN_EXTENDED
                                                                        │
                                                         AUCCELL_DIRECT / AUCCELL_EXTENDED
                                                                        │
                                                               CREATE_SCPLUS_MUDATA
                                                               → scplusmdata.h5mu

Requirements

  • Nextflow ≥ 23.04
  • Singularity (container runtime)
  • LSF cluster (for lsf profile) or local execution

The pipeline runs inside a Singularity container with all Python dependencies pre-installed. See Dockerfile for the full image definition.

Quick Start

Peak matrix mode (most common)

nextflow run main.nf -profile lsf \
    --rna_h5ad     /path/to/rna.h5ad \
    --peak_h5ad    /path/to/atac_peaks.h5ad \
    --cistarget_db /path/to/hg38.regions_vs_motifs.rankings.feather \
    --dem_db       /path/to/hg38.regions_vs_motifs.scores.feather \
    --celltype_col cell_type \
    --outdir       results/

Fragment file mode

nextflow run main.nf -profile lsf \
    --rna_h5ad      /path/to/rna.h5ad \
    --fragments_tsv "data/*/fragments.tsv.gz" \
    --regions_bed   /path/to/consensus_peaks.bed \
    --tss_bed       /path/to/tss_annotation.bed \
    --cistarget_db  /path/to/hg38.regions_vs_motifs.rankings.feather \
    --dem_db        /path/to/hg38.regions_vs_motifs.scores.feather \
    --celltype_col  cell_type \
    --outdir        results/

Test run (reduced parameters, local)

nextflow run main.nf -profile test

Parameters

All parameters are defined in nextflow.config.

Required inputs

Parameter Description
--rna_h5ad scRNA-seq AnnData (.h5ad)
--peak_h5ad ATAC peak matrix AnnData (.h5ad); mutually exclusive with --fragments_tsv
--fragments_tsv Path glob to fragment files (.tsv.gz); requires --regions_bed and --tss_bed
--cistarget_db cisTarget rankings database (.feather)
--dem_db DEM scores database (.feather)

Sample metadata

Parameter Default Description
--sample_id_col "sample_id" Column name for sample IDs
--celltype_col "cell_type" Column name for cell type labels
--run_id "scenicplus" Run identifier, used in output file names

Species and assembly

Parameter Default Description
--species_biomart "hsapiens" Biomart species dataset (e.g. "mmusculus")
--species_motif "homo_sapiens" Species for motif annotation (e.g. "mus_musculus")
--motif_annotation null Path to motif-to-TF .tbl file; auto-downloaded if omitted
--biomart_host "oct2024.archive.ensembl.org" Ensembl archive mirror
--ucsc_genome "hg38" UCSC assembly name (chromsizes fallback)

Data preparation

Parameter Default Description
--is_multiome true Whether data is paired multiome
--use_raw_for_gex false Use .raw layer for gene expression
--bc_transform_func "lambda x: f'{x}___cisTopic'" Lambda to align ATAC → RNA barcodes
--nr_metacells null Number of metacells (null = disable)
--nr_cells_per_metacells 10 Cells per metacell

Topic modeling (LDA)

Parameter Default Description
--n_topics "2 4 10 16 32 48" Space-separated list of topic counts to evaluate
--n_topics_select 32 Topic count to use downstream
--topic_modeling_n_iter 500 LDA iterations

Search space (region-to-gene)

Parameter Default Description
--upstream "1000 150000" Min/max bp upstream of TSS
--downstream "1000 150000" Min/max bp downstream of TSS
--extend_tss "10 10" TSS extension for promoter definition
--use_gene_boundaries true Restrict search space by neighboring gene bodies
--remove_promoters false Exclude promoter regions from search space

Motif enrichment

Parameter Default Description
--annotation_version "v10nr_clust" Motif annotation version
--annotations_to_use "Direct_annot Orthology_annot" Space-separated annotation types
--direct_annotation "Direct_annot" Label for direct TF annotations
--extended_annotation "Orthology_annot" Label for orthology-based annotations
--motif_similarity_fdr 0.001 FDR threshold for motif similarity
--auc_threshold 0.005 cisTarget AUC threshold
--nes_threshold 3.0 cisTarget NES threshold
--rank_threshold 0.05 cisTarget rank threshold
--dem_adjpval_thr 0.05 DEM adjusted p-value threshold
--dem_log2fc_thr 1.0 DEM log2 fold-change threshold

GRN inference

Parameter Default Description
--tf_to_gene_method "GBM" TF→gene importance method ("GBM" or "RF")
--region_to_gene_importance_method "GBM" Region→gene importance method
--region_to_gene_correlation_method "SR" Region→gene correlation method ("SR" = Spearman)
--mask_expr_dropout false Mask zero-expression values as dropouts
--gsea_n_perm 1000 GSEA permutations for eGRN scoring
--quantile_thresholds "0.85 0.90 0.95" Thresholds for pruning adjacencies
--top_n_regionTogenes_per_gene "5 10 15" Top regions per gene for eGRN
--rho_threshold 0.05 Correlation threshold for TF-region-gene links
--min_target_genes 10 Minimum target genes per TF to retain regulon
--seed 666 Random seed

Resources

Parameter Default Description
--outdir "results" Output directory
--max_memory "250.GB" Maximum memory per process
--max_cpus 32 Maximum CPUs per process
--max_time "120.h" Maximum wall time per process
--container_image (local .sif path) Singularity image path or Docker URI

Execution Profiles

Profile Description
local Local execution, no scheduler
lsf LSF cluster (Sanger HPC); maps resource labels to queues
test Reduced parameters for quick validation (n_topics = "2 4", 50 LDA iterations)

Output Structure

results/
├── create_cistopic_object/
│   └── cistopic_obj.pkl
├── topic_modeling/
│   └── models/
├── topic_modeling_evaluate/
│   ├── cistopic_obj.pkl          # updated with selected model
│   └── region_sets/
│       ├── topics_otsu/          # one BED file per topic (Otsu threshold)
│       ├── topics_top3k/         # one BED file per topic (top 3000 regions)
│       └── DARs/                 # differentially accessible regions per cell type
├── prepare_gex_acc/
│   └── ACC_GEX.h5mu
├── motif_enrichment_cistarget/
│   └── cistarget_results.hdf5
├── motif_enrichment_dem/
│   └── dem_results.hdf5
├── egrn/
│   ├── direct_egrn.tsv
│   └── extended_egrn.tsv
├── auccell/
│   ├── auccell_direct.h5ad
│   └── auccell_extended.h5ad
└── create_scplus_mudata/
    └── scplusmdata.h5mu          # ← final output

Module Reference

Module Resource label Description
CREATE_CISTOPIC_OBJECT process_medium Builds CistopicObject from h5ad or fragment files
TOPIC_MODELING process_long LDA topic modeling (one parallel job per topic count)
TOPIC_MODELING_EVALUATE process_medium Model selection, topic binarization, DAR computation
PREPARE_GEX_ACC process_medium Assembles paired GEX+ACC MuData object
DOWNLOAD_GENOME_ANNOTATION process_low Fetches gene annotation and chromosome sizes via Biomart
GET_SEARCH_SPACE process_medium Computes region→gene search windows
MOTIF_ENRICHMENT_CISTARGET process_high cisTarget motif ranking enrichment
MOTIF_ENRICHMENT_DEM process_high Differential enrichment method (DEM) scoring
PREPARE_MENR process_low Merges enrichment results; extracts TF name list
TF_TO_GENE process_grn TF→gene importance scoring (GBM/RF via arboreto)
REGION_TO_GENE process_grn Region→gene importance + correlation scoring
EGRN_DIRECT / EGRN_EXTENDED process_grn eGRN assembly (direct and orthology annotations)
AUCCELL_DIRECT / AUCCELL_EXTENDED process_medium AUCell regulon activity scoring
CREATE_SCPLUS_MUDATA process_medium Assembles final annotated MuData output

Resource labels (defined in conf/base.config):

Label CPUs Memory Queue
process_low 2 16 GB normal
process_medium 4 64 GB normal
process_high 8 200 GB normal
process_long 4 200 GB long
process_grn 24 150 GB long

Barcode Alignment

ATAC and RNA barcodes frequently differ in format (e.g. suffix additions). Use --bc_transform_func to supply a Python lambda that transforms ATAC barcodes to match RNA barcodes:

# Strip a sample suffix added to ATAC barcodes
--bc_transform_func "lambda x: x.split('___')[0]"

# Add a sample prefix
--bc_transform_func "lambda x: 'SAMPLE1_' + x"

cisTarget Databases

Pre-built cisTarget and DEM databases for human (hg38/hg19) and mouse (mm10) can be downloaded from the Aerts lab resources portal.

Citation

If you use this pipeline, please cite:

Bravo González-Blas, C. et al. SCENIC+: single-cell multiomic inference of enhancer-driven gene regulatory networks. Nature Methods 20, 1355–1367 (2023). https://doi.org/10.1038/s41592-023-01938-4

License

This pipeline is distributed under the MIT License. See LICENSE for details.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors