NaviDiv: A Comprehensive Framework for Monitoring Chemical Diversity in Generative Molecular Design

NaviDiv is a comprehensive framework for analyzing chemical diversity in generative molecular design, with a focus on understanding how different diversity metrics evolve during reinforcement learning optimization. The framework introduces multiple complementary metrics that capture different aspects of molecular variation: representation distance-based, string-based, fragment-based, and scaffold-based approaches.

Features

Multiple Diversity Metrics

Representation Distance-Based: Using molecular fingerprints (Morgan, RDKit) and similarity metrics (Tanimoto coefficient)
String-Based Analysis: N-gram analysis of SMILES representations for sequence-level diversity assessment
Fragment-Based Metrics: Systematic molecular decomposition using BRICS fragmentation and frequency analysis
Scaffold-Based Methods: Bemis-Murcko scaffold analysis for core molecular framework comparison
Ring System Analysis: Identification and analysis of ring systems and their sizes
Functional Group Analysis: Detection and diversity assessment of functional groups

Real-Time Monitoring & Visualization

Interactive Molecular Visualization: 2D structural representations with sorting and filtering options
Temporal Analysis: Monitor evolution of specific molecular fragments and cluster formation patterns
Chemical Space Projection: t-SNE and PCA visualization of molecular diversity evolution
Comparative Analysis: Similarity assessment against user-defined reference sets

Integration Capabilities

REINVENT4 Compatible: Seamless integration with reinforcement learning workflows
Real-Time Penalty Functions: Adaptive diversity constraints during generation
Computational Efficiency: Minimal overhead (~3 seconds per 100 molecules)
Statistical Analysis: Comprehensive diversity trend reports with significance testing

Installation

1. Clone the Repository

git clone https://github.com/mohammedazzouzi15/NaviDiv.git
cd NaviDiv

2. Create and Activate Conda Environment

conda create -n NaviDiv python==3.12
conda activate NaviDiv

3. Choose Installation Type

Standard Installation (Core Framework)

Install the core NaviDiv package with essential dependencies for diversity analysis:

pip install -e .

Full Installation (with REINVENT4 Integration)

For complete generative molecular design workflows with REINVENT4:

First, install PyTorch following the official documentation:

conda install pytorch==2.8.0 torchvision==0.23.0 torchaudio==2.8.0 pytorch-cuda=12.4 -c pytorch -c nvidia

Then install REINVENT4 and NaviDiv with full dependencies:

git clone https://github.com/mohammedazzouzi15/REINVENT4_div.git
cd REINVENT4_div
pip install --no-deps -e .
cd ../
pip install -e .[reinvent]

4. Optional Dependencies

For enhanced molecular manipulation capabilities:

conda install openeye::openeye-toolkits

Quick Start

Interactive Dashboard

Launch the Streamlit dashboard for comprehensive diversity analysis:

streamlit run app.py

Programmatic Usage

from navidiv.diversity.diversity import diversity_all
from rdkit import Chem

# Load SMILES strings
smiles_list = ["CCO", "CCN", "CCC"]  # Your SMILES data

# Calculate various diversity metrics
richness = diversity_all(smiles=smiles_list, mode="Richness")
internal_diversity = diversity_all(smiles=smiles_list, mode="IntDiv")
scaffold_diversity = diversity_all(smiles=smiles_list, mode="BM")

# Analyze functional groups and ring systems
functional_groups = diversity_all(smiles=smiles_list, mode="FG")
ring_systems = diversity_all(smiles=smiles_list, mode="RS")

Integration with REINVENT4

from navidiv.reinvent.run_staged_learning_2 import run_staged_learning
from navidiv.reinvent.InputGenerator import InputGenerator
from omegaconf import DictConfig

# Create configuration
cfg = DictConfig({...})  # Your REINVENT config

# Generate input files with diversity filters
input_generator = InputGenerator(cfg)
input_generator.generate_input()

# Run staged learning with diversity constraints
run_staged_learning(cfg)

Using the NaviDiv App

Full tutorial: Tutorials/Using_The_app/app_tutorial.md

The interactive Streamlit dashboard provides a no-code interface for exploring molecular diversity.

Launching

conda activate NaviDiv
streamlit run app.py
# Opens at http://localhost:8501

Input Data Format

Prepare a CSV file with:

smiles column (required): SMILES strings of molecules
step column (optional): generation step index, for temporal analysis
Score column (optional): optimization score, for tracking RL progress

Recommended Workflow

Load dataset — enter the full path to your CSV and click Load File
Run t-SNE — projects molecular fingerprints to 2D for chemical space visualization

Run individual scorers — choose from the sidebar:

Scorer	What it measures
Ngram	SMILES sequence pattern diversity
Scaffold	Bemis-Murcko core framework diversity
Cluster	Similarity-based molecular grouping
RingScorer	Ring system diversity
FGscorer	Functional group diversity
Fragments_default	BRICS fragment decomposition

Run All Scorers — comprehensive analysis across all metrics; outputs files to scorer_output/
Interpret results — two tabs:
- Per Fragment: frequency bar plots of structural motifs
- Per Step: diversity trends over optimization steps (requires step column)

Temporal Evolution Analysis

If your CSV contains a step column (e.g., from a REINVENT4 run), the Per Step tab tracks how diversity evolves across the optimization. Look for:

Exploration phase: high diversity early in RL
Exploitation phase: diversity narrows as the model focuses
Convergence phase: plateau around the optimum

Running REINVENT4 with Diversity Constraints

Full tutorial: Tutorials/Running_Reinvent/reinvent_implementation_tutorial.md

NaviDiv adds diversity constraint scoring to REINVENT4 reinforcement learning. Instead of optimizing only for a target property, you balance property optimization with structural diversity.

Environment Setup

conda activate reinvent4

# Verify installation
python -c "import navidiv; print('NaviDiv OK')"
python -c "import reinvent; print('REINVENT4 OK')"

# Set repo root (required for all commands below)
export NAVIDIV_ROOT="$(cd "$(git rev-parse --show-toplevel)" && pwd)"

Pre-configured Diversity Strategies

Working configurations are in Tutorials/Running_Reinvent/example/conf_folder/diversity_scorer/:

Configuration	Description	Best for
`All_constraints.yaml`	All diversity metrics, moderate constraints	Balanced exploration + optimization
`All_weak_constraints.yaml`	Light constraints, property-first	When diversity is secondary
`scaffold_only.yaml`	Scaffold diversity focus	Exploring different core frameworks
`fragement_only.yaml`	Fragment diversity focus	Exploring molecular building blocks
`ngram_only.yaml`	SMILES string pattern diversity	Varying SMILES sequence patterns
`similarity_only.yaml`	Cluster-based diversity	Preventing generation of near-duplicates

Running a Single Diversity Scorer

cd $NAVIDIV_ROOT/Tutorials/Running_Reinvent/example/
conda activate reinvent4
export PYTHONPATH="${PYTHONPATH}:$NAVIDIV_ROOT/src/navidiv/reinvent"

python3 $NAVIDIV_ROOT/src/navidiv/reinvent/run_reinvent_2.py \
    --config-name test \
    --config-path $NAVIDIV_ROOT/Tutorials/Running_Reinvent/example/conf_folder \
    name=scaffold_experiment \
    wd=./runs/scaffold_test \
    input_generator.file_path=./InputGenerator_custom.py \
    reinvent_common.prior_filename=./priors/formed.prior \
    reinvent_common.agent_filename=./priors/formed.prior \
    reinvent_common.max_steps=1000 \
    diversity_scorer=scaffold_only

For a quick test run (10 steps):

python3 $NAVIDIV_ROOT/src/navidiv/reinvent/run_reinvent_2.py \
    --config-name test \
    --config-path $NAVIDIV_ROOT/Tutorials/Running_Reinvent/example/conf_folder \
    name=test_run \
    wd=./runs/test \
    input_generator.file_path=./InputGenerator_custom.py \
    reinvent_common.prior_filename=./priors/formed.prior \
    reinvent_common.agent_filename=./priors/formed.prior \
    reinvent_common.max_steps=10 \
    diversity_scorer=All_weak_constraints

Running All Diversity Strategies at Once

The provided script iterates over every configuration in diversity_scorer/, runs REINVENT4, and automatically applies t-SNE and diversity post-analysis:

cd $NAVIDIV_ROOT/Tutorials/Running_Reinvent/example/
chmod +x run_reinvent.sh
./run_reinvent.sh

Output Structure

runs/test_case_2/
└── scaffold_only/
    ├── scaffold_only_1.csv          # Generated molecules + scores
    ├── scaffold_only_1_TSNE.csv     # With t-SNE coordinates
    ├── scorer_output/               # NaviDiv diversity analysis
    └── logs/                        # REINVENT4 logs

Load the resulting CSV in the NaviDiv app to visualize diversity evolution.

Writing a Custom Diversity Configuration

Create a YAML file in conf_folder/diversity_scorer/ to define your own constraints:

# conf_folder/diversity_scorer/custom.yaml
scorer_dicts:
  - prop_dict:
      scorer_name: Scaffold
      scaffold_type: basic_wire_frame
      min_count_fragments: 1
      selection_criteria:
        diff_median_score: 0.05       # Stricter: require larger score improvement
    score_every: 5                    # Evaluate more frequently
    groupby_every: 10
    selection_criteria:
      count_perc_ratio: 2             # Stricter diversity ratio
      Total Number of Molecules with Substructure: 25
    custom_alert_name: customalertsscaffold

  - prop_dict:
      scorer_name: Fragments
      min_count_fragments: 5
      transformation_mode: none
    score_every: 15
    groupby_every: 30
    selection_criteria:
      count_perc_ratio: 10            # More relaxed fragment constraint
      Total Number of Molecules with Substructure: 100
    custom_alert_name: customalerts

Key tuning parameters:

count_perc_ratio — lower = stricter diversity enforcement
Total Number of Molecules with Substructure — cap on how many molecules may share a motif
score_every — how often the diversity scorer runs (lower = more control, slower)
diff_median_score — minimum score improvement required to accept a molecule

Configuring the Target Property

Edit or create files in conf_folder/stage_comp/ to change the optimization objective:

# stage_comp/LogP.yaml  — optimize for drug-like LogP
model_params_rdkit_physchem:
  params_list: ["ALOGP"]
  lower_bound: [1.0]
  higher_bound: [3.0]
  weight: [1.0]

# stage_comp/QED_MW.yaml  — multi-property optimization
model_params_rdkit_physchem:
  params_list: ["QED", "MW"]
  lower_bound: [0.5, 200.0]
  higher_bound: [1.0, 500.0]
  weight: [1.0, 0.5]

For additional scoring plugins, refer to the REINVENT4 repository and the custom implementations in src/navidiv/reinvent/reinvent_plugins/.

Post-run Analysis

# t-SNE projection (if not run automatically)
python3 $NAVIDIV_ROOT/src/navidiv/get_tsne.py \
    --df_path runs/test_case_2/scaffold_only/scaffold_only_1.csv \
    --step 20

# Comprehensive diversity analysis
python3 $NAVIDIV_ROOT/src/navidiv/run_all_scorers.py \
    --df_path runs/test_case_2/scaffold_only/scaffold_only_1_TSNE.csv \
    --output_path runs/test_case_2/scaffold_only/scorer_output

Then open the output CSV in the NaviDiv app for interactive exploration.

Performance

Real-Time Analysis: <3 seconds per 100 molecules on standard CPU
Scalable: Complete analysis of 10,000 molecules in ~5 minutes
Memory Efficient: Optimized for large-scale molecular datasets
Integration Ready: Minimal computational overhead for existing workflows

Citation

If you use NaviDiv in your research, please cite:

@article{azzouzi_navidiv:_2026,
	title = {{NaviDiv}: a web app for monitoring chemical diversity in generative molecular design},
	shorttitle = {{NaviDiv}},
	url = {https://pubs.rsc.org/en/content/articlelanding/2026/dd/d5dd00487j},
	doi = {10.1039/D5DD00487J},
	language = {en},
	urldate = {2026-04-16},
	journal = {Digital Discovery},
	author = {Azzouzi, Mohammed and Worakul, Thanapat and Corminboeuf, Clémence},
	year = {2026},
}

Development Setup

git clone https://github.com/mohammedazzouzi15/NaviDiv.git
cd NaviDiv
pip install -e .[dev]
pre-commit install

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

This work was supported by the Swiss National Science Foundation (SNSF).

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
.github		.github
Tutorials		Tutorials
data/formed		data/formed
docs		docs
src/navidiv		src/navidiv
tests		tests
.gitignore		.gitignore
.readthedocs.yaml		.readthedocs.yaml
LICENSE		LICENSE
README.md		README.md
app.py		app.py
justfile		justfile
pyproject.toml		pyproject.toml
uv.lock		uv.lock
working_environment.yml		working_environment.yml

Folders and files

Latest commit

History

Repository files navigation

NaviDiv: A Comprehensive Framework for Monitoring Chemical Diversity in Generative Molecular Design

Features

Multiple Diversity Metrics

Real-Time Monitoring & Visualization

Integration Capabilities

Installation

1. Clone the Repository

2. Create and Activate Conda Environment

3. Choose Installation Type

4. Optional Dependencies

Quick Start

Interactive Dashboard

Programmatic Usage

Integration with REINVENT4

Using the NaviDiv App

Launching

Input Data Format

Recommended Workflow

Temporal Evolution Analysis

Running REINVENT4 with Diversity Constraints

Environment Setup

Pre-configured Diversity Strategies

Running a Single Diversity Scorer

Running All Diversity Strategies at Once

Output Structure

Writing a Custom Diversity Configuration

Configuring the Target Property

Post-run Analysis

Performance

Citation

Development Setup

License

Acknowledgments

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages