This tutorial will guide you through running REINVENT4 with different diversity scorers using the NaviDiv framework. We'll use the working example in the Tutorials/Running_Reinvent/example/ directory that demonstrates QED optimization with various diversity constraints.
The NaviDiv framework provides multiple diversity scoring mechanisms that can be integrated with REINVENT4 to control molecular generation. Instead of optimizing only for a target property (like QED), you can balance property optimization with structural diversity.
- Conda environment:
reinvent4(must be activated) - NaviDiv installed with REINVENT4 dependencies
- REINVENT4 integration enabled
All shell examples below use $NAVIDIV_ROOT to refer to the repository root.
Set it once per terminal session:
# From anywhere — resolve the repo root automatically
export NAVIDIV_ROOT="$(cd "$(git rev-parse --show-toplevel)" && pwd)"
# Or set it explicitly if not in a git repo:
# export NAVIDIV_ROOT="/absolute/path/to/NaviDiv"conda activate reinvent4
python -c "import navidiv; print('NaviDiv installed successfully')"
python -c "import reinvent; print('REINVENT4 available')"All working examples are in the Tutorials/Running_Reinvent/example/ directory of the repository.
Tutorials/Running_Reinvent/example/
├── conf_folder/
│ ├── diversity_scorer/ # Different diversity configurations
│ │ ├── All_constraints.yaml
│ │ ├── All_weak_constraints.yaml
│ │ ├── scaffold_only.yaml
│ │ ├── fragement_only.yaml
│ │ ├── ngram_only.yaml
│ │ └── similarity_only.yaml
│ ├── stage_comp/ # Target property configurations
│ ├── reinvent_common/ # REINVENT settings
│ └── test.yaml # Main configuration
├── InputGenerator_custom.py # Custom input generator
├── run_reinvent.sh # Execution script
└── outputs/ # Results will be stored here
The framework includes several pre-configured diversity scoring strategies:
# Applies all diversity metrics with moderate constraints
# Good for: Balanced exploration with property optimization# Weaker diversity constraints, prioritizes property optimization
# Good for: When you want some diversity but focus on target property# Emphasizes scaffold diversity while relaxing other constraints
# Good for: Exploring different core molecular frameworks# Focuses on fragment-level diversity
# Good for: Exploring different molecular building blocks# Emphasizes N-gram (sequence pattern) diversity
# Good for: Exploring different SMILES sequence patterns# Uses similarity clustering for diversity control
# Good for: Preventing generation of too-similar moleculesEach diversity scorer configuration contains:
scorer_dicts:
- prop_dict:
scorer_name: "Scaffold" # Type of diversity metric
scaffold_type: "basic_wire_frame" # Specific variant
min_count_fragments: 1 # Minimum fragment count
selection_criteria:
diff_median_score: 0.1 # Score difference threshold
score_every: 10 # Evaluation frequency
groupby_every: 20 # Grouping frequency
selection_criteria:
count_perc_ratio: 5 # Percentage ratio constraint
Total Number of Molecules with Substructure: 50 # Count limit
custom_alert_name: "customalertsscaffold" # Alert identifierThe working configuration is located at:
Tutorials/Running_Reinvent/example/conf_folder/test.yaml
# wd and input_generator.file_path are set to ??? (Hydra "must override").
# The run_reinvent.sh script passes them automatically; for manual runs use:
# python3 run_reinvent_2.py ... \
# wd=./runs/myrun \
# input_generator.file_path=$NAVIDIV_ROOT/Tutorials/Running_Reinvent/example/InputGenerator_custom.py \
# reinvent_common.prior_filename=$NAVIDIV_ROOT/Tutorials/Running_Reinvent/example/priors/formed.prior
wd: ???
name: "stage1"
generate_input_only: false
input_generator:
file_path: ???
class_name: "InputGeneratorCustom"
defaults:
- diversity_scorer: default # Will be overridden by command line
- reinvent_common: default
- stage_comp: QED # Target propertyYou can modify the target property by creating new stage_comp configurations: To implement a new target property, you can check the reinvent4 package on the use of other Scoring Plugins (https://github.com/MolecularAI/REINVENT4/tree/main) Currently we can use the implemented scorers in reinvent4 with few added implementation in /src/navidiv/reinvent/reinvent_plugins
# stage_comp/MW.yaml
model_params_rdkit_physchem:
params_list: ["MW"]
lower_bound: [200.0]
higher_bound: [500.0]
weight: [1.0]# stage_comp/LogP.yaml
model_params_rdkit_physchem:
params_list: ["ALOGP"]
lower_bound: [1.0]
higher_bound: [3.0]
weight: [1.0]# stage_comp/QED_MW.yaml
model_params_rdkit_physchem:
params_list: ["QED", "MW"]
lower_bound: [0.5, 200.0]
higher_bound: [1.0, 500.0]
weight: [1.0, 0.5]The working script is provided at:
Tutorials/Running_Reinvent/example/run_reinvent.sh
#!/bin/bash
# Check if conda is available
if ! command -v conda &> /dev/null; then
echo "conda could not be found. Please install Anaconda or Miniconda."
exit 1
fi
# Automatically detect paths relative to this script
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
NAVIDIV_ROOT="$(cd "$SCRIPT_DIR/../../.." && pwd)"
ENV_NAME="reinvent4"
export PYTHONPATH="${PYTHONPATH}:${NAVIDIV_ROOT}/src/navidiv/reinvent/"
python_script_path="${NAVIDIV_ROOT}/src/navidiv/reinvent/"
config_path="${SCRIPT_DIR}/conf_folder"
config_name=test
wd="${SCRIPT_DIR}/runs/test_case"
i=2
wd_current="${wd}_${i}/"
for diversity_config in "$config_path"/diversity_scorer/*.yaml; do
diversity_scorer_name=$(basename "$diversity_config" .yaml)
run_name=$diversity_scorer_name
mkdir -p "$wd_current"
echo "Running with diversity scorer: $diversity_scorer_name, run $i"
conda run -n $ENV_NAME python3 "$python_script_path/run_reinvent_2.py" \
--config-name $config_name \
--config-path $config_path \
name=$run_name \
wd=$wd_current \
input_generator.file_path="${SCRIPT_DIR}/InputGenerator_custom.py" \
reinvent_common.prior_filename="${SCRIPT_DIR}/priors/formed.prior" \
reinvent_common.agent_filename="${SCRIPT_DIR}/priors/formed.prior" \
reinvent_common.max_steps=1000 \
diversity_scorer=$diversity_scorer_name
# Post-processing: t-SNE visualization
conda run -n $ENV_NAME python3 "${NAVIDIV_ROOT}/src/navidiv/get_tsne.py" \
--df_path "$wd_current/${run_name}/${run_name}_1.csv" \
--step 20
# Post-processing: Diversity analysis
conda run -n $ENV_NAME python3 "${NAVIDIV_ROOT}/src/navidiv/run_all_scorers.py" \
--df_path "$wd_current/${run_name}/${run_name}_1_TSNE.csv" \
--output_path "$wd_current/${run_name}/scorer_output"
done-
Navigate to the example directory:
cd $NAVIDIV_ROOT/Tutorials/Running_Reinvent/example/
-
Make the script executable:
chmod +x run_reinvent.sh
-
Run the script:
./run_reinvent.sh
To run a specific diversity scorer, navigate to the example directory and run:
# Navigate to example directory
cd $NAVIDIV_ROOT/Tutorials/Running_Reinvent/example/
# Activate environment
conda activate reinvent4
# Set PYTHONPATH
export PYTHONPATH="${PYTHONPATH}:$NAVIDIV_ROOT/src/navidiv/reinvent"
# Run specific diversity scorer (e.g., scaffold_only)
python3 $NAVIDIV_ROOT/src/navidiv/reinvent/run_reinvent_2.py \
--config-name test \
--config-path $NAVIDIV_ROOT/Tutorials/Running_Reinvent/example/conf_folder \
name=scaffold_experiment \
wd=$NAVIDIV_ROOT/Tutorials/Running_Reinvent/example/runs/scaffold_test \
input_generator.file_path=$NAVIDIV_ROOT/Tutorials/Running_Reinvent/example/InputGenerator_custom.py \
reinvent_common.prior_filename=$NAVIDIV_ROOT/Tutorials/Running_Reinvent/example/priors/formed.prior \
reinvent_common.agent_filename=$NAVIDIV_ROOT/Tutorials/Running_Reinvent/example/priors/formed.prior \
reinvent_common.max_steps=100 \
diversity_scorer=scaffold_onlyFor a quick test with minimal steps:
# Test run with only 10 steps
python3 $NAVIDIV_ROOT/src/navidiv/reinvent/run_reinvent_2.py \
--config-name test \
--config-path $NAVIDIV_ROOT/Tutorials/Running_Reinvent/example/conf_folder \
name=test_run \
wd=$NAVIDIV_ROOT/Tutorials/Running_Reinvent/example/runs/test \
input_generator.file_path=$NAVIDIV_ROOT/Tutorials/Running_Reinvent/example/InputGenerator_custom.py \
reinvent_common.prior_filename=$NAVIDIV_ROOT/Tutorials/Running_Reinvent/example/priors/formed.prior \
reinvent_common.agent_filename=$NAVIDIV_ROOT/Tutorials/Running_Reinvent/example/priors/formed.prior \
reinvent_common.max_steps=10 \
diversity_scorer=All_weak_constraintsEach run produces:
output_directory/
├── experiment_name/
│ ├── experiment_name_1.csv # Generated molecules
│ ├── experiment_name_1_TSNE.csv # With t-SNE coordinates
│ ├── scorer_output/ # Diversity analysis results
│ └── logs/ # REINVENT logs
- Characteristics: Balanced diversity across all metrics
- Expected outcome: Moderate property optimization with good structural variety
- Use case: General-purpose diverse molecular generation
- Characteristics: High scaffold diversity, relaxed other constraints
- Expected outcome: Many different core structures, varying ring systems
- Use case: Exploring different molecular frameworks
- Characteristics: Focus on fragment-level diversity
- Expected outcome: Varied building blocks and functional groups
- Use case: Exploring different molecular components
- Characteristics: Emphasis on sequence pattern diversity
- Expected outcome: Varied SMILES patterns and string representations
- Use case: Exploring different chemical notation patterns
Create your own diversity scorer configuration:
# custom_diversity.yaml
scorer_dicts:
# Strong scaffold constraints
- prop_dict:
scorer_name: Scaffold
scaffold_type: basic_wire_frame
min_count_fragments: 1
selection_criteria:
diff_median_score: 0.05 # Stricter score requirement
score_every: 5 # More frequent evaluation
groupby_every: 10
selection_criteria:
count_perc_ratio: 2 # Stricter ratio
Total Number of Molecules with Substructure: 25 # Lower limit
custom_alert_name: customalertsscaffold
# Relaxed fragment constraints
- prop_dict:
scorer_name: Fragments
min_count_fragments: 5 # Higher minimum
transformation_mode: none
score_every: 15 # Less frequent evaluation
groupby_every: 30
selection_criteria:
count_perc_ratio: 10 # More relaxed
Total Number of Molecules with Substructure: 100
custom_alert_name: customalerts- Lower
count_perc_ratio: Stricter diversity constraints - Higher
count_perc_ratio: More relaxed constraints - Lower
Total Number of Molecules: Stricter limits - Higher values: More permissive
- Lower
score_every: More frequent diversity checks (slower but more controlled) - Higher
score_every: Less frequent checks (faster but less controlled)
diff_median_score: Minimum score improvement required- Lower values: Stricter improvement requirements
- Higher values: More permissive acceptance
Modify InputGenerator_custom.py to handle multiple properties:
class model_params_multi_property:
def __init__(self, cfg, id):
self.property_name = cfg.params_list[id]
self.lower_bound = cfg.lower_bound[id]
self.upper_bound = cfg.higher_bound[id]
self.weight = cfg.weight[id]
self.transformation = getattr(cfg, 'transformation', ['none'] * len(cfg.params_list))[id]
class InputGeneratorCustom(InputGenerator):
def add_stage_component_multi_property(self, stage1_parameters, component):
# Implementation for multiple properties
# ... (detailed implementation)Run different diversity strategies in sequence:
# Phase 1: Broad exploration
python3 run_reinvent_2.py \
diversity_scorer=All_weak_constraints \
reinvent_common.max_steps=500
# Phase 2: Scaffold refinement
python3 run_reinvent_2.py \
diversity_scorer=scaffold_only \
reinvent_common.max_steps=500
# Phase 3: Fragment optimization
python3 run_reinvent_2.py \
diversity_scorer=fragement_only \
reinvent_common.max_steps=500Monitor your runs by checking the output directory:
# Check if run is progressing
cd $NAVIDIV_ROOT/Tutorials/Running_Reinvent/example/
tail -f runs/test_case_2/All_constraints/logs/*.log
# Check generated molecules count
wc -l runs/test_case_2/All_constraints/All_constraints_1.csvCreate a simple analysis script:
# analysis.py
import pandas as pd
import matplotlib.pyplot as plt
import os
def analyze_run(run_path):
"""Analyze a single REINVENT run."""
csv_file = None
for file in os.listdir(run_path):
if file.endswith('_1.csv'):
csv_file = os.path.join(run_path, file)
break
if csv_file is None:
print(f"No CSV file found in {run_path}")
return None
df = pd.read_csv(csv_file)
# Basic statistics
stats = {
'total_molecules': len(df),
'unique_smiles': df['smiles'].nunique(),
'mean_score': df['total_score'].mean(),
'max_score': df['total_score'].max(),
'final_step': df['step'].max()
}
return df, stats
# Example usage
if __name__ == "__main__":
base_path = "runs/test_case_2"
# Analyze all diversity scorers
results = {}
for scorer in ['All_constraints', 'All_weak_constraints', 'scaffold_only']:
run_path = os.path.join(base_path, scorer)
if os.path.exists(run_path):
df, stats = analyze_run(run_path)
if stats:
results[scorer] = stats
print(f"\n{scorer}:")
for key, value in stats.items():
print(f" {key}: {value}")
# Plot comparison
if results:
scorers = list(results.keys())
mean_scores = [results[s]['mean_score'] for s in scorers]
unique_counts = [results[s]['unique_smiles'] for s in scorers]
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 5))
ax1.bar(scorers, mean_scores)
ax1.set_title('Mean Scores by Diversity Scorer')
ax1.set_ylabel('Mean Score')
plt.setp(ax1.get_xticklabels(), rotation=45)
ax2.bar(scorers, unique_counts)
ax2.set_title('Unique Molecules by Diversity Scorer')
ax2.set_ylabel('Unique SMILES Count')
plt.setp(ax2.get_xticklabels(), rotation=45)
plt.tight_layout()
plt.savefig('diversity_comparison.png')
plt.show()After your runs complete, use the built-in analysis tools:
# Navigate to example directory
cd $NAVIDIV_ROOT/Tutorials/Running_Reinvent/example/
# Activate environment
conda activate reinvent4
# Analyze results with t-SNE (if not done automatically)
python3 $NAVIDIV_ROOT/src/navidiv/get_tsne.py \
--df_path runs/test_case_2/All_constraints/All_constraints_1.csv \
--step 20
# Run comprehensive diversity analysis
python3 $NAVIDIV_ROOT/src/navidiv/run_all_scorers.py \
--df_path runs/test_case_2/All_constraints/All_constraints_1_TSNE.csv \
--output_path runs/test_case_2/All_constraints/scorer_output# Error: ModuleNotFoundError: No module named 'navidiv'
# Solution: Ensure correct environment and PYTHONPATH
conda activate reinvent4
export PYTHONPATH="${PYTHONPATH}:$NAVIDIV_ROOT/src"# Error: Config file not found
# Solution: Use absolute paths or run from correct directory
cd $NAVIDIV_ROOT/Tutorials/Running_Reinvent/example/
# Then run commands with relative paths# Error: Prior model file not found
# Solution: Check REINVENT4 installation and model paths
python -c "import reinvent; print(reinvent.__file__)"
# Update model paths in configuration files# Symptoms: Out of memory errors
# Solution: Reduce batch sizes in configuration
# Edit reinvent_common/default.yaml:
# batch_size: 64 # Reduce from default
# Or run with fewer steps for testing:
reinvent_common.max_steps=50# Symptoms: Very slow molecule generation
# Solution: Increase score_every frequency in diversity configs
# Edit diversity_scorer/*.yaml files:
score_every: 20 # Increase from 10# Test 1: Check environment
conda activate reinvent4
python -c "import navidiv, reinvent, torch; print('All dependencies available')"
# Test 2: Check REINVENT script
cd $NAVIDIV_ROOT/Tutorials/Running_Reinvent/example/
export PYTHONPATH="${PYTHONPATH}:$NAVIDIV_ROOT/src"
python3 $NAVIDIV_ROOT/src/navidiv/reinvent/run_reinvent_2.py --help
# Test 3: Minimal run (1 step)
python3 $NAVIDIV_ROOT/src/navidiv/reinvent/run_reinvent_2.py \
--config-name test \
--config-path $NAVIDIV_ROOT/Tutorials/Running_Reinvent/example/conf_folder \
name=minimal_test \
wd=$NAVIDIV_ROOT/Tutorials/Running_Reinvent/example/runs/test_minimal \
input_generator.file_path=$NAVIDIV_ROOT/Tutorials/Running_Reinvent/example/InputGenerator_custom.py \
reinvent_common.prior_filename=$NAVIDIV_ROOT/Tutorials/Running_Reinvent/example/priors/formed.prior \
reinvent_common.agent_filename=$NAVIDIV_ROOT/Tutorials/Running_Reinvent/example/priors/formed.prior \
reinvent_common.max_steps=1 \
diversity_scorer=All_weak_constraintsCheck logs for issues:
# Check REINVENT logs
find runs/ -name "*.log" -exec tail -20 {} \;
# Check for common error patterns
grep -r "Error\|Exception\|Failed" runs/*/logs/
# Monitor progress in real-time
tail -f runs/test_case_2/*/logs/*.logAlways test with minimal steps first:
# Quick test (10 steps)
python3 $NAVIDIV_ROOT/src/navidiv/reinvent/run_reinvent_2.py \
--config-name test \
--config-path $NAVIDIV_ROOT/Tutorials/Running_Reinvent/example/conf_folder \
name=quick_test \
wd=$NAVIDIV_ROOT/Tutorials/Running_Reinvent/example/runs/test_quick \
input_generator.file_path=$NAVIDIV_ROOT/Tutorials/Running_Reinvent/example/InputGenerator_custom.py \
reinvent_common.prior_filename=$NAVIDIV_ROOT/Tutorials/Running_Reinvent/example/priors/formed.prior \
reinvent_common.agent_filename=$NAVIDIV_ROOT/Tutorials/Running_Reinvent/example/priors/formed.prior \
reinvent_common.max_steps=10 \
diversity_scorer=All_weak_constraints- Testing: 10-50 steps
- Development: 100-500 steps
- Production: 1000+ steps
# Monitor memory and CPU usage
htop # or top
# Monitor GPU usage (if available)
nvidia-smi
# Monitor disk space for outputs
df -h# Create organized output structure
mkdir -p experiments/$(date +%Y%m%d)_QED_diversity/
# Use descriptive names
python3 run_reinvent_2.py \
name="QED_scaffold_diversity_$(date +%H%M)" \
wd="./experiments/$(date +%Y%m%d)_QED_diversity/" \
# ... other parameters# Copy configurations for each experiment
cp -r conf_folder experiments/$(date +%Y%m%d)_config_backup/
# Document changes in README
echo "Experiment $(date): Modified scaffold constraints" >> experiments/README.mdCreate a simple analysis pipeline:
#!/bin/bash
# analyze_run.sh
RUN_DIR=$1
if [ -z "$RUN_DIR" ]; then
echo "Usage: $0 <run_directory>"
exit 1
fi
echo "Analyzing run in $RUN_DIR"
# Find CSV file
CSV_FILE=$(find $RUN_DIR -name "*_1.csv" | head -1)
if [ -z "$CSV_FILE" ]; then
echo "No CSV file found in $RUN_DIR"
exit 1
fi
echo "Found molecules file: $CSV_FILE"
# Basic statistics
echo "Total molecules: $(tail -n +2 $CSV_FILE | wc -l)"
echo "Unique SMILES: $(tail -n +2 $CSV_FILE | cut -d',' -f2 | sort -u | wc -l)"
# Run t-SNE if not exists
TSNE_FILE=${CSV_FILE/_1.csv/_1_TSNE.csv}
if [ ! -f "$TSNE_FILE" ]; then
echo "Running t-SNE analysis..."
conda run -n reinvent4 python3 $NAVIDIV_ROOT/src/navidiv/get_tsne.py \
--df_path "$CSV_FILE" --step 20
fi
# Run diversity analysis
OUTPUT_DIR="${RUN_DIR}/scorer_output"
if [ ! -d "$OUTPUT_DIR" ]; then
echo "Running diversity analysis..."
conda run -n reinvent4 python3 $NAVIDIV_ROOT/src/navidiv/run_all_scorers.py \
--df_path "$TSNE_FILE" --output_path "$OUTPUT_DIR"
fi
echo "Analysis complete. Results in $OUTPUT_DIR"This tutorial provides a practical, tested framework for running REINVENT4 with various diversity scoring strategies using the NaviDiv framework. All examples are based on working code in the Tutorials/Running_Reinvent/example/ directory.
- Setup:
conda activate reinvent4 - Navigate:
cd $NAVIDIV_ROOT/Tutorials/Running_Reinvent/example/ - Test: Run minimal test with 10 steps
- Execute: Use
./run_reinvent.shfor full analysis - Analyze: Results automatically processed with t-SNE and diversity analysis
- Balance is crucial: Find the right trade-off between property optimization and diversity
- Start small: Always test with minimal steps first
- Monitor actively: Watch logs and resource usage during runs
- Validate results: Use post-analysis tools to confirm diversity metrics
- Document everything: Keep records of successful configurations
The diversity scorers provide powerful tools for controlling molecular generation beyond simple property optimization. Experiment with different configurations, monitor results carefully, and adjust parameters based on your specific research requirements.
- Example Directory:
$NAVIDIV_ROOT/Tutorials/Running_Reinvent/example/ - Configuration Templates: All YAML files are tested and working
- Analysis Scripts:
get_tsne.pyandrun_all_scorers.pyfor result analysis
- Custom Scorers: Extend the framework by implementing custom diversity metrics
- REINVENT4 Documentation: https://github.com/MolecularAI/REINVENT4
- NaviDiv App: Use
streamlit run app.pyfor interactive result exploration
- GitHub Issues: Report problems on the NaviDiv repository
- Configuration Help: Check working examples in the tutorial directory
- Environment Issues: Verify
reinvent4conda environment setup
Happy molecular generation with controlled diversity! 🧬⚗️🎯