Welcome to the NaviDiv application tutorial! This guide will walk you through using the web interface for molecular diversity analysis.
- Python environment with NaviDiv installed
- CSV file containing SMILES molecular data
- Optional: 'step' and 'Score' columns for evolution analysis
- Open your terminal and navigate to the NaviDiv directory
- Activate your conda environment:
conda activate NaviDiv
- Launch the Streamlit app:
streamlit run app.py
- Your browser should automatically open to
http://localhost:8501
When you first open the app, you'll see:
- Title: NaviDiv - Molecular Diversity Analysis π§¬
- Description: Overview of the tool's capabilities
- Information Section: Expandable section about molecular diversity scoring functions
- File Upload Section: Where you'll load your dataset
Click on "βΉοΈ About Molecular Diversity Scoring Functions" to learn about:
- N-gram String Analysis: SMILES-based pattern analysis
- Scaffold Diversity Analysis: Murcko framework decomposition
- Molecular Clustering: Similarity-based grouping
- Reference Dataset Comparison: Novel vs. known structure identification
- Ring System Analysis: Cyclic structure diversity
- Functional Group Analysis: Chemical functionality patterns
- Fragment Analysis: Basic and advanced molecular decomposition
Your CSV file should contain:
- SMILES column: Molecular structures in SMILES format
- Optional step column: For temporal analysis
- Optional Score column: For optimization tracking
-
Enter File Path: Type or paste the full path to your CSV file in the text input
- Example:
/path/to/your/molecules.csv - Use the placeholder path as a reference format
- Example:
-
Click Load File: Press the "π Load File" button to validate and load your dataset
-
Validation: The app will check:
- File exists at the specified path
- File is in CSV format
- Contains valid SMILES data
Once your dataset is loaded, the first step is usually to run t-SNE analysis to visualize your chemical space:
- Click "Run t-SNE" button in the sidebar
- Wait for processing - this may take a few moments depending on dataset size
- View results in the main visualization area
After running t-SNE, you'll see:
- 2D scatter plot showing molecular distribution in chemical space
- Color-coded points representing different molecules
- Interactive visualization with zoom and pan capabilities
- Two tabs for exploration:
- 𧬠All Molecules Tab: Complete dataset visualization
- π― Fragment Analysis Tab: Focused fragment-based views
- Purpose: Creates 2D visualization of molecular diversity
- Method: Reduces high-dimensional molecular fingerprints to 2D coordinates
- Benefits: Reveals clustering patterns and chemical space organization
- Output: Adds t-SNE coordinates to your dataset for further analysis
You can run specific diversity scoring methods to focus on particular aspects:
N-gram String Analysis examines SMILES-based patterns:
- Click "Run Scorer" in the sidebar
- Select "Ngram" from the dropdown
- Click "Run Selected Scorer"
- View results in the Analysis Results section
What N-gram analysis shows:
- Common substrings in SMILES representations
- Recurring sequence patterns
- String-based molecular motifs
- Frequency distribution of character patterns
Fragment Analysis provides detailed molecular decomposition:
- Select "Fragments_default" from the scorer dropdown
- Execute the analysis
- Explore fragment occurrence patterns
- Identify common structural motifs
Fragment analysis reveals:
- Molecular building blocks
- Structural diversity patterns
- Common and rare fragments
- Fragment frequency distributions
- Purpose: Comprehensive diversity analysis using all available scoring methods
- What it does:
- Analyzes fragments, scaffolds, clusters, and other diversity metrics
- Generates detailed reports per fragment and per step
- Output: Creates analysis files in a scorer_output directory
- Time: May take several minutes for large datasets
- Scaffold: Murcko framework decomposition
- Cluster: Similarity-based molecular grouping
- Original: Reference dataset comparison
- RingScorer: Ring system analysis
- FGscorer: Functional group analysis
- Fragments_basic: Basic fragment analysis
- Fragments_elemental: Elemental wireframe transformation
- Purpose: Comprehensive diversity analysis using all available scoring methods
- What it does:
- Analyzes fragments, scaffolds, clusters, and other diversity metrics
- Generates detailed reports per fragment and per step
- Output: Creates analysis files in a scorer_output directory
- Purpose: Run specific diversity scoring methods
- Options: Choose from N-gram, Scaffold, Cluster, Ring, Functional Group, or Fragment analysis
- Output: Targeted analysis results
After running analyses, results appear in two main tabs:
What you'll see:
- Fragment occurrence frequency charts
- Bar plots showing fragment diversity metrics
- Dropdown menu to select different analysis results
- Interactive visualizations with hover details
How to interpret:
- High frequency fragments: Common structural motifs in your dataset
- Low frequency fragments: Rare or unique structural elements
- Distribution patterns: Overall diversity characteristics
- Comparative analysis: Different scorer results side-by-side
Temporal Evolution Analysis:
- Shows how diversity metrics change over generation steps
- Line plots tracking diversity trends
- Useful for optimization and reinforcement learning workflows
- Reveals how molecular generation evolves over time
Key metrics to watch:
- Diversity trends: Increasing or decreasing over time
- Convergence patterns: Plateau regions indicating stability
- Optimization phases: Different stages of molecular generation
Plot Interactions:
- Zoom and Pan: Use mouse wheel and drag to explore
- Hover Information: Mouse over data points for detailed information
- Selection Tools: Click and drag to select specific regions
- Download Options: Save plots in various formats
- Data Export: Download underlying data for further analysis
Navigation:
- Dropdown Selections: Choose between different analysis results
- Tab Switching: Move between fragment and step-based views
- Sidebar Controls: Access analysis tools and settings
Recommended sequence for comprehensive analysis:
- Load your dataset with molecular structures
- Run t-SNE to create 2D chemical space visualization (see Step 3)
- Run individual scorers to understand specific aspects:
- Start with N-gram analysis for sequence patterns
- Follow with Fragment analysis for structural decomposition
- Add Scaffold analysis for core frameworks
- Run All Scorers for comprehensive diversity analysis
- Compare results across different scoring methods
- Export insights for further research
Using the Fragment Analysis Tab:
- Run fragment analysis first to generate fragment data
- Switch to π― Fragment Analysis tab in the visualization
- Select specific fragments from analysis results
- View highlighted molecules containing those fragments
- Compare fragment distributions across your dataset
For REINVENT or optimization datasets:
Prerequisites:
- CSV with 'step' or 'stage' column indicating generation/optimization step
- 'Score' column for tracking optimization progress
Analysis steps:
- Load temporal dataset with step information
- Run comprehensive analysis (All Scorers recommended)
- Navigate to π Per Step tab in results section
- Track diversity evolution over optimization steps
- Identify optimization phases:
- Exploration phase: High diversity, varied structures
- Exploitation phase: Lower diversity, focused optimization
- Convergence phase: Stable diversity around optimal regions
Comparing different datasets or conditions:
- Analyze first dataset completely
- Export or screenshot key results
- Load second dataset and repeat analysis
- Compare diversity patterns between conditions
- Document differences in fragment distributions and trends
- Clean SMILES: Ensure valid SMILES strings
- Sufficient diversity: Include at least 100+ molecules for meaningful analysis
- Consistent format: Use standard SMILES notation
- File size: Large datasets (>10,000 molecules) may take longer to process
- Memory usage: Monitor system resources during analysis
- Browser responsiveness: Use modern browsers for best performance
- File not found: Check file path and permissions
- Invalid SMILES: Review molecular structures in your dataset
- Memory errors: Try smaller datasets or increase system memory
- Slow performance: Close other applications and browser tabs
- "File not found": Verify the complete file path
- "File should be a CSV": Ensure file extension is .csv
- "No valid SMILES found": Check SMILES column formatting
Here's a complete example workflow using all the available screenshots:
-
Start with Landing Page (screenshots/landing_page.png)
- Launch the app using
streamlit run app.py - Review the scoring function information
- Prepare your CSV file path
- Launch the app using
-
Load Your Dataset (screenshots/after_loading_dataset.png)
- Enter the file path to your molecular CSV
- Click "Load File" and verify successful loading
- Confirm your data appears in the interface
-
Run t-SNE Analysis (screenshots/after_TSNE.png)
- Click "Run t-SNE" in the sidebar
- Wait for processing completion
- Explore the 2D chemical space visualization
- Switch between All Molecules and Fragment Analysis tabs
-
Perform Individual Analysis
- N-gram Analysis (screenshots/after_running_ngram_scorer.png):
- Select "Ngram" from the scorer dropdown
- Execute analysis and review string pattern results
- Fragment Analysis (screenshots/after_running_fragment_scorer.png):
- Select "Fragments_default" from the dropdown
- Run analysis and explore structural decomposition results
- N-gram Analysis (screenshots/after_running_ngram_scorer.png):
-
Comprehensive Analysis
- Run "All Scorers" for complete diversity assessment
- Review results in both Per Fragment and Per Step tabs
- Compare different scoring method outputs
-
Interpret and Export
- Analyze fragment frequency distributions
- Export plots and data for research documentation
- Save insights for integration with other workflows
After completing your analysis:
- Export results for further processing
- Compare different datasets by loading new files
- Integrate with REINVENT4 for generative workflows
- Use programmatic API for automated analysis
For additional help:
- Check the main README for API documentation
- Review example datasets in the test_data directory
- Report issues on the GitHub repository
- Consult the scientific publication for methodology details
Happy analyzing! π§¬π




