Skip to content

psMDHamdan/Mutation-Prediction-Model

Repository files navigation

🧬 Influenza HA Mutation Prediction Platform

An autonomous, end-to-end, Deep Learning bioinformatics pipeline designed to forecast real-world amino acid (AA) substitutions in the hyper-variable regions of Influenza Hemagglutinin (HA).

This repository houses a production-ready architecture powered by Meta's ESM-2 (Evolutionary Scale Modeling) Protein Language Model, rigorously fine-tuned to understand and predict viral escape dynamics natively from historical NCBI Sequence databases.


🔬 System Biological Architecture

The project systematically solves the biological challenge of continuous HA strain divergence by utilizing a highly constrained, algorithmically rigorous 5-step pipeline:

1. Automated Data Acquisition & Clustering (data_fetcher.py)

  • Interfaces directly with the NCBI Entrez Database to pull thousands of historical and modern HA sequences.
  • Filters purely for biologically viable, full-length sequences (500–600 amino acids).
  • Applies CD-HIT (95% clustering) to remove redundant identical sequences, ensuring the neural network doesn't overfit on over-represented annual strains.

2. Native Alignment & Entropy Mapping (bio_utils.py)

  • Natively executes MAFFT multiple sequence alignment to computationally standardize the dataset.
  • Filters excessive gap alignments computationally.
  • Computes position-wise Shannon Entropy to formally identify the top 25% "hyper-variable regions." The model explicitly ignores structural constants to concentrate exclusively on active viral mutation sites.

3. Chronological Temporal Splitting (dataset.py)

  • To prevent future-data memorization (data leakage), the pipeline implements strict Temporal Splitting:
    • Train: Ancestral Strains (Pre-2020)
    • Validation: Intermediate Strains (2021–2022)
    • Test: Modern Unseen Strains (2023+)
  • Dynamically creates masked continuous sequence tensors mapping aligned variable targets back to their raw index locations.

4. Two-Phase ESM-2 Fine-Tuning (model.py & train.py)

  • Leverages the massive facebook/esm2_t6_8M_UR50D transformer architecture.
  • Phase 1: Freezes the entire 8M parameter body and initializes a raw linear classification head aggressively (lr=1e-3) to map biological 3D representations to standard 20-amino-acid outputs.
  • Phase 2: Unfreezes the final 2 transformer attention layers to contextually fine-tune the deeper evolutionary representation gradients.

5. Rigorous Mathematical Evaluation (evaluate.py)

Provides production-grade metrics guaranteeing biological viablity against fundamental statistical benchmarks.


📊 Production-Ready Validation Suite

HA Mutation Benchmark

(Above: The advanced tracking confusion matrix illustrating the ESM-2 model's 20x20 amino acid substitution mapping logic against live 2023+ viral strains.)

📦 Pre-Trained Weights Available: The fully fine-tuned 10-epoch PyTorch ESM-2 model is available natively in this repository as best_model.pt. No retraining required!

This system doesn't just report "accuracy"—it proves statistical significance over randomized background frequencies by enforcing the following evaluations seamlessly inside Colab_Autonomous.ipynb:

  1. Baseline Confrontation (PSSM & Frequencies) The model actively competes against Position-Specific Scoring Matrices (PSSM) and ancestral Training Frequencies. Top-1, Top-3, and Top-5 accuracies are computed. Outcome: ESM-2 actively outperforms naive historical frequency benchmarks.

  2. LLR Statistical Significance (True vs Random) For every prediction on an unseen strain, the system computes the Log-Likelihood Ratio (LLR). (LLR = log P(mutant) - log P(wild-type)) Outcome: The model cleanly assigns consistently higher mathematical likelihood to true, observed 2023+ mutations than it does to randomly generated amino acid noise.

  3. Expected Calibration Error (ECE) Verifies the probability scalars outputted by the token classification head accurately reflect biological confidence instead of being wildly overconfident on wrong answers.

  4. Per-Position Difficulty Dataframes & Confusion Heatmaps Returns a pure 20x20 seaborn tracking interface (Predicted vs True AA) alongside a raw data frame that maps position-wise accuracy identically to its Shannon Entropy measurement.


🚀 How to Run

For immediate, out-of-the-box infrastructure deployment, simply access the native GPU-cloud builder:

  1. Open Colab_Autonomous.ipynb in Google Colab.
  2. Select Runtime -> Run All.
  3. The platform will automatically install system requirements (mafft, cd-hit), pull raw NCBI binaries, and orchestrate the Deep Learning pipeline.
  4. Review Cell 6, which renders the rigorous Mathematical LLR Metrics, 20x20 Confusion Matrix, and outputs the exact [FINAL VERDICT].

(To operate natively on local hardware, install biopython, torch, transformers, pandas, seaborn, mafft, and cd-hit. Execute python3 src/main.py --subtype H1N1 --max_records 50000 --epochs 10.)

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors