An autonomous, end-to-end, Deep Learning bioinformatics pipeline designed to forecast real-world amino acid (AA) substitutions in the hyper-variable regions of Influenza Hemagglutinin (HA).
This repository houses a production-ready architecture powered by Meta's ESM-2 (Evolutionary Scale Modeling) Protein Language Model, rigorously fine-tuned to understand and predict viral escape dynamics natively from historical NCBI Sequence databases.
The project systematically solves the biological challenge of continuous HA strain divergence by utilizing a highly constrained, algorithmically rigorous 5-step pipeline:
- Interfaces directly with the NCBI Entrez Database to pull thousands of historical and modern HA sequences.
- Filters purely for biologically viable, full-length sequences (500–600 amino acids).
- Applies CD-HIT (95% clustering) to remove redundant identical sequences, ensuring the neural network doesn't overfit on over-represented annual strains.
- Natively executes MAFFT multiple sequence alignment to computationally standardize the dataset.
- Filters excessive gap alignments computationally.
- Computes position-wise Shannon Entropy to formally identify the top 25% "hyper-variable regions." The model explicitly ignores structural constants to concentrate exclusively on active viral mutation sites.
- To prevent future-data memorization (data leakage), the pipeline implements strict Temporal Splitting:
- Train: Ancestral Strains (Pre-2020)
- Validation: Intermediate Strains (2021–2022)
- Test: Modern Unseen Strains (2023+)
- Dynamically creates masked continuous sequence tensors mapping aligned variable targets back to their raw index locations.
- Leverages the massive
facebook/esm2_t6_8M_UR50Dtransformer architecture. - Phase 1: Freezes the entire 8M parameter body and initializes a raw linear classification head aggressively (
lr=1e-3) to map biological 3D representations to standard 20-amino-acid outputs. - Phase 2: Unfreezes the final 2 transformer attention layers to contextually fine-tune the deeper evolutionary representation gradients.
Provides production-grade metrics guaranteeing biological viablity against fundamental statistical benchmarks.
(Above: The advanced tracking confusion matrix illustrating the ESM-2 model's 20x20 amino acid substitution mapping logic against live 2023+ viral strains.)
📦 Pre-Trained Weights Available: The fully fine-tuned 10-epoch PyTorch ESM-2 model is available natively in this repository as best_model.pt. No retraining required!
This system doesn't just report "accuracy"—it proves statistical significance over randomized background frequencies by enforcing the following evaluations seamlessly inside Colab_Autonomous.ipynb:
-
Baseline Confrontation (PSSM & Frequencies) The model actively competes against Position-Specific Scoring Matrices (PSSM) and ancestral Training Frequencies. Top-1, Top-3, and Top-5 accuracies are computed. Outcome: ESM-2 actively outperforms naive historical frequency benchmarks.
-
LLR Statistical Significance (True vs Random) For every prediction on an unseen strain, the system computes the Log-Likelihood Ratio (LLR). (LLR = log P(mutant) - log P(wild-type)) Outcome: The model cleanly assigns consistently higher mathematical likelihood to true, observed 2023+ mutations than it does to randomly generated amino acid noise.
-
Expected Calibration Error (ECE) Verifies the probability scalars outputted by the token classification head accurately reflect biological confidence instead of being wildly overconfident on wrong answers.
-
Per-Position Difficulty Dataframes & Confusion Heatmaps Returns a pure 20x20 seaborn tracking interface (Predicted vs True AA) alongside a raw data frame that maps position-wise accuracy identically to its Shannon Entropy measurement.
For immediate, out-of-the-box infrastructure deployment, simply access the native GPU-cloud builder:
- Open
Colab_Autonomous.ipynbin Google Colab. - Select Runtime -> Run All.
- The platform will automatically install system requirements (
mafft,cd-hit), pull raw NCBI binaries, and orchestrate the Deep Learning pipeline. - Review Cell 6, which renders the rigorous Mathematical LLR Metrics, 20x20 Confusion Matrix, and outputs the exact
[FINAL VERDICT].
(To operate natively on local hardware, install
biopython,torch,transformers,pandas,seaborn,mafft, andcd-hit. Executepython3 src/main.py --subtype H1N1 --max_records 50000 --epochs 10.)
