ModernGENA is a DNA foundation model family based on a ModernBERT-style encoder adapted for genomic sequence modeling.
We are releasing modernGENA base and large, designed to be faster and more efficient that our previous generation of GENA models
Start from our modernGENA examples and see the paper Back to BERT in 2026: ModernGENA as a Strong, Efficient Baseline for DNA Foundation Models.
| Model | Hugging Face |
|---|---|
| modernGENA base | AIRI-Institute/moderngena-base |
| modernGENA large | AIRI-Institute/moderngena-large |
- ModernBERT-based encoder architecture
- Regulatory and coding regions upsampling during pretraining
- Hybrid local/global attention
- RoPE positional embeddings
- End-to-end unpadding
- FlashAttention-based efficient inference on compatible hardware
- Same 32k BPE tokenizer as GENA-LM for straightforward transition from previous GENA workflows
- 443 vertebrate genome assemblies
- 353,574,093,776 bp total
- Includes both forward strand and reverse complement sequences
- Excludes sequences containing ambiguous symbols other than
A/C/G/T - Sampling window:
[-16 kbp, +8 kbp]around each unique TSS - Overlapping intervals merged with BEDTools, both strands included
Inference efficiency on an NVIDIA A100 (80 GB). Models from the primary baseline set are benchmarked. Throughput is averaged over 10 timing runs per model.
ModernGENA improves downstream performance on NT bench and leads among comparable-size models. Circles denote task-specific ranks, while diamonds indicate the average rank across tasks.
import importlib.util
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("AIRI-Institute/gena-lm-bert-base-t2t")
model_kwargs = {"trust_remote_code": True}
if importlib.util.find_spec("flash_attn") is not None:
model_kwargs["attn_implementation"] = "flash_attention_2"
model = AutoModel.from_pretrained("AIRI-Institute/moderngena-base", **model_kwargs)Swap the model name to AIRI-Institute/moderngena-large to use modernGENA large.
See examples/modernGENA for:
- sequence classification (
sequence_classification/) - token regression (
token_regression/)
From the repository root:
conda env create -f examples/modernGENA/environment.yml
conda activate moderngena-example
bash examples/modernGENA/sequence_classification/download_and_prepare_data.sh
python examples/modernGENA/sequence_classification/train.py
