modernGENA

ModernGENA is a DNA foundation model family based on a ModernBERT-style encoder adapted for genomic sequence modeling.

We are releasing modernGENA base and large, designed to be faster and more efficient that our previous generation of GENA models

Start from our modernGENA examples and see the paper Back to BERT in 2026: ModernGENA as a Strong, Efficient Baseline for DNA Foundation Models.

Pre-trained models

Model	Hugging Face
modernGENA base	AIRI-Institute/moderngena-base
modernGENA large	AIRI-Institute/moderngena-large

Technical features

ModernBERT-based encoder architecture
Regulatory and coding regions upsampling during pretraining
Hybrid local/global attention
RoPE positional embeddings
End-to-end unpadding
FlashAttention-based efficient inference on compatible hardware
Same 32k BPE tokenizer as GENA-LM for straightforward transition from previous GENA workflows

Pretraining corpus

443 vertebrate genome assemblies
353,574,093,776 bp total
Includes both forward strand and reverse complement sequences
Excludes sequences containing ambiguous symbols other than A/C/G/T
Sampling window: [-16 kbp, +8 kbp] around each unique TSS
Overlapping intervals merged with BEDTools, both strands included

Benchmarking

Inference efficiency on an NVIDIA A100 (80 GB). Models from the primary baseline set are benchmarked. Throughput is averaged over 10 timing runs per model.

ModernGENA improves downstream performance on NT bench and leads among comparable-size models. Circles denote task-specific ranks, while diamonds indicate the average rank across tasks.

Quick start

Load a pretrained model

import importlib.util
from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained("AIRI-Institute/gena-lm-bert-base-t2t")
model_kwargs = {"trust_remote_code": True}
if importlib.util.find_spec("flash_attn") is not None:
    model_kwargs["attn_implementation"] = "flash_attention_2"

model = AutoModel.from_pretrained("AIRI-Institute/moderngena-base", **model_kwargs)

Swap the model name to AIRI-Institute/moderngena-large to use modernGENA large.

Run training examples

See examples/modernGENA for:

sequence classification (sequence_classification/)
token regression (token_regression/)

From the repository root:

conda env create -f examples/modernGENA/environment.yml
conda activate moderngena-example
bash examples/modernGENA/sequence_classification/download_and_prepare_data.sh
python examples/modernGENA/sequence_classification/train.py

Citation

Back to BERT in 2026: ModernGENA as a Strong, Efficient Baseline for DNA Foundation Models

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
data		data
docs/images		docs/images
downstream_tasks		downstream_tasks
examples/modernGENA		examples/modernGENA
manuscript_data		manuscript_data
notebooks		notebooks
src/gena_lm		src/gena_lm
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
README_previous_generation.md		README_previous_generation.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
setup.cfg		setup.cfg
update_model_code_on_hf.sh		update_model_code_on_hf.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

modernGENA

Pre-trained models

Technical features

Pretraining corpus

Benchmarking

Quick start

Load a pretrained model

Run training examples

Citation

About

Uh oh!

Uh oh!

Contributors 5

Languages

Folders and files

Latest commit

History

Repository files navigation

modernGENA

Pre-trained models

Technical features

Pretraining corpus

Benchmarking

Quick start

Load a pretrained model

Run training examples

Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Uh oh!

Contributors 5

Languages