Skip to content

AIRI-Institute/GENA_LM

Repository files navigation

modernGENA

ModernGENA is a DNA foundation model family based on a ModernBERT-style encoder adapted for genomic sequence modeling.

We are releasing modernGENA base and large, designed to be faster and more efficient that our previous generation of GENA models

Start from our modernGENA examples and see the paper Back to BERT in 2026: ModernGENA as a Strong, Efficient Baseline for DNA Foundation Models.

Pre-trained models

Model Hugging Face
modernGENA base AIRI-Institute/moderngena-base
modernGENA large AIRI-Institute/moderngena-large

Technical features

  • ModernBERT-based encoder architecture
  • Regulatory and coding regions upsampling during pretraining
  • Hybrid local/global attention
  • RoPE positional embeddings
  • End-to-end unpadding
  • FlashAttention-based efficient inference on compatible hardware
  • Same 32k BPE tokenizer as GENA-LM for straightforward transition from previous GENA workflows

Pretraining corpus

  • 443 vertebrate genome assemblies
  • 353,574,093,776 bp total
  • Includes both forward strand and reverse complement sequences
  • Excludes sequences containing ambiguous symbols other than A/C/G/T
  • Sampling window: [-16 kbp, +8 kbp] around each unique TSS
  • Overlapping intervals merged with BEDTools, both strands included

Benchmarking

Figure 1: Inference efficiency on an NVIDIA A100 (80 GB)

Inference efficiency on an NVIDIA A100 (80 GB). Models from the primary baseline set are benchmarked. Throughput is averaged over 10 timing runs per model.

Figure 2: ModernGENA task-rank distribution

ModernGENA improves downstream performance on NT bench and leads among comparable-size models. Circles denote task-specific ranks, while diamonds indicate the average rank across tasks.

Quick start

Load a pretrained model

import importlib.util
from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained("AIRI-Institute/gena-lm-bert-base-t2t")
model_kwargs = {"trust_remote_code": True}
if importlib.util.find_spec("flash_attn") is not None:
    model_kwargs["attn_implementation"] = "flash_attention_2"

model = AutoModel.from_pretrained("AIRI-Institute/moderngena-base", **model_kwargs)

Swap the model name to AIRI-Institute/moderngena-large to use modernGENA large.

Run training examples

See examples/modernGENA for:

  • sequence classification (sequence_classification/)
  • token regression (token_regression/)

From the repository root:

conda env create -f examples/modernGENA/environment.yml
conda activate moderngena-example
bash examples/modernGENA/sequence_classification/download_and_prepare_data.sh
python examples/modernGENA/sequence_classification/train.py

Citation

About

GENA-LM is a transformer masked language model trained on human DNA sequence.

Topics

Resources

License

Stars

Watchers

Forks