Implementation of NOVA: Non-Contrastive Vision-Language Learning with Predictive Embedding Alignment on top of stable-pretraining.
NOVA trains a randomly initialized ViT image encoder to predict embeddings from a frozen ClinicalBERT text encoder. It uses MSE alignment to the text anchor plus joint SIGReg regularization over all image-view predictions and the text embedding:
loss = (1 - lambda) * MSE(image_views, text_anchor) + lambda * SIGReg([image_views, text_anchor])
The code follows the neighboring LeVLJEPA-release structure, but replaces the learnable text encoder/cross-prediction setup with the paper's frozen ClinicalBERT target stack and MIMIC-style radiology datasets.
main.py- Hydra/stable-pretraining training entry pointforwards.py- NOVA, InfoNCE, and SigLIP forward/loss functionscallbacks.py- gradient clipping, embedding diagnostics, checkpointing, zero-shot evalutils/dataset.py- MIMIC/CheXpert/ChestX-ray14 manifest datasets and augmentationsutils/eval.py- binary prompt zero-shot AUC evaluationconfigs/- ViT-S/ViT-B and objective configs
uv syncFor local framework development, install the parent checkout instead:
uv pip install -e ../stable-pretrainingTraining is manifest-driven so protected datasets stay outside the repo. A training CSV/parquet/jsonl needs at least:
image_path,impression,ViewPosition
p10/p10000032/s50414267/xxx.jpg,"No acute cardiopulmonary abnormality.",PAIf impression is missing, set data.report_col to a full radiology report column and the loader extracts the IMPRESSION section.
Evaluation manifests need an image path and binary label columns. CheXpert-style uncertain labels (-1) are treated as negative by default.
python main.py \
data.train_manifest=/path/to/mimic_train.csv \
data.image_root=/path/to/images \
run_name=nova_vitbViT-S:
python main.py model=small run_name=nova_vitsMulti-GPU is handled by Lightning:
python main.py devices=8 batch_size=256The default configs/nova.yaml matches the paper setup:
- frozen
emilyalsentzer/Bio_ClinicalBERT - ViT-B/16 from scratch
- embedding dimension
64 - predictor hidden width
2048 - 2 global crops at
224, 6 local crops at96 - AdamW, cosine decay
1e-4 -> 1e-5 - batch size
256,100epochs lambda=0.02, gradient clipping1.0, bf16 mixed precision
Add datasets under evals in a config or CLI override. Example:
evals:
- name: chexpert
enabled: true
manifest: /path/to/chexpert_test.csv
image_root: /path/to/CheXpert-v1.0
image_col: image_path
label_cols: [Atelectasis, Cardiomegaly, Edema, Pleural Effusion, Consolidation]
positive_prompts: [atelectasis, cardiomegaly, edema, pleural effusion, consolidation]
negative_prompts: [no atelectasis, no cardiomegaly, no edema, no pleural effusion, no consolidation]The callback reports per-label AUC and macro AUC every eval_every_n_steps.
The same frozen ClinicalBERT + ViT stack can train the comparison objectives:
python main.py --config-name infonce
python main.py --config-name siglip
python main.py --config-name medclipThese are intentionally single-crop objectives, matching the paper's distinction from NOVA's multi-crop training.