This repository contains the implementation behind our submission to the closed-world track of the "What Happens Next?" challenge (École Polytechnique, course CSC_43M04_EP), where it finished first out of 35 participants. The task is to anticipate the action that is about to occur from only the four frames covering the first forty percent of a Something-Something-V2 clip. The closed-world setting forbids any external data or pretrained weights: only the data provided by the challenge may be used, which makes the problem an exercise in data efficiency.
Authors: Baptiste Duperray, Clarisse Douysset. The full report is in docs/report_en.pdf, a condensed version in docs/methodology.md, and the complete list of experiments in docs/exploration.md.
The data is a 33-class subset of Something-Something-V2 (class 27 is empty, so 32 classes are populated): 44,993 training clips, 6,745 validation clips and 6,912 test clips. Each clip is represented by four 224x224 JPEG frames taken from the first forty percent of the source video, so the problem is one of anticipation rather than recognition: the model must infer the upcoming action from a very short visual prefix. The evaluation metric is top-1 accuracy.
The closed-world rule is the central constraint: no external corpus and no pretrained weights are allowed, so every representation must be learned from the challenge data alone. Two further constraints shaped the work. A limited GPU budget on a shared university cluster, one card per job with a fixed deadline, bounds the number of self-supervised epochs that can be accumulated, since a full pretraining over the whole corpus takes several days. And the modest dataset size, about 52,000 labelled clips, limits the number of parameters that can be trained before models begin to memorise rather than generalise. Both constraints favour data-efficient methods over raw capacity.
Before adding capacity, we asked what kind of errors the best from-scratch model actually makes. Decomposing the 3,770 validation errors of an EfficientFormer-L encoder (validation accuracy 0.4411) into interpretable categories gives the following picture.
| Error type | Share |
|---|---|
| Temporal inversion (Folding/Unfolding, Opening/Closing, Covering/Uncovering) | 3.9% |
| Left/right mirror (classes 18 and 19) | 0.0% |
| Pretending versus real action (an intrinsic four-frame ceiling) | 4.0% |
| Fine visual confusions | 92.1% |
The conclusion is that the bottleneck is neither the temporal head nor the ordering of the frames, but the
quality of the visual representation; the mirror confusions are already at zero thanks to a label-aware flip
described below. The entire effort therefore went into learning better representations from the challenge
data, rather than into more elaborate sequence models. This analysis is reproduced by
whatnext.analysis.error_decomposition.
Every model has the same structure: an encoder produces one feature vector per frame, and a temporal head aggregates the four vectors into 33 logits. Four ideas combine to reach the first-place result.
Self-supervised pretraining on the challenge data. Since no external weights are permitted, we pretrain our own encoders by masked modelling on the challenge frames. The first family (m4) is a 2D masked autoencoder on a ViT-B/16 backbone, with a masking ratio of 0.75 and a normalised-pixel reconstruction loss; at finetuning time its last six transformer blocks receive a zero-initialised temporal attention, which is an exact identity at the first step and therefore leaves the pretrained spatial weights undisturbed, followed by a Perceiver head. The second family (m5) is a tubelet VideoMAE on a ViT-S/16 backbone; its ViT-B variant (m5b) is seeded from an iBOT initialisation that we also pretrain on the challenge frames. We also trained encoders entirely from scratch as baselines: pure convolutional networks, 3D convolutional networks, and convolution-attention hybrids (EfficientFormer, MaxViT, NextViT), the last of which reach about 0.44 in validation, whereas a pure Vision Transformer trained from scratch collapses to roughly six percent, the majority-class baseline, because it cannot learn locality from so little data.
A self-supervised cascade to make the most of a bounded budget. A completed cosine cycle has essentially zero
learning rate, so naively extending pretraining achieves little. Each generation instead reloads the encoder
but resets the optimiser, scheduler and decoder, opening a fresh cycle. Across generations the reconstruction
loss decreases monotonically (from 0.30 for the iBOT initialisation to 0.16) and downstream accuracy keeps
rising through the eighth generation, which is why additional self-supervised epochs remain our most promising
untried direction (see whatnext.ssl.pretrain_videomae).
Auxiliary self-supervision at finetuning. Two inexpensive objectives encourage the encoder to use temporal information and to be consistent: a pair-direction head that predicts the temporal order of a frame pair (weight 0.3), and a multi-clip consistency term, a symmetric Kullback-Leibler divergence between two augmentations of the same clip (weight 0.2).
Pseudo-label distillation. The rules forbid the test labels but not the test images. The strongest aggregator
assigns a hard label to every test clip by taking the arg-max of its prediction, with no confidence
threshold, and the next model is trained on the union of the real and pseudo-labelled sets. We iterated this
loop three times; the bidirectional LSTM head benefits most, improving from 0.5500 to 0.6061. Generation is
handled by whatnext.ensemble.pseudo_label and consumption by whatnext.finetune.track_a_videomae.
A per-class, log-space, bagged stacker. The final submission aggregates 95 checkpoints. The weights are learned per class, so that a member specialised on a few classes can be given a large weight there without dominating elsewhere, and they are combined in log space:
W[i, c] = softmax_i(R[i, c])
logit[n, c] = sum_i W[i, c] * log P[i, n, c] + b[c]
Fitting uses AdamW with a Tikhonov pull toward a flat prior and is bagged over fifteen folds. The reported
estimate is obtained by nested cross-validation: within each fold the weights are fit on a training split,
early-stopped on a probe split, and scored on a separate held-out split that is used neither for fitting nor
for selection, which removes the optimistic bias of reporting the best-step score on the selection set. A
sidecar scheme then re-injects the 6,745 validation clips without leakage, by calibrating the weights on the
out-of-sample probabilities of a train-only model and applying them to a twin trained on the training and
validation sets together (whatnext.ensemble.stacker, sidecar, calibration).
Two task-specific choices matter on four frames. Augmentation is clip-coherent: the same random parameters are applied to all four frames so that temporal coherence is preserved. The horizontal flip is label-aware: a flip swaps labels 18 (Pulling left to right) and 19 (Pulling right to left), the only mirror pair in the dataset, without which roughly 1,500 clips are mislabelled and those two classes collapse.
The progression from a single from-scratch encoder to the first-place ensemble is summarised below (validation accuracy unless stated otherwise).
| Stage | Model | val |
|---|---|---|
| From scratch (conv-attention hybrid) | EfficientFormer-L, strong augmentation | 0.4411 |
| From scratch (hybrid) | EfficientFormerV2-L / NextViT-S | 0.413 / 0.409 |
| On-data SSL (VideoMAE-S) | with Perceiver (m5p) / with LSTM (m5l) | 0.478 / 0.481 |
| On-data SSL (MAE-2D ViT-B) | with Perceiver head (m4a) | 0.4976 |
| On-data SSL (MAE-2D ViT-B) | with hybrid head (m4c) | 0.5075 |
| On-data SSL with auxiliary losses | with pair-direction and multi-clip (m4d_v2) | 0.5122 |
| Self-supervised cascade | ssl800 with attentive probe | 0.5960 |
| Cascade with pseudo-labels | ssl1000 with BiLSTM and pseudo-labels | 0.6061 |
| Ensemble of 95 checkpoints | per-class log-space bagged stacker | 0.6070 (1st / 35) |
Two transitions account for most of the gain: from a hybrid trained from scratch (about 0.44) to an on-data MAE ViT-B with auxiliary losses (0.5122), and then to the cascade with pseudo-labels (0.6061 in validation, 0.6070 on the leaderboard, first by 0.0006).
The cascade is detailed below (validation accuracy by generation and head).
| Encoder | attn | perceiver | lstm | hybrid |
|---|---|---|---|---|
| ssl600 | 0.5788 | 0.5758 | 0.5729 | 0.5781 |
| ssl700 | 0.5936 | 0.5887 | 0.5778 | 0.5913 |
| ssl800 | 0.5960 | 0.5901 | 0.5810 | 0.5895 |
| ssl800 with pseudo | 0.6004 | 0.5935 | 0.5947 | 0.5981 |
| ssl1000 with pseudo | 0.5539 | 0.5508 | 0.6061 | 0.5463 |
The improvement from pretraining is positive on all 32 populated classes. The confusion matrices below compare the best from-scratch model (left, validation 0.44) with the pretrained model after pseudo-label distillation (right, validation 0.61).
| From scratch | Pretrained with pseudo-labels |
|---|---|
![]() |
![]() |
The submission is an aggregation of cached per-member probabilities, so the stacker runs on a laptop in a couple of minutes.
pip install -e . # torch and numpy are sufficient for this step
python -m whatnext.ensemble.stacker \
--cache-dir artifacts/probs \
--members configs/track_a_members.txt \
--out submission.csvOn the subset of members shipped here this reports a nested cross-validation held-out accuracy of 0.6163 (fifteen folds; each fold is scored on a split used neither for fitting nor for selection, and the test set is never consulted). The full 95-member pool scored 0.6070 on the private leaderboard. The stacker also reports which members it skipped, so coverage is never silently reduced.
pip install -e ".[models]" # adds timm for the encoders
# self-supervised pretrain (one cascade step), finetune, cache probabilities, then stack
python -m whatnext.ssl.pretrain_videomae --init-encoder <prev_gen.pt> --epochs 100
python -m whatnext.finetune.track_a_mae2d --mae-ckpt <ssl.pt> ...
python -m whatnext.inference.cache_probs --ckpt <ft.pt> --split val # then --split test
python -m whatnext.ensemble.pseudo_label --out pseudo.csv # optional distillation round
python -m whatnext.ensemble.stacker --cache-dir artifacts/probsThe challenge data (frames.zip and the label CSVs) is not redistributed here; it can be downloaded from the
competition and placed at the repository root.
src/whatnext/
data/ zip_dataset (evaluation and caching), clip_transforms (clip-coherent augmentation
with the 18/19 swap), tta (five-crop test-time augmentation)
models/ mae2d_vitb (MAE ViT-B with temporal attention, Perceiver and pair-direction heads),
mae2d_heads (LSTM, relational and hybrid heads), videomae_vits (VideoMAE ViT-S),
encoders_2d (timm), encoders_3d (3D convolutions), convert_videomae_ckpt
ssl/ pretrain_mae2d, pretrain_videomae (the cascade driver)
finetune/ track_a_mae2d, track_a_videomae (pseudo-label union), track_a_scratch
ensemble/ stacker (per-class log-space, nested-CV bagged), sidecar (leakage-free re-injection),
calibration (temperature scaling and MLP/linear baselines), pseudo_label
inference/ cache_probs (test-time augmentation and the 18/19 swap), submission
analysis/ error_decomposition (the diagnostic of section 2)
utils/ seeding, metrics, normalisation constants
configs/ track_a_members.txt (the 95-member ensemble list)
artifacts/ probs/ (cached member probabilities), pseudo_labels/ (three distillation iterations)
docs/ report_en.pdf, methodology.md, exploration.md, figures/
Built on PyTorch and timm.






