What Happens Next? Data-efficient action anticipation from four frames

This repository contains the implementation behind our submission to the closed-world track of the "What Happens Next?" challenge (École Polytechnique, course CSC_43M04_EP), where it finished first out of 35 participants. The task is to anticipate the action that is about to occur from only the four frames covering the first forty percent of a Something-Something-V2 clip. The closed-world setting forbids any external data or pretrained weights: only the data provided by the challenge may be used, which makes the problem an exercise in data efficiency.

Authors: Baptiste Duperray, Clarisse Douysset. The full report is in docs/report_en.pdf, a condensed version in docs/methodology.md, and the complete list of experiments in docs/exploration.md.

1. Problem statement

The data is a 33-class subset of Something-Something-V2 (class 27 is empty, so 32 classes are populated): 44,993 training clips, 6,745 validation clips and 6,912 test clips. Each clip is represented by four 224x224 JPEG frames taken from the first forty percent of the source video, so the problem is one of anticipation rather than recognition: the model must infer the upcoming action from a very short visual prefix. The evaluation metric is top-1 accuracy.

The closed-world rule is the central constraint: no external corpus and no pretrained weights are allowed, so every representation must be learned from the challenge data alone. Two further constraints shaped the work. A limited GPU budget on a shared university cluster, one card per job with a fixed deadline, bounds the number of self-supervised epochs that can be accumulated, since a full pretraining over the whole corpus takes several days. And the modest dataset size, about 52,000 labelled clips, limits the number of parameters that can be trained before models begin to memorise rather than generalise. Both constraints favour data-efficient methods over raw capacity.

2. Error analysis and the resulting strategy

Before adding capacity, we asked what kind of errors the best from-scratch model actually makes. Decomposing the 3,770 validation errors of an EfficientFormer-L encoder (validation accuracy 0.4411) into interpretable categories gives the following picture.

Error type	Share
Temporal inversion (Folding/Unfolding, Opening/Closing, Covering/Uncovering)	3.9%
Left/right mirror (classes 18 and 19)	0.0%
Pretending versus real action (an intrinsic four-frame ceiling)	4.0%
Fine visual confusions	92.1%

The conclusion is that the bottleneck is neither the temporal head nor the ordering of the frames, but the quality of the visual representation; the mirror confusions are already at zero thanks to a label-aware flip described below. The entire effort therefore went into learning better representations from the challenge data, rather than into more elaborate sequence models. This analysis is reproduced by whatnext.analysis.error_decomposition.

3. Method

Every model has the same structure: an encoder produces one feature vector per frame, and a temporal head aggregates the four vectors into 33 logits. Four ideas combine to reach the first-place result.

Self-supervised pretraining on the challenge data. Since no external weights are permitted, we pretrain our own encoders by masked modelling on the challenge frames. The first family (m4) is a 2D masked autoencoder on a ViT-B/16 backbone, with a masking ratio of 0.75 and a normalised-pixel reconstruction loss; at finetuning time its last six transformer blocks receive a zero-initialised temporal attention, which is an exact identity at the first step and therefore leaves the pretrained spatial weights undisturbed, followed by a Perceiver head. The second family (m5) is a tubelet VideoMAE on a ViT-S/16 backbone; its ViT-B variant (m5b) is seeded from an iBOT initialisation that we also pretrain on the challenge frames. We also trained encoders entirely from scratch as baselines: pure convolutional networks, 3D convolutional networks, and convolution-attention hybrids (EfficientFormer, MaxViT, NextViT), the last of which reach about 0.44 in validation, whereas a pure Vision Transformer trained from scratch collapses to roughly six percent, the majority-class baseline, because it cannot learn locality from so little data.

A self-supervised cascade to make the most of a bounded budget. A completed cosine cycle has essentially zero learning rate, so naively extending pretraining achieves little. Each generation instead reloads the encoder but resets the optimiser, scheduler and decoder, opening a fresh cycle. Across generations the reconstruction loss decreases monotonically (from 0.30 for the iBOT initialisation to 0.16) and downstream accuracy keeps rising through the eighth generation, which is why additional self-supervised epochs remain our most promising untried direction (see whatnext.ssl.pretrain_videomae).

Auxiliary self-supervision at finetuning. Two inexpensive objectives encourage the encoder to use temporal information and to be consistent: a pair-direction head that predicts the temporal order of a frame pair (weight 0.3), and a multi-clip consistency term, a symmetric Kullback-Leibler divergence between two augmentations of the same clip (weight 0.2).

Pseudo-label distillation. The rules forbid the test labels but not the test images. The strongest aggregator assigns a hard label to every test clip by taking the arg-max of its prediction, with no confidence threshold, and the next model is trained on the union of the real and pseudo-labelled sets. We iterated this loop three times; the bidirectional LSTM head benefits most, improving from 0.5500 to 0.6061. Generation is handled by whatnext.ensemble.pseudo_label and consumption by whatnext.finetune.track_a_videomae.

A per-class, log-space, bagged stacker. The final submission aggregates 95 checkpoints. The weights are learned per class, so that a member specialised on a few classes can be given a large weight there without dominating elsewhere, and they are combined in log space:

W[i, c] = softmax_i(R[i, c])
logit[n, c] = sum_i W[i, c] * log P[i, n, c] + b[c]

Fitting uses AdamW with a Tikhonov pull toward a flat prior and is bagged over fifteen folds. The reported estimate is obtained by nested cross-validation: within each fold the weights are fit on a training split, early-stopped on a probe split, and scored on a separate held-out split that is used neither for fitting nor for selection, which removes the optimistic bias of reporting the best-step score on the selection set. A sidecar scheme then re-injects the 6,745 validation clips without leakage, by calibrating the weights on the out-of-sample probabilities of a train-only model and applying them to a twin trained on the training and validation sets together (whatnext.ensemble.stacker, sidecar, calibration).

Two task-specific choices matter on four frames. Augmentation is clip-coherent: the same random parameters are applied to all four frames so that temporal coherence is preserved. The horizontal flip is label-aware: a flip swaps labels 18 (Pulling left to right) and 19 (Pulling right to left), the only mirror pair in the dataset, without which roughly 1,500 clips are mislabelled and those two classes collapse.

4. Results

The progression from a single from-scratch encoder to the first-place ensemble is summarised below (validation accuracy unless stated otherwise).

Stage	Model	val
From scratch (conv-attention hybrid)	EfficientFormer-L, strong augmentation	0.4411
From scratch (hybrid)	EfficientFormerV2-L / NextViT-S	0.413 / 0.409
On-data SSL (VideoMAE-S)	with Perceiver (m5p) / with LSTM (m5l)	0.478 / 0.481
On-data SSL (MAE-2D ViT-B)	with Perceiver head (m4a)	0.4976
On-data SSL (MAE-2D ViT-B)	with hybrid head (m4c)	0.5075
On-data SSL with auxiliary losses	with pair-direction and multi-clip (m4d_v2)	0.5122
Self-supervised cascade	ssl800 with attentive probe	0.5960
Cascade with pseudo-labels	ssl1000 with BiLSTM and pseudo-labels	0.6061
Ensemble of 95 checkpoints	per-class log-space bagged stacker	0.6070 (1st / 35)

Two transitions account for most of the gain: from a hybrid trained from scratch (about 0.44) to an on-data MAE ViT-B with auxiliary losses (0.5122), and then to the cascade with pseudo-labels (0.6061 in validation, 0.6070 on the leaderboard, first by 0.0006).

The cascade is detailed below (validation accuracy by generation and head).

Encoder	attn	perceiver	lstm	hybrid
ssl600	0.5788	0.5758	0.5729	0.5781
ssl700	0.5936	0.5887	0.5778	0.5913
ssl800	0.5960	0.5901	0.5810	0.5895
ssl800 with pseudo	0.6004	0.5935	0.5947	0.5981
ssl1000 with pseudo	0.5539	0.5508	0.6061	0.5463

The improvement from pretraining is positive on all 32 populated classes. The confusion matrices below compare the best from-scratch model (left, validation 0.44) with the pretrained model after pseudo-label distillation (right, validation 0.61).

From scratch	Pretrained with pseudo-labels

5. Reproducing the result without a GPU

The submission is an aggregation of cached per-member probabilities, so the stacker runs on a laptop in a couple of minutes.

pip install -e .                       # torch and numpy are sufficient for this step
python -m whatnext.ensemble.stacker \
    --cache-dir artifacts/probs \
    --members  configs/track_a_members.txt \
    --out      submission.csv

On the subset of members shipped here this reports a nested cross-validation held-out accuracy of 0.6163 (fifteen folds; each fold is scored on a split used neither for fitting nor for selection, and the test set is never consulted). The full 95-member pool scored 0.6070 on the private leaderboard. The stacker also reports which members it skipped, so coverage is never silently reduced.

6. Reproducing from scratch

pip install -e ".[models]"             # adds timm for the encoders

# self-supervised pretrain (one cascade step), finetune, cache probabilities, then stack
python -m whatnext.ssl.pretrain_videomae   --init-encoder <prev_gen.pt> --epochs 100
python -m whatnext.finetune.track_a_mae2d  --mae-ckpt <ssl.pt> ...
python -m whatnext.inference.cache_probs   --ckpt <ft.pt> --split val   # then --split test
python -m whatnext.ensemble.pseudo_label   --out pseudo.csv             # optional distillation round
python -m whatnext.ensemble.stacker        --cache-dir artifacts/probs

The challenge data (frames.zip and the label CSVs) is not redistributed here; it can be downloaded from the competition and placed at the repository root.

7. Repository layout

src/whatnext/
  data/        zip_dataset (evaluation and caching), clip_transforms (clip-coherent augmentation
               with the 18/19 swap), tta (five-crop test-time augmentation)
  models/      mae2d_vitb (MAE ViT-B with temporal attention, Perceiver and pair-direction heads),
               mae2d_heads (LSTM, relational and hybrid heads), videomae_vits (VideoMAE ViT-S),
               encoders_2d (timm), encoders_3d (3D convolutions), convert_videomae_ckpt
  ssl/         pretrain_mae2d, pretrain_videomae (the cascade driver)
  finetune/    track_a_mae2d, track_a_videomae (pseudo-label union), track_a_scratch
  ensemble/    stacker (per-class log-space, nested-CV bagged), sidecar (leakage-free re-injection),
               calibration (temperature scaling and MLP/linear baselines), pseudo_label
  inference/   cache_probs (test-time augmentation and the 18/19 swap), submission
  analysis/    error_decomposition (the diagnostic of section 2)
  utils/       seeding, metrics, normalisation constants
configs/       track_a_members.txt (the 95-member ensemble list)
artifacts/     probs/ (cached member probabilities), pseudo_labels/ (three distillation iterations)
docs/          report_en.pdf, methodology.md, exploration.md, figures/

Acknowledgements

Built on PyTorch and timm.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
artifacts		artifacts
configs		configs
docs		docs
scripts		scripts
src/whatnext		src/whatnext
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

What Happens Next? Data-efficient action anticipation from four frames

1. Problem statement

2. Error analysis and the resulting strategy

3. Method

4. Results

5. Reproducing the result without a GPU

6. Reproducing from scratch

7. Repository layout

Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

What Happens Next? Data-efficient action anticipation from four frames

1. Problem statement

2. Error analysis and the resulting strategy

3. Method

4. Results

5. Reproducing the result without a GPU

6. Reproducing from scratch

7. Repository layout

Acknowledgements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages