ICML 2026

Progressive Cramming

Reliable Token Compression — and What It Reveals

Dmitrii Tarasov · Timofei Lashukov · Elizaveta Goncharova · Andrey Kuznetsov

FusionBrain Lab · HSE University · Innopolis  —  github.com/FusionBrainLab/progressive_cramming

The big claim

How much can you fit into a single token?

One embedding can encode a whole sequence

One embedding encodes a whole sequence

The cramming task of Kuratov et al. (2025), “Cramming 1568 tokens into a single vector and back again” (ACL) — a frozen LM reconstructs a sequence from one trained input embedding.

The answer

More than a thousand tokens — in one vector

1,568 tokens fit into a single embedding

Result from Kuratov et al. (2025).

The catch

“99%” is an illusion

Reconstruction brittleness table (PG19)

Teacher-forced accuracy is ~99%, but greedy generation collapses to ~0% — because the missing 1% lands on the first one or two tokens, and one early miss cascades.

The fix

Progressive cramming

Progressive token-addition schematic (paper Fig. 2)

Grow the target one token at a time, warm-starting each stage; stop only when perfect reconstruction is no longer possible. Full cramming leaves ~1% error → 0% generation; progressive guarantees 0% error → 100%.

Watch the optimizer walk

PC1–PC2 projection of the progressive trajectory (Llama-3.1-8B, length-1000). Each point perfectly stores the prefix so far; the basin of perfect reconstruction shrinks as tokens are added. First two components = 65.7% of variance — the path is low-dimensional.

The twist

Does perfect reconstruction = understanding?

No.

Downstream HellaSwag / ARC-E evaluation table

Prepending the crammed embedding drops accuracy on HellaSwag & ARC-E across families — even with the original prefix still in context (scored on the converged subset only).

The sharpest test

Generative MMLU collapses

5-shot MMLU: full_prefix (optimized) embedding collapses to ~0%, random control stays near baseline

5-shot MMLU, 512 samples — a single compressed embedding placed in context (full table in the appendix). Acc = accuracy, Valid = % of parseable answers.

Why? Causal attention knockout

The early layers drive the collapse

Cumulative attention knockout (Llama-3.1-8B)

Forward knockout (left) masks early→late: downstream accuracy returns to the uncompressed baseline after only the first few layers, even as reconstruction is destroyed. Reverse knockout (right) recovers only once it reaches the early layers — so the embedding does its damage by steering the early layers. The same pattern holds across model families (Llama-3.1-8B, Pythia-1.4B, SmolLM2-1.7B).

Where capacity comes from

Capacity scales with depth & width

Heatmap of compression capacity (mean perfectly-crammed tokens) vs. retained depth and model size

Mean perfectly-crammed tokens over 50 PG19 samples (darker = more). Keep only the first N decoder layers (then finetune): capacity rises with retained depth (→) and with model size (↓), and the two axes compound — capacity isn't magic, it's bought with the reconstructor's compute.

What cramming really reveals
  • Perfect reconstruction can be brittle steering — it stores nothing the model can use.
  • Capacity isn't magic: it scales with the reconstructor's depth & width.
  • Across all model families, the collapse is driven by the embedding's first few layers.
QR code to the project repository
Explore the project →
code · released trajectories · paper
github.com/FusionBrainLab/progressive_cramming
Q&A · backup

Low-dimensionality is a path property

From a shared initialization, different learning rates reach equally-good solutions that are far apart and nearly orthogonal:

Solution-diversity table

Equally-good solutions are farther from each other than from the start (Sol. dist. > 1) and nearly orthogonal (Dir. cos. ≈ 0). The valid-solution set is wide & high-dimensional; one trajectory is a thin slice.

Q&A · backup

Trajectory dimensionality & PCA reconstruction

PCA components vs sequence length

Components for 99% variance grow ~logarithmically with length (Llama-3.1-8B).

PCA reconstruction accuracy

PCA-reconstructed embeddings need many more components for teacher-forced accuracy — same early-token failure mode.