Portfolio · Transformer Decoupling

Sequential Embedding Updates — 12 Cycle Simulation

Simulating biweekly embedding updates. Full-model updates sign-flip at cycle 3 and oscillate; frozen-head updates drift smoothly. The oscillation doesn't collapse — but it creates periodic geometric-confusion windows.

Sequential Embedding Update Cycle Simulation

Date: 2026-04-16 Companion to: MOE_ROUTING_COLLAPSE.md


Summary

A simulation of biweekly sequential embedding updates on GPT-2 (124M) reveals two distinct failure trajectories depending on whether the output layer is updated:

  • Full-model updates: hidden/token alignment oscillates wildly, crossing zero (sign-flipping the hidden-state/token-embedding relationship) at cycle 3, peaking at +0.103 displacement at cycle 6, then partially returning by cycle 12. Generation quality survives due to sequential data diversity preventing consistent accumulation.

  • Frozen-head updates: alignment drifts gradually and monotonically, never crosses zero, no sign flips. Smooth, predictable, stable.

Both variants drift roughly the same total magnitude (~0.08-0.09), but the full-model variant passes through a geometrically confused state during cycles 3-8 that the frozen-head variant never enters. This predicts that production models under full-model updates will experience periodic "off" windows corresponding to specific update cycles.


Experimental Setup

  • Base model: GPT-2 (124M, 12 layers, 768d)
  • Training data: TTOBT novel split into 12 sequential batches (simulating biweekly X/Twitter data dumps)
  • Per-cycle training: 200 steps, batch size 4, block size 256
  • Two variants:
    • Full-model: all parameters updated, lr=5e-5
    • Frozen-head: final block (11), ln_f, wte, wpe frozen; blocks 0-10 updated, lr=5e-5
  • Evaluation: held-out text from end of the book (never in any training batch), fixed positions across all cycles

Results

Alignment Trajectory

Cycle Full-model Frozen-head
baseline −0.1228 −0.1228
1 −0.1336 −0.0972
2 −0.1100 −0.0983
3 +0.0061 ← sign flip −0.0776
4 +0.0772 −0.0753
5 +0.0920 −0.0685
6 +0.1026 ← peak −0.0575
7 +0.0701 −0.0640
8 +0.0554 −0.0515
9 +0.0399 −0.0606
10 +0.0150 −0.0410
11 −0.0053 −0.0458
12 −0.0286 −0.0380

The full-model alignment crossed zero at cycle 3 and went positive — the hidden states briefly pointed the wrong direction relative to token embeddings. It peaked at +0.103 (cycle 6) then oscillated back toward baseline by cycle 12.

The frozen-head alignment drifted smoothly from −0.123 to −0.038, never crossing zero.

Why the Full-Model Variant Didn't Collapse

Unlike the catastrophic forgetting experiments (70 epochs on the same data → collapse), the sequential-update scenario trains on DIFFERENT data each cycle. The gradient pressure changes direction every 200 steps, preventing the consistent accumulation that drives permanent decoupling. The alignment oscillates rather than accumulating.

Generation quality stayed between 0.83-0.91 coherence throughout:

  • Cycle 6 (peak displacement): "jagged and deliberate, as though it had been the only space for the interrogation rooms"
  • Cycle 12 (returned): "the silence that followed was heavy, heavy with the weight of everything uncertain"

Entropy and Loss

Metric Full-model (baseline→final) Frozen-head (baseline→final)
Held-out loss 3.30 → 2.64 3.30 → 2.55
Output entropy 3.60 → 1.46 3.60 → 1.68
Coherence 0.73 → 0.85 0.73 → 0.82

Both variants improved on held-out loss and coherence — the sequential training genuinely helped the model learn the domain. The frozen-head variant achieved slightly better held-out loss (2.55 vs 2.64) with higher entropy (1.68 vs 1.46) — more confident predictions that are better calibrated.


The Combined Risk (MoE + Sequential Updates)

The cycle simulation shows the LM head oscillating. The MoE simulation shows routing concentrating on single experts. Together:

  • The router is concentrating 93.5% of deep-layer traffic on one expert
  • The LM head is oscillating with each update
  • If an update hits the dominant expert AND the LM head is near a zero-crossing → compound failure
  • The probability of compound failure increases with each cycle as routing concentration deepens

Scripts, per-cycle metrics, and visualisations available to licensees.