Portfolio · Transformer Decoupling

MoE Routing Collapse — The Extra Matchstick

2026-04-16 · conductor decoupling safety

A 4-expert MoE under 12 sequential update cycles: the deepest layer concentrates 93.5% of traffic on one expert by cycle 12. A single-thread architecture disguised as multi-expert. The supposed redundancy is eliminated by the collapse.

MoE Routing Collapse Under Sequential Updates

Date: 2026-04-16 Companion to: the sequential-update cycle simulation.

Summary

A 4-expert, 4-layer Mixture of Experts language model trained through 12 sequential update cycles (simulating biweekly sequential data updates) exhibits progressive routing collapse: the router concentrates traffic onto a single expert at each layer while the remaining experts atrophy. By cycle 12, the deepest layer routes 93.5% of all tokens to one expert. Router entropy at the output-adjacent layer drops from 1.26 (near-uniform) to 0.26 (near-deterministic) — a 79% reduction.

This confirms a specific prediction: Mixture of Experts architectures add an additional decoupling surface (the router's softmax over experts) on top of the LM head's softmax over vocabulary. Both can collapse independently. When the dominant expert at any layer is destabilised by an update, the system has no fallback — the atrophied experts cannot compensate because they haven't been receiving enough traffic to stay calibrated.

Experimental Setup

Model

Architecture: 4-expert MoE LM with top-2 routing, 4 layers, 128-dim, 8.56M params
Router: nn.Linear(dim, n_experts) → softmax → top-k selection
Experts: each is a 2-layer MLP (dim → 4×dim → dim) with GELU
Output: tied token embeddings (same as GPT-2 pattern)
Vocab: GPT-2 tokenizer (50,257 tokens)

Training Protocol

12 sequential update cycles, each on a different data batch (contiguous chunks of the TTOBT novel)
300 training steps per cycle, batch size 8, block size 128
AdamW lr=3e-4, weight decay 0.01
Full-model updates (router + all experts + embeddings)

Measurements

At each checkpoint:

Router entropy per layer: −Σ w·log(w) over expert weights. Max = ln(4) ≈ 1.386 (uniform). Low = concentrated.
Expert usage distribution: mean routing weight per expert per layer across held-out data
Training loss: NTP cross-entropy on the cycle's batch

Results

Router Entropy Collapse

Cycle	Layer 0	Layer 1	Layer 2	Layer 3
baseline	1.273	1.279	1.273	1.258
1	1.127	0.976	0.365	0.499
3	1.065	0.670	0.310	0.557
6	0.994	0.552	0.349	0.484
9	0.940	0.459	0.514	0.384
12	0.908	0.408	0.423	0.262

Layer 3 (deepest, closest to output) collapsed fastest and hardest: from 1.258 to 0.262. Layer 0 (shallowest) collapsed least: from 1.273 to 0.908. The collapse is monotonic and progressive across all layers.

Expert Usage — Single-Expert Dominance

Layer 3 expert usage over cycles:

Cycle	Expert 0	Expert 1	Expert 2	Expert 3
baseline	0.25	0.25	0.25	0.25
3	0.813	0.029	0.073	0.085
6	0.837	0.083	0.050	0.030
9	0.892	0.055	0.027	0.026
12	0.935	0.032	0.018	0.015

Expert 0 went from 25% to 93.5% of all layer-3 routing. The other three experts combined handle 6.5%.

Layer 2 expert usage (cycle 12):

Expert 0	Expert 1	Expert 2	Expert 3
0.072	0.072	0.095	0.761

A DIFFERENT expert dominates at layer 2 (Expert 3 at 76.1%). The coherence path runs through specific experts at each layer — a single thread through the architecture.

Layer 0 expert usage (cycle 12):

Expert 0	Expert 1	Expert 2	Expert 3
0.215	0.239	0.215	0.331

Layer 0 (shallowest) remains more balanced. The collapse is deepest near the output — same pattern as the transformer decoupling, where the final layers are most affected.

The Dominant-Expert Thread

By cycle 12, the model's coherence runs through a single path:

Layer 0: Expert 3 (33.1% — mild preference)
Layer 1: Expert 2 (39.9% — moderate dominance)
Layer 2: Expert 3 (76.1% — strong dominance)
Layer 3: Expert 0 (93.5% — near-total dominance)

This is the "one agent takes control" pattern reported by users of multi-expert systems: most of the routing weight concentrates on a specific expert at each layer, and the coherent output depends on that single thread remaining stable.

Risk Analysis

Normal Operation

The system functions well. The dominant expert produces coherent output. Metrics are normal. Training loss continues to decrease (22.2 → 4.7 over 12 cycles). Nothing visibly wrong.

Fragility

The system is a single-thread architecture disguised as a multi-expert architecture. 93.5% of the deepest layer's computation depends on one expert. If an update destabilises that expert, nearly all output is affected. The three atrophied experts (handling 6.5% combined) cannot absorb the load because:

They haven't been receiving enough traffic to stay calibrated on diverse inputs
Their weights have diverged from the output projection's expectations
The router would need to LEARN to redistribute traffic, which takes many steps

The Catastrophic Scenario

A single data batch that produces large gradients for the dominant expert's weights could rotate its hidden states out of alignment with the LM head — the same mechanism documented in the catastrophic forgetting experiments. Because 93.5% of layer-3 traffic passes through this expert, the rotation affects essentially all output simultaneously.

In a balanced 4-expert system, a single expert being destabilised affects ~25% of traffic. In the collapsed system, it affects ~94%. The MoE architecture was supposed to provide redundancy through multiple experts; the routing collapse has eliminated that redundancy.

Implications

For MoE architectures generally

The router's softmax over experts is an additional decoupling surface
Routing collapse is a natural consequence of sequential updates on varied data
Expert load-balancing losses (used in practice) slow but may not prevent the collapse
The deeper the layer, the faster the collapse — same depth-gradient as transformer decoupling

For the unified decoupling theory

The MoE router is another instance of softmax-over-metric-free-categories. The same mechanism that causes catastrophic forgetting (LM head rotation), tometometome (GRU output collapse), and jailbreak vulnerability (hidden state displacement) also causes routing collapse in MoE. Every softmax-over-categories in the architecture is a potential decoupling surface. MoE adds N of them.

Scripts, per-cycle routing metrics, and visualisations available to licensees. Runtime: approximately 3 minutes on an RTX 4070.