Tag: safety

Portfolio · 2026-04-20

Post-Hoc Morphology Correction for Quantised LLMs

Quantised and distilled language models lose irregular morphology first — 'runned', 'childs', 'mouses' — because the irregulars are carried by a smaller fraction of parameters. A 450-entry irregulars table plus a short repair function catches and corrects these without retraining, without latency cost, and without touching the model. Closed-form failures should not be solved by statistical learners.
Portfolio · 2026-04-16

Jailbreak Detection Via Geometry

The decoupling mechanism operates at inference time when adversarial prompts push hidden states into unusual regions. Three cheap metrics separate in-domain from adversarial prompts on both GPT-2 and TinyLlama.
Portfolio · 2026-04-16

Sequential Embedding Updates — 12 Cycle Simulation

Simulating biweekly embedding updates. Full-model updates sign-flip at cycle 3 and oscillate; frozen-head updates drift smoothly. The oscillation doesn't collapse — but it creates periodic geometric-confusion windows.
Portfolio · 2026-04-16

MoE Routing Collapse — The Extra Matchstick

A 4-expert MoE under 12 sequential update cycles: the deepest layer concentrates 93.5% of traffic on one expert by cycle 12. A single-thread architecture disguised as multi-expert. The supposed redundancy is eliminated by the collapse.
Portfolio · 2026-04-15

Architecture-Universal — Alice & Bob

The same three decoupling signatures appear in Meta's 2017 Alice & Bob negotiation bots: 70× hidden norm explosion, rank-6 output distribution, complete context-sensitivity collapse. 'Tometometome' is the same mechanism expressed through a different architecture.
Portfolio · 2026-04-14

Catastrophic Forgetting Is Pipeline Decoupling

The training-domain loss rises while training on the training data — impossible under classical weight-overwriting. Three mechanism tests confirm the geometric signature. Surgical recovery: resetting the LM head restores 62% of 'forgotten' capability.
Portfolio · 2026-03-22

The No-Deal Button — Training-Environment Echoes in Trained Agents

Five phases and a 50-combination sweep (27,500 rounds) of Alice & Bob self-play. The word 'button' surfaces 95 times across the sweep — a UI mechanism from the training environment that doesn't exist at inference. Temperature-gated, category-stable, concept-persistent under entropy. Announced-but-broken tools still reshape behaviour. The training environment leaves imprints in the model's navigable state space.

Tag: safety

Post-Hoc Morphology Correction for Quantised LLMs

Jailbreak Detection Via Geometry

Sequential Embedding Updates — 12 Cycle Simulation

MoE Routing Collapse — The Extra Matchstick

Architecture-Universal — Alice & Bob

Catastrophic Forgetting Is Pipeline Decoupling

The No-Deal Button — Training-Environment Echoes in Trained Agents