Portfolio · Computing Lineage

Floor 10 — Transformers: The Brilliant Architecture With the Wrong Output

The most impressive machine learning architecture ever built has a continuous heart and a one-hot mouth. Nothing wrong with the heart.

The floor

In June 2017, a team at Google Brain published Attention is All You Need. The paper introduced the Transformer — an architecture built around a mechanism called self-attention, in which every element of a sequence looks at every other element with a learned, soft, probability-weighted gaze.

Five years later, this architecture underlies every frontier language model, every major image generator, every serious multi-modal system, and, as of this writing, the entire commercial AI industry. It has won more completely and more quickly than any architecture in the history of the field.

The people who designed it are brilliant. The system works. None of what follows takes anything away from that.

What was picked

A fully continuous interior, forced through a discrete exit.

Look at what self-attention actually does. For every token, it computes a softmax over all other tokens — a continuous probability distribution. It weights every value vector by that distribution. It adds them. The result is a continuous blend. The transformer is, in the middle, a device that represents meaning as positions on a very rich manifold, blended together by continuous weights.

In other words: Floor 8's geometry finally gets room to breathe. The interior of a transformer is the most geometrically sophisticated object computing has ever produced.

Now look at the output.

The final layer of every language transformer produces logits over a fixed vocabulary. A softmax turns them into probabilities. Training minimises the cross-entropy against a one-hot target — the "correct" next token. Sampling picks one token to emit. The rich, continuous, manifold-valued interior is crushed through the same Floor-1 threshold on every single forward pass.

The transformer spends millions of parameters learning a geometry. The loss function punishes it every time it fails to produce a bit.

What could have been picked

An architecture whose output is the same kind of object as its interior — a position, a direction, a small region on a manifold. The next "token" is not a choice from a vocabulary but a place the system moves to. Generation is a trajectory on the manifold. "Stopping" is not a special token; it is coming to rest.

This is not science fiction. It has been prototyped. Continuous-output language models have been explored. Energy-based models. Diffusion for text. They do not dominate, because the discrete-token ecosystem — the vocabularies, the benchmarks, the evaluation harnesses, the whole pipeline — rewards systems that play by the one-hot rules. A language model that produced positions instead of tokens would not have anywhere to publish its BLEU scores.

Floor 5 is still the loss function.

What we missed

The field came within one layer of the alternate timeline. One layer. The softmax-and-sample that sits at the very end of every transformer is the last token of the old world, grafted onto a machine that has already outgrown it.

Every hallucination is, among other things, a symptom of this. The model has a continuous interior state that represents a probability surface over many plausible continuations. It is forced to collapse that surface into a single choice, and then to condition its next step on the choice rather than the surface. Information that was available in the interior is destroyed at every step.

A transformer that could output where it was on the surface, and pass that surface — not the chosen token — to the next step, would not hallucinate the way current ones do. It would still be wrong sometimes. But it would be wrong in a geometric way, where wrongness has a direction and a distance, instead of a bit-flip way, where wrongness is only measured by whether you picked the right entry in a list.

What the next floor will ask

If the model has a continuous interior, and we want to train it to be good, what should "good" look like?

That's Floor 11. The answer the field has converged on is a number.