Portfolio · Computing Lineage

Floor 8 — Backprop Returns, and Brings the One-Hot With It

In 1986, connectionism came back from the cold. It worked. It was also, quietly, wearing a tie. The loss function it carried was a child of Floor 5.

The floor

In October 1986, Nature published Learning representations by back-propagating errors, by David Rumelhart, Geoffrey Hinton, and Ronald Williams. Backpropagation — the chain-rule trick that lets you train a multi-layer network by passing error gradients backwards through the layers — was finally in Nature, finally legible to the wider field, finally impossible to ignore.

The ice cracked. Funding started flowing back into neural networks. Yann LeCun trained convolutional nets on handwritten digits. Jürgen Schmidhuber and Sepp Hochreiter published LSTMs. By the 2010s, deep learning was eating computer vision whole. By the late 2010s, it was eating language.

This is the story everyone knows. This is the comeback.

It is also, quietly, Floor 5 in a new coat.

What was picked

Gradient descent against a discrete target. Feed the network an image. Ask it "what digit is this?" Have it output a probability distribution over the ten digit classes. Compare that distribution to a one-hot vector — all zeros except for a 1 in the correct position. Measure the disagreement using cross-entropy. Nudge the weights in the direction that shrinks the disagreement. Repeat.

Cross-entropy is Shannon's discrete information theory, turned into a loss function. The one-hot vector is the claim that the true answer lives at a single point, and every other point is wrong by an equal amount.

A neural network trained this way spends its whole life learning to compress a continuous interior state down to a discrete output. The middle of the network is richly geometric — activations on a manifold, distances that mean something, directions that encode concepts. The output has to be a bit.

Every network you have ever used is structured this way. The pre-final layers are where the geometry lives. The final softmax is where the geometry is thrown away.

What could have been picked

Training the network against positions instead of categories. Each target is a point — or better, a region — on a manifold, and the network is asked to move its output toward that position. Two targets that are semantically similar occupy nearby points. Uncertainty is distance from any well-populated region, not a flat entropy over a discrete simplex.

This idea was not hidden. Self-organising maps (Kohonen, 1982). Hebbian learning on continuous vectors. Information geometry (Amari, since the 1980s). Manifold learning (Tenenbaum, Roweis, Saul, early 2000s). All of it was saying the same thing: the target is a region on a surface, not an index into a list.

The field kept the surface in the middle of the network — the hidden layers — and put a one-hot sign on the front door.

What we missed

A generation of systems whose output is the same kind of object as their interior. Whose answer to "which digit is this?" is a point in handwriting-space, not a probability over ten bins. Whose uncertainty is geometric — I am far from any region of the space that I know about — rather than statistical.

Systems that, crucially, can say I do not know in a way that isn't just a flat softmax. The one-hot loss does not permit a system to be between answers. It is Floor 1, once again, enforced through the gradient itself.

The return of connectionism in 1986 was a real victory. It unblocked a door that had been nailed shut for fifteen years. It would have been a much bigger victory if it had walked through the door without dragging the one-hot vector in behind it.

What the next floor will ask

What happens when you give networks words, and ask them to learn positions for those words?

That's Floor 9. It's the closest the field has come. And it still isn't quite it.