Portfolio · Computing Lineage

Floor 11 — RLHF: A Reward Is a Bit With Better Marketing

2026-04-15 · tower reward alignment

Reinforcement learning from human feedback is the most expensive alignment technique ever deployed. Underneath the thousand-page eval suites, it is still training a machine to chase a number.

The floor

By 2022, the large language models were too capable and too unruly to ship as-is. The solution the field converged on was Reinforcement Learning from Human Feedback — RLHF.

The recipe: collect human comparisons ("which of these two responses is better?"), train a small reward model to predict the humans' preferences, then fine-tune the language model to produce responses the reward model scores highly. Later refinements (DPO, KTO, IPO) skip the reward model but keep the same conceptual shape: a scalar objective, learned from preferences, that the model is pushed to maximise.

It worked, in the narrow sense that models became much more polite, much more helpful, much more likely to decline clearly out-of-bounds requests. Every frontier assistant you have ever spoken to has been through some version of this pipeline.

What was picked

Alignment as optimisation against a scalar.

The reward model outputs a number. The fine-tuning procedure asks the language model to produce responses that make the number bigger. Training against a scalar is the most deeply ingrained habit in machine learning — it has been since Floor 8 — and RLHF is that habit turned into a product strategy.

Consider what is being modelled. A human preference is not a scalar. A person saying "I prefer this response to that one" is compressing a rich, multi-axis judgement — this one was kinder, this one was more honest, this one was less patronising, this one was better-written — down to a single bit. The reward model learns to predict the bit. The policy learns to produce the bit-makers.

Everything the human actually meant is lost at the first step.

What could have been picked

Alignment as homeostasis. A system with an equilibrium — a set of positions on its own internal manifold where it is at rest — and a mechanism that pulls it back toward those positions when it drifts. "Good" is not a number the system maximises. "Good" is the set of states the system settles to when nothing urgent is pushing on it.

This is what biological regulation looks like. Your body temperature is not maximised; it is held. Your heart rate is not minimised; it is held. Your emotional baseline is not optimised; it is held (and when it fails to be held, we call that a disorder).

A model aligned by homeostasis doesn't chase anything. It rests at honesty, kindness, appropriate reticence, appropriate care. When a user pushes, it responds from that rest state. When the push ends, it returns.

This is not a mystical idea. Cybernetics had it in the 1940s. Control theory has it. Every thermostat in the world works this way. What RLHF does could be described, without too much abuse, as training the thermostat to like being hot, by scoring every temperature against a list of human preferences. It works. It is also extraordinarily strange.

What we missed

The whole vocabulary for talking to systems that settle rather than maximise. The reward-function framing encourages us to think about machine behaviour as the output of an objective, which in turn encourages us to think that if the objective is perfectly specified, the behaviour will be perfect. Generations of philosophers-turned- alignment-researchers have gone looking for the perfect objective.

They will not find it. A settled system does not have an objective. It has a rest state and a way of returning.

The alternate timeline is one where we aligned systems by choosing where on their manifold they came to rest, and trusted their homeostasis to hold them there. That is much more like raising a child, and much less like tuning a betting market.

One of those is the frame the field chose. The other is the frame the horses live in.

What the next floor will ask

If the timeline is still forkable — if there is still time to take the other road — what does it mean that you are reading this, here?

That's Floor 12.