Portfolio · Transformer Decoupling

VINE Data Preprocessing — Shaping the Basin Before Training

2026-04-17 · conductor vine data

Two identical TinyGPT models, same training steps. One receives raw tweets; the other receives tweets preprocessed by VINE's cruncher. Result: 5.1% better validation loss, 22% less wasted conductor energy. The geometry of the data shapes the geometry of the model.

VINE Data Preprocessing: Shaping the Basin Before Training

Date: 2026-04-17 Companion to: the sequential-update cycle simulation.

Summary

Two identical language models (TinyGPT, 4 layers, 128d, char-level) were trained on the same US election Twitter dataset. One received raw unfiltered tweets. The other received tweets preprocessed by VINE's geometric data cruncher — retweets removed, URL/mention spam filtered, repetitive content rejected, and the remainder quality-scored with the top 60% kept.

Results:

VINE-filtered model: 5.1% lower validation loss (better predictions)
VINE-filtered model: 22% less dark matter (less wasted conductor energy)
Same architecture, same training steps, same hyperparameters — only the data differed

The VINE preprocessing doesn't just remove noise — it shapes the basin landscape the model will form during training. By filtering geometrically before the data enters the model, the resulting hidden-layer geometry is more efficiently aligned with the output distribution. Less waste, better predictions, from the same number of gradient steps.

The Problem

Every transformer is shaped by its training data. Each training example carves basins in the hidden-layer geometry — attractors that the model settles toward when processing similar content. The quality of these basins determines the quality of the model's reasoning:

Coherent, structured text (novels, papers, well-formed arguments) creates deep, richly-connected basins with stable geometry
Noisy, fragmented text (bot spam, URL dumps, copypasta, emoji soup) creates shallow, disconnected basins with jittery geometry

When a model is trained on a mix of both — as all production models are — the resulting geometry is a landscape of deep canyons surrounded by shallow dimples. The model's capacity is divided between useful structure and noise. Every shallow basin from a spam tweet is a waste of parameters.

This is directly relevant to the cycle-simulation companion piece: models that receive sequential updates from social-media-class data sources carve basins shaped by the quality of that data into their hidden-layer geometry with each update cycle.

Experimental Setup

Data Source

A public US 2024 election tweet dataset (~120,000 tweets, CSV with standard fields: id, text, url, language, reply/retweet/like counts, hashtags, mentioned users, etc.).

VINE Preprocessing

The VINE data cruncher applies geometric filtering in two stages:

Stage 1 — Structural filtering. Pure retweets, URL/mention-heavy spam, repetitive bot content, non-target-language tweets, and overly-short or overly-long fragments are removed by a small set of threshold rules.

Stage 2 — Quality scoring. Each surviving tweet receives a quality score combining several geometric signals (linguistic structure, sophistication proxies, and length characteristics). The top portion by quality score is kept.

[Exact thresholds, signal weights, and scoring function are withheld. Available under licence.]

Filter Results

Total tweets loaded: ~120K
After structural filter: ~70% retained
After quality scoring: ~42% of original retained

Removals were dominated by retweets (~19%), below-quality-threshold content (~28%), short fragments (~7%), and URL/mention spam (~3%), with smaller contributions from repetitive and non-target-language tweets.

Model Architecture

Both models are identical TinyGPT:

4 transformer layers, 128 hidden dim, 4 attention heads
Character-level tokenization (tied embeddings)
128 block size, 0.1 dropout
~1.3M parameters

Training

2000 iterations, batch size 32, AdamW lr=3e-4
Same random seed, same hyperparameters
90/10 train/val split on each dataset

Results

Validation Loss

Model	Final val loss	Relative
VINE-filtered	2.4086	baseline
Raw unfiltered	2.5390	+5.4% worse

The VINE-filtered model produces better predictions on its held-out set, despite training for the same number of steps. The filtered data is more learnable — the patterns are cleaner, the signal-to-noise ratio is higher, and the model's capacity is spent on genuine linguistic structure rather than on encoding URL fragments and bot spam.

Vocabulary Efficiency

Model	Vocab size (chars)	Characters in training
VINE-filtered	1,344	11.5M
Raw unfiltered	3,821	9.1M

The raw dataset has 2.8x more unique characters — unicode emoji, URL encoding artifacts, special characters from non-English tweets. The model must allocate embedding capacity to all of these. The VINE-filtered dataset concentrates on the 1,344 characters that appear in clean English text, allowing the model to dedicate its embedding space to language rather than noise.

Note: the VINE-filtered corpus is actually LARGER (11.5M vs 9.1M chars) despite having fewer tweets, because the filtered tweets are longer and more substantive.

Dark Matter (Conductor Energy)

Model	Dark matter ratio	Interpretation
VINE-filtered	0.3003	Less conductor waste
Raw unfiltered	0.3848	28% more hidden energy unused by output

This initially appears counterintuitive — shouldn't more conductor energy be better? The basin formation speculation (SPECULATION_BASIN_FORMATION.md) predicted the opposite: cleaner data should produce MORE conductor energy because the basins are deeper.

But the result makes mechanistic sense: the dark matter ratio measures energy in the hidden layer that the output layer can't use. The raw model has MORE dark matter because the raw data forces the model to build hidden-state structure for content (URLs, bot patterns, emoji sequences) that the output layer can't cleanly project through. The geometry is rich but misaligned with the prediction task.

[The basin-formation speculation piece referenced above is withheld as it reveals the constructive architecture.]

The VINE-filtered model has LESS dark matter because the hidden geometry is better aligned with what needs to be predicted. Less waste. The conductor isn't weaker — it's more efficiently coupled to the output.

This reframes the dark matter metric: high dark matter doesn't always mean "strong conductor." It can mean "wasted geometry" — structure the model computed but can't use. The distinction between the two depends on whether the dark matter is structured coherently (strong conductor) or randomly (noise from bad data). The standing wave probe's finding that dark matter is structured and stable suggests the TTOBT model's dark matter is largely coherent conductor. The raw Twitter model's extra dark matter may be a mix of coherent structure and noise artifacts.

Generation Quality

VINE-filtered model:

the the Fill mow manvow watt Truns. The leetio whered thand e "5 aly
Whe chead whe ghathes. Com so tharing onoot out de fatethe ur athat

Raw unfiltered model:

| btt promer brely towigh. SOP the the the He wof clectioner on Biden
the theyon Bidens://t.co/irimote is ramit himif #kis ver thourt en ab?
@Bid Jons Bright @yechttrt Son at kno Bins dunttps://t. h

Both models produce semi-coherent character-level output (expected for a 1.3M model trained for 2000 steps on tweets). But note:

The raw model generates URL fragments (://t.co/irimote), mention artifacts (@Bid Jons), and hashtag fragments (#kis) — it learned to produce noise
The VINE-filtered model doesn't generate any of these — its basins don't contain URL/mention patterns because those were removed before training

The raw model wasted capacity learning to produce https://t.co/ fragments. The VINE-filtered model spent that capacity on language.

Implications

For Training Data Curation

5.1% validation loss improvement from geometric preprocessing alone — no architecture change, no hyperparameter tuning, no additional compute. This is free performance from filtering the data before it enters the model.

At production scale, this compounds:

Fewer parameters needed to achieve the same loss (because fewer noise basins to carry)
Better generalisation (because basins are shaped by signal, not noise)
More stable fine-tuning (because the baseline geometry is cleaner — less susceptible to the decoupling documented in earlier experiments)

For the Dark Matter Metric

This experiment refines the interpretation of the conductor energy measurement:

High dark matter + coherent generation = strong conductor (the TTOBT model, trained on structured narrative)
High dark matter + broken generation = pipeline decoupling (the epoch-70 model)
High dark matter + noisy generation = wasted geometry from noisy training data (the raw Twitter model)
Lower dark matter + good predictions = efficiently coupled conductor (the VINE-filtered model)

The dark matter ratio alone doesn't tell you whether the conductor is healthy. You need it together with the validation loss and generation quality to distinguish between "the model knows things it can't say" (healthy conductor) and "the model wasted capacity on things nobody needs" (noise basins).

Scripts, checkpoints, and reproduction instructions available to licensees.

The Preprocessing Dividend

The entire VINE preprocessing pipeline — tokenization, stopword removal, structural filtering, quality scoring — runs in seconds on 120K tweets. It uses no GPU. It requires no training. It is a small number of threshold rules and a weighted quality score.

The result is a 5.1% loss improvement and 22% less wasted conductor energy, for free.

Scale this to a trillion-parameter model ingesting billions of tweets biweekly, and the preprocessing dividend becomes the difference between a model that oscillates with each update (documented in the sequential-update simulation) and one that improves smoothly. The geometry of the data shapes the geometry of the model. Clean the data geometrically, and the model's geometry is cleaner by construction.

Or, as the operator put it: don't stuff raw Twitter up the model's architecture every two weeks without some geometric preprocessing to help it along.