Portfolio · Mechanism
Etymological Layering in English Lemmatisation
English is three morphological systems pretending to be one — Germanic inflection, partly-productive Latin derivation, Greek compound roots, French borrowings. A vocabulary-gated collapse rule per layer handles all four correctly; a single uniform rule cannot. The layer distribution of a text also turns out to be a cheap stylometric feature.
Etymological layering in English lemmatisation
A vocabulary-gated collapse rule for Latin-layer morphology
Problem
A rule-based lemmatiser that handles all English suffixes with uniform rules produces two classes of errors on Latin-layer vocabulary:
-
False positives. Mission is stripped to a nonexistent miss- stem, or to the unrelated English verb miss (meaning "fail to hit", from Germanic missan). Vision, station, question behave similarly.
-
False negatives. The lemmatiser refuses to collapse decision, movement, expression out of caution about the first class, producing three distinct basins where English users recognise one concept with multiple roles.
Both errors are the consequence of treating English as morphologically uniform. English is not. It is three well-documented morphological systems stacked: an inherited Germanic/Old English layer that provides the inflectional machinery and most common function words, a Latin layer introduced largely through ecclesiastical and scholarly borrowing, and a Greek layer of compound roots used in scientific and philosophical vocabulary. A French/Norman layer adds further borrowings with their own smaller inventory of suffixes.
The layers coexist in modern English but have different productivity profiles.
Observation
Germanic-layer derivational suffixes (-ly, -ness, -ful, -less, -hood, -ship, -dom, -th) are productive and regular. Softly is always related to soft, stillness to still. Collapsing these is safe even when the underlying root happens not to be present in a working vocabulary.
Latin-layer derivational suffixes (-tion, -sion, -ment, -ity, -ance, -ence, -ive, -ous, -al, -able, -ible) are conditionally productive. Some Latin-derived words have a corresponding living English verb — decide / decision, move / movement, express / expression, resist / resistance. Others do not — mission, vision, station, nation, question, intelligence, patience. For the second group, English has effectively frozen the Latin noun as a standalone concept; the underlying Latin verb either never entered English as a productive root, entered and then died, or survives only in other derivatives.
Greek-layer suffixes (-logy, -graphy, -meter, -scope, -phone, -cracy) are different again. They are compound roots rather than derivational suffixes. Biology is not the nominalisation of a verb; it is bios + logos. Stripping is inappropriate; tagging the constituent roots is.
French-layer suffixes (-age, -ette, -esque) follow roughly the same pattern as Latin, with a higher proportion of frozen borrowings.
The correct behaviour therefore depends on which layer a word belongs to and, within the Latin and French layers, on whether the candidate root exists as a live English word.
Implementation
Suffixes are tagged with their layer at the specification level:
LAYERS = {
'ly': 'germanic', 'ness': 'germanic', 'ful': 'germanic',
'less': 'germanic', 'hood': 'germanic', 'ship': 'germanic',
'tion': 'latin', 'sion': 'latin', 'ment': 'latin',
'ity': 'latin', 'ance': 'latin', 'ence': 'latin',
'ive': 'latin', 'ous': 'latin', 'al': 'latin',
'logy': 'greek', 'graphy':'greek', 'meter':'greek',
'age': 'french', 'ette': 'french',
# ...
}
The lemmatisation procedure dispatches on layer:
-
Germanic layer: unconditional collapse. Candidate stems are computed by suffix-specific rules; the first plausible candidate is accepted whether or not it is present in the vocabulary. Rationale: Germanic derivational morphology is productive enough that unseen roots are highly likely to be real.
-
Latin layer: vocabulary-gated collapse. Candidate stems are computed using suffix-specific rules that account for Latin morphophonological patterns (decision → decide via -sion → -de; expression → express via -sion → -ss; exhaustion → exhaust via -tion → ∅). If any candidate is present in the provided vocabulary, the collapse is committed. If none are, the surface form is tagged
base_latin_frozenand retained as its own lemma. -
Greek layer: no collapse. The word is tagged with its Greek root and kept as a standalone lemma.
-
French layer: vocabulary-gated, as for Latin.
The vocabulary argument to the lemmatiser is the set of known base forms — in a concept-graph system, this is the set of existing concept-node keys. The vocabulary-gated rule is then semantically meaningful: a Latin-layer collapse is committed exactly when the target concept already exists, in which case the collapse represents a grammatical role edge to that concept; it is refused when the target does not exist, in which case forcing a collapse would either create a bogus new concept or erroneously redirect to a homonym.
Correctness
The procedure returns (lemma, edge_tag, layer). The disambiguation behaviour on representative cases:
| Input | Output |
|---|---|
softly |
(soft, adverb_ly, germanic) |
stillness |
(still, nom_ness, germanic) |
decision |
(decide, nom_sion, latin) |
expression |
(express, nom_sion, latin) |
movement |
(move, nom_ment, latin) |
silence |
(silent, nom_ence, latin) |
resistance |
(resistant, nom_ance, latin) |
mission |
(mission, base_latin_frozen, latin) |
vision |
(vision, base_latin_frozen, latin) |
question |
(question, base_latin_frozen, latin) |
biology |
(biology, greek_compound[bio+logos:word/reason], greek) |
cottage |
(cottage, base_french_frozen, french) |
All twenty-one regression cases for the layer-aware lemmatiser pass. The vocabulary used in evaluation includes the living English verb and adjective roots one would expect to find in any general-purpose English concept graph.
A stylometric consequence
The layer distribution of a text is a measurable property. Running the lemmatiser over a 5,290-word prose chapter and partitioning by narrative scene produced the following Germanic-to-Latin token ratios:
| Scene type (summary) | Germanic % | Latin % |
|---|---|---|
| Interior consciousness, narration | 11–15 | 2–4 |
| Institutional report prose | 7 | 16 |
| Dialogue (bureaucratic characters) | 9 | 8 |
The difference between narrative interiority (low Latin) and institutional reportage (high Latin) is large enough to distinguish the two by shallow morphological features alone. This was not a designed feature of the lemmatiser; it fell out of the layer-tagging. The practical implication is that the layer distribution can be used as a cheap, interpretable feature for stylometric analysis, authorship detection, or genre classification, in settings where a deep neural approach would be overkill.
Limitations
The vocabulary-gated procedure is only as good as the vocabulary. A vocabulary that is missing common verb roots will produce excess false-frozen tags. In a concept-graph setting this is self-correcting, because the gradual accretion of concepts into the graph progressively reduces the set of apparent frozen borrowings. In a static evaluation setting it requires a reasonably complete seed vocabulary to avoid misleading results.
Some Latin-derived words are in a genuinely ambiguous state — marriage has a living root marry but behaves as if frozen in most contexts; survival is clearly derived from survive but shades into frozen use in expressions like survival instinct. The current procedure commits to collapse if the root is present in the vocabulary, which is the correct default in most applications, but may over-collapse for applications that need to distinguish the compound-as-concept from the verbal-process sense.
The Greek-layer handling is coarse. The current implementation recognises ten common compound suffixes; English scientific vocabulary has a longer tail of less frequent roots (-pathy, -trophy, -genesis, -plasm) that would be straightforward to add but have not been exhaustively catalogued.
Related work
The three-layer decomposition of English morphology is standard in historical linguistics (Baugh & Cable, A History of the English Language; Algeo, The Origins and Development of the English Language). The Latinate layer's partial productivity is discussed in Aronoff's work on English word-formation rules. The application of this decomposition as a runtime gate for lemmatisation appears to be novel, though it is a small enough insight that parallel discovery is likely. The implementation here is intended as a reference and a reproducibility target, not as a claim of priority.
Raychell Langan · NEXICOG Ltd · Hampshire, UK