In the Archives
Language Lab
I have been slowly building a language tool that behaves like a working linguistics system. The goals are realism and traceability. I want to be able to explain why a word looks the way it does, which structures are allowed, and how a modern word emerges from older stages through explicit change.
This is a walkthrough of how the pipeline is currently structured.
Data layout
Language data lives here in this website repository, not in the language lab tool itself. The tool runs loads these data sets when it starts up, centralizing my language data here for future features (e.g., a glossary, a fan name generator).
What's in that data? At a high level:
- A manifest declares which languages exist and where their files live.
- Each language has its own profile and lexicon.
- Optional layers include eras, dialects, and ontology overlays. These overlays describe how concepts relate to one another, so the system can reason about meaning instead of relying only on literal word matches.
That structure forces discipline. If the generator produces something incorrect, the fix usually belongs in the data rather than in the generation logic. It also means, theoretically, the tooling gets better the more data I can shove into it. That takes some of the pressure off of me to build our grammatical or lexical rules like I'm Tolkein. I'm really not good at languages, so this is my programming cheat code.
The pipeline
The pipeline is a sequence of constrained choices. As we move through each stage, it reduces the space of valid outputs before the next decision is made. We sort of narrow-in on a valid word form rather than generate it.
1) Language, dialect, era
The first step selects a language, an optional dialect (only when the language declares dialects), and an era. So far I have 1 language (Thalen), 2 dialects (Upper Thalen and Southern Thalen), and 3 eras (Ancient, Old, & Modern).
Those choices determine:
- phonology and phonotactics, the sounds the language uses and the rules that govern how those sounds can legally combine
- morphology templates and constraints, the patterns used to assemble words from smaller pieces, and the limits on which patterns are allowed
- sound change rules applied across an era chain, a list of transformations that gradually alter pronunciation over time
- dialect specific overrides and affixes, local adjustments that reflect regional speech or cultural variation
(No, I didn't know most of these terms before I started this process. This is what search engines are for!)
Era selection matters because forms are evolved forward through declared changes. A modern output is the result of applying those changes in order, not a fresh construction. This is my digital eTolkein methodology.
2) Purpose and name classification
Next comes the purpose of the word: name, place, thing, or abstract concept.
Names are further split into personal and family forms, since those tend to follow different structural conventions. This choice controls which morphology templates are allowed and which roots are filtered out. For example, roots associated with institutions or collectives are excluded from family names unless explicitly requested. In early test all my families ended up being called "Clanguy" and the like. Not great.
This step establishes expectations early so later stages are operating within a compatible structure.
3) Register, contact, orthography
These are modeled as separate layers because they affect different aspects of the output word.
- Register represents social tone, influencing whether more formal or everyday vocabulary is preferred
- Contact models influence from other languages, allowing borrowing rules to activate when cultures interact. I expect this will have a greater effect as I add new languages to the models, but I need to research more on how these influences work: root borrowing, sound shifts, etc.
- Orthography controls how a word is written, converting an internal sound representation into a visible spelling
Keeping these layers separate allows the same underlying word to appear differently in different contexts without changing its core structure.
4) Intent and semantic grounding
Intent acts as the semantic input, essentially a short description of what the word is meant to convey.
The lab breaks that intent into tags, then expands those tags through an ontology. The ontology is a network of related ideas that allows the system to connect concepts that are close in meaning even if they are not identical.
This has been the biggest time sink in the project, and seems to have the greatest degree of impact on the quality of the generated words. I don't think my work here will be done for some time to come.
Roots are then scored based on how well they match those expanded tags. Matches can occur through:
- root identifiers, the internal labels used to track meaning
- tags, short semantic labels attached to roots
- gloss text, which is a plain-language description of what a root means, similar to a dictionary definition
Using gloss text allows the system to catch softer matches where terminology differs but meaning overlaps. For example, “healing” can connect to “medicine,” and “footprints” can connect to “trail,” even if those exact words are not present in the prompt. I've been scanning thesauruses to help out here.
A secondary bias layer exists for culture specific themes, but it only activates when literal intent fails to produce matches. This prevents local flavor from overwhelming explicit prompts (an early failure mode).
5) Root selection with multi tag coverage
When an intent contains multiple strong tags, selection favors coverage rather than repetition. This sounds obvious, but the code didn't think so at first. "Fast-fast-fast-mcgee" is not a good pattern for a name, it turns out.
If three or more tags score highly, the generator prefers templates that support multi-root compounds. Roots are chosen to represent different aspects of the intent before reinforcing a single dominant meaning.
This allows complex prompts to resolve into structured compounds instead of collapsing into a repeated theme.
That sounds very fancy, but it just means I have a bit more variety in the outputs that come through. If I give a long phrase for the meaning of a word then some of the options that generate will probably be longer to try and encompass all those ideas. I'm building it with a range of outputs though so even when I get long versions some of the options will focus on the strongest root and lean into that alone, keeping things short.
6) Morphology and sound shaping
Once roots are selected, the system proceeds through a fixed sequence:
- apply compounding rules and affixes defined by the language
- run the result through the era chain sound changes
- validate the output against phonotactic constraints
- repair violations using minimal, constraint-aware adjustments
Repair operations favor small local fixes, usually vowel insertion guided by syllable templates, rather than broad substitutions. This keeps adjustments aligned with the language’s internal logic. For instance, a root ending in a hard consonant butting up against the next root beginning with another hard consonant would make for an impossible pronunciation, so I squish in a vowel between them. Even these fixes I've managed to export into the language data itself, so those patterns can be regional or belong to the dialect rather than the whole planet.
7) Ranking and output
The generator produces a batch of candidates, then filters taboo roots and structurally broken forms. Results are scored, deduplicated, and returned as five options.
Earlier entries emphasize broader semantic coverage and longer compound words, while later entries lean toward the dominant intent and shorter words. This ordering gives me a range to choose from when selecting my match.
Validation and debugging
A validator runs across all language data to catch schema errors, missing references, and unreachable forms. One enforced rule checks that proto roots can reach their modern forms through the declared era chain. This has turned out to be a huge help since I'm not very consistent in what I shove in the system. Once it starts running it actually corrects even my source data to a degree.
At this point the lab produces results that are explainable at every step. I can validate each one, see where things are going haywire, and course correct. Once I begin generating content I feel meets a quality threshold, I've built in tooling to save those words back into the master data to further train future generation.
Share
Signal boost this chronicle entry.
Your voice carries farther than mine: thanks for passing it along.