Translation Memory & Glossary Management
A translation memory ™ reuses past human translations by matching new source segments against stored ones, while a glossary (termbase) pins approved terms so “Sign in” never ships as “Log in” in one screen and “Connect” in the next. Skip both and you pay to re-translate strings you already own, your fuzzy-match leverage drops to zero, and inter-segment consistency breaks the moment two translators touch the same release — the classic symptom being a QA bug like Glossary term "Dashboard" rendered as "Tablero de control" in es-ES but "Panel" in es-MX.
This is the leverage-and-consistency layer that sits beside your runtime fallback chain but operates at authoring time, before strings ever reach a bundle. A TM stores bilingual segment pairs and serves them back at fuzzy-match percentages; a termbase stores approved source/target term entries with part-of-speech and forbidden variants. Both interchange through Unicode/OASIS standards — TMX 1.4b for memories, TBX for termbases — so they survive a tool migration. Everything below makes TM leverage, fuzzy thresholds, and glossary enforcement explicit, portable, and CI-checkable.
Prerequisites
Concept & spec — TM, TMX, and the termbase
A translation memory is a database of bilingual translation units (TUs). Each TU pairs a source segment with one or more target-language variants. On a new segment, the TM computes a similarity score against stored sources and returns the best candidate as a match percentage: 100% is an exact match (byte-identical source, reuse verbatim), 75–99% is fuzzy (the translator post-edits a near-match), and below the floor the segment counts as new. Leverage is the share of words covered by exact and high-fuzzy matches — it is the single biggest driver of per-release translation cost and turnaround.
The interchange format is TMX (Translation Memory eXchange) 1.4b, the LISA/GALA standard later stewarded by the localization industry. A TMX file is XML: a <header> carries srclang, adminlang, and segtype, and the <body> holds <tu> (translation unit) elements, each with one <tuv xml:lang="…"> per language wrapping a <seg>. Inline placeholders are preserved as <ph>, <bpt>/<ept>, and <it> tags so a {count} or <b> survives the round-trip. Because every serious tool reads TMX 1.4b, your accumulated leverage is portable — a tool migration is a TMX export/import, not a re-translation.
A glossary (termbase) is a different store: not segments but terms. The interchange standard is TBX (TermBase eXchange), an ISO 30042 XML format whose <conceptEntry> groups a concept across languages, each <langSec> holding <termSec> entries with a <term> plus notes like part of speech, usage status, and forbidden synonyms. Where a TM maximizes reuse, the termbase enforces inter-segment consistency: every occurrence of “dashboard” resolves to one approved target. Both feed the broader Core i18n Architecture & Locale Negotiation pipeline at authoring time, upstream of the runtime resolver.
Step-by-step implementation
1. Seed the TM from existing bilingual bundles
If you already ship multiple locales, you own a TM — it’s just trapped in your PO/XLIFF files. Pair each translated string with its source and emit TMX so the tool can leverage it on the next release instead of starting cold.
import { create } from 'xmlbuilder2';
type Pair = { src: string; tgt: string; tgtLang: string };
export function toTmx(pairs: Pair[], srclang = 'en'): string {
const tus = pairs.map((p) => ({
tu: {
tuv: [
{ '@xml:lang': srclang, seg: p.src },
{ '@xml:lang': p.tgtLang, seg: p.tgt },
],
},
}));
return create({
tmx: {
'@version': '1.4',
header: { '@srclang': srclang, '@segtype': 'sentence', '@adminlang': 'en' },
body: tus,
},
}).end({ prettyPrint: true });
}
2. Set fuzzy-match thresholds
A threshold policy decides which matches auto-apply and which a human must review. Apply high-fuzzy matches silently, surface mid-fuzzy for post-editing, and treat the rest as new work routed to machine-translation pre-fill or a translator.
export type Band = 'exact' | 'high-fuzzy' | 'review' | 'new';
export function band(score: number): Band {
if (score >= 100) return 'exact'; // reuse verbatim
if (score >= 85) return 'high-fuzzy'; // auto-apply, light edit
if (score >= 75) return 'review'; // flag for post-edit
return 'new'; // MT pre-fill or fresh translation
}
3. Compute the match percentage
Tools use token-level edit distance, not raw character diff, so a one-word change in a ten-word segment scores ~90%, not 50%. Normalize by token count and never let inline-tag differences alone drop a match below the floor.
function tokens(s: string): string[] {
return s.toLowerCase().match(/\p{L}+|\p{N}+/gu) ?? [];
}
export function matchPct(a: string, b: string): number {
const x = tokens(a), y = tokens(b);
const dist = levenshtein(x, y); // token-level Levenshtein
const max = Math.max(x.length, y.length) || 1;
return Math.round((1 - dist / max) * 100);
}
4. Load the termbase and build a forbidden-term index
Parse the TBX into a per-locale map of approved targets plus a set of forbidden variants. Index by source term so enforcement is an O(1) lookup per segment rather than a scan of every entry.
type Entry = { approved: string[]; forbidden: string[] };
export function loadTermbase(tbx: ConceptEntry[], lang: string): Map<string, Entry> {
const idx = new Map<string, Entry>();
for (const c of tbx) {
const src = c.langSec.en?.terms[0]?.text;
const sec = c.langSec[lang];
if (!src || !sec) continue;
idx.set(src.toLowerCase(), {
approved: sec.terms.filter((t) => t.status !== 'forbidden').map((t) => t.text),
forbidden: sec.terms.filter((t) => t.status === 'forbidden').map((t) => t.text),
});
}
return idx;
}
5. Enforce glossary terms before commit
For each segment, find which source terms it contains, then assert the target uses an approved equivalent and contains no forbidden variant. Return structured violations so CI can print an actionable diff.
type Violation = { term: string; expected: string[]; found: string };
export function checkTerms(src: string, tgt: string, idx: Map<string, Entry>): Violation[] {
const out: Violation[] = [];
for (const [term, entry] of idx) {
if (!src.toLowerCase().includes(term)) continue;
const usesApproved = entry.approved.some((a) => tgt.includes(a));
const usesForbidden = entry.forbidden.find((f) => tgt.includes(f));
if (usesForbidden) out.push({ term, expected: entry.approved, found: usesForbidden });
else if (!usesApproved) out.push({ term, expected: entry.approved, found: '(missing)' });
}
return out;
}
Configuration reference
| Option | Type | Description / default |
|---|---|---|
srclang |
string |
TMX header source-language tag every <tu> inherits. Default 'en'. |
segtype |
'sentence' | 'paragraph' | 'block' |
TM segmentation granularity. Default 'sentence'. |
minMatchScore |
number |
Floor below which a match is discarded as a non-leverage segment. Default 75. |
autoApplyAt |
number |
Score at/above which a fuzzy match is applied without human review. Default 85. |
penalty.placeholder |
number |
Score deduction when inline <ph> placeholders differ. Default 5. |
caseSensitiveTerms |
boolean |
Enforce glossary term casing exactly (API, not Api). Default false. |
forbiddenIsError |
boolean |
Fail CI on a forbidden term vs. warn. Default true. |
tbxDialect |
'TBX-Basic' | 'TBX-Core' |
Termbase profile to import/export. Default 'TBX-Basic'. |
Framework variants
Crowdin
Crowdin keeps a per-project TM and glossary. Upload an existing TMX under Translation Memory → Upload, then enable TM pre-translation to auto-fill exact and high-fuzzy matches on import. Glossaries import as TBX or CSV; mark a term Forbidden to surface it in QA checks. The CLI seeds both from your repo:
crowdin tm upload --file ./memory/en-es.tmx --language es
crowdin glossary upload --file ./glossary/terms.tbx
Wire Crowdin’s QA “Glossary” check into the same gate you use when connecting the Crowdin API to GitHub pull requests so a forbidden term blocks the sync PR.
Weblate
Weblate exposes the TM as Automatic suggestions and stores the glossary as a normal component flagged read_only/glossary. Import a TMX via the project’s Translation memory → Import, and add terms through the glossary component or a TBX upload. Enable the Glossary consistency and Has been translated checks per component; see the broader Weblate self-hosted setup for memory sharing across components.
wlc upload --input memory.tmx translation-memory
Node.js backend / build step
If you don’t run a TMS, run leverage and enforcement as a plain build step. Parse exported bundles, score against a committed memory.tmx, and run checkTerms over every translated segment. This keeps TM leverage and glossary enforcement in the same repo as the code, version-controlled alongside the strings.
memoQ / Trados (offline)
Desktop CAT tools are the canonical TMX/TBX producers. Translators work against a local TM and termbase, then deliver a TMX export you import back into the TMS — the round-trip is lossless precisely because both ends speak TMX 1.4b and TBX. Keep your segtype consistent between tools or segment boundaries won’t align and leverage collapses.
Verification
Assert two things in CI: that the TM round-trips through TMX without losing placeholders, and that no committed target violates the glossary. The first pins interchange fidelity; the second is the consistency gate.
import { test, expect } from 'vitest';
import { toTmx } from './tmx';
import { checkTerms, loadTermbase } from './glossary';
test('TMX round-trip preserves placeholders', () => {
const tmx = toTmx([{ src: 'Hello {name}', tgt: 'Hola {name}', tgtLang: 'es' }]);
expect(tmx).toContain('{name}');
});
test('no segment violates the glossary', () => {
const idx = loadTermbase(parseTbx('glossary/terms.tbx'), 'es');
for (const { src, tgt } of loadSegments('es')) {
expect(checkTerms(src, tgt, idx)).toEqual([]);
}
});
# Expected output
✓ TMX round-trip preserves placeholders
✓ no segment violates the glossary
Run the glossary assertion as a required check — the dedicated walkthrough in enforcing glossary terms in CI shows the full GitHub Actions wiring and annotated failure output.
Common pitfalls
- Treating fuzzy matches as exact. Auto-applying an 80% match ships a near-miss as final. Set
autoApplyAtto 85+ and route mid-fuzzy to post-edit; the precise banding lives in enforcing glossary terms in CI. - Placeholder-only diffs tanking the score. A
{count}vs{n}change shouldn’t drop a match to 0%. Strip inline tags before scoring and apply a fixedpenalty.placeholderinstead. - Glossary checked source-side only. Verifying the source contains a term proves nothing about the target. Always assert the target uses an approved equivalent and no forbidden variant.
- TM drift across regions. One shared
esTM serveses-ESandes-MXidentical targets. Scope memories per regional variant or the fallback chain inherits inconsistent terminology. - Lossy TBX export. Round-tripping through CSV drops part-of-speech and forbidden flags. Use TBX-Basic for interchange, not a spreadsheet.
- Stale memory after a string rewrite. Editing source copy without updating the TU leaves a 100% match pointing at an outdated translation. Invalidate or re-confirm TUs when source segments change.
FAQ
What is the difference between a translation memory and a glossary?
A translation memory stores whole bilingual segments and serves them back at a match percentage to maximize reuse, so you don’t re-translate sentences you already own. A glossary (termbase) stores individual terms with approved targets and forbidden variants to enforce consistency, so a given word always renders the same way. TMs save cost; termbases protect terminology. They interchange through TMX 1.4b and TBX respectively.
What does a fuzzy-match percentage mean?
It’s a token-level similarity score between a new source segment and the closest one in the memory. 100% is an exact match you reuse verbatim; 75–99% is a fuzzy match a translator post-edits; below the floor (commonly 75%) the segment counts as new work. Tools weight by token edit distance, so a single changed word in a long sentence still scores high. You set a threshold to decide which matches auto-apply versus require review.
Why use TMX 1.4b instead of a proprietary format?
TMX 1.4b is the industry interchange standard, so your accumulated leverage stays portable — migrating tools becomes an export/import rather than a re-translation. It preserves inline placeholders via <ph>, <bpt>/<ept>, and <it> tags, so interpolations survive the round-trip, and its <header> carries srclang and segtype so the receiving tool segments identically. Every serious TMS reads it.
How do I enforce glossary terms automatically?
Load the termbase into a per-locale index of approved and forbidden variants, then for each segment check that the target uses an approved equivalent of any source term it contains and no forbidden one. Run that as a required CI check on every pull request so a terminology violation fails the build before merge, with the full pipeline shown in the enforcing-glossary-terms-in-CI walkthrough.
Can machine translation fill segments the TM misses?
Yes. Segments scoring below the fuzzy floor are routed to a machine-translation pre-fill step, then post-edited and committed back into the TM so the next release leverages them. The MT output must still pass the same termbase check, because raw machine translation routinely ignores approved terminology.
Related
- Fallback Chain Configuration — the runtime resolver downstream of the authoring-time TM and glossary.
- Enforcing glossary terms in CI — the full GitHub Actions gate and annotated failure output.
- Crowdin Integration for Dev Teams — wiring TM pre-translation and glossary QA into a sync PR.
- Weblate Self-Hosted Setup — sharing translation memory and glossary components across a self-hosted project.