Machine-Translation Pre-fill Workflows

Q: How do I stop the MT engine from mangling ICU and placeholders?

Mask everything that isn't human-readable text — {vars}, %s, HTML tags, and ICU control tokens (plural, select, one, other, #) — with opaque sentinels before sending, then restore afterward and re-parse the result as ICU. Drop any candidate that fails to parse rather than committing it.

Machine-translation pre-fill workflows pipe newly extracted keys through DeepL, Google, or Azure before a human ever sees them, writing each result back as needs-review (never translated) so untrusted MT output can never silently ship to production. The classic failure mode is a pre-filled {count, plural, ...} block where the engine “translated” the ICU keyword one into the target language, and the build crashes at runtime with Expected plural argument selector, got "uno". This page shows how to wire MT pre-fill into your sync pipeline as part of Translation Workflows & CI/CD Pipeline Sync while protecting placeholders, enforcing glossaries, and gating every machine string behind human post-edit.

MT pre-fill never writes "translated": every machine string lands as fuzzy/needs-review and waits at the human gate.

Prerequisites

Node.js 20+ (or Python 3.11+) on the runner that executes the pre-fill step
An MT API key in CI secrets: DEEPL_AUTH_KEY, GOOGLE_APPLICATION_CREDENTIALS, or AZURE_TRANSLATOR_KEY + region
A working key-extraction stage that emits a diff of new source keys (i18next-parser, formatjs extract, or Lingui)
A locale file format that can carry a per-string review state: gettext #, fuzzy flags, XLIFF 2.1 state="needs-translation", or a JSON sidecar of metadata
A glossary export (CSV/TBX) shared with translation memory & glossary management
A merge gate (branch protection or CODEOWNERS on locales/) so no human-unreviewed string reaches main

Concept & spec — why MT output is “fuzzy”, not “translated”

Machine translation produces a candidate, not a commitment. Every serious localization format encodes this distinction with an explicit per-string state, and your pre-fill job must set it. In gettext PO, the state is the #, fuzzy flag — gettext runtime treats a fuzzy entry as untranslated and falls back to the source, exactly the behaviour you want for unreviewed MT. In XLIFF 2.1 (OASIS XLIFF Version 2.1, §segment state) the corresponding values are initial, translated, reviewed, and final; MT pre-fill writes translated only inside a <target> whose segment state stays initial or carries an mtype="mt" marker, never final.

#: src/checkout.tsx:42
#, fuzzy
msgid "Your cart has {count} items"
msgstr "Votre panier contient {count} articles"

The fuzzy flag is the contract between the MT engine and the human reviewer: it says “a machine wrote this, do not trust it.” This mirrors how the fallback chain treats a missing string — an unreviewed fuzzy entry should resolve to the source locale at runtime, not display raw MT to users. Pre-fill belongs to the broader Translation Workflows & CI/CD Pipeline Sync discipline: it sits between extraction and human review, never replacing the latter.

Step-by-step implementation

1. Detect only the new keys

Run your extractor and diff against the last committed source file so you translate the delta, not the whole catalog — this is the single biggest cost lever. Emit a JSON array of {key, source} for keys present in the new source but absent from the target locale.

# emit new English keys missing from the French target
i18next-parser --config i18next-parser.config.js
node scripts/diff-new-keys.mjs locales/en.json locales/fr.json > /tmp/new-keys.json

Skip any key that already has a non-fuzzy target value. Re-translating reviewed strings burns characters and silently overwrites human work.

2. Mask placeholders and ICU structure before sending

Never hand raw ICU to an MT engine. Replace every placeholder and every ICU control token with an opaque sentinel the engine will pass through untouched, then restore after. DeepL also accepts inline <x> tags you mark as tag_handling=xml.

// protect.ts — mask {vars}, ICU keywords, and HTML before MT
const ICU_KEYWORDS = /\b(plural|select|selectordinal|one|other|few|many|two|zero|#)\b/g;
const PLACEHOLDER = /\{[^{}]+\}|%[sd]|<[^>]+>/g;

export function protect(src: string) {
  const slots: string[] = [];
  const masked = src
    .replace(PLACEHOLDER, (m) => `⁨${slots.push(m) - 1}⁩`); // FSI/PDI wrap
  // ICU keywords are kept verbatim by only translating the human-text segments
  return { masked, slots };
}
export const restore = (s: string, slots: string[]) =>
  s.replace(/⁨(\d+)⁩/g, (_, i) => slots[+i]);

For ICU plural/select messages, translate only the human-readable sub-messages, leaving keywords (one, other, =0) and the {count, plural, skeleton fixed. The dedicated DeepL pre-translation quality gate covers validating that restoration produced parseable ICU before write-back.

3. Call the engine with the glossary attached

Send masked text in batches. Pass the engine’s native glossary so brand and domain terms resolve deterministically instead of being paraphrased.

import * as deepl from "deepl-node";
const t = new deepl.Translator(process.env.DEEPL_AUTH_KEY!);

const glossary = await t.createGlossary(
  "app-fr", "en", "fr",
  new deepl.GlossaryEntries({ entries: { "Workspace": "Espace de travail", "seat": "licence" } })
);

const out = await t.translateText(maskedBatch, "en", "fr", {
  glossaryId: glossary.glossaryId,
  tagHandling: "xml",
  formality: "prefer_more",
});

Glossary entries must stay in sync with your CI-enforced terms; see enforcing glossary terms in CI.

4. Write back as needs-review and never as final

Restore placeholders, parse the result as ICU to confirm structure survived, then write the target value with fuzzy/needs-review set. If the ICU parse throws, drop the candidate and leave the key untranslated rather than committing a build-breaking string.

import { parse } from "@formatjs/icu-messageformat-parser";

for (const { key, masked } of batch) {
  const candidate = restore(out[key], slots[key]);
  try {
    parse(candidate);                       // throws on broken ICU -> skip
    target[key] = candidate;
    meta[key] = { state: "needs-review", origin: "mt", engine: "deepl" };
  } catch {
    meta[key] = { state: "untranslated", origin: "mt-rejected" };
  }
}

5. Gate every machine string behind a human

Open the pre-filled values as a pull request, but make the merge gate require that no state: needs-review (or #, fuzzy) entry remains for shipping locales. Reviewers post-edit in your TMS — pushing this PR into Crowdin or Weblate surfaces each fuzzy string for approval before it flips to translated.

# CI gate: fail if any unreviewed MT string targets a shipping locale
node scripts/assert-no-fuzzy.mjs locales/fr.json --max-fuzzy 0

Configuration reference

Option	Type	Description / default
`engine`	`"deepl" \| "google" \| "azure"`	MT provider. Default `deepl` (best formality + glossary control for EU langs).
`prefillState`	`string`	Write-back state for new MT strings. Default `needs-review` (PO `fuzzy`, XLIFF `initial`).
`protectPlaceholders`	`boolean`	Mask `{vars}`, `%s`, ICU keywords, HTML before send. Default `true`. Never disable.
`glossaryId`	`string`	Engine glossary applied per language pair. No default — unset means no term enforcement.
`skipReviewed`	`boolean`	Skip keys with a non-fuzzy target. Default `true` (protects human work + cuts cost).
`maxCharsPerRun`	`number`	Hard character budget per CI run; abort over it. Default `50000`.
`formality`	`"prefer_more" \| "prefer_less" \| "default"`	DeepL/Azure tone hint. Default `default`.
`failOnIcuError`	`boolean`	Drop candidates that fail ICU parse after restore. Default `true`.

Framework variants

React / Next.js (formatjs): run formatjs extract to produce the source catalog, then feed the extracted defaultMessage strings into the pre-fill script. Keep ICU intact by translating only message text; write pre-filled values into lang/fr.json with a parallel lang/fr.meta.json carrying needs-review flags so formatjs compile can exclude unreviewed entries from the production bundle.

Vue / Nuxt (vue-i18n): vue-i18n has no native fuzzy concept, so store review state in a sidecar and have your build filter it. Pre-fill into locales/fr.json, keep locales/fr.review.json, and let the runtime fall back to the base locale for any key still flagged — aligned with vue-i18n’s fallbackLocale.

Angular (@angular/localize): Angular uses XLIFF natively, so set state="initial" and mtype on the <target>. extract-i18n regenerates messages.fr.xlf; your pre-fill step fills empty <target> nodes only and leaves state="final" segments untouched.

Node.js backend (i18next): for server strings, pre-fill locales/fr/translation.json plus a _fuzzy namespace. Configure i18next saveMissing: false in production and returnNull: false so unreviewed keys resolve through the fallback chain instead of leaking MT to API consumers.

Verification

Assert that no unreviewed machine string can reach a shipping locale and that every pre-filled string still parses as ICU.

# 1. structural check: every target value is valid ICU
node scripts/validate-icu.mjs locales/fr.json
# expected: "✓ 412 keys parsed, 0 ICU errors"

# 2. review-state gate: zero fuzzy entries for shipping locales
node scripts/assert-no-fuzzy.mjs locales/fr.json --max-fuzzy 0
# expected exit 0; non-zero with "12 needs-review strings remain" blocks merge

# .github/workflows/i18n.yml — gate fragment
- name: Block unreviewed MT
  run: |
    node scripts/validate-icu.mjs locales/*.json
    node scripts/assert-no-fuzzy.mjs locales/fr.json --max-fuzzy 0

This gate composes with broader GitHub Actions i18n CI gates that also fail builds on untranslated keys.

Common pitfalls

MT translates ICU keywords. Sending {count, plural, one {...} other {...}} raw lets the engine “translate” one/other, breaking the parser. Mask the skeleton; validate with the DeepL pre-translation quality gate.
Placeholders get reordered or dropped. Hello {name} becomes Bonjour with the variable gone. Always mask-and-restore, then assert every original slot reappears exactly once.
Pre-fill overwrites approved human translations. Skip any key whose target is already non-fuzzy — re-translation silently destroys reviewed work.
Fuzzy strings ship. Without a merge gate, needs-review MT reaches users. Enforce --max-fuzzy 0 on shipping locales.
Unbounded cost. Translating the full catalog every run, not the delta, multiplies the bill. Diff for new keys and cap maxCharsPerRun.
Glossary drift. An engine glossary that diverges from your TM produces inconsistent terms; keep both fed from one source via translation memory & glossary management.

FAQ

Should machine-translated strings ever be marked “translated”?

No. MT output is a candidate, not a commitment. Always write it as fuzzy (gettext) or with a non-final segment state (XLIFF), so the runtime falls back to the source locale and the string stays visible to a human reviewer until they post-edit and approve it.

How do I stop the MT engine from mangling ICU and placeholders?

Mask everything that isn’t human-readable text — {vars}, %s, HTML tags, and ICU control tokens (plural, select, one, other, #) — with opaque sentinels before sending, then restore afterward and re-parse the result as ICU. Drop any candidate that fails to parse rather than committing it.

How do I keep machine-translation costs under control?

Translate only newly extracted keys, not the full catalog; skip keys that already have a reviewed target; cache results so identical strings aren’t re-sent; and set a hard per-run character budget that aborts the job when exceeded. Diffing the delta typically cuts character volume by an order of magnitude.

Does applying a glossary remove the need for human review?

No. A glossary enforces term consistency but does not guarantee correct grammar, tone, or meaning. Glossary application and human post-edit gating are complementary: the glossary reduces obvious term errors, the human gate catches everything else before a string ships.

Which engine should I pick — DeepL, Google, or Azure?

DeepL generally gives the best formality control and glossary fidelity for major European languages; Google Cloud Translation has the broadest language coverage; Azure Translator integrates cleanly with custom Translator glossaries and dictionary mappings. Choose per language pair, and keep the write-back contract identical across engines.

DeepL pre-translation quality gate — validating restored ICU and placeholders before pre-filled strings are written back.
Translation Memory & Glossary Management — the shared term source that feeds both your TM and the MT engine glossary.
Crowdin Integration for Dev Teams — surfacing pre-filled fuzzy strings for human post-edit inside a TMS.
Weblate Self-Hosted Setup — self-hosted review of machine-suggested translations with built-in MT add-ons.
GitHub Actions i18n CI Gates — the CI layer that blocks merges while unreviewed MT strings remain.

Part of Translation Workflows & CI/CD Pipeline Sync.

Machine-Translation Pre-fill Workflows ¶

Prerequisites ¶

Concept & spec — why MT output is “fuzzy”, not “translated” ¶

Step-by-step implementation ¶

1. Detect only the new keys ¶

2. Mask placeholders and ICU structure before sending ¶

3. Call the engine with the glossary attached ¶

4. Write back as needs-review and never as final ¶

5. Gate every machine string behind a human ¶

Configuration reference ¶

Framework variants ¶

Verification ¶

Common pitfalls ¶

FAQ ¶

Related ¶