DeepL Pre-Translation Quality Gate

A DeepL pre-translation quality gate runs every machine-translated string through automated checks — placeholder parity, length ratio, and glossary adherence — before it is written back as needs-review, so broken ICU never reaches a human or a build. DeepL is genuinely strong, but it will still occasionally drop a {count} placeholder, mangle a <b> tag, “translate” the ICU keyword other, or ignore a glossary term under inflection. Without a gate, those defects land in your catalog flagged as if they were merely unreviewed prose, and the next npm run build dies on Expected "}" but found "uno". This page wires DeepL into a machine-translation pre-fill workflow with tag handling, a glossary, and a confidence check that rejects defective output instead of trusting it.

DeepL quality gate decision flow A source ICU string is masked, translated by DeepL with a glossary, unmasked, then evaluated by placeholder parity, length ratio, and glossary adherence checks. Pass writes needs-review; any failure routes the string to a reject queue for human translation. Source string ICU + tags Mask placeholders → <x id="0"/> DeepL API tag_handling+glossary Unmask restore ICU + tags QA gate (3 checks) parity · length ratio · glossary PASS → write needs-review held for human post-edit FAIL → reject queue no pre-fill; flag for translator
Every DeepL result must clear three deterministic checks before it earns a needs-review state.

Root cause analysis

DeepL’s neural model treats the whole input string as natural language. Anything that looks like prose — and ICU keywords, HTML tags, and named placeholders all do — is fair game for reordering, inflection, casing changes, or outright omission. Three behaviours cause almost every broken pre-fill:

  • Placeholder loss or duplication. With tag_handling off, DeepL sees {count} as a word. German output frequently becomes {Anzahl} or drops the braces entirely; long sentences sometimes emit a placeholder twice. The ICU parser then throws on an unknown argument or a malformed token.
  • ICU keyword translation. In {count, plural, one {...} other {...}} the tokens plural, one, and other are reserved selectors defined by CLDR plural rules, not words. DeepL has no way to know that and will happily render one as uno or eins, producing a syntactically invalid message.
  • Glossary drift under morphology. Even with a glossary attached, DeepL applies terms case- and form-sensitively. A glossary entry Dashboard → Übersicht may be respected in nominative but silently re-translated when the sentence demands a different case, so your enforced glossary terms never actually appear.

The fix is two-sided: prevent corruption by masking placeholders into XML tags DeepL is contractually told to preserve, and detect residual corruption with a gate that compares the output against the source before trusting it.

Minimal reproducible example

Send an ICU plural straight to DeepL as plain text and the structure does not survive. This is the smallest call that reproduces the breakage:

import * as deepl from "deepl-node";
const translator = new deepl.Translator(process.env.DEEPL_KEY!);

const source = "{count, plural, one {# file} other {# files}}";
const r = await translator.translateText(source, "en", "de"); // no tag handling
console.log(r.text);
// → "{Anzahl, Plural, eins {# Datei} andere {# Dateien}}"
//    ^^^^^^^        ^^^^^^ ^^^^                ^^^^^^
//    placeholder    selectors translated → ICU parser will throw

Feeding r.text back into your catalog and running the build yields the runtime error every team recognises:

$ npm run build
SyntaxError: Expected "plural", "select", "selectordinal" but "Plural" found. (de.json: messages.fileCount)

The string was written back as needs-review, which looks safe, but the defect is structural, not stylistic — no human post-edit step protects the compiler from a string that never reaches a translator before CI runs.

Fix with annotated code block

Mask placeholders into DeepL’s XML tag format, translate with tag_handling: "xml" plus the glossary, unmask, then run the gate. Tokens inside <x/> tags are preserved verbatim by DeepL when ignore tags are declared.

import * as deepl from "deepl-node";
const translator = new deepl.Translator(process.env.DEEPL_KEY!);

// 1. Extract every ICU placeholder, tag, and the plural/select scaffold.
//    We mask runtime args ({count}, <b>…</b>) but NOT the ICU keywords,
//    which we strip out separately and reattach so DeepL never sees them.
const PLACEHOLDER = /(\{[^{}]+\}|<\/?[a-z][^>]*>)/gi;

function mask(src: string) {
  const tokens: string[] = [];
  const masked = src.replace(PLACEHOLDER, (m) => {
    const id = tokens.push(m) - 1;        // store original, index = id
    return `<x id="${id}"/>`;             // DeepL-preserved ignore tag
  });
  return { masked, tokens };
}

function unmask(text: string, tokens: string[]) {
  return text.replace(/<x id="(\d+)"\/>/g, (_, id) => tokens[+id]); // restore verbatim
}

async function pretranslate(src: string) {
  const { masked, tokens } = mask(src);
  const res = await translator.translateText(masked, "en", "de", {
    tagHandling: "xml",
    ignoreTags: ["x"],                     // <x/> contents are never translated
    glossary: process.env.DEEPL_GLOSSARY_ID, // term overrides, case-insensitive seed
  });
  return unmask(res.text, tokens);
}

For nested ICU ({count, plural, ...}) translate only the human-readable branch bodies, never the scaffold. Parse the message, run pretranslate on each leaf, and re-emit the structure so plural/one/other are reconstructed by your code, not the model. The gate below catches anything that still slips through.

// QA gate — runs AFTER unmask, BEFORE writing the catalog.
type GateResult = { ok: boolean; reason?: string };

function gate(source: string, target: string, glossary: Record<string, string>): GateResult {
  // (a) Placeholder parity: same multiset of {args} and tags in/out.
  const tokens = (s: string) => (s.match(PLACEHOLDER) ?? []).sort();
  const a = tokens(source), b = tokens(target);
  if (a.length !== b.length || a.some((t, i) => t !== b[i]))
    return { ok: false, reason: "placeholder_parity" };

  // (b) Length ratio: target wildly off vs. source = likely truncation/hallucination.
  //     0.4–2.5 covers normal expansion (DE ~1.3×, FI ~1.4×); outside = reject.
  const ratio = target.length / Math.max(source.length, 1);
  if (ratio < 0.4 || ratio > 2.5) return { ok: false, reason: "length_ratio" };

  // (c) Glossary adherence: every required target term must appear literally.
  for (const [src, tgt] of Object.entries(glossary))
    if (source.includes(src) && !target.toLowerCase().includes(tgt.toLowerCase()))
      return { ok: false, reason: `glossary:${src}` };

  return { ok: true };
}

const out = await pretranslate(source);
const verdict = gate(source, out, { Dashboard: "Übersicht" });
writeEntry(out, verdict.ok ? "needs-review" : "rejected", verdict.reason);

The key discipline: a passing string still becomes needs-review, never translated. The gate only decides whether MT output is worth showing a human, not whether it is correct.

Verification snippet

Assert that the gate rejects the exact defects DeepL produces and accepts a clean translation. This is the test that belongs in CI alongside your other i18n CI gates:

import { test, expect } from "vitest";

const G = { Dashboard: "Übersicht" };

test("rejects dropped placeholder", () => {
  const r = gate("Hello {name}", "Hallo", G);
  expect(r).toEqual({ ok: false, reason: "placeholder_parity" });
});

test("rejects glossary miss", () => {
  const r = gate("Open the Dashboard", "Öffne das Panel", G);
  expect(r.reason).toBe("glossary:Dashboard");
});

test("accepts clean expansion", () => {
  const r = gate("Save {count} files", "{count} Dateien speichern", G);
  expect(r.ok).toBe(true);
});

To prove ICU survives end to end, compile every pre-filled string with @formatjs/cli:

$ npx formatjs compile de.json --out-file /dev/null && echo "ICU OK"
ICU OK   # non-zero exit = a needs-review string still contains broken ICU

When to escalate

This gate is a structural safety net, not a quality judge. It guarantees that no string with mismatched placeholders, absurd length, or a missing glossary term enters review — but a translation can clear all three checks and still be wrong in tone, register, or meaning. DeepL also has no reliable per-string confidence score, so the length ratio is a heuristic proxy, not a true probability. When pre-fill output passes the gate but reviewers keep rejecting it (high post-edit distance over several sprints), the answer is not a tighter ratio; it is more source context, a richer glossary, or human-first translation for that surface. At that point, route the strings back through the full machine-translation pre-fill workflow and reconsider whether MT belongs on that namespace at all.

FAQ

Why mask placeholders into <x/> tags instead of using DeepL’s HTML mode?

DeepL’s tag_handling: "xml" with ignoreTags gives you an explicit contract: tokens inside declared ignore tags are never translated or reordered. HTML mode infers structure and can still move or re-case real <b> tags. Masking everything — ICU args and markup — into uniform <x/> tags means a single, predictable round-trip, and your parity check compares the original tokens you stored, not whatever DeepL emits.

Can the gate run inside DeepL’s response, or must it be a separate step?

It must be separate. DeepL returns text only; it has no awareness of your ICU grammar or glossary intent beyond term seeding. The parity, length-ratio, and adherence checks all compare the unmasked output against your source string, so they can only run after the translation returns and tokens are restored. Treat the gate as a pure function you can unit-test independently of any network call.

What length-ratio bounds should I use for non-Latin or verbose languages?

The 0.4–2.5 default is deliberately wide. Tighten it per target: German and Finnish expand ~1.3–1.4×, while CJK targets contract sharply, so a 0.4 floor is appropriate there but a 2.5 ceiling is generous. Measure your own corpus — take the median source/target character ratio per locale and reject beyond roughly ±2 standard deviations rather than using one global band.

Part of Machine-Translation Pre-fill Workflows.