Converting gettext PO to XLIFF 2.1 Without Data Loss
Converting gettext PO to XLIFF 2.1 loses data because a naive po2xliff flattens msgstr[N] plural arrays to a single segment, merges #. and # comments into one undifferentiated note, and discards the #, fuzzy flag and msgctxt disambiguator. This page walks the exact field mapping — plural index to CLDR category, comment kinds to <note category>, msgctxt to unit context, fuzzy to segment state, and #: references to <note category="location"> — then proves the conversion is lossless with a Translate Toolkit po2xliff round trip. It is the precise edge case behind the empty msgstr[1] after import that the PO / XLIFF Format Bridging overview warns about.
Root Cause Analysis
The data loss is not a bug in any single tool — it is the gap between two data models. gettext PO, defined by the GNU gettext manual, is line-oriented: one entry is a msgid, an optional msgid_plural, a msgstr or an integer-indexed msgstr[0]…msgstr[n] array, and comment lines distinguished only by their two-character prefix (#. extracted/developer, #: source reference, #, flag, # translator, and the msgctxt keyword for context). XLIFF 2.1, the OASIS XML vocabulary, nests <file> → <unit> → <segment> → <source>/<target>, carries status on the segment state attribute, and stores all commentary in typed <note category="..."> elements. PO is flatter and relies on positional and prefix conventions; XLIFF is hierarchical and relies on named attributes.
Four mismatches cause the silent loss. First, plurals: PO keys variants by integer index whose meaning is defined only by the file’s Plural-Forms header, while XLIFF 2.1 expects each variant tagged with a CLDR plural category (zero, one, two, few, many, other). A converter that never reads Plural-Forms cannot derive the category, so it typically writes only index 0 and drops the rest — this is the empty msgstr[1] after import. Second, comments: PO’s #. (extracted by the developer) and # (written by the translator) are semantically different, but a converter that emits a single <note> collapses them. Third, the fuzzy flag: #, fuzzy is a boolean that must become state="needs-review-translation"; if ignored, the string imports as fully translated and skips review. Fourth, msgctxt: PO’s one disambiguation field must drive both the unit id/name and the context, or two entries that differ only by context collide on the same <source>.
The plural derivation is the part most converters skip. Given Plural-Forms: nplurals=2; plural=(n != 1);, index 0 corresponds to the singular category one and index 1 to other. For a language like Polish with nplurals=3; plural=(n==1 ? 0 : n%10>=2 && n%10<=4 && (n%100<10 || n%100>=20) ? 1 : 2), the indices map to one/few/many. You recover the category by evaluating the formula against the CLDR sample integers for the locale, not by guessing from the index number.
Minimal Reproducible Example
The smallest PO entry that triggers every loss path at once — a plural, two comment kinds, a context, a reference, and a fuzzy flag:
#. Shown on the cart page; {count} is the running item total
#: src/cart/summary.ts:42
#, fuzzy
msgctxt "cart"
msgid "{count} item"
msgid_plural "{count} items"
msgstr[0] "{count} artigo"
msgstr[1] "{count} artigos"
A converter that ignores Plural-Forms produces this lossy XLIFF — one segment, a single merged note, no state, no context:
<unit id="u1">
<notes>
<note>Shown on the cart page; {count} is the running item total src/cart/summary.ts:42</note>
</notes>
<segment>
<source>{count} item</source>
<target>{count} artigo</target>
</segment>
</unit>
msgstr[1] is gone, the location reference is fused into prose, the fuzzy flag silently became “translated”, and msgctxt "cart" vanished — so a second cart-context entry with the same msgid will now collide.
Fix With Annotated Code Block
The lossless target keeps every field in its correct XLIFF home. Note the two CLDR-categorized segments, the typed notes, the preserved context, and the review state:
<unit id="cart.{count} item" name="{count} item">
<notes>
<!-- #. extracted/developer comment -> developer note -->
<note category="developer">Shown on the cart page; {count} is the running item total</note>
<!-- #: reference -> location note, kept machine-parseable -->
<note category="location">src/cart/summary.ts:42</note>
</notes>
<!-- msgctxt "cart" drives the unit id prefix AND a context note so two
same-msgid entries never collide -->
<segment id="0" state="needs-review-translation"> <!-- #, fuzzy -> needs-review -->
<source>{count} item</source> <!-- msgid -->
<target>{count} artigo</target> <!-- msgstr[0] = CLDR "one" -->
</segment>
<segment id="1" state="needs-review-translation">
<source>{count} items</source> <!-- msgid_plural -->
<target>{count} artigos</target> <!-- msgstr[1] = CLDR "other" -->
</segment>
</unit>
To produce it, drive Translate Toolkit po2xliff at version 2.1 and then post-process plurals using the Plural-Forms header so each index becomes a categorized segment:
# 1. Normalize first so multi-line msgstr blocks fold and diffs stay clean.
msgcat --no-wrap locales/pt_BR/messages.po -o locales/pt_BR/messages.norm.po
# 2. Confirm Plural-Forms exists — the index->category mapping depends on it.
grep -i "Plural-Forms" locales/pt_BR/messages.norm.po
# "Plural-Forms: nplurals=2; plural=(n > 1);\n"
# 3. Convert to XLIFF 2.1 with an explicit BCP 47 target so trgLang is valid.
po2xliff --version=2.1 -l pt-BR \
locales/pt_BR/messages.norm.po locales/pt_BR/messages.xlf
# 4. Expand plural indices to CLDR-categorized segments the converter omits.
python tools/expand_plurals.py \
--in locales/pt_BR/messages.xlf --locale pt-BR --in-place
The post-processor reads the formula, evaluates the CLDR category per index, and stamps each segment — the step a plain po2xliff skips:
# tools/expand_plurals.py (core of the index -> CLDR category step)
from translate.lang.data import plural_tags # CLDR categories per locale
def categories_for(locale: str) -> list[str]:
# e.g. pt-BR -> ["one", "other"]; pl -> ["one", "few", "many", "other"]
return plural_tags(locale)
def expand(unit, locale: str) -> None:
cats = categories_for(locale)
# msgstr[0]->cats[0], msgstr[1]->cats[1], ... one segment per index
for index, segment in enumerate(unit.segments):
segment.set_category(cats[index]) # writes the CLDR category
if segment.was_fuzzy: # carry the boolean across
segment.state = "needs-review-translation"
Verification Snippet
Prove the conversion is lossless by round-tripping and comparing semantically, not byte-for-byte. msgcmp exits non-zero if any msgid or plural variant differs, so it doubles as the CI gate:
#!/usr/bin/env bash
set -euo pipefail
po2xliff --version=2.1 -l pt-BR messages.norm.po /tmp/rt.xlf
xliff2po -t messages.norm.po /tmp/rt.xlf /tmp/rt.po # -t template preserves refs + order
# Non-zero exit if any msgid, plural count, or context diverges.
msgcmp --use-untranslated /tmp/rt.po messages.norm.po
# Assert no plural variant was dropped: each plural entry must keep both indices.
test "$(grep -c '^msgstr\[1\]' /tmp/rt.po)" \
= "$(grep -c '^msgstr\[1\]' messages.norm.po)" \
|| { echo "plural variant lost in round trip"; exit 1; }
echo "Round-trip integrity OK"
On success it prints Round-trip integrity OK and exits 0. If a plural collapsed, the msgstr[1] counts differ and the script exits 1; if a msgid or context was lost, msgcmp prints this message is used but not defined and fails the job first.
When to Escalate
This field-level fix is sufficient for the standard PO→XLIFF→PO bridge where PO stays canonical. It is not enough when the XLIFF round-trips through a translation management system that adds richer state (reviewed, final, subState) — those have no PO equivalent and collapse back to “not fuzzy”, so review status must be reconciled from the TMS rather than the PO file. It also breaks down for inline markup (placeholders, <g>/<ph> tags) and for nested plural-in-select messages, where segment boundaries no longer line up with PO entries. When you hit either, stop treating PO as the source of truth and keep XLIFF canonical, following the mapping rules in the PO / XLIFF Format Bridging overview, and route the artifacts through a managed platform such as Crowdin Integration for Dev Teams or a self-hosted Weblate Self-Hosted Setup so state is owned in one place.
FAQ
Why is msgstr[1] empty after converting PO to XLIFF 2.1?
Because the converter did not read the Plural-Forms header and so could not map plural index 1 to a CLDR category. A naive po2xliff then writes only index 0 and drops the rest. Confirm the PO file has a correct Plural-Forms formula, convert with po2xliff --version=2.1, then post-process to emit one CLDR-categorized segment (one, few, many, other) per index so every msgstr[N] survives.
How do PO comments and the fuzzy flag map to XLIFF 2.1?
Map #. extracted/developer comments to <note category="developer">, #: source references to <note category="location">, and # translator comments to <note category="translator"> so the three stay distinct. Map the #, fuzzy flag to segment state="needs-review-translation". Keeping comment kinds separate preserves reviewer context, and routing fuzzy into the review state stops unreviewed strings importing as final.
What happens to msgctxt during the conversion?
msgctxt is PO’s only disambiguation field, so it must drive both the XLIFF unit identity (the id/name) and a context marker; otherwise two entries that share a msgid but differ by context collapse onto one <source> and one translation overwrites the other. Prefix the unit id with the context value and verify with msgcmp that the round trip preserves the same number of entries.
Related
- PO / XLIFF Format Bridging — the full field-by-field bridge and which attributes survive each direction.
- Weblate Self-Hosted Setup — running PO and XLIFF components side by side when XLIFF becomes canonical.
- Crowdin Integration for Dev Teams — pushing and pulling XLIFF artifacts through a managed platform that owns review state.
- Handling Pluralization in Arabic and Slavic Languages — the CLDR category rules you evaluate to map plural indices correctly.
Part of PO / XLIFF Format Bridging.