Converting gettext PO to XLIFF 2.1 Without Data Loss

Converting gettext PO to XLIFF 2.1 loses data because a naive po2xliff flattens msgstr[N] plural arrays to a single segment, merges #. and # comments into one undifferentiated note, and discards the #, fuzzy flag and msgctxt disambiguator. This page walks the exact field mapping — plural index to CLDR category, comment kinds to <note category>, msgctxt to unit context, fuzzy to segment state, and #: references to <note category="location"> — then proves the conversion is lossless with a Translate Toolkit po2xliff round trip. It is the precise edge case behind the empty msgstr[1] after import that the PO / XLIFF Format Bridging overview warns about.

Lossless PO entry to XLIFF 2.1 unit mapping A single gettext PO entry on the left expands into one XLIFF 2.1 unit with two CLDR-categorized segments, distinct note categories for developer and translator comments, msgctxt routed to unit context, and the fuzzy flag mapped to segment state needs-review-translation. PO entry XLIFF 2.1 unit #. dev comment #: file.ts:42 #, fuzzy msgctxt "cart" msgid msgid_plural msgstr[0] msgstr[1] note cat="developer" note cat="location" state="needs-review" unit name + context <source> segment one segment other Each PO field has exactly one XLIFF home — nothing is dropped
One PO entry expands into one XLIFF 2.1 unit: plural indices become CLDR-categorized segments, comment kinds split into typed notes, fuzzy becomes a segment state.

Root Cause Analysis

The data loss is not a bug in any single tool — it is the gap between two data models. gettext PO, defined by the GNU gettext manual, is line-oriented: one entry is a msgid, an optional msgid_plural, a msgstr or an integer-indexed msgstr[0]msgstr[n] array, and comment lines distinguished only by their two-character prefix (#. extracted/developer, #: source reference, #, flag, # translator, and the msgctxt keyword for context). XLIFF 2.1, the OASIS XML vocabulary, nests <file><unit><segment><source>/<target>, carries status on the segment state attribute, and stores all commentary in typed <note category="..."> elements. PO is flatter and relies on positional and prefix conventions; XLIFF is hierarchical and relies on named attributes.

Four mismatches cause the silent loss. First, plurals: PO keys variants by integer index whose meaning is defined only by the file’s Plural-Forms header, while XLIFF 2.1 expects each variant tagged with a CLDR plural category (zero, one, two, few, many, other). A converter that never reads Plural-Forms cannot derive the category, so it typically writes only index 0 and drops the rest — this is the empty msgstr[1] after import. Second, comments: PO’s #. (extracted by the developer) and # (written by the translator) are semantically different, but a converter that emits a single <note> collapses them. Third, the fuzzy flag: #, fuzzy is a boolean that must become state="needs-review-translation"; if ignored, the string imports as fully translated and skips review. Fourth, msgctxt: PO’s one disambiguation field must drive both the unit id/name and the context, or two entries that differ only by context collide on the same <source>.

The plural derivation is the part most converters skip. Given Plural-Forms: nplurals=2; plural=(n != 1);, index 0 corresponds to the singular category one and index 1 to other. For a language like Polish with nplurals=3; plural=(n==1 ? 0 : n%10>=2 && n%10<=4 && (n%100<10 || n%100>=20) ? 1 : 2), the indices map to one/few/many. You recover the category by evaluating the formula against the CLDR sample integers for the locale, not by guessing from the index number.

Minimal Reproducible Example

The smallest PO entry that triggers every loss path at once — a plural, two comment kinds, a context, a reference, and a fuzzy flag:

#. Shown on the cart page; {count} is the running item total
#: src/cart/summary.ts:42
#, fuzzy
msgctxt "cart"
msgid "{count} item"
msgid_plural "{count} items"
msgstr[0] "{count} artigo"
msgstr[1] "{count} artigos"

A converter that ignores Plural-Forms produces this lossy XLIFF — one segment, a single merged note, no state, no context:

<unit id="u1">
  <notes>
    <note>Shown on the cart page; {count} is the running item total src/cart/summary.ts:42</note>
  </notes>
  <segment>
    <source>{count} item</source>
    <target>{count} artigo</target>
  </segment>
</unit>

msgstr[1] is gone, the location reference is fused into prose, the fuzzy flag silently became “translated”, and msgctxt "cart" vanished — so a second cart-context entry with the same msgid will now collide.

Fix With Annotated Code Block

The lossless target keeps every field in its correct XLIFF home. Note the two CLDR-categorized segments, the typed notes, the preserved context, and the review state:

<unit id="cart.{count} item" name="{count} item">
  <notes>
    <!-- #. extracted/developer comment -> developer note -->
    <note category="developer">Shown on the cart page; {count} is the running item total</note>
    <!-- #: reference -> location note, kept machine-parseable -->
    <note category="location">src/cart/summary.ts:42</note>
  </notes>
  <!-- msgctxt "cart" drives the unit id prefix AND a context note so two
       same-msgid entries never collide -->
  <segment id="0" state="needs-review-translation">   <!-- #, fuzzy -> needs-review -->
    <source>{count} item</source>                     <!-- msgid -->
    <target>{count} artigo</target>                   <!-- msgstr[0] = CLDR "one" -->
  </segment>
  <segment id="1" state="needs-review-translation">
    <source>{count} items</source>                    <!-- msgid_plural -->
    <target>{count} artigos</target>                  <!-- msgstr[1] = CLDR "other" -->
  </segment>
</unit>

To produce it, drive Translate Toolkit po2xliff at version 2.1 and then post-process plurals using the Plural-Forms header so each index becomes a categorized segment:

# 1. Normalize first so multi-line msgstr blocks fold and diffs stay clean.
msgcat --no-wrap locales/pt_BR/messages.po -o locales/pt_BR/messages.norm.po

# 2. Confirm Plural-Forms exists — the index->category mapping depends on it.
grep -i "Plural-Forms" locales/pt_BR/messages.norm.po
#   "Plural-Forms: nplurals=2; plural=(n > 1);\n"

# 3. Convert to XLIFF 2.1 with an explicit BCP 47 target so trgLang is valid.
po2xliff --version=2.1 -l pt-BR \
  locales/pt_BR/messages.norm.po locales/pt_BR/messages.xlf

# 4. Expand plural indices to CLDR-categorized segments the converter omits.
python tools/expand_plurals.py \
  --in locales/pt_BR/messages.xlf --locale pt-BR --in-place

The post-processor reads the formula, evaluates the CLDR category per index, and stamps each segment — the step a plain po2xliff skips:

# tools/expand_plurals.py (core of the index -> CLDR category step)
from translate.lang.data import plural_tags   # CLDR categories per locale

def categories_for(locale: str) -> list[str]:
    # e.g. pt-BR -> ["one", "other"]; pl -> ["one", "few", "many", "other"]
    return plural_tags(locale)

def expand(unit, locale: str) -> None:
    cats = categories_for(locale)
    # msgstr[0]->cats[0], msgstr[1]->cats[1], ... one segment per index
    for index, segment in enumerate(unit.segments):
        segment.set_category(cats[index])           # writes the CLDR category
        if segment.was_fuzzy:                        # carry the boolean across
            segment.state = "needs-review-translation"

Verification Snippet

Prove the conversion is lossless by round-tripping and comparing semantically, not byte-for-byte. msgcmp exits non-zero if any msgid or plural variant differs, so it doubles as the CI gate:

#!/usr/bin/env bash
set -euo pipefail

po2xliff --version=2.1 -l pt-BR messages.norm.po /tmp/rt.xlf
xliff2po -t messages.norm.po /tmp/rt.xlf /tmp/rt.po   # -t template preserves refs + order

# Non-zero exit if any msgid, plural count, or context diverges.
msgcmp --use-untranslated /tmp/rt.po messages.norm.po

# Assert no plural variant was dropped: each plural entry must keep both indices.
test "$(grep -c '^msgstr\[1\]' /tmp/rt.po)" \
   = "$(grep -c '^msgstr\[1\]' messages.norm.po)" \
   || { echo "plural variant lost in round trip"; exit 1; }

echo "Round-trip integrity OK"

On success it prints Round-trip integrity OK and exits 0. If a plural collapsed, the msgstr[1] counts differ and the script exits 1; if a msgid or context was lost, msgcmp prints this message is used but not defined and fails the job first.

When to Escalate

This field-level fix is sufficient for the standard PO→XLIFF→PO bridge where PO stays canonical. It is not enough when the XLIFF round-trips through a translation management system that adds richer state (reviewed, final, subState) — those have no PO equivalent and collapse back to “not fuzzy”, so review status must be reconciled from the TMS rather than the PO file. It also breaks down for inline markup (placeholders, <g>/<ph> tags) and for nested plural-in-select messages, where segment boundaries no longer line up with PO entries. When you hit either, stop treating PO as the source of truth and keep XLIFF canonical, following the mapping rules in the PO / XLIFF Format Bridging overview, and route the artifacts through a managed platform such as Crowdin Integration for Dev Teams or a self-hosted Weblate Self-Hosted Setup so state is owned in one place.

FAQ

Why is msgstr[1] empty after converting PO to XLIFF 2.1?

Because the converter did not read the Plural-Forms header and so could not map plural index 1 to a CLDR category. A naive po2xliff then writes only index 0 and drops the rest. Confirm the PO file has a correct Plural-Forms formula, convert with po2xliff --version=2.1, then post-process to emit one CLDR-categorized segment (one, few, many, other) per index so every msgstr[N] survives.

How do PO comments and the fuzzy flag map to XLIFF 2.1?

Map #. extracted/developer comments to <note category="developer">, #: source references to <note category="location">, and # translator comments to <note category="translator"> so the three stay distinct. Map the #, fuzzy flag to segment state="needs-review-translation". Keeping comment kinds separate preserves reviewer context, and routing fuzzy into the review state stops unreviewed strings importing as final.

What happens to msgctxt during the conversion?

msgctxt is PO’s only disambiguation field, so it must drive both the XLIFF unit identity (the id/name) and a context marker; otherwise two entries that share a msgid but differ by context collapse onto one <source> and one translation overwrites the other. Prefix the unit id with the context value and verify with msgcmp that the round trip preserves the same number of entries.

Part of PO / XLIFF Format Bridging.