Detecting Locale from the Accept-Language Header — Edge Cases

A naive Accept-Language parser that does header.split(',')[0].split('-')[0] returns the wrong locale the moment a real browser sends q-weights, a * wildcard, a q=0 rejection, a duplicate tag, or an empty header — and the bug only surfaces in production for users whose preferences differ from the developer’s. This page catalogs the specific edge cases that break header detection: quality-value ordering, the * wildcard, malformed and duplicate tags, region-only versus language-only ranges, case sensitivity, the empty-header case, and why RFC 4647 matching beats a raw split('-').

The header looks simple — a comma-separated list — so it invites hand-rolled parsing. But each comma-separated entry can carry a ;q= weight, the list arrives in arbitrary q-order despite often looking sorted, and a single token may be a bare language (en), a region-qualified tag (en-GB), or the wildcard *. Treating it as a sorted list of language codes is the root of almost every locale-detection bug. The value you resolve here is the seed for the fallback chain resolver, so a wrong first guess propagates through the whole pipeline.

Naive split versus BCP 47 aware parsing The same header fr-CH, fr;q=0.9, en;q=0.8, de;q=0 yields fr-CH from a naive split that ignores weights, but a q-sorted RFC 4647 parser drops the q=0 rejection and matches fr-CH against supported fr. Accept-Language header fr-CH, fr;q=0.9, en;q=0.8, de;q=0 Naive split(',')[0].split('-')[0] ignores q, wildcard, rejection BCP 47 / RFC 4647 parser q-sort, drop q=0, range match keeps "fr-CH" verbatim, never sees de;q=0 rejection, no supported-set check sorts by q desc, removes de, matches fr-CH range to fr, intersects supported list "fr-CH" (unsupported) "fr" (supported)
The same header, two parsers: a naive split returns an unsupported tag; an RFC 4647 parser returns a supported match.

Root cause analysis

Two specifications govern this header, and skipping either produces a class of bugs. RFC 9110 §12.5.4 (which obsoleted RFC 7231) defines Accept-Language as a comma-separated list of language ranges, each with an optional q quality value between 0 and 1, up to three decimal places; an absent q defaults to 1. The list is not guaranteed to be sorted — clients usually emit it in descending q-order, but the spec requires the server to sort, so you must sort by descending q yourself. A q=0 is an explicit rejection that must remove that range from consideration, not rank it last. And * is a wildcard range that matches any language the server offers — useful as a catch-all but dangerous if you treat it as a literal locale code.

The second spec is RFC 4647, which defines how a range like en matches a tag like en-US. Its “basic filtering” says a range matches a tag if the tag equals the range or begins with the range followed by -. This is why split('-')[0] is wrong in both directions: it strips region from the tag (losing en-GB vs en-US) yet never performs the prefix match RFC 4647 requires, so a range of zh would not correctly match a supported zh-Hans tag. Language subtags are also case-insensitive (BCP 47 / RFC 5646 §2.1.1 recommends en-US casing only as a convention), so a raw string compare against EN-us silently fails. Finally, the header can be empty or absent entirely — many bots, curl without flags, and privacy tools send no Accept-Language — and a parser that assumes a non-empty string throws or returns undefined instead of cleanly hitting the fallback.

Minimal reproducible example

This parser looks reasonable and passes a smoke test against the developer’s own browser, but breaks on every edge case above:

// BROKEN: assumes a sorted list, ignores q, wildcard, q=0, case, empty header
function detectLocale(header) {
  return header.split(',')[0].split('-')[0].trim();
}

detectLocale('fr-CH, fr;q=0.9, en;q=0.8, de;q=0'); // 'fr'  (lucky — looks sorted)
detectLocale('en;q=0.8, fr-CH;q=0.9');             // 'en'  WRONG: fr has higher q
detectLocale('*');                                  // '*'   WRONG: wildcard treated as locale
detectLocale('EN-US');                              // 'EN'  WRONG: wrong case, never matches 'en'
detectLocale('');                                   // ''    WRONG: should hit fallback
detectLocale('en, en;q=0.5');                       // 'en'  duplicate tag, weight ignored

The second line is the giveaway: fr-CH;q=0.9 outranks en;q=0.8, but because the list happens to start with en, the split returns en. Real Safari and Firefox builds emit lists in q-order most of the time, which is exactly why this bug ships — it works until a client sends an unsorted or wildcard-bearing header.

The fix — annotated parser

Parse into structured ranges, sort by q, drop rejections, then match against your supported set with RFC 4647 basic filtering. In production prefer the accepts/negotiator libraries (the same ones Express locale negotiation uses), but this annotated version shows exactly what they do:

const SUPPORTED = ['en', 'en-GB', 'fr', 'zh-Hans'];
const FALLBACK = 'en';

function parseAcceptLanguage(header) {
  if (!header || !header.trim()) return []; // empty / absent header -> no preferences
  return header
    .split(',')
    .map((part) => {
      const [range, ...params] = part.trim().split(';');
      // q defaults to 1 when omitted; clamp malformed q to 0 so it is rejected
      const qParam = params.find((p) => p.trim().startsWith('q='));
      let q = qParam ? parseFloat(qParam.split('=')[1]) : 1;
      if (Number.isNaN(q)) q = 0;            // malformed q  => treat as rejected
      return { range: range.trim().toLowerCase(), q }; // case-fold the range
    })
    .filter((r) => r.range && r.q > 0)        // drop empty tokens and q=0 rejections
    .sort((a, b) => b.q - a.q);               // SERVER must sort; clients are not trusted to
}

function detectLocale(header) {
  const ranges = parseAcceptLanguage(header);
  const supportedLower = SUPPORTED.map((s) => s.toLowerCase());
  for (const { range } of ranges) {
    if (range === '*') return SUPPORTED[0];   // wildcard -> server's top offering
    // RFC 4647 basic filtering: tag equals range OR starts with range + '-'
    const hit = supportedLower.findIndex(
      (tag) => tag === range || tag.startsWith(range + '-')
    );
    if (hit !== -1) return SUPPORTED[hit];     // return canonically-cased supported tag
  }
  return FALLBACK;                             // deterministic last resort
}

Three lines carry the fix. The .sort((a, b) => b.q - a.q) enforces that the server orders by quality rather than trusting the wire order. The r.q > 0 filter turns de;q=0 and any malformed weight into a true rejection. And the tag === range || tag.startsWith(range + '-') check is RFC 4647 basic filtering, so a range of zh-hans matches the supported zh-Hans tag and a bare en matches en-GB only by the prefix rule, never by a lossy split('-').

Verification snippet

Assert each edge case directly so a future “simplification” cannot quietly reintroduce the naive split:

const assert = require('node:assert');

// q-ordering: higher q wins even when listed second
assert.equal(detectLocale('en;q=0.8, fr;q=0.9'), 'fr');
// q=0 is a rejection, not a low rank: de removed, en chosen
assert.equal(detectLocale('de;q=0, en;q=0.5'), 'en');
// wildcard maps to the server's top supported locale, not literal '*'
assert.equal(detectLocale('*'), 'en');
// region vs language: en-GB range matches the supported en-GB tag exactly
assert.equal(detectLocale('en-GB'), 'en-GB');
// case-insensitive: EN-us folds and matches en-GB? no -> falls to en
assert.equal(detectLocale('EN-US'), 'en');
// empty and absent headers hit the deterministic fallback
assert.equal(detectLocale(''), 'en');
assert.equal(detectLocale(undefined), 'en');
// malformed q is treated as rejected, not q=1
assert.equal(detectLocale('fr;q=banana, en;q=0.4'), 'en');
console.log('all Accept-Language edge cases pass');

Run it with node test-accept-language.js; every line above is one of the failure modes the naive parser silently gets wrong.

When to escalate

This parser resolves the header tier correctly, but the header is only one signal. If a user clicked a language switcher or hit a /de/ URL, those deliberate choices must outrank the header — that precedence belongs in the broader Locale Negotiation Strategies ordering, not in the parser. If you need en-CH to gracefully degrade to en, or a region-qualified tag with no exact match to collapse to its base language, that lookahead logic lives in the fallback chain layer rather than here. And if you serve responses from a CDN, the negotiated result must be reflected in a Vary: Accept-Language header or the edge cache will hand one visitor’s locale to everyone.

FAQ

Is the Accept-Language header already sorted by preference?

No — RFC 9110 requires the server to interpret the q-values, and only conventionally do clients emit ranges in descending q-order. Many real headers arrive sorted, which is exactly why a parser that trusts the wire order passes testing and then fails on the first unsorted or wildcard-bearing request. Always sort by descending q yourself before picking a match.

What does a q=0 quality value mean?

A q=0 is an explicit rejection: the client is stating it does not want that language. It must remove the range from consideration entirely, not merely rank it last. So de;q=0, en means “anything but German, prefer English” — a parser that keeps de as a low-priority option can still wrongly serve German when nothing else matches.

Why is split(‘-’) wrong for matching region tags?

Splitting on - discards the region from the tag (collapsing en-GB and en-US to en) yet never performs the prefix match RFC 4647 requires, so a range like zh would not match a supported zh-Hans tag. Use basic filtering instead: a range matches a tag when the tag equals the range or begins with the range followed by a hyphen.

Part of Locale Negotiation Strategies.