Khmer NLP

Our 8K Khmer Tokenizer Beats Every Multilingual LLM Tokenizer — Here's the Data

We benchmarked Angkor SPM v3f against SeaLLM v2/v2.5/v3 and HY-MT 1.5 on 2,009 FLORES-200 sentences. An 8K-vocab Khmer-specialized tokenizer produces 1.5x to 6x fewer tokens than tokenizers with vocabularies up to 32x larger.

March 24, 2026


Our 8K Khmer Tokenizer Beats Every Multilingual LLM Tokenizer

Every time you send Khmer text to an LLM, you're paying a hidden tax.

Multilingual tokenizers — the ones inside GPT, Llama, Mistral, and every major LLM — were trained on web-crawl data where Khmer represents less than 0.1% of text. The result: they fragment Khmer script into tiny pieces, consuming 2x to 9x more tokens than the equivalent English text. Same meaning, same information, but Khmer users pay more per API call, fit less text in the context window, and get worse performance.

We built a tokenizer that fixes this. And we have the data to prove it.

The Benchmark

We tested five tokenizers on the FLORES-200 Khmer-English parallel dataset — 2,009 sentence pairs used as a standard benchmark in multilingual NLP research.

Tokenizers Under Test

TokenizerOrganizationVocab SizeTypeDesign
Angkor SPM v3fAngkor Intelligence8,000SP UnigramKhmer-specialized
SeaLLM v2DAMO-NLP-SG48,384BPE (NLLB-ext)SEA multilingual
SeaLLM v2.5DAMO-NLP-SG256,000BPESEA multilingual
SeaLLM v3DAMO-NLP-SG151,643Byte-level BPESEA multilingual
HY-MT 1.5Tencent120,000BPEMachine translation

The comparison is deliberately unfair: a purpose-built 8K-vocab tokenizer against models with vocabularies 6x to 32x larger. The question is whether vocabulary size or language-specific training matters more.

Results

Token Efficiency

TokenizerMean Tokens/Sentencevs AngkorTotal Tokens (2,009 sentences)
Angkor SPM v3f37.4baseline75,207
SeaLLM v255.31.48x111,142
SeaLLM v2.5133.53.56x268,108
SeaLLM v3166.74.45x334,831
HY-MT 1.5222.65.95x447,237

Angkor SPM v3f wins on 98.2% of sentences (1,973 out of 2,009). The 1.8% where SeaLLM v2 wins are sentences containing Latin-script content, where Mistral's English BPE vocabulary has an inherent advantage.

The Khmer/English Compression Ratio

This is the metric that matters most. Using FLORES parallel translations, we measured how many Khmer tokens each tokenizer produces relative to English tokens for the same content. A ratio of 1.0 means equal treatment. Above 1.0 means Khmer is penalized.

TokenizerKM/EN RatioWhat It Means
Angkor SPM v3f0.42xKhmer encoded MORE efficiently than English
SeaLLM v21.92xKhmer costs ~2x more than English
SeaLLM v2.55.16xKhmer costs ~5x more than English
SeaLLM v36.32xKhmer costs ~6x more than English
HY-MT 1.58.55xKhmer costs ~9x more than English

For context: the SeaLLM ACL 2024 paper reported that stock Llama-2 has a 12.14x Khmer/English ratio. SeaLLM's NLLB-extended tokenizer brought it down to 2.67x. Angkor SPM achieves 0.42x — a 29x improvement over stock Llama-2, and 4.6x better than SeaLLM's best published result.

Tokens Per Character

Raw character-level density. Values above 1.0 mean the tokenizer produces more tokens than input characters — effectively worse than character-level encoding.

TokenizerTPCChars/TokenAssessment
Angkor SPM v3f0.2633.80Excellent — multi-char tokens
SeaLLM v20.3912.56Good — sub-word level
SeaLLM v2.50.9441.06Poor — near character-level
SeaLLM v31.1800.85Bad — worse than character-level
HY-MT 1.51.5780.63Bad — 1.6 tokens per character

SeaLLM v3 and HY-MT 1.5 literally produce more tokens than there are characters in the input. They are worse than no tokenizer at all.

What This Costs You

API Costs

Estimated cost processing 1 million Khmer characters at $0.01/1K tokens:

TokenizerEst. TokensEst. CostCost Multiplier
Angkor SPM v3f262,900$2.63baseline
SeaLLM v2391,200$3.911.5x
SeaLLM v2.5944,400$9.443.6x
SeaLLM v31,179,800$11.804.5x
HY-MT 1.51,578,100$15.786.0x

Context Window

How many Khmer characters fit in a 4,096-token context window:

TokenizerMax CharactersRelative Capacity
Angkor SPM v3f15,580100%
SeaLLM v210,47067%
SeaLLM v2.54,33728%
SeaLLM v33,47122%
HY-MT 1.52,59517%

A typical FLORES sentence uses 34 tokens with Angkor vs 213 with HY-MT. With a 4,096-token context window, Angkor fits ~120 Khmer sentences; HY-MT fits ~19.

Speed

TokenizerAvg Time/Sentencevs Angkor
Angkor SPM v3f0.055 msfastest
SeaLLM v20.121 ms2.2x slower
SeaLLM v2.50.148 ms2.7x slower
SeaLLM v30.220 ms4.0x slower
HY-MT 1.50.213 ms3.9x slower

Why Bigger Vocab Doesn't Mean Better

The results show vocabulary size is inversely correlated with Khmer performance. Four reasons:

1. Vocabulary allocation. A 256K-vocab tokenizer covering 200+ languages may allocate only 500-2,000 tokens to Khmer. Angkor SPM dedicates all 8,000 tokens to Khmer, achieving denser coverage of Khmer morphemes and word patterns.

2. Training data composition. Multilingual tokenizers are trained on web-crawl data where Khmer represents less than 0.1% of text. BPE merge operations optimize for high-resource languages first; Khmer patterns get merged late or not at all, resulting in character-level or byte-level fragmentation.

3. Unigram vs BPE for agglutinative scripts. SentencePiece Unigram selects tokens by maximizing likelihood over the corpus, naturally capturing frequent multi-character Khmer morphemes. BPE's bottom-up merging can miss linguistically meaningful boundaries in scripts without explicit word separators.

4. SeaLLM v2's exception. SeaLLM v2 outperforms v2.5 and v3 despite a smaller vocab because it explicitly extended Mistral's tokenizer with ~16K NLLB tokens for SEA languages. This deliberate vocabulary investment is exactly what the other tokenizers lack.

Statistical Rigor

All comparisons use the Mann-Whitney U test (two-sided) on per-sentence token counts:

Comparisonp-valueSignificant?
Angkor vs SeaLLM v21.86 x 10^-194YES
Angkor vs SeaLLM v2.5< 10^-300YES
Angkor vs SeaLLM v3< 10^-300YES
Angkor vs HY-MT 1.5< 10^-300YES

With n=2,009, all differences are statistically significant at extreme confidence levels.

Limitations

We want to be transparent about what this benchmark does and doesn't show:

  1. Corpus scope. FLORES-200 is news/Wikipedia-style text. Results may differ on conversational, social media, or domain-specific content.
  2. Round-trip fidelity is 99.8%, not 100%. Four sentences fail lossless reconstruction — edge cases that need investigation.
  3. The 0.42x compression ratio is partly asymmetric. Angkor SPM tokenizes English poorly (183K tokens for English FLORES vs ~53K for multilingual tokenizers). The ratio reflects both excellent Khmer performance AND poor English handling.
  4. SeaLLM v1 was not tested because the Llama-2-based model is gated on HuggingFace.

What This Means

For developers building Khmer applications: a specialized tokenizer cuts your API costs by 1.5x-6x, fits 6x more Khmer text in your context window, and runs 2-4x faster.

For the broader NLP ecosystem: vocabulary size is not a proxy for language coverage quality. Low-resource languages need purpose-built tokenizers, not bigger multilingual ones. The "token tax" on languages like Khmer, Lao, Myanmar, and other Southeast Asian scripts is a real cost that affects real users.

The tokenizer model is 191KB. It runs offline. It fits in a SQLite file alongside a complete predictive keyboard engine in 19MB. We've deployed it at angkor-intelligence.com/labs where you can try the tokenizer playground and the Khmer predictive keyboard that runs on top of it.


The full benchmark report with detailed methodology is available as PDF. For SDK integration or research collaboration, reach out at nicolasdelrieu.services@gmail.com.

Ready to apply AI to your business?

Book a free 30-minute strategy call with Nicolas.

Book a call