Our 8K Khmer Tokenizer Beats Every Multilingual LLM Tokenizer
Every time you send Khmer text to an LLM, you're paying a hidden tax.
Multilingual tokenizers — the ones inside GPT, Llama, Mistral, and every major LLM — were trained on web-crawl data where Khmer represents less than 0.1% of text. The result: they fragment Khmer script into tiny pieces, consuming 2x to 9x more tokens than the equivalent English text. Same meaning, same information, but Khmer users pay more per API call, fit less text in the context window, and get worse performance.
We built a tokenizer that fixes this. And we have the data to prove it.
The Benchmark
We tested five tokenizers on the FLORES-200 Khmer-English parallel dataset — 2,009 sentence pairs used as a standard benchmark in multilingual NLP research.
Tokenizers Under Test
| Tokenizer | Organization | Vocab Size | Type | Design |
|---|---|---|---|---|
| Angkor SPM v3f | Angkor Intelligence | 8,000 | SP Unigram | Khmer-specialized |
| SeaLLM v2 | DAMO-NLP-SG | 48,384 | BPE (NLLB-ext) | SEA multilingual |
| SeaLLM v2.5 | DAMO-NLP-SG | 256,000 | BPE | SEA multilingual |
| SeaLLM v3 | DAMO-NLP-SG | 151,643 | Byte-level BPE | SEA multilingual |
| HY-MT 1.5 | Tencent | 120,000 | BPE | Machine translation |
The comparison is deliberately unfair: a purpose-built 8K-vocab tokenizer against models with vocabularies 6x to 32x larger. The question is whether vocabulary size or language-specific training matters more.
Results
Token Efficiency
| Tokenizer | Mean Tokens/Sentence | vs Angkor | Total Tokens (2,009 sentences) |
|---|---|---|---|
| Angkor SPM v3f | 37.4 | baseline | 75,207 |
| SeaLLM v2 | 55.3 | 1.48x | 111,142 |
| SeaLLM v2.5 | 133.5 | 3.56x | 268,108 |
| SeaLLM v3 | 166.7 | 4.45x | 334,831 |
| HY-MT 1.5 | 222.6 | 5.95x | 447,237 |
Angkor SPM v3f wins on 98.2% of sentences (1,973 out of 2,009). The 1.8% where SeaLLM v2 wins are sentences containing Latin-script content, where Mistral's English BPE vocabulary has an inherent advantage.
The Khmer/English Compression Ratio
This is the metric that matters most. Using FLORES parallel translations, we measured how many Khmer tokens each tokenizer produces relative to English tokens for the same content. A ratio of 1.0 means equal treatment. Above 1.0 means Khmer is penalized.
| Tokenizer | KM/EN Ratio | What It Means |
|---|---|---|
| Angkor SPM v3f | 0.42x | Khmer encoded MORE efficiently than English |
| SeaLLM v2 | 1.92x | Khmer costs ~2x more than English |
| SeaLLM v2.5 | 5.16x | Khmer costs ~5x more than English |
| SeaLLM v3 | 6.32x | Khmer costs ~6x more than English |
| HY-MT 1.5 | 8.55x | Khmer costs ~9x more than English |
For context: the SeaLLM ACL 2024 paper reported that stock Llama-2 has a 12.14x Khmer/English ratio. SeaLLM's NLLB-extended tokenizer brought it down to 2.67x. Angkor SPM achieves 0.42x — a 29x improvement over stock Llama-2, and 4.6x better than SeaLLM's best published result.
Tokens Per Character
Raw character-level density. Values above 1.0 mean the tokenizer produces more tokens than input characters — effectively worse than character-level encoding.
| Tokenizer | TPC | Chars/Token | Assessment |
|---|---|---|---|
| Angkor SPM v3f | 0.263 | 3.80 | Excellent — multi-char tokens |
| SeaLLM v2 | 0.391 | 2.56 | Good — sub-word level |
| SeaLLM v2.5 | 0.944 | 1.06 | Poor — near character-level |
| SeaLLM v3 | 1.180 | 0.85 | Bad — worse than character-level |
| HY-MT 1.5 | 1.578 | 0.63 | Bad — 1.6 tokens per character |
SeaLLM v3 and HY-MT 1.5 literally produce more tokens than there are characters in the input. They are worse than no tokenizer at all.
What This Costs You
API Costs
Estimated cost processing 1 million Khmer characters at $0.01/1K tokens:
| Tokenizer | Est. Tokens | Est. Cost | Cost Multiplier |
|---|---|---|---|
| Angkor SPM v3f | 262,900 | $2.63 | baseline |
| SeaLLM v2 | 391,200 | $3.91 | 1.5x |
| SeaLLM v2.5 | 944,400 | $9.44 | 3.6x |
| SeaLLM v3 | 1,179,800 | $11.80 | 4.5x |
| HY-MT 1.5 | 1,578,100 | $15.78 | 6.0x |
Context Window
How many Khmer characters fit in a 4,096-token context window:
| Tokenizer | Max Characters | Relative Capacity |
|---|---|---|
| Angkor SPM v3f | 15,580 | 100% |
| SeaLLM v2 | 10,470 | 67% |
| SeaLLM v2.5 | 4,337 | 28% |
| SeaLLM v3 | 3,471 | 22% |
| HY-MT 1.5 | 2,595 | 17% |
A typical FLORES sentence uses 34 tokens with Angkor vs 213 with HY-MT. With a 4,096-token context window, Angkor fits ~120 Khmer sentences; HY-MT fits ~19.
Speed
| Tokenizer | Avg Time/Sentence | vs Angkor |
|---|---|---|
| Angkor SPM v3f | 0.055 ms | fastest |
| SeaLLM v2 | 0.121 ms | 2.2x slower |
| SeaLLM v2.5 | 0.148 ms | 2.7x slower |
| SeaLLM v3 | 0.220 ms | 4.0x slower |
| HY-MT 1.5 | 0.213 ms | 3.9x slower |
Why Bigger Vocab Doesn't Mean Better
The results show vocabulary size is inversely correlated with Khmer performance. Four reasons:
1. Vocabulary allocation. A 256K-vocab tokenizer covering 200+ languages may allocate only 500-2,000 tokens to Khmer. Angkor SPM dedicates all 8,000 tokens to Khmer, achieving denser coverage of Khmer morphemes and word patterns.
2. Training data composition. Multilingual tokenizers are trained on web-crawl data where Khmer represents less than 0.1% of text. BPE merge operations optimize for high-resource languages first; Khmer patterns get merged late or not at all, resulting in character-level or byte-level fragmentation.
3. Unigram vs BPE for agglutinative scripts. SentencePiece Unigram selects tokens by maximizing likelihood over the corpus, naturally capturing frequent multi-character Khmer morphemes. BPE's bottom-up merging can miss linguistically meaningful boundaries in scripts without explicit word separators.
4. SeaLLM v2's exception. SeaLLM v2 outperforms v2.5 and v3 despite a smaller vocab because it explicitly extended Mistral's tokenizer with ~16K NLLB tokens for SEA languages. This deliberate vocabulary investment is exactly what the other tokenizers lack.
Statistical Rigor
All comparisons use the Mann-Whitney U test (two-sided) on per-sentence token counts:
| Comparison | p-value | Significant? |
|---|---|---|
| Angkor vs SeaLLM v2 | 1.86 x 10^-194 | YES |
| Angkor vs SeaLLM v2.5 | < 10^-300 | YES |
| Angkor vs SeaLLM v3 | < 10^-300 | YES |
| Angkor vs HY-MT 1.5 | < 10^-300 | YES |
With n=2,009, all differences are statistically significant at extreme confidence levels.
Limitations
We want to be transparent about what this benchmark does and doesn't show:
- Corpus scope. FLORES-200 is news/Wikipedia-style text. Results may differ on conversational, social media, or domain-specific content.
- Round-trip fidelity is 99.8%, not 100%. Four sentences fail lossless reconstruction — edge cases that need investigation.
- The 0.42x compression ratio is partly asymmetric. Angkor SPM tokenizes English poorly (183K tokens for English FLORES vs ~53K for multilingual tokenizers). The ratio reflects both excellent Khmer performance AND poor English handling.
- SeaLLM v1 was not tested because the Llama-2-based model is gated on HuggingFace.
What This Means
For developers building Khmer applications: a specialized tokenizer cuts your API costs by 1.5x-6x, fits 6x more Khmer text in your context window, and runs 2-4x faster.
For the broader NLP ecosystem: vocabulary size is not a proxy for language coverage quality. Low-resource languages need purpose-built tokenizers, not bigger multilingual ones. The "token tax" on languages like Khmer, Lao, Myanmar, and other Southeast Asian scripts is a real cost that affects real users.
The tokenizer model is 191KB. It runs offline. It fits in a SQLite file alongside a complete predictive keyboard engine in 19MB. We've deployed it at angkor-intelligence.com/labs where you can try the tokenizer playground and the Khmer predictive keyboard that runs on top of it.
The full benchmark report with detailed methodology is available as PDF. For SDK integration or research collaboration, reach out at nicolasdelrieu.services@gmail.com.