Our 8K Khmer Tokenizer Beats Every Multilingual LLM Tokenizer

Every time you send Khmer text to an LLM, you're paying a hidden tax.

Multilingual tokenizers — the ones inside GPT, Llama, Mistral, and every major LLM — were trained on web-crawl data where Khmer represents less than 0.1% of text. The result: they fragment Khmer script into tiny pieces, consuming 2x to 9x more tokens than the equivalent English text. Same meaning, same information, but Khmer users pay more per API call, fit less text in the context window, and get worse performance.

We built a tokenizer that fixes this. And we have the data to prove it.

The Benchmark

We tested five tokenizers on the FLORES-200 Khmer-English parallel dataset — 2,009 sentence pairs used as a standard benchmark in multilingual NLP research.

Tokenizers Under Test

Tokenizer	Organization	Vocab Size	Type	Design
Angkor SPM v3f	Angkor Intelligence	8,000	SP Unigram	Khmer-specialized
SeaLLM v2	DAMO-NLP-SG	48,384	BPE (NLLB-ext)	SEA multilingual
SeaLLM v2.5	DAMO-NLP-SG	256,000	BPE	SEA multilingual
SeaLLM v3	DAMO-NLP-SG	151,643	Byte-level BPE	SEA multilingual
HY-MT 1.5	Tencent	120,000	BPE	Machine translation

The comparison is deliberately unfair: a purpose-built 8K-vocab tokenizer against models with vocabularies 6x to 32x larger. The question is whether vocabulary size or language-specific training matters more.

Results

Token Efficiency

Tokenizer	Mean Tokens/Sentence	vs Angkor	Total Tokens (2,009 sentences)
Angkor SPM v3f	37.4	baseline	75,207
SeaLLM v2	55.3	1.48x	111,142
SeaLLM v2.5	133.5	3.56x	268,108
SeaLLM v3	166.7	4.45x	334,831
HY-MT 1.5	222.6	5.95x	447,237

Angkor SPM v3f wins on 98.2% of sentences (1,973 out of 2,009). The 1.8% where SeaLLM v2 wins are sentences containing Latin-script content, where Mistral's English BPE vocabulary has an inherent advantage.

The Khmer/English Compression Ratio

This is the metric that matters most. Using FLORES parallel translations, we measured how many Khmer tokens each tokenizer produces relative to English tokens for the same content. A ratio of 1.0 means equal treatment. Above 1.0 means Khmer is penalized.

Tokenizer	KM/EN Ratio	What It Means
Angkor SPM v3f	0.42x	Khmer encoded MORE efficiently than English
SeaLLM v2	1.92x	Khmer costs ~2x more than English
SeaLLM v2.5	5.16x	Khmer costs ~5x more than English
SeaLLM v3	6.32x	Khmer costs ~6x more than English
HY-MT 1.5	8.55x	Khmer costs ~9x more than English

For context: the SeaLLM ACL 2024 paper reported that stock Llama-2 has a 12.14x Khmer/English ratio. SeaLLM's NLLB-extended tokenizer brought it down to 2.67x. Angkor SPM achieves 0.42x — a 29x improvement over stock Llama-2, and 4.6x better than SeaLLM's best published result.

Tokens Per Character

Raw character-level density. Values above 1.0 mean the tokenizer produces more tokens than input characters — effectively worse than character-level encoding.

Tokenizer	TPC	Chars/Token	Assessment
Angkor SPM v3f	0.263	3.80	Excellent — multi-char tokens
SeaLLM v2	0.391	2.56	Good — sub-word level
SeaLLM v2.5	0.944	1.06	Poor — near character-level
SeaLLM v3	1.180	0.85	Bad — worse than character-level
HY-MT 1.5	1.578	0.63	Bad — 1.6 tokens per character

SeaLLM v3 and HY-MT 1.5 literally produce more tokens than there are characters in the input. They are worse than no tokenizer at all.

What This Costs You

API Costs

Estimated cost processing 1 million Khmer characters at $0.01/1K tokens:

Tokenizer	Est. Tokens	Est. Cost	Cost Multiplier
Angkor SPM v3f	262,900	$2.63	baseline
SeaLLM v2	391,200	$3.91	1.5x
SeaLLM v2.5	944,400	$9.44	3.6x
SeaLLM v3	1,179,800	$11.80	4.5x
HY-MT 1.5	1,578,100	$15.78	6.0x

Context Window

How many Khmer characters fit in a 4,096-token context window:

Tokenizer	Max Characters	Relative Capacity
Angkor SPM v3f	15,580	100%
SeaLLM v2	10,470	67%
SeaLLM v2.5	4,337	28%
SeaLLM v3	3,471	22%
HY-MT 1.5	2,595	17%

A typical FLORES sentence uses 34 tokens with Angkor vs 213 with HY-MT. With a 4,096-token context window, Angkor fits ~120 Khmer sentences; HY-MT fits ~19.

Speed

Tokenizer	Avg Time/Sentence	vs Angkor
Angkor SPM v3f	0.055 ms	fastest
SeaLLM v2	0.121 ms	2.2x slower
SeaLLM v2.5	0.148 ms	2.7x slower
SeaLLM v3	0.220 ms	4.0x slower
HY-MT 1.5	0.213 ms	3.9x slower

Why Bigger Vocab Doesn't Mean Better

The results show vocabulary size is inversely correlated with Khmer performance. Four reasons:

1. Vocabulary allocation. A 256K-vocab tokenizer covering 200+ languages may allocate only 500-2,000 tokens to Khmer. Angkor SPM dedicates all 8,000 tokens to Khmer, achieving denser coverage of Khmer morphemes and word patterns.

2. Training data composition. Multilingual tokenizers are trained on web-crawl data where Khmer represents less than 0.1% of text. BPE merge operations optimize for high-resource languages first; Khmer patterns get merged late or not at all, resulting in character-level or byte-level fragmentation.

3. Unigram vs BPE for agglutinative scripts. SentencePiece Unigram selects tokens by maximizing likelihood over the corpus, naturally capturing frequent multi-character Khmer morphemes. BPE's bottom-up merging can miss linguistically meaningful boundaries in scripts without explicit word separators.

4. SeaLLM v2's exception. SeaLLM v2 outperforms v2.5 and v3 despite a smaller vocab because it explicitly extended Mistral's tokenizer with ~16K NLLB tokens for SEA languages. This deliberate vocabulary investment is exactly what the other tokenizers lack.

Statistical Rigor

All comparisons use the Mann-Whitney U test (two-sided) on per-sentence token counts:

Comparison	p-value	Significant?
Angkor vs SeaLLM v2	1.86 x 10^-194	YES
Angkor vs SeaLLM v2.5	< 10^-300	YES
Angkor vs SeaLLM v3	< 10^-300	YES
Angkor vs HY-MT 1.5	< 10^-300	YES

With n=2,009, all differences are statistically significant at extreme confidence levels.

Limitations

We want to be transparent about what this benchmark does and doesn't show:

Corpus scope. FLORES-200 is news/Wikipedia-style text. Results may differ on conversational, social media, or domain-specific content.
Round-trip fidelity is 99.8%, not 100%. Four sentences fail lossless reconstruction — edge cases that need investigation.
The 0.42x compression ratio is partly asymmetric. Angkor SPM tokenizes English poorly (183K tokens for English FLORES vs ~53K for multilingual tokenizers). The ratio reflects both excellent Khmer performance AND poor English handling.
SeaLLM v1 was not tested because the Llama-2-based model is gated on HuggingFace.

What This Means

For developers building Khmer applications: a specialized tokenizer cuts your API costs by 1.5x-6x, fits 6x more Khmer text in your context window, and runs 2-4x faster.

For the broader NLP ecosystem: vocabulary size is not a proxy for language coverage quality. Low-resource languages need purpose-built tokenizers, not bigger multilingual ones. The "token tax" on languages like Khmer, Lao, Myanmar, and other Southeast Asian scripts is a real cost that affects real users.

The tokenizer model is 191KB. It runs offline. It fits in a SQLite file alongside a complete predictive keyboard engine in 19MB. We've deployed it at angkor-intelligence.com/labs where you can try the tokenizer playground and the Khmer predictive keyboard that runs on top of it.

The full benchmark report with detailed methodology is available as PDF. For SDK integration or research collaboration, reach out at nicolasdelrieu.services@gmail.com.