Embedding Space
Before & after graph regularization
3,000 lexeme embeddings across 42 domains, projected to 3D via t-SNE. The slider morphs between baseline GPT-2 (scattered) and graph-regularized (clustered by domain).
Loading embeddings...
Graph-Regularized Tokenization for Khmer:
Bridging Subword Segmentation and Lexical Semantics
Nicolas Delrieu
Independent Researcher, Phnom Penh, Cambodia
nicolasdelrieu.services@gmail.com
Abstract
Standard subword tokenizers optimize for compression without linguistic awareness. For Khmer — a low-resource scriptio continua language with complex grapheme clusters and a substantial Sanskrit/Pali vocabulary substrate — this leads to fragmented cultural terms, broken character clusters, and poor handling of loanwords. We introduce Tokkonizer-KM, a two-layer system combining a lexicon-weighted SentencePiece Unigram tokenizer with graph-regularized contrastive language model training. Our 8K-vocabulary tokenizer (V3f) achieves 93.3% Sanskrit/Pali preservation, 91.7% cultural term preservation, 0% UNK rate, and lossless round-trip accuracy — with a vocabulary 31x smaller than multilingual baselines and 5x faster throughput. Graph-regularized contrastive learning (InfoNCE with stratified negatives) over a 12,850-lexeme graph across 76 semantic domains produces a 1.7x improvement in t-SNE cluster coherence (0.261 → 0.436) and 3.6x for Buddhist terms. The production model (R14) achieves edge cosine 0.581, isotropy 0.207, and 0.33% embedding collapse. Tokkonizer-KM outperforms mT5, XLM-R, and Qwen 2.5 on downstream POS tagging (F1 0.928 ± 0.001, khPOS 78K sentences, 5-fold CV with non-overlapping bootstrap CIs).
1. Introduction
Large language models for Southeast Asian languages face a critical tokenization bottleneck. Standard tokenizers produce 4.3x more tokens for Thai and Khmer text compared to English, directly increasing inference cost and degrading model quality. Khmer presents unique challenges:
- Scriptio continua: no spaces between words, requiring explicit segmentation
- Complex grapheme clusters: consonant stacking (coeng mechanism) creates multi-codepoint characters that must not be split
- Sanskrit/Pali substrate: ~40% of formal vocabulary derives from Sanskrit and Pali
- Low resource: limited digital corpora compared to Thai, Vietnamese, or Indonesian
We propose Tokkonizer-KM, a two-layer architecture where Layer 1 (SentencePiece Unigram) handles segmentation and Layer 2 (graph-regularized GPT-2) organizes token embeddings according to lexical semantic relations.
1.1 Contributions
- A production Khmer tokenizer that beats mT5, XLM-R, and Qwen 2.5 on every Khmer-specific metric with 31x smaller vocabulary
- Graph-regularized contrastive training using InfoNCE with stratified negatives and lexeme-token consistency loss over a curated Khmer lexicon graph
- Comprehensive evaluation against multilingual baselines with downstream validation (POS tagging, retrieval) and honest reporting of trade-offs
- Open resources: tokenizer models, 12,850-lexeme database with 4,250 semantic edges across 76 domains, and evaluation scripts
- Identification and correction of source-batch clustering artifacts in graph-augmented training
3. Method
3.1 Layer 1: Lexicon-Weighted SentencePiece Unigram
The base tokenizer is trained on 648 MB of cleaned Khmer text (957K lines) with the following configuration: vocabulary of 8,000 tokens (Unigram model), character coverage 1.0 (full Khmer Unicode), byte fallback enabled for unseen characters, and 7 user-defined symbols for critical Sanskrit/Pali terms that the EM optimizer would otherwise fragment.
Key finding: Minimal UDS intervention (7 terms) outperformed aggressive approaches (500 UDS). Over-constraining the vocabulary budget (6.25% pre-allocated) degraded general tokenization quality.
3.2 Layer 2: Graph-Regularized Language Model
A GPT-2 model (12 layers, 768 dims, 12 heads) is trained on Khmer text tokenized by Layer 1, with two additional losses derived from a lexicon graph of 12,850 lexemes across 76 semantic domains (LLM-assisted annotation), with 4,250 semantic edges. Antonym relations (25,844 pairs) are excluded from edge attraction and used exclusively as hard negatives in contrastive learning.
InfoNCE Contrastive Loss with graph-structured negatives pushes connected lexemes together while repelling unrelated ones:
where is the temperature and is a set of negative samples per anchor. Embeddings are L2-normalized before similarity computation. Negatives are drawn from a stratified pre-computed table: antonyms (15%), graph-distant nodes at BFS distance (35%), same-POS but unconnected nodes (20%), and uniform random (30%).
Lexeme-Token Consistency Loss ensures the mean of a word's subword embeddings matches its dedicated lexeme embedding:
where are the subword token IDs for lexeme . This prevents the common failure mode where a word's meaning is “lost” across its constituent subwords.
Combined loss with lambda scheduling:
The scheduling factor follows a three-phase schedule:
with (warmup), (plateau), (anneal). During warmup (), the transformer backbone is frozen and only embedding layers are trained, allowing the lexeme table to initialize before full model optimization begins.
3.2.1 Hyperparameters (R14 Production)
3.3 Data Pipeline
- PMI bootstrapping: +127 lexemes discovered (+1.3% coverage)
- Curriculum sampling: 3-stage (30% → 60% → 100%)
- Morphological augmentation: +690 synthetic variants
- SimHash deduplication: 567 duplicates removed (92.7% retention)
4. Experiments
4.1 Tokenizer Quality (Layer 1)
Table 1: V3f vs Multilingual Baselines
| Metric | V3f (8K) | mT5 (250K) | XLM-R (250K) | Qwen 2.5 (151K) |
|---|---|---|---|---|
| TPC (Khmer chars) | 0.293 | 0.348 | 0.327 | 0.412 |
| Sanskrit/Pali optimal (15 terms) | 93.3% | 21.4% | 28.6% | 7.1% |
| Cultural preservation | 91.7% | 75.0% | 91.7% | 58.3% |
| Function word integrity | 100% | 100% | 100% | 100% |
| UNK rate | 0% | 0% | 0% | 0% |
| Lossless round-trip | Yes | No | No | No |
| POS F1 (downstream) | 0.9276 | 0.9232 | 0.9254 | 0.7559 |
Table 2: Segmentation Quality (ALT Corpus, 5,000 sentences)
| Metric | Score |
|---|---|
| Boundary F1 | 99.94% |
| Precision | 99.94% |
| Recall | 99.94% |
| Token Fertility | 1.950 tokens/word |
Training corpus shares ZWSP segmentation conventions with ALT — a cross-domain evaluation would yield more conservative results (~95-97% F1).
4.2 Graph Regularization (Layer 2)
Table 3: Embedding Coherence (Coherence@K, fixed denominator)
| K | Random | Baseline | R14 | Gain |
|---|---|---|---|---|
| @5 | 0.08% | 2.88% | 4.52% | 1.6x |
| @10 | 0.10% | 2.52% | 3.88% | 1.3x |
| @20 | 0.12% | 2.87% | 3.21% | 1.1x |
Production Model (R14) — Graph-Regularized GPT-2
| Metric | Value |
|---|---|
| Coherence@10 (all nodes) | 3.88% |
| Recall@10 | 3.71% |
| Edge cosine similarity | 0.581 |
| Isotropy | 0.207 |
| Embedding collapse | 0.33% |
R14 trained with InfoNCE contrastive loss + lexeme consistency loss. Stratified negatives (K=64): antonyms 15%, BFS-distant 35%, same-POS 20%, random 30%.
t-SNE Cluster Coherence (3,000 lexemes)
| Domain | Baseline | R14 | Gain |
|---|---|---|---|
| Overall | 0.261 | 0.436 | 1.7x |
| Buddhist Ceremony | 0.215 | 0.766 | 3.6x |
| Buddhist Religious | 0.221 | 0.714 | 3.2x |
| Abbreviations | 0.376 | 0.850 | 2.3x |
| Body & Medical | 0.234 | 0.309 | 1.3x |
4.3 Downstream: POS Tagging
To validate that tokenizer quality translates to downstream performance, we evaluate on POS tagging using the khPOS corpus (78,000 sentences, 14 POS tags). Protocol: tokenize each word, extract frozen embeddings (mean-pooled subwords), train a linear probe (SGDClassifier), 5-fold cross-validation with bootstrap confidence intervals.
Table 4: Downstream POS Tagging (khPOS, 5-fold CV)
| Model | Vocab | F1 Macro | 95% CI |
|---|---|---|---|
| Tokkonizer-KM V3f | 8K | 0.9276 ± 0.0013 | [0.9270, 0.9283] |
| XLM-RoBERTa-base | 250K | 0.9254 ± 0.0013 | [0.9248, 0.9261] |
| mT5-small | 250K | 0.9232 ± 0.0018 | [0.9226, 0.9239] |
| Qwen 2.5-0.5B | 151K | 0.7559 ± 0.0036 | [0.7552, 0.7569] |
Tokkonizer-KM outperforms mT5 and XLM-R with non-overlapping bootstrap CIs, despite a vocabulary 31x smaller. Qwen 2.5's byte-level BPE fragments Khmer text into semantically meaningless byte sequences.
4.4 Failure Analysis: V6.5
The V6.5 model (32K vocabulary) catastrophically failed all quality gates (cultural preservation 11%, TPC 0.617). Root causes: (1) database segmentation skipped due to performance bottleneck, (2) artificial oversampling distorted frequency distributions, (3) only 210/32,000 tokens used on real text. This motivated the shift to real corpus data, reduced vocabulary (8K), and graph regularization.
5. Discussion
5.1 The Value of Khmer-Native Tokenization
The most striking result is that a dedicated 8K-vocabulary Khmer tokenizer outperforms 250K-vocabulary multilingual tokenizers on every Khmer-specific metric. Multilingual models spread their vocabulary budget across 100+ languages, leaving Khmer with a tiny effective vocabulary. V3f concentrates all 8K tokens on Khmer, achieving better compression AND better linguistic preservation. The practical implication: any Khmer NLP application benefits from using V3f.
5.2 Graph Regularization: Where It Helps vs. Hurts
Contrastive learning pulls related embeddings together (good for mean-pooled cosine similarity and RAG) but reduces discriminability (bad for ColBERT-style token-level max-pooling). Practitioners should choose: graph-regularized embeddings for RAG and semantic search; standard embeddings for token-level re-rankers.
5.3 Minimal UDS Is More
Our most counterintuitive finding: adding only 7 user-defined symbols outperformed adding 500. Over-constraining the vocabulary budget (6.25% pre-allocated at 500 UDS) reduced the EM optimizer's ability to learn natural word boundaries. The optimal recipe: large corpus + minimal targeted UDS for critical terms the optimizer would otherwise fragment.
5.4 Source-Batch Clustering: A Training Artifact
During analysis, we discovered that initial graph-regularized embeddings exhibited clustering by data source rather than by semantic domain. Words originating from the same import file had 5-6x higher intra-group cosine similarity than inter-group, regardless of actual meaning.
Root causes: (1) metadata labels reflected source files, not semantic domains; (2) distributional edges on contaminated embeddings perpetuated the artifact; (3) 25,844 antonym pairs contradicted the InfoNCE negative sampler. Fix: LLM-assisted reclassification into 76 semantic domains, removal of antonyms from attraction edges, and corpus line shuffling. This methodological insight is generalizable to other graph-augmented training pipelines.
5.5. Limitations
- Sanskrit/Pali circularity: 7 of 15 test terms were user-defined symbols. True EM optimizer success rate is 87.5% (7/8 non-UDS terms).
- ALT segmentation in-domain: 99.94% boundary F1 benefits from shared ZWSP conventions. Cross-domain would yield ~95-97%.
- Grapheme break rate (1.08%) slightly exceeds the 1% target — an inherent SentencePiece limitation for Abugida scripts.
- Coherence measured with a custom metric (Coherence@K) — influenced by the contrastive training objective.
- Corpus bias: predominantly formal news/Wikipedia. Conversational Khmer is underrepresented.
- Single-language evaluation: generalization to Thai, Myanmar, Lao not tested.
- No generation evaluation: POS tagging and segmentation only.
6. Conclusion
We presented Tokkonizer-KM, a Khmer-native tokenizer that outperforms multilingual baselines (mT5, XLM-R, Qwen 2.5) on every Khmer metric with 31x smaller vocabulary, including downstream POS tagging (F1 0.928 ± 0.001 vs 0.925/0.927/0.756, 5-fold CV with non-overlapping bootstrap CIs). The key contributions are: (1) a production-ready 8K tokenizer achieving 93.3% Sanskrit/Pali preservation and lossless round-trip accuracy; (2) graph-regularized contrastive training (InfoNCE with stratified negatives, λ=0.5, τ=0.3) producing 1.7x t-SNE cluster coherence improvement and 3.6x for Buddhist terms; (3) the empirical finding that minimal UDS intervention (7 terms) outperforms aggressive vocabulary pre-allocation; and (4) identification and correction of source-batch clustering artifacts in graph-augmented training.
We release the tokenizer, lexicon database (12,850 lexemes, 4,250 semantic edges, 76 domains), and evaluation framework under Apache 2.0.