Embedding Space

Before & after graph regularization

3,000 lexeme embeddings across 42 domains, projected to 3D via t-SNE. The slider morphs between baseline GPT-2 (scattered) and graph-regularized (clustered by domain).

0.025cluster coherence

Loading embeddings...

Graph-Regularized Tokenization for Khmer:
Bridging Subword Segmentation and Lexical Semantics

Nicolas Delrieu

Independent Researcher, Phnom Penh, Cambodia

nicolasdelrieu.services@gmail.com

Abstract

Standard subword tokenizers optimize for compression without linguistic awareness. For Khmer — a low-resource scriptio continua language with complex grapheme clusters and a substantial Sanskrit/Pali vocabulary substrate — this leads to fragmented cultural terms, broken character clusters, and poor handling of loanwords. We introduce Tokkonizer-KM, a two-layer system combining a lexicon-weighted SentencePiece Unigram tokenizer with graph-regularized contrastive language model training. Our 8K-vocabulary tokenizer (V3f) achieves 93.3% Sanskrit/Pali preservation, 91.7% cultural term preservation, 0% UNK rate, and lossless round-trip accuracy — with a vocabulary 31x smaller than multilingual baselines and 5x faster throughput. Graph-regularized contrastive learning (InfoNCE with stratified negatives) over a 12,850-lexeme graph across 76 semantic domains produces a 1.7x improvement in t-SNE cluster coherence (0.261 → 0.436) and 3.6x for Buddhist terms. The production model (R14) achieves edge cosine 0.581, isotropy 0.207, and 0.33% embedding collapse. Tokkonizer-KM outperforms mT5, XLM-R, and Qwen 2.5 on downstream POS tagging (F1 0.928 ± 0.001, khPOS 78K sentences, 5-fold CV with non-overlapping bootstrap CIs).

1. Introduction

Large language models for Southeast Asian languages face a critical tokenization bottleneck. Standard tokenizers produce 4.3x more tokens for Thai and Khmer text compared to English, directly increasing inference cost and degrading model quality. Khmer presents unique challenges:

Scriptio continua: no spaces between words, requiring explicit segmentation
Complex grapheme clusters: consonant stacking (coeng mechanism) creates multi-codepoint characters that must not be split
Sanskrit/Pali substrate: ~40% of formal vocabulary derives from Sanskrit and Pali
Low resource: limited digital corpora compared to Thai, Vietnamese, or Indonesian

We propose Tokkonizer-KM, a two-layer architecture where Layer 1 (SentencePiece Unigram) handles segmentation and Layer 2 (graph-regularized GPT-2) organizes token embeddings according to lexical semantic relations.

1.1 Contributions

A production Khmer tokenizer that beats mT5, XLM-R, and Qwen 2.5 on every Khmer-specific metric with 31x smaller vocabulary
Graph-regularized contrastive training using InfoNCE with stratified negatives and lexeme-token consistency loss over a curated Khmer lexicon graph
Comprehensive evaluation against multilingual baselines with downstream validation (POS tagging, retrieval) and honest reporting of trade-offs
Open resources: tokenizer models, 12,850-lexeme database with 4,250 semantic edges across 76 domains, and evaluation scripts
Identification and correction of source-batch clustering artifacts in graph-augmented training

3. Method

3.1 Layer 1: Lexicon-Weighted SentencePiece Unigram

The base tokenizer is trained on 648 MB of cleaned Khmer text (957K lines) with the following configuration: vocabulary of 8,000 tokens (Unigram model), character coverage 1.0 (full Khmer Unicode), byte fallback enabled for unseen characters, and 7 user-defined symbols for critical Sanskrit/Pali terms that the EM optimizer would otherwise fragment.

Key finding: Minimal UDS intervention (7 terms) outperformed aggressive approaches (500 UDS). Over-constraining the vocabulary budget (6.25% pre-allocated) degraded general tokenization quality.

3.2 Layer 2: Graph-Regularized Language Model

A GPT-2 model (12 layers, 768 dims, 12 heads) is trained on Khmer text tokenized by Layer 1, with two additional losses derived from a lexicon graph of 12,850 lexemes across 76 semantic domains (LLM-assisted annotation), with 4,250 semantic edges. Antonym relations (25,844 pairs) are excluded from edge attraction and used exclusively as hard negatives in contrastive learning.

InfoNCE Contrastive Loss with graph-structured negatives pushes connected lexemes together while repelling unrelated ones:

\mathcal{L}_{info} = -\frac{1}{|B|} \sum_{(i,j) \in B} \log \frac{\exp(\text{sim}(\mathbf{e}_i, \mathbf{e}_j) / \tau)}{\exp(\text{sim}(\mathbf{e}_i, \mathbf{e}_j) / \tau) + \sum_{k \in N_i} \exp(\text{sim}(\mathbf{e}_i, \mathbf{e}_k) / \tau)}

(1)

where $\tau = 0.3$ is the temperature and $N_i$ is a set of $K = 64$ negative samples per anchor. Embeddings are L2-normalized before similarity computation. Negatives are drawn from a stratified pre-computed table: antonyms (15%), graph-distant nodes at BFS distance $\geq 3$ (35%), same-POS but unconnected nodes (20%), and uniform random (30%).

Lexeme-Token Consistency Loss ensures the mean of a word's subword embeddings matches its dedicated lexeme embedding:

\mathcal{L}_{cons} = \frac{1}{|S|} \sum_{l \in S} \left\| \frac{1}{|t_l|} \sum_{k \in t_l} \mathbf{E}_{tokens}[k] - \mathbf{E}_{lex}[l] \right\|^2

(2)

where $t_l$ are the subword token IDs for lexeme $l$ . This prevents the common failure mode where a word's meaning is “lost” across its constituent subwords.

Combined loss with lambda scheduling:

\mathcal{L} = \mathcal{L}_{LM} + \lambda_{info} \cdot f(t) \cdot \mathcal{L}_{info} + \lambda_{cons} \cdot f(t) \cdot \mathcal{L}_{cons}

(3)

The scheduling factor $f(t)$ follows a three-phase schedule:

f(t) = \begin{cases} t / T_w & \text{if } t < T_w \quad \text{(warmup)} \cr 1.0 & \text{if } T_w \le t < T_w + T_p \quad \text{(plateau)} \cr \max\!\left(0.5,\; 1 - 0.5 \cdot \dfrac{t - T_w - T_p}{T_a}\right) & \text{(anneal to 0.5)} \end{cases}

(4)

with $T_w = 1{,}455$ (warmup), $T_p = 2{,}910$ (plateau), $T_a = 2{,}910$ (anneal). During warmup ( $t < T_w$ ), the transformer backbone is frozen and only embedding layers are trained, allowing the lexeme table to initialize before full model optimization begins.

3.2.1 Hyperparameters (R14 Production)

\lambda_{\text{info}}

0.5

\lambda_{\text{cons}}

1.0 \times 10^{-4}

Temperature

\tau

0.3Negatives

K

64 (stratified)Lexeme batch

|S|

256Training1 epoch, ~1.5h on NVIDIA H100 NVL

3.3 Data Pipeline

PMI bootstrapping: +127 lexemes discovered (+1.3% coverage)
Curriculum sampling: 3-stage (30% → 60% → 100%)
Morphological augmentation: +690 synthetic variants
SimHash deduplication: 567 duplicates removed (92.7% retention)

4. Experiments

4.1 Tokenizer Quality (Layer 1)

Table 1: V3f vs Multilingual Baselines

Metric	V3f (8K)	mT5 (250K)	XLM-R (250K)	Qwen 2.5 (151K)
TPC (Khmer chars)	0.293	0.348	0.327	0.412
Sanskrit/Pali optimal (15 terms)	93.3%	21.4%	28.6%	7.1%
Cultural preservation	91.7%	75.0%	91.7%	58.3%
Function word integrity	100%	100%	100%	100%
UNK rate	0%	0%	0%	0%
Lossless round-trip	Yes	No	No	No
POS F1 (downstream)	0.9276	0.9232	0.9254	0.7559

Table 2: Segmentation Quality (ALT Corpus, 5,000 sentences)

Metric	Score
Boundary F1	99.94%
Precision	99.94%
Recall	99.94%
Token Fertility	1.950 tokens/word

Training corpus shares ZWSP segmentation conventions with ALT — a cross-domain evaluation would yield more conservative results (~95-97% F1).

4.2 Graph Regularization (Layer 2)

Table 3: Embedding Coherence (Coherence@K, fixed denominator)

K	Random	Baseline	R14	Gain
@5	0.08%	2.88%	4.52%	1.6x
@10	0.10%	2.52%	3.88%	1.3x
@20	0.12%	2.87%	3.21%	1.1x

Production Model (R14) — Graph-Regularized GPT-2

Metric	Value
Coherence@10 (all nodes)	3.88%
Recall@10	3.71%
Edge cosine similarity	0.581
Isotropy	0.207
Embedding collapse	0.33%

R14 trained with InfoNCE contrastive loss + lexeme consistency loss. Stratified negatives (K=64): antonyms 15%, BFS-distant 35%, same-POS 20%, random 30%.

t-SNE Cluster Coherence (3,000 lexemes)

Domain	Baseline	R14	Gain
Overall	0.261	0.436	1.7x
Buddhist Ceremony	0.215	0.766	3.6x
Buddhist Religious	0.221	0.714	3.2x
Abbreviations	0.376	0.850	2.3x
Body & Medical	0.234	0.309	1.3x

4.3 Downstream: POS Tagging

To validate that tokenizer quality translates to downstream performance, we evaluate on POS tagging using the khPOS corpus (78,000 sentences, 14 POS tags). Protocol: tokenize each word, extract frozen embeddings (mean-pooled subwords), train a linear probe (SGDClassifier), 5-fold cross-validation with bootstrap confidence intervals.

Table 4: Downstream POS Tagging (khPOS, 5-fold CV)

Model	Vocab	F1 Macro	95% CI
Tokkonizer-KM V3f	8K	0.9276 ± 0.0013	[0.9270, 0.9283]
XLM-RoBERTa-base	250K	0.9254 ± 0.0013	[0.9248, 0.9261]
mT5-small	250K	0.9232 ± 0.0018	[0.9226, 0.9239]
Qwen 2.5-0.5B	151K	0.7559 ± 0.0036	[0.7552, 0.7569]

Tokkonizer-KM outperforms mT5 and XLM-R with non-overlapping bootstrap CIs, despite a vocabulary 31x smaller. Qwen 2.5's byte-level BPE fragments Khmer text into semantically meaningless byte sequences.

4.4 Failure Analysis: V6.5

The V6.5 model (32K vocabulary) catastrophically failed all quality gates (cultural preservation 11%, TPC 0.617). Root causes: (1) database segmentation skipped due to performance bottleneck, (2) artificial oversampling distorted frequency distributions, (3) only 210/32,000 tokens used on real text. This motivated the shift to real corpus data, reduced vocabulary (8K), and graph regularization.

5. Discussion

5.1 The Value of Khmer-Native Tokenization

The most striking result is that a dedicated 8K-vocabulary Khmer tokenizer outperforms 250K-vocabulary multilingual tokenizers on every Khmer-specific metric. Multilingual models spread their vocabulary budget across 100+ languages, leaving Khmer with a tiny effective vocabulary. V3f concentrates all 8K tokens on Khmer, achieving better compression AND better linguistic preservation. The practical implication: any Khmer NLP application benefits from using V3f.

5.2 Graph Regularization: Where It Helps vs. Hurts

Contrastive learning pulls related embeddings together (good for mean-pooled cosine similarity and RAG) but reduces discriminability (bad for ColBERT-style token-level max-pooling). Practitioners should choose: graph-regularized embeddings for RAG and semantic search; standard embeddings for token-level re-rankers.

5.3 Minimal UDS Is More

Our most counterintuitive finding: adding only 7 user-defined symbols outperformed adding 500. Over-constraining the vocabulary budget (6.25% pre-allocated at 500 UDS) reduced the EM optimizer's ability to learn natural word boundaries. The optimal recipe: large corpus + minimal targeted UDS for critical terms the optimizer would otherwise fragment.

5.4 Source-Batch Clustering: A Training Artifact

During analysis, we discovered that initial graph-regularized embeddings exhibited clustering by data source rather than by semantic domain. Words originating from the same import file had 5-6x higher intra-group cosine similarity than inter-group, regardless of actual meaning.

Root causes: (1) metadata labels reflected source files, not semantic domains; (2) distributional edges on contaminated embeddings perpetuated the artifact; (3) 25,844 antonym pairs contradicted the InfoNCE negative sampler. Fix: LLM-assisted reclassification into 76 semantic domains, removal of antonyms from attraction edges, and corpus line shuffling. This methodological insight is generalizable to other graph-augmented training pipelines.

5.5. Limitations

Sanskrit/Pali circularity: 7 of 15 test terms were user-defined symbols. True EM optimizer success rate is 87.5% (7/8 non-UDS terms).
ALT segmentation in-domain: 99.94% boundary F1 benefits from shared ZWSP conventions. Cross-domain would yield ~95-97%.
Grapheme break rate (1.08%) slightly exceeds the 1% target — an inherent SentencePiece limitation for Abugida scripts.
Coherence measured with a custom metric (Coherence@K) — influenced by the contrastive training objective.
Corpus bias: predominantly formal news/Wikipedia. Conversational Khmer is underrepresented.
Single-language evaluation: generalization to Thai, Myanmar, Lao not tested.
No generation evaluation: POS tagging and segmentation only.

6. Conclusion

We presented Tokkonizer-KM, a Khmer-native tokenizer that outperforms multilingual baselines (mT5, XLM-R, Qwen 2.5) on every Khmer metric with 31x smaller vocabulary, including downstream POS tagging (F1 0.928 ± 0.001 vs 0.925/0.927/0.756, 5-fold CV with non-overlapping bootstrap CIs). The key contributions are: (1) a production-ready 8K tokenizer achieving 93.3% Sanskrit/Pali preservation and lossless round-trip accuracy; (2) graph-regularized contrastive training (InfoNCE with stratified negatives, λ=0.5, τ=0.3) producing 1.7x t-SNE cluster coherence improvement and 3.6x for Buddhist terms; (3) the empirical finding that minimal UDS intervention (7 terms) outperforms aggressive vocabulary pre-allocation; and (4) identification and correction of source-batch clustering artifacts in graph-augmented training.

We release the tokenizer, lexicon database (12,850 lexemes, 4,250 semantic edges, 76 domains), and evaluation framework under Apache 2.0.