Embedding Space

Before & after graph regularization

3,000 lexeme embeddings across 42 domains, projected to 3D via t-SNE. The slider morphs between baseline GPT-2 (scattered) and graph-regularized (clustered by domain).

0.025cluster coherence

Loading embeddings...

Graph-Regularized Tokenization for Khmer:Bridging Subword Segmentation and Lexical Semantics

Nicolas Delrieu

Independent Researcher, Phnom Penh, Cambodia

nicolasdelrieu.services@gmail.com

Abstract

Standard subword tokenizers optimize for compression without linguistic awareness. For Khmer — a low-resource scriptio continua language with complex grapheme clusters and a substantial Sanskrit/Pali vocabulary substrate — this leads to fragmented cultural terms, broken character clusters, and poor handling of loanwords. We introduce Tokkonizer-KM, a two-layer system combining a lexicon-weighted SentencePiece Unigram tokenizer with graph-regularized contrastive language model training. Our 8K-vocabulary tokenizer (V3f) achieves 93.3% Sanskrit/Pali preservation, 91.7% cultural term preservation, 0% UNK rate, and lossless round-trip accuracy — with a vocabulary 31x smaller than multilingual baselines and 5x faster throughput. Graph-regularized contrastive learning (InfoNCE with stratified negatives) over a 12,850-lexeme graph across 76 semantic domains produces a 1.7x improvement in t-SNE cluster coherence (0.261 → 0.436) and 3.6x for Buddhist terms. The production model (R14) achieves edge cosine 0.581, isotropy 0.207, and 0.33% embedding collapse. Tokkonizer-KM outperforms mT5, XLM-R, and Qwen 2.5 on downstream POS tagging (F1 0.928 ± 0.001, khPOS 78K sentences, 5-fold CV with non-overlapping bootstrap CIs).

1. Introduction

Large language models for Southeast Asian languages face a critical tokenization bottleneck. Standard tokenizers produce 4.3x more tokens for Thai and Khmer text compared to English, directly increasing inference cost and degrading model quality. Khmer presents unique challenges:

  1. Scriptio continua: no spaces between words, requiring explicit segmentation
  2. Complex grapheme clusters: consonant stacking (coeng mechanism) creates multi-codepoint characters that must not be split
  3. Sanskrit/Pali substrate: ~40% of formal vocabulary derives from Sanskrit and Pali
  4. Low resource: limited digital corpora compared to Thai, Vietnamese, or Indonesian

We propose Tokkonizer-KM, a two-layer architecture where Layer 1 (SentencePiece Unigram) handles segmentation and Layer 2 (graph-regularized GPT-2) organizes token embeddings according to lexical semantic relations.

1.1 Contributions

  1. A production Khmer tokenizer that beats mT5, XLM-R, and Qwen 2.5 on every Khmer-specific metric with 31x smaller vocabulary
  2. Graph-regularized contrastive training using InfoNCE with stratified negatives and lexeme-token consistency loss over a curated Khmer lexicon graph
  3. Comprehensive evaluation against multilingual baselines with downstream validation (POS tagging, retrieval) and honest reporting of trade-offs
  4. Open resources: tokenizer models, 12,850-lexeme database with 4,250 semantic edges across 76 domains, and evaluation scripts
  5. Identification and correction of source-batch clustering artifacts in graph-augmented training

3. Method

3.1 Layer 1: Lexicon-Weighted SentencePiece Unigram

The base tokenizer is trained on 648 MB of cleaned Khmer text (957K lines) with the following configuration: vocabulary of 8,000 tokens (Unigram model), character coverage 1.0 (full Khmer Unicode), byte fallback enabled for unseen characters, and 7 user-defined symbols for critical Sanskrit/Pali terms that the EM optimizer would otherwise fragment.

Key finding: Minimal UDS intervention (7 terms) outperformed aggressive approaches (500 UDS). Over-constraining the vocabulary budget (6.25% pre-allocated) degraded general tokenization quality.

3.2 Layer 2: Graph-Regularized Language Model

A GPT-2 model (12 layers, 768 dims, 12 heads) is trained on Khmer text tokenized by Layer 1, with two additional losses derived from a lexicon graph of 12,850 lexemes across 76 semantic domains (LLM-assisted annotation), with 4,250 semantic edges. Antonym relations (25,844 pairs) are excluded from edge attraction and used exclusively as hard negatives in contrastive learning.

InfoNCE Contrastive Loss with graph-structured negatives pushes connected lexemes together while repelling unrelated ones:

Linfo=1B(i,j)Blogexp(sim(ei,ej)/τ)exp(sim(ei,ej)/τ)+kNiexp(sim(ei,ek)/τ)\mathcal{L}_{info} = -\frac{1}{|B|} \sum_{(i,j) \in B} \log \frac{\exp(\text{sim}(\mathbf{e}_i, \mathbf{e}_j) / \tau)}{\exp(\text{sim}(\mathbf{e}_i, \mathbf{e}_j) / \tau) + \sum_{k \in N_i} \exp(\text{sim}(\mathbf{e}_i, \mathbf{e}_k) / \tau)}
(1)

where τ=0.3\tau = 0.3 is the temperature and NiN_i is a set of K=64K = 64 negative samples per anchor. Embeddings are L2-normalized before similarity computation. Negatives are drawn from a stratified pre-computed table: antonyms (15%), graph-distant nodes at BFS distance 3\geq 3 (35%), same-POS but unconnected nodes (20%), and uniform random (30%).

Lexeme-Token Consistency Loss ensures the mean of a word's subword embeddings matches its dedicated lexeme embedding:

Lcons=1SlS1tlktlEtokens[k]Elex[l]2\mathcal{L}_{cons} = \frac{1}{|S|} \sum_{l \in S} \left\| \frac{1}{|t_l|} \sum_{k \in t_l} \mathbf{E}_{tokens}[k] - \mathbf{E}_{lex}[l] \right\|^2
(2)

where tlt_l are the subword token IDs for lexeme ll. This prevents the common failure mode where a word's meaning is “lost” across its constituent subwords.

Combined loss with lambda scheduling:

L=LLM+λinfof(t)Linfo+λconsf(t)Lcons\mathcal{L} = \mathcal{L}_{LM} + \lambda_{info} \cdot f(t) \cdot \mathcal{L}_{info} + \lambda_{cons} \cdot f(t) \cdot \mathcal{L}_{cons}
(3)

The scheduling factor f(t)f(t) follows a three-phase schedule:

f(t)={t/Twif t<Tw(warmup)1.0if Twt<Tw+Tp(plateau)max ⁣(0.5,  10.5tTwTpTa)(anneal to 0.5)f(t) = \begin{cases} t / T_w & \text{if } t < T_w \quad \text{(warmup)} \cr 1.0 & \text{if } T_w \le t < T_w + T_p \quad \text{(plateau)} \cr \max\!\left(0.5,\; 1 - 0.5 \cdot \dfrac{t - T_w - T_p}{T_a}\right) & \text{(anneal to 0.5)} \end{cases}
(4)

with Tw=1,455T_w = 1{,}455 (warmup), Tp=2,910T_p = 2{,}910 (plateau), Ta=2,910T_a = 2{,}910 (anneal). During warmup (t<Twt < T_w), the transformer backbone is frozen and only embedding layers are trained, allowing the lexeme table to initialize before full model optimization begins.

3.2.1 Hyperparameters (R14 Production)

λinfo\lambda_{\text{info}}0.5λcons\lambda_{\text{cons}}1.0×1041.0 \times 10^{-4}Temperature τ\tau0.3Negatives KK64 (stratified)Lexeme batch S|S|256Training1 epoch, ~1.5h on NVIDIA H100 NVL

3.3 Data Pipeline

  • PMI bootstrapping: +127 lexemes discovered (+1.3% coverage)
  • Curriculum sampling: 3-stage (30% → 60% → 100%)
  • Morphological augmentation: +690 synthetic variants
  • SimHash deduplication: 567 duplicates removed (92.7% retention)

4. Experiments

4.1 Tokenizer Quality (Layer 1)

Table 1: V3f vs Multilingual Baselines

MetricV3f (8K)mT5 (250K)XLM-R (250K)Qwen 2.5 (151K)
TPC (Khmer chars)0.2930.3480.3270.412
Sanskrit/Pali optimal (15 terms)93.3%21.4%28.6%7.1%
Cultural preservation91.7%75.0%91.7%58.3%
Function word integrity100%100%100%100%
UNK rate0%0%0%0%
Lossless round-tripYesNoNoNo
POS F1 (downstream)0.92760.92320.92540.7559

Table 2: Segmentation Quality (ALT Corpus, 5,000 sentences)

MetricScore
Boundary F199.94%
Precision99.94%
Recall99.94%
Token Fertility1.950 tokens/word

Training corpus shares ZWSP segmentation conventions with ALT — a cross-domain evaluation would yield more conservative results (~95-97% F1).

4.2 Graph Regularization (Layer 2)

Table 3: Embedding Coherence (Coherence@K, fixed denominator)

KRandomBaselineR14Gain
@50.08%2.88%4.52%1.6x
@100.10%2.52%3.88%1.3x
@200.12%2.87%3.21%1.1x

Production Model (R14) — Graph-Regularized GPT-2

MetricValue
Coherence@10 (all nodes)3.88%
Recall@103.71%
Edge cosine similarity0.581
Isotropy0.207
Embedding collapse0.33%

R14 trained with InfoNCE contrastive loss + lexeme consistency loss. Stratified negatives (K=64): antonyms 15%, BFS-distant 35%, same-POS 20%, random 30%.

t-SNE Cluster Coherence (3,000 lexemes)

DomainBaselineR14Gain
Overall0.2610.4361.7x
Buddhist Ceremony0.2150.7663.6x
Buddhist Religious0.2210.7143.2x
Abbreviations0.3760.8502.3x
Body & Medical0.2340.3091.3x

4.3 Downstream: POS Tagging

To validate that tokenizer quality translates to downstream performance, we evaluate on POS tagging using the khPOS corpus (78,000 sentences, 14 POS tags). Protocol: tokenize each word, extract frozen embeddings (mean-pooled subwords), train a linear probe (SGDClassifier), 5-fold cross-validation with bootstrap confidence intervals.

Table 4: Downstream POS Tagging (khPOS, 5-fold CV)

ModelVocabF1 Macro95% CI
Tokkonizer-KM V3f8K0.9276 ± 0.0013[0.9270, 0.9283]
XLM-RoBERTa-base250K0.9254 ± 0.0013[0.9248, 0.9261]
mT5-small250K0.9232 ± 0.0018[0.9226, 0.9239]
Qwen 2.5-0.5B151K0.7559 ± 0.0036[0.7552, 0.7569]

Tokkonizer-KM outperforms mT5 and XLM-R with non-overlapping bootstrap CIs, despite a vocabulary 31x smaller. Qwen 2.5's byte-level BPE fragments Khmer text into semantically meaningless byte sequences.

4.4 Failure Analysis: V6.5

The V6.5 model (32K vocabulary) catastrophically failed all quality gates (cultural preservation 11%, TPC 0.617). Root causes: (1) database segmentation skipped due to performance bottleneck, (2) artificial oversampling distorted frequency distributions, (3) only 210/32,000 tokens used on real text. This motivated the shift to real corpus data, reduced vocabulary (8K), and graph regularization.

5. Discussion

5.1 The Value of Khmer-Native Tokenization

The most striking result is that a dedicated 8K-vocabulary Khmer tokenizer outperforms 250K-vocabulary multilingual tokenizers on every Khmer-specific metric. Multilingual models spread their vocabulary budget across 100+ languages, leaving Khmer with a tiny effective vocabulary. V3f concentrates all 8K tokens on Khmer, achieving better compression AND better linguistic preservation. The practical implication: any Khmer NLP application benefits from using V3f.

5.2 Graph Regularization: Where It Helps vs. Hurts

Contrastive learning pulls related embeddings together (good for mean-pooled cosine similarity and RAG) but reduces discriminability (bad for ColBERT-style token-level max-pooling). Practitioners should choose: graph-regularized embeddings for RAG and semantic search; standard embeddings for token-level re-rankers.

5.3 Minimal UDS Is More

Our most counterintuitive finding: adding only 7 user-defined symbols outperformed adding 500. Over-constraining the vocabulary budget (6.25% pre-allocated at 500 UDS) reduced the EM optimizer's ability to learn natural word boundaries. The optimal recipe: large corpus + minimal targeted UDS for critical terms the optimizer would otherwise fragment.

5.4 Source-Batch Clustering: A Training Artifact

During analysis, we discovered that initial graph-regularized embeddings exhibited clustering by data source rather than by semantic domain. Words originating from the same import file had 5-6x higher intra-group cosine similarity than inter-group, regardless of actual meaning.

Root causes: (1) metadata labels reflected source files, not semantic domains; (2) distributional edges on contaminated embeddings perpetuated the artifact; (3) 25,844 antonym pairs contradicted the InfoNCE negative sampler. Fix: LLM-assisted reclassification into 76 semantic domains, removal of antonyms from attraction edges, and corpus line shuffling. This methodological insight is generalizable to other graph-augmented training pipelines.

5.5. Limitations

  • Sanskrit/Pali circularity: 7 of 15 test terms were user-defined symbols. True EM optimizer success rate is 87.5% (7/8 non-UDS terms).
  • ALT segmentation in-domain: 99.94% boundary F1 benefits from shared ZWSP conventions. Cross-domain would yield ~95-97%.
  • Grapheme break rate (1.08%) slightly exceeds the 1% target — an inherent SentencePiece limitation for Abugida scripts.
  • Coherence measured with a custom metric (Coherence@K) — influenced by the contrastive training objective.
  • Corpus bias: predominantly formal news/Wikipedia. Conversational Khmer is underrepresented.
  • Single-language evaluation: generalization to Thai, Myanmar, Lao not tested.
  • No generation evaluation: POS tagging and segmentation only.

6. Conclusion

We presented Tokkonizer-KM, a Khmer-native tokenizer that outperforms multilingual baselines (mT5, XLM-R, Qwen 2.5) on every Khmer metric with 31x smaller vocabulary, including downstream POS tagging (F1 0.928 ± 0.001 vs 0.925/0.927/0.756, 5-fold CV with non-overlapping bootstrap CIs). The key contributions are: (1) a production-ready 8K tokenizer achieving 93.3% Sanskrit/Pali preservation and lossless round-trip accuracy; (2) graph-regularized contrastive training (InfoNCE with stratified negatives, λ=0.5, τ=0.3) producing 1.7x t-SNE cluster coherence improvement and 3.6x for Buddhist terms; (3) the empirical finding that minimal UDS intervention (7 terms) outperforms aggressive vocabulary pre-allocation; and (4) identification and correction of source-batch clustering artifacts in graph-augmented training.

We release the tokenizer, lexicon database (12,850 lexemes, 4,250 semantic edges, 76 domains), and evaluation framework under Apache 2.0.