Khmer NLP

Tokenizer-KM V3fKhmer tokenizer playground.

Production-ready 8K-vocabulary SentencePiece Unigram model for Khmer text processing, search, and NLP pipelines. Full Sanskrit/Pali preservation and lossless round-trip fidelity.

8,000

Vocabulary

93.3%

Sanskrit/Pali preservation

100%

Lossless round-trip

5x

Faster than mT5

Need Khmer text processing in your product? We integrate this into your pipeline.

Book a call

Input

0 / 2000

Try an example

Tokens

Tokens

Tokens will appear here...

Reconstructed

Decoded output will appear here...

Interested in Khmer NLP for your project?