Skip to content

Tokenizer

Khmercut : A (fast) Khmer word segmentation toolkit.

Khmercut

Tokenize Khmer text using the khmercut library from Seanghay.

This function takes a string of Khmer text as input and uses the khmercut 
library to tokenize it into individual words or tokens. This can be useful 
for various natural language processing tasks such as text analysis, 
machine learning, and information retrieval.

Examples:
    >>> from pykhmernlp.tokenizer import khmercut
    >>> khmercut("ខ្ញុំសុខសប្បាយ")
    ['ខ្ញុំ', 'សុខ', 'សប្បាយ']

Args:
    text (str): The Khmer text to tokenize. It should be a string containing 
                valid Khmer characters.

Returns:
    List[str]: A list of tokens. Each token is a string representing a word 
               or meaningful unit in the Khmer text.
from pykhmernlp.tokenizer import khmercut

khmercut("ឃាត់ខ្លួនជនសង្ស័យ០៤នាក់ ករណីលួចខ្សែភ្លើង នៅស្រុកព្រៃនប់")

# => ['ឃាត់ខ្លួន', 'ជនសង្ស័យ', '០៤', 'នាក់', ' ', 'ករណី', 'លួច', 'ខ្សែភ្លើង', ' ', 'នៅ', 'ស្រុក', 'ព្រៃនប់']