Tha

Tha : Khmer Text Normalization and Verbalization Toolkit.

Normalize¶

Normalize text by removing zero-width spaces.

Args:
    text (str): The text to normalize.

Returns:
    str: The normalized text.

from pykhmernlp.tha import normalize
text_normalize = "មិន\u200bឲ្យ"

print("Original text:", text_normalize)
print(normalize(text_normalize))  # Output: "មិនឱ្យ"

Phone Numbers¶

Process phone numbers in the text by chunking digits with a specified chunk size.

Args:
    text (str): The text containing phone numbers.
    chunk_size (int): The size of each digit chunk.

Returns:
    str: The processed text with phone numbers chunked.

from pykhmernlp.tha import process_phone_numbers
text_phone_numbers = "010123123"

print("Original text:", text_phone_numbers)
print(process_phone_numbers(text_phone_numbers, chunk_size=2))  # Output: "0▁10▁12▁31▁23"

URLs and emails¶

Process URLs and emails in the text by replacing them with tokens.

Args:
    text (str): The text containing URLs and emails.

Returns:
    str: The processed text with URLs and emails replaced.

from pykhmernlp.tha import process_urls
text_urls = "https://google.com"

print("Original text:", text_urls)
print(process_urls(text_urls))  # Output: "google dot com"

Time¶

Process time expressions in the text by formatting them.

Args:
    text (str): The text containing time expressions.

Returns:
    str: The processed text with time expressions formatted.

from pykhmernlp.tha import process_time
text_time = "10:23AM"

print("Original text:", text_time)
print(process_time(text_time))  # Output: "10 23▁A▁M"

Date¶

Process date expressions in the text by formatting them.

Args:
    text (str): The text containing date expressions.

Returns:
    str: The processed text with date expressions formatted.

from pykhmernlp.tha import process_date
text_date = "2024-01-02"

print("Original text:", text_date)
print(process_date(text_date))  # Output: "2024 01 02"

Hashtags¶

Process hashtags in the text by removing them.

Args:
    text (str): The text containing hashtags.

Returns:
    str: The processed text with hashtags removed.

from pykhmernlp.tha import process_hashtags
text_hashtags = "Hello world #this_will_remove hello"

print("Original text:", text_hashtags)
print(process_hashtags(text_hashtags))  # Output: "Hello world  hello"

ASCII Lines¶

Process ASCII lines in the text by removing them.

Args:
    text (str): The text containing ASCII lines.

Returns:
    str: The processed text with ASCII lines removed.

from pykhmernlp.tha import process_ascii_lines
text_ascii_lines = "Remove --- asdasd"

print("Original text:", text_ascii_lines)
print(process_ascii_lines(text_ascii_lines))  # Output: "Remove  asdasd"

Cambodia License Plate¶

Process Cambodia license plate numbers in the text by formatting them.

Args:
    text (str): The text containing Cambodia license plate numbers.

Returns:
    str: The processed text with license plate numbers formatted.

from pykhmernlp.tha import process_license_plate
text_license_plate = "1A 1234"

print("Original text:", text_license_plate)
print(process_license_plate(text_license_plate))  # Output: "1 A 12▁34"

Number - Cardinals¶

Process cardinal numbers in the text by converting them to Khmer words.

Args:
    text (str): The text containing cardinal numbers.

Returns:
    str: The processed text with cardinal numbers converted to Khmer words.

from pykhmernlp.tha import process_cardinals
text_cardinals = "1234"

print("Original text:", text_cardinals)
print(process_cardinals(text_cardinals))  # Output: "មួយពាន់▁ពីររយ▁សាមសិបបួន"

Number - Decimals¶

Process decimal numbers in the text by converting them to Khmer words.

Args:
    text (str): The text containing decimal numbers.

Returns:
    str: The processed text with decimal numbers converted to Khmer words.

from pykhmernlp.tha import process_decimals
text_decimals = "123.324"

print("Original text:", text_decimals)
print(process_decimals(text_decimals)) # Output: "មួយរយ▁ម្ភៃបី▁ចុច▁បីរយ▁ម្ភៃបួន"

Number - Ordinals¶

Process ordinal numbers in the text by converting them to Khmer words.

Args:
    text (str): The text containing ordinal numbers.

Returns:
    str: The processed text with ordinal numbers converted to Khmer words.

from pykhmernlp.tha import process_ordinals
text_ordinals = "5th"

print("Original text:", text_ordinals)
print(process_ordinals(text_ordinals))  # Output: "ទី▁ប្រាំ"

Number - Currency¶

Process currency expressions in the text by converting them to Khmer words.

Args:
    text (str): The text containing currency expressions.

Returns:
    str: The processed text with currency expressions converted to Khmer words.

from pykhmernlp.tha import process_currency
text_currency = "$100"

print("Original text:", text_currency)
print(process_currency(text_currency))  # Output: "មួយរយដុល្លារ▁មួយសេន"

Parenthesis¶

Process parenthesis in the text by removing them.

Args:
    text (str): The text containing parenthesis.

Returns:
    str: The processed text with parenthesis removed.

from pykhmernlp.tha import process_parenthesis
text_parenthesis = "Hello (this will be ignored) world"

print("Original text:", text_parenthesis)
print(process_parenthesis(text_parenthesis))  # Output: "Hello world"

Iteration Mark¶

"""
Process repeated words or phrases in the text by replacing them with a single instance.

Args:
    text (str): The text containing repeated words or phrases.
    tokenizer: A function used for tokenization.

Returns:
    str: The processed text with repeated words or phrases replaced.
"""

from pykhmernlp.tha import process_repeater
text_repeater = "គាត់បានទៅបន្តិចម្ដងៗហើយ"

def fake_tokenizer(_):
    return ["គាត់", "បាន", "ទៅ", "បន្តិច", "ម្ដង"]

process_repeater(text_repeater, tokenizer=fake_tokenizer) # Output: "គាត់បានទៅបន្តិចម្ដង▁បន្តិចម្ដងហើយ"