2 minute read

For analysis purposes, I wanted to divide the UD data based on the script of the texts. However, I had a hard time finding a script to automatically detect the script of text (partially due to the word “script” being ambiguous). So, I wrote the following code excerpt, which uses the unicode scripts definitions.

Update 15-11-2023 Updated to not include the ranges, but just the script for every data point. This uses more RAM, but is much faster.

import os

class ScriptFinder():
    def __init__(self):
        """
        Class that loads the scripts definitions from Unicode; it automatically
        downloads them to a text file, and loads them to a list, where every index
        of valid unicode is represented by a string that contains the script name.
        Note that this is not very RAM efficient, but very fast for lookups.
        """
        self.ranges = [None] * 918000
        if not os.path.isfile('scripts/Scripts.txt'):
            os.system('mkdir -p scripts')
            os.system('wget https://www.unicode.org/Public/15.0.0/ucd/Scripts.txt --no-check-certificate -O scripts/Scripts.txt')
        for line in open('scripts/Scripts.txt'):
            tok = line.split(';')
            if line[0]!='#' and len(tok) == 2:
                char_range_hex = tok[0].strip().split('..')
                char_range_int = [int(x, 16) for x in char_range_hex]
                script_name = tok[1].strip().split()[0]
                if len(char_range_int) == 1:
                    self.ranges[char_range_int[0]] = script_name
                else:
                    for ind in range(char_range_int[0], char_range_int[1]+1):
                        self.ranges[ind] = script_name
                # Note that we include the first and the last character of the
                # range in the indices, so the first range for Latin is 65-90
                # for example, character 65 (A) and 90 (Z) are both included in
                # the Latin set.  


    def find_char(self, char):
        """
        Return the script of a single character, if a string
        is passed, it returns the script of the first character.

        Parameters
        ----------
        char: char
            The character to find the script of, if this is a string
            the first character is used.
    
        Returns
        -------
        script: str
            The name of the script, or None if not found
        """
        if len(char) > 1:
            char = char[0]
        char_idx = ord(char)
        if char_idx >= len(self.ranges):
            return None
        return self.ranges[char_idx]

    def guess_script(self, text):
        """
        Guess the script of a piece of text, it first counts
        how many characters are in each script, and then returns
        the most frequent one. It ignores the None and Common 
        (punctuation) classes of unicode.

        Parameters
        ----------
        text: str
            The input text

        Returns
        -------
        script: str
            Name of the script

        """
        classes = {}
        for char in text:
            cat = self.find_char(char)
            if cat == None or cat == 'Common':
                continue
            if cat not in classes:
                classes[cat] = 0
            classes[cat] += 1
        if len(classes) == 0:
            return None
        main_class = sorted(classes.items(), key=lambda x: x[1], reverse=True)[0][0]
        return main_class

Updated:

Comments