Using a script to identify the script of a text
For analysis purposes, I wanted to divide the UD data based on the script of the texts. However, I had a hard time finding a script to automatically detect the script of text (partially due to the word “script” being ambiguous). So, I wrote the following code excerpt, which uses the unicode scripts definitions.
Update 15-11-2023 Updated to not include the ranges, but just the script for every data point. This uses more RAM, but is much faster.
import os
class ScriptFinder():
def __init__(self):
"""
Class that loads the scripts definitions from Unicode; it automatically
downloads them to a text file, and loads them to a list, where every index
of valid unicode is represented by a string that contains the script name.
Note that this is not very RAM efficient, but very fast for lookups.
"""
self.ranges = [None] * 918000
if not os.path.isfile('scripts/Scripts.txt'):
os.system('mkdir -p scripts')
os.system('wget https://www.unicode.org/Public/15.0.0/ucd/Scripts.txt --no-check-certificate -O scripts/Scripts.txt')
for line in open('scripts/Scripts.txt'):
tok = line.split(';')
if line[0]!='#' and len(tok) == 2:
char_range_hex = tok[0].strip().split('..')
char_range_int = [int(x, 16) for x in char_range_hex]
script_name = tok[1].strip().split()[0]
if len(char_range_int) == 1:
self.ranges[char_range_int[0]] = script_name
else:
for ind in range(char_range_int[0], char_range_int[1]+1):
self.ranges[ind] = script_name
# Note that we include the first and the last character of the
# range in the indices, so the first range for Latin is 65-90
# for example, character 65 (A) and 90 (Z) are both included in
# the Latin set.
def find_char(self, char):
"""
Return the script of a single character, if a string
is passed, it returns the script of the first character.
Parameters
----------
char: char
The character to find the script of, if this is a string
the first character is used.
Returns
-------
script: str
The name of the script, or None if not found
"""
if len(char) > 1:
char = char[0]
char_idx = ord(char)
if char_idx >= len(self.ranges):
return None
return self.ranges[char_idx]
def guess_script(self, text):
"""
Guess the script of a piece of text, it first counts
how many characters are in each script, and then returns
the most frequent one. It ignores the None and Common
(punctuation) classes of unicode.
Parameters
----------
text: str
The input text
Returns
-------
script: str
Name of the script
"""
classes = {}
for char in text:
cat = self.find_char(char)
if cat == None or cat == 'Common':
continue
if cat not in classes:
classes[cat] = 0
classes[cat] += 1
if len(classes) == 0:
return None
main_class = sorted(classes.items(), key=lambda x: x[1], reverse=True)[0][0]
return main_class
Comments