Multilingual Twitter word embeddings

2 minute read

Since I have started to work on Twitter data, word embeddings have proven to be very useful. In the last two years, word embeddings have mostly been replaced by contextualized transformer based embeddings. For multi-lingual contextual twitter embeddings I refer to Barbieri et al. (also available on huggingface). However, there are still many cases where word embeddings are preferable, because of their efficiency.

I have prepared word embeddings for all languages included in the Multi-LexNorm shared task. The procedure was as follows:

Download sample of tweets between 2012-2020 from archive.org
Used the Fasttext language classifier for language identification. Empirical results looked much better as the Twitter provided language labels.
For the code-switched language pairs, I have simply concatenated both mono-lingual datasets, as it is non-trivial to filter for code-switched data.
Cleaned usernames and url’s to make the vocabulary smaller, and anonymize. Used the following command:

sed -r 's/@\[^ \]\[^ \]*//g' | sed -r 's/(http\[s\]?:\\/\[^ \]*|www\\.\[^ \]*)//g'

Removed duplicates with (note that this stores intermediate results in /dev/shm, which should be quite large):

sort -T /dev/shm | uniq

trained word2vec with the following settings:

./word2vec/word2vec -train nl.txt -output nl.bin -size 400 -window 5 -cbow 0 -binary 1 -threads 45

You might notice that I did not do any tokenization. This is not because I forgot. This is done because any consistent errors in tokenization would lead to specific words being excluded from the vocabulary.

The sizes of the files, and the number of characters, words and tweets are:

Language	Code	Chars	Words.	Tweets	Size
Danish	da	159,067,945	26,410,783	2,939,931	152M
German	de	4,017,217,589	602,955,881	72,054,802	3.8G
English	en	183,774,280,286	31,463,897,778	2,526,522,685	172G
Spanish	es	75,656,330,294	9,602,044,523	765,704,695	53G
Croatian	hr	99,558,448	16,352,437	2,007,553	95M
Indonesian	id	15,355,311,741	2,479,391,528	196,348,197	15G
Indonesian-English	iden	199,129,592,027	33,943,289,306	2,722,870,882	186G
Italian	it	4,082,095,927	650,557,697	64,662,978	3.9G
Dutch	nl	2,842,694,893	480,387,036	45,942,710	2.7G
Slovenian	sl	192,472,502	22,977,241	3,577,682	184M
Serbian	sr	403,058,101	58,043,354	5,903,680	385M
Turkish	tr	11,400,083,503	1,461,947,731	133,557,943	11G
Turkish-German	trde	15,417,301,092	2,064,903,612	205,612,745	15G

The results of this procedure are hosted on: http://www.itu.dk/people/robv/data/monoise/, note that smaller/older versions for most languages can be found on: http://www.itu.dk/people/robv/data/monoise-old

When using Gensim, there can be unicode incompatabilities in some of these models, set unicode_errors=’ignore’ when loading the embeddings. Thanks to Elijah Rippeth for this addition.

Besides the embeddings, I have also counted uni- and bi-gram frequencies on the same data with a minimum frequency of 3. They are saved in binary format, and can be extracted using the following code: https://bitbucket.org/robvanderg/utils/src/master/ngrams/.

Rob van der Goot

Multilingual Twitter word embeddings

Comments