2 minute read

Since I have started to work on Twitter data, word embeddings have proven to be very useful. In the last two years, word embeddings have mostly been replaced by contextualized transformer based embeddings. For multi-lingual contextual twitter embeddings I refer to Barbieri et al. (also available on huggingface). However, there are still many cases where word embeddings are preferable, because of their efficiency.

I have prepared word embeddings for all languages included in the Multi-LexNorm shared task. The procedure was as follows:

  • Download sample of tweets between 2012-2020 from archive.org
  • Used the Fasttext language classifier for language identification. Empirical results looked much better as the Twitter provided language labels.
  • For the code-switched language pairs, I have simply concatenated both mono-lingual datasets, as it is non-trivial to filter for code-switched data.
  • Cleaned usernames and url’s to make the vocabulary smaller, and anonymize. Used the following command:
sed -r 's/@\[^ \]\[^ \]*//g' | sed -r 's/(http\[s\]?:\\/\[^ \]*|www\\.\[^ \]*)//g'
  • Removed duplicates with (note that this stores intermediate results in /dev/shm, which should be quite large):
sort -T /dev/shm | uniq
  • trained word2vec with the following settings:
./word2vec/word2vec -train nl.txt -output nl.bin -size 400 -window 5 -cbow 0 -binary 1 -threads 45

You might notice that I did not do any tokenization. This is not because I forgot. This is done because any consistent errors in tokenization would lead to specific words being excluded from the vocabulary.

The sizes of the files, and the number of characters, words and tweets are:

Language Code Chars Words. Tweets Size
Danish da 159,067,945 26,410,783 2,939,931 152M
German de 4,017,217,589 602,955,881 72,054,802 3.8G
English en 183,774,280,286 31,463,897,778 2,526,522,685 172G
Spanish es 75,656,330,294 9,602,044,523 765,704,695 53G
Croatian hr 99,558,448 16,352,437 2,007,553 95M
Indonesian id 15,355,311,741 2,479,391,528 196,348,197 15G
Indonesian-English iden 199,129,592,027 33,943,289,306 2,722,870,882 186G
Italian it 4,082,095,927 650,557,697 64,662,978 3.9G
Dutch nl 2,842,694,893 480,387,036 45,942,710 2.7G
Slovenian sl 192,472,502 22,977,241 3,577,682 184M
Serbian sr 403,058,101 58,043,354 5,903,680 385M
Turkish tr 11,400,083,503 1,461,947,731 133,557,943 11G
Turkish-German trde 15,417,301,092 2,064,903,612 205,612,745 15G

The results of this procedure are hosted on: http://www.itu.dk/people/robv/data/monoise/, note that smaller/older versions for most languages can be found on: http://www.itu.dk/people/robv/data/monoise-old

When using Gensim, there can be unicode incompatabilities in some of these models, set unicode_errors=’ignore’ when loading the embeddings. Thanks to Elijah Rippeth for this addition.

Besides the embeddings, I have also counted uni- and bi-gram frequencies on the same data with a minimum frequency of 3. They are saved in binary format, and can be extracted using the following code: https://bitbucket.org/robvanderg/utils/src/master/ngrams/.

Updated:

Comments