Multilingual Twitter word embeddings

Since I have started to work on Twitter data, word embeddings have proven to be very useful. In the last two years, word embeddings have mostly been replaced by contextualized transformer based embeddings. For multi-lingual contextual twitter embeddings I refer to Barbieri et al. (also available on huggingface). However, there are still many cases where word embeddings are preferable, because of their efficiency, or because they still result in better performance. Yes, they sometimes are still not outperformed, for example for the task of lexical normalization (SOTA).

I have prepared word embeddings for all languages included in the Multi-LexNorm shared task. The procedure was as follows:

You might notice that I did not do any tokenization. This is not because I forgot. This is done because any consistent errors in tokenization would lead to specific words being excluded from the vocabulary.

The sizes of the files, and the number of characters, words and tweets are:

Language Code Chars Words. Tweets Size
Danish da 159,067,945 26,410,783 2,939,931 152M
German de 4,017,217,589 602,955,881 72,054,802 3.8G
English en 183,774,280,286 31,463,897,778 2,526,522,685 172G
Spanish es 75,656,330,294 9,602,044,523 765,704,695 53G
Croatian hr 99,558,448 16,352,437 2,007,553 95M
Indonesian id 15,355,311,741 2,479,391,528 196,348,197 15G
Indonesian-English iden 199,129,592,027 33,943,289,306 2,722,870,882 186G
Italian it 4,082,095,927 650,557,697 64,662,978 3.9G
Dutch nl 2,842,694,893 480,387,036 45,942,710 2.7G
Slovenian sl 192,472,502 22,977,241 3,577,682 184M
Serbian sr 403,058,101 58,043,354 5,903,680 385M
Turkish tr 11,400,083,503 1,461,947,731 133,557,943 11G
Turkish-German trde 15,417,301,092 2,064,903,612 205,612,745 15G

The results of this procedure are hosted on: http://www.itu.dk/people/robv/data/embeds/

When using Gensim, there can be unicode incompatabilities in some of these models, set unicode_errors='ignore' when loading the embeddings. Thanks to Elijah Rippeth for this addition.

Besides the embeddings, I have also counted uni- and bi-gram frequencies on the same data. I have used a minimum frequency of 3. Results of this are shared on: http://www.itu.dk/people/robv/data/ngrams/