Multilingual Twitter word embeddings
Since I have started to work on Twitter data, word embeddings have proven to be very useful. In the last two years, word embeddings have mostly been replaced by contextualized transformer based embeddings. For multi-lingual contextual twitter embeddings I refer to Barbieri et al. (also available on huggingface). However, there are still many cases where word embeddings are preferable, because of their efficiency, or because they still result in better performance. Yes, they sometimes are still not outperformed, for example for the task of lexical normalization (SOTA).
I have prepared word embeddings for all languages included in the Multi-LexNorm shared task. The procedure was as follows:
- Download sample of tweets between 2012-2020 from archive.org
- Used the Fasttext language classifier for language identification. Empirical results looked much better as the Twitter provided language labels.
- For the code-switched language pairs, I have simply concatenated both mono-lingual datasets, as it is non-trivial to filter for code-switched data.
- Cleaned usernames and url's to make the vocabulary smaller, and anonymize. Used the following command:
- Removed duplicates with (note that this stores intermediate results in /dev/shm, which should be quite large):
- trained word2vec with the following settings:
You might notice that I did not do any tokenization. This is not because I forgot. This is done because any consistent errors in tokenization would lead to specific words being excluded from the vocabulary.
The sizes of the files, and the number of characters, words and tweets are:
The results of this procedure are hosted on: http://www.itu.dk/people/robv/data/embeds/
Besides the embeddings, I have also counted uni- and bi-gram frequencies on the same data. I have used a minimum frequency of 3. Results of this are shared on: http://www.itu.dk/people/robv/data/ngrams/