1 minute read

In the MultiLexNorm shared task (WNUT 2021), we made a first attempt at homogenising multiple lexical normalization datasets in a variety of languages into one standard. This project was started to improve the evaluation and comparison of existing lexical normalization models, as well as pushing the focus to a larger variety of languages. We defined lexical normalization as the task of “transforming an utterance into its standard form, word by word, including both one-to-many (1-n) and many-to-one (n-1) replacements.” An example of an utterance annotated for this task would be:

most social pple r troublsome
most social people are troublesome

More examples and information about MultiLexNorm can be found on the task website and overview paper.

On this page, I collect references to datasets that were not included in MultiLexNorm for a variety of reasons, some of these are word-based, not publicly available/sharable, they include translation/transcription, or I only found out about them after the shared task. Hopefully, the MultiLexNorm benchmark will be expanded in the future with more varied languages. Note that I focus on social media datasets here, there are also historical and medical datasets for the lexical normalization task.

Language Source Notes
Bangla-English Dutta et al. (2015) Paper behind paywall
Chinese (Mandarin) Li & Yarowsky (2008) No context
Chinese (Mandarin) Wang et al. (2013) No context
Danish Hansen et al. (2023) Not public, after shared task
Flemish De Clercq et al. (2013) Not public, includes translation (to Dutch)
Finnish Vehomäki (2022) After MultiLexNorm
Greek Toska (2020)  
Hindi-English Bhat et al. (2018) Includes transcription
Hindi-English Makhija et al. (2020)  
Indonesian Kurnia & Yulianti (2020) There seems to be no word allignment
Irish Cassidy et al. (2022)  
Japanese Kaji & Kitsuregawa (2014)  
Japanese 2017  
Japanese Higashiyama et al. (20  
Latvian Deksne (2019)  
Portuguese Costa Bertaglia & Volpe Nunes (2016) small
Portuguese Sanches Duran et al. (2015) small, Brazilian Portuguese
Urdu Khan et al. (2020)  
Uyghur Tursun & C¸ akıcı (2017) Includes transcription
Vietnamese Nguyen et al. (2015) Not available
Singlish Liu et al (2022) Includes translation

Note: Dutch, Turkish and English datasets not in MultiLexNorm are not listed here yet. For English, a recent survey (Zhang et al. (2022)) lists some of the datasets.

Updated:

Comments