Normalization datasets
In the MultiLexNorm shared task (WNUT 2021), we made a first attempt at homogenising multiple lexical normalization datasets in a variety of languages into one standard. This project was started to improve the evaluation and comparison of existing lexical normalization models, as well as pushing the focus to a larger variety of languages. We defined lexical normalization as the task of “transforming an utterance into its standard form, word by word, including both one-to-many (1-n) and many-to-one (n-1) replacements.” An example of an utterance annotated for this task would be:
most | social | pple | r | troublsome |
most | social | people | are | troublesome |
More examples and information about MultiLexNorm can be found on the task website and overview paper.
On this page, I collect references to datasets that were not included in MultiLexNorm for a variety of reasons, some of these are word-based, not publicly available/sharable, they include translation/transcription, or I only found out about them after the shared task. Hopefully, the MultiLexNorm benchmark will be expanded in the future with more varied languages. Note that I focus on social media datasets here, there are also historical and medical datasets for the lexical normalization task.
Language | Source | Notes |
---|---|---|
Bangla-English | Dutta et al. (2015) | Paper behind paywall |
Chinese (Mandarin) | Li & Yarowsky (2008) | No context |
Chinese (Mandarin) | Wang et al. (2013) | No context |
Danish | Hansen et al. (2023) | Not public, after shared task |
Flemish | De Clercq et al. (2013) | Not public, includes translation (to Dutch) |
Finnish | Vehomäki (2022) | After MultiLexNorm |
Greek | Toska (2020) | |
Hindi-English | Bhat et al. (2018) | Includes transcription |
Hindi-English | Makhija et al. (2020) | |
Indonesian | Kurnia & Yulianti (2020) | There seems to be no word allignment |
Irish | Cassidy et al. (2022) | |
Japanese | Kaji & Kitsuregawa (2014) | |
Japanese | 2017 | |
Japanese | Higashiyama et al. (20 | |
Latvian | Deksne (2019) | |
Portuguese | Costa Bertaglia & Volpe Nunes (2016) | small |
Portuguese | Sanches Duran et al. (2015) | small, Brazilian Portuguese |
Urdu | Khan et al. (2020) | |
Uyghur | Tursun & C¸ akıcı (2017) | Includes transcription |
Vietnamese | Nguyen et al. (2015) | Not available |
Singlish | Liu et al (2022) | Includes translation |
Note: Dutch, Turkish and English datasets not in MultiLexNorm are not listed here yet. For English, a recent survey (Zhang et al. (2022)) lists some of the datasets.
Comments