Below I list all the datasets for which I was involved in the creation.

Lexical Normalization for Italian

In collaboration with Alan Ramponi, Tommaso Caselli, Michele Cafagna, Lorenzo De Mattei, we developed a dataset for lexical normalization of italian based on the PoSTWITA data. This dataset contains both random tweets and tweets about politics. Two examples from the dataset:

ma Grillo nn e'quello ke ha fatto a suo tempo il condono
ma Grillo non è quello che ha fatto a suo tempo il condono
a Roma invece è cosí primavera che sembra gia giov .
a Roma invece è così primavera che sembra già giovedì .

This dataset is available at

Normalization categories

This dataset accommodates for in-depth evaluation of a normalization model (e.g. MoNoise), it is based on a taxonomy containing different normalization actions. A Twitter corpus is annotated by 2 annotators (thanks Rik!) and is publicly available.

MoNoise Treebank

For this treebank I have used data from datasets commonly used for research: the LexNorm dataset and the Owoputi dataset. From these dataset I extracted all tweets which were still available in December 2018, to ensure that the non-tokenized version was also available. I used the normalization and POS annotation from [1] [2] [3]. The POS tags were automatically converted to UPOS and corrected; dependency annotation is added from scratch with help from Gosse Bouma (thanks!). Predicted normalization is added by MoNoise. Finally, with help from Wessel Reijngoud, a layer of normalization categories, as described above, is added. The final annotation with all layers looks like:

This dataset is available in conll format with all remaining layers in the `misc' column.

Shared Tasks

Shared tasks are indisputably drivers of progress and interest for problems in NLP. To qualify some of the characteristics and potential problems, we annotated a random set of shared tasks by hand. The process is described in the paper Sharing is Caring: The Future of Shared Tasks.

Human Judgement for Gender Identification

Detecting gender based on text is a classic authorship profiling task. Previous work showed that humans can do a reasonable job on this task, being correct in approximately 80% of the cases. Automatic methods seem to reach similar performance if there is enough (similar) training data available (try it). However, how well can humans and machines perform for this task if they do not speak the language of the utterances? To answer these questions, we annotated a Portugese dataset by native speakers of Dutch and a Dutch dataset annotated by French people. The dataset is available here and the paper with more details here.