Towards Domain Adaptation for Dutch Social Media Text Through Normalization

When processing data from less canonical domains, current natural language
processing systems trained on canonical (news) data perform poorly. One way of
resolving this problem is to normalize the non-canonical data to more canonical
text before processing. Some examples at for this approach include include
Tjong Kim Sang (2016) and the CLIN27 shared task. But instead of focusing on
historical text, we will focus on the social media domain. There are already
numerous normalization systems for English (eg. Baldwin et al., 2015).

For this purpose, we introduce a newly annotated corpus of 1,000 normalized
Dutch Tweets. We hope that this resource can stimulate further research in this
direction. Additionally we will introduce the first results on this corpus with
our own normalization system for Dutch. The output of this normalization system
can be used in a pipeline for pos tagging, parsing and other natural language
processing tasks.

Our normalization model uses word embeddings trained on Twitter and the spell
checker Aspell to generate normalization candidates. Features from the
generation are then combined with features from 2 different n-gram language
models. One model learned from the source domain (social media), and one from a
more canonical domain(Google n-grams). We combine these features in a binary
random forest classifier. We can then use the confidence score from the
classifier as a score to rank possible candidates.

The performance of the normalization system compared to inter annotator
agreement will be discussed at the presentation.