Lexical Normalization for Dutch Social Media Texts

Lexical normalization is the task of translating ill-formed or non-standard
text to a more standard register. This can be helpful for many natural language
processing pipelines, since they are usually trained on standard texts. These
natural language processing systems simply break down when they encounter the
noisy text from social media domains. Below we show an example of a normalized
Dutch tweet:

tgaat    goed , vdg      rustig aaan .
Het gaat goed , vandaag  rustig aan  .

There is already some previous work on normalization for Flemish (De Clercq et
al 2013; Schulz et al, 2016). On this dataset, the performance of a
state-of-the-art normalization model(van der goot and van Noord, 2017) is much
lower compared to the English corpora: 42.5% vs. 86.4%. Upon inspection of the
different corpora this is due to the fact that this corpus also includes
transformation of some Flemish words into Dutch, and the difference in size of
the training data.  But even when using the same amount of training data, the
performance difference remains.

Besides this, the corpus also makes no distinction between punctuation and
normalization edits; this corpus actually contains 1,322 tokenization
replacements and only 708 normalization replacements. This results in
tokenization being far more important for the final evaluation. On top of that,
the corpus is not publicly available and capitalization use is not corrected.

We will annotate a new dataset of 1,000 noisy sentences taken from the SoNaR
corpus (Oostdijk et al, 2013) with a normalization layer. This can be used to
train a normalization model and confirm if Dutch is really a more difficult
language to normalize. 150 sentences will be annotated by two annotators to
obtain an inter-annotator agreement. Also enabling inspection of the type of
disagreements.

We will train the existing normalization model MoNoise (van der Goot and van
Noord, 2017).  This normalization model is modular, because the normalization
task comprises of different normalization replacements. The most important
modules to generate candidates are:
- Aspell: lexical and phonetic edit distances
- Lookup list: generated from the training data 
- Word embeddings trained on Dutch tweets data collected between 2012-2016,
  using the same method as described in Tjong Kim Sang and van den Bosch
  (2013): the top-n closest words in the embedding space are used as candidates

The model uses features from the generation modules as well as some additional
features. From the additional features, the N-gram features are by far the
best predictor, these include unigram and bigram probabilities from both
standard and non-standard texts. A random forest classifier is used to predict
the best normalization candidate based on these features. At the conference, we
will to present a live demo of the normalization model.

Bibliography
    Sarah Schulz, Guy De Pauw, Orphée De Clercq, Bart Desmet, Véronique Hoste,
Walter Daelemans, and Lieve Macken. (2016) Multimodular text normalization of
Dutch user-generated content. ACM Transactions on Intelligent Systems
Technology, 7(4):61:1–61:22.
    Orphée De Clercq, Sarah Schulz, Bart Desmet, Els Lefever, Véronique Hoste.
(2013) Normalization of Dutch user-generated content. Proceedings of the
International Conference Recent Advances in Natural Language Processing RANLP
2013.
    Oostdijk, N., Reynaert, M., Hoste, V., Schuurman, I. (2013) The
Construction of a 500 Million Word Reference Corpus of Contemporary Written
Dutch in: Essential Speech and Language Technology for Dutch: Results by the
STEVIN-programme.
    Erik Tjong Kim Sang and Antal van den Bosch. Dealing with big data: The
case of twitter. Computational Linguistics in the Netherlands Journal, 3
    Rob van der Goot and Gertjan van Noord. 2017. Monoise: Modeling noise using
a modular normalization system. Computational Linguistics in the Netherlands
Journal, 7