Thesis project topics.

Contact me if you are interested in any of these (robv@itu.dk).

Tokenization of social media data

In many NLP benchmarks, tokenized texts are assumed as input to our models. For standard domains, tokenization can be considered a solved problem, however, for social media text tokenization is non-trivial. The goal of this project is to create a multi-lingual corpus and model for this task. Steps include:

Some related work:

Language identification for many languages

Language identification is a standard NLP task, which is often considered to be solved. However, most current classifiers only support around 100 languages, or are not publicly available. This project makes use of the LTI LangID Corpus, and asks the question: how do modern neural network approaches compare to simple character-based classifiers for language classification (with >1300 languages). Relevant previous work:

Multi-lingual lexical normalization

Lexical normalization is the task of converting social media language to its canoical equivalent. A traditional machine learning still holds the state-of-the-art (as opposed to deep learning), and most approaches are language specific. However, recently a dataset with 12 languages was introduced. More information on:

Dependency parsing of Danish social media data

Dependency parsing is the task of finding the syntactic relations between words in a sentence. Chapter 14 of the Speech and Language Processing book contains a nice introduction to this topic. Many different languages and domains have been covered by the recent Universal Dependencies project. However, for language types not covered performance is generally lower. Recently, we have collected some non-canonical data samples: DaN+, for which it is uncertain how well current methods would perform. The goal of this project would be to annotate a small sample of Danish social media data to evaluate parsers. Then, a variety of approaches of adapting the parser could be studied, including the ones mentioned below:

Active learning for POS tagging

POS tagging is the task of classifying words into their syntactic category:

I see the light
PRON VERB DET NOUN

Current POS tagger are usually supervised, which means they rely on human-annotated training data. This data commonly exists of thousands of sentences. To make this process less costly, one can select a more informative sample of words to rely on, and instead only annotate this subsample. Previous work (see below) has shown that competetive performance can be obtained with as little as 400 words on English news data. However, it is unclear how this transfers to other languages/domains. In this project, the first step is to evaluate the existing method on a larger sample (i.e. the Universal Dependencies dataset), followed by possible improvements to the model.

Related reading:

Unsupervised code-switch detection

Code-switching is the phenomenon of switching to another language within one utterance. Many previous approaches have been evaluated for a variety of language pairs; however, they are all trained on annotated code-switched data.

To increase the usefulness of such a code-switch detector, the idea is to train a system based on two monolingual datasets to predict language labels on the word level. An example of the desired output is shown below:

@friend u perform besop apa tudey ?
un en en id id en ?

Related reading:

Conversion of NLP tasks to sequence labeling tasks

Because of the continually increasing power of sequence labelers, competetive performance for complex tasks can be gained by simplifying tasks to sequence labeling problems. This lead to efficient and accurate NLP models. The main setup is: 1) find a task (I have a couple in mind already of course), 2) convert this a sequence labeling problem 3) train a sequence labeler for this conversion, 4) then convert the sequence back to the original task and evaluate. This is mainly an algorithmic project, as existing sequence labelers can be used.

Succesfull examples on some NLP tasks:

Effect of sociodemographic factors on language use

Recent work has shown that including the origin of a text instance can improve performance on NLP tasks. However, it is unclear which specific sociodemographic attributes correlate with language use. Recent efforts on annotation of social media data could give us more insights.