Thesis project topics.
Below I list some research ideas that I would like to supervise for a research project/thesis. These can also be seen as research directions that I'm interested in, so if you are interested in related projects feel free to contact me as well (robv@itu.dk).
Tokenization of social media data
In many NLP benchmarks, tokenized texts are assumed as input to our models. For standard domains, tokenization can be considered a solved problem, however, for social media text tokenization is non-trivial. The goal of this project is to create a multi-lingual corpus and model for this task. Steps include:
- Finding the original utterances of Multi-LexNorm
- Create a gold standard dataset based on the original and the tokenized data.
- Evaluate existing tokenizers and train your own.
Efficient Language identification for many languages
Language identification is a standard NLP task, which is often considered to be solved. However, most current classifiers only support around 100 languages, or are not publicly available. This project makes use of the LTI LangID Corpus(with >1300 languages), and asks the question: how can we efficiently handle such a large label space, and such a wide variety in input-features. Relevant previous work:
- Non-linear Mapping for Improved Identification of 1300+ Languages
- A Fast, Compact, Accurate Model for Language Identification of Codemixed Text
- Language ID in the Wild: Unexpected Challenges on the Path to a Thousand-Language Web Text Corpus
Massive Multi-task Learning
Multiple recent papers have shown the effectivenes of massive multi-task learning. In this paradigm, we take a pre-trained language model, retrain it on as many NLP tasks as we can find, and use the resulting weight for our final model training. However, most of these works make use of a sequence-to-sequence architecture, instead of an autoencoder (e.g. BERT) model. The paper below does an introductory study for semantically oriented tasks with 7 datasets (19 tasks), and shows mixed results. This project aims to extend this study (note that SemEval shares ~10 datasets each year), and evaluate the effect of this scaling.
Processing of long documents
Recent NLP models often have a maximum input size of 512 for efficiecny reasons. In a recent competition on predicting document similarity, the winning approach took the first and last 128 (sub)words of two documents to compare them. This means that the majority of the input data is not taken into account, which is probably suboptimal for performance. To exploit current models, a two layered approach can be used; first we embed the full input in multiple windows, keeping one embedding out of each of the windows, then we run a separate neural network over the output of the previous steps, leading to a single label prediction.
The effect of translationese on slot and intent detection
The tasks of slot and intent detection is a crucial component of digital assisnents. Intent detection aims to find the goal of an utterance, and slot detection finds relevant entities. An example of this task:
Add | reminder | to | swim | at | 11am | tomorrow |
---|---|---|---|---|---|---|
B-TODO | B-DATETIME | I-DATETIME | ||||
Intent: | add-reminder |
Recently, two big multi-lingual datasets have been introduced (multiAtis, and xSID). However, these datasets consist of data translated from English. Translationese is known to be different from spontaneous language. This project aims to estimate the effect of this difference. By generating a small sample of native non-English data (for example Danish), and evaluating this against the xSID data.
Here you can find the xSID paper, and an investigation on translationese for machine translation evaluation:
- From Masked Language Modeling to Translation: Non-English Auxiliary Tasks Improve Zero-shot Spoken Language Understanding
- The Effect of Translationese in Machine Translation Test Sets
Multi-lingual lexical normalization
Lexical normalization is the task of converting social media language to its canonical equivalent. Most literature on this problem is only tackling this task for one language. But in 2021, MultiLexNorm was introduced; including normalization datasets for 13 language variants. A wide variety of models was evaluated on this new benchmark, however all of these trained a single model for each language. However, at least three of these models can be used in a multi-lingual or cross-lingual setup, which could enable more efficiency, performance and transfer to new language for which no annotated data is available.
- MultiLexNorm: A Shared Task on Multilingual Lexical Normalization
- Website for dataset: Multi-LexNorm
Dependency parsing of Danish social media data
Dependency parsing is the task of finding the syntactic relations between words in a sentence. Chapter 14 of the Speech and Language Processing book contains a nice introduction to this topic. Many different languages and domains have been covered by the recent Universal Dependencies project. However, for language types not covered performance is generally lower. Recently, we have collected some non-canonical data samples: DaN+, for which it is uncertain how well current methods would perform. The goal of this project would be to annotate a small sample of Danish social media data to evaluate parsers. Then, a variety of approaches of adapting the parser could be studied, including the ones mentioned below:- Challenges in Annotating and Parsing Spoken, Code-switched, Frisian-Dutch Data
- Cross-lingual Parsing with Polyglot Training and Multi-treebank Learning: A Faroese Case Study
- A systematic comparison of methods for low-resource dependency parsing on genuinely low-resource languages
- How to Parse Low-Resource Languages: Cross-Lingual Parsing, Target Language Annotation, or Both?
- Modeling Input Uncertainty in Neural Network Dependency Parsing
Active learning for POS tagging
POS tagging is the task of classifying words into their syntactic category:
I | see | the | light |
---|---|---|---|
PRON | VERB | DET | NOUN |
Current POS tagger are usually supervised, which means they rely on human-annotated training data. This data commonly exists of thousands of sentences. To make this process less costly, one can select a more informative sample of words to rely on, and instead only annotate this subsample. Previous work (see below) has shown that competetive performance can be obtained with as little as 400 words on English news data. However, it is unclear how this transfers to other languages/domains. In this project, the first step is to evaluate the existing method on a larger sample (i.e. the Universal Dependencies dataset), followed by possible improvements to the model.
Related reading:
Unsupervised code-switch detection
Code-switching is the phenomenon of switching to another language within one utterance. Many previous approaches have been evaluated for a variety of language pairs; however, they are all trained on annotated code-switched data.
To increase the usefulness of such a code-switch detector, the idea is to train a system based on two monolingual datasets to predict language labels on the word level. An example of the desired output is shown below:
@friend | u | perform | besop | apa | tudey | ? |
---|---|---|---|---|---|---|
un | en | en | id | id | en | ? |
Update 2022: Note that this project was done last year, and was published (Much Gracias: Semi-supervised Code-switch Detection for Spanish-English: How far can we get?). But there are still extensions that could be interesting: how can we identify named entities (often considered a separate class), or how can we scale to identify any number of language at once?
Related reading:
- Code-Mixing in Social Media Text
- Overview for the First Shared Task on Language Identification in Code-Switched Data
- Overview for the Second Shared Task on Language Identification in Code-Switched Data
- Overview of FIRE-2015 Shared Task on Mixed Script Information Retrieval
- Overview of the Mixed Script Information Retrieval (MSIR) at FIRE-2016
- A Fast, Compact, Accurate Model for Language Identification of Codemixed Text
Conversion of NLP tasks to sequence labeling tasks
Because of the continually increasing power of sequence labelers, competetive performance for complex tasks can be gained by simplifying tasks to sequence labeling problems. This lead to efficient and accurate NLP models. The main setup is: 1) find a task (I have a couple in mind already of course), 2) convert this a sequence labeling problem 3) train a sequence labeler for this conversion, 4) then convert the sequence back to the original task and evaluate. This is mainly an algorithmic project, as existing sequence labelers can be used.
Succesfull examples on some NLP tasks:
- Biomedical Event Extraction as Sequence Labeling
- Tetra-Tagging: Word-Synchronous Parsing with Linear-Time Inference
- Viable Dependency Parsing as Sequence Labeling
Strategies for Morphological Tagging
Morphological tagging is the task of assigning labels to a sequence of tokens that describe them morphologically. This means that one word can have 0-n labels. There has been a variety of architectures proposed to solve this task, however it is unclear which method works best in which situation.
In this project you can make use of the Universal Dependencies data, which has annotation for morphological tags for many languages. You can use the MaChAmp toolkit, or implement a BiLSTM tagger yourself, and evaluate at least the three most common strategies:
- Predict the concatenation of the tags as one label (same as POS tagging, but with more labels)
- Predict morphological tags as a sequence (like machine translation)
- View the task as a multilabel prediction problem (Get a probability for each label, and set a cutoff threshold)
Related reading:
- The SIGMORPHON 2019 Shared Task: Morphological Analysis in Context and Cross-Lingual Transfer for Inflection
- Multi-Team: A Multi-attention, Multi-decoder Approach to Morphological Analysis.
Can humans recognize NLP domains?
In natural language processing, we often assume that the origin/platform of a text is its domain. These "domains" are then used to evaluate cross-domain performance. However, in many cases the domain or genre of a text might not be so clear-cut. Even tough theoretical frameworks have been proposed, it is unclear whether domain properties can be identified from an utterance directly. In this project, we ask: How well can humans identify which domain a sentence is taken from?, and how does this compare to automatic models?, existing datasets such as the AG News corpus or the Universal Dependencies data can be used for this project.
Related reading: