Thesis project topics.

Below I list some research ideas that I would like to supervise for a research project/thesis or collaborate on. These can also be seen as research directions that I’m interested in, so if you are interested in related projects feel free to contact me as well (robv@itu.dk). Next semester (Spring 2026) I will be on sabbatical, so I will have very limited supervision time.

For more information about how I normally supervise see: Supervision statement

Syllable/phoneme level input to language models

Subwords are the most common input unit for language models. However, there is no concensus on what they should encapsulate. Making subwords align to syllables or phonemes could have beneficial effects for cross-lingual evaluations and coverage. Previous work has already shown that converting languages to the same script leads to better performance:

Do the number of terminal symbols effect generalization in LMs?

Hierarchical generalization in language models is often studied by training them on synthetic and controlled data. As the main focus in hierarchical generalization is syntax, these studies tend to use a limited set of terminals (i.e. words), which determines the vocabulary size of the synthetic data. However, as observed in Targeted Syntactic Evaluation on the Chomsky Hierarchy, generalization is impacted by the number of terminal symbols. I am interested in examining the relationship between terminal size and hierarchical generalization.

How do language models learn morphologically rich languages?

Learning dynamics of language models on linguistic, especially syntactic, patterns are well documented in English. ( Language acquisition: do children and language models follow similar learning stages?, Subspace Chronicles: How Linguistic Information Emerges, Shifts and Interacts during Language Model Training) English, however, represents only one common way of encoding syntactic information: via word order and function words such as prepositions. There are, however, families of languages that encode syntax via inflection or agglutination, such as Turkic or Slavic languages. These languages have different typologies, such as flexible word order and larger vocabulary size. I am interested in studying learning dynamics of language models on such languages.

Language model training for diverse languages

Many benchmarks (and development) have focused on English (or other Latin languages). However, languages differ in many ways, so it is crucial to consider a varied set of languages. Evaluating on all languages is impractical, so a variety of (methods for finding) subsets have been proposed, including:

One challenge is finding data for these languages, therefore I propose to start with language modeling, since no annotation is necessary. You can use a similar setup (hyperparameters) for the (small) language model training as:

Translation, generation, or manual labour for instruction tuning

Instruction tuning refers to the phase of language model training where the model learns how to respond to tasks. Many instruction tuning datasets have been created for English recently. However, for other languages there is usually (almost) no manually created data. In this case, people usually use translated instructions from English data, or instructions generated by larger, more accurate language models. However, a systematic comparison is lacking. This project will investigate the amounts of data and costs of creating data with the different approaches.

Simplify then solve

There exist many variants of constructed languages, which are designed with specific purposes in mind. Many of those are focused to “easify” processing, for example Basic English and Learning English. If we are able to build a good machine translation model to these language varieties, we can then evaluate the performance of NLP models after translation. This project is probably mainly focused on (automatic) data creation/curation.

Predict best LM for task

Huggingface provides a unified platform that stores language models along with some meta-data about them. In this project, you will aim to use as much of this meta-data as necessary to automatically predict performance on benchmarks. If accurate, the resulting tool will be very useful in the selection of language models in practice. You could use public benchmarks (e.g. MTEB, bigbench, HELM, MMLU, etc.) to obtain performances.

Predict best LM for task based on tokenizer

The tokenizer of an LM can reveal quite a bit about how it is trained, so it is to be expected that certain properties of the tokenizer are good predictors of performance. This project focuses on finding these properties.

Cross lingual morphological segmenter

It has been shown that morphological segments are probably good input units for language models. However, high quality segmenters are only available for a handful of languages. Most languages lack annotated data for this task, hence, a cross-lingual approach will enable more diverse experimentation with morphs as inputs in language models.