Open Theses
On this page you find open theses to be supervised by the Chair of Economic and Social Statistics.
In this thesis, the goal is to transfer the idea of randomization based inference for testing the equality of spectral densities of (multivariate) time series in Jentsch and Pauly (2015, Bernoulli) to random fields, that is, processes indexed not only in the space of integers Z, but, more generally, in Z^d. These processes allow to model spatial data in the two-dimensional plane Z^2 or in the three-dimensional space Z^3, or spatial data Z^2 over time Z. In such scenarios, the covariance function becomes more complex and relies on more parameters, which are difficult to estimate in practice. To tackle this, simplifying assumptions on the covariance function such as symmetry or separability are often imposed. In this thesis, a frequency-domain test statistic based on non-parametric spectral density estimators is proposed. Following Jentsch and Pauly (2015, Bernoulli), the null distribution of that test is estimated by a randomization approach, which has the main advantage that it does not require any tuning parameters (in addition to the bandwidth parameter) in comparison e.g. to commonly applied bootstrap methods.
The asymptotic distribution of the test statistic under the corresponding null has to be derived and it has to be checked whether the randomization approach leads to the correct distribution as well. The finite sample performance has to be investigated using Monte Carlo simulations.
If you are interested in writing this thesis, please contact Prof. Dr. Carsten Jentsch.
Classifying texts is a common task which can be used to differentiate positive and negative reviews for products, to determine which pre-defined category the text matches best or to even predict a speakers political stance. Text classifcation is also applied in many more, increasingly specific, use cases and is thus an integral part of textual data analysis.
Supervised classification of texts has seen rapid increases in performance in the last years due to the increased use of transformer models (e.g. BERT). Such models however are dependend on a training data set which is can be increasingly hard to come by for specific use cases. Also, many of the best-performing models are pre-trained on a general-use data set (BERT is pre-trained on the entire english Wikipedia), they might thus lack performance when analyzing texts in which the semantics differ from what the model considers normal. For instance, the analysis of old texts from decades or even centuries ago might be biased or simply inaccurate due to the difference between the old texts and the Wikipedia-articles used to pre-train the document.
Thus, unsupervised clustering still remains a relevant task for situations in which no pre-trained supervised model is applicable. The clustering model can be directly trained on the relevant texts instead of relying on another pre-trained model or even manually selected labels. While unsupervised clustering can be performed without any a-priori-knowledge by clustering tfidf-scores, such procedures are often not regulated enough to match the use-case at hand. An alternative to this is to use a-priori-knowledge outside of manually labeled texts. Such information is often stored in lexica, which for instance contain information about which word can be interpreted negatively or positively.
The goal of this thesis is to create a model to cluster texts into multiple classes. A possible starting point might be
Lange, K.-R., Rieger, J. und Jentsch, C. (2022). Lex2Sent: A bagging approach to unsupervised sentiment analysis. arXiv. DOI.
in which documents were clustered in a two-class-problem using external information stored in lexica
If you are interested in writing this thesis, please contact Kai-Robin Lange.
Now- and forecasting are important components in answering econometric questions. At DoCMA (Dortmund Center for Data-based Media Analysis), two indices, UPI (uncertainty perception indicator) and IPI (inflation perception indicator), were developed to measure the presence of uncertainty and inflation reporting in German newspapers, partitioned by different topics.
The aim of this thesis is to assess to what extent the (sub-) indices improve the predictive power of established econometric models.
Literature (as well as references in the following papers):
- Rieger, J., Hornig, N., Schmidt, T. and Müller, H. (2023). Early Warning Systems? Building Time Consistent Perception Indicators for Economic Uncertainty and Inflation Using Efficient Dynamic Modeling. Accepted for MUFin'23. Link. GitHub.
- Müller, H., Rieger, J., Schmidt, T. and Hornig, N. (2022). An Increasing Sense of Urgency: The Inflation Perception Indicator (IPI) to 30 June 2022 - a Research Note. DoCMA Working Paper #12. DOI. GitHub.
Previous editions: Pressure is high (04/30/2022), A German Inflation Narrative (02/28/2022). - Müller, H., Rieger, J. and Hornig, N. (2022). Vladimir vs. the Virus - a Tale of two Shocks. An Update on our Uncertainty Perception Indicator (UPI) to April 2022 - a Research Note. DoCMA Working Paper #11. DOI. GitHub.
Previous editions: "Riders on the Storm" (Q1 2021), "We’re rolling" (Q4 2020), "For the times they are a-changin'" (Q3 2020). - Shrub, Y., Rieger, J., Müller, H. and Jentsch, C. (2022). Text data rule - don't they? A study on the (additional) information of Handelsblatt data for nowcasting German GDP in comparison to established economic indicators. Ruhr Economic Papers #964. Link.
Additional information on the indicators:
- IPI
- Handelsblatt (07/25/2022)
- Handelsblatt (05/24/2022)
- Handelsblatt (03/10/2022)
If you are interested in writing this thesis, please contact Jonas Rieger.
Latent Dirichlet allocation (LDA) is still a widely used topic model in application-oriented research areas for the exploration of textual data. In our work on RollingLDA, we describe a further development of the model in terms of discrete updates. For this, three parameters have to be chosen: the initialization period, the size of the update interval, and the size of the corresponding memory for each update interval.
The goal of this thesis is to investigate the effect of the three parameters on the resulting model and to give suggestions for parameter choices in specific settings, as well as to propose adjustments for model estimation, if useful.
Literature:
- Rieger, J., Jentsch, C. and Rahnenführer, J. (2021). RollingLDA: An Update Algorithm of Latent Dirichlet Allocation to Construct Consistent Time Series from Textual Data. Findings of the Association for Computational Linguistics: EMNLP 2021, 2337-2347. DOI. GitHub.
Example applications of RollingLDA:
- Rieger, J., Hornig, N., Schmidt, T. and Müller, H. (2023). Early Warning Systems? Building Time Consistent Perception Indicators for Economic Uncertainty and Inflation Using Efficient Dynamic Modeling. Accepted for MUFin'23. Link. GitHub.
- Bittermann, A. and Rieger, J. (2022). Finding scientific topics in continuously growing text corpora. Proceedings of the 3rd Workshop on Scholarly Document Processing. Link. GitHub. PsychTopics App.
- Lange, K.-R., Rieger, J., Benner, N. and Jentsch, C. (2022). Zeitenwenden: Detecting changes in the German political discourse. Proceedings of the 2nd Workshop on Computational Linguistics for Political Text Analysis. pdf. GitHub.
- Rieger, J., Lange, K.-R., Flossdorf, J. und Jentsch, C. (2022). Dynamic change detection in topics based on rolling LDAs. Proceedings of the Text2Story'22 Workshop. CEUR-WS 3117, 5-13. pdf. GitHub.
If you are interested in writing this thesis, please contact Jonas Rieger.