Offene Abschlussarbeiten
Auf dieser Seite finden Sie zu vergebene Abschlussarbeiten des Lehrstuhls für Wirtschafts- und Sozialstatistik. Englischsprachige Beschreibungen legen nahe, dass eine Abfassung der Abschlussarbeit in englischer Sprache zu bevorzugen ist.
In this thesis, the goal is to transfer the idea of randomization based inference for testing the equality of spectral densities of (multivariate) time series in Jentsch and Pauly (2015, Bernoulli) to random fields, that is, processes indexed not only in the space of integers Z, but, more generally, in Z^d. These processes allow to model spatial data in the two-dimensional plane Z^2 or in the three-dimensional space Z^3, or spatial data Z^2 over time Z. In such scenarios, the covariance function becomes more complex and relies on more parameters, which are difficult to estimate in practice. To tackle this, simplifying assumptions on the covariance function such as symmetry or separability are often imposed. In this thesis, a frequency-domain test statistic based on non-parametric spectral density estimators is proposed. Following Jentsch and Pauly (2015, Bernoulli), the null distribution of that test is estimated by a randomization approach, which has the main advantage that it does not require any tuning parameters (in addition to the bandwidth parameter) in comparison e.g. to commonly applied bootstrap methods.
The asymptotic distribution of the test statistic under the corresponding null has to be derived and it has to be checked whether the randomization approach leads to the correct distribution as well. The finite sample performance has to be investigated using Monte Carlo simulations.
If you are interested in writing this thesis, please contact Prof. Dr. Carsten Jentsch.
Classifying texts is a common task which can be used to differentiate positive and negative reviews for products, to determine which pre-defined category the text matches best or to even predict a speakers political stance. Text classifcation is also applied in many more, increasingly specific use cases and is thus an integral part of textual data analysis.
Supervised classification of texts has seen rapid increases in performance in the latest years due to the use of transformer models (e.g. BERT). Such models however are dependend on a training data set which is can be increasingly hard to come by for specific use cases. Also, many of the best-performing models are pre-trained on a general-use data set (BERT is pre-trained on the entire english Wikipedia), they might thus lack performance when analyzing texts in which the semantics differ from what the model considers normal. For instance, the analysis of old texts from decades or even centuries ago might be biased or simply inaccurate due to the semantic and syntactic difference between the old texts and the Wikipedia-articles used to pre-train the model.
Thus, unsupervised clustering still remains a relevant task for situations in which no pre-trained supervised model is applicable. The clustering model can be directly trained on the relevant texts instead of relying on another pre-trained model or even manually selected labels. While unsupervised clustering can be performed without any a-priori-knowledge by clustering tfidf-scores, such procedures are often not regulated enough to match the use-case at hand. An alternative to this is to use a-priori-knowledge outside of manually labeled texts. Such information is often stored in lexica, which for instance contain information about which word can be linked to which class.
The goal of this thesis is to create a model to cluster texts into multiple classes. A possible starting point might be
Lange, K.-R., Rieger, J. und Jentsch, C. (2022). Lex2Sent: A bagging approach to unsupervised sentiment analysis. arXiv. DOI.
in which documents were clustered in a two-class-problem using external information stored in lexica.
If you are interested in writing this thesis, please contact Kai-Robin Lange.
Now- und forecasting sind wichtige Bestandteile der Beantwortung ökonometrischer Fragestellungen. Im Rahmen von DoCMA (Dortmund Center for Data-based Media Analysis) wurden zwei Indizes, UPI (uncertainty perception indicator) und IPI (inflation perception indicator) entwickelt, die für die Berichterstattung in deutschen Zeitungen die Präsenz von Unsicherheits- und Inflationsberichterstattung - nach verschiedenen Themenbereichen aufgeteilt - erfassen sollen.
Das Ziel dieser Abschlussarbeit ist eine Aussage treffen zu können, inwiefern die (Sub-) Indizes die Vorhersagekraft etablierter ökonometrischer Modelle verbessern.
Literatur (sowie ausdrücklich auch die Literaturverweise in den folgenden Arbeiten):
- Rieger, J., Hornig, N., Schmidt, T. und Müller, H. (2023). Early Warning Systems? Building Time Consistent Perception Indicators for Economic Uncertainty and Inflation Using Efficient Dynamic Modeling. Angenommen für MUFin'23. Link. GitHub.
- Müller, H., Rieger, J., Schmidt, T. und Hornig, N. (2022). An Increasing Sense of Urgency: The Inflation Perception Indicator (IPI) to 30 June 2022 - a Research Note. DoCMA Working Paper #12. DOI. GitHub.
Vorherige Ausgaben: Pressure is high (04/30/2022), A German Inflation Narrative (02/28/2022). - Müller, H., Rieger, J. und Hornig, N. (2022). Vladimir vs. the Virus - a Tale of two Shocks. An Update on our Uncertainty Perception Indicator (UPI) to April 2022 - a Research Note. DoCMA Working Paper #11. DOI. GitHub.
Vorherige Ausgaben: "Riders on the Storm" (Q1 2021), "We’re rolling" (Q4 2020), "For the times they are a-changin'" (Q3 2020). - Shrub, Y., Rieger, J., Müller, H. und Jentsch, C. (2022). Text data rule - don't they? A study on the (additional) information of Handelsblatt data for nowcasting German GDP in comparison to established economic indicators. Ruhr Economic Papers #964. Link.
Weitere Informationen zu den Indikatoren:
- IPI
- Handelsblatt (07/25/2022)
- Handelsblatt (05/24/2022)
- Handelsblatt (03/10/2022)
Wenn Sie daran interessiert sind, diese Arbeit zu schreiben, wenden Sie sich bitte an Jonas Rieger.
Die latent Dirichlet allocation (LDA) ist in anwendungsorientierten Forschungsbereichen weiterhin ein viel verwendetes Topic Modell zur Exploration von Textdaten. In unserer Arbeit zur RollingLDA beschreiben wir eine Weiterentwicklung des Modells im Sinne von diskreten Updates. Dafür werden insbesondere drei Parameter gewählt: die Intitialisierungsperiode, die Größe des Update-Intervalls, sowie die Größe des entsprechenden Gedächtnisses für jedes Update-Intervall.
Das Ziel dieser Abschlussarbeit ist den Effekt der drei Parameter auf das entstehende Modell zu untersuchen und Vorschläge für die Parameterwahl in bestimmten Settings zu geben sowie ggf. Adjustierungen für die Modellschätzung vorzuschlagen.
Literatur:
- Rieger, J., Jentsch, C. und Rahnenführer, J. (2021). RollingLDA: An Update Algorithm of Latent Dirichlet Allocation to Construct Consistent Time Series from Textual Data. Findings of the Association for Computational Linguistics: EMNLP 2021, 2337-2347. DOI. GitHub.
Beispiel-Anwendungen von RollingLDA:
- Rieger, J., Hornig, N., Schmidt, T. und Müller, H. (2023). Early Warning Systems? Building Time Consistent Perception Indicators for Economic Uncertainty and Inflation Using Efficient Dynamic Modeling. Angenommen für MUFin'23. Link. GitHub.
- Bittermann, A. und Rieger, J. (2022). Finding scientific topics in continuously growing text corpora. Proceedings of the 3rd Workshop on Scholarly Document Processing. Link. GitHub. PsychTopics App.
- Lange, K.-R., Rieger, J., Benner, N. und Jentsch, C. (2022). Zeitenwenden: Detecting changes in the German political discourse. Proceedings of the 2nd Workshop on Computational Linguistics for Political Text Analysis. pdf. GitHub.
- Rieger, J., Lange, K.-R., Flossdorf, J. und Jentsch, C. (2022). Dynamic change detection in topics based on rolling LDAs. Proceedings of the Text2Story'22 Workshop. CEUR-WS 3117, 5-13. pdf. GitHub.
Wenn Sie daran interessiert sind, diese Arbeit zu schreiben, wenden Sie sich bitte an Jonas Rieger.