Below you will find a list of my main research projects. For a list of student projects I offer at the University of Göttingen, please see here.
Paraphrases are differently worded texts with approximately the same meaning. For example, they are used for various natural language processing tasks to improve language model representations. But paraphrases can also be misused to plagiarize text and deceive detection methods. Plagiarism means using content without source acknowledgment, and when content is paraphrased, detection solutions have increasing difficulties in detecting plagiarism. With an increase in automated methods to paraphrase the text, plagiarism becomes scalable and effortless. Anyone with internet access can access online paraphrasing tools and paraphrasing content from others while (probably) unnoticed by the most used plagiarism detection software.
Our research found that traditional detection techniques can often detect less than one of ten paraphrased plagiarism cases. We also tested recent neural language models and found that they achieve super-human performance in detecting machine paraphrases while transferring to new domains and paraphrasing tools. For a quick introduction, I recommend checking out this blog post.
To test against machine paraphrases, we need - you guessed it - machine paraphrases. Neural language models have become a viable choice for generating, editing, translating, and paraphrasing text. But also other applications benefit from machine paraphrases as it allows to scale the amount of data available (especially for low-resource languages).
During our experiments, we found that humans can identify machine paraphrased text from large language models at almost random chance while rating the quality of paraphrases from large models nearly as high as human-authored texts. For an easy read with more details, I recommend the following blog posts (one, two)
This project focuses on making information about computer science research available with a few mouse clicks. Therefore, we use the DBLP Discovery Dataset D3 to show trends in volume, topics, impact, and many more. If you are interested, we have more details on the dataset creation as well as for the tool itself.
The tool has the following structure. The frontend enables abstract access to the data through visualizations and filters. The backend allows access to the data through a well-defined REST API and makes it easy to index new data. The prediction-endpoint does the heavy lifting for analyzing topics and other semantic features using parents and children of docker containers can run on different servers. Finally, the crawler ensures that the latest data is retrieved from the web, parsed, and stored in the backend.