Transparency and reproducibility are key components of open science – and like so many things impacted by digital innovation: As part of the “Coleridge Initiative – Show US the data” competition, Luca Papariello from the Research Studio Data Science developed a practical solution to automatically recognize datasets in scientific publications – increasing openness and reproducibility.

Data is power. But in science, data also allows for openness, reproducibility and transparency – cornerstones of scientific methods. Only the availability of not only the source code but also the underlying dataset makes it possible to reproduce research published in scientific literature. However, the accurate identification of datasets used in research papers often poses considerable challenges for researchers wishing to replicate the findings.

The competition “Coleridge Initiative – Show US the Data”, which was launched on the data science platform Kaggle on 23 March 2021, aimed to change that: Natural Language Processing (NLP) techniques were to be applied in order to automate the extraction of mentions of datasets from scientific publications. By linking research articles and the data referenced in them, data scientists would help public authorities in showing how their data is used – promoting transparency and trust in evidence.

Our researcher Luca Papariello from the Research Studio Data Science took part in the challenge along with more than 1600 other teams – totalling more than 1900 participants. The solution he has developed ranked in the top 9% in the final classification, awarding him a bronze medal.

Kaggle is the ideal platform to challenge yourself and learn new things that, at first sight, you are not sure how to tackle. It is a great complement to more theoretical resources (such as scientific articles, lectures, MOOCs, etc.) that allows you to learn on the ground by experimenting with different solutions. Also, only the best models and techniques survive the test of time in Kaggle competitions. This allows me to assess which are the most significant developments in a given field and which are doomed to fade away.

Luca Papariello, Researcher Data Science

In his implementation, Luca Papariello exploited novel transformer-based models and PyTorch, an open source machine-learning library. Deep learning models based on the transformer architecture have recently captured the NLP world, achieving state-of-the-art results in several areas. An illustration of this success is provided by the fantastic growth of the Hugging Face ecosystem, which allows access to a plethora of (currently state-of-the-art) pre-trained models with just a few lines of code.

For the RSA FG, this competition is a commendable initiative promoting openness and follow-up research in science. However, it is also an opportunity to advance and establish digital innovation in the scientific field itself. Both data and innovation should be used for social good – furthering open science and knowledge transfer is one way to do this.