Open Data in STEM
Resources and information on open, FAIR data for researchers across science, technology, engineering, and medicine
Open data can be a tricky topic for researchers in any field, but it raises particular questions for those working in STEM subjects. Issues like patient confidentiality, third party restrictions, and even commercial conflicts, can make it difficult to turn open data into a reality for many STEM researchers.
At F1000Research, we’re here to support you in your open data journey. We ask our authors to comply with the FAIR Guidelines as part of our progressive Open Data Policy. We advocate for research data which is as open as possible, and as closed as necessary. And we’ve got a range of tools and resources to help you understand the open data landscape across STEM subjects, so your submission to F1000Research can go as smoothly as possible.
This page is packed with information about open, FAIR data in the context of STEM research, including:
- What does data look like for STEM researchers?
- Data sharing in the context of COVID-19
- Examples of open research data in action
- How to make your STEM data open and FAIR
What Research Data is Common in STEM?
STEM research is wide-ranging and varied – from epidemiology to nuclear physics, chemical engineering to neuroscience imaging. So it’s no surprise that research data in these fields is equally diverse.
Here are a few examples of what your data could look like, as a STEM researcher:
- Numerical data stored in spreadsheets, including modelling data and lab results
- Qualitative data like transcripts and questionnaire responses
- Software including models, algorithms, scripts, and logfiles
- Visual data, such as photographs, maps, neuroimages, and X-rays
- Process Documents including reporting guidelines, protocols, and SOPs
Research data across STEM subjects is often digital, but non-digital objects (such as ice-core samples or medical specimens) still form part of your research data. It’s important to include these forms of data in your Data Management Plan, and have a clear strategy for digitizing, preserving, or describing them, so they’re as accessible and open as the rest of your data.
Here are a couple of examples from the F1000Research archive….
Differential dynamics of early stages of platelet adhesion and spreading on collagen IV- and fibrinogen-coated surfaces [version 2; peer review: 3 approved]
Horev, Zabary, Zarka, Sorrentino, Medalia, Zaritsky, Geiger
This Research Article, published in the Israel Science Foundation Gateway by Professor Beni Geiger (Former President of ISF), is a great example of how diverse research data in STEM can be. The Data Availability Statement includes a range of data types - including datasets saved as csv files, raw and annotated images of platelets, movies showing surface interactions, and even source code for the IRM spreading dynamics software.
The impact of the COVID-19 pandemic on self-harm and suicidal behaviour: a living systematic review [version 1; peer review: 1 approved, 2 approved with reservations]
John et al.
This Living Systematic Review considers the impact of the coronavirus pandemic on suicide and self-harm, with the latest version of the article considering evidence up to June 7th 2020. The Data Availability Statement shares details of the underlying and extended data relating to this research, along with the relevant PRISMA checklist under ‘Reporting Guidelines’ and the software behind the study, available on GitHub.
Why Choose Open Data for your STEM Research?
Get the credit you deserve
Some researchers worry about being 'scooped', but in reality sharing your data openly gives you greater recognition for your work by making it easier for others to find and cite your dataset. You can maximize its reuse and citation potential by reciprocally linking between your published article and the repository which hosts your data. This increased recognition for your research can lead to greater career progression opportunities, and future collaborations with other researchers and institutions.
Support reproducibility and transparency
Follow best practice for transparent, reproducible science by choosing FAIR, open data. It is only when the data underpinning your findings can be accessed, reproduced, and interrogated, that your research can be truly validated. Open data is a ‘must-have’ for robust, rigorous science.
Build public trust in your research area
Some fields of study, including climate change and infectious diseases, come under public scrutiny more often than others. Sharing your research data openly helps to demystify the science in these areas and improve how results are reported in the media and understood by the general public. This, combined with open data’s role in making research more reproducible and transparent, makes open data essential for building public trust in science. This is particularly crucial for research with the potential to make real-world impact.
“I am committed and convinced more than ever, that openly sharing my data is the best possible way forward for my research” – Jana Hutter, Wellcome Postdoctoral Fellow (source)
Data sharing during the pandemic
The benefits of open data in the context of COVID-19 are clear: immediate, easy access to the data underpinning essential research on the virus is crucial for the whole research community, and for informing evidence-based clinical and public health responses in real-time.
Despite the obvious challenges of the situation, the global STEM research community has reacted heroically to the COVID-19 crisis. Read the blog by the Wellcome Open Research Early Career Researchers Advisory Board to find out more about how the pandemic is changing science for the better.
Open Research Data in Action
What do our STEM authors have to say about open data?
Benchmark assessment of molecular geometries and energies from small molecule force fields [version 1; peer review: 2 approved]
Lim, Hahn, Tresadern, Bayly, Mobley
In this Research Article from the Chemical Information Science Gateway, the authors test a number of molecular mechanics force fields to determine which is most effective. The data underlying their results comes in a range of formats, including molecular geometries and energies from quantum mechanical calculations, histograms of force fields, scatter plots, statistics, and source code for conducting modelling and analysis.
“We’re very enthusiastic about open data and software.
When we make our work available in a truly reproducible manner, we find that other researchers build on, reuse, and extend the work in ways far beyond what we might have originally imagined. It increases the visibility of our work, and helps science progress better and faster. Everyone wins.
And not only that, but it decreases the amount of work we have to do in the long haul because we have to field far fewer requests for our data or for additional details on a particular step of our work."
- Dr Mobley
ccbmlib – a Python package for modeling Tanimoto similarity value distributions [version 2; peer review: 2 approved]
Vogt and Bajorath
This Software Tool Article describes a Python package for which can be used to assess the statistical significance of Tanimoto coefficients, and evaluate how molecular similarity is reflected when different fingerprint representations in RDKit are used. Using source datasets from ChEMBL, this article also shares the openly available RDKit software, and the Python library the researchers created.
“Data and software sharing is of critical relevance for the further scientific development of computational sciences. It starts with ensuring reproducibility of computational studies, but does not end there. Any steps taken towards open data, software, and science represent important contributions.”
- Dr Bajorath
How to Make your STEM Data Open and FAIR
The first step when considering data sharing is to identify any legal or ethical issues relating to your dataset. For example, a medical research dataset may include identifiable patient information; or for engineering and applied research, there may be restrictions around sharing your data for commercial reasons.
We know that many researchers have questions and concerns around sharing sensitive data, including:
- What kind of consent should you have from patients or participants?
- How can a dataset be anonymized?
- How do controlled-access repositories work?
- What about third party data?
You can find all the answers in our handy guide on sharing sensitive data. When submitting to F1000Research, remember that if your data is too sensitive to share openly for any reason, you should provide detailed instructions for readers on how to apply for access to the data.
Curation of an intensive care research dataset from routinely collected patient data in an NHS trust [version 1; peer review: 2 approved]
McWilliams et al.
This Data Note describes a research database of 4,831 adult intensive care patients, who were treated in the Bristol Royal Infirmary, UK between 2015 and 2019. The Data Availability Statement explains that the underlying data is too sensitive to be made publicly available, and describes a process for researchers to apply for approval to access the dataset. This is a great example of how sensitive data can be shared in an ethical way, which is as open as possible and as closed as necessary.
When submitting to F1000Research, it’s important to remember that all articles should include details of any software that is required to view the datasets described or to replicate the analysis. This information should be included within your Data Availability Statement, under the heading 'Software Availability'.
If you’ve created your own research software, the source code should be openly available in a structured repository, written in an open source programming language, and included in your Data Availability Statement. We also ask for an archived version of the software at the time of submission, hosted on a recognized VCS such as GitHub.
Repositories for STEM Research
Depositing your data in a publicly accessible, recognized repository which assigns a persistent identifier (such as a DOI) ensures that your scientific dataset continues to be available to both humans and computers in the future. When submitting to F1000Research, we ask that your data is deposited in a stable, recognized repository under a CC0 license prior to submission.
We strongly recommend the use of community-recognized repositories. For some data types, such as genetic sequences and protein structures, it is essential that the data is deposited in GenBank and Protein Data Bank respectively.
Our Data Guidelines include a full breakdown of F1000Research-approved repositories by subject area and data type, but here’s a snapshot of some approved repositories for research across a few STEM subjects.
Health & Medicine
NAHDAP facilitates research on drug addiction and HIV infection by preserving and sharing research datasets relating to these fields, particularly those funded by the National Institute on Drug Abuse.
TCIA is a service which de-identifies and hosts a large archive of medical images of cancer, accessible for public download. Data in this repository is organized into ‘collections’ relating to specific diseases (e.g. lung cancer), image type (e.g. MRI, CT) or research focus, to enable better search functionality.
Project Datasphere is a leading oncology data sharing platform, hosting de-identified patient-level data from randomized clinical trials, and linked or enriched datasets.
Vivli focuses on individual participant-level data from completed clinical trials. They describe themselves as a “neutral broker between data contributor, data user, and the wider data sharing community” aiming to advance human health through effective data sharing.
Environment & Ecology
NERC has an Environmental Data Service (EDS) that offers a central point for NERC research data, consisting of a network of Data Centers hosting data from environmental scientists around the world. These include the British Oceanographic Data Center, the National Geoscience Data Center, and the Polar Data Center, among others.
PANGAEA is a member of the ICSU World Data System, offering data publishing services across earth and environmental sciences. It operates as an open access library for the archival, publication and distribution of georeferenced data for earth system research projects.
Established in 2005, this repository is popular with geoscientists working with geochemical, petrological, or geochronological data. The EarthChem Library is an open access repository for geochemical datasets and other digital resources for researchers.
The WDCC collects, stores, and disseminates earth system data with a particular focus on climate simulation data, and is hosted by the German Climate Computing Center (DKRZ).
The HEPData repository for high-energy physics data is a unique, open access repository for scattering data from experimental particle physics. It comprises data points from plots and tables related to several thousand publications, including data from the Large Hadron Collider.
“Where the world builds software” – this popular software development platform works well for hosting the source code behind your research software, offering easy access and collaboration options for users.
The Code Ocean platform supports faster, more collaborative computational research, with reproducibility at its core. It acts as a centralized repository to keep projects and results organized. Researchers can even embed Code Ocean ‘compute capsules’ in the body of their published article on F1000Research, so that readers and reviewers can run (and re-run) analyses without needing to leave the webpage, or download any new software.