According to an initial keyword search, we selected the 17 most popular microbial ecology-related journals, as these were more likely to have sequence-specific data deposition instructions or requirements. We surveyed all the articles published in these journals between January 2015 and March 2019 (n = 26,927 articles, Supplementary Table S1), as concerns over data deposition practices began to grow in 201514 and were soon followed by stricter standards for data availability12. A custom-built pattern-based text extraction algorithm followed by manual curation, we selected those studies which performed 16S rRNA gene amplicon sequencing and listed INSDC-compliant accession numbers (n = 2015, Supplementary Table S1; 145,203 samples).
To confirm that our parsing algorithm did not miss accession numbers in articles containing 16S rRNA gene amplicon sequencing, we randomly selected 150 articles which mentioned 16S rRNA, but for which no accession numbers were detected, for manual inspection. Of these, one contained a misspelled accession number, two had archived their sequences in unconventional repositories (Google Drive and GEO, a gene expression database, Supplementary Data 2), and 19 were identified as having performed 16S rRNA gene amplicon sequencing, but had not included any reference to the data. We found no cases in which accession numbers or sequence data were stored in supplementary materials. From this group, we estimate that 18% of the studies in our database (n = 469) performed 16S rRNA gene amplicon sequencing but did not provide access to the data (Fig. 1a). Four studies mentioned deposition data in dbGaP18, and we could verify the existence of three of these studies. We found that an additional 6.5% of the studies had deposited their data in the Qiita19, MG-RAST20, and figshare databases (n = 14, n = 134, and n = 24 studies, respectively). Of the estimated 2,656 studies employing 16S rRNA gene amplicon sequencing, 75.9% deposited their data to an INSDC database in the period studied (Fig. 1a).
To obtain more precise estimates of the percentage of articles which deposited their data in each database, we focused on the subset of 635 studies which sequenced the V3–V4 region of the 16S rRNA gene between base pairs 515 and 806 (heretofore V3–V4 subset), a target region which has gained popularity since its development and use by the Earth Microbiome Project21,22. Of these, 74.5% (n = 474) studies listed INSDC-compliant accession numbers within the article, but of these, accession numbers from 5% of the studies (n = 33) were not findable on any INSDC database. Additionally, 19% (n = 121) did not provide an identifiable link to the data, and 6.8% of the studies deposited their data in the Qiita, MG-RAST, and figshare databases (n = 9, n = 24, n = 7, respectively, Fig. 1b). Two studies provided SRA submission IDs rather than accession numbers, and were also inaccessible.
The increasing popularity of microbial community sequencing was evident in our data. Over the period studied, the number of studies in the V3–V4 subset rose from 56 in 2015 to 214 in 2018 (Supplementary Fig. 1a). The proportion of publications which claimed to deposit data to INSDC databases increased slightly over time, from 33/56 in 2015 to 172/214 in 2018 (χ2 = 6.6, p = 0.01, Supplementary Fig. 1b), suggesting an increasing tendency towards deposition in INSDC databases. Deposition to alternative databases decreased (χ2 = 14.04, p < 0.001, Supplementary Fig. 1c), indicating a switch to these standardized databases but not towards making data accessible in general, as the proportion of studies which did not deposit their data was remarkably stable over time (χ2 < 0.28, p = 0.6, Supplementary Fig. 1d). During this period, the number of studies without publicly available data rose, from 13 in 2015, to 38 in 2018 (Supplementary Fig. 1d).
Data deposition to any public repository is preferable over no deposition at all. However, despite the advantage of using the same platform for the housing, (re-)analysis, and storage of data, non-INSDC alternatives were not designed for the long-term storage of 16S rRNA amplicon sequencing data, and thus are likely to lead to the long-term loss of information. Qiita’s intended use is “the analysis and administration of multi-omics datasets” (https://qiita.ucsd.edu/). This platform is not designed for the long-term archiving of these data, and accordingly, Qiita includes software to facilitate deposition of sequences to the ENA, at which point MIMARKS requirements are enforced17. Similarly, MG-RAST20 is an online platform for metagenomics analyses which also facilitates sequence deposition to appropriate databases. In contrast, figshare is a general repository which hosts most forms of research output (https://figshare.com/), but it is neither sequence-specific nor richly searchable, and does not enforce community standards.
Microbiome research spans a wide range of fields including ecology, epidemiology, medicine, biotechnology, and agricultural engineering, and is likely to become more integrative in the future23. Synthesis efforts to bridge knowledge gaps across environments6 will likely rely on the ability to find data by searching databases directly, rather than resorting to a body of literature which is currently spread across the journals from various fields. To ensure future reusability, it is therefore essential that microbiome data is deposited to the appropriate INSDC databases, which also store searchable metadata and allow for automatable access to large datasets, and that current databases continue to make improvements to increase the searchability of their databases.
Due to the sensitive nature of unpublished data, INSDC databases allow users to upload their data and receive an accession number but keep the data private indefinitely24. This was evident in our data collection. We found that 2.2% of the studies (n = 45) listed incorrect accession numbers, for example placeholders (Supplementary Fig. 2b). Over the period studied, this proportion went up significantly (χ2 = 9.18, p < 0.001), from 1.3% in 2015 to 5.3% in 2019. Among the 2,015 articles which contained accession numbers, 7.2% (n = 146) of the articles had listed accession numbers correctly but had not made the sequence data public, and this proportion increased slightly over time from 5.9% in 2015 to 12.2% in 2019 (χ2 = 3.9, p = 0.05, Supplementary Fig. 2c), indicating that recent articles were more likely to have not made their data public at the time of manuscript publication. An additional 2.5% of the studies (n = 51) had not made their sequence metadata public, a trend which increased over the period studied (χ2 = 14.83, p < 0.001, Supplementary Fig. 2d).
While microbiome sequence data has been lauded for its uniform format, we found that the sequence files deposited varied quite widely in the format in which they were deposited, often rendering them unusable. Among the 441 in the V3–V4 subset for which INSDC-compliant accession numbers were available and data was public in the repository (representing 45,440 samples), we found that between 2015 and 2019, 11.8% of the studies (n = 52) had uploaded a single sequence file for the entirety of the project, despite analyzing more than one sample (Fig. 2b). Currently, most sequencing platforms are able to output demultiplexed data, i.e., one or more sequence file(s) per sample. However, common legacy formats consisted of one or two files for the entirety of the run as well as a mapping file, which contained the primer barcodes used to demultiplex the sequences (i.e., sequence metadata file). INSDC platforms require sequencing data to be demultiplexed prior to deposition, rendering non-demultiplexed raw data unusable due to elimination of any header information in the sequence files. Our data reflected this legacy effect: between 2015 and 2019, the proportion of studies which contained a single sequence file decreased significantly from 24.5% to 9.5% (χ2 = 16.92, p < 0.001, Fig. 2c). Furthermore, over this period, the proportion of studies which used Illumina platforms increased, and the proportion which used the older 454 pyrosequencing technique decreased (χ2 = 10.96, p < 0.001; and χ2 = 10.46, p = 0.001, respectively; Fig. 2c). Our findings shed light on the effect that fluxes in sequencing platform and file formats have on the scientific community’s ability to access data later.
Further variability in the formatting of sequence data complicated data reuse. For example, we found that 1.6% of the studies (n = 7) contained sequence files which lacked standard quality scores (Supplementary Fig. 3a). During sequence processing, quality scores allow users to assess the quality of the data and to exclude sequence reads with poor quality. Therefore, sequence data lacking quality scores is not reusable. We also found that 18.1% (n = 80) of the studies contained putative primer sequences, but there was no significant change over time in this proportion (χ2 = 2.33, p = 0.13, Supplementary Fig. 3b). Primer presence is not a strong determinant of whether data is reusable, and it is advised that data is archived in the rawest format possible. However, knowledge of primer presence and primer sequence identity are essential in the proper reprocessing of the data in the future, and currently, there are no standard methodologies for including this information in the metadata. Without this information, barcode and primer sequences may be interpreted as regular data. The lack of consensus on primer presence is one example of the complexities that underlie the analysis of seemingly reusable data. Other ‘hidden’ obstacles include the lack of information on the formatting of quality scores, and a lack of information on the primer sequence and length. Focusing on the V3–V4 subset allowed us to collect all possible primer sequences for this region and test for their presence; however, this is labor intensive, and forces data re-users to make inferences about the sequence formatting in their analyses, reducing the quality of research. Including extensive primer and file formatting information, as well as documentation of computational processing steps, which is automatically provided by state-of-the-art pipelines such as QIIME225 or Snakemake workflows26 in the sequence metadata, may greatly facilitate data re-use.
Properly labeling sequence data and including detailed metadata is essential to data reuse27. Among the studies in the V3–V4 subset which provided accession numbers, errors in labeling exceeded any other type of error (Fig. 3). Because our data collection contained 16S rRNA amplicon sequencing studies exclusively, we checked whether this information was included correctly in the sequence metadata. Among these studies, 12% (n = 53) of them had incorrectly labeled their sequences, using terms other than “Amplicon”. The percentage of studies with this error varied widely, from 18.2% (n = 6) in 2015 to 5% (n = 1) in 2019, and no trends were found over time (χ2 = 1.41, p = 0.24, Supplementary Fig. 3c).
A defining development in the field of microbial ecology has been the advent of paired-end sequencing, by which both ends of the fragment are sequenced and later aligned in silico, resulting in a higher read accuracy or in longer read lengths28. Next-generation sequencers currently output forward and reverse reads in separate files. We checked whether datasets labeled as “paired” also contained files corresponding to forward and reverse reads (i.e., were labeled appropriately). This was not the case for 16.8% of the studies targeting the V3–V4 region (n = 74), and exhibited no temporal patterns (χ2 = 2.09, p = 0.15, Supplementary Fig. 3d). Much like datasets which include putative primer sequences, when data is labeled as paired-ended and only a single file per sample is available, future users must infer what the true state of the sequence data is. Upon a qualitative inspection of these datasets, we found that a common source of the error was that only the forward reads or merged reads had been deposited. This labeling error does not render the data unusable, but makes the sequencing conditions hard to understand for future users, who must reverse engineer the methods from the data format and quality information.
Repopulating the archives
Errors in data deposition may render entire datasets unavailable for future research, or they may greatly complicate future data reuse. To this end, we followed the 635 studies which performed amplicon sequencing of the V3–V4 segment of the 16S rRNA gene (Fig. 3). Throughout the process of archiving data, we found that 19% were not archived at all, while 6.3% of the datasets were archived in other databases which were not designed for this task. A further 6.1% datasets were improperly deposited to sequence databases, while 11.5% and 8.9% were made partially (i.e., contained putative primer sequences) or completely unavailable (i.e., not demultiplexed) due to errors in data formatting, respectively. Finally, errors in labeling affected 14.6% of the available data.
Privacy issues, which are common in studies with human subjects, seem to have played only a minor role in choosing non-public repositories (4 studies in our dataset reported using dbGaP). One work-around for keeping microbial community data open is the removal of potentially identifying human reads. With the increasing number of more costly shotgun metagenomes, community standards for archiving either in closed databases like dbGaP or the removal step and its documentation should be formulated in the interest of re-usability without impeding privacy.
In total, only 34% of the studies identified (n = 216) contained fully reusable datasets, and 25.5% (n = 162) contained partially available datasets. A further 40.3% of studies (n = 256) contained data that was either not available or not reusable, severely limiting advances in synthetic microbiome research and compromising some of the fundamental principles in science12. An additional hurdle to data reuse is the availability of suitable metadata, and an assessment of the content and informational value of the metadata supplied for studies in the V3–V4 subset is presented in Supplementary Figs. 4–6 and 8.
Our findings show the true extent of reusability of the sequencing data which has been deposited over the past 5 years, and reveal a serious gap between the sequence data which is uploaded and that which may serve to inform future research. By identifying the main reasons for data loss (i.e., loss due to data location, errors in data deposition, errors in data formatting, and errors in data labeling), the present study provides the basis of and concrete recommendations for improved data archiving practices (Table 1). Given the plethora of pressing environmental, biotechnological, and medical challenges, preserving microbiome data is particularly relevant across fields of basic and applied research.