metaPR2 : A database of eukaryotic 18S rRNA metabarcodes with an emphasis on protists

Abstract In recent years, metabarcoding has become the method of choice for investigating the composition and assembly of microbial eukaryotic communities. The number of environmental data sets published has increased very rapidly. Although unprocessed sequence files are often publicly available, processed data, in particular clustered sequences, are rarely available in a usable format. Clustered sequences are reported as operational taxonomic units (OTUs) with different similarity levels or more recently as amplicon sequence variants (ASVs). This hampers comparative studies between different environments and data sets, for example examining the biogeographical patterns of specific groups/species, as well analysing the genetic microdiversity within these groups. Here, we present a newly‐assembled database of processed 18S rRNA metabarcodes that are annotated with the PR2 reference sequence database. This database, called metaPR2, contains 41 data sets corresponding to more than 4000 samples and 90,000 ASVs. The database, which is accessible through both a web‐based interface (https://shiny.metapr2.org) and an R package, should prove very useful to all researchers working on protist diversity in a variety of systems.

Metabarcoding, which reveals the taxa present in an environment by amplifying and then massively sequencing a standardized genetic marker (Santoferrara, 2019;Taberlet et al., 2012), has become a very powerful and widespread approach to investigate protist diversity in a range of environments (marine, freshwater, soils, microbiomes, etc.) in recent years. By far, the most common marker used for eukaryotic microbes is the gene coding for ribosomal small subunit RNA (18S rRNA). This gene has the advantage of being universal and having well annotated reference databases such as Silva or PR 2 (Guillou et al., 2013;Quast et al., 2013) which allow, for many protist groups, a precise taxonomic assignation. Within the 18S rRNA gene, several variable regions have been used as barcodes, in particular the V4 region located near the middle of the gene and the shorter V9 region located at its 3′ end (Burki et al., 2021;Pawlowski et al., 2012).
The V4 region has most often been used in recent studies (Lopes dos Santos et al., 2021). Over the years, metabarcoding has been used to study various aspects of protist diversity. The first studies aimed to simply establish the real extent of eukaryotic diversity that was underestimated with traditional clone library approaches (e.g., Stoeck et al., 2009). In marine waters, metabarcoding studies now tackle more focused questions, for example analysing the distribution of protist groups in the ocean as a function of their size , the diversity of heterotrophic protists in the deep layers of the ocean Obiol et al., 2021), detailed biogeographic distribution of specific taxa (e.g., Malviya et al., 2016;Yau et al., 2020), factors structuring marine plankton communities Sommeria-Klein et al., 2021), and the seasonal succession of taxa (e.g., Giner et al., 2019;Lambert et al., 2019).
Fewer metabarcoding studies have been carried out in freshwater and soils, but that is rapidly changing with recent implementation of some large scale studies (e.g., for soils Mahé et al., 2017).
For bacteria and archaea, large metabarcoding projects using the 16S rRNA gene have been undertaken, such as the Earth Microbiome Project which encompassed more than 23,000 samples of both freeliving and host-associated microbes, allowing inferences of global patterns of prokaryotic diversity (Thompson et al., 2017). For eukaryotes, although a few large scale sampling programs have been performed, such as Tara Oceans, Ocean Sampling Day (OSD) and Malaspina Duarte, 2015;Kopf et al., 2015) for marine systems, most eukaryotic metabarcoding studies have targeted geographically restricted specific environments. Most studies that have performed analyses on the global ocean microbiota have relied on the three data sets mentioned, in particular Tara Oceans (e.g., Ibarbalz et al., 2019;Sommeria-Klein et al., 2021). Many smaller-scale metabarcoding studies have also been carried out, in particular for environments that have not been sampled by these expeditions, such as soils or freshwater lakes and rivers (Lopes dos Santos et al., 2021). Unfortunately, it is difficult to combine the data from these studies with those of the large scale expeditions for a range of reasons. First, even if the unprocessed data files containing raw reads have been deposited to GenBank SRA (sequence read archive), secondary data (e.g., clustered sequences) and metadata (e.g., sample coordinates, temperature) are rarely available, or, if available, are hard to locate since they are stored in a range of formats (DOCX, XLSX or TXT files) as supplementary files to the studies, often protected behind a pay-wall. Clustered sequences can be provided as operational taxonomic units (OTUs) that depend on a specific similarity threshold or amplified sequence variants (ASVs, Callahan et al., 2016) that do not. OTUs clustered with different levels of similarity (e.g., 97 vs. 99%) are not directly comparable, meaning that if two studies are to be combined, clustering has to be performed de novo from the raw sequences. In contrast, ASVs from different studies can be directly compared. Secondly, taxonomic assignation is often conducted with different reference databases, such as GenBank, Silva or PR 2 (Guillou et al., 2013;Quast et al., 2013).
Some studies have tried to combine sets of samples from different environments (e.g., marine, freshwater and soil, Singer et al., 2021), but these efforts remain limited (for example, the Singer et al., 2021 study only included 122 sampling sites). The Ocean Barcode Atlas (Vernette et al., 2021) provides a web service allowing mapping of barcodes and diversity analyses. Unfortunately, at present it is restricted solely to Tara Oceans data sets and the taxonomy has not been updated since the publication of the original study . Thus, there is clearly a need to provide the protist research community with a reference database of metabarcodes which would allow full exploration of the available sequencing data by combining existing studies across different environments.
In this study, we introduce a database of metabarcodes (metaPR 2 ) containing more than 4150 samples originating from 41 published studies, most using the V4 region of the 18S rRNA gene. The database focusses on ASVs in order for the different metabarcodes to be directly comparable. All raw sequence files were reprocessed with a pipeline based on the dada2 R package (Callahan et al., 2016).
The taxonomy of the resulting ASVs was assigned using PR 2 (Guillou et al., 2013) as a reference database. We have developed a web application available in two forms (website and R package) that allows analysis, visualization and download of the data. This database will be extended in the future with novel publicly available data sets and should prove very useful to the protist research community. In addition to introducing the database, we also provide basic statistics on the database and preliminary analyses of ASV diversity across different biomes.

| Data set selection and metabarcode processing
Data sets were selected from published studies (Table 1) Notes: Data sets sequenced with 454 technology are single reads while those processed with Illumina are paired end reads. The column "Samples" corresponds to the number of samples that have more than 1000 reads after processing. The column "Region" correspond to the 18S rRNA gene region used for metabarcoding. The column "Reads" corresponds to mean number of reads per sample after processing. The column "ASVs" corresponds to the number of ASVs in the data set after removing ASVs that have less than 100 reads over all the data sets.

TA B L E 1 (Continued)
MySQL database, the coherence of which was checked with the R validate package (van der Loo & de Jonge, 2021). For each study, raw sequence files were processed independently de novo on the Roscoff ABIMS (Analysis and Bioinformatics for Marine Science) cluster. Primer sequences were removed with cutadapt version 2.8 (Martin, 2011) using the default parameters (maximum error rate = 10%) and the -g flag which removes any base upstream of the primers. Amplicon processing was performed under the R software version 3.5.1 (R Development Core Team, 2021) using the dada2 package version 1.14.0 (Callahan et al., 2016). Read quality was visualized with the plotQualityProfile function. Reads were filtered using the filterAndTrim function, adapting parameters (truncLen, minLen, truncQ, maxEE) according to overall sequence quality. Merging of the forward and reverse reads was undertaken with the mergePairs function using the default parameters (minOverlap = 12, maxMismatch = 0). Chimeras were removed using removeBimeraDenovo with default parameters. ASVs with similar sequences from different studies were merged together and identified with a unique 10 character code which corresponds to the start of the 40 character hash value of the sequence computed using the sha1 function from the R digest package. Taxonomic classification of ASVs was performed using the assignTaxonomy function from dada2 against the PR 2 database (Guillou et al., 2013) version 4.14 (https://pr2-datab ase.org/).
We did not threshold bootstrap values (minBoot = 0). ASV sequence, taxonomy assignment and bootstrap values, as well as abundance in each sample, were stored in tables in the same master database as the metadata. In order to limit the size of the online database, we removed ASVs that corresponded to less than 100 reads over all studies included in the database and did not consider sequences that had an assignment bootstrap value lower than 75% at the supergroup level. However, the master database contains all ASVs without any filter on their abundance or bootstrap values which will allow future evolution as the number of ASVs increases with the addition of new data sets. The total number of reads in each sample was normalized to 100 by dividing the number of reads for a given ASV in a given sample by the total number of reads in the sample multiplied by 100. In this way, read abundance could be expressed as % of total eukaryotic reads in the sample in visualizations (e.g., in maps, see below). Sequence processing scripts can be found at https://github. com/vaulo t/Paper -2021-Vaulo t-metap r2/tree/main/R_proce ssing.

| Metabarcode clustering
Since the data sets included in metaPR 2 used different sets of primers (see below Table S3), for the purpose of this study we clustered ASVs with 100% similarity using the -cluster_fast option of vsearch version 2.18.0. ASVs within each cluster were merged together, using the centroid ASV as the new ASV, called cASV. This led to a slight reduction in the total number of ASVs from 79,000 to 70,000 once clustered. In general, sequences included in a given cluster were widely overlapping, although a few bases could be different outside the overlap region, indicating some microdiversity within these clusters ( Figure S1). Clustering was only used in the framework of this study and the data provided in the web application are not clustered.

| Metabarcode similarity to reference sequences
In order to evaluate the similarity of ASVs to existing reference sequences, in the context of this study we followed the approach of Metz et al. (2022). We compared ASVs to sequences from the PR 2 database (Guillou et al., 2013) version 4.14 (https://pr2-datab ase. org/) using the -usearch_global option of vsearch with iddef = 2.
The similarity information was stored in the MySQL database, then retrieved and merged with the ASV information using an R script.
Alpha and beta diversity analyses were performed using the R phy-

| Ecological function
We used had not defined any function, we assigned a function based on general knowledge for protists, generating a new table (Table S1).

| Diversity analyses
Diversity analyses were performed with the phyloseq R package (McMurdie & Holmes, 2013). NMDS was based on Bray-Curtis dissimilarity. Upset plots to visualize the number of cASVs common to two or more environments were performed with the UpSetR R package.

| R shiny application
All post-processing was conducted with the R software. The data were extracted from the MySQL database using a custom script and stored in files using the R qs package that allows extremely fast loading of files (Travers, 2021). The data were post-processed using the dplyr and tidyr packages. An R shiny application was developed to interact with the database using the following R packages: shiny, DT, shinyvalidate, shinyWidgets and shinycssloaders (Sali & Attali, 2020 with only two data sets representing the V9 region (Tara Oceans and Argentinian lakes, Table 1). The most common primer pairs used for V4 ( Figure S2, Table S2 and S3) were those designed by Stoeck et al. (2010) and modified by Piredda et al. (2017). The V4 metabarcodes varied from 309 to 672 bp and were overlapping ( Figure S2).
The metaPR 2 database contains more than 4150 samples  Figure S3).
The number of samples per data set is quite heterogeneous, ranging from less than 10 to almost 1150 for Tara Oceans (Table 1).
The total number of reads analysed is almost 800 million for V9 and above 220 million for V4. The average number of reads per data set is also highly variable ranging from about 1000 in the older studies sequenced by 454 technology to almost 700,000 for Tara V9 (Table 1), which explains why overall there are more reads for V9 than V4 despite only two data sets using V9. The total number of ASVs was about 90,000. The number of ASVs in a given study ranges from less than 100 to more than 30,000 depending on both the number of samples and the depth of sequencing (Table 1). Since different studies have used different primer sets, it was necessary for the purpose of the analyses presented below to cluster ASVs with 100% similarity (cASVs, see Materials and Methods).

| Protist composition
Overall, the database is dominated by Opisthokonta (Metazoa and fungi) and Alveolata (Dinoflagellata) ( Figure S4). In this study, we decided to focus on protists and on the V4 region. The focus on protists is justified because the sampling strategy of most data sets was optimal for microbial eukaryotes. DNA from the three divisions (metazoa, plants and fungi) not included in protists were probably unevenly sampled, for example, plant seeds in soils or larval stages of metazoa in water environments. The focus on the V4 data sets that contain almost 3000 samples and 850 sites is due to that fact that the data for the V9 region are dominated by the Tara Oceans data set, which has been extensively analysed previously (e.g., de Vargas et al., 2015).
Protist sequences represent more than 41,000 ASVs (~33,000 cASVs once clustered). In terms of reads and cASVs, the database is dominated by Alveolata (in particular dinoflagellates), followed Comparing the metaPR 2 metabarcodes to reference sequences, such as those from PR 2 , reveals that there are very few novel

| Global trends across environments
Analysis of the metaPR 2 database corroborates some trends that have been observed in studies with much fewer samples. Singer of individual sample diversity, terrestrial ecosystems are most diverse, followed by rivers, oceanic and coastal environments, with lakes the least diverse in agreement with previous analyses (Singer et al., 2021), these differences all being significant ( Figure S7). Most cASVs are restricted to a single type of ecosystem, with less than 2% (620 out of 33,235) common to two or more ecosystems if we consider coastal and oceanic ecosystems together (Figure 7). This segregation based on ecosystem type is probably not linked to the use of different primers. Since we used clustered ASVs (cASVs), we grouped together similar sequences even if they originated from data sets using different primers. Moreover, some data sets from different ecosystems used the same primer sets. For example, data sets numbers 34 and 204 (ocean), number 197 (lakes) and number 199 (soils) used the same TAReuk454FWD1/TAReukREV3 primer sets. The highest number of cASVs corresponds to marine ecosystems (coastal and oceanic), followed by terrestrial and freshwater.
Interestingly, both coastal and oceanic ecosystems have a large number of specific cASVs with roughly one third purely oceanic, one third purely coastal and one third common. It is also striking that there are very few cASVs common between freshwater rivers and lakes (just above 7%). In terms of novelty, that is, of cASVs with low similarity to known sequences, terrestrial ecosystems are the least known with a median similarity below 95%, followed by rivers, lakes, coastal and oceanic ecosystems (Figure 5b). In some way, this reflects the fact that soil protists have only recently been investigated (Geisen et al., 2018). A comparison between the community structures from these different ecosystems using NMDS based on Bray-Curtis dissimilarity ( Figure 8) reveals a clear gradient: terrestrial ecosystems, followed by rivers and lakes, then coastal and oceanic ecosystems. Interestingly, river communities are the closest to soil communities, as they are probably enriched in terrestrial protists through soil drainage.

| Shiny application
With a database of such size and complexity, it is necessary to create tools that allow to explore the database and to download the data of interest (e.g., for a specific taxonomic group or environment).
We developed an R Shiny application (Figure 9 and Figure S8)  is possible to select/deselect specific data sets or groups of data sets, such as all oceanic data sets ( Figure S9). Selection can also be based on sample characteristics such as whether samples come from DNA or RNA, the ecosystem, the type of substrate (e.g., ice, water, soil), the size fraction and the depth level ( Figure S10). It is possible through reactive menus to navigate the taxonomy tree down to the species and even ASV level (potentially corresponding to cryptic species or subspecies). ASVs can be filtered based on the number of reads found for this ASV in the whole database (between 100 and 10,000). The number of total reads for a given taxonomic level can be visualized in a treemap ( Figure S11) with the number of reads cASVs the x-axis represents the fraction of reads per taxon while the yaxis represents one of the variables from the metadata (depth level, temperature, etc.). For continuous variables, bins are created. The barplot panel can also be used for time series with different levels of aggregation (year, month, day). Alpha and beta diversity ( Figure S14) can be computed for a limited number of samples (1000 maximum).
The whole set of ASV sequences can be searched using a BLASTlike query and the resulting ASVs mapped ( Figure S15). Finally, it is possible to download data sets and sample metadata as well as ASV sequences and read abundance for the data sets, samples and taxa selected ( Figure S16).
Besides being very useful for research, the metaPR 2 shiny application can also be used for teaching purposes in the field of microbial ecology. In the framework of the undergraduate course ES2304 -Microbes in Natural Systems at Nanyang Technological University (Singapore), the application was used to investigate the biogeography of several groups of phytoplankton (diatoms, bolidophytes, dinoflagellates, green algae) by groups of four students in a flippedclassroom model. Each group had to do some research on the genus it was assigned and then analyse the distribution and diversity of key species, answering questions such as whether species had ubiquitous distributions or distributions controlled by latitude or temperature and whether species appeared to contain different genotypes as reflected by the presence of several ASVs. In order to make their analysis less daunting, they only analysed the OSD, Malaspina and Tara Oceans V4 data sets. Despite the fact that they had only 1 week to discover the interface and produce their analyses, this hands-on experience resulted in very positive feedback from the students, who especially enjoyed using the platform to look at "real-world" research data.

| PER S PEC TIVE S
Like its sister database, PR 2 , which is revised every 6-12 months with the addition of novel sequences as well as with taxonomy updates, the metaPR 2 database will evolve with time to include more data sets F I G U R E 9 The metaPR 2 shiny application available at https://shiny.metap r2.org and more samples, in particular from ecosystems (e.g., extreme environments), regions (e.g., tropical and southern latitudes) and substrate (e.g., host microbiomes) that are still underrepresented. We have listed more than 280 metabarcoding studies of protist diversity, for most of which data are available from GenBank SRA (Lopes dos Santos et al., 2021). These data will be processed and incorporated into the database with probably yearly releases. The taxonomy of metaPR 2 will evolve in parallel to that of PR 2 and we will add other functional and phenotypic traits (e.g., size, mixotrophy type) as there is clear tendency to use this approach more widely for protists (Schneider et al., 2020). We will also develop novel functionalities for the R shiny application and package, for example heat maps and phylogenetic analyses. This will constitute a very rich resource that will help researchers to compare eukaryotic communities across habitats.

ACK N OWLED G EM ENTS
We are very grateful to the many researchers that answered our queries about their publications, in particular pointing us to the available metadata. We thank Javier de Campo, Catherine Ribeiro,

CO N FLI C T O F I NTE R E S T
The authors declare no competing financial interests.

DATA AVA I L A B I L I T Y S TAT E M E N T
The source code for the Shiny server has been made available as an R package from GitHub (https://github.com/pr2da tabas e/metap r2-shiny, doi: 10.5281/zenodo.5992354). The source code for this study has been made available from GitHub (https://github.com/ vaulo t/Paper -2021-Vaulo t-metapr2). Source code for sequence processing has been made available from GitHub https://github.com/ vaulo t/Paper -2021-Vaulo t-metap r2/tree/main/R_proce ssing.