S5 Supplementary Information

Be positive: customised reference databases and new, local barcodes balance false taxonomic assignments in metabarcoding studies

List of commands to produce customized reference databases and for taxonomic assignments of metabarcoding data.

Tools used in this workflow

1. Select sequences from COInr

1.1 Download COInr DB

Downloaded from www.zenodo.org/record/6555985. This DB is ready to use as it is for formatting and for selecting a desired target region. It comes with an associated taxonomy .tsv file.

1.2 COInr-WO-Insecta

Remove insect sequences from COInr

metafiles/taxon_list_insecta.txt:

1.3 COInr-Med

Derived from COInr-WO-Insecta and refined for Mediterranean marine Families gathered from OBIS*

Data_S4.tsv is the list of taxonomic families present in the Mediterranean Sea.

1.4 COInr-Med+

Add new barcodes to COInr-Med

1.4.1 Suggest one or more lineages for each taxon name based on the existing lineages in taxonomy.tsv

Data_S2_barcodes.tsv is a tab separated file with seqID, taxon, sequence as columns. It can be created from Data_S2.tsv by selecting the appropriate columns.

The output is a lineage file COInr_Med_plus/format_custom/custom_lineages.tsv and a sequence file COInr_Med_plus/format_custom/custom_sequences.tsv.

Revise the output lineage file to complete lineages if taxon name is new to taxonomy.tsv and choose between homonyms if necessary.

custom_lineages_verified.tsv

1.4.2 Find taxID for each taxon in the lineage file

This command will update the taxonomy.tsv file by adding new taxIDs. Remember to use the generated COInr_Med_plus/add_taxids/taxonomy_updated.tsv file for further taxonomic assignation steps. It will also produce COInr_Med_plus/add_taxids/sequences_with_taxIDs.tsv used in the next step.

1.4.3 Dereplicate custom sequences

1.4.4 Pool and dereplicate COInr_Med DB + custom sequences

Move the updated taxonomy file to the same folder as the COInr_Med_plus.tsv.

2. Select the Leray region for each DB

Select sequences that cover at least 80% of the region amplified by metabarcoding primer pairs and trim sequences to this region.

2.1 COInr

2.2 COInr-WO-Insecta

2.3 COInr-Med

2.4 COInr-Med+

3. Format the four databases for each taxassing tools

3.1 VTAM format

3.1.1 COInr

3.1.2 COInr-WO-Insecta

3.1.3 COInr-Med

3.1.4 COInr-Med+

3.2 RDP format

3.2.1 COInr

3.2.2 COInr-WO-Insecta

3.2.3 COInr-Med

3.2.4 COInr-Med+

3.3 QIIME format

3.3.1 COInr

3.3.2 COInr-WO-Insecta

3.3.3 COInr-Med

3.3.4 COInr-Med+

4. Taxonomic assignment

4.1 VTAM taxassign

Create output directory

4.1.1 COInr

4.1.2 COInr-WO-Insecta

4.1.3 COInr-Med

4.1.4 COInr-Med+

4.2 RDP classifier

4.2.1 RDP training

The "Xmx216g" command has to be adjusted according to your available RAM (e.g., 216 = 216GB). Do not use all the available RAM of your machine, it will freeze.

Create output directories

COInr

COInr-WO-Insecta

COInr-Med

COInr-Med+

4.2.2 Taxonomic assigment with RDP classifier

Create output directory

COInr

COInr-WO-Insecta

COInr-Med

COInr-Med+

4.3 Taxonomic assignment with QIIME2

4.3.1 Import database sequences and taxonomy to QIIME2

COInr

COInr-WO-Insecta

COInr-Med

COInr-Med+

4.3.2 Import the test ASV dataset to QIIME artifact

Sequences should be in CAPITAL letters

4.3.3 Train Qiime classifier

COInr

COInr-WO-Insecta

COInr-Med

COInr-Med+

4.3.4 Classify (taxassign) with QIIME2 using SKLEARN algorithm

Create output directory

COInr

COInr-WO-Insecta

COInr-Med

COInr-Med+

4.3.5 Classify (taxassign) with QIIME2 using BLAST algorithm

Use three different percentage of identity: 0.97, 0.9, 0.8

Create output directory

COInr

COInr-WO-Insecta

COInr-Med

COInr-Med+

5. References

Bolyen E, et al. 2019. Reproducible, interactive, scalable and extensible microbiome data science using QIIME 2. Nature Biotechnology 37:852–857. DOI: 10.1038/s41587-019-0209-9.Microbiome, 6, 90.

González,A. et al. (2020) VTAM: A robust pipeline for validating metabarcoding data using internal controls. bioRxiv, 2020.11.06.371187.

Meglécz,E. (2022a) COInr a comprehensive, non-redundant COI database from NCBI-nt and BOLD. DOI: 10.5281/zenodo.6555985.

Meglécz,E. (2022b) COInr and mkCOInr: Building and customizing a non-redundant barcoding reference database from BOLD and NCBI using a lightweight pipeline. BioRxiv:2022.05.18.492423.

Meglécz,E. (2022c) meglecz/mkCOInr: mkCOInr-v.0.2.0. DOI: 10.5281/zenodo.6961340

Wang,Q. et al. (2007) Naive Bayesian classifier for rapid assignment of rRNA sequences into the new bacterial taxonomy. Appl. Environ. Microbiol., 73, 5261–5267.