Swarm: robust and fast clustering method for amplicon-based studies
Frédéric Mahé, Torbjørn Rognes, Christopher Quince, Colomban de Vargas and Micah Dunthorn
Supplementary File 1
Table of Contents
1 Disclaimer
The purpose of this document is too provide the reader with details on the bioinformatics methods we used to prepare the paper "Swarm: robust and fast clustering method for amplicon-based studies". The code snippets and shell commands presented here were executed on a Debian GNU/Linux 7, and might have to be adapted to your particular system. Use them carefully.
2 Mock-communities analyses
2.1 Synopsis
Swarm's performance was tested on two mock communities each comprising genome isolates from 49 bacterial and 10 archaeal species: one with even abundances with 152,523 unique amplicons (1,614,012 raw reads of average length 254.6 bp), the other with uneven populations with 56,628 unique amplicons (641,541 raw reads of average length 253.9 bp). In both cases, sequencing on the MiSeq platform with 250~bp paired-end reads was performed following amplification of the V4 region of the 16S rRNA gene with fusion Golay adaptors barcoded on the reverse end. The forward 16S rRNA primer sequence 515f was used (GTGNCAGCMGCCGCGGTAA). The reverse primers, barcodes and adaptors were identical to Caporaso et al. (2011).
Sequences were trimmed for quality and adaptors using Sickle and then forward and reverse reads were overlapped with PandaSeq (Masella et al, 2012) insisting on a 50 bp overlap. Strictly identical amplicons were merged with the command described in Swarm's README, and the two data sets were subjected to five different stand-alone clustering methods in their last versions: CD-HIT-454 (Fu et al., 2012), DNAclust (Ghodsi and Pop, 2011), Swarm (v1.2.3), Usearch (Edgar, 2010) with presorting of the amplicons by decreasing length (option -cluster_fast) and without presorting (options -usersort -cluster_smallmem). Different clustering thresholds were used: d = 1 to 20 local differences for Swarm, and t = 1 to 20% global divergence for the other methods. For each clustering threshold and each clustering method, the first analysis was done on a fasta file sorted by decreasing abundance, and repeated 100 times with amplicon input order randomly shuffled.
The clustering results of the two communities were then evaluated with three metrics using the known assignments to the 59 species as the ground truth, which was determined by matching with Usearch against the known 16S rRNA reference sequences and only using reads that were within 5% sequence difference using Usearch with the usearch_global alignment option. The three metrics used were: recall, which quantifies the extent to which amplicons assigned to the same species are grouped together in the same OTU (i.e. not over-splitting); precision, which asks if all amplicons in an OTU are assigned to the same species (i.e. not over-grouping); and the adjusted Rand index, which summarizes both precision and recall as the proportion of pairs of amplicons that are placed in the same OTU and are from the same species, but adjusting for the expected proportions through random chance (Rand, 1971; Hubert and Larabie, 1985). The results were synthesized in Fig. 2 (uneven community) and Fig. 3 (even community) using R and the ggplot2 library.
Even = Fusion Golay Uneven = Mirce
2.2 Download the datasets
The cleaned datasets are available and can be downloaded as such:
mkdir -p ./data/{even,uneven} cd ./data/even/ wget http://sbr2.sb-roscoff.fr/download/externe/de/fmahe/even.fasta.bz2 cd ../uneven/ wget http://sbr2.sb-roscoff.fr/download/externe/de/fmahe/uneven.fasta.bz2
2.3 Dereplication
The two mock-communities were subjected to the same bioinformatics analyses. For the sake a brevity, only the analysis of the even community is described here. The MiSeq paired-end reads have already been quality-filtered, assembled and checked for chimeras. The first step is to linearize and dereplicate the dataset.
cd ./data/even/ RAW_FASTA="19-5ng25CyclesQ5_130624_L001_R1_trim_overlap_nonchim.fasta" CLEANED_FASTA="even.fasta" # Linearize and dereplicate awk 'NR==1 {print ; next} {printf /^>/ ? "\n"$0"\n" : $1} END {print}' "${RAW_FASTA}" | \ grep -v "^>" | grep -v [^ACGTacgt] | sort -d | uniq -c | \ while read abundance sequence ; do hash=$(printf "${sequence}" | sha1sum) hash=${hash:0:40} printf ">%s_%d_%s\n" "${hash}" "${abundance}" "${sequence}" done | sort -t "_" -k2,2nr -k1.2,1d | sed -e 's/\_/\n/2' > "${CLEANED_FASTA}"
Get some statistics:
cd ~/Science/Projects/Quince_mock_community/data/even/ paste - - < even.fasta | tr "_" "\t" | \ awk '{l += $2 * length($3) ; sum += $2} END {print NR, sum, l/sum}'
143,163 unique amplicons, representing 1,576,869 reads of average length 254.6 bp.
2.4 Taxonomic assignment
Umer Ijaz run the dereplicated dataset and assigned the sequences to the species level using the pipeline TAXAassign (v0.4) developed in Christopher Quince's lab, and Blast NCBI's nt database.
./TAXAassign.sh -p -c 16 -r 100 -t 60 -m 60 -a "60,70,80,95,95,97" -f even.fasta
2.5 Clustering Quality Metrics
Shuffle the fasta input-order, produce all clustering results and compute the five metrics (recall, precision, NMI, Rand index, adjusted Rand index).
# Binaries SWARM="swarm" USEARCH="usearch7.0.1001_i86linux32" CD_HIT="cd-hit-454" DNACLUST="dnaclust" PYTHON="python2.7" # Scripts FASTA_SHUFFLER="./src/fasta_shuffler.py" SWARM_BREAKER="./src/swarm_breaker.py" BUILD_CONFUSION_TABLE="./src/confusion_table.py" COMPUTE_CLUSTER_METRICS="./src/CValidate.pl" # Data FASTA="even.fasta" TAXONOMIC_ASSIGNMENTS="./results/even/even.spec.ualn" # Intermediate results SWARM_RESULTS=$(mktemp) UCLUST_RESULTS_A=$(mktemp) UCLUST_RESULTS_A2=$(mktemp) UCLUST_RESULTS_B=$(mktemp) UCLUST_RESULTS_B2=$(mktemp) CD_HIT_RESULTS=$(mktemp) CD_HIT_RESULTS2=$(mktemp) DNACLUST_RESULTS=$(mktemp) CONFUSION_TABLE=$(mktemp) # Metrics METRICS="${FASTA/.fasta/.metrics_shuffled}" # Shuffle cd ./data/even/ "${PYTHON}" "${FASTA_SHUFFLER}" -i "${FASTA}" -n 100 cp "${FASTA}" "${FASTA/.fasta/_shuffled_000.fasta}" # Multi-clustering shuffling=0 echo "method,shuffling,d,recall,precision,nmi,rand,adjustedrand" > "${METRICS}" # All fasta files for f in *_shuffled_*.fasta ; do # All clustering thresholds for i in {1..20} ; do identity="0.$(( 100 - $i ))" # SWARM "${SWARM}" -d ${i} < "${f}" > "${SWARM_RESULTS}" 2> /dev/null "${PYTHON}" "${SWARM_BREAKER}" -f "${f}" -s "${SWARM_RESULTS}" -d ${i} 2> /dev/null > "${SWARM_RESULTS2}" # UCLUST (no pre-sorting) "${USEARCH}" -usersort -cluster_smallmem "${f}" -id "${identity}" -uc "${UCLUST_RESULTS}" > /dev/null 2>&1 awk 'BEGIN {FS="\t"} { if ($1 == "S") clusters[$2] = $9 if ($1 == "H") clusters[$2] = clusters[$2] " " $9 } END { for (cluster in clusters) { print clusters[cluster] } }' "${UCLUST_RESULTS_A}" > "${UCLUST_RESULTS_A2}" # UCLUST (pre-sorting) "${USEARCH}" -cluster_fast "${f}" -id "${identity}" -uc "${UCLUST_RESULTS}" > /dev/null 2>&1 awk 'BEGIN {FS="\t"} { if ($1 == "S") clusters[$2] = $9 if ($1 == "H") clusters[$2] = clusters[$2] " " $9 } END { for (cluster in clusters) { print clusters[cluster] } }' "${UCLUST_RESULTS_B}" > "${UCLUST_RESULTS_B2}" # CD-HIT-454 "${CD_HIT}" -i "${f}" -o "${CD_HIT_RESULTS}" -c ${identity} -M 0 > /dev/null 2>&1 awk 'BEGIN {FS = "[>.]"} NR != 1 {printf /^>Cluster/ ? "\n" : $2" "} END {printf "\n"}' "${CD_HIT_RESULTS}.clstr" | \ sed -e 's/ $//' > "${CD_HIT_RESULTS2}" # DNACLUST "${DNACLUST}" -s "${identity}" -i "${f}" | tr "\t" " " > "${DNACLUST_RESULTS}" for CLUSTERING_RESULTS in "${SWARM_RESULTS2}" "${UCLUST_RESULTS_A2}" \ "${UCLUST_RESULTS_B2}" "${CD_HIT_RESULTS2}" "${DNACLUST_RESULTS}" ; do python "${BUILD_CONFUSION_TABLE}" -t "${TAXONOMIC_ASSIGNMENTS}" -s "${CLUSTERING_RESULTS}" > "${CONFUSION_TABLE}" METRICS=$(perl "${COMPUTE_CLUSTER_METRICS}" --cfile="${CONFUSION_TABLE}") METHOD=${CLUSTERING_RESULTS##*/} METHOD=${METHOD%.*} echo "${METHOD},${shuffling},${i},${METRICS}" >> "${METRICS}" done done shuffling=$(( $shuffling + 1 )) done # Cleaning rm "${CONFUSION_TABLE}" "${SWARM_RESULTS}" "${SWARM_RESULTS2}" \ "${UCLUST_RESULTS_A}" "${UCLUST_RESULTS_A2}" \ "${UCLUST_RESULTS_B}" "${UCLUST_RESULTS_B2}" "${CD_HIT_RESULTS}.clstr" \ "${CD_HIT_RESULTS2}" "${DNACLUST_RESULTS}" even_shuffled_*.fasta
2.6 Scripts
Here are the perl and python stand-alone scripts used in the above shell script.
2.6.1 Shuffling
Change the input-order of the fasta sequences.
#!/usr/bin/env python # -*- coding: utf-8 -*- """ Parse a fasta file and output the entries in a shuffled order. """ from __future__ import print_function __author__ = "Frédéric Mahé <mahe@rhrk.uni-kl.de>" __date__ = "2012/05/14" __version__ = "$Revision: 2.0" import os import sys from random import shuffle from Bio import SeqIO from Bio.Seq import Seq from optparse import OptionParser #**********************************************************************# # # # Functions # # # #**********************************************************************# def option_parse(): """ Parse arguments from command line. """ desc = """Parse a fasta file and output the entries in a shuffled order.""" parser = OptionParser( usage="usage: %prog --input_file FILENAME --number_of_shufflings INT", description = desc, version = "%prog version 2.0") parser.add_option( "-i", "--input_file", metavar = "<FILENAME>", action = "store", dest = "input_file", help = "set <FILENAME> as input file.") parser.add_option( "-n", "--number_of_shufflings", metavar = "<INT>", action = "store", dest = "number_of_shuffles", type = "int", default = 10, help = "create <INT> shuffled outputs (default is 10).") (options, args) = parser.parse_args() return options.input_file, options.number_of_shuffles #**********************************************************************# # # # Body # # # #**********************************************************************# if __name__ == '__main__': # Parse command line options. input_file, number_of_shuffles = option_parse() basename = os.path.splitext(os.path.abspath(input_file))[0] extension = os.path.splitext(input_file)[1] # Build a list of fasta entries input_format = "fasta" with open(input_file, "rU") as input_file: records = SeqIO.parse(input_file, input_format) records_list = [(record.description, str(record.seq)) for record in records] # Is this a real fasta file? (this test needs improvement) if len(records_list) < 2: print("Error: something is wrong with the input file.", file=sys.stderr) sys.exit(-1) # Create and shuffle a list of indices, and output the fasta # entries using that list, repeat n times index = range(0, len(records_list)) zeros = len(str(number_of_shuffles)) for n in range(0, number_of_shuffles): shuffle(index) output_file = basename + "_shuffled_" + str(n).zfill(zeros) + extension with open(output_file, "w") as output_file: for i in index: print(">", records_list[i][0], "\n", records_list[i][1], sep="", file=output_file) sys.exit(0)
2.6.2 Swarm breaker
Refine the clusters produced by Swarm.
#!/usr/bin/env python # -*- coding: utf-8 -*- """ Detect and break chains of amplicons in a swarm. """ from __future__ import print_function __author__ = "Frédéric Mahé <mahe@rhrk.uni-kl.fr>" __date__ = "2014/01/30" __version__ = "$Revision: 1.1" import os import sys import tempfile import itertools import subprocess from operator import itemgetter from optparse import OptionParser #*****************************************************************************# # # # Functions # # # #*****************************************************************************# def option_parse(): """ Parse arguments from command line. """ desc = """Detect and break chains of amplicons in a swarm. That script will search for the swarm binary in /usr/bin/. If swarm is installed at a different location, please modify the corresponding line in the function run_swarm.""" parser = OptionParser(usage="usage: %prog -f filename -s filename", description=desc, version="%prog version 1.1") parser.add_option("-b", "--binary", metavar="<BINARY>", action="store", default="/usr/bin/swarm", dest="binary", help="swarm binary location. Default is /usr/bin/swarm") parser.add_option("-f", "--fasta_file", metavar="<FILENAME>", action="store", dest="fasta_file", help="set <FILENAME> as fasta file.") parser.add_option("-s", "--swarm_file", metavar="<FILENAME>", action="store", dest="swarm_file", help="set <FILENAME> as swarm file.") parser.add_option("-d", "--differences", metavar="<THRESHOLD>", action="store", type="int", default=1, dest="threshold", help="set local clustering <THRESHOLD>. Default is 1") (options, args) = parser.parse_args() return options.binary, options.fasta_file, options.swarm_file, options.threshold def fasta_parse(fasta_file): """ List amplicon ids, abundances and sequences, make a list and a dictionary """ with open(fasta_file, "rU") as fasta_file: all_amplicons = dict() for line in fasta_file: if line.startswith(">"): amplicon, abundance = line.strip(">\n").split("_") else: sequence = line.strip() all_amplicons[amplicon] = (int(abundance), sequence) return all_amplicons def swarm_parse(swarm_file): """ List amplicons contained in each swarms, sort by decreasing abundance. Sort the list of swarms by decreasing mass and decreasing size. """ with open(swarm_file, "rU") as swarm_file: swarms = list() for line in swarm_file: amplicons = [(amplicon.split("_")[0], int(amplicon.split("_")[1])) for amplicon in line.strip().split(" ")] # Sort amplicons by decreasing abundance and alphabetical order amplicons.sort(key=itemgetter(1, 0), reverse=True) top_amplicon, top_abundance = amplicons[0] swarm_size = len(amplicons) swarm_mass = sum([amplicon[1] for amplicon in amplicons]) swarms.append([top_amplicon, swarm_mass, swarm_size, top_abundance, amplicons]) # Sort swarms on mass, size and seed name swarms.sort(key=itemgetter(1, 2, 0), reverse=True) return swarms def run_swarm(binary, all_amplicons, swarm, threshold): """ Write temporary fasta files, run swarm and collect the graph data """ swarm_command = [binary, "-b", "-d", str(threshold)] with open(os.devnull, "w") as devnull: with tempfile.SpooledTemporaryFile() as tmp_fasta_file: with tempfile.SpooledTemporaryFile() as tmp_swarm_results: for amplicon, abundance in swarm: sequence = all_amplicons[amplicon][1] print(">", amplicon, "_", str(abundance), "\n", sequence, sep="", file=tmp_fasta_file) tmp_fasta_file.seek(0) # rewind to the begining of the file proc = subprocess.Popen(swarm_command, stderr=tmp_swarm_results, stdout=devnull, stdin=tmp_fasta_file, close_fds=True) proc.wait() # usefull or not? tmp_swarm_results.seek(0) # rewind to the begining of the file graph_data = [line.strip().split("\t")[1:4] for line in tmp_swarm_results if line.startswith("@")] return graph_data def build_graph(graph_data): """ List pairwise relations in a swarm. Note that not all pairwise relations are stored. That's why the graph exploration must always start from the most abundant amplicon, and must be reiterated for sub-swarms after a breaking. """ graph = dict() for line in graph_data: ampliconA, ampliconB, differences = line if ampliconA in graph: graph[ampliconA] += [ampliconB] else: graph[ampliconA] = [ampliconB] return graph def find_path(graph, start, end, path=[]): """ Recursively explore the graph and find all paths connecting two amplicons (http://www.python.org/doc/essays/graphs.html). As the graph is not complete, some pairs of amplicon cannot be linked. """ path = path + [start] if start == end: return path if start not in graph: return None for node in graph[start]: if node not in path: newpath = find_path(graph, node, end, path) if newpath: return newpath return None def graph_breaker(amplicons, graph, all_amplicons, ABUNDANT): """ Find deep valleys and cut the graph """ # High peaks to test (starting and ending points) top_amplicons = [amplicon[0] for amplicon in amplicons if amplicon[1] >= ABUNDANT] # Ending peak is RATIO times higher than the valley RATIO = 50 # Debugging print("## OTU ", top_amplicons[0], "\n", "# List potential bridges", sep="", file=sys.stderr) # Initialize the list of new seeds new_swarm_seeds = [top_amplicons[0]] # Break if there is no second peak if len(top_amplicons) < 2: return new_swarm_seeds, graph # Loop over the list of top amplicons pairs_of_peaks = itertools.combinations(top_amplicons, 2) for pair_of_peaks in pairs_of_peaks: start_amplicon, end_amplicon = pair_of_peaks path = find_path(graph, start_amplicon, end_amplicon) # Path can be empty if the relation have been deleted if path and len(path) > 1: abundances = [int(all_amplicons[node][0]) for node in path] # Find the weakest spot lowest = min(abundances) if lowest != abundances[-1]: # LOW VALLEY MODEL (CHANGE HERE) if (abundances[-1] / lowest > RATIO / 2 and abundances[0] / abundances[-1] < 10) or abundances[-1] / lowest >= RATIO: # Debugging print(abundances, "\tBREAK!", file=sys.stderr) # Find the rightmost occurence of the lowest point index = len(abundances) - (abundances[::-1].index(lowest) + 1) left_amplicon = path[index-1] right_amplicon = path[index] # Delete the relation from the graph graph[left_amplicon].remove(right_amplicon) # Remove the graph entry if the relation is now empty if not graph[left_amplicon]: del graph[left_amplicon] # Lowest point will be a new swarm seed new_swarm_seeds.append(right_amplicon) else: print(abundances, file=sys.stderr) return new_swarm_seeds, graph def swarmer(graph, seed, path=[]): """ Recursively explore the graph and find all amplicons linked to the seed """ path = path + [seed] if seed in graph: for node in graph[seed]: path = swarmer(graph, node, path) return path def swarm_breaker(binary, all_amplicons, swarms, threshold): """ Recursively inspect and break the newly produced swarms """ # ARBITRARY PARAMETERS ABUNDANT = 100 # Deal with each swarm for swarm in swarms: top_amplicon, swarm_mass, swarm_size, top_abundance, amplicons = swarm if swarm_size > 2 and top_abundance > ABUNDANT: # Run swarm to get the pairwise relationships graph_raw_data = run_swarm(binary, all_amplicons, amplicons, threshold) # Build the graph of pairwise relationships graph = build_graph(graph_raw_data) new_swarm_seeds, graph = graph_breaker(amplicons, graph, all_amplicons, ABUNDANT) # Explore the graph and find all amplicons linked to the seeds observed = 0 new_swarms = list() for seed in new_swarm_seeds: new_swarm = swarmer(graph, seed) observed += len(new_swarm) # Give to the new swarms the same structure and # re-order them by decreasing abundance amplicons = [(amplicon, all_amplicons[amplicon][0]) for amplicon in new_swarm] amplicons.sort(key=itemgetter(1), reverse=True) top_amplicon, top_abundance = amplicons[0] swarm_size = len(amplicons) swarm_mass = sum([amplicon[1] for amplicon in amplicons]) new_swarms.append([top_amplicon, swarm_mass, swarm_size, top_abundance, amplicons]) # Deal with the new swarms (no need to treat again the # first swarm). There will always be at least one swarm in # new_swarms. print(" ".join(["_".join([amplicon[0], str(amplicon[1])]) for amplicon in new_swarms[0][4]]), file=sys.stdout) new_swarms.pop(0) if new_swarms: # Sort the rest of the new swarms by decreasing mass # and size. Inject them into swarm_breaker. new_swarms.sort(key=itemgetter(1, 2), reverse=True) swarm_breaker(binary, all_amplicons, new_swarms, threshold) else: # Output the swarm print(" ".join(["_".join([amplicon[0], str(amplicon[1])]) for amplicon in amplicons]), file=sys.stdout) return None def main(): """ Hypothesis: chain of amplicons happen among the most abundant amplicons of the swarm. The number of chains in a swarm is small. The abundances of each sub-swarm centroids are comparable. The "valleys" are deep compared to the "peaks". Swarm graphs are acyclical, so there is only one path joining two amplicons. Synopsis: Break bridges as you discover them. Find the weakest point in the chain, break on the left of that point and mark it as the seed of a new swarm. Repeat the process with the nth most abundant amplicon, until all amplicons in the arbitrary range have been treated. """ # Parse command line options. binary, fasta_file, swarm_file, threshold = option_parse() # Load all amplicon ids, abundances and sequences all_amplicons = fasta_parse(fasta_file) # Load the swarming data swarms = swarm_parse(swarm_file) # Deal with each swarm swarm_breaker(binary, all_amplicons, swarms, threshold) #*****************************************************************************# # # # Body # # # #*****************************************************************************# if __name__ == '__main__': main() sys.exit(0)
2.6.3 Confusion table
Parse the clustering results and the taxonomic assignments results to build a "confusion table", with each species on one line and one column per OTU. Several OTUs can be assigned to the same species, but OTUs should not contain amplicons assigned to different species. Each cell of the table should contain the total number of reads (from that OTU and assigned to that species).
#!/usr/bin/env python # -*- coding: utf-8 -*- """ Parse taxonomic assignments and swarm results to produce a confusion table. """ from __future__ import print_function __author__ = "Frédéric Mahé <mahe@rhrk.uni-kl.fr>" __date__ = "2013/09/21" __version__ = "$Revision: 0.1" # This program is free software: you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation, either version 3 of the License, or # (at your option) any later version. # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # You should have received a copy of the GNU General Public License # along with this program. If not, see <http://www.gnu.org/licenses/>. import sys from optparse import OptionParser def option_parser(): """ Parse arguments from command line. """ desc = """This program reads a table of taxonomic assignments and a swarm clustering file and outputs a confusion table (OTUs vs taxonomic assignments).""" parser = OptionParser(usage="usage: %prog -t FILENAME -s FILENAME", description=desc, version="%prog version 0.1") parser.add_option("-t", "--taxonomic_assignments", metavar="<FILENAME>", action="store", dest="taxonomic_assignments", help="set <FILENAME> as input.") parser.add_option("-s", "--swarm_OTUs", metavar="<FILENAME>", action="store", dest="swarm_OTUs", help="set <FILENAME> as input.") (options, args) = parser.parse_args() return options.taxonomic_assignments, options.swarm_OTUs def parse_taxonomy(taxonomic_assignments): """ Parse taxonomy assignments """ amplicon2taxonomy = dict() taxa = ["Unassigned"] with open(taxonomic_assignments, "rU") as taxonomic_assignments: for line in taxonomic_assignments: amplicon, taxon = line.split()[0:2] amplicon, abundance = amplicon.split("_") amplicon2taxonomy[amplicon] = (abundance, taxon) taxa.append(taxon) return amplicon2taxonomy, taxa def dereplicate_taxa(taxa): """ Dereplicate taxa (list of unique taxon names, and initialized dictionary) """ taxa_list = list(set(taxa)) taxa_dict = dict(zip(taxa_list, [0] * len(taxa_list))) return taxa_list, taxa_dict def OTU_parser(taxa_dict, taxa_list, amplicon2taxonomy, amplicons): """ Parse each OTU and output one line of the confusion table. """ for amplicon in amplicons: try: amplicon, abundance = amplicon.split("_") except ValueError: amplicon, abundance = amplicon, 1 try: abundance, taxon = amplicon2taxonomy[amplicon] except KeyError: taxon = "Unassigned" taxa_dict[taxon] += int(abundance) OTU_abundance_per_taxa = [str(taxa_dict[taxon]) for taxon in taxa_list] return OTU_abundance_per_taxa if __name__ == '__main__': ## Parse command line arguments taxonomic_assignments, swarm_OTUs = option_parser() ## Parse taxonomy assignments amplicon2taxonomy, taxa = parse_taxonomy(taxonomic_assignments) ## Dereplicate taxa taxa_list, taxa_dict = dereplicate_taxa(taxa) ## Output the table header print("OTUs_vs_Taxa", "\t".join(taxa_list), sep="\t", file=sys.stdout) ## Parse swarm OTUs with open(swarm_OTUs, "rU") as swarm_OTUs: for i, line in enumerate(swarm_OTUs): amplicons = line.strip().split() OTU_abundance_per_taxa = OTU_parser(taxa_dict.copy(), taxa_list, amplicon2taxonomy, amplicons) print(str(i+1), "\t".join(OTU_abundance_per_taxa), sep="\t", file=sys.stdout) sys.exit(0)
2.6.4 Clustering metrics
The perl script CValidate.pl was provided by Christopher Quince and team collaborators. It computes the different clustering metrics using the confusion table produced by the above python script.
#!/usr/bin/perl use strict; use Getopt::Long; my $tFile; GetOptions("cfile=s" => \$tFile,"sfile=s"); my @cluster = (); open(FILE,"$tFile") or die "Can't open $tFile\n"; my $line = <FILE>; my $i = 0; while($line = <FILE>){ chomp($line); my @tokens = split(/\t/,$line); shift(@tokens); my $j = 0; foreach my $tok(@tokens){ #print "$j $tok\n"; $cluster[$i][$j] = $tok; $j++; } $i++; } close(FILE); printf("%f,%f,%f,%f,%f\n",recall(@cluster),precision(@cluster),nmi(@cluster),randindex(@cluster),adjrandindex(@cluster)); sub precision(){ my @cluster = @_; my $nN = 0; my $nC = scalar(@cluster); my $nK = scalar(@{$cluster[0]}); my $precision = 0; for(my $i = 0; $i < $nC; $i++){ my $maxS = 0; for(my $j = 0; $j < $nK; $j++){ if($cluster[$i][$j] > $maxS){ $maxS = $cluster[$i][$j]; } $nN += $cluster[$i][$j]; } $precision += $maxS; } return $precision/$nN; } sub recall(){ my @cluster = @_; my $nN = 0; my $nC = scalar(@cluster); my $nK = scalar(@{$cluster[0]}); my $recall = 0; for(my $i = 0; $i < $nK; $i++){ my $maxS = 0; for(my $j = 0; $j < $nC; $j++){ if($cluster[$j][$i] > $maxS){ $maxS = $cluster[$j][$i]; } $nN += $cluster[$j][$i]; } $recall += $maxS; } return $recall/$nN; } sub choose2{ my $N = shift; my $ret = $N*($N - 1); return int($ret/2); } sub randindex{ my @cluster = @_; my @ktotals = (); my @ctotals = (); my $nN = 0; my $nC = scalar(@cluster); my $nK = scalar(@{$cluster[0]}); my $cComb = 0; my $kComb = 0; my $kcComb = 0; for(my $i = 0; $i < $nK; $i++){ $ktotals[$i] = 0; for(my $j = 0; $j < $nC; $j++){ $ktotals[$i]+=$cluster[$j][$i]; } $nN += $ktotals[$i]; $kComb += choose2($ktotals[$i]); } for(my $i = 0; $i < $nC; $i++){ $ctotals[$i] = 0; for(my $j = 0; $j < $nK; $j++){ $ctotals[$i]+=$cluster[$i][$j]; } $cComb += choose2($ctotals[$i]); } for(my $i = 0; $i < $nC; $i++){ for(my $j = 0; $j < $nK; $j++){ $kcComb += choose2($cluster[$i][$j]); } } my $nComb = choose2($nN); return ($nComb - $cComb - $kComb + 2*$kcComb)/$nComb; } sub adjrandindex{ my @cluster = @_; my @ktotals = (); my @ctotals = (); my $nN = 0; my $nC = scalar(@cluster); my $nK = scalar(@{$cluster[0]}); my $cComb = 0; my $kComb = 0; my $kcComb = 0; for(my $i = 0; $i < $nK; $i++){ $ktotals[$i] = 0; for(my $j = 0; $j < $nC; $j++){ $ktotals[$i]+=$cluster[$j][$i]; } $nN += $ktotals[$i]; $kComb += choose2($ktotals[$i]); } for(my $i = 0; $i < $nC; $i++){ $ctotals[$i] = 0; for(my $j = 0; $j < $nK; $j++){ $ctotals[$i]+=$cluster[$i][$j]; } $cComb += choose2($ctotals[$i]); } for(my $i = 0; $i < $nC; $i++){ for(my $j = 0; $j < $nK; $j++){ $kcComb += choose2($cluster[$i][$j]); } } my $nComb = choose2($nN); my $temp = ($kComb*$cComb)/$nComb; my $ret = $kcComb - $temp; return $ret/(0.5*($cComb + $kComb) - $temp); } sub nmi{ my @cluster = @_; my @ktotals = (); my @ctotals = (); my $nN = 0; my $nC = scalar(@cluster); my $nK = scalar(@{$cluster[0]}); my $HC = 0.0; my $HK = 0.0; for(my $i = 0; $i < $nK; $i++){ $ktotals[$i] = 0; for(my $j = 0; $j < $nC; $j++){ $ktotals[$i]+=$cluster[$j][$i]; } $nN += $ktotals[$i]; } for(my $i = 0; $i < $nC; $i++){ $ctotals[$i] = 0; for(my $j = 0; $j < $nK; $j++){ $ctotals[$i]+=$cluster[$i][$j]; } my $dFC = $ctotals[$i]/$nN; $HC += -$dFC*log($dFC); } for(my $i = 0; $i < $nK; $i++){ my $dFK = $ktotals[$i]/$nN; $HK += -$dFK*log($dFK); } my $NMI = 0.0; for(my $i = 0; $i < $nK; $i++){ my $NMII = 0.0; for(my $j = 0; $j < $nC; $j++){ my $dF = ($nN*$cluster[$j][$i])/($ctotals[$j]*$ktotals[$i]); if($dF > 0.0){ $NMII += $cluster[$j][$i]*log($dF); } } $NMII /= $nN; $NMI += $NMII; } return (2.0*$NMI)/($HC + $HK); }
2.7 Visualization
The results were synthesized in Fig. 2 (uneven community) and Fig. 3 (even community) using R and the ggplot2 library.
Produce a long-form table first:
cd ./results/even/ # Produce a long table version sort -t "," -k1,1d -k2,2n -k3,3n even_clustering.metrics | \ awk 'BEGIN { FS = "," OFS = "," split("recall precision NMI rand adjustedrand", a, " ") } {for (i=1; i<=5; i++) print $1, $2, $3, a[i], $(i+3)}' > even_clustering.metrics2 rm even_clustering.metrics mv even_clustering.metrics2 even_clustering.metrics
Then, use this R script to produce the plots:
library(ggplot2) library(reshape) setwd("./results/even/") d <- read.table("even_clustering.metrics", dec = ".", sep =",") colnames(d) <- c("method", "shuffling", "divergence", "metric", "value") # Divergence values treated as class. Force numerical sorting d$divergence <- as.character(d$divergence) d$divergence <- factor(d$divergence, levels=c("1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13", "14", "15", "16", "17", "18", "19", "20"), ordered=TRUE) # Subset and rename d2 <- subset.data.frame(d, d$metric != "NMI" & d$metric != "rand") levels(d2$method)[4] <- "Swarm" levels(d2$method)[1] <- "CD-HIT-454" levels(d2$method)[2] <- "DNAclust" levels(d2$method)[5] <- "Usearch (no presorting)" levels(d2$method)[6] <- "Usearch (presorting)" levels(d2$metric)[1] <- "adjusted Rand index" # Faceted plot ggplot(d2, aes(x = divergence, y = value)) + geom_boxplot(outlier.size = 1.5) + scale_x_discrete(breaks = c("1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13", "14", "15", "16", "17", "18", "19", "20"), labels=c("1", "", "3", "", "5", "", "7", "", "9", "", "11", "", "13", "", "15", "", "17", "", "19", "")) + xlab("divergence (global clustering threshold t in %; or swarm local clustering threshold d)") + ylab("metric values") + facet_grid(metric ~ method) ggsave(file = "even_clustering_selected_metrics.pdf", width=16, height=9) # Compute median values d4 <- melt(cast(d, metric~divergence~method, median)) colnames(d4) <- c("metric", "divergence", "method", "value") d4$divergence <- as.character(d4$divergence) d4$divergence <- factor(d4$divergence, levels = c("1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13", "14", "15", "16", "17", "18", "19", "20"), ordered=TRUE) d5 <- subset.data.frame(d4, d4$method != "swarm" & d4$metric != "NMI" & d4$metric != "rand") levels(d5$method)[4] <- "Swarm" # Faceted plot using median values ggplot(d5, aes(x = divergence, y = value)) + geom_boxplot() + xlab("divergence (% or swarm d value)") + ylab("metric values") + facet_grid(metric ~ method, scales = "free_y") ggsave(file = "even_clustering_selected_metrics_medians.pdf", width=16, height=9) # Output median values to a file d7 <- cast(d5, metric ~ method ~ divergence) write.csv(d7, file = "even_median_table.csv", quote = TRUE, eol = "\n", na = "NA") quit(save = "no")
Date: [2014-05-12 lun.]
HTML generated by org-mode 6.33x in emacs 23