ATP-cone sequence clusters

2019-04-05T12:49:12Z (GMT) by Daniel Lundin
The NCBI RefSeq database (2019-03-19; Haft et al. 2018 https://doi.org/10.1093/nar/gkx1068) was searched with Pfam's ATP-cone profile (accno: PF03477; Finn et al. 2010 https://doi.org/10.1093/nar/gkp985) returning 44367 NCBI accessions. Ribonucleotide reductase proteins were identified using HMMER (Eddy 2011 https://doi.org/10.1371/journal.pcbi.1002195) profiles from the RNRdb database (http://rnrdb.pfitmap.org). Subsequently, sequences were clustered with UCLUST (Edgar 2010 https://doi.org/10.1093/bioinformatics/btq461) at identity to remove sequence duplicates (24477 sequences remaning).

All sequences were pairwise aligned to each other using LAST (Kiełbasa et al. 2011 https://doi.org/10.1101/gr.113985.110) and a bitscore matrix was constructed. The bitscore matrix was clustered with MCL (Enright, Dongen & Ouzounis 2002 https://doi.org/10.1093/nar/30.7.1575) using the Cluster Maker 2 (Morris et al. 2011 https://doi.org/10.1186/1471-2105-12-436) Cytoscape (Shannon et al. 2003 https://doi.org/10.1101/gr.1239303) app. Bitscores < 200 were not included in the initial network and an inflation parameter of 2.5 was used.

The file "atp-cone_mcl_clustering.tsv" contains all information necessary to recreate the clustering as well as the assigned cluster numbers to each sequence.

Column names: SUID: Cytoscape's id, accno: NCBI's accession number, mcl2.0ewc200-mcl3.0: cluster assignments with inflation parameters and bitscore cutoff (ewc; when used), name: sequence identifier composed of accno plus cone number, outer_inner: "inner", "middle" or "outer" when more than one cone present in full sequence, pclass and psubclass: RNR class and subclass, ptype: protein type, taxon, tdomain: taxonomic domain, title: NCBI's description of the sequence.

The "precluster_assignments.tsv" file contains the results of the preclustering with USEARCH, i.e. which usearch cluster (first column) each accession number (second column) belong to.