ATP-cone sequence clusters

2019-04-05T12:49:12Z (GMT) by Daniel Lundin
The NCBI RefSeq database (2019-03-19; Haft et al. 2018 was searched with Pfam's ATP-cone profile (accno: PF03477; Finn et al. 2010 returning 44367 NCBI accessions. Ribonucleotide reductase proteins were identified using HMMER (Eddy 2011 profiles from the RNRdb database ( Subsequently, sequences were clustered with UCLUST (Edgar 2010 at identity to remove sequence duplicates (24477 sequences remaning).

All sequences were pairwise aligned to each other using LAST (Kiełbasa et al. 2011 and a bitscore matrix was constructed. The bitscore matrix was clustered with MCL (Enright, Dongen & Ouzounis 2002 using the Cluster Maker 2 (Morris et al. 2011 Cytoscape (Shannon et al. 2003 app. Bitscores < 200 were not included in the initial network and an inflation parameter of 2.5 was used.

The file "atp-cone_mcl_clustering.tsv" contains all information necessary to recreate the clustering as well as the assigned cluster numbers to each sequence.

Column names: SUID: Cytoscape's id, accno: NCBI's accession number, mcl2.0ewc200-mcl3.0: cluster assignments with inflation parameters and bitscore cutoff (ewc; when used), name: sequence identifier composed of accno plus cone number, outer_inner: "inner", "middle" or "outer" when more than one cone present in full sequence, pclass and psubclass: RNR class and subclass, ptype: protein type, taxon, tdomain: taxonomic domain, title: NCBI's description of the sequence.

The "precluster_assignments.tsv" file contains the results of the preclustering with USEARCH, i.e. which usearch cluster (first column) each accession number (second column) belong to.