Datasets usually provide raw data for analysis. This raw data often comes in spreadsheet form, but can be any collection of data, on which analysis can be performed.
All NrdJm sequences from NCBI’s RefSeq database were downloaded from RNRdb (http://rnrdb.pfitmap.org) and clustered with USEARCH (Edgar 2010) at 90% sequence identity to reduce redundancy. Non full-length sequences and sequences of dubious quality were manually removed, before aligning all 363 sequences with ProbCons (Do et al. 2005). Trustworthy alignment positions were selected with the BMGE algorithm (Criscuolo and Gribaldo 2010) using the BLOSUM30 substitution matrix, ending up with 363 well-aligned positions forming 350 distinct alignment patterns. RAxML version 8.2.4 (Stamatakis 2014) was used to estimate a phylogeny, using the PROTGAMMAAUTO model, rapid bootstrapping with the autoMRE bootstopping followed by a full maximum likelihood tree search.