TY - JOUR
T1 - The statistical power of k-mer based aggregative statistics for alignment-free detection of horizontal gene transfer
AU - Huang, Guan Da
AU - Liu, Xue Mei
AU - Huang, Tian Lai
AU - Xia, Li C.
N1 - Funding Information:
L.C.X. was supported by the Innovation in Cancer Informatics Fund .
Publisher Copyright:
© 2019
PY - 2019/9
Y1 - 2019/9
N2 - Alignment-based database search and sequence comparison are commonly used to detect horizontal gene transfer (HGT). However, with the rapid increase of sequencing depth, hundreds of thousands of contigs are routinely assembled from metagenomics studies, which challenges alignment-based HGT analysis by overwhelming the known reference sequences. Detecting HGT by k-mer statistics thus becomes an attractive alternative. These alignment-free statistics have been demonstrated in high performance and efficiency in whole-genome and transcriptome comparisons. To adapt k-mer statistics for HGT detection, we developed two aggregative statistics Tsum S and Tsum *, which subsample metagenome contigs by their representative regions, and summarize the regional D2 S and D2 * metrics by their upper bounds. We systematically studied the aggregative statistics’ power at different k-mer size using simulations. Our analysis showed that, in general, the power of Tsum S and Tsum * increases with sequencing coverage, and reaches a maximum power >80% at k = 6, with 5% Type-I error and the coverage ratio >0.2x. The statistical power of Tsum S and Tsum * was evaluated with realistic simulations of HGT mechanism, sequencing depth, read length, and base error. We expect these statistics to be useful distance metrics for identifying HGT in metagenomic studies.
AB - Alignment-based database search and sequence comparison are commonly used to detect horizontal gene transfer (HGT). However, with the rapid increase of sequencing depth, hundreds of thousands of contigs are routinely assembled from metagenomics studies, which challenges alignment-based HGT analysis by overwhelming the known reference sequences. Detecting HGT by k-mer statistics thus becomes an attractive alternative. These alignment-free statistics have been demonstrated in high performance and efficiency in whole-genome and transcriptome comparisons. To adapt k-mer statistics for HGT detection, we developed two aggregative statistics Tsum S and Tsum *, which subsample metagenome contigs by their representative regions, and summarize the regional D2 S and D2 * metrics by their upper bounds. We systematically studied the aggregative statistics’ power at different k-mer size using simulations. Our analysis showed that, in general, the power of Tsum S and Tsum * increases with sequencing coverage, and reaches a maximum power >80% at k = 6, with 5% Type-I error and the coverage ratio >0.2x. The statistical power of Tsum S and Tsum * was evaluated with realistic simulations of HGT mechanism, sequencing depth, read length, and base error. We expect these statistics to be useful distance metrics for identifying HGT in metagenomic studies.
KW - Alignment-free sequence comparison
KW - Horizontal gene transfer
KW - Statistical power
KW - k-mer
UR - http://www.scopus.com/inward/record.url?scp=85071529634&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85071529634&partnerID=8YFLogxK
U2 - 10.1016/j.synbio.2019.08.001
DO - 10.1016/j.synbio.2019.08.001
M3 - Article
AN - SCOPUS:85071529634
SN - 2405-805X
VL - 4
SP - 150
EP - 156
JO - Synthetic and Systems Biotechnology
JF - Synthetic and Systems Biotechnology
IS - 3
ER -