The statistical power of k-mer based aggregative statistics for alignment-free detection of horizontal gene transfer

Guan Da Huang; Xue Mei Liu; Tian Lai Huang; Li C. Xia

doi:10.1016/j.synbio.2019.08.001

The statistical power of k-mer based aggregative statistics for alignment-free detection of horizontal gene transfer

Guan Da Huang, Xue Mei Liu, Tian Lai Huang, Li C. Xia

Research output: Contribution to journal › Article › peer-review

5 Scopus citations

Abstract

Alignment-based database search and sequence comparison are commonly used to detect horizontal gene transfer (HGT). However, with the rapid increase of sequencing depth, hundreds of thousands of contigs are routinely assembled from metagenomics studies, which challenges alignment-based HGT analysis by overwhelming the known reference sequences. Detecting HGT by k-mer statistics thus becomes an attractive alternative. These alignment-free statistics have been demonstrated in high performance and efficiency in whole-genome and transcriptome comparisons. To adapt k-mer statistics for HGT detection, we developed two aggregative statistics T_sum ^S and T_sum ^*, which subsample metagenome contigs by their representative regions, and summarize the regional D₂ ^S and D₂ ^* metrics by their upper bounds. We systematically studied the aggregative statistics’ power at different k-mer size using simulations. Our analysis showed that, in general, the power of T_sum ^S and T_sum ^* increases with sequencing coverage, and reaches a maximum power >80% at k = 6, with 5% Type-I error and the coverage ratio >0.2x. The statistical power of T_sum ^S and T_sum ^* was evaluated with realistic simulations of HGT mechanism, sequencing depth, read length, and base error. We expect these statistics to be useful distance metrics for identifying HGT in metagenomic studies.

Original language	English (US)
Pages (from-to)	150-156
Number of pages	7
Journal	Synthetic and Systems Biotechnology
Volume	4
Issue number	3
DOIs	https://doi.org/10.1016/j.synbio.2019.08.001
State	Published - Sep 2019
Externally published	Yes

Keywords

Alignment-free sequence comparison
Horizontal gene transfer
Statistical power
k-mer

ASJC Scopus subject areas

Structural Biology
Biomedical Engineering
Applied Microbiology and Biotechnology
Genetics

Access to Document

10.1016/j.synbio.2019.08.001

Cite this

@article{0710b77bc36147218f1fee80fc1abb0d,

title = "The statistical power of k-mer based aggregative statistics for alignment-free detection of horizontal gene transfer",

abstract = "Alignment-based database search and sequence comparison are commonly used to detect horizontal gene transfer (HGT). However, with the rapid increase of sequencing depth, hundreds of thousands of contigs are routinely assembled from metagenomics studies, which challenges alignment-based HGT analysis by overwhelming the known reference sequences. Detecting HGT by k-mer statistics thus becomes an attractive alternative. These alignment-free statistics have been demonstrated in high performance and efficiency in whole-genome and transcriptome comparisons. To adapt k-mer statistics for HGT detection, we developed two aggregative statistics Tsum S and Tsum *, which subsample metagenome contigs by their representative regions, and summarize the regional D2 S and D2 * metrics by their upper bounds. We systematically studied the aggregative statistics{\textquoteright} power at different k-mer size using simulations. Our analysis showed that, in general, the power of Tsum S and Tsum * increases with sequencing coverage, and reaches a maximum power >80% at k = 6, with 5% Type-I error and the coverage ratio >0.2x. The statistical power of Tsum S and Tsum * was evaluated with realistic simulations of HGT mechanism, sequencing depth, read length, and base error. We expect these statistics to be useful distance metrics for identifying HGT in metagenomic studies.",

keywords = "Alignment-free sequence comparison, Horizontal gene transfer, Statistical power, k-mer",

author = "Huang, {Guan Da} and Liu, {Xue Mei} and Huang, {Tian Lai} and Xia, {Li C.}",

note = "Publisher Copyright: {\textcopyright} 2019",

year = "2019",

month = sep,

doi = "10.1016/j.synbio.2019.08.001",

language = "English (US)",

volume = "4",

pages = "150--156",

journal = "Synthetic and Systems Biotechnology",

issn = "2405-805X",

publisher = "KeAi Communications Co",

number = "3",

}

TY - JOUR

T1 - The statistical power of k-mer based aggregative statistics for alignment-free detection of horizontal gene transfer

AU - Huang, Guan Da

AU - Liu, Xue Mei

AU - Huang, Tian Lai

AU - Xia, Li C.

PY - 2019/9

Y1 - 2019/9

N2 - Alignment-based database search and sequence comparison are commonly used to detect horizontal gene transfer (HGT). However, with the rapid increase of sequencing depth, hundreds of thousands of contigs are routinely assembled from metagenomics studies, which challenges alignment-based HGT analysis by overwhelming the known reference sequences. Detecting HGT by k-mer statistics thus becomes an attractive alternative. These alignment-free statistics have been demonstrated in high performance and efficiency in whole-genome and transcriptome comparisons. To adapt k-mer statistics for HGT detection, we developed two aggregative statistics Tsum S and Tsum *, which subsample metagenome contigs by their representative regions, and summarize the regional D2 S and D2 * metrics by their upper bounds. We systematically studied the aggregative statistics’ power at different k-mer size using simulations. Our analysis showed that, in general, the power of Tsum S and Tsum * increases with sequencing coverage, and reaches a maximum power >80% at k = 6, with 5% Type-I error and the coverage ratio >0.2x. The statistical power of Tsum S and Tsum * was evaluated with realistic simulations of HGT mechanism, sequencing depth, read length, and base error. We expect these statistics to be useful distance metrics for identifying HGT in metagenomic studies.

AB - Alignment-based database search and sequence comparison are commonly used to detect horizontal gene transfer (HGT). However, with the rapid increase of sequencing depth, hundreds of thousands of contigs are routinely assembled from metagenomics studies, which challenges alignment-based HGT analysis by overwhelming the known reference sequences. Detecting HGT by k-mer statistics thus becomes an attractive alternative. These alignment-free statistics have been demonstrated in high performance and efficiency in whole-genome and transcriptome comparisons. To adapt k-mer statistics for HGT detection, we developed two aggregative statistics Tsum S and Tsum *, which subsample metagenome contigs by their representative regions, and summarize the regional D2 S and D2 * metrics by their upper bounds. We systematically studied the aggregative statistics’ power at different k-mer size using simulations. Our analysis showed that, in general, the power of Tsum S and Tsum * increases with sequencing coverage, and reaches a maximum power >80% at k = 6, with 5% Type-I error and the coverage ratio >0.2x. The statistical power of Tsum S and Tsum * was evaluated with realistic simulations of HGT mechanism, sequencing depth, read length, and base error. We expect these statistics to be useful distance metrics for identifying HGT in metagenomic studies.

KW - Alignment-free sequence comparison

KW - Horizontal gene transfer

KW - Statistical power

KW - k-mer

UR - http://www.scopus.com/inward/record.url?scp=85071529634&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85071529634&partnerID=8YFLogxK

U2 - 10.1016/j.synbio.2019.08.001

DO - 10.1016/j.synbio.2019.08.001

M3 - Article

AN - SCOPUS:85071529634

SN - 2405-805X

VL - 4

SP - 150

EP - 156

JO - Synthetic and Systems Biotechnology

JF - Synthetic and Systems Biotechnology

IS - 3

ER -

The statistical power of k-mer based aggregative statistics for alignment-free detection of horizontal gene transfer

Abstract

Keywords

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this