TY - GEN
T1 - Improving functional annotation of no n-s yno no mo us s nps with information theory
AU - Karchin, R.
AU - Kelly, L.
AU - Sali, A.
PY - 2005
Y1 - 2005
N2 - Automated functional annotation of nsSNPs requires that amino-acid residue changes are represented by a set of descriptive features, such as evolutionary conservation, side-chain volume change, effect on ligand-binding, and residue structural rigidity. Identifying the most informative combinations of features is critical to the success of a computational prediction method. We rank 32 features according to their mutual information with functional effects of amino-acid substitutions, as measured by in vivo assays. In addition, we use a greedy algorithm to identify a subset of highly informative features [1], The method is simple to implement and provides a quantitative measure for selecting the best predictive features given a set of features that a human expert believes to be informative. We demonstrate the usefulness of the selected highly informative features by cross-validated tests of a computational classifier, a support vector machine (SVM). The SVM's classification accuracy is highly correlated with the ranking of the input features by their mutual information. Two features describing the solvent accessibility of "wild-type" and "mutant" amino-acid residues and one evolutionary feature based on superfamily-level multiple alignments produce comparable overall accuracy and 6% fewer false positives than a 32- feature set that considers physiochemical properties of amino acids, protein electrostatics, amino-acid residue flexibility, and binding interactions.
AB - Automated functional annotation of nsSNPs requires that amino-acid residue changes are represented by a set of descriptive features, such as evolutionary conservation, side-chain volume change, effect on ligand-binding, and residue structural rigidity. Identifying the most informative combinations of features is critical to the success of a computational prediction method. We rank 32 features according to their mutual information with functional effects of amino-acid substitutions, as measured by in vivo assays. In addition, we use a greedy algorithm to identify a subset of highly informative features [1], The method is simple to implement and provides a quantitative measure for selecting the best predictive features given a set of features that a human expert believes to be informative. We demonstrate the usefulness of the selected highly informative features by cross-validated tests of a computational classifier, a support vector machine (SVM). The SVM's classification accuracy is highly correlated with the ranking of the input features by their mutual information. Two features describing the solvent accessibility of "wild-type" and "mutant" amino-acid residues and one evolutionary feature based on superfamily-level multiple alignments produce comparable overall accuracy and 6% fewer false positives than a 32- feature set that considers physiochemical properties of amino acids, protein electrostatics, amino-acid residue flexibility, and binding interactions.
UR - http://www.scopus.com/inward/record.url?scp=15944417881&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=15944417881&partnerID=8YFLogxK
M3 - Conference contribution
C2 - 15759645
AN - SCOPUS:15944417881
SN - 9812560467
SN - 9789812560469
T3 - Proceedings of the Pacific Symposium on Biocomputing 2005, PSB 2005
SP - 397
EP - 408
BT - Proceedings of the Pacific Symposium on Biocomputing 2005, PSB 2005
T2 - 10th Pacific Symposium on Biocomputing, PSB 2005
Y2 - 4 January 2005 through 8 January 2005
ER -