147b Protein Structure and Fold Recognition Using Amino Acid Interaction Models

Raghuraj K. Rao and Lakshminarayanan Samavedham. Chemical and Biomolecular Engineering, National University of Singapore, 4, Engineering Drive 4, Singapore, 117576, Singapore

Prediction of protein structural classes is an important topic in proteomics as most of the protein functional properties are characterized by these structures. The amino acids are the building blocks for the peptide chains and their arrangement decides the structure, in turn the functionality of the proteins. Hence inferring protein structure from many observations with measured amino acid compositions as attributes is a classical reverse engineering problem. Proteins are amino acid sequences, arranged into one of the four basic structural classes [1]: helical alfa (ALP), flat beta strands (BET), and the combinations ALP/BET and ALP+BET. These assignments basically characterize the overall structures of proteins and thus they are accepted widely as base classes for protein structure [2]. Another protein characterization issue further addressed in literature is recognition of different types of folds in the above four structures [3]. Even though the source of proteins observed vary depending on organism or tissue under study the overall protein structure classification and fold recognition problem can be always formulated as multivariable (20 different amino acid predictor attributes), multi-class (basic four classes to hundreds of different types of folds) discriminant analysis problem [4]. Many classification approaches like Linear Discriminant Analysis (LDA), Quadratic Discriminant Analysis (QDA), classification and regression tree (CART), nearest neighborhood (kNN), used to predict protein structures, are based on correlation methods and employ various distance measures to distinguish the classes. Most recently, advanced machine learning algorithms like Support Vector Machines (SVM) and Neural Networks (ANN) have been shown to provide higher prediction accuracies in proteomic studies [3]. An average successful prediction rate of 60-70% of protein structures has been achieved. The class to class dissimilarities (measured in terms of distance/variance in the training set data) used in LDA, SVM, NN and intra class associations between variables as in other nonlinear methods such as ANN needs to be utilized simultaneously for higher structure prediction results. Associations among the amino acid in the form of linear and nonlinear dependencies can be modeled for distinguishing samples into specific class. Based on this idea a new model based protein structure prediction algorithm is proposed as an alternative classification approach mainly for proteomic studies.

The basic systems idea adopted in the study is to form different models for each class of protein structure representing the nonlinear dependencies and interactions between amino acids for each class. These models are designed so as to predict a particular amino acid concentration in a given class based on all or few other amino acid compositions. The model parameters are estimated using the training data for every class. For a protein sample to be classified, its given amino acid compositions are compared with those predicted by the trained classifier models for different classes. The sample protein is classified as belonging to the structure model with least prediction error. The decision is tested using prediction capabilities defined based on statistical error.

The new discrimination algorithm based on class structure models is validated using illustrative and well studied protein structure classification problems. Proven literature data sets are utilized as benchmark problems for basic four class structure prediction and multi class fold recognition in large sets of proteins. Different classification performance testing procedures are adopted to compare the results obtained from proposed new method with classical classification algorithms like LDA, QDA, SVM and ANN. The proposed model based discrimination method is observed to perform well for all the cases. The results are superior to existing methods especially for re-substitution and cross validation tests. Improvement of 10-20% overall prediction rate is achieved over the best known existing algorithms. The new approach has potential to resolve similar system characterization problems and such extensions to the other areas of proteomics are being investigated.

Keywords: protein structure classification, multivariate statistics, discriminant analysis, biological systems modeling.

[1] Kuo-Chen Chou Biochemical and Biophysical Research Communications 264, 216–224, 1999.

[2] Chun-Ting Zhang and Ren Zhang, J.Theor. Biol. 201, 189 – 199, 1999. [3] C.H.Q. Ding and I. Dubchak, Bioinformatics, 17(4), 349-358, 2001.

[4] R. Sokal and F. J. Rohlf, “Biometry: The principles and practice of statistics in biological research”. 3rd. Edition, W.H. Freeman & co., New York, 1995