679b Model Predictive Discrimination Approach for Classification of Process and Biological Data

Raghuraj K. Rao and Lakshminarayanan Samavedham. Chemical and Biomolecular Engineering, National University of Singapore, 4, Engineering Drive 4, Singapore, 117576, Singapore

Data discrimination into multiple classes is an important application in process systems engineering and systems biology. Fault detection and isolation, Pattern recognition, product characterization based on multiple input features are some of the areas that benefit from discriminant analysis [1,2]. Data classification tools are also extensively used in biological data analysis. Bio-species classification, molecular structure and functional classification, diagnostic studies of diseases and other areas of molecular biology often adopt multivariate statistics (MVS) based discrimination methods[3]. Data classification algorithms, applied in literature, are data specific and have shown varying degrees of performance [4]. As most of these methods are developed and well tested for two-group problems, the classification rates for biological data are affected further due to their multi-class nature. The data classification problem poses further challenges to linear discrimination methods when different classes have overlapping characteristics and cannot be distinguished just based on interclass dissimilarities [3]. Alternatively, linear and nonlinear interactions among the feature vectors can be exploited for distinguishing samples into specific class. Based on this idea a new Model Predictive Discrimination (MPD) method is proposed as an alternative classification approach.

The basic idea adopted in the study is to form a variable predictive model representing the associations between attributes for each class. Each variable is modeled as a function of remaining all the variables or optimally selected subset of variables. The model parameters are estimated using the training data. The variable predictive model structure, designed for a particular class, distinctly characterizes the intra-class attribute relations for that class. It is hypothesized that a given sample observation of all the variables gives the best prediction for the class model which it belongs to. The hypothesis is tested using prediction capabilities defined based on statistical error. The sample to be tested is projected on each class model to re-predict the full set of variables and the prediction accuracy is used as the discriminating criteria.

The new Model Predictive Discrimination (MPD) algorithm for classification problems is validated using illustrative and well studied classification examples from process and biological applications. Experimental and proven literature data sets are utilized as benchmark. The performance of the new method is compared with classical classification algorithms like LDA, QDA, SVM and ANN. MPD method is observed to perform well for all the cases. The results are superior to existing methods especially for data sets with nonlinear variable interactions. The MPD classification method can be successfully extended to many fault detection and diagnosis applications in process and biological systems.

Keywords: Discriminant analysis, data classification, Fault detection and diagnosis, Computational biology, Pattern recognition.

[1] R.O. Duda, P.E. Hart and D.G. Stork, “Pattern classification”, John Wiley, New York, 2000.

[2] Y. Tominaga, “Comparative study of class data analysis with PCA-LDA, SIMCA, PLS, ANNs, and k-NN”, Chemometric Intel. Lab. Syst. 49, 105-115, 1999

[3] R. Sokal and F. J. Rohlf, “Biometry : The principles and practice of statistics in biological research”. 3rd. Edition, W.H.Freeman & co., New York, 1995

[4] S. Dudoit, J. Fridlyand, and T.P. Speed, “Comparison of discrimination methods for the classification of tumors using gene expression data”, Jrnl. Amer. Statist. Assoc. 97, 77–87, 2002.