573e Mocat: an Algorithm for Learning Boolean Functions from Noisy Data and Its Application to Learning Interacting Residues in a Protein

Anshul Dubey, School of Chemical and Biomolecular Engineering, Georgia Institute of Technology, 311 Ferst Dr, Atlanta, GA 30332, Matthew J. Realff, Chemical and Biomolecular Engineering, Georgia Tech, 311 Ferst Drive, N.W., Atlanta, GA 30332-0100, Jay H. Lee, Chemical and Biomolecular Engineering, Georgia Institute of Technology, Ford Environmental Science & Technology Building 311 Ferst Drive, N.W., Atlanta, GA 30332, and Andreas S. Bommarius, Georgia Institute of Technology, School of Chemical Engineering, 315 Ferst Drive N.W., Parker H. Petit Biotechnology Institute, Room 3310, Atlanta, GA 30332-0363.

The mapping of a protein's sequence to its function is one of the most fundamental challenges in protein engineering and still remains an unsolved problem. This problem, if solved, can facilitate a rational design process in which the structure to function mapping is used to screen the potential designs. The enormous size of the sequence space or the state space of the problem, which can be defined as all the possible sequences that can be created for an enzyme, adds to the complexity of the problem. However, it is known that only a small fraction of the amino acids present in a protein contribute significantly to the protein's properties. Finding these amino acid positions can greatly improve our knowledge about proteins and also help us to design better experiments to alter them.

Apart from individual amino acids, it is known that in the three dimensional structure of a protein, certain amino acids can interact with each other in order to provide maintain structural integrity or aid in its catalytic function [1, 2]. If these positions are mutated the loss of this interaction usually leads to a non-functional protein. Directed Evolution (DE) experiments [3], which probe the sequence space of a protein through mutations or recombination in search for an improved variant, frequently result in such inactive sequences. In previous work, it was shown that Boolean Learning can be used to find the interacting amino acid residues from the primary sequence of the variants that are generated during DE and their measured activity [4]. It was shown this problem can be posed for Boolean Learning by transforming the sequences into Boolean vectors. The logical function that corresponds to the specific problem of finding amino acid residues with interactions is formulated in a Disjunctive Normal Form (DNF) [4]. The One Clause at a Time (OCAT) [5] algorithm was used to learn the Boolean function and thus, obtain the interacting residues. Simulations show that the pairs can be identified with a reasonably accuracy, which declines with increasing number of pairs per sequence and the length of the sequence.

To verify the strength of this approach, sequences from the recombination of mRFP and dsRED by using both RDA-PCR [6] and DNA-shuffling [7] were used to identify the interactions that exist between different residues in their sequence. However, it was observed that due to various factors contributing to the inaccuracy in high throughput screening of the variants, experimentally obtained sequences are prone to a classification noise. This means that certain active variants can be classified as inactive and vice versa. It can be shown that the OCAT algorithm is intolerable to any level of classification noise. Thus, a modified version of the algorithm, referred to as modified OCAT or mOCAT, was developed. It is theoretically shown that mOCAT can tolerate classification noise, β < ½. An expressions for the number of examples required for reliable learning was also developed.

The problem of learning Boolean functions with classification noise has been investigated earlier. Valiant [8] first introduced the PAC learning model for Boolean functions, which was extended for classification noise by Angluin and Laird [9]. It can be shown that their algorithm does not take into account the non-relevant features, which are present in our problem. A similar algorithm was also provided by Kearns and Li [10], which was shown to be tolerant to a worst-case noise model termed malicious errors. All the above algorithms were developed to identify k-length Boolean functions only, which are a specific class of Boolean functions. On the other hand, OCAT and thus mOCAT, can identify Boolean functions with any length and can also work in the presence of irrelevant features.

Empirical results are obtained by simulating the problem of identifying interacting residues from the sequences of protein variants. It is shown that interactions can be identified for varying length of proteins and also with varying level of classification noise. A recursive extension of mOCAT is also introduced, which is shown to converge in a bounded amount of time and is proven to give better identification than mOCAT. The simulation results corroborate this improvement. This algorithm is also applied to the experimentally obtained variants of mRFP and dsRED. A pair of amino acid residues is identified to be interacting and their interaction is further verified by point mutations.

References

1. Meyer, M.M., J.J. Silberg, et al., Library analysis of SCHEMA-guided protein recombination. Protein Science: a Publication Of The Protein Society, 2003. 12(8): p. 1686-1693.

2. Saraf, M.C., A.R. Horswill, et al., FamClash: a method for ranking the activity of engineered enzymes. Proceedings Of The National Academy Of Sciences Of The United States Of America, 2004. 101(12): p. 4142-4147.

3. Petrounia, I.P. and F.H. Arnold, Designed evolution of enzymatic properties. Curr.Opin.Biotechnol., 2000. 11(4): p. 325-330.

4. Dubey, A., M.J. Realff, et al., Identifying the Interacting Positions of a Protein Using Boolean Learning and Support Vector Machines. Accepted in Computational Biology and Chemistry, 2006.

5. Triantaphyllou, E., Inference of a Minimum Size Boolean Function from Examples by Using a New Efficient Branch-and-Bound Approach. Journal of Global Optimization, 1994. 5(1): p. 69-94.

6. Ikeuchi, A., Y. Kawarasaki, et al., Chimeric Gene Library Construction by a Simple and Highly Versatile Method Using Recombination-Dependent Exponential Amplification. Biotechnology Progress, 2003. 19(5): p. 1460-1467.

7. Stemmer, W.P., Rapid evolution of a protein in vitro by DNA shuffling. Nature, 1994. 370(6488): p. 389-391.

8. Valiant, L.G., A theory of the learnable. Communications of the ACM, 1984. 27(11): p. 1134-1142.

9. Angluin, D. and P. Laird, Learning from Noisy Examples. Mach. Learn., 1988. 2(4): p. 343-370.

10. Kearns, M. and M. Li, Learning in the Presence of Malicious Errors. SIAM Journal on Computing, 1993. 22(4): p. 807-837.