124a Global Pairwise Sequence Alignment Using Integer Linear Optimization: a Path Selection Approach

Scott R. McAllister, Rohit Rajgaria, and Christodoulos A. Floudas. Chemical Engineering, Princeton University, Dept of Chemical Engineering; A215 Engineering Quadrangle, Princeton, NJ 08544

An important and well-studied problem in the area of computational biology is the sequence alignment problem. The general goals of sequence alignment methods are to identify related protein sequences and to determine the best alignment between the two sequences. This provides not only a rough measure of evolutionary distance, but also possible relationships between the protein structure and function of similar sequences. Generally, the pairwise sequence alignment problem is approached through either (i) global alignment or (ii) local alignment techniques. Global alignment algorithms strive to identify the best overall alignment that spans the length of both sequences. Needleman and Wunsch developed a dynamic programming approach to this problem that is widely used[1]. Two recently developed methods have formulated the global pairwise alignment problem as a mixed integer linear optimization problem[2,3].

A novel integer linear optimization (ILP) model has been developed to rigorously address the global pairwise sequence alignment problem[4]. This model is formulated by defining an alignment node as an active binary variable only when a position in the first sequence aligns to a position in the second sequence. The success of this approach then results from the selection of the optimal binary path variables, which become active when two neighboring node variables are activated. By using a path selection approach that employs some of the algorithmic advantages of dynamic programming methods, this integer linear optimization model gains efficiency while maintaining the rigor of the combinatorial optimization approach. The model is independent of the functional form of the gap penalties, which can efficiently be pre-computed for any given alignment path. The important components of the model formulation, in addition to its mathematical rigor, are (a) the natural introduction of functionally important conservation constraints, (b) the creation of a rank-ordered list of the highest scoring alignments, and (c) the refinement of alignments by pairwise interaction scores. For biological examples where residues must be conserved to retain function, mathematical constraints are included to enforce this functionally important conservation. Such examples of conservation constraints have been applied to the disulfide bridge residues in trypsin inhibitors and the Ser-His-Asp catalytic triad of serine proteases. Additionally, introducing a number of integer cut constraints allows for the generation of a rank-ordered list to highlight and possibly address uncertainty in the highest scoring alignment. Applications of this model feature include the alignment of helical regions of G protein-coupled receptors. The ability to include pairwise interaction scores during the alignment, without sacrificing the rigorous guarantee of optimality, will be crucial to the development of advanced techniques for detecting remotely homologous proteins in a fold recognition framework.

[1] Needleman SB, Wunsch CD. A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol 1970;48:443-453.

[2] McAllister SR, Rajgaria R, and Floudas CA. A template-based mixed-integer linear programming sequence alignment model. In: A. Torn and J. Zilinskas, editors, Modeling and Algorithms for Global Optimization, Nonconvex Optimization and Its Applications. Springer-Verlag. 2006a. Accepted for publication.

[3] McAllister SR, Rajgaria R, and Floudas CA. Global Pairwise Sequence Alignment Through Mixed-Integer Linear Programming: A Template-Free Approach. Optim Methods Softw 2006b. Accepted for publication.

[4] McAllister SR, Rajgaria R, and Floudas CA. A Path Selection Approach to Global Pairwise Sequence Alignment Using Integer Linear Optimization. 2006c. In preparation.