5cl Optimization Algorithms for Protein Bioinformatics

Wei Xie, Chemical and Biomolecular Engineering, University of Illinois, 600 South Mathews Avenue, Urbana, IL 61801

Research in proteins is one of the most critical activities in molecular biology, a field in which recent results strongly suggest that sophisticated computational methods contribute substantially to our fundamental understanding of protein structure and function. It is now acknowledged that most important aspects of protein research demand powerful algorithms and revolve around optimizing certain merit functions. Good algorithms in this field have changed fundamentally the way molecular biology has progressed (e.g., BLAST algorithm).

My research focuses on developing efficient algorithms for important and challenging protein bioinformatics problems. In this presentation, I will talk about our recent algorithmic studies on three such problems: (a) protein side-chain conformation prediction and design, (b) contact map overlap maximization for protein structure alignment, and (c) combinatorial library design for protein engineering. Although these problems originate from different aspects of protein bioinformatics, they share a common feature: the problems are computationally challenging yet important enough in practice, and efficient algorithms will result in research breakthroughs in molecular biology. Results show that these new optimization algorithms that we have developed shed new light on these problems.

(a) Protein Side-Chain Conformation Problem

The protein side-chain conformation problem is a very important problem in protein structure prediction and rational protein design, and therefore has been extensively studied in the literature. The side-chain problem assumes that the coordinates of backbone atoms are known, and the possible conformations of the side-chains are represented by discrete sets of statistically significant conformations called rotamers. A least energetic side-chain conformation is believed to be representative of the protein fold. Since the problem is inherently difficult and practical applications often concern large instances, efficiency of the algorithms becomes the bottleneck for current and future rational protein design projects.

By combining existing dead-end-elimination techniques with graph-theoretic results, we design a highly efficient algorithm for the side-chain problem, which outruns the popular SCWRL 3.0 package, and yet maintains a high level of prediction accuracy.

(b) Contact Map Overlap (CMO) Maximization for Protein Structure Alignment

Aligning proteins for similarity is one of the oldest and most important problems in computational biology that will continue to attract protein scientists' attention in the foreseeable future. Aligning unknown proteins with well-studied counterparts highlights important common features and indicates relations in the structures and functions. Despite overwhelming success, sequence alignment based approaches are being increasingly criticized for not being able to handle remotely related proteins. In fact, even the most powerful profile-based sequence alignment approaches currently cannot compare proteins whose sequence identity differs by more than 80% (i.e., in the so-called mid-night zone). Alternative methods to compare proteins based on their structural similarity are gaining favor, as they bypass such limit on sequence identity. Among many existing structural alignment approaches, contact map overlap (CMO) maximization is known for its sensitivity and robustness. However, solving the CMO problem is far from easy. While sequence alignment algorithms routinely align sequences of residues beyond tens of thousands in a few seconds, state-of-art CMO methods fail to align sequences of around a hundred residues.

By combining the prototypical branch-and-bound algorithm with problem-specific reduction schemes, we develop a combinatorial branch-and-reduce algorithm for the CMO problem. Preliminary results suggest that this new algorithm substantially enhances our ability to solve practically interesting CMO instances, and possesses strong potential for leading to even more efficient algorithms for the CMO problem in the future.

Designing proteins with novel or improved functions via combinatorial libraries involves first mutating or recombining existing sequences and then exploring the resultant sequences. Compared to the traditional sequence design approaches that require detailed knowledge of protein structure and function, combinatorial library design approaches are more straightforward to implement and prove quite successful in practice. An indispensable component of a successful library design approach is a powerful computational method that reaches balance between diversity and activity. Although quite a few computational studies have been conducted in the past, no detailed theoretical analysis exists that unveils the inherent difficulty of the proposed models, i.e., how efficiently they can be solved. Such analysis also provides important guidelines for the future development of models and algorithms for this problem.

We propose to study combinatorial library design problem from a computational perspective and reach several interesting conclusions from preliminary studies. First, we show that existing models for combinatorial library design vary substantially in difficulty: while some models are fairly easy in that they admit low order polynomial-time algorithms, others may demand exponential time to solve. In addition, we discover a mathematical property for the general combinatorial library design problem, which could be the key to design efficient algorithms. Using this new property, we design a new and much faster algorithm for the sequence-independent site-directed chimeragenesis (SISDC) protocol than the existing popular RASPP algorithm.