582h Evolution and Changes in the Blosum Matrix and Blocks Database

Mark Styczynski¹, Kyle L. Jensen¹, and Gregory N. Stephanopoulos². (1) Chemical Engineering, MIT, 66-264 MIT 77 Massachusetts Ave, Cambridge, MA 02139, (2) Department of Chemical Engineering, Massachusetts Institute of Technology, 77 Massachusetts Ave., 56-439, Cambridge, MA 02139

The fidelity of amino acid sequence alignment methods depends strongly on the target frequencies implied by the underlying substitution matrices. The BLOSUM series of matrices, constructed from the Blocks 5 database, is by far the most commonly used family of scoring matrices. Since the derivation of these matrices, there have been many advances in sequence alignment methods and significant growth in protein sequence databases. However, close scrutiny has not been given to the development of the BLOSUM matrices, and they have never been recalculated to reflect recent changes in the Blocks database. Intuition suggests that if the Blocks database has changed --- by the growth or addition of blocks --- that matrices computed after these changes may be different than the original BLOSUM matrices.

We begin by noting inconsistencies between the intended alogrithm for the development of the BLOSUM matrices and the actual implementation of the algorithm. These inconsistencies lead to subtle, yet important, differences in the actual BLOSUM matrices that are still used today and the matrices that “should” have been derived and published originally. We analyze the impact of these differences using structurally aligned proteins from the SCOP database as a “gold standard” and find statistically significant differences between the performances of the BLOSUM matrices used today and those that should have originally been derived.

Next, we show that updated BLOSUM matrices computed from successive releases of the Blocks database deviate from the original BLOSUM matrices. At constant re--clustering percentage, later releases of the Blocks database give rise to matrices with decreasing relative entropy, or information content. We show that this decrease in entropy is due to the addition of large, diverse families to the Blocks database. Using two separate tests, we demonstrate that isentropic matrices derived from later Blocks releases are less effective for the detection of remote homologs, and that these differences are statistically significant. Finally, we show that by removing the top 1% largest, most diverse blocks, the performance of the matrices can largely be recovered.