124b Degenerative Sequence Motifs Identification

Eric Ho1, Eric Yang2, Samuel Gunderson1, and Ioannis Androulakis2. (1) Molecular Biology and Biochemistry, Rutgers University, Piscataway, NJ 08854, (2) Biomedical Engineering, Rutgers University, Piscataway, NJ 08854

Most eukaryotic protein coding pre-mRNAs undergo different degrees of post-transcriptional processing before nuclear export. These include constitutive and/or alternative splicing, 5' capping and polyadenylation. Among these, we will focus on polyadenylation. Polyadenylation refers to the addition of adenosine nucleotides (about 200-250 in mammals or 70 in yeast) to the 3' end of a pre-mRNA; it is named poly(A) tail. Polyadenylation is non-template driven, in contrast to transcription and DNA replication.

Studies have shown that the 3' poly(A) tail helps to increase mRNA stability, signals nuclear export, and regulates mRNA translatability. Two steps are involved in polyadenylation: the cleavage of pre-mRNA during transcription and the addition of adenosine nucleotides to the 3' end of the upstream fragment. All eukaryotic pre-mRNAs except a few histones undergo polyadenylation in the nucleus. In order for the molecular machinery to recognize the location of the site of cleavage and polyA tail addition, the 3' end of a gene typically contains two cis-regulatory elements that flank this site, namely an upstream polyadenylation signal (poly(A) signal) and a downstream GU or U (GU/U) rich region. Interestingly, some genes contain multiple polyadenylation elements. Experiments have shown that the availability of alternative polyadenylation elements allows cell to switch cell fate e.g. B cells differentiation. And one controlling signal has been identified to be the level of GU/U rich binding protein [3]. It is clear that there is a regulatory pathway involving polyadenylation.

The upstream poly(A) signal is recognized by cleavage and polyadenylation specificity factor (CPSF), while the downstream GU/U-rich region is recognized by cleavage simulation factor (CstF). With the help of two additional cleavage factors CFI and CFII, CPSF is brought to interact with CstF to form a larger complex and the 3' region of pre-mRNA is brought into a loop, where pre-mRNA cleavage occurs. The poly(A) signal is a highly conserved hexa-nucleotide element located 10-35 nucleotides upstream of poly(A) site. Most genes contain AAUAAA, but a few have AUUAAA. In contrast to the poly(A) signal, the GU/U-rich region is highly variable in terms of sequence composition and location. In addition, experiments indicated that polyadenylation occurs deterministically at relatively fixed location (+/- 10) between these two elements. Therefore it is puzzling how it is possible for CstF to exhibit sequence binding specificity when the targeted sequence is highly variable. Our objective in this study is to develop a general method to identify degenerative sequence motifs such as CstF binding sites in mammalian genes.

Recent NMR studies [2] suggest that sequence selectivity of CstF-64 (RNA binding subunit of CstF) may be explained by the significant motions at the protein-RNA interface, a novel way to understand binding specificity. This structural study suggested that the key sequence element for CstF-64 interaction is two uracil nucleotides (U). In vitro biochemical studies had shown that CstF-64 bound well only to a certain subset of GU/U-rich sequences [4]. Even though several studies have proposed possible consensus sequence for CstF-64 binding, this remains an open question.

The availability of genomic data enables us to employ bioinformatics and statistical techniques to identify precisely CstF-64 targeted sequences. We speculate that CstF-64 targeted sequences possess certain latent structure which can be uncovered by removing statistically non-essential sequence elements. The assumption is that if certain sequence elements are vital to CstF targeting, they must be preserved in the 3' UTR of all genes, albeit with some degree of degeneracy and location variation. On the other hand, non-essential sequence elements can be subjected to random mutation without causing any harm to host organism. Comparatively speaking, essential sequence elements exhibit distinctive occurrences as compared to non-essential ones. Here we propose to tackle this problem by singular value decomposition (SVD), which is a well established linear algebra technique and has been applied to solve similar problems in other fields such as image compression. We have collected well annotated mammalian 3' UTR sequences from public genomic databases. Each gene associates with one sequence. For each sequence, we break it down into fixed length, overlapping short fragments (k-mers) of length k, where k is set in the range from 2 to 6. Based on that, a gene-by-k-mer matrix is constructed where each row denotes occurrences of all possible k-mers (4k) in a gene. Each column represents occurrences of a particular k-mer across all genes. We apply SVD to this matrix so that it is decomposed into three matrices as follows: M = U*S*Vt

Matrix U reflects data characteristics of genes, and matrix V reflects characteristics of k-mers. Diagonal of S serve as scale factors (non-negative) to indicate how important each eigenvector in U or V with respect to the overall dataset. The larger the singular value the more important is that eigenvector. Since we are interested to identify sequence elements which exhibit distinctive occurrences in the 3' GU/U-rich region, our analysis is focused solely on the multiplication of matrices V and S; (V* S). Each row of V*S contains principal component values of a k-mer in n-dimensional principal component space. If a k-mer is biologically unimportant, its n principal components have to be small relative to other important k-mers. We can imagine if we plot all unimportant k-mers in n-dimensional space, they will all cluster together in a hypersphere. However, if a k-mer is vital, it has to be preserved in each gene. Hence, its occurrence is different from unimportant k-mers. If we plot it in n-dimensional space, it will be positioned remotely from unimportant k-mers. By using this method, we are able to pinpoint distinctive set of tetramers and pentamers. We have found that, in general, they possess high composition of G and U, and a few with C but not A. This result agrees with earlier in vitro biochemical experiment. Consistent results have been found in human and mouse, which make sense as human and mouse CstF-64 shared high degree of homology. Furthermore, we are also interested to know whether these distinctive k-mers tend to locate in any particular positions in 3' UTR. We have constructed a position-by-k-mer matrix. By similar approach, we analyze the principal component space of positions. We have discovered that these essential sequence elements mostly concentrated between poly(A) site and 25 bases downstream, with the highest concentration at 14 to 17 downstream from poly(A) site.

In order to make use of the SVD method proposed above, one must first estimate the parameter k. The parameter k also serves another important purpose. Identifying the length of k would also hint at the length of the RNA element and hence the length of the binding domain in the protein, as well as allow for the construction of a simple classifier through the identification of sequence as either binding or non-binding.

The only piece of information which we know is that the sequence binds to GU/U-rich regions, with non-specific binding to C, but not to A. From this piece of information we can build a non-specific search algorithm based off of the Fourier transform, a method first proposed by [1]. However, given the interchangeability of the bases, we will assign G and U the value of 1, and C an intermediate value of .5. We will perform this transformation upon the set of sequences known to bind and the set of sequences which are a randomly generated set which maintain the same relative ratios of A,T,G,C's.

After the transformation has occurred, we will run an autocorrelation between the target sequence, a k-length vector of 1's. The autocorrelation will function essentially as a sliding window non-specific window matching. We will then take the distribution of autocorrelation values and plot a histogram. It is our belief that the k-length vector of 1's which gives the most discriminating PDF will be the best k-mer to utilize in the algorithm above. In addition, it is also our hypothesis that the k-mer with the most discriminant PDF will also be a likely candidate for the size of the CstF-64 binding domain.

The benefit of such a technique is that it allows for the identification of non-specific sequences. Traditional text matching algorithms look for specific or patterned matching. Utilizing the Fourier transform based autocorrelation method, we are not only able to do non-specific matching treating the bases G and T in the same way, but also account for the smaller specificity of C in a single step.

[1] Cheever, E. (1991). Hardware-based Comparisons of DNA Sequences. IEEE EMBS Conference, Orlando, Florida, IEEE.

[2] Perez Canadillas JM, Varani G. (2003) Recognition of GU-rich polyadenylation regulatory elements by human CstF-64 protein. EMBO J. 2003 Jun 2;22(11):2821-30.

[3] Shell SA, Hesse C, Morris SM Jr, Milcarek C. (2005) Elevated levels of the 64-kDa cleavage stimulatory factor (CstF-64) in lipopolysaccharide-stimulated macrophages influence gene expression and induce alternative poly(A) site selection. J Biol Chem. 2005 Dec 2;280(48):39950-61.

[4] Takagaki Y, Manley JL. (1997) RNA recognition by the human polyadenylation factor CstF. Mol Cell Biol. 1997 Jul;17(7):3907-14.