124d Selection of Informative Genes in Time-Course Gene Expression Data

Eric Yang and Ioannis Androulakis. Biomedical Engineering, Rutgers University, Piscataway, NJ 08854

One of the biggest challenges in term of interpreting data generated through high throughput gene expression experiments is how to extract relevant genes from the reams of data generated. Clustering as a technique is oftentimes used in order to segregate genes based upon their expression profiles in the hopes that the co-expression of certain genes will give a hint to both their functions and regulatory controls (7). However, most clustering techniques make the general underlying assumption that the entire dataset is relevant which is not always the case and furthermore it complicates the process of extracting useful and relevant information. We propose an integrated methodology for extracting dominant expression signatures and the putative regulatory structures.

Our proposed algorithm identifies genes that maximally contribute to changes in the overall transcriptional state of the system. Specifically, we propose a novel processing of transcriptional information that first characterizes each expression profiles as a waveform of normalized intensity and assigns to it a characteristics attribute. This allows for each waveform to be compared to each other waveform in the set, thus enabling the grouping of similar waveforms (i.e., genes) into expression motifs. The characteristic attribute will be defined using established methods of symbolic transformation and hashing in time series (5) with a novel weighting of values more appropriate to expression data. Subsequently, expression motifs are selected based on their ability to represent the overall dynamics of the transcriptional dynamics. A rank order distribution of gene expression intensities is first derived and the waveform of this distribution is characterized, thus defining the transcriptional state. As this state transition over time, deviations from the original state are summarized and quantified using the Kolmogorov-Smirnov Test. Importantly; this statistic can quantify the extent of the deviation from the initial, or ground state, of the system. Discrete packets of expression motifs can then be used in an attempt to replicate this deviation. As a result, one can characterize expression motifs for their strength of replication of the entire system. Thus, one is able to rank order the expression motifs for their contribution to the overall state change of the system (8).

The gene selection based on the fine-grained clustering just described is further analyzed in order to identify optimal ways of merging motifs with the reduced set of maximally informative motifs. In order to identify these related genes, we will require both a method for determining motifs which are nearby to the maximally informative set, and a method for determining when sufficient motifs have been merged to the maximally informative set. In order to determine which other unselected motifs are close to maximally informative set, we will utilize hierarchical clustering upon the symbolic representation. Given that hierarchical clustering normally has no stopping criteria to determine whether or not two clusters or expression profiles should be merged, we have devised a novel method for doing so that utilizes the notion that the difficulty in building a classifier which could segregate the genes in the same manner the as the hash value is directly related to the similarity between two motifs. If the number of misclassified genes, i.e. genes which hash to motif A being classified as motif B and vice versa is low then it would suggest that the motifs are different. The stopping criteria which we will be utilizing relies on the fact that the two distributions are independent if the probability of the misclassification rate is smaller or less than the product of the probability that any given gene will lie within one of the motifs. The benefit of utilizing this idea is that a built in stopping criteria exists. The misclassification estimates will be generated using a Naïve Beyes Classifier which, instead of requiring the full chain rule, to calculate the conditional probability it allows one to take the aggregate product of the p(featurei|cluster). The novelty of our approach stems from the fact that the features used for our classifier are the transcription factor binding sites as well as a rough probability estimate derived from the distance of a gene expression profile and the average expression profile for that motif. The transcription factor binding sites are identified using a combination of position weight matrices as well as a phylogenetic estimates by comparing the relative probabilities of said transcription factor binding to different mammalian organisms. The main novelty of our approach is that, unlike other expression analysis techniques, we are in fact merging multiple sources of data. The overall procedure for this method will be to first divide up the genes into a set of motifs, extract the maximally informative set of motifs, and then perform hierarchical clustering upon the entire set of motifs using the maximally informative motifs as starting point.

Once the set of informative genes has been identified and a set of potential transcription factor binding sites has been selected for each class of informative profiles, a classifier is built by determining the transcription factors that provide the best separation between the informative set, and the uninformative set as well as provide separation between the clusters using concepts from informative feature selection (2) and further quantify the strength of these regulatory interactions (4). Thus we can extract the subset of transcription factors which are being activated under the experimental conditions as well as the transcription factor signals which shape the expression profiles. What this data will provide is hints as to the pathways which are activated under experimental conditions. The primary advantage of this method is that the results are shaped by biologically relevant information, and therefore much easier to justify in a biological context. What we have selected is a set of genes that have both a consistent hypothetical regulatory structure, as well as the identification of the hypothetical regulatory structure. This stands in contrast to clustering techniques based only upon the expression profile, where the end result does not yield any hypothetical control structure.

A number of inflammation specific data sets will be analyzed:

• expression profiling in response to corticosteroid treatment in rats (1);

• expression profiling of the inflammatory changes following burn injury in rats (6) , and finally

• expression profiling in response to bacterial endotoxin induced sepsis in humans (3)

The relative advantages of our proposed methodology, compared to other methods of analyzing temporal expression data as well as the wealth of information generated, will be discussed in terms of identifying informative subsets of genes, their corresponding regulatory mechanisms and relevant functional ontologies.

1. Almon RR, DuBois DC, Pearson KE, Stephan DA, Jusko WJ. 2003. Gene arrays and temporal patterns of drug response: corticosteroid effects on rat liver. Funct Integr Genomics 3: 171-9

2. Androulakis I. 2004. Selecting maximally informative genes. Computers & Chemical Engineering available on line: doi:10.1016/j.compchemeng.2004.08.037

3. Calvano SE, Xiao W, Richards DR, Felciano RM, Baker HV, et al. 2005. A network-based analysis of systemic inflammation in humans. Nature 437: 1032-7

4. Kao KC, Yang YL, Liao JC, Boscolo R, Sabatti C, Roychowdhury V. 2004. Network component analysis of Escherichia coli transcriptional regulation. Abstracts of Papers of the American Chemical Society 227: U216-U7

5. Patel P, Keogh E, Lin J, Lonardi S. 2002. Mining Motifs in Massive Time Series Databases. ICDM 2002: 370-7

6. Vemula M, Berthiaume F, Jayaraman A, Yarmush ML. 2004. Expression profiling analysis of the metabolic and inflammatory changes following burn injury in rats. Physiol Genomics 18: 87-98

7. Wolfe CJ, Kohane IS, Butte AJ. 2005. Systematic survey reveals general applicability of "guilt-by-association" within gene coexpression networks. BMC Bioinformatics 6: 227

8. Yang E, Berthiaume F, Yarmush ML, Androulakis IP. 2006. An integrative systems biology approach for analyzing liver hypermetabolism. Presented at 9th Int. Symp. Process Systems Engineering and 16th European Symp. Computer Aided Process Engineering, Garmisch-Partenkirchen / Germany