Analysis of Microarray Data
Data Generated From Both mRNA and aRNA Samples
|mRNA Sample||A260/280||aRNA Yield|
Hierarchical clustering analysis of the data obtained from 6912 elements was carried out using UPGMA (Unweighted Pair Group Method with Arithmetic Mean) analysis (see sidebar "Clustering Methods Used for Analyzing Microarray Data"), with an ordering function based on the input rank. This data is represented as a dendrogram (tree graph) with the closest branches of the tree representing arrays with similar gene expression patterns. Figure 3 depicts the hierarchical clustering data from all 6912 elements. The results indicate that there are broad similarities between arrays hybridized with aRNA or mRNA. Even though the overall signal patterns found on the aRNA and mRNA hybridized arrays are similar, a small subset of regions show differential expression (RKO/HCT116) signals between the aRNA and mRNA samples.
To obtain statistically significant data for the sub-regions that were distinct between the aRNA and mRNA (91 elements), a weighted average (WPGMA) analysis was carried out. The hierarchical clustering of these 91 elements is depicted in Figure 4. It is evident that there are very few genes that clearly segregate into either mRNA or aRNA groups. It is important to note, for those genes that do segregate, the gene expression differences (ratios) do not change direction (i.e. RKO>HCT to HCT>RKO), but show greater differences in the aRNA samples compared to the mRNA samples (as determined by the color shade).
An alternate methodology used to understand the clustering of microarray data is k-means clustering. This method does not suffer from some of the problems associated with hierarchical clustering such as irrelevance of gene expression data as clustering progresses or spurious results due to errors in assigning clusters initially in the analysis (2). K-means clustering of all the elements of the HCI arrays with 6 clusters was determined (Figure 5). After 45 iterations, a total score of 1.082e+004 was computed. The most similar "similarity value" was 0 and the least similar "similarity value" was 1.798e+308. This grouping of genes to identify sets of genes that appear to be differentially expressed between aRNA and mRNA resulted in two clusters (91 elements among clusters 5 and 6) that have the largest difference between aRNA and mRNA (Figure 6).
Figure 5. K- means cluster analysis of the 6912 elements using a userdefined cluster number of 6. 45 iterations were carried out to group the genes within a given cluster using a data centroid based search. The total score was calculated to be 1.082e+004. The most closely clustering genes had a similarity value of 0 and the least similar gene had a similarity value of 1.798e+308.
Figure 6. An analysis of variance calculation of K-mean clusters 5 and 6. This plot indicates the confidence limit of the data. A Œp-value‚ of less than 0.0001 was used, indicating that the genes represented in this plot are unique at the 99.9999% cut-off value.
Analysis of variance (ANOVA) of genes in clusters 5 and 6 indicated that the clusters contained genes that behave distinctly between the mRNA and aRNA samples at a confidence limit of 99.99999% (p<0.00001). ANOVA measurements processed the gene-by-gene fluctuations from the mean value and accounted for variance across samples.
A scatter plot analysis of the raw Cy3 and Cy5 values of all the 6912 elements within the 5 gene clusters is shown in Figure 7 (4 plots). The top two plots represent all the elements and the bottom two depict the genes that show the largest differences in signal. Most of the genes that are distinct between the samples are expressed at lower levels (low fluorescent signal). These differences were more exaggerated in the aRNA than the mRNA sample because the signal-to-noise ratio was typically much greater in the aRNA sample. The distinct genes in the aRNA panel might be elements that were not clearly discernible in the mRNA sample due to ribosomal contamination (27% in the mRNA used for this analysis). The presence of ribosomal RNA can increase background in mRNA samples, resulting in variations in mRNA concentration between samples and decreasing the efficiency of cDNA probe synthesis. Thus the presence of ribosomal RNA could have cumulatively skewed the detection and quantification of genes that were expressed in very low amounts when mRNA or total RNA was used.
Figure 7. Scatter Plot Analysis of all Array Elements. A scatter plot of the Cy™5 vs. Cy™3 values obtained for an aRNA and a mRNA array is shown. The top two panels depict the 6 clusters (obtained after K-Mean Clustering Analysis) containing all 6912 elements. A subset of elements that are distinct between the two arrays and which deviate the most in signal intensity are depicted in the lower panels.
Amplification of RNA thus provides a means of measuring expression from genes transcribed at very low levels. In many cases the RNA concentration of an experimental sample is under the optimal required amount for synthesizing labeled cDNA for microarray analysis. MessageAmp is a viable technology for increasing the yield of useful probe and can greatly lower the starting amount of RNA required to produce biologically relevant signals.
This analysis calculates the average Euclidean distance between each point in a cluster and all the points in a neighboring cluster. The two clusters that are closest to each other (have the smallest average distance) are connected to form the higher order cluster. This data is represented as a dendrogram (tree graph) with the closest branches of the tree representing genes with similar gene expression patterns.
WPGMA (Weighted Pair Group Method with Arithmetic Mean)
WPGMA is a clustering technique used when the cluster sizes obtained (using UPGMA) are suspected to be greatly uneven. In this analysis the cluster is computed by weighing the data based on the number of objects contained in a given cluster.
This method does not suffer from some of the problems associated with hierarchical clustering such as irrelevance of gene expression data as clustering progresses, or spurious results due to mistakes assigning clusters initially in the analysis (2). K-means analysis requires a prior knowledge of the number of clusters represented in the data, which is used to partition the data into clusters. Each element in the array is randomly assigned to a cluster and an average expression vector is calculated for each cluster. This vector is used to compute the average Euclidean distances between clusters. Elements are then moved between clusters and allowed to remain in a cluster if the new computed distance after reassignment is lower than the distance when assigned to the previous cluster. After reassignment the expression vector is recalculated for each cluster and this process is carried out iteratively until any further movement of the elements can increase the distance within and between clusters.
Analysis of Variance (ANOVA)
ANOVA measurements process the gene by gene fluctuations from the mean value and accounts for variance across samples. The p value indicates the overlap between samples - the smaller the 'p-value' the more distinct the sample tested and lesser the likelihood of overlap between samples.