# miRNA Profiling Data Analysis

Microarray signal intensities detected by the scanner are quantified by detection software. Signal intensities for each spot are calculated by subtracting local background and raw data are normalized and analyzed by using data analysis software. Data analysis of microRNA expression profiling experiments can pose considerable challenges with normalization because some traditional normalization methods are based on assumptions that may not be valid for microRNA expression data. Three methods are typically used for normalization of 2-dye expression profiling microarray experiments:

- Latin Squares/Loop Design/Dye Swap (used in Invitrogen's NCode™ profiling services - see also data analysis recommendations
- M vs. A, Lowess Normalization
- Quantile Normalization

## Latin Squares/Loop Design/Design Swap

This model is a global linear model that is fit to the data. This model attempts to minimize the number of arrays used in the experiment, while still controlling for the typical sources of variation. This method was initially described by Kerr, et al. in Analysis of Variance for Gene Expression Microarray Data.

The Loop Design experimental design fits the following model to the observed data,

Where,

- Xijkg – Is the observed signal for the ith array, jth dye, kth tissue and gth spot on the array, here i = 1, 2, …, I is the number of arrays, j = 1, 2 where this is the number of dyes and for this platform there are only 2 dyes used, k = 1, 2,…, K is the number of tissues in the experiment currently in NCode™ Profiler K = I > 2 and g = 1, 2, …, G is the total number of spots on the array.
- µ – This is the overall average signal observed in the experiment
- αi – This is the additional signal observed in the ith array in addition to the overall effect of µ, independent of dye, gene and tissue.
- δj – This is the additional signal observed in the jth dye in addition to the overall effect of µ, independent of arrays, gene and tissue. Note that for the NCode application j = 1 and 2, for the Alexa 3 and 5 dyes.
- τk – This is the additional signal observed in the kth tissue in addition to the overall effect of µ, independent of arrays, dye and genes.
- γg – This is the additional signal observed in the gth gene in addition to the overall effect of µ, independent of arrays, dye, and tissue.
- αγig – This is an interaction term and represents the additional signal observed of the ith array and the gth gene in addition to the overall effect of µ, the overall effect of the ith array αi and the overall effect of the gth gene independent of dye and tissue.
- τγkg – This is an interaction term and represents the additional signal observed of the kth tissue and the gth gene in addition to the overall effect of µ, the overall effect of the kth tissue τk and the overall effect of the gth gene independent of dye and array. This is the term of interest since it normalizes out the effect of the dye and the array and focuses on signal that can be explained for a particular miRNA marker in a particular tissue. If there is no particular effect associated with a tissue then this term will be near 0.

This model can be fit using many different packages to do so. P-Values are typically calculated, depending on the experimental design and the number of replicates, by bootstrapping the residuals of the model. Note that for a proper Latin Squares model this would include a two dye experiment, two tissues and only two arrays, where the second chip is the dye swap of the first. For a proper Loop design is an extension of the Latin Squares model where the number of samples is more then two and the number of chips is equal to the number of samples, the chips are not dye swapped, rather they are dye shifted. For example a 3 sample loop design would be given by:

- Chip 1 – Sample 1 – Dye 1, Sample 2 – Dye 2
- Chip 2 – Sample 2 – Dye 1, Sample 3 – Dye 2
- Chip 3 – Sample 3 – Dye 1, Sample 1 – Dye 2

**Assumptions**: Performs both normalization and differential marker detection at the same time. Assumes that effects in the model are log additive, assumes due to confounding it assumes that there is no gene-dye effect.

**Pros**: Many different sources of variation are normalized/controlled for in the model. Detecting differential markers falls out of the model.

**Cons**: Calculating the P-Values is computational challenging

## M vs A, Lowess Normalization

This method is typically used for single dye array systems, but can be adapted for 2 dye systems. It seeks to normalize the data by assuming that typically there should be no differential expression on the chip. It is also worth noting that this is a within the chip normalization. Specifically this method calculates the log ratio of the signals for each miRNA as well as the log product of each signal for each miRNA. We then make a plot of the log product (x-axis) versus log ratio (y-axis), this is typically called an M versus A plot.

This method was initially described by Dudoit,S, et al. in Statistical methods for identifying genes with differential expression in replicated cDNA microarray experiments. It was also described by Yang, YH, et al. in Normalization for Two-color cDNA Microarray Data. Science and Statistics: A Festschrift for Terry Speed, Monograph Series. and by Yang, YH et al. in Normalization for cDNA microarray data: a robust composite method addressing single and multiple slide systematic variation.

The assumption here is that the trend of this plot should be y=0. To test this trend, you then apply a lowess (LOcally WEighted Scatterplot Smoother), which is a non-linear line that attempts to create a trend in the vertical direction in the horizontal direction. If this lowess line does not fall on y=0, then the data is normalized via this line, thus for any log ratio you subtract the corresponding lowess value, and this results in the normalized value. This is done for per chip.

**Assumptions**: Across the chip on average there should be no change in the signal between two different channels

**Pros**: In a before-after design, great method, where normalized data has natural interpretation. In arrays with lots of content, assumption is almost trivially always true.

**Cons**: Lowess is non-trivial to apply to data sets. Experimental design issues. If relatively small number probes or directed content then assumption can fail.

**Note**: Typical mRNA data have readily discernable patterns in M vs. A plots which makes fitting a model fairly easy. miRNA data are much more diffuse in an M vs. A plot which makes model fitting much more difficult

## Quantile Normalization

This is a global normalization method where the main goal is to force the histogram of any particular chip and channel to look the same, but the actual value for any particular miRNA maybe different, depending on the order of the signal within the chip/dye combination. This method was initially described by Irizarry RA et al. as part of a larger analysis method in Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Specifically to get a “global distribution of signal”, the largest value on every chip is replaced by the median of the largest values of every chip, the second largest value on every chip is replaced by the median of the second largest value of every chip, and so on. Once this is done for every spot, this results in the normalized data.

**Assumption**: That the histogram of signals for each chip within an experiment should be equal

**Pros**: Has been shown in the literature to be a superior method of normalization, even when assumptions are violated. Estimated directly from the data.

**Cons**: Difficult to apply without statistical software, i.e. trying to do this in Excel can be a very painful exercise

## Information on NCode™ miRNA Profiling:

## Additional Learning Topics

## Interesting Articles on the Statistical Methods Used in Calculating p-values:

Efron B and Tibshirani R (1986) Bootstrap Methods for Standard Errors, Confidence Intervals, and Other Measures of Statistical Accuracy. Statistical Science 1: 54-77.

Hastie TJ and Tibshirani RJ (1990) Generalized Additive Models. Chapman and Hall, London.

Altman NS and Hua J (2006) Extending the Loop Design for Two-Channel Microarray experiments. Genetic Research 88 (3) 153-63.

## References Cited:

Irizarry RA et al. (2003) Exploration, normalization, and summaries of high density oligonucleotide array probe level data.

Kerr, et al. (2000) Analysis of Variance for Gene Expression Microarray Data, Journal of Computational Biology, 7:819- 837.

Dudoit,S, et al. Statistical methods for identifying genes with differential expression in replicated cDNA microarray experiments. Statistica Sinica, Vol. 12, No. 1, p. 111-139

Yang, YH et al. (2002) Normalization for cDNA microarray data: a robust composite method addressing single and multiple slide systematic variation. Nucleic Acids Res. 2002 Feb 15;30(4):e15. .

Yang, YH, et al.(2003) Normalization for Two-color cDNA Microarray Data. Science and Statistics: A Festschrift for Terry Speed, Monograph Series. Volume 40. Edited by: Goldstein DR. IMS Lecture Notes; 2003:403-418.

**Literature**

**Questions**

Send your questions to our top researchers at EpiScientist@invitrogen.com.