Metabolomics Data Analysis

Turning data into knowledge

Metabolomics analysis leads to large datasets similar to the other "omics" technologies. This data may contain many experimental artifacts, and sophisticated software is required for high-throughput and efficient analysis, to provide statistical power to eliminate systematic bias, confidently identify compounds and explore significant findings.

Receive updates on Metabolomics Read the blog Metabolomics Software Solutions

Metabolomics data analysis usually consists of feature extraction, compound identification, statistical analysis and interpretation. Data analysis is a significant part of the metabolomics workflow, with compound identification being the major bottleneck. This overview reviews the challenges of data analysis for metabolomics and the strategies today to address these.

Click image to enlarge

Once data acquisition is complete, spectral data pre-processing occurs through the following steps:

Baseline correction is used to remove low frequency artifacts and differences between samples that are generated by experimental and any instrumental variation
Spectral alignment can happen before or after feature/compound extraction. It is one of the main processing steps in metabolomics studies involving multiple samples where chromatographic retention time is the parameter that can vary.

Feature extraction

This step involves finding and quantifying all the known and unknown metabolites and extracting all relevant spectral and chromatographic information from them. Peak-based algorithms are the method of choice for MS- based studies, and peaks are detected across the entire spectrum.

Once detected, related ions indicative of a single-component chromatographic peak (adducts, multiply charged) are identified and grouped.

Click image to enlarge

Their areas are then integrated to provide a quantification of the underlying metabolite.

Metabolite identification

Compound or metabolite identification is one of the major challenges of untargeted metabolomics research. However, this step must be performed in order to infer any biological or scientific meaning from a novel spectral peak.

When using an MS reference database or MS/MS spectral library matching, or a number of other commercially and open-source databases, several factors influence the selection of available resources:

The number and types of compounds.
The nominal or accurate mass data.
The quality and curation of data.
The ability to process data batches.
The ability to customize databases/libraries.

MS Database Searching

When dealing with high resolution accurate mass data (full scan MS), it is fairly common to compare the neutral molecular mass (derived from m/z value) against MS databases such as METLIN , mzCloud , etc. This approach provides compound candidates, but it lacks sufficient specificity for identity confirmation.

This is why isotope pattern matching is used to confirm empirical formula. If retention time information is also included, confident compound identification can be achieved.

Such an approach works well with data acquired from either LC- or IC-MS analysis, where the molecular ion is left intact during full scan MS. With GC-MS using electron impact (EI) or chemical ionization, the molecular ion is typically fragmented, so these additional approaches are not required to achieve full compound identification.

MS/MS Spectral Library Matching

Fragmented molecular ions can be compared against MS/MS spectral libraries or EI libraries to generate more confident identification results. Combining retention time information with MS/MS library or EI library searching provides the highest level of confidence. The quality of the data found in these libraries are critical for confident identification; likewise, so is the number of metabolite spectra. Today, there are libraries that contain spectral data beyond just that of MS/MS. As data are continuously added to and curated within these spectral libraries, routine peak identification will improve.

Click image to enlarge

Mass Spectral Interpretation

If the metabolite or compound is not identified using the above approaches, it’s possible to perform more in-depth mass spectrometry analysis performing MSⁿand utilizing several dissociation techniques to obtain multiple fragmentation patterns. The approach would be to interpret the compound fragmentation spectra and propose a rational structure. This is a time consuming process.

Two approaches exist:

De novo interpretation. Without using any prior knowledge, a chemical structure is reconstructed based on its fragmentation data.

Structure correlation. MS/MS spectra are correlated with a list of searched database structures using their calculated molecular formulae.

Metabolomics statistical analysis

Metabolomics samples are typically complex and there are many interactions between metabolites and biological states. To uncover significant differences, univariate and multivariate statistical analyses (chemometric methods) use the abundance relationships between the different metabolomics components. Visualization tools to interact more productively with the data are also an integral part of this process.

1) Univariate methods (the most common statistical approach) analyze metabolomics features separately. Their main advantage is ease of use and interpretation. There are several univariate methods for metabolomics. When assessing differences between two or more groups, parametric tests such as student’s t-test, box whisker plots and ANOVA (analysis of variance) are commonly used.

The disadvantage is that this approach doesn’t take into account the presence of interactions between the different metabolic features (correlations between metabolites from the same pathway, or metadata such as diet, gender etc) increasing the probability of obtaining false positive or false negative results.

2) Multivariate methods analyze metabolomics features simultaneously and can identify relationships patterns between them. There are two groups of pattern-recognition methods: unsupervised and supervised.

Click image to enlarge

Unsupervised methods are an effective way to detect patterns that are correlated with experimental or biological variables. Similarity patterns within the data are identified without taking into account the type or class of the study samples. Principal component analysis (PCA) is a common example.

Supervised methods take into account sample labels to identify features that are associated with a phenotype of interest, and down weights variance. These are also the basis for building prediction models. Partial least squares (PLS) is one of the widely used supervised methods in metabolomics.

Click image to enlarge

Figure: PLS-DA model of the decomposition data. A supervised multivariate analysis that collapses high-dimensional data (e.g. a large number of metabolites with varying intensities) to principal components that encompass the majority of variance in the dataset. In this case the X axis is principal component 1 and the Y axis is principal component 2. Note that the samples cluster appropriately—each group clusters together and T0 is distinctly separated from the other groups.

Step 1 : Feature extraction

Feature extraction

Once detected, related ions indicative of a single-component chromatographic peak (adducts, multiply charged) are identified and grouped.

Click image to enlarge

Their areas are then integrated to provide a quantification of the underlying metabolite.

Step 2 : Identification

Metabolite identification

When using an MS reference database or MS/MS spectral library matching, or a number of other commercially and open-source databases, several factors influence the selection of available resources:

The number and types of compounds.
The nominal or accurate mass data.
The quality and curation of data.
The ability to process data batches.
The ability to customize databases/libraries.