Herein we discuss the basics of spectral flow cytometry data analysis including data generation, visualization, and data mining. This is the last step in the Spectral Flow Cytometry experimental process (Figure 1).
Explore spectral flow cytometry reagents Spectral flow cytometry experimental setup
Spectral flow cytometry increases the number of parameters tested at the single-cell level, thus transforming the study of cellular, functional, and phenotypic diversity by expanding the complexity and information obtained. Traditionally, flow cytometry data has been analyzed using a hierarchy of two-dimensional plots relying on the manual selection and identification of defined cell populations . As higher numbers of parameters are measured simultaneously on greater numbers of individual cells, this method of manual gating and data evaluation is subjective, time consuming, and risks overlooking meaningful but undefined populations and/or cellular relationships .Data sets typically generated in spectral flow cytometry will benefit from computational techniques such as dimensionality reduction and/or clustering algorithms to analyze, visualize, and interpret the high dimensional data .Many analysis approaches are available to researchers using flow cytometry [4,5]. A sequence of basic steps for analyzing data using computational flow cytometry tools includes data generation, data cleaning, data visualization and analysis .
Data generation begins with experimental planning, using standardized procedures and appropriate controls, ensuring instrumentation is fully functional, and determining the sample size and method for data analysis. Data cleaning is used to identify and remove debris, dead/dying cells, cellular aggregates, areas of fluidic or sample inconsistency identified by a time parameter, and pre-gating for preliminary cell population identification and as a preliminary check of unmixing.
Visualization of data can provide an overview of the cell populations present, find unexpected populations that might be otherwise overlooked, and, in many cases, help confirm basic assumptions about the data. For example, a t-stochastic neighbor embedding (t-SNE) plot can help validate unmixing performance by confirming that cells with similar marker expressions are positioned near one another in the 2D embedding t-SNE is a more recent dimensionality reduction technique in which each data point is given a location in a two- or three- dimensional map. A researcher may highlight a subset of CD4 positive T cells, such as Tregs, and ensure that the t-SNE embedding has placed the Tregs within the broader placement of the CD4 positive T cells. Once these steps have been completed, an appropriate analysis technique can be chosen and applied to the data. If the data from the experiment is used in a published, peer-reviewed article, it can be uploaded to a public flow cytometry data repository allowing access, review, annotation, and analysis of the flow cytometry data sets .
Data mining methods that will automatically learn a model from an example are called machine learning techniques. With adequate training data and proper implementation, machine learning techniques can produce highly accurate and beneficial models that can generalize beyond the training data. Researchers can then use a model confidently to make inferences about new data presented to the model. Machine learning techniques are often classified into two categories, unsupervised learning, and supervised learning. The main difference is that supervised learning uses labeled data to help predict outcomes, while unsupervised learning does not . Supervised learning methods require training data which creates a model to learn mapping from input to output. These data sets are designed to train, or supervise, algorithms into classifying data or predicting outcomes. Unsupervised learning methods take a set of data that contains only outputs to find structure in the data, via grouping or clustering. Unsupervised algorithms represent most of the current development for analysis of high-dimensional flow cytometry data sets. These algorithms learn from data that has not been labeled, classified, or categorized. A goal of unsupervised learning methods in flow cytometry is to correctly identify and quantify cell populations. Examples of unsupervised learning techniques that are commonly used in flow cytometry data analysis are dimensionality reduction and clustering analysis (Figure 2).
Figure 2. Representations of data analysis visualizations. (A) Dimensionality reduction and (B) clustering techniques used for analyzing flow cytometry data.
In dimensionality reduction techniques, the goal is to visualize all data points in a lower-dimensional space while preserving the main data structure. Principal component analysis (PCA) is a traditional dimensionality reduction technique that condenses data to its principal components. PCA is an established method commonly used to visualize relationships in multidimensional data with single-cell resolution. t-SNE aims to find a lower-dimensional representation to preserve the similarity in the original high-dimensional space (Figure 3). Some dimensionality reduction techniques have run times that are often quite long, and it may be necessary to select a subsample of data (called down-sampling) for the analysis. Uniform manifold approximation and projection (UMAP) is another dimension reduction technique that can be used for visualization like t-SNE, asserting faster processing speeds and a modified visualization.
Figure 3. t-SNE projection of 45-color spectral flow cytometry experimental data. (A) t-SNE analysis projection of live human PBMC populations with (B) population coloring defined.
Automated cluster analysis techniques first find groups of similar objects, assigning cells with similar marker profiles to similar clusters, with subsequent two-dimensional visualization of clusters. Spanning-tree progression for density-normalized events (SPADE) is a program that combines down-sampling, clustering, and a minimum-spanning tree to provide visualization of high dimensional single cell data. SPADE is useful for finding fold differences in expression levels between samples, although there is loss of single-cell resolution. A self-organizing map (SOM) is an unsupervised technique for clustering and dimensionality reduction, in which a discrete representation of the input is trained. FlowSOM clusters cells with a self-organizing map and provides visualization of data subsets producing a minimum spanning tree (MST). PhenoGraph is a recently developed algorithm to model high-dimensional space in which each cell is depicted as a node that is connected to its neighbor. Here phenotypically similar clusters of cells are represented as sets of interconnected nodes. Both unsupervised clustering and dimensionality-reduction visualization approaches may be combined by first running a dimensionality-reduction and using the results as the input for a clustering algorithm [8,9].
Computational approaches are powerful tools for exploring high-dimensional flow cytometry data. Each algorithmic tool has strengths and challenges, and some are designed for a specific purpose. New tools are evolving to meet developing needs of the researcher . Understanding the functionality of different algorithms is important in selecting the optimal tool to help answer the research question. While manual gating will continue to allow testing of basic hypotheses and evaluation of data quality, the use of computational analysis of complex flow cytometry data sets has the potential to deepen our understanding of the immune system and provide insights into the complexity of biological systems .
For Research Use Only. Not for use in diagnostic procedures.