Raman spectroscopy has increased use in Process Analytical Technology (PAT) in the past two decades. It is particularly well suited toward measuring the progress of cell line growth in bioreactors, as it provides near instantaneous readings of important metabolites, is non-destructive, and requires minimal maintenance. However, to be helpful, Raman spectra must be interpreted from their raw form into actionable metrics like concentrations and substance identities.
For pure components and simple mixtures, using the intensity of a peak or the combination of several peaks can be used to estimate metabolite identities and concentrations, as chemistry undergraduate students are taught to quantify chemicals in their analytical chemistry classes. However, bioreactors are extraordinarily complex systems, with many metabolites, proteins, and amino acids interacting with each other, cells developmentally changing and dying, and interactions between cells and the cell culture media. As a result, Raman peaks in a bioreactor often overlap with each other, and the baseline of the entire spectra can vary throughout the duration of the bioreactor’s run due to fluorescence, meaning that individual peaks can no longer be easily deciphered.
That is where chemometrics comes in.
What is Chemometrics?
Chemometrics is a subset of machine learning and, more broadly, artificial intelligence specifically tailored toward chemical analysis. A chemometrician will train a chemometrics model to understand what features in a spectrum correlate to what metabolite identities and concentrations. The general workflow toward creating chemometrics models is as follows:
- Data collection – Raman spectra are collected throughout the duration of a bioreactor run. Several dozen reference concentrations are collected simultaneously and independently – for instance, offline by another instrument. These independent values correlate spectral features to concentrations and identities, creating the chemometrics model.
- Preprocessing – Raman spectra are preprocessed to reduce signal noise and baseline differences caused by fluorescence. Outliers and erroneous measurements are removed.
- Training and validation data – The reference data and accompanying Raman spectra are split into different sets. Training data will be used to train the model, whereas Validation data is “held out” – used to test the final model to ensure that the model accurately measures the Raman signal instead of “overfitting” to random noise.
- Model building – A chemometrics model is built by correlating the training data to the corresponding reference values. Often cross-validation is used to set model parameters. Cross-validation uses a different iteration of the training data to build sub-models that are then tested against data left out to minimize overfitting, which can occur when the model improperly interprets spectral features.
- Model validation and application – The model is tested with the validation data to ensure results are still within specifications and then installed permanently on a Raman instrument. Now new Raman spectra can be collected and applied to the model to yield concentration and identity results almost instantly!
Some may wonder why using Raman is beneficial if reference spectra still must be collected on a separate instrument to build a chemometrics model. The reason is that after the model is built, no new reference data will have to be collected, as the model will continue to make accurate predictions on new data, even for new bioreactor setups. Additionally, unlike independent measurements, which are often taken only a few times a day, Raman will provide nearly instant feedback on the state of the reactor, which can be used to adjust reactor settings immediately and even automatically.
A Bioreactor Example
Independent data from multiple Raman spectroscopy process analyzer instruments*, probes, and bioreactor types were used to create models. The training datasets were collected from 45 samples per bioreactor to create each chemometric model. The spectral data were reviewed, and outlier spectral spikes caused by cosmic rays were removed. The spectral region of interest was selected, and the spectra were pre-processed to remove the baseline and maximize signal-to-noise. Many pre-processing techniques were tested, including the Savitzky Golay filter with derivatives, Automatic Whitaker Smoothing, Extended Multiplicative Scatter Correction, SNV, and mean centering. The best pre-processing techniques used varied, based on which specific parameter of interest was modeled. Partial Least Squares (PLS) models were created for each property of interest, and cross-validation was performed to test the optimization of each model. Properties of interest included glucose, lactate, glutamine, glutamate, TCD, VCD, and other common metabolites generated during the bioreactor culture run.
Continuous in-line Raman spectroscopy was applied to a fed-batch CHO cell culture process. The in-line spectral data was correlated to the offline analytical data acquired for parameters of interest. Using Raman spectroscopy to monitor process parameters first requires chemometric model building with an externally calibrated data set (independent offline data). Bioreactor samples were collected daily and analyzed for comparison to assess the accuracy of the process analyzer* predicted values. The root mean square error of calibration (RMSEC), root mean square error of cross-validation (RMSECV), and root mean square error of prediction was calculated for each parameter (RMSEP). The error was averaged based on the model’s prediction to identify the RMSECV used to construct the model. The RMSEP tests the model against “new” data that the model has not seen. The coefficient of variation, R2 , was recorded for each PLS model. The value is used to determine the amount of variation of the Y variable, which the model predictors (X variables) can explain.
It is important to note that the combined use of several large, independent data sets from bioreactor runs of the same CHO culture process produced predictive chemometric models that are more accurate and robust. This study combined five independent datasets from previous bioreactor runs to train a large chemometric model. The calibration model was then applied to the spectral data obtained during this DynaDrive bioreactor run. The data indicates that the model was able to predict this new dataset accurately and that model predictions were highly correlated with data measurements collected offline for numerous metabolites, as shown in Table 1: Correlation of Model Prediction with Offline Data Analysis.
Thermo Scientific DynaDrive S.U.B. Chemometric Model Plots- Comparison of Raman Model vs Offline Analytical Data for Important Bioreactor Parameters.
We have seen what the Raman signal from a typical bioreactor looks like throughout the duration of a reaction. Additionally, fluorescence was removed from the spectra to analyze the signal peaks of interest best. In the following steps, we will analyze our independently collected data to get concentration values at various points during our bioreactor run. Then, we will use those values to create a chemometrics model that correlates spectral changes to concentrations. Here, we used the process analyzer inline to monitor CPPs (Critical Process Parameters) in combination with chemometrics to develop predictive models. These inline Raman estimates from the platform models may enable automated feed-rate adjustments and accurate scalability regarding on-demand nutrient feeding without operator intervention, potentially yielding higher product concentrations in future runs.
- Infographic: 5 simple steps to optimize chemometrics model results for analysis
- Process Analysis Instruments
Notes: *The instrument used for this example was a Thermo Scientific Ramina Process Analyzer
Authors: David Kuntz, Kevin Broadbelt