Increasing complexity of raw spectral data resulting from tandem mass spectrometry (MS/MS) proteomic characterization is leading to memory problems in computational analysis. With XML file sizes heading into gigabyte ranges, working system memory can be quickly overwhelmed, resulting in a reduction in processing speed. Röst et al. (2015) supply some useful tweaks to the OpenMS library of tools that improve parsing speeds, and thus performance, for proteomics data analysis.1
As proteomic instrumentation advances, MS/MS analysis creates increasingly complex spectral data files with greater volume and sizes. Although vendor-specific file formats exist for data output from mass spectrometry instruments, open formats such as mzML exist, which allow researchers to use open-source algorithms for processing and computational proteomics. However, open-source tools often have difficulties in working through large amounts of data, since processing power and system memory limit performance. This is frequently due to the requirement to download the entire raw MS data file into system memory for analysis. When the file size exceeds or approaches system memory allocation, this becomes problematic.
Röst et al. present a new tool, a “fast and versatile” parsing library for MS XML formats based on the OpenMS software framework. Built in C++ and Python, the tool can work around problems such as constraints in system memory. The modifications made by the authors enable the “improved” OpenMS system to deliver fast and efficient processing for computational proteomics that use open-source algorithms for data analysis.
The solution proposed and executed by Röst et al. is to provide a high-performance, low-memory application processing interface (API) to access the raw MS data. By writing it in C++ and Python, the team has ensured that the workaround is accessible to OpenMS users. They have accomplished this in several ways:
- OpenMS C++ API modification. To promote efficient processing of large XML databases, the team configured “lazy loading,” where random read access means that only the data requested is loaded. Additionally, this feature enables the ability to read cached data and to interrogate data without fully loading it into the memory.
- Code analysis. The team completed a careful analysis of OpenMS code and modifications to identify and bypass “critical performance bottlenecks” through software modifications.
- Speed and performance monitoring. With these changes enabled, the team monitored parsing speeds and analytical performance to enhance output and efficiency.
Once the code was modified, Röst et al. undertook validation and benchmarking, comparing analytical results with those obtained from another similar tool, ProteoWizard. They found a 200x increase in read speeds in conjunction with reduced memory requirements, resulting in faster processing compared with the traditional unmodified method.
In summary, the modifications made to the program include configuring random read access to the data disk, enabling event-driven processing for analysis while data from only one spectrum is held in memory, and improved parsing speed. These changes result in faster processing without heavy memory requirements. Röst et al. have made all information available online to computational proteomics researchers through GitHub.
Reference
1. Röst, H.L., et al. (2015) “Fast and efficient XML data access for next-generation mass spectrometry,” PLoS One, 10(4), doi: 10.1371/journal.pone.0125108.
Leave a Reply