Sharing is good, and this includes proteomics data. Proteomics data sharing is constrained by file type and size, however, which hinders both translation of results across systems and physical portability among users. The merging of the two initial open XML (Extensible Markup Language) file formats into mzML has freed researchers from their equipment choices and thus enabled platform-independent analysis. Unfortunately, with the rise in high-resolution, high-frequency mass spectrometry (MS) spectral data, file sizes are once again overloading the system. Compared to vendor-specific platform files, mzML files can be 4 to 18 times larger, requiring larger data storage repositories, with concomitantly longer processing times. In other words, big data need big storage and improved processing power.
Teleman and colleagues (2014) recently exhibited a solution to the problem.1 They noted that although file conversion into mzML from vendor formats is possible, it is less efficient for the reasons explained in the previous paragraph. Their response was to generate a series of near-lossless numerical compression algorithms, which they call MS-Numpress,2 that wrangle the mzML file sizes and read speeds into more manageable packages without compromising primary data.
The authors managed this by writing three different algorithms that concentrated on the following data essentials: m/z ratios, ion counts and retention times, all considered fundamental to mass spectral data representation. They wrote their algorithms specifically to compress the binary data contained in the mzML files, optimizing the performance for each fundamental and allowing for reconstruction of original data.
Once constructed, the team applied their work using a test set of 10 mass spectrometric data files obtained from different instruments, vendors and experiment types, testing the process on different computers. They included MS1 and MS2 spectrum data from experimental runs in data-dependent acquisition (DDA), selected reaction monitoring (SRM) and data-independent acquisition (DIA) SWATH modes.
Comparing their method alone and in combination with existing compression schemes such as ZLIB and GZIP, the researchers found that MS-Numpress successfully compressed file sizes. Using traditional compression tools in conjunction with MS-Numpress reduced file sizes by 87%. This was, however, accompanied by 138% longer write times, which the researchers noted could be offset by faster (21%) read times.
Furthermore, when Teleman et al. converted MS-Numpress files back to their original forms, they found that data loss was minimal. Using two Orbitrap (Thermo Scientific) DDA liquid chromatography–tandem mass spectrometry (LC-MS/MS) mzML files, the team compressed and then uncompressed the data. They compared LC-MS/MS data from the original files with the twice-converted files—identifying peptides using Mascot—and found that the lists were extremely similar.
Overall, Teleman et al. found that their algorithms, in combination with other compression tools, could successfully compress files by 90%, leading to read time decreases of 50%, with minimal loss of data. They see these algorithms as useful “simple and robust solutions” for data handling within the proteomics community. To this end, they have enabled support within existing tools and have submitted MS-Numpress for further evaluation through the Proteomics Standards Initiative.
Reference and Note
1. Teleman, J., et al. (2014, June) “Numerical compression schemes for proteomics mass spectrometry data,” Molecular and Cellular Proteomics, 13 (pp. 1537–42), doi: 10.1074/mcp.O114.037879 (e-pub ahead of print).
2. Read more about MS-Numpress here, https://github.com/ms-numpress/ms-numpress, under the Apache 2.0 license.
Post Author: Amanda Maxwell. Mixed media artist; blogger and social media communicator; clinical scientist and writer.
A digital space explorer, engaging readers by translating complex theories and subjects creatively into everyday language.
Leave a Reply