Technologies that are capable of generating quantities of data unthinkable a few years ago are now common. This allows us to begin thinking about food testing in new and exciting ways: how we characterize and classify food-related organisms, look for these organisms in our environments, and investigate their introduction into our food production processes.
Big data is a broad term used to describe data sets that are too large and/or complex to be dealt with using traditional methods or infrastructure. There are a lot of reasons why that data can be considered ‘big’:
- Analysis can be computationally intensive
- Visualization can be complex
- The volume or velocity of data capture can be difficult
- Curation or annotation can be highly labor intensive
- Searching the data can be slow
- Physical limitations such as storage or transfer of the information are encountered
Notwithstanding the infrastructure involved, tapping into large data sets can radically change our understanding and allow us to develop learning models that weren’t possible before. For instance, mathematical models that use large sales data sets to identify contaminated food products are an interesting development because they allow a more rapid response to outbreaks by leveraging available data1. Social media adds another source of data, and several crowd-sourced epidemiological apps are already available in attempts to track infectious disease.
How will big data alter the way food safety and quality testing is performed? Two points are important: clearly understanding what information has value to you is critical, and technologies need to be applied that deliver that information. Let me give a couple of quick examples:
Question 1: What percent of the Listeria monocytogenes genome is useful in determining that a sample does, or does not, contain the organism?
Answer: Based on a quick sequence analysis using 50 Listeria isolates (the inclusion set) and 20 non-Listeria isolates (the exclusion set), an informative but not huge sample, about 18% of the genome is useful. Think about that; over 80% of the genome had no value in the context of that question.
Question 2: What percent of the Salmonella genome is needed to indicate that a sample is positive?
Question 3: What percent of the E. coli O104 genome was useful in characterizing the German foodborne illness outbreak in 2011?
Answer: ALL of the information was useful, because it was an uncharacterized strain that we knew very little about.
So here’s the point: all information has value, and the value is dependent on the question that is being asked and the models that have been developed to identify the most valuable information. Framing the appropriate questions is key: what am I trying to accomplish with my testing, what information has value in meeting that goal, and what technology delivers that information as effectively and efficiently as possible?
In part two, I’ll share how we are using large data sets to develop food testing technologies that benefit food producers.
Why not subscribe today to receive blog posts directly in your inbox?
Dan Kephart, PhD, is R&D Leader, Food Safety Testing at Thermo Fisher Scientific.
- Kaufman J, Lessler J, Harry A, Edlund S, Hu K et al. (2014) A Likelyhood-Based Approach to Identifying Contaminated Food Products Using Sales Data: Performance and Challenges. PLOS Comput Biol 10(7): e1003692. Doi:10.1371/journal.pcbi.1003692.