NGS Big Data


My name is Hans Karten and I am co-founder and CEO/CTO of the company. We initiated this blog, since we have a strong vision about accelerating the clinical application of Next-Generations Sequencing (NGS). Large-scale genomics studies are needed to translate genomics data into true patients’ benefits. We need to reconsider how we organize the workflow to process, analyze and store the massive amounts of data produced by modern sequencers to make this more efficient and cost-effective.

This is our first post and it makes sense to tell you a bit about our company. We exist since 2011 and come from a background of high speed and high volume data processing. This background has proven to be a critical success factor in the genomics space, where the NGS related Big Data phenomenon is becoming the fear and challenge of many who are responsible for IT, finance and daily operation.

Big Data Challenges

Most likely this is going to be preaching to the choir, yet the challenges of working with genomics at scale are quite often underestimated by decision makers. This is fueled by ambition and based on the assumption that more resources and high pace technological innovation can solve the NGS related Big Data problems. As a result there are many initiatives funded in a new and growing industry.

Part of that is true. The growth and complexity of the NGS data deluge produced by sequencers has been and will be outpacing Moore’s law for the time to come. Innovation power and the courage to break the sound barrier into utter silence will move this industry to the next level and maturity it needs.

Silence of a frontrunner

The introduction of the Illumina HiSeq XTM Ten system is a prime example of upping the bar in terms of data production. This has been both disruptive and a wakeup call event. Not leaving the existing path of data processing (starting with mapping and variant calling) requires a big investment in IT infrastructure, storage capacity and qualified people (becoming a scarce resource by the way) to deal with the workload, and eventually fall behind the ones that decided to move into the silence of a frontrunner.

If you think we must have reached the end of the capacity of sequencers, think again. In terms of throughput, the underlying biology allows current systems to increase their throughput with 4-5 times. And this may well happen before you are ready to upgrade the investments you are making to deal with the vastly increased data production volumes.

Two important rules

Here we arrive at the core of the issue, as there is no turning back once you made the choice to be part of this fast growing industry and find your spot as genomics data producer or researcher. There are two important rules that must be followed:

  1. Concentrate on the essence of the data
  2. Reduce the data as early in your pipeline as possible

If you want to keep all the details of all the data, you are defending a fortress with the gates wide open.

Let me illustrate this with two examples: If you do quality control of the data that comes out of the sequencer and a read contains only N’s would you want to keep this data, or do you want to keep a reliable record on how many N’s this data stream had and have been discarded? If you encounter a read, which has a tail with very low quality, such that you cannot use it to make a variant call, do you want to keep this data or a record on how many bases were trimmed from the dataset?


In both cases, you have the information you desire, the first is raw and the second is descriptive, based on a defined set of rules. They both yield the same result. We have to ask ourselves the question: If the process of QC trimming and perhaps dropping reads, because the quality was not according to spec, would happen prior to data leaving the sequencer and this would be the status quo, would we be happy with it? I think we would. It merely seems a matter of being responsible for making the choice, than a true contribution to quality output of the process.

A similar case is being made for base quality binning. People have gone through lengthy debates on continues or binned base quality representation. In the end the perceived ‘loss’ of information has no impact at all. Reason being that the tools dealing with the data have been calibrated for continues quality and will be re-calibrated to deal with binned quality scores yielding the exact same results and precision. We recently presented the results on collaborative study with deCODE Genetics on this in a webinar.

Bite the bullet

We need to bite the bullet and take a pragmatic approach backed by experimental proof to these fundamental issues at both the science and the governance level to move genetics diagnostic support to the clinic, providing economical health in a healthy economy.

To be continued…

Written by:
Hans Karten

There are no products listed under this category.