The value of benchmarks

In our last blog post “The Quality of Speed” we discussed that embracing new technologies to increase speed and reduce file sizes will move us forward. History consistently shows that these go hand in hand with quality improvements and also contribute to the quality and well-being of both individuals and society.

Standards and safety nets

But hold on! How can we assert that we indeed do improve on quality and that we are, as many fear with good reason, not cutting corners and exchanging speed for quality? Well, as all rational people do, we create standards and build safety nets to make sure that we do not regress from where we are when pushing the envelope of technology.

It boils down to a simple statement: Within the bounds of defined variations from a standard, we are safe to increase speed and do any optimization on footprint and resource usage. So we innovate our heart out, check the numbers and go to bed with a clear conscience, knowing we did well, made a worthwhile contribution and will be tapped on the back in many venues.

Holy Grail meets check box

As more and more people acknowledge the defined boundaries to be a true guardian, the ‘safety net’ gets a life of its own: It transforms from a means to a goal. The standard becomes the Holy Grail and marketing departments start pressing development by making sure that their product is indeed the ‘best’ according to this standard. Gentlemen, start your engines…

I have seen this happen in many cases. Achieving higher quality using a reference is an honest and sound process. As technology improvements truly accelerate, if developers are challenged and have a clear target. I have also seen many escalations, where specific fast-paths and switches are built into products to make sure the benchmark is met and the perceived quality and performance numbers are top notch. The rat race surrounding the TPC benchmarks is a clear example of the process at hand. And YES! I stand guilty. I too reduced code path and contention to make particular transactions go ultra-fast.

To be fair, users are adding fuel to the fire too by adding the benchmark ‘checkbox’ on their purchase list. If you are a power purchaser and only want to buy a product which meets your checkboxes I can assure you that all vendors will present their products with: Yes we can! You get what you ask for.

Illumina’s truth set

In the field of genomics, life will not be different. As this industry develops, people are extremely risk-evasive and one of the securities to latch onto is a general accepted quality benchmark. A few do exist. Illumina for example has a ‘truth set’. A sample set of seventeen family members, which cause the set to be largely self-validating. This is part of their Platinum Genomes set. This means that the input data is the best of the best when it comes to quality. So when we process this data and do mapping and variant calling, we will not be hindered by many errors in the data itself. Neither the mapper, nor the variant caller is challenged with noisy data. I think this is a good thing, but you have to understand the context of the test to value the results for your environment.

Genome In A Bottle (GIAB) consensus

Another initiative is GIAB, which is very thorough in terms of how to determine a ‘truth set’ to compare against. GIAB takes the input of multiple sequencing platforms, aligns the data with multiple mappers and runs these results through multiple variant callers to arrive at a consensus ‘reality’, which is dubbed the ‘high confidence set’.

Both initiatives are sound methods to get a safety net to guard against regression, and prevent our rapid pace developing industry going of track.

Red flags

There are a couple of red flags we need to wave. We are coming from a place where the technological methods used, to get insights in genomics data, for a large part have been coming from a single source. There is no doubt that GATK developed by Broad Institute and good old BWA have put their fingerprint on the result sets, which are being used as ‘benchmark’.

As a consequence, the reference data is biased in stating that these tools are right. If you build the standard, you score 100%. As Dredd would say: “I am the Judge!”

The “truth”

There is no point in bashing. This is the nature of things and the state where we are in at this moment. Three years from now, the scene will be different. There are, however, two things we need to be aware of in our field. Bare in mind, that the technology to find truth in genetic data is not at the level of maturity. We are not yet finding all there is to discover. This is a no-brainer for biology, and still holds true for the data processing quality and discoveries in the secondary analysis stage.

As such, a current “standard” is not the ‘Platinum Meter’, it is a snapshot which is still evolving. Marking it down as the absolute ‘truth’ will stop further discoveries, as technology is no longer challenged. The industry settles and purchasers will only buy if you make the mark and nothing but the mark.

The other thing we need to be aware of is how tools achieve their results. Is it rule based, a combination of statistics to distinguish signal from noise topped with post-filtering (Bayesian or orthogonal) and/or is it using training sets? Besides, is a training set being used to seed a true learning system or will it just bias the output towards prior knowledge?

Don’t limit your view

It is easy to see how a training set which provokes a bias towards prior knowledge can tilt the result of benchmarks showing almost unbelievable sensitivity and precision, promoting this as the tool to use to get the most out of your data. One has to make a careful consideration, as using a bias may work very well (as it does with population calling on even a few samples) if you know what to blend your data with. However, new discoveries will not benefit from such bias, and training sets must not become the boundary of our vision.

Concluding this thought I state: benchmarks are good and honorable. They provide a safety net, can spur innovation and push the industry to a higher level. They can also stifle innovation and misdirect decisions of users. In a context which has not settled yet, creators of benchmarks would do well to rotate the content; For instance every three years. This way the industry stays alert and keeps moving towards better quality, a wider spectrum of discoveries and overall better results for the health system. This too will contribute to economical health in a healthy economy.

Your feedback below is highly appreciated.

  • Written by:
  • Hans Karten

There are no products listed under this category.