As we continue our contemplation on topics that are relevant in our genomics space, we arrive at ‘The Format’ as relevant and important. We discussed the importance of the ‘lossy’ part of any data contained in a format in our first blog “NGS Big Data: Fear and Challenge”. Today we discuss ‘The Format’ itself.

The past few days I passed through Las Vegas. We stayed there for a few days, and one night my wife and I went to ‘The Steakhouse’. So we ordered a cab to Circus Circus, where this hidden gem is located.

In our way the taxi driver asked where we were heading. I told him about the fantastic Filet Mignon and the excellent Malbec they serve and that this is the reason why it is called ‘The Steakhouse’. The man, obviously a steak connoisseur, replied that I was absolutely right. He had been to all steakhouses in town. He makes a habit of watching the opening announcements, waits for a few weeks to let the crew get the flow right and then tries them out.

So he tells me that there is another place that is even better. It is called ‘Golden Steer’, which used to be a place where the mob got their steaks. It sounded like a logical place for improvement through natural selection.

My point is that the perception of ‘The Steakhouse’ i.e. ‘there is no better’ was firmly rooted in my mind till I spoke to this man. I still don’t know if I will ever go to ‘Golden Steer’ though. Why would I, if I am happy with the food, the wine and the service of ‘The Steakhouse’? I would probably only go there, if it is easy to access or, half the price or when all other people go there and I feel lost because I don’t blend in any more. I do realize however that the venue, ‘The Format’, will only serve as a means to get me a perfect filet and a wonderful and relaxing evening.

Quote v3I still remember the days where Betamax and VHS where competing to become ‘The Standard’ video player format. Reason for the battle was that the industry could not sustain two formats. One had to emerge to allow for creation of content and physical devices to read the tapes. After VHS won (on available content), the market was locked-in as the cost of change i.e. the barrier to entry was too high.

A similar battle appeared with the introduction of the Compact Disc. Here too, the requirements of physical devices caused a lock in to the format. In both cases, strangely enough, the ‘market’ never experienced a lack of freedom because of it. People enjoy watching movies and listening to music without thinking for one moment about the format. Fair is fair, in the transition to download and streaming mode, there have been a few glitches and curses.

Now, from the fluffy stuff to the core of genomics: We have grown to hate and use BAM file as the general purpose format for aligned reads (just used as an example). And similar to the CD or the videotape, the market is locked due to the inflexibility of devices and tools to work with other formats. Which is indeed a deadly embrace. When data producers stick to BAM, there is no incentive for tools to implement other formats v.v.

Now bear in mind that our world has vastly changed since the 80’s. There is hardly such thing as a true physical device anymore. Almost every device dealing with data has embedded software. If we look again at the world of video; “They got the picture!” Just check how many codecs have been created since the inception of video processing and you can see that ‘The Format’ in this world has many colors. The trick is to distinct between format and content. The format must carry the attributes of small footprint and high performance access. The content must carry the attributes of complete and concise high quality data, serving the needs of the user.

For video processing, codecs are developed as plugins and video players or editing tools are all built around published interfaces. So whenever a new codec becomes available, it comes with a plugin, which will be pulled into the tool and you are free to go.

From 'The Format' to 'The Interface'
The point of today is, that the genomics space must move away from ‘The Format’ and move to ‘The Interface’. Every tool working with e.g. NGS reads should be constructed such that it uses a standard plugin interface which in this case retrieves a single or a batch of BamReads. Every new format, which sees the light of day, needs to supply a plugin such that the format is immediately accessible and usable by the community.

This will lower the barrier to entry for new formats and allow more innovation in this space. People can experiment with different formats and write about the pros and cons until the field settles. Then we can move to more hardware orientated solutions, which will make encoding and decoding, or compression and decompression fast and energy friendly. Medical devices will be able to handle this type of data outside of the data center and functionality will eventually move to handhelds.

Having new formats be put on real trial is a really good thing. In all cases much better than facing a jury dreaming up theoretical cases in which a format decision ‘may’ cause a loss of result in a ‘boundary’ case. In reality, no clinic or a scientist would ever base any important decision on a boundary case. Besides, there may be many pros of a new format bringing benefit to many. Such argument merely sounds like a symptom to hide the cost of change.

Coming to the end of my elaboration of today, whilst enjoying a white wine at the harbor of San Diego, I strongly suggest to all tool producers in our genomics space to adapt their tools to a plugin model and define the interface for both call methods and data structures such that we get rid of the spell of ‘The Format’ and become free in the world of ‘The Interface’. This will spur innovation in the data storage space, which again will contribute to our goal of economical health in a healthy economy. I rest my case.

