2011-09-22

SAM/BAM with gapped reference

A lot of my time this week has gone into thinking and "talking" on the samtools-devel mailing list about the SAM/BAM file format and how it might be improved for (de novo) assemblies.



SAM/BAM

Anyone working with high throughput sequencing data (formerly known as Next Generation Sequencing, NGS), should be well versed with the SAM/BAM file format. SAM stands for "Sequence Alignment/Map" and is a text based format, with BAM being a compressed binary version. This is an open standard and one of the drivers behind its development was coping with the large volumes of genetic data from the 1000 (human) genomes project.

One attraction of SAM/BAM is the existence of two solid open source reference library implementations (which mostly agree), samtools written in C (which has wrappers in an number of languages including PySam for Python), and Picard written in Java.

The main draw of SAM/BAM as a format is that it was designed to scale well with large datasets, and in particular there is a standard indexing system for accessing any sub-region of a reference sequence and all the reads mapped to that region. This indexing approach has been generalized in the tabix tool (now included with samtools) for indexing tabular files which are commonly used in genome annotation.

There are now over a dozen SAM/BAM alignments viewers, of which I am most familiar with Tablet and IGV. Most of these viewers use the Picard library internally, but they all attempt to exploit the indexing system to allow fast viewing of very large datasets - accessing just the read data required to draw the region currently on screen. There is also at least one alignment editor supporting SAM/BAM, Gap5.

The Bioinformatics community seems to have settled on SAM/BAM as the de facto standard for mapping reads to a reference genome - but adoption by assembly authors has lagged. For now those of us generating (de novo) assemblies still have to wrestle with a range of formats and tools, with inter-conversions only partially supported.  It would be much nicer and easier if the major assembly tools could all produce SAM/BAM as one of their outputs, a point Nick Loman made very strongly on his blog this Monday.

Ungapped vs Gapped

One of the main topics we discussed this week in this samtools-devel mailing list thread was the use of a gapped reference sequence to show the reads mapped against. This is quite a dramatic change, since until now SAM/BAM has exclusively used an ungapped reference. On the other hand, traditional assembly formats like ACE have used a gapped reference.

Here's an example of how things may look - using the MIRA 3 assembler, which comes with a few example projects include a small sample of E. coli. Now MIRA can produce quite a few different output formats, including ACE and the rather similar MAF (MIRA assembly format). ACE is a fairly widely supported format, but has limitations like it doesn't have a mechanism for declaring paired end reads. This is one of the things MAF handles much better - but there are no assembly viewers that understand MIRA's own format. Instead, I've been using my own Python script (maf2sam.py) to convert from MAF to SAM which lets me view paired end data nicely.

Here is part of this example in ACE format shown in Tablet:
Tablet showing ACE file


How does this look in a traditional SAM or BAM file using an unpadded consensus/reference? Well notice in the above ACE screenshot there is a run of six gaps, and about a dozen reads have some bases there. In a traditional SAM/BAM file these six columns are compressed down to nothing - leaving the assembly viewer with a tricky presentation problem for showing the hidden bases.
Tablet showing BAM file (traditional ungapped reference), highlighting reads with insert at this column

Tablet showing BAM file (traditional ungapped reference), highlighting column with inserts

Tablet has several ways to try and show the bases inserted relative to the ungapped consensus/reference - including mouse over tooltips and highlights when you hover over the "I" in the CIGAR-I feature track. None of these views, nor those in other viewers like IGV shown below, are entirely satisfactory if you are used to the padded reference approach.

Here is the same region in IGV, note that the read packing algorithm is slightly different so you can't trivially work out which read is which compared to Tablet: 
IGV v2.0.10 showing BAM file (traditional ungapped reference), coloured by sample, tool tip showing one insert

Here is the same assembly converted into BAM using a padded reference sequence (this is new and experimental - but using the standard version of Tablet):
Tablet showing BAM file (experimental gapped reference)

Go back to the ACE screen shot again - can you spot the difference? No? Well it was a trick question - the read placement and presentation is identical. There are some minor graphical differences like the tool tip - you can see the SAM/BAM CIGAR string in the read tooltip for example.

Here's the same file in IGV for comparison - again it seems to cope fine with the padded reference (although it doesn't seem to show both coordinate systems):

IGV v2.0.10 showing BAM file (experimental gapped reference), coloured by sample.

So, I think this is a big win - we get the speed and scaling advantages of SAM/BAM, along with explicit support for paired reads, plus the traditional view of inserted bases. And if you like or need to refer to positions in the unpadded co-ordinate scheme, Tablet handles this very nicely (without any changes). Notice it shows all the lengths and positions in both padded and unpadded co-ordindates. Not all SAM/BAM viewers can do this - the reason Tablet can is this functionality was required for ACE and other formats.

Fingers crossed this refinement to allow a gapped reference sequence will become part of the official SAM/BAM specification shortly!

Disclaimer
I don't work on Tablet, but I know the team that does - they're over in the next building to me.



Update (Tuesday, 27 September 2011)

Since I wrote up this blog post (Thursday, 22 September 2011), the SAM/BAM specification has been updated in SVN to cover gapped reference sequences, so this is on track to be part of the SAM Format Specification v1.5 - Yay!

Discussions about other aspects of using SAM/BAM for (de novo) assemblies like conventions for read and consensus annotation are still going on at the samtools-devel mailing list (public archive).

Update (Wednesday, 28 September 2011)

Added the IGV screenshots to show how it represents inserts with a traditional unpadded reference BAM file, and the experimental padded reference BAM file.

Update (Monday, 3 October 2011)

I posted a follow up going into the details of the CIGAR pad operator etc for multiple sequence alignment in SAM/BAM without a gapped reference.

Update (Friday, 29 May 2015)

Those of you following the samtools-devel mailing list, or the SAM/BAM specification on GitHub might have noticed using a padded reference was made part of the specification. I coordinated a paper describing this work which is now available as a preprint: SAM/BAM format v1.5 extensions for de novo assemblies (DOI: 10.1101/014043).

No comments:

Post a Comment