Guest post: Luke Jostins on the twice-sequenced genome

Luke Jostins of Genetic Inference critiques a recent paper in Genome Research showcasing the first published human genome sequence generated using SOLiD technology.

All products featured on WIRED are independently selected by our editors. However, we may receive compensation from retailers and/or from purchases of products through these links.

While I continue my work-induced blog coma, here's a guest post from Luke Jostins, a genetic epidemiology PhD student and the author of the blog Genetic Inference, delivering a fairly scathing critique of a recent whole-genome sequencing paper based on Life Technologies' SOLiD platform.*
*McKernan et al. 2009. Sequence and structural variation in a human genome uncovered by short-read, massively parallel ligation sequencing using two-base encoding Genome Research DOI: 10.1101/gr.091868.109


In prepublication at the moment is a
paper
from the labs of ABI,
makers of the SOLiD sequencing system. It is the first published human genome to be sequenced entirely by SOLiD, the lowest-coverage non-454 second generation genome, and, interestingly, the second whole genome publication to not come out in either *Science *or Nature. There is a lot of interesting stuff going on with this paper, and with the discussions going on around it, and there are a lot of stories to tell about it.

Genome Blogging

This is a case of the genomics blogging community breaking a very interesting story hot off the press. Dan Koboldt, in a blog
post
at the blog MassGenomics reported the pre-publication of the SOLiD genome very rapidly after its appearance. He had a very interesting insight; the individual sequenced was the SAME INDIVIDUAL as Bentley
et al
, the Illumina genome! The SOLiD paper makes no comparison, and does not even mention that they are, despite reviewing previous work done on this individual:

The NA18507 genome has not been Sanger sequenced to more than 0.5× sequence coverage (Kidd et al. 2008). However, it has been extensively genotyped as part of the HapMap project (Frazer et al. 2007) and some regions shotgun sequenced to higher depth as part of the ENCODE project (Birney et al. 2007).

Score one for blog-based reporting. Score another one for the fact that Kevin McKernan, lead author of the paper, spoke up in the comments of Dan's post, saying that they had not done a comparison due to the lack of genotype information published by Illumina. Why they don't mention this justification in the paper, or even comment on the existence of a first higher-coverage second gen sequencing project on the same individual, is far from clear (and it is a bit distressing that the reviewers failed to pick up on this).

Anyway, a quick-and-dirty comparison is not too hard, so lets do one now.

Illumina Vs SOLiD

A quick word on the relevant differences between Illumina and SOLiD in terms of sequencing: the Illumina system looks at single nucleotides each dyed with a different colour, so the the sequence ACATCA would give green for A, blue for C, green for A, yellow for T, and so on; in constrast, the SOLiD system reads pairs of bases as single colours (giving 16 colours in total); the sequence ACATCA would give purple for AC, orange for CA, red for AT, and so on.

Note that the SOLiD method determines each base pair from two different bits of information: this makes it less prone to sequencing errors, since an encorporation error has less of an effect on the call. SOLiD's claim is that this higher accuracy allows them to call base pairs accurately using only two aligned reads, meaning you need lower sequence coverage: the SOLiD genome only has 17.9-fold coverage, which is a lot less than the 40-fold coverage used by Illumina.

SNP and indel performance

The SOLiD paper calls a simular number of SNPs as the Illumina paper: just under 4M, of which just about 0.8M are new. They validated these both by genotyping about a hundred independently, and by checking their homozygous calls against the HapMap genotypes. Despite the low coverage, SOLiD had accuracy similar to Illumina's; 0.76% false positives, 99.1% of HapMap sites called, 99.16% concordance (this compares to about 0.62%, 99.3% and 99.07% from Bentley et al).

They called about half as many indels as Illumina (0.23M verses 0.41M), but they managed to call far more varied size; they have inserts in all size ranges up to 1kb, and deletions all the way up to 100kb. However, I expect these differences are as much down to the algorithms they used to call these indels as the sequencing technology used - it would be interesting to re-analyse the Illumina and SOLiD data using the same alignment and calling software. This would be pretty easy to do; the data is all available online: Illumina and SOLiD.

Going Deeper

The SOLiD genome paper is, as far as I know, the most in-depth study of variation within a single human genome, and they try out a lot of pretty impressive tricks. Beyond indels, they also find 91 inversion and 565 copy number variations; when combined with the indel data, they effectively have information on structural variants ranging in size from 1 bp to 1Mb.

They manage to do phasing [that is, inferring whether nearby variants were inherited on the same copy of a chromosome - DM] using paired end read data; when you are very confident of every base call in every read, you can start assembling these reads into individual haplotype contigs [i.e. long stretches of sequence that can be said with high confidence to have been inherited as a block - DM]. They manage to produce blocks of phased haplotypes up to 215Kb long, and confirm their phasings are about 99% accurate compared to HapMap SNPs.

Along with this, they make an in-depth study of the functional implications of the variations they see, including looking for OMIM variants, potential gene disruptions and even a handful of potential gene fusions. They also use functional information to produce lists of protein families and genes functions that have been under different types of selection (though I have my doubts about how reliable this is).

The fudges

So, SOLiD wins, yes? They have done a much lower coverage sequence of the same genome as Illumina, matched their SNP calling for quality, beat them in range (if not quantity) of indel sizes, and do a load of other fancy tricks, all for less than half the coverage. Right?

Well, no, not really. SOLiD cheated, in a number of ways.

Firstly, they fudged their coverage; while their abstract declares a 17.9-fold coverage, their Results section declares that they aligned over 87 Gb of sequence to the genome. They generated a total of 89 Gb of sequence, but the only way to learn that is to look at their NCBI submission; this is a 31X genome. They got the 17.9-fold coverage by throwing out 41% of their reads, by applying various quality control criteria. Now there is nothing wrong with QC, but there IS something wrong with taking the best 18X coverage of a 31X genome and pretending that you have an 18-fold genome. They will have seriously enriched their sequence data for high-quality reads, and thus have much higher quality sequence than what you or I would have if we ordered 17.9-fold DNA from a sequencing department.

Secondly, they fudged their genotype calling. Their HapMap validation was only performed on homozygous SNPs - but homozygous SNPs are the strength of SOLiD's low-coverage, high-accuracy approach. It is heterozygous SNPs that are the weakness, since no amount of read quality can compensate for not seeing both alleles. And the authors completely fail to report their heterozygous genotype concordance. They do, however, give a graph, and by sitting down with a ruler we can get an estimate of the quality of het calls, and it is something like this: 96% called, 91.2% concordance. When averaged over all HapMap SNPs, SOLiD quality drops to something like 98.1% called, 97.5% concordance. This is not good performance, and this fact is nowhere indicated in the SOLiD paper.

Finally, a relatively minor gripe, but one that kicked me while I was down. SOLiD claim that you could replicate their study in "just one or two 30-50Gb runs from a SOLiD instrument at an estimated reagent cost of under $30,000". Really? To generated 89 Gb of sequence? It took them 3 runs, and while SOLiD have at times had 50 Gb runs, they are still the exception. I expect it would cost around $40-50k to repeat this sequencing experiment, based on i

t taking 3-4 runs.

What is going on here?

What follows is idle speculation, and thus (hopefully) not slander. My guess is that SOLiD is attempting to reposition itself as the Low Coverage Sequencing Company. They have still failed to topple Illumina as the market leader, and I expect that they think if enough people start thinking of SOLiD as "the guys who did a good quality 18X genome", all the people who found the low coverage 1000 Genomes Project work sexy will start looking to SOLiD. But to do this, they had to fudge a few things: get the coverage down by cutting out lots of reads, obfuscate the low quality of heterozygous calls, and give an overenthusiastic estimate of how cheap their technology is.

It would be interesting to know why this paper did not end up in *Nature *or Science. This is only the second whole-genome sequencing project to not do so [after the SJK Korean genome, also published in Genome Research - DM], and people are saying that this is because whole genomes just aren't new enough any more. But I don't think that is the case; this isn't just any genome, it is both the first SOLiD genome, and by far the most detailed analyses of a newly-sequenced individual we've seen before, with a lot of new and interesting stuff in it. I think that the reason it is not in one of the big two might have something to do with the methodological flaws in the paper; *Nature *or *Science *might have asked the wrong questions. But perhaps Genome Research was desperate enough to grab a genome, and not ask too many questions about what SOLiD had left out. It is distressing that it took a blogger like Dan Koboldt to start blowing this thing open.rss-icon-16x16.jpg Subscribe to Genetic Future.twitter-icon-16x16.jpg Follow Daniel on Twitter.