Advanced Sequencing Technologies: methods and goals

Sunday 8 April 2012
Posted by Crystal

Introduction

In the Human Genome Project (HGP), early investments in the development of cost-effective sequencing methods contributed to its resounding success.  Over the course of a decade, through the refinement, parallelization, and automation of established sequencing methods, the HGP motivated a 100-fold reduction of sequencing costs, from 10 dollars per finished base to 10 finished bases per dollar1.  The relevance and utility of high-throughput sequencing and sequencing centers in the wake of the HGP has been a subject of recent debate.  Nonetheless, a number of academic and commercial efforts are developing for new ultra-low-cost sequencing (ULCS) technologies that aim to reduce the cost of DNA sequencing by several orders of magnitude2.  Here we discuss the motivations for ULCS and review a sampling of the technologies themselves.

Emerging ULCS technologies can be generally classified into one of four groups: (a) micro-electrophoretic methods, (b) sequencing-by-hybridization, (c) cyclic-array sequencing, and (d) single-molecule sequencing.  Most of these technologies are still at relatively early stages of development, such that it is difficult to gauge the time-frame before any given method will truly be practical and living up to expectations.  Yet there is an abundance of potential, and a number of recent technical breakthroughs have contributed to increased momentum and community-interest. 

Until recently, the motivations for pursuing ULCS technologies have generally been defined in terms of the needs and goals of the biomedical and bioagricultural research communities.  This list is long, diverse, and potentially growing (Box 1).  In more recent years, the primary justification for these efforts has shifted to the notion that the technology could become so affordable that sequencing the full genomes of individual patients would be justified from a health-care perspective 3-6.  “Full individual genotyping” has great potential to impact health-care via contributions to clinical diagnostics and prognostics, risk assessment and disease prevention.  Here we use the phrase “Personal Genome Project” (PGP) to describe this goal.  As we contemplate the routine sequencing of individual human genomes we must consider the economic, social, legal and ethical issues raised by this technology.  What are the potential health-care benefits?  At what cost-threshold does the PGP become viable?  What are the risks with respect to issues such as consent, confidentiality, discrimination, and patient psychology?  In addition to reviewing technologies, we will try to address aspects of these questions.

Traditional Sequencing

 

In 1977, two groups familiar with peptide and RNA sequencing methods made a leap forward by harnessing the amazing single-base resolution separation power of gel electrophoresis7,8.  Electrophoretic sequencing was widely adopted and rapidly improved.  In 1985, a small group of scientists set the audacious goal of sequencing the entire human genome by 20051,9.  The proposal met with considerable skepticism from the wider community10,11.  At the time, many felt that the cost of DNA sequencing was far too high (about $10 per base) and the sequencing community too fragmented to complete such a vast undertaking.  Such “large-scale biology” also represented a significant diversion of resources from the traditional question-driven approach that had been so successful in laying molecular biology’s foundations.

 

Five years ahead of schedule and slightly under the $3 billion budget, a useful draft sequence of the human genome was published in 2000.  While the entire project costs include years of “production” using weaker technologies, the bulk of the sequencing cost was about $300 million.  Amongst the factors underlying the HGP’s achievement was the rapid pace of technical and organizational innovation.  Automation in the form of commercial sequencing machines, process miniaturization, optimization of biochemistry, and robust software were all crucial to the exponential “ramp-up” of sequencing throughput.  Managerial and organizational challenges were successfully met at both the level of the coordination of the full HGP and within individual sequencing centers.  Possibly more significant was the appearance of an “open” culture with respect to technology, data, and software1.  In refreshing contrast to the competition and consequent secrecy that has traditionally characterized many scientific disciplines, the major sequencing centers freely shared technical advances and engaged in near-instantaneous data-release (i.e. the Bermuda Principles).  The approach not only broadened support for the HGP, but also undoubtedly expedited its completion.  With respect to both technology development and “large-scale biology” projects, the HGP perhaps provides excellent lessons for how the scientific community can proceed in future endeavors.


Why continue sequencing?


Sequencing the biosphere.  Through comparative genomics, we are learning a great deal about our own molecular program, as well as those of other organisms in the biosphere12,13.  There are currently 2x1010 bases in international databases14.  The genomes of over 160 organisms have been fully sequenced, as well as parts of the genomes of over 100,000 taxonomic species.  It is both humbling and amusing to compare that number to the full complexity of sequences on earth.  By our estimate, a global biomass of over 2x1018 g contains a total biopolymer sequence on the order of 1038 residues.  While sequencing the entire biosphere is obviously unnecessary and impractical, it seems clear that we have only sequenced a very small fraction of interesting and useful nucleotides.  

Impact on biomedical research.  A widely-available ULCS technology would improve existing biological and biomedical investigations and expedite the development of a variety of new genomic and technological studies (Box 1).  Foremost amongst these goals might be efforts to determine the genetic basis of susceptibility to both common and rare human diseases.  It is occasionally claimed that all we can afford (and hence all that we want) is information on "common" (i.e. > 1% in a population) single nucleotide polymorphisms, SNPs, or the arrangements of these (haplotypes) 15  in order to understand so-called multifactorial or complex diseases16.  In a non-trivial sense all diseases are components of "complex diseases".  As we get better at genotyping and phenotyping, we simply get better at finding the factors contributing to ever lower penetrance and expressivity.  A focus on common alleles will probably be successful for alleles maintained in human populations by heterozygote advantage (such as the textbook relationship between sickle-cell anemia and malaria) but would miss most of the genetic diseases documented so far17.  In any case, even for diseases that are amenable to the haplotype mapping approach, ULCS would allow geneticists to move more quickly from a haplotype that is linked to a phenotype to the causative SNP(s).  Diseases confounded by genetic heterogeneity could be investigated by sequencing of specific candidate loci, or whole genomes, across populations of affected individuals18,19.  It is possible that the cost of doing accurate genotyping (e.g. $5K for 500,000 SNPs95 and/or 30,000 genes) for tens of thousands of individuals will make more sense in the context of normal health care than stand-alone epidemiology.  Whether SNPs or personal genomes, this will require high levels of informed consent and security20.

Another broad area that ULCS could significantly impact is cancer biology.  Cancer is fundamentally a disease of the genome: cycles of somatic mutation followed by clonal expansion give rise to malignant cells.  Epidemiology suggests that mutations in three to seven genes are necessary to cause malignancy, but it is also clear that different sets of genes are mutated in different cancers.  Furthermore, phenomena of genomic instability such as aneuploidy or chromosomal rearrangements can play a role in cancer progression21.  Although a remarkable variety of different genomic mutations and aberrations have been found in tumors, patterns are beginning to emerge.  By looking for disrupted pathways rather than just individual genes, researchers are obtaining a better understanding of tumorigenesis22.  The ability to sequence and compare complete genomes from a large number of normal, neoplastic, and malignant cells would allow us to exhaustively catalogue the molecular pathways and checkpoints that are mutated in cancer.  Such a comprehensive approach would help us to more fully decipher the combinations of mutations that in concert give rise to cancer, and thus facilitate a deeper understanding of the cellular functions that are perturbed during tumorigenesis.

ULCS has the potential to facilitate new research paradigms.  Mutagenesis in model and non-model organisms would be more powerful if one could inexpensively sequence large genomic regions or complete genomes across large panels of mutant pedigrees.  In studying the diversity of the natural mutagenesis of a specific immune response, sequencing of rearranged B-cell and T-cell receptor loci in a large panel of B-cells and T-cells could be routine, rather than a major undertaking.  ULCS would also benefit the emerging fields of synthetic biology and genome engineering, both of which are becoming powerful tools for perturbing or designing complex biological systems.  This would enable the rapid selection or construction of new enzymes, new genetic networks, or perhaps even new chromosomes.  Even further afield than the above synthetics looms DNA computing23 and DNA as ultracompact memory.  DNA computing uses only standard recombinant techniques for DNA editing, amplification, and detection but because these techniques operate on strands of DNA in parallel, the result is highly efficient and massively parallel molecular computing24.  Furthermore, since a gram of dehydrated DNA contains approximately 1021 bits of information, DNA could potentially store data at a density of eleven orders of magnitude higher than today’s DVD’s24.

The Personal Genome Project.  Perhaps the most compelling reason to pursue ULCS technology is the impact that it could have on human health via the sequencing of “personal genomes” as a component of individualized health-care.  The current level of health-care spending for the general U.S. population is approximately $5,000 per capita per year25.  Amortized over the 76-year average lifespan for which it is useful, a $1,000 genome would only have to produce $13 benefit per year to “break-even” in terms of cost-effectiveness.  Straightforward ways in which “full individual genotypes” could benefit patient-care include clinical diagnostics and prognostics for both common and rare inherited conditions, risk assessment and prevention, informing pharmacogenetic contraindications, etc.  Our growing understanding of how specific genotypes and their combinations impact and determine the phenome will only increase the value of personal genomes.  If only even rare inherited mutations can be comprehensively surveyed for less than some threshold cost (e.g. $5000), it is likely that an autocatalytic paradigm shift could occur with each new genome/phenome fact found making the process more attractive, hence more genomes analyzed.  The issue now is how might this catalysis get started?

Is the PGP feasible?  One reason for the overwhelming success of sequencing is that the number of nucleotides that can be sequenced at a given price has increased exponentially for the past 30 years (Figure 1).  This exponential trend is by no means guaranteed and realizing a PGP in the next five years probably requires a higher commitment to technology development than was available in the pragmatic and production-oriented HGP (Figure 1).  How might this be achieved?  Obviously we cannot review technologies that are secret, but a number of truly innovative approaches have now been made fully or partially public, marking this as an important time to compare and to conceptually integrate these innovative strategies.  We review four major approaches below (also see Figures 2 and 3)

Emerging ULCS technologies

What are the specifications for a ULCS technology that is capable of delivering low-cost human genomes?  Key considerations are (a) cost per raw base, (b) throughput per instrument, (c) accuracy per raw base, and (d) read-length per independent read.  With respect to these parameters, let us consider what would be required to resequence a human genome with reasonably high accuracy at a cost of $1000.  Accuracy goals will depend on the application, ranging from 21 base RNA tags26  to nearly error-free genomes (<1e-10).  At a minimum, the error rate of any resequencing or genotyping method must be considerably lower than the level of variation that one is trying to detect27.  As human chromosomes differ at ~1 in every 1,000 bases, an error rate of 1/100,000 bp is a reasonable goal.  If individual errors are truly random with probability p, then the overall error rate for r reads is approximately:
 . 
If a given method can achieve ~99.7% accuracy (on par with the state-of-the-art), then 3x coverage of each base will yield the desired error rate.  However, to ensure a minimum 3x coverage of >95% of bases of a diploid genome (6e9 bp) requires ~6.5x coverage, or ~40 billion raw bases.  Achieving an accurate $1000 genome will thus require that costs approach ~40 million raw bases per dollar, a 4 to 5 log improvement over current methods.  Although they could someday potentially approach the cost of a $2K computer, today’s integrated genomics devices typically cost $50K to $500K.  Assuming that the capital and operating costs of any new instrument will be similar to that of conventional electrophoretic sequencers, the bulk of improvement will have to derive from an increase in the rate of sequence acquisition per device.  In this scenario, the rate of data acquisition per device will have to increase from ~12 bases per instrument-second to ~450,000 bases per instrument-second.  With respect to read-length, it is substantially advantageous to be resequencing rather than de novo sequencing a genome.  No assembly is required; resequencing requires only that one can match sequencing reads to unique locations within an assembled canonical genome sequence, and then determine if and how a given sequencing read differs from its corresponding canonical sequence.  In a random base model, one would expect that nearly all 20 bp reads would be unique in the genome (420 >> 3x109).  Probably due to repetitive elements, tandem repeats, low-complexity sequence, and the substantial fraction of recently duplicated sequence28, only ~73% of 20 bp genomic “reads” can in fact be assigned to a single unique location in the current draft of the human genome.  To achieve >95% uniqueness, a modest goal, will require ~60 bp reads.  There are diminishing returns with longer read-lengths; achieving >99% uniqueness will require >200 bp reads.  It is also worth noting that if one is only concerned with n-mers derived from protein-coding sequences, ~88% of 20-mers and ~93% of 30-mers can be matched to a unique location in the genome.

Although this is only one scenario, alternative scenarios will require generally involve some trade-off (e.g. lower accuracy at higher throughput, or higher accuracy at lower throughput).  With the above assumptions, a resequencing instrument capable of delivering a $1000 human genome of reasonable coverage and high accuracy will need to achieve >60 bp reads with 99.7% raw base accuracy, acquiring data at a rate of ~450,000 bases per second.  Keeping these numbers in mind, let us review each of the technologies of interest.

Micro-electrophoretic Sequencing.  The vast preponderance of DNA sequence has been obtained via the Sanger method, based on electrophoretic separation of dNTP fragments with single-base resolution.  Using 384-capillary automated sequencing machines, costs for heavily optimized sequencing centers are currently approaching $1 per 1000 bp raw sequencing read and a throughput of ~12 bases per instrument-second.  Typically, 99.99% accuracy can be achieved with as few as three raw reads covering a given nucleotide.  Regions of sequence that have proven difficult for Sanger sequencing can be rendered accessible via mutagenesis techniques29.  A number of teams, including the Mathies group and the Whitehead BioMEMS laboratory, are currently investigating whether costs can be further reduced by additional multiplexing and miniaturization30,31.  Borrowing microfabrication techniques developed by the semiconductor industry, they are working to create single devices that can perform DNA amplification, purification, and sequencing in an integrated fashion32.

The primary advantage of this approach is that the fundamental principles of the sequencing method are so well tested.  Electrophoretic sequencing has already been used to successfully sequence on the order of 1011 nucleotides.   Although the approaches being taken (e.g. miniaturization and process integration) will certainly yield significant cost-reductions, achieving 4 to 5 logs of improvement may require some more radical changes with respect to the underlying engineering of electrophoretic sequencers.  Nevertheless, given that other ULCS methods are still far from proven, micro-electrophoretic sequencing may be a relatively safer bet, with a higher short-term probability of delivering reasonably low-cost genome resequencing (i.e. “the “$100,000 genome”).

0 comments: