Information

Sequencing a specific region of a genome

Sequencing a specific region of a genome


We are searching data for your request:

Forums and discussions:
Manuals and reference books:
Data from registers:
Wait the end of the search in all databases.
Upon completion, a link will appear to access the found materials.

First off, I'm new to bioinformatics and I am learning about DNA sequencing.

Let's say that I knew that a specific region of a genome which contained information about a disease (whether it a person had the disease or not).

It would make sense that we would only want to sequence that part of the genome in order to make a detector for this disease.

Would it be possible to 'cut-out' this region of the genome and only sequence that part (so we don't have to sequence the entire thing)? If not, how would we sequence only this part of the genome (it doesn't make sense to sequence the other parts as they don't give us information).

Thanks in advance.


Yes, it is possible to sequence a specific region of the genome. The method, as you mentioned, is called targeted sequencing. Resequencing is basically sequencing something that has already been sequenced. This means that instead of assembling all of your sequence reads from scratch, you can just align them to the reference sequence (in your case, the entire human genome has been sequenced). Targeted resequencing means you're sequencing a specific region, such as a gene. This requires concentration of the specific DNA you wish to sequence and can be done by PCR amplification (using primers that flank your desired sequence) or hybridisation (using probes complementary to your desired sequence that are fixed to a surface), among other methods.

For information on methods for enriching a DNA sequence, give this a read:

Mamanova, L. et al. Target-enrichment strategies for next-generation sequencing. Nat. Methods. 7, 111-118 (2010).


Efficient sequence-specific isolation of DNA fragments and chromatin by in vitro enChIP technology using recombinant CRISPR ribonucleoproteins

The clustered regularly interspaced short palindromic repeats (CRISPR) system is widely used for various biological applications, including genome editing. We developed engineered DNA-binding molecule-mediated chromatin immunoprecipitation (enChIP) using CRISPR to isolate target genomic regions from cells for their biochemical characterization. In this study, we developed 'in vitro enChIP' using recombinant CRISPR ribonucleoproteins (RNPs) to isolate target genomic regions. in vitro enChIP has the great advantage over conventional enChIP of not requiring expression of CRISPR complexes in cells. We first showed that in vitro enChIP using recombinant CRISPR RNPs can be used to isolate target DNA from mixtures of purified DNA in a sequence-specific manner. In addition, we showed that this technology can be used to efficiently isolate target genomic regions, while retaining their intracellular molecular interactions, with negligible contamination from irrelevant genomic regions. Thus, in vitro enChIP technology is of potential use for sequence-specific isolation of DNA, as well as for identification of molecules interacting with genomic regions of interest in vivo in combination with downstream analysis.

© 2016 The Authors. Genes to Cells published by Molecular Biology Society of Japan and John Wiley & Sons Australia, Ltd.


Long-read sequencing to understand genome biology and cell function

Determining the sequence of DNA and RNA molecules has a huge impact on the understanding of cell biology and function. Recent advancements in next-generation short-read sequencing (NGS) technologies, drops in cost and a resolution down to the single-cell level shaped our current view on genome structure and function. Third-generation sequencing (TGS) methods further complete the knowledge about these processes based on long reads and the ability to analyze DNA or RNA at single molecule level. Long-read sequencing provides additional possibilities to study genome architecture and the composition of highly complex regions and to determine epigenetic modifications of nucleotide bases at a genome-wide level. We discuss the principles and advancements of long-read sequencing and its applications in genome biology.

Keywords: Genomics Long-read sequencing Nanopore sequencing SMRT sequencing Third-generation sequencing.


Results

Study design

To assess heterogeneity within gliomas, we dissected each of 14 tumors into 3 to 5 regions per tumor (Figure  1 A Table S1 in Additional file 1). We used the smMIP assay on genomic DNA isolated from each region to identify single nucleotide variants and high level copy amplifications (Figure  1 B Figure S1 in Additional file 1). smMIP probes capture target sequence into covalently linked circular molecules after polymerase extension and ligation. Following barcoding-PCR, sample pooling, sequencing, deduplication and alignment, we identified high level amplifications and point mutations (Figure  1 B,C Figure S1 in Additional file 1).

Experimental approach. (A) Each tumor was divided into three to five regions to assay intratumoral heterogeneity. Each individual region was subdivided into four pieces for use in next generation sequencing (NGS), histology, cell culture and xenotransplantation. (B) Molecular inversion probe method. Oligonucleotide probes were previously designed against 33 cancer genes [6]. MIPs have a common backbone sequence, molecular tag sequence as well as targeting arms homologous to regions flanking targets of interest. After polymerase extension and ligation, targeted sequence is captured within a circular molecule. Captured sequences are amplified in a barcoding-PCR reaction and multiple samples are pooled and sequenced on the same lane. After tag-correction (not shown), reads corresponding to each tumor region are mapped to the human reference sequence to be used to identify copy number amplifications and point mutations specific to one region or another. Additional details are provided in Figure S1 in Additional file 1. (C) Example of comparisons: MIP captures of regions C and D can detect both TP53 point mutation heterogeneity and EGFR amplification heterogeneity within a tumor. Tumors with mutational heterogeneity were required to share either a point mutation or copy number alteration (in this case mutation of PTEN) across all regions to ensure that differences in observed mutation were not due to varying levels of tumor cellularity.

Across the 14 tumors and 33 genes considered in this analysis, we identified a total of 33 putative protein-altering mutations (Tables S1 and S2 in Additional file 1). Tumors had between zero and 16 putative protein-altering mutations, with a median of two. TP53 was the most commonly mutated gene, with mutations found in 8 out of 14 tumors (Figure  2 A Table S3 in Additional file 1). One tumor, BI12, had many more candidate somatic mutations than other tumors (n =� versus median n =𠂒 in other tumors). Mutations in this GBM were predominantly G > T (or C >𠂚) transversions (8 of 16 total), possibly representing mutation from unrepaired 8-oxo-guanine damage. Most mutations were observed across all tumor regions of BI12, consistent with a defect in DNA repair arising early in the development of the tumor.

Summary of heterogeneity observed across all samples. (A) Protein-altering mutations detected across all tumor regions. Genes mutated twice in the same tumor region are not identified here but can be found within a table of all mutations (Table S3 in Additional file 1). (B) High level gene amplifications detected by smMIP assay. Copy number was estimated by comparing all tumor samples against 12_X, a universal control from BI12 (see Figure S2 in Additional file 1 for analysis using patient matched controls). 𠆊mplification’ indicates genes with coverage three-fold higher than median coverage across a sample. ‘High Amplification’ indicates genes with coverage six-fold higher than median coverage across a sample. Region X refers to brain tissue grossly uninvolved by tumor. Our approach would miss any low-level gene amplifications within these tumors.

To identify high level gene amplifications in tumors, we compared read depth of smMIP-targeted regions in each tumor against that of a control tissue. As smMIP sequencing suggested that a subset of control tissues were contaminated with tumor cells, we performed analyses using either patient-matched controls (Figure S2 in Additional file 1) or a ‘universal’ control (Figure  2 B). For the latter we selected control tissue from tumor BI12, as it appeared to have the least tumor contamination based on allele fraction of known pathogenic point mutations, and restricted copy number analyses to targets with 㸰× coverage in control tissue from BI12 as well as targets whose GC percentage ranged from 30 to 60% (n =� capture probes). A careful review of discrepant calls when using patient-matched versus a universal control indicated that use of the universal control was more sensitive in identifying bona fide amplification events (as confirmed with Taqman assays) secondary to the contamination of a subset of control tissues with tumor cells. After applying our filters (see Materials and methods), a total of 21 genes could be assayed in a total of 62 regions across 14 tumors (Figure  2 B).

The ratio of coverage of each probe was calculated relative to the control tissue (from BI12). We used DNACopy [8] to segment genes and obtain R, the mean ratio of coverage relative to control for each gene. We estimated the copy number for each gene by dividing R for each gene by the median value of R across all genes for each tissue. Genes with ratios above 3 were called as amplified. Genes with ratios above 6 were called as highly amplified. We did not measure deletion of genes using this method.

This process identified five tumors with gene amplifications, with three having one or more regions with a highly amplified gene (Figure  2 B). Three tumors had amplification of both PDGFRA and KIT, and three tumors had amplification of EGFR. We validated copy number estimates for a subset of calls using a variety of different methods, including Taqman quantitative PCR (qPCR across all tumors for EGFR and tumors BI05, BI06 and BI15 for PDGFRA), as well as whole genome sequencing (in tumor BI15 for EGFR). MIP copy number estimates of EGFR were highly correlated (R 2  =𠂐.90) with delta Ct obtained by Taqman qPCR when compared across all 62 regions sequenced (Figure S3 in Additional file 1). Additionally for five tumor regions of BI15 that were subjected to light-whole genome sequencing, EGFR copy number estimates were consistent between whole genome sequencing and smMIP techniques (Supplementary methods and Figure S4 in Additional file 1).

Tumors in which only a subset of regions possess an amplification or point mutation with no other mutation shared across regions can be the result of either mutational heterogeneity within a tumor or varying levels of tumor content between different tumor regions. As an example, tumor BI15 was called as amplified for EGFR in two out of five regions with no other somatic mutations/point mutations detected across the tumor (Figures S5 and S6 in Additional file 1). Upon close inspection of histologic slides prepared from adjacent tissue, the observed difference in amplification was most likely due to lower tumor cellularity within other regions of this tumor rather than intratumoral genetic heterogeneity. This was also seen in tumor BI04, where one region without detectable PDGFRA amplification also had lower frequencies of a TP53 mutation seen across all regions. For this reason, we chose to restrict our interpretation of intratumoral heterogeneity to tumors in which all regions also shared a point mutation or gene amplification. Three tumors met these criteria and are described below.

Spatial heterogeneity of TP53 and RB1 point mutations

One tumor exhibited clear spatial heterogeneity with respect to point mutations within the 33 genes investigated (Figure  3 ). BI09, an IDH1-mutant anaplastic oligodendroglioma, had a high allele fraction (㸰% reads supporting mutation) inactivating mutation (R248H) in TP53 in only two regions of the tumor (A and B). This tumor had high allele fraction mutations in RB1 exclusively in two other regions (D and E) within the same tumor. Both TP53 and RB1 mutations were present at trace levels (ρ%) within region C. As clinical workup indicated that BI09 had an IDH1 mutation, we investigated all regions of this tumor by Sanger sequencing and found that regions A to E shared the IDH1 R132H mutation. Sanger sequencing also validated the TP53 mutation in regions A and B as well as the RB1 mutation in regions D and E (Figure S7 in Additional file 1). Immunohistochemistry of p53 and IDH1-R132H expression on tissue adjacent to regions A to E provided additional confirmatory evidence (Figure S8 in Additional file 1). These findings are consistent with an IDH1-mutant tumor subsequently diverging to form subclones with mutations in RB1 and TP53 [9,10]. A neuropathologist (LFG-C) scored the grade and diagnosis for each of these samples blinded to the mutation type. Interestingly, the presence of TP53 mutation correlated with the higher grade histology (Table S4 in Additional file 1). The clinical significance is unknown but this serves as a potential example of how genomic heterogeneity may affect histology of a tumor.

Intratumoral heterogeneity of TP53 and RB1 determined from smMIP sequencing. Tumor BI09 was sectioned into five regions (A to E). Brain tissue grossly uninvolved by tumor was used as a control (X). Each region was assayed for mutations in 33 genes, including TP53 and RB1. This plot shows the allele balance of TP53 and RB1 mutations within each tumor region. Regions A and B have a high allele fraction mutation in TP53, while regions D and E have a high allele fraction mutation in RB1. Sanger results validated TP53 and RB1 mutations in each region and also revealed that all regions shared a R132H mutation in IDH1 (Figure S7 in Additional file 1).

Spatial heterogeneity of PDGFRA and KIT amplifications

Our smMIP technique detected amplification of PDGFRA, KIT and EGFR within tumor BI05, an IDH1-wild type glioblastoma. In this tumor EGFR amplification was seen across all tumor regions, while amplification of both PDGFRA and KIT was detected in two of five regions (Figure  4 A). As KIT is located near PDGFRA on chromosome 4, shared amplification of these genes is expected [11]. Taqman real-time PCR assays performed in quadruplicate confirmed both the amplification in EGFR and the amplification in PDGFRA across all assayed regions (Figure  4 B). Immunohistochemistry of PDGFRA and EGFR on tissue adjacent to regions A to E provided additional confirmatory evidence (Figure S9 in Additional file 1).

Heterogeneity of PDGFRA amplification in BI05. (A) Copy number estimates based on smMIP probe data. PDGFRA amplification (labeled) occurs in regions A and B with no amplification in regions C, D or E. (B) Results from Taqman qPCR targeting both PDGFRA and EGFR performed in quadruplicate. PDGFRA amplification occurs in regions A and B (between four- and eight-fold amplification) with no significant amplification in regions C, D and E. EGFR amplification occurs in all regions of BI05, consistent with MIP sequencing results. Heterogeneity of PDGFRA amplification was also confirmed through immunohistochemistry of regions A and E (Figure S9 in Additional file 1). Error bars represent the mean +/- one standard deviation from quadruplicate values.

Similarly, we detected heterogeneity of PDGFRA amplification within BI06, an IDH1-mutant glioblastoma. This tumor had amplification of PDGFRA and KIT in region A not detected within other regions (Figure  5 A). Taqman qPCR confirmed amplification of region A, mild amplification in region B and no amplification in regions C, D and E (Figure  5 B). All other regions of this tumor had somatic mutations in PTEN, such that reduced tumor cellularity is an unlikely explanation for our observations.

Heterogeneity of PDGFRA amplification in BI06. (A) Copy number estimates based on smMIP probe data. PDGFRA amplification (labeled) occurs in region A with only mild amplification in region B and no clear detectable amplification in regions C, D or E. (B) Results from Taqman qPCR targeting PDGFRA performed in quadruplicate. Region X refers to a region of brain tissue grossly uninvolved by tumor. PDGFRA amplification occurs in region A (approximately four-fold amplification) with only mild amplification in regions B, C, D and E. Error bars represent the mean +/- one standard deviation from quadruplicate values.

Additional cases of heterogeneity are potential passenger mutations

A missense mutation in KRAS was observed at moderate allele fraction (10% of reads) in region D of the glioblastoma BI12 and was not detected in at least one other region (Table S3 in Additional file 1). As this mutation does not occur within known mutation hotspots and is in a tumor with signs of hypermutation (BI12), the clinical significance of this heterogeneity remains unclear. Other somatic point mutations are heterogeneous across an individual tumor but occur within genes that have another, ubiquitously distributed mutation. BI12 has missense mutations in PTEN that are observed in regions A, B and C and not in region D (Table S3 in Additional file 1). This tumor also has another high allele fraction mutation in this gene that is present across all regions of this tumor. A similar scenario is seen in the astrocytoma BI08. Regions D and E of this tumor have low allele fraction point mutations in TP53, but all regions share another high allele fraction mutation in the same gene. These results suggest that the heterogeneously observed mutations observed are more likely secondary passenger mutations that occurred after PTEN and TP53 were inactivated in tumors BI12 and BI08, respectively (Table S3 in Additional file 1). An alternative possibility is that tumors may contain distinct subpopulations in which genes were inactivated by a different mutation.


References

Turner, W. The cell theory, past and present. J. Anat. Physiol. 24, 253–287 (1890).

Avery, O. T., Macleod, C. M. & McCarty, M. Studies on the chemical nature of the substance inducing transformation of pneumococcal types: induction of transformation by a desoxyribonucleic acid fraction isolated from Pneumococcus type III. J. Exp. Med. 79, 137–158 (1944).

Amberger, J., Bocchini, C. A., Scott, A. F. & Hamosh, A. McKusick's Online Mendelian Inheritance in Man (OMIM). Nucleic Acids Res. 37, D793–D796 (2009).

Tringe, S. G. et al. Comparative metagenomics of microbial communities. Science 308, 554–557 (2005).

Marcy, Y. et al. Dissecting biological “dark matter” with single-cell genetic analysis of rare and uncultivated TM7 microbes from the human mouth. Proc. Natl Acad. Sci. USA 104, 11889–11894 (2007). This study shows that we can identify uncultivated microorganisms using single-cell sequencing.

McConnell, M. J. et al. Mosaic copy number variation in human neurons. Science 342, 632–637 (2013). This article provides the first evidence that mosaic CNV may be more common than previously appreciated.

Wang, Y. et al. Clonal evolution in breast cancer revealed by single nucleus genome sequencing. Nature 512, 155–160 (2014). The study is an example of high-quality single-cell cancer sequencing data, which has enabled new insights into the pathogenesis of breast cancer.

Emmert-Buck, M. R. et al. Laser capture microdissection. Science 274, 998–1001 (1996).

Navin, N. E. Cancer genomics: one cell at a time. Genome Biol. 15, 452 (2014).

Zhou, J., Bruns, M. A. & Tiedje, J. M. DNA recovery from soils of diverse composition. Appl. Environ. Microbiol. 62, 316–322 (1996).

Ham, R. G. Clonal growth of mammalian cells in a chemically defined, synthetic medium. Proc. Natl Acad. Sci. USA 53, 288–293 (1965).

Zong, C., Lu, S., Chapman, A. R. & Xie, X. S. Genome-wide detection of single-nucleotide and copy-number variations of a single human cell. Science 338, 1622–1626 (2012).

Gole, J. et al. Massively parallel polymerase cloning and genome sequencing of single cells using nanoliter microwells. Nat. Biotechnol. 31, 1126–1132 (2013).

Landry, Z. C., Giovanonni, S. J., Quake, S. R. & Blainey, P. C. Optofluidic cell selection from complex microbial communities for single-genome analysis. Methods Enzymol. 531, 61–90 (2013).

Navin, N. et al. Tumour evolution inferred by single-cell sequencing. Nature 472, 90–94 (2011). This study provides the first evidence that single-cell sequencing can be used to dissect intratumour heterogeneity.

Leung, M. L., Wang, Y., Waters, J. & Navin, N. E. SNES: single nucleus exome sequencing. Genome Biol. 16, 55 (2015).

Rinke, C. et al. Obtaining genomes from uncultivated environmental microorganisms using FACS-based single-cell genomics. Nat. Protoc. 9, 1038–1048 (2014).

White, A. K. et al. High-throughput microfluidic single-cell RT-qPCR. Proc. Natl Acad. Sci. USA 108, 13999–14004 (2011).

Leung, K. et al. A programmable droplet-based microfluidic device applied to multiparameter analysis of single microbes and microbial communities. Proc. Natl Acad. Sci. USA 109, 7665–7670 (2012).

Macosko, E. Z. et al. Highly parallel genome-wide expression profiling of individual cells using nanoliter droplets. Cell 161, 1202–1214 (2015). The study presents droplet-based microfluidics as a viable option for efficiently sequencing the transcriptomes of thousands of cells.

Blainey, P. C. The future is now: single-cell genomics of bacteria and archaea. FEMS Microbiol. Rev. 37, 407–427 (2013).

Shapiro, E., Biezuner, T. & Linnarsson, S. Single-cell sequencing-based technologies will revolutionize whole-organism science. Nat. Rev. Genet. 14, 618–630 (2013).

Lichter, P., Ledbetter, S. A., Ledbetter, D. H. & Ward, D. C. Fluorescence in situ hybridization with Alu and L1 polymerase chain reaction probes for rapid characterization of human chromosomes in hybrid cell lines. Proc. Natl Acad. Sci. USA 87, 6634–6638 (1990).

Troutt, A. B., McHeyzer-Williams, M. G., Pulendran, B. & Nossal, G. J. Ligation-anchored PCR: a simple amplification technique with single-sided specificity. Proc. Natl Acad. Sci. USA 89, 9823–9825 (1992).

Telenius, H. et al. Degenerate oligonucleotide-primed PCR: general amplification of target DNA by a single degenerate primer. Genomics 13, 718–725 (1992).

Zhang, L. et al. Whole genome amplification from a single cell: implications for genetic analysis. Proc. Natl Acad. Sci. USA 89, 5847–5851 (1992).

Dean, F. B., Nelson, J. R., Giesler, T. L. & Lasken, R. S. Rapid amplification of plasmid and phage DNA using Phi29 DNA polymerase and multiply-primed rolling circle amplification. Genome Res. 11, 1095–1099 (2001). This paper provides the first evidence that isothermal amplification could be used to efficiently analyse whole genomes.

Zhang, D. Y., Brandwein, M., Hsuih, T. & Li, H. B. Ramification amplification: a novel isothermal DNA amplification method. Mol. Diagn. 6, 141–150 (2001).

de Bourcy, C. F. et al. A quantitative comparison of single-cell whole genome amplification methods. PLoS ONE 9, e105585 (2014).

Lasken, R. S. & Stockwell, T. B. Mechanism of chimera formation during the multiple displacement amplification reaction. BMC Biotechnol. 7, 19 (2007).

Marcy, Y. et al. Nanoliter reactors improve multiple displacement amplification of genomes from single cells. PLoS Genet. 3, 1702–1708 (2007).

Zhang, K. et al. Sequencing genomes from single cells by polymerase cloning. Nat. Biotechnol. 24, 680–686 (2006).

Langmore, J. P. Rubicon Genomics, Inc. Pharmacogenomics 3, 557–560 (2002).

Hou, Y. et al. Comparison of variations detection between whole-genome amplification methods used in single-cell resequencing. Gigascience 4, 37 (2015).

Huang, L., Ma, F., Chapman, A., Lu, S. & Xie, X. S. Single-cell whole-genome amplification and sequencing: methodology and applications. Annu. Rev. Genomics Hum. Genet. 16, 79–102 (2015).

Blainey, P. C. & Quake, S. R. Digital MDA for enumeration of total nucleic acid contamination. Nucleic Acids Res. 39, e19 (2011).

Yu, Z., Lu, S. & Huang, Y. A microfluidic whole genome amplification device for single cell sequencing. Anal. Chem. 86, 9386–9390 (2014).

Nishikawa, Y. et al. Monodisperse picoliter droplets for low-bias and contamination-free reactions in single-cell whole genome amplification. PLoS ONE 10, e0138733 (2015).

Fu, Y. et al. Uniform and accurate single-cell sequencing based on emulsion whole-genome amplification. Proc. Natl Acad. Sci. USA 112, 11923–11928 (2015).

Gawad, C., Koh, W. & Quake, S. R. Dissecting the clonal origins of childhood acute lymphoblastic leukemia by single-cell genomics. Proc. Natl Acad. Sci. USA 111, 17947–17952 (2014). This paper uses microfluidics to efficiently resequence the genomes of almost 1,500 cells, allowing new insights into the development of leukaemia.

Hughes, A. E. et al. Clonal architecture of secondary acute myeloid leukemia defined by single-cell sequencing. PLoS Genet. 10, e1004462 (2014).

Zhang, C. Z. et al. Calibrating genomic and allelic coverage bias in single-cell sequencing. Nat. Commun. 6, 6822 (2015).

Daley, T. & Smith, A. D. Modeling genome coverage in single-cell sequencing. Bioinformatics 30, 3159–3165 (2014).

Clingenpeel, S., Clum, A., Schwientek, P., Rinke, C. & Woyke, T. Reconstructing each cell's genome within complex microbial communities-dream or reality? Front. Microbiol. 5, 771 (2014).

Nikolenko, S. I., Korobeynikov, A. I. & Alekseyev, M. A. BayesHammer: Bayesian clustering for error correction in single-cell sequencing. BMC Genomics 14, S7 (2013).

Baslan, T. et al. Genome-wide copy number analysis of single cells. Nat. Protoc. 7, 1024–1041 (2012).

Zhang, C. et al. A single cell level based method for copy number variation analysis by low coverage massively parallel sequencing. PLoS ONE 8, e54236 (2013).

Cheng, J. et al. Single-cell copy number variation detection. Genome Biol. 12, R80 (2011).

Bankevich, A. et al. SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing. J. Comput. Biol. 19, 455–477 (2012). This method overcomes some whole-genome amplification artefacts, resulting in more accurate single-cell genome assemblies.

Peng, Y., Leung, H. C., Yiu, S. M. & Chin, F. Y. IDBA-UD: a de novo assembler for single-cell and metagenomic sequencing data with highly uneven depth. Bioinformatics 28, 1420–1428 (2012).

Eisen, M. B., Spellman, P. T., Brown, P. O. & Botstein, D. Cluster analysis and display of genome-wide expression patterns. Proc. Natl Acad. Sci. USA 95, 14863–14868 (1998).

Jaccard, P. Étude comparative de la distribution florale dans une portion des Alpes et des Jura. Bull. Société Vaudoise Sci. Naturelles37, 547–579 (in French) (1901).

Fraley, C. & Raftery, A. E. Model-based clustering, discriminant analysis, and density estimation. J. Am. Statist. Associ. 97, 611–631 (2002).

Dempster, A. P., Laird, N. M. & Rubin, D. B. Maximum likelihood from incomplete data via the EM algorithm. J. R. Statiscal Soc. 39, 1–38 (1977).

Fraley, C. & Raftery, A. E. How many clusters? Which clustering method? Answers via model-based cluster analysis. Computer J. 41, 578–588 (1998).

Fraley, C. & Raftery, A. E. MCLUST: software for model-based cluster analysis. J. Classif. 16, 297–306 (2014).

Bentley, D. R. et al. Accurate whole human genome sequencing using reversible terminator chemistry. Nature 456, 53–59 (2008).

Kim, K. I. & Simon, R. Using single cell sequencing data to model the evolutionary history of a tumor. BMC Bioinformatics 15, 27 (2014).

Yang, Z. & Rannala, B. Molecular phylogenetics: principles and practice. Nat. Rev. Genet. 13, 303–314 (2012).

Podar, M. et al. Targeted access to the genomes of low-abundance organisms in complex microbial communities. Appl. Environ. Microbiol. 73, 3205–3214 (2007).

Youssef, N. H., Blainey, P. C., Quake, S. R. & Elshahed, M. S. Partial genome assembly for a candidate division OP11 single cell from an anoxic spring (Zodletone Spring, Oklahoma). Appl. Environ. Microbiol. 77, 7804–7814 (2011).

Campbell, J. H. et al. UGA is an additional glycine codon in uncultured SR1 bacteria from the human microbiota. Proc. Natl Acad. Sci. USA 110, 5540–5545 (2013).

McLean, J. S. et al. Candidate phylum TM6 genome recovered from a hospital sink biofilm provides genomic insights into this uncultivated phylum. Proc. Natl Acad. Sci. USA 110, E2390–E2399 (2013).

Dodsworth, J. A. et al. Single-cell and metagenomic analyses indicate a fermentative and saccharolytic lifestyle for members of the OP9 lineage. Nat. Commun. 4, 1854 (2013).

Rinke, C. et al. Insights into the phylogeny and coding potential of microbial dark matter. Nature 499, 431–437 (2013). This study identifies new phyla of microorganisms from diverse environments, enabling new insights into the biology of those ecosystems.

Parks, D. H., Imelfort, M., Skennerton, C. T., Hugenholtz, P. & Tyson, G. W. CheckM: assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes. Genome Res. 25, 1043–1055 (2015).

Tennessen, K. et al. ProDeGe: a computational protocol for fully automated decontamination of genomes. ISME J. 10, 269–272 (2015).

Fitzsimons, M. S. et al. Nearly finished genomes produced using gel microdroplet culturing reveal substantial intraspecies genomic diversity within the human microbiome. Genome Res. 23, 878–888 (2013).

Woyke, T. et al. One bacterial cell, one complete genome. PLoS ONE 5, e10314 (2010).

Chitsaz, H. et al. Efficient de novo assembly of single-cell bacterial genomes from short-read data sets. Nat. Biotechnol. 29, 915–921 (2011).

Mason, O. U. et al. Metagenome, metatranscriptome and single-cell sequencing reveal microbial response to Deepwater Horizon oil spill. ISME J. 6, 1715–1727 (2012).

Lasken, R. S. & McLean, J. S. Recent advances in genomic DNA sequencing of microbial species from single cells. Nat. Rev. Genet. 15, 577–584 (2014).

Tadmor, A. D., Ottesen, E. A., Leadbetter, J. R. & Phillips, R. Probing individual environmental bacteria for viruses by using microfluidic digital PCR. Science 333, 58–62 (2011).

Roux, S. et al. Ecology and evolution of viruses infecting uncultivated SUP05 bacteria as revealed by single-cell- and meta-genomics. eLife 3, e03125 (2014).

Roux, S., Hallam, S. J., Woyke, T. & Sullivan, M. B. Viral dark matter and virus–host interactions resolved from publicly available microbial genomes. eLife 4, e08490 (2015).

Roux, S., Enault, F., Hurwitz, B. L. & Sullivan, M. B. VirSorter: mining viral signal from microbial genomic data. PeerJ 3, e985 (2015).

Yoon, H. S. et al. Single-cell genomics reveals organismal interactions in uncultivated marine protists. Science 332, 714–717 (2011). This paper shows that single-cell sequencing can be used to study interactions of bacteria, protists and viruses at single-cell resolution.

Human Microbiome Project Consortium. Structure, function and diversity of the healthy human microbiome. Nature 486, 207–214 (2012).

Martinez-Garcia, M. et al. Unveiling in situ interactions between marine protists and bacteria through single cell sequencing. ISME J. 6, 703–707 (2012).

Hirschhorn, K., Decker, W. H. & Cooper, H. L. Human intersex with chromosome mosaicism of type XY/XO. Report of a case. N. Engl. J. Med. 263, 1044–1048 (1960).

Happle, R. Mosaicism in human skin. Understanding the patterns and mechanisms. Arch. Dermatol. 129, 1460–1470 (1993).

Weinstein, L. S. et al. Activating mutations of the stimulatory G protein in the McCune–Albright syndrome. N. Engl. J. Med. 325, 1688–1695 (1991).

Groesser, L. et al. Postzygotic HRAS and KRAS mutations cause nevus sebaceous and Schimmelpenning syndrome. Nat. Genet. 44, 783–787 (2012).

Lindhurst, M. J. et al. A mosaic activating mutation in AKT1 associated with the Proteus syndrome. N. Engl. J. Med. 365, 611–619 (2011).

Lindhurst, M. J. et al. Mosaic overgrowth with fibroadipose hyperplasia is caused by somatic activating mutations in PIK3CA. Nat. Genet. 44, 928–933 (2012).

Conlin, L. K. et al. Mechanisms of mosaicism, chimerism and uniparental disomy identified by single nucleotide polymorphism array analysis. Hum. Mol. Genet. 19, 1263–1275 (2010).

Drake, J. W., Charlesworth, B., Charlesworth, D. & Crow, J. F. Rates of spontaneous mutation. Genetics 148, 1667–1686 (1998).

Bianconi, E. et al. An estimation of the number of cells in the human body. Ann. Hum. Biol. 40, 463–471 (2013).

Behjati, S. et al. Genome sequencing of normal cells reveals developmental lineages and mutational processes. Nature 513, 422–425 (2014).

Piotrowski, A. et al. Somatic mosaicism for copy number variation in differentiated human tissues. Hum. Mutat. 29, 1118–1124 (2008).

Wang, J., Fan, H. C., Behr, B. & Quake, S. R. Genome-wide single-cell analysis of recombination activity and de novo mutation rates in human sperm. Cell 150, 402–412 (2012). This study establishes the feasibility of using single-cell sequencing to identify genomic structural variants and SNVs genome-wide.

Lu, S. et al. Probing meiotic recombination and aneuploidy of single sperm cells by whole-genome sequencing. Science 338, 1627–1630 (2012).

Hou, Y. et al. Genome analyses of single human oocytes. Cell 155, 1492–1506 (2013).

Cai, X. et al. Single-cell, genome-wide sequencing identifies clonal somatic copy-number variation in the human brain. Cell Rep. 8, 1280–1289 (2014).

Knouse, K. A., Wu, J., Whittaker, C. A. & Amon, A. Single cell sequencing reveals low levels of aneuploidy across mammalian tissues. Proc. Natl Acad. Sci. USA 111, 13409–13414 (2014).

Lodato, M. A. et al. Somatic mutation in single human neurons tracks developmental and transcriptional history. Science 350, 94–98 (2015).

Handyside, A. H., Kontogianni, E. H., Hardy, K. & Winston, R. M. Pregnancies from biopsied human preimplantation embryos sexed by Y-specific DNA amplification. Nature 344, 768–770 (1990).

Geraedts, J. et al. Polar body array CGH for prediction of the status of the corresponding oocyte. Part I: clinical results. Hum. Reprod. 26, 3173–3180 (2011).

Alfarawati, S., Fragouli, E., Colls, P. & Wells, D. First births after preimplantation genetic diagnosis of structural chromosome abnormalities using comparative genomic hybridization and microarray analysis. Hum. Reprod. 26, 1560–1574 (2011).

Cancer Genome Atlas Research Network et al. The Cancer Genome Atlas Pan-Cancer analysis project. Nat. Genet. 45, 1113–1120 (2013).

Ding, L. et al. Clonal evolution in relapsed acute myeloid leukaemia revealed by whole-genome sequencing. Nature 481, 506–510 (2012).

Gerlinger, M. et al. Intratumor heterogeneity and branched evolution revealed by multiregion sequencing. N. Engl. J. Med. 366, 883–892 (2012).

Hou, Y. et al. Single-cell exome sequencing and monoclonal evolution of a JAK2-negative myeloproliferative neoplasm. Cell 148, 873–885 (2012).

Xu, X. et al. Single-cell exome sequencing reveals single-nucleotide mutation characteristics of a kidney tumor. Cell 148, 886–895 (2012).

Li, Y. et al. Single-cell sequencing analysis characterizes common and cell-lineage-specific mutations in a muscle-invasive bladder cancer. Gigascience 1, 12 (2012).

Yu, C. et al. Discovery of biclonal origin and a novel oncogene SLC12A5 in colon cancer by single-cell sequencing. Cell Res. 24, 701–712 (2014).

Ni, X. et al. Reproducible copy number variation patterns among single circulating tumor cells of lung cancer patients. Proc. Natl Acad. Sci. USA 110, 21083–21088 (2013).

Lohr, J. G. et al. Whole-exome sequencing of circulating tumor cells provides a window into metastatic prostate cancer. Nat. Biotechnol. 32, 479–484 (2014).

Potter, N. E. et al. Single-cell mutational profiling and clonal phylogeny in cancer. Genome Res. 23, 2115–2125 (2013).

Papaemmanuil, E. et al. RAG-mediated recombination is the predominant driver of oncogenic rearrangement in ETV6RUNX1 acute lymphoblastic leukemia. Nat. Genet. 46, 116–125 (2014).

Jan, M. et al. Clonal evolution of preleukemic hematopoietic stem cells precedes human acute myeloid leukemia. Sci. Transl Med. 4, 149ra118 (2012).

Shintaku, H., Nishikii, H., Marshall, L. A., Kotera, H. & Santiago, J. G. On-chip separation and analysis of RNA and DNA from single cells. Anal. Chem. 86, 1953–1957 (2014).

Macaulay, I. C. et al. G&T-seq: parallel sequencing of single-cell genomes and transcriptomes. Nat. Methods 12, 519–522 (2015).

Dey, S. S., Kester, L., Spanjaard, B., Bienko, M. & van Oudenaarden, A. Integrated genome and transcriptome sequencing of the same cell. Nat. Biotechnol. 33, 285–289 (2015).

Stahlberg, A., Thomsen, C., Ruff, D. & Aman, P. Quantitative PCR analysis of DNA, RNAs, and proteins in the same single cell. Clin. Chem. 58, 1682–1691 (2012).

Lee, J. H. et al. Highly multiplexed subcellular RNA sequencing in situ. Science 343, 1360–1363 (2014). This study presents a method for acquiring single-cell transcriptomic data while retaining intercellular and intracellular spatial information.

Satija, R., Farrell, J. A., Gennert, D., Schier, A. F. & Regev, A. Spatial reconstruction of single-cell gene expression data. Nat. Biotechnol. 33, 495–502 (2015).

Achim, K. et al. High-throughput spatial mapping of single-cell RNA-seq data to tissue of origin. Nat. Biotechnol. 33, 503–509 (2015).

Yachida, S. & Iacobuzio-Donahue, C. A. Evolution and dynamics of pancreatic cancer progression. Oncogene 32, 5253–5260 (2013).


Genomics of Infectious Diseases and Private Industry

2.1 Sanger Sequencing

Sanger sequencing uses the SBS approach in which a DNA polymerase generates DNA reads from a template that is the DNA molecule to be analyzed. The nature of the nucleotide at a given position is now determined using specific dyes.

Sanger sequencing, although too laborious and expensive for WGS, remains routinely used when sequencing of specific genes or fragment of genes is needed, for example, for viral or bacterial genotyping or for resistance testing when SNPs are associated with specific genome regions. For bacterial WGS, biological amplification by culture and single colony picking is needed whereas PCR amplification of specific genes is done for both viruses and bacteria before amplicons are sequenced. Since 1987 and during the last four decades, Sanger sequencing has been mostly done on ABI sequencers (Thermo Fisher Scientific) instruments, a brand that now proposes a series of capillary electrophoresis sequencers ranging from 1 to 96 capillaries and covering the needs of different laboratories in terms of throughput. All current ABI DNA sequencing kits use cycle sequencing protocols with two different chemistries: dye primer chemistry or dye terminator chemistry.


Methods

Animals and sample collection

The six ram at fetus (110 days old, 1.36 ± 0.14 kg) and adult (2 years old, 77.98 ± 3.19 kg) stages (n = 3) were supplied from Taizhou Hailun Sheep Industry Co., Ltd. (Taizhou, China). The sheep were raised under the same conditions, with natural light and free access to food and water. All animals were fasted overnight and were then euthanized by captive bolt stunning and exsanguination. The LD muscle samples were collected from between the 12th and 13th thoracic vertebrae of the right side at the fetus and adult stages, immediately frozen in liquid nitrogen, and stored at − 80 °C until use.

Library preparation

DNA was isolated from LD muscle samples using a DNA extraction kit (Tiangen, Beijing, China). The DNA concentration and quality were determined by NanoDrop (NanoDrop Technologies, Wilmington, DE, USA) and agarose gel electrophoresis. Three DNA libraries were constructed for each group. Equal amounts of genomic DNA (2 μg per sample) were fragmented to 400–500 bp by ultrasonication, followed by adenylation and end-repair. The selected fragments were treated with bisulfite and then amplified by PCR to generate the sequencing libraries.

WGBS and identification of DMRs

The library was sequenced using an IlluminaHiSeqTM2500 platform (Biomarker Technologies, Beijing, China). The peak signal was transformed into sequence data by base calling, following which the raw reads were quality-filtered to obtain the clean reads. First, reads were trimmed of the 3′ adapter sequence. Then, reads with > 10% unknown bases (N) and those of low quality (more than 50% of bases with a PHRED score ≤ 5) were removed. We also calculated the Q30 and GC content.

The clean reads were aligned to the sheep reference genome (Oar_v3.1) and the bisulfite mapping of methylation sites was performed using Bismark software. The duplicates were reads that aligned with the same region of the genome, and can estimated the sequencing depth and coverage. The bisulfite conversion rate is the percentage of methylated clean reads to the total number of clean reads in the genome. The binomial distribution test for each C site was used to confirm C-site methylation by screening conditions for coverage ≥4× and false discovery rate (FDR) < 0.05.

To identify the differentially methylated regions (DMRs) between fetal and adult samples, we referenced the model of [16] to estimate the methylation level. All C sites with read coverage > 10× were used for DMR analysis with MOABS [46]. Subsequently, DMRs were defined by the presence of at least three methylation sites in the region, and in which the difference in methylation levels was > 0.2 (> 0.3 for the CG context) and the P-value from Fisher’s exact test was < 0.05.

Functional enrichment analysis

The DMR-related genes (DMGs) were compared against functional databases such as GO and KEGG by BLAST for annotation of gene function. GO enrichment analysis of the DMGs was implemented by the GOseq R packages based on the Wallenius non-central hypergeometric distribution [47]. KOBAS software was used to test the significance of DMR-related gene enrichment in the KEGG pathway analysis [48]. Pathways with a corrected P-value < 0.05 were considered to be significantly enriched. The STRING database was used to analyze interaction networks of selected DMGs (http://string-db.org/) [49].

Quantitative reverse transcription-PCR

The expression levels of DNA methyltransferase-related genes and validate the DMGs by qRT-PCR. Total RNA was isolated from LD muscles using Trizol reagent (Invitrogen, Carlsbad, USA). cDNA was reverse transcribed from total RNA using the PrimeScript RT kit (Takara, Dalian, China). qPCR was performed on a StepOnePlus Real-Time PCR System (Life Technologies, USA) using SYBR Green Master Mix (Roche Applied Science, Mannheim, Germany). The gene primers are listed in Additional file 1. The relative expression of each gene was normalized to that of GAPDH using the 2 −ΔΔCt method [50].

Bisulfite sequencing PCR

The bisulfite sequencing PCR was used to validate DNA methylation levels of selected candidate genes. Genomic DNA was modified with sodium bisulfite using the EZ DNA Methylation-Gold™ Kit (ZymoResearch, Los Angeles, USA). Then, bisulfite-converted gDNA was subjected to PCR amplification using Zymo Taq™ DNA polymerase (ZymoResearch). The PCR products were purified using a Gel Extraction Kit (Shenggong, Shanghai, China), ligated, and cloned into the pUC18-T vector (Shenggong). Fifteen clones of each sample were randomly selected for DNA sequencing. The quantification tool for methylation analysis was used to analyze bisulfite sequencing data (QUMA http://quma.cdb.riken.jp/). Gene sequence-specific primers are listed in Additional file 2.

Association analysis

We previously screened many genes related to muscle development at two stages of Hu sheep (fetus and adult) using the Illumina platform, [15]. By association analysis of the differentially methylated genes and the differentially expressed genes, a set of differentially methylation DEGs at the intersection of the two was obtained. Negative correlations between DMR methylation level and the corresponding gene expression level were identified by correlation analysist (r with a negative value).

Statistical analysis

Statistical analyses were performed by the independent samples t-test with the SPSS 25.0 software package (SPSS Inc., Chicago, IL, USA). Results of the qRT-PCR data were expressed as means ± standard error of the mean (SEM) of three samples with three biological replicates. Differences were regarded as significant at P < 0.05.


References

Skaletsky, H. et al. Nature 423, 825–837 (2003).

Rozen, S. et al. Nature 423, 873–876 (2003).

Ohno, S. Sex Chromosomes and Sex-Linked Genes (Springer, Berlin, 1967).

Burgoyne, P. S. Hum. Genet. 61, 85–90 (1982).

International Human Genome Sequencing Consortium Nature 409, 860–921 (2001).

Venter, J. C. et al. Science 291, 1304–1351 (2001).

Bailey, J. A. et al. Science 297, 1003–1007 (2002).

Marshall Graves, J. A. Trends Genet. 18, 259–264 (2002).

Cavalli-Sforza, L. L. & Feldman, M. W. Nature Genet. 33, 266–275 (2003).

Lahn, B. T. & Page, D. C. Science 278, 675–680 (1997).

Carrel, L., Cottle, A. A., Goglin, K. C. & Willard, H. F. Proc. Natl Acad. Sci. USA 96, 14440–14444 (1999).


Introduction

Gaining a single-cell view of tumor heterogeneity is crucial for improving our understanding of tumor evolution and enabling future advances in cancer research. The standard paradigm is bulk sequencing of genomic DNA derived from millions of heterogeneous cells. In bulk sequencing, the ability to resolve sub-clonality is confounded relying on indirect inference, frequently resulting in an ensemble view dominated by the majority clone 1,2 . While bulk sequencing has provided major insights into tumor biology, lower throughput single-cell methods such as spectral karyotyping are often necessary to understand sub-clonal complexity and tumor evolution. Previously, methods for sequencing DNA of single-cells using next-generation sequencing approaches have often been laborious or limited to multiplexing hundreds of cells or nuclei 3,4,5,6,7,8 . In this study, we used newly emerging droplet-based shallow genome sequencing to simultaneously sequence 1475 single-cells from one of the most well-studied and well-characterized benchmark cell lines, COLO829 9 , as a means to better understand limitations and insights gained by single-cell sequencing at shallow depth. We follow this analysis with a deep-dive, examining data by multiple technologies and multiple samples on COLO829, in order to better understand the resulting sub-clonality, its major hallmark features, and the underlying driving biology.

The melanoma COLO829 and germline COLO829-BL tumor/normal pair have been extensively analyzed using multiple methods and technologies, making it an ideal vehicle for new and emerging genomic technologies 1,2,10,11 . This line was one of the first tumor/normal pairs to be subjected to whole-genome sequencing, where Pleasance et al. identified several hallmark events including a homozygous 12 kb deletion in PTEN, BRAF 600V/E, and a CDK2NA 2 bp deletion. Previous studies using bulk sequencing of the tumor-line COLO829 have focused largely on developing tools and standardizations to improve copy number estimation and cancer characterization 2 . While a few of the studies found cell line complexity inconsistent with the assumption of clonality and suggestive of multiple sub-clones, in general, most analyses presumed COLO829 to be a single clone. Of papers looking at copy number, Craig et al. observed differences among samples in chromosome 1p, and Gusnanto et al. found evidence for a mixture of clones but they were unable to resolve the individual components using bulk data and methods. Much of the work on this tumor-line highlighted major CNV hallmark events, as well as a series of inconsistent findings that point towards bulk sequencing methods being lossy and unable to resolve the complexity of COLO829 11 .

Beyond the difficulty of resolving clonal mixtures, an additional challenge of bulk sequencing even in the context of a paired normal is that without single-cell resolution there are limited informatic options to resolve relative differences in read-depth to integer copy number states. At some point, most algorithms require assumptions, such as a diploid region or tumor purity, and the veracity of these assumptions shape overall accuracy. Even so, even with a uniform set of algorithms applied on the same cell line, variable results are observed across samples, suggesting that there may be differences with some sub-populations of cells impacting their expansion 11 . In this paper, we performed shallow single-cell sequencing of genomic DNA across 1475 cells from the same cell-line, COLO829, and show that it is in fact a complex mixture and identify key structural variants that contribute to its sub-clonal evolution.


GENOME PROJECTS: UNCOVERING THE BLUEPRINTS OF BIOLOGY

In the early days of genetics, scientists did not have the resources to look at more than a few genes at a time. This made the process of understanding the influence of genetics on an organism slow and arduous. Scientists were faced with the enormous task of attempting to understand genetic influence with little information to complete the task. The understanding of genes would have been very helpful in solving this problem.

The year 1995 saw the completion of the first two complete non-viral genomes, Haemophilus influenzae [1] and Mycoplasma genitalium [1], two bacteria that can cause human disease. Since then, over 100 genomes have been fully sequenced, including those of higher organisms like baker’s yeast, the fruit fly, and the nematode [2]. With the announcement in June of 2001 that the first draft of the human genome had been completed [3], scientists’ approach to biology completely changed. The entire set of human genes was now available. This represented an irresistible amount of data that breached the bioinformatic gap that lay between biologists and their understanding of genetics.

To begin to see the significance of such an historical event, it is necessary to look at why uncovering a genome is an important biological task.

The genome refers to all DNA present in an organism.

DNA is the “genetic blueprint” that determines the genotypic make-up of each organism. In its barest form, DNA consists of two strings of nucleotides, or bases (abbreviated A, C, G, and T), wound around each other. The bases composing DNA have specific binding capabilities: A always binds to T, and C always binds to G. These binding capabilities are useful for scientists to understand since, if the nucleotide sequence of one DNA strand is determined, complementary binding allows the sequence of other strand to be deduced.

In the case of humans, DNA is organized into 24 structural units called chromosomes. Each chromosome consists of compacted coils of DNA. While much of this DNA has no known function (these stretches of DNA are conveniently referred to as spacer DNA or junk DNA), a significant portion of the DNA codes for genes. Each gene provides the information necessary to produce a protein, which is responsible for carrying out cellular functions. The complement of proteins in an organism is very important, with diseases often manifesting when a protein does not function properly.

Why Sequence Genomes, Especially Non-Human Genomes?

One of the interesting things about biological organisms is their remarkable similarity at the molecular level, despite their obvious outward differences. For instance, many genes are found in morphologically different organisms despite the phylogenetic distance between them4. Not only are these genes very similar in their DNA sequence composition they also tend to perform the same functions. Thus, by understanding the function of a gene in one organism, scientists can get an idea of what function that gene may perform in a more complex organism such as humans. The knowledge gained can then be applied to various fields such as medicine, biological engineering and forensics.

The Sequencing Reaction: How the Nucleotide Composition of DNA is Determined

To understand how DNA is sequenced, one must first know a little about the structure of DNA:

  • A segment of DNA, which is ordinarily double stranded, has a specific orientation, as it has a 5′ (read as 𔄝 prime”) and a 3′ (𔄛 prime”) end. This can be simply thought of as a front and tail end to the DNA segment.
  • When DNA is synthesized in the lab, the two strands are separated and new bases are added to the 3′ end-thus DNA is assembled from the 5′ to 3′ end.
  • DNA cannot be synthesized from scratch. A short piece of DNA, called a primer, is required for the reaction to begin.
  • Primers are designed such that they are able to bind to the target DNA, the binding of which is the initiator for DNA synthesis.

DNA sequencing is accomplished by the Fredrick Sanger method (see Figure 1), for which he won his second Nobel Prize in 1980.


Figure 1. The Sanger sequencing reaction. Single stranded DNA is amplified in the presence of fluorescently labelled ddNTPs that serve to terminate the reaction and label all the fragments of DNA produced. The fragments of DNA are then separated via polyacrylamide gel electrophoresis and the sequence read using a laser beam and computer.

This method essentially involves amplifying a single stranded piece of DNA many times [5]. Normally, when DNA is amplified, new deoxy-nucleotides ( dNTPs) are added as the strand of DNA grows. The Sanger method employs special bases called dideoxy-nucleotides ( ddNTPs). These are similar to dNTPs, except for two important differences: they have fluorescent tags attached to them (a different tag for each of the 4 ddNTPs) and are missing a crucial atom that prevents new bases from being added to a DNA strand after a ddNTPs has been added. Thus, once a ddNTP is inserted into a growing DNA strand, synthesis of that strand is stopped. After many repeated cycles of amplification this will result in all the possible lengths of DNA being represented and every piece of synthesized DNA containing a fluorescent label at its terminus.

Amplified DNA can then be separated according to size via gel electrophoresis. As the fluorescent DNA reaches the bottom of the gel (now separated from smallest to largest), a laser can pick up the fluorescence of each piece of DNA. The trick to the Sanger method lies in the fact that each ddNTP emits a different fluorescent signal, so that the presence of a ddNTP at the terminus can be recorded on a computer (see Figure 2). The reaction is set up so that a fluorescent ddNTP is present at every position in the DNA strand (i.e. every possible size of DNA strand is present) so that every nucleotide in the strand can be determined. A computer program can then compile the data into a coloured graph showing the determined sequence.

In the past, the separation of the DNA strands by electrophoresis was a time consuming step, requiring the use of radioisotopes for labelling ddNTPs. This was less than trivial, as four different sequencing reactions were required (one for each ddNTP) and the resulting sequencing gel needed to be analyzed manually. Today, fluorescent labels and new advances in gel electrophoresis have made DNA sequencing not only fast and far more accurate, but also almost fully automated, including the read out of the final sequence.


Figure 2. An electropherogram of a finished sequencing reaction. As the fragments from the sequencing reaction are resolved via electrophoresis, a laser reads the fluorescence of each fragment (blue, green, red or yellow) and compiles the data into an image. Each colour, or fluorescence intensity, represents a different nucleotide (e.g. blue for C) and reveals where that nucleotide is in the sequence.

While the Sanger method is the accepted method for sequencing DNA, one cannot sequence a complete genome using this method alone. The main reason for this is that as the pieces of DNA get larger, resolving two pieces by one base becomes virtually impossible [6]. In fact, only about 1000 bases can be sequenced accurately, a far cry from the 50 to 250 million bases that comprise a human chromosome. Furthermore, as stated above, a primer of known sequence is required for each sequencing reaction. Thus, one cannot take any piece of DNA and “just sequence it.” A known starting point, and thus some knowledge of the sequence, is required to begin the reaction. To circumvent this problem, DNA is usually cut up into smaller, more manageable chunks and then placed into a small circular piece of DNA known as a plasmid or cloning vector (a process generally referred to as cloning). The cloning vector’s sequence is known and therefore allows any piece of DNA introduced into it to be sequenced.

With these ideas in mind, scientists set out to design methods to make possible the sequencing of an entire genome. No small task when you consider that the human genome contains approximately three billion bases that needed to be sequenced.

The first method of sequencing a genome, employed by the publicly funded Human Genome Project, involves cloning a large piece of DNA into smaller pieces called sub-clones. With the use of known genetic markers (i.e. physical characteristics that have been attributed to specific areas of a chromosome) a simple and poorly resolved map of where the sub-clones would be located on a chromosome is prepared. This allows the sub-clones to be placed in an order based on the structure of the chromosome. Each individual sub-clone is then sequenced. The resulting sequence is used to create a new primer to sequence flanking regions of the DNA that could not be sequenced in the first round of reactions. This process is continued until the sequences overlap (are contiguous). These contiguous sequences can then be assembled into a group of overlapping sequences, termed a contig. As this method progresses, larger and larger contigs will be produced, until a single ordered contig of the genome is achieved.

A common named for the above method is a ‘top-down’ approach (See Figure 3). If you look at a jigsaw puzzle as an analogy, a top-down approach is similar to starting the puzzle form one corner and working your way down and across in an ordered manner, always building on the last piece that was added. The advantages of this method are that each individual clone can be sent to different people for sequencing and that each stretch of DNA only needs to be sequenced once, as the DNA has already been mapped. However, a large disadvantage to this method is the slow process of sub-cloning and mapping of the clones, requiring significant human manipulation.


Figure 3. The top-down sequencing method. In this approach, a large source clone is first physically mapped before it is broken up into smaller sub-clones. This is done by taking the fragmented source clone and sequentially ordering the sub-clones, based on their original order in the source clone. This requires a physical map of the source clone to work, meaning you need to know that #1 (blue) comes before #2 (yellow) in the source clone. Once the clones have been ordered, each sub-clone is sequenced, and using the overlapping sequences of neighbouring sub-clones, the whole piece is put together.

A second method is the so-called ‘shotgun’ method of sequencing (see Figure 4), which was employed by the privately funded company Celera Genomics to sequence the human genome. This method was the subject of a good deal of debate, as it is relatively crude in comparison to the method employed by the Human Genome Project. It involves each contig being sub-cloned into smaller fragments in the same way as the top-down approach, with the exception that a physical genetic map is not created. Instead, each clone is sequenced first, and then overlapping sequences are joined together to create the contig. In other words, random clones are sequenced (as they are not ordered) in the hopes that overlapping sequences will be found to piece together the contiguous sequence.


Figure 4. Shotgun Sequencing. A relatively crude method of sequencing, shotgun sequencing does not produce a physical map of the source clone first. Instead, the source clone is fragmented, producing a random mixture, and a random sub-clone (i.e. an unordered sequencing clone of blue, yellow, black, red or green) is selected for sequencing by the Sanger method. To ensure that that the whole source clone has been sequenced, this stretch of DNA must be sequenced numerous times (represented by multiples of a single coloured sub-clone) to produce an ordered overlapping sequence. Gaps in this process will occur where a sub-clone is not fully sequenced (blue coloured sub-clone).

Using the jigsaw puzzle analogy again, the shotgun method is similar to starting with random pieces of the puzzle and looking for pieces that fit to it, regardless of where in the puzzle the piece originated from. One major problem with this method is uncertainty. You lack an initial map to guide you, making it difficult to be sure that the entire contig is represented. To get around this problem, the same contig needs to be sequenced many times to ensure that the probability of missing a sub-clone is less than 1%. After which the gaps between contigs must still be filled in, usually through the use of a technique called chromosome walking. The shotgun method is advantageous in that the laborious process of mapping and sub-cloning, requiring human hands, is eliminated. So, while this method requires much more sequencing compared to the first, it proves to be much more economical and faster due to the sequencing reactions being virtually fully automated and the sequences being assembled by computer programs.

When is a Genome Sequence Finished?

When it was announced that the first draft of the human genome was completed [3], it was commonly misreported by many media outlets that the human genome was sequenced. In fact, much more sequencing needs to be done to finish the job. This is because the genome sequence was still in the ‘draft’ stage, meaning that the genome had been sequenced about 4 to 5 times, and the data organized into fragments that are approximately 10,000 bases in size.

To prepare a high quality sequence of the human genome, potential errors in the sequence must still be statistically removed. This is done primarily by closing the gaps between contigs with additional sequencing, ultimately reducing ambiguity and ensuring that there is at most 1 error in every 10,000 bases. The finished version will require that a chromosome be sequenced about 9 to 10 times. Furthermore, not all regions of the chromosome can be cloned, resulting in them being unavailable for sequencing. Luckily, these regions, called heterochromatin, consist of telomeres and centromeres (the tips and centre of the chromosome, respectively), which are rich in repeating sequences (making cloning very difficult) and low in genes. Most of the genes reside in euchromatin, the part of the chromosome that can be sequenced. Therefore, a complete genome sequence actually refers to a high quality sequence of an organism’s euchromatin.

Benefits of Sequencing Projects

Why do we want to determine the A’s, T’s, C’s, and G’s of an organism?

When you get right down to it, a genome is the blueprint of how an organisms functions. If we are interested in understanding the complexity of life (and every biologist and doctor is), having a genome to study is a big step forward.

Scientists are revving up their computers to study genomes and the benefits of this are already being seen. Take the field of medicine as an example. As the population begins to become increasingly health conscious, more attention is being paid to the ongoing research in the medical sciences. As the chromosome maps have become more detailed, genes associated with genetic diseases such as Alzheimer’s disease [7] and familial breast cancer [8] have been identified. This has led to the hope that these diseases can be identified early and that new drugs and treatments can be discovered.

Genome projects also give us insight into other organisms, which has many applications in the industrial sector [9]. Increasing knowledge about domesticated plants and animals can reduce costs in agriculture, for example, by reducing the need for pesticides. Microbes are also an important resource. It has already been shown that bacteria can be used to clean up toxic chemical and oil spills and aid in the clean-up of sewage and waste. Bacteria have also been used to replace many industrial processes that require large amounts of toxic reagents or harsh conditions, making many workplaces, and their surrounding environment, much safer.

Final Words: Where is Genome Science Taking Us?

Even though the numbers of completed genomes is ever increasing, the real work is just beginning. New advances in technology must accommodate the increasing amount of data, as the information available to researchers can be overwhelming. Already new fields of science have been created by the sequencing of genomes. An example of this is functional genomics, which aims to look at the practical aspects of sequenced genomes by looking at genome-wide responses to various elements.

Finally, a whole can of ethical issues have been opened as researchers have begun patenting genes in the hopes of financial reward. Is it right to patent genes that are present in all humans? Who controls the genetic information? Can the use of genetic information oppress and control people, like in the movie Gattaca? Only education, debate and time will produce these answers.

Texts Consulted and Additional Reading

1. Dale JW, von Schantz M. 2002. From Genes to Genomes: Concepts and Applications of DNA Technology. West Sussex, England / New York: Wiley. 360p.

2. Town C, ed. 2002. Functional Genomics. Dordrecht/Boston: Kluwer Academic. 200p.

3. Caporale LH. 2003. Darwin in the Genome: Molecular Strategies in Biological Evolution. New York: McGraw-Hill. 245p.

4. Rangel P, Giovannetti J. 2002. Genomes and databases on the Internet: A Practical Guide to Functions and Applications. Wymondham: Horizon Scientific. 223p.

5. Primrose SB, Twyman RM. 2003. Principles of genome analysis and genomics. Malden, MA: Blackwell Pub. 263p.

1. Two Bacterial Genomes Sequenced. 1995. Human Genome News, May-June 7(1).

2. Genome-Scale Science. National Centre for Biotechnology Information:

3. The Genome International Sequencing Consortium. 2001. Initial sequencing and analysis of the human genome. Nature 409: 860-921.

4. Griffiths et al, eds. 2002. Modern Genetic Analysis: Integrating Genes and Genomes. New York: W.H. Freeman and Co. 736p.

6. Alphey L. 1997. DNA Sequencing: From Experimental Methods to Bioinformatics. New York: Springer. 206p.

7. Lahiri DK, et al. 2003. A Critical Analysis of New Molecular Targets and Strategies for Drug Developments in Alzheimer’s Disease. Curr Drug Targets 4(2): 97-112.

8. Marsh D, Zori R. 2002. Genetic Insights into Familial Cancers — Update and Recent Discoveries. Cancer Lett 181(2): 125-64.

9. Goujon P. 2001. From Biotechnology to Genomes: The Meaning of the Double Helix. NJ: World Scientific. 728p.