HOW DNA SEQUENCING WORKS
The Basic Structure of DNA
We may hear a lot about deoxyribonucleic acid (DNA) and how scientists discover mutations and other information about the specific makeup of DNA; however, how this information is actually obtained may elude many. The process to determine the actual make-up of a strand of DNA is called DNA sequencing. To know how sequencing is done, it is important to know the basic structure or nature of DNA.
DNA is made up of bases (nitrogen-based molecules). There four such bases that make up DNA, adenine, guanine, cytosine, and thymine, and these are represented by the letters A, G, C, and T, respectively. In DNA, the bases are bound to a sugar molecule called deoxyribose. Another molecular group, phosphate, is also attached to the base. The three different connected molecules (the base, sugar, and phosphate) form what is called a nucleotide. A string or sequence of nucleotides forms a DNA strand.
DNA in living organisms is actually found as a connection of two strands and exists in a very well-defined twisted configuration called a double helix. The two stands are connected by way of bonds between bases. For example, guanine from one strand forms and bonds with cytosine of the second strand. This formation is referred to as a base pair.
Background of DNA Sequencing
DNA sequencing is a biochemical method to determine the order of nucleotides in DNA. The technology of DNA sequencing began in the 1970s. The first methods that developed were the Maxam-Gilbert and Sanger methods. For DNA to be sequenced, the strands of the DNA double helix must be separated, a process called denaturation (usually by applying a high temperature). In the Maxam-Gilbert method, the basic process involves radioactively tagging or labeling the DNA by adding a phosphate molecule containing radioactive phosphorus. The DNA strand is then modified at certain locations followed by chemically cleaving the DNA at the sites where it was chemically modified.
The results are strands that correspond to the cleavage locations. The cleavages occur at 1 or 2 of the 4 possible nucleotides. Let´s say four reactions were employed and there is 1 tube per reaction. The contents of each tube are separated by using gel electrophoresis. Each tube´s contents have a lane on the gel. The DNA fragments of each tube are separated based on molecular weight. Because it is known where cleavage had to occur based on the reaction used, it would be known what nucleotide is represented by a band on the gel.
A sheet of radiographic (X-ray) film is exposed to the gel so that the radioactive bands can be seen. Starting at the bottom of the film, the first band(s) is located, and the nucleotide is identified based on the lane it is in (the reaction representing the lane). Let´s say that based on the reaction in question, the nucleotide representing that lane means that cleavage occurred at a guanine. Then, the next sized band(s) above that is found and identified, then the next, and so on. Recognizing the absence or presence of fragments allows this identification process. Once reaching the top of the film and the last band(s), the whole sequence will be ascertained.
The Sanger sequencing method (ultimately favored over the Maxam-Gilbert method) was developed at about the same time as the Maxam-Gilbert method. The basis of this method is called the chain termination technique. With this method, elongation of DNA is terminated by a process using a special nucleotide called a dideoxynucleotide. Four reaction tubes are also employed here, and each one represents 1 of the 4 dideoxynucleotides (the regular nucleotides previously mentioned are deoxynucleotides). These nucleotides, unlike the regular nucleotides, lack the 3’-OH group that is necessary for bond formation between two nucleotides. The reaction mix here contains radiolabeled primers (short single-stranded pieces of DNA that start off the elongation reaction).
When a specified dideoxynucleotide is encountered, the elongation reaction stops there. What results is a termination of strand elongation leading to DNA fragments of various lengths. Each reaction is run on a gel also (and this is exposed to X-ray film). Each of the 4 lanes represents only 1 nucleotide making base calling easier. As described, the bands are identified from the bottom to the top of the film. This method was also considered more efficient than the Maxam-Gilbert method and more sequence can be read (up to 30,000 bases long).
Massive Parallel Sequencing
Massive parallel sequencing, also known as next-generation or second-generation sequencing, differs from Sanger sequencing in regards to the speed with which sequence information can be achieved. It only takes weeks to achieve the amount of sequence information that would take years with Sanger sequencing. Next-generation sequencing (NGS) also uses less DNA samples, and is much more cost effective1.
Massive parallel sequencing refers to the simultaneous sequencing of millions of small fragments. This results in an enormous or massive amount of data to process. Millions to a billion bases of DNA sequence can result. It is estimated that approximately 250 gigabases per week can be sequenced using NGS technology2.
The formation of DNA sequencing libraries is an important part of NGS. Although automatic sequencing machines were developed to perform Sanger sequencing3, newer sequencers with NGS-based technology have been developed4, 5. One model method is the addition of a nucleotide during the DNA extension reaction, releasing a pyrophosphate molecule. This molecule is converted to adenosine triphosphate (the familiar ATP molecule). Using the ATP, florescent luciferin is converted to oxyluciferin generating an amount of light energy that is proportional to the amount of ATP present. It is this that the sequencer´s camera detects and is then analyzed. The described reactions are done on millions of DNA strands at once, thus the reason for the term massively parallel sequencing.
It is clear that an immense amount of data can be generated by a single NGS run. When taking into account the numerous reactions being performed over a matter of just months, a mammoth amount of data results. The challenges are many, including data processing and interpretation, secure storing of the data, and translation to the applied science and medical arenas. There are new bioinformatics programs available for NGS data analysis6, and more are to be developed. Another technological advance regarding NGS includes the development of sequencers such as Illumina´s HiSeq X, which can produce nearly 2 terabases of data from a single 3-day run. These advances will impact the ability to more quickly and accurately make diagnoses regarding genetics-based diseases and help scientists to learn more about the connection between DNA, the environment, and how we function.
- Tucker T, Marra M, Friedman JM. Massively parallel sequencing: the next big thing in genetic medicine. American journal of human genetics. 2009;85(2):142-54.
- Lander ES. Initial impact of the sequencing of the human genome. Nature. 2011;470(7333):187-97.
- Martin WJ. New technologies for large-genome sequencing. Genome / National Research Council Canada = Genome / Conseil national de recherches Canada. 1989;31(2):1073-80
- Kato K. Impact of the next generation DNA sequencers. International journal of clinical and experimental medicine. 2009;2(2):193-202.
- Mukhopadhyay R. DNA sequencers: the next generation. Analytical chemistry. 2009;81(5):1736-40.
- Pabinger S, Dander A, Fischer M, Snajder R, Sperk M, Efremova M, et al. A survey of tools for variant analysis of next-generation genome sequencing data. Briefings in bioinformatics. 2014;15(2):256-78.