Managing High-Throughput DNA Sequencing Data
The newest sequencing technologies are providing new insights into disease etiology, individual susceptibilities, keys to drug discovery, and more. However, there are many questions and challenges regarding how to understand the meaning and applicability of the newfound wealth of data. As new data is generated at record rates, scientists are working diligently on the development of methods to archive the massive sequencing information and create bioinformatics tools to analyze the data mammoths.
We are seeing in real time the generation of colossal amounts of data by single next-generation sequencing (NGS) runs. With the thousands of runs performed in a short period, the results are massive and seemingly cosmic sets of data with yet-to-be-determined significance. This creates the need to efficiently process, archive, and secure the data (especially protected patient related data). Still, the data must be interpreted to determine its significance and utility for translation to the clinic, regulatory entities, and various medical industries.
Data Storage and Processing
Important steps in the data storage effort are decision-making for data trimming and triage. Without this process, there would be a massive duplication of data and accumulation of background information or “noise.” Policies are developed to systematically identify and remove redundant information or artifact-related data. For example, high-throughput sequencing (HTS) reads can often contain regions of low base-calling quality, particularly at the final stretch of the read. These can be identified and excluded.
The actual physical space for the data must also be considered. As zip files are produced to compress large groups of files, there are compression methods to store large volumes of sequencing data and even facilitate analysis. General (gzip, bzip) and specialized (GenCompress, DNACompress, DNABIT) algorithms have been applied to compress large amounts of sequencing data (1). The DNA compression algorithm DNABIT Compress assigns binary bits for segments of DNA bases to compress repetitive and nonrepetitive DNA sequence (2). The binary coding significantly cuts the down the file size. A reference-based compression method entails aligning new sequences to a reference genome followed by encoding and storing the differences between the new and reference sequence (3).
Data storage challenges are not limited to the technical details of how to physically store the sequencing data. Other issues concern the use and protection of individual medical information. The availability of genetic testing for consumers presents ethical concerns and questions regarding the use of stored consumer data. Studies and surveys of genetic testing companies have revealed that some companies may have used consumer data in research efforts, and the existence of policies regarding data sharing were not always apparent (4). This demonstrates the need to develop standards and policies to protect the integrity of stored sensitive consumer DNA sequencing data.
Numerous tools exist for genomic data analysis. The types of tools vary according to the algorithms used by the tools, the software and hardware needed to run the programs, and the programming languages used. There are also categories of bioinformatics tools based on the type of genome to be analyzed.
The sequence analysis phase has the challenge to obtain meaningful medically significant information from data that is now in the terabyte range. Software and apps are available for scientific teams to analyze their own data. However, there is a growing list of companies that offer analysis services. To gain skills and applicable knowledge in the DNA sequence analysis arena, there are courses and tutorials available to help scientist learn to analyze their NGS and other data.
Examples of sequence analysis objectives are variant detection, screening for protein-DNA interactions, and discovery of unique transcripts. Pabinger et al published a survey of bioinformatics tools for variant analysis of NGS data (5). An open source, web-interface software tool was developed by Zomer et al to analyze TnSeq-derived data to find essential genes (6). The typical process for sequence analysis involves base calling and obtaining raw data reads. These are reassembled de novo or alignment to a reference is performed. In the case of variant detection, differences between a sample and the associated reference genome are identified.
Interpretation and Application
Once data has been processed and analyzed, it is necessary to know what the data means in the clinical, drug discovery, and other applied bioscience endeavors. There is software that assists in making inferences from analyzed sequence data to determine medically relevant and actionable information. An example is sequence to medical phenotypes (STMP), an open source pipeline for clinical interpretation of sequence data. This program allows the determination of genetic drug responses, as well as genetic disease risk (7).
The ENCODE (Encyclopedia of DNA Elements) Consortium, funded by the National Human Genome Research Institute, is designed to produce a comprehensive collection of functional elements in the human genome. The goal is to provide genomic information that will help to determine the relationship between DNA sequences and disease development and management. This and other developing databases are accessed to determine the role of analyzed sequences in biomedicine.
The impressive and ever growing availability of informatics tools are proving to be indispensable in the effort to manage and apply HTS data to biomedical and clinical efforts. However, numerous challenges continue to exist. Improvements in the validation of bioinformatics tools and reproducibility of variant detections are ongoing endeavors. However, it has been demonstrated that HTS data provides significant information that can be applied to early clinical diagnosis and more successful treatment strategies.
- Zhu Z, Zhang Y, Ji Z, He S, Yang X. High-throughput DNA sequence data compression. Brief Bioinform. 2015 Jan;16(1):1-15.
- Rajarajeswari P, Apparao A. DNABIT Compress – Genome compression algorithm. Bioinformation. 2011;5(8):350-360.
- Hsi-Yang Fritz M, Leinonen R, Cochrane G, Birney E. Efficient storage of high throughput DNA sequencing data using reference-based compression. Genome Research. 2011;21(5):734-740.
- Niemiec E, Howard HC. Ethical issues in consumer genome sequencing: Use of consumers’ samples and data. Appl Transl Genom. 2016 Feb 1;8:23-30.
- Pabinger S, Dander A, Fischer M, Snajder R, Sperk M, Efremova M, et al. A survey of tools for variant analysis of next-generation genome sequencing data. Briefings in bioinformatics. 2014;15(2):256-78.
- Zomer A, Burghout P, Bootsma HJ, Hermans PW, van Hijum SA. ESSENTIALS: software for rapid analysis of high throughput transposon insertion sequencing data. PloS one. 2012;7(8):e43012.
- The Ashley Lab, http://ashleylab.stanford.edu/tools/stmp.html