↓ Skip to main content

PLOS

Synthetic Spike-in Standards Improve Run-Specific Systematic Error Analysis for DNA and RNA Sequencing

Overview of attention for article published in PLOS ONE, July 2012
Altmetric Badge

Mentioned by

twitter
2 X users
patent
5 patents

Citations

dimensions_citation
52 Dimensions

Readers on

mendeley
127 Mendeley
citeulike
4 CiteULike
Title
Synthetic Spike-in Standards Improve Run-Specific Systematic Error Analysis for DNA and RNA Sequencing
Published in
PLOS ONE, July 2012
DOI 10.1371/journal.pone.0041356
Pubmed ID
Authors

Justin M. Zook, Daniel Samarov, Jennifer McDaniel, Shurjo K. Sen, Marc Salit

Abstract

While the importance of random sequencing errors decreases at higher DNA or RNA sequencing depths, systematic sequencing errors (SSEs) dominate at high sequencing depths and can be difficult to distinguish from biological variants. These SSEs can cause base quality scores to underestimate the probability of error at certain genomic positions, resulting in false positive variant calls, particularly in mixtures such as samples with RNA editing, tumors, circulating tumor cells, bacteria, mitochondrial heteroplasmy, or pooled DNA. Most algorithms proposed for correction of SSEs require a data set used to calculate association of SSEs with various features in the reads and sequence context. This data set is typically either from a part of the data set being "recalibrated" (Genome Analysis ToolKit, or GATK) or from a separate data set with special characteristics (SysCall). Here, we combine the advantages of these approaches by adding synthetic RNA spike-in standards to human RNA, and use GATK to recalibrate base quality scores with reads mapped to the spike-in standards. Compared to conventional GATK recalibration that uses reads mapped to the genome, spike-ins improve the accuracy of Illumina base quality scores by a mean of 5 Phred-scaled quality score units, and by as much as 13 units at CpG sites. In addition, since the spike-in data used for recalibration are independent of the genome being sequenced, our method allows run-specific recalibration even for the many species without a comprehensive and accurate SNP database. We also use GATK with the spike-in standards to demonstrate that the Illumina RNA sequencing runs overestimate quality scores for AC, CC, GC, GG, and TC dinucleotides, while SOLiD has less dinucleotide SSEs but more SSEs for certain cycles. We conclude that using these DNA and RNA spike-in standards with GATK improves base quality score recalibration.

X Demographics

X Demographics

The data shown below were collected from the profiles of 2 X users who shared this research output. Click here to find out more about how the information was compiled.
Mendeley readers

Mendeley readers

The data shown below were compiled from readership statistics for 127 Mendeley readers of this research output. Click here to see the associated Mendeley record.

Geographical breakdown

Country Count As %
United States 6 5%
United Kingdom 2 2%
Norway 1 <1%
Brazil 1 <1%
New Zealand 1 <1%
France 1 <1%
Denmark 1 <1%
Belgium 1 <1%
Unknown 113 89%

Demographic breakdown

Readers by professional status Count As %
Researcher 52 41%
Student > Ph. D. Student 31 24%
Other 7 6%
Student > Postgraduate 7 6%
Professor 5 4%
Other 14 11%
Unknown 11 9%
Readers by discipline Count As %
Agricultural and Biological Sciences 66 52%
Biochemistry, Genetics and Molecular Biology 24 19%
Medicine and Dentistry 10 8%
Computer Science 5 4%
Physics and Astronomy 3 2%
Other 7 6%
Unknown 12 9%