Report for: On Combining Reference Data to Improve Imputation Accuracy

Title	On Combining Reference Data to Improve Imputation Accuracy
Published in	PLOS ONE, January 2013
DOI	10.1371/journal.pone.0055600
Pubmed ID	23383238
Authors	Jun Chen, Ji-Gang Zhang, Jian Li, Yu-Fang Pei, Hong-Wen Deng
Abstract	Genotype imputation is an important tool in human genetics studies, which uses reference sets with known genotypes and prior knowledge on linkage disequilibrium and recombination rates to infer un-typed alleles for human genetic variations at a low cost. The reference sets used by current imputation approaches are based on HapMap data, and/or based on recently available next-generation sequencing (NGS) data such as data generated by the 1000 Genomes Project. However, with different coverage and call rates for different NGS data sets, how to integrate NGS data sets of different accuracy as well as previously available reference data as references in imputation is not an easy task and has not been systematically investigated. In this study, we performed a comprehensive assessment of three strategies on using NGS data and previously available reference data in genotype imputation for both simulated data and empirical data, in order to obtain guidelines for optimal reference set construction. Briefly, we considered three strategies: strategy 1 uses one NGS data as a reference; strategy 2 imputes samples by using multiple individual data sets of different accuracy as independent references and then combines the imputed samples with samples based on the high accuracy reference selected when overlapping occurs; and strategy 3 combines multiple available data sets as a single reference after imputing each other. We used three software (MACH, IMPUTE2 and BEAGLE) for assessing the performances of these three strategies. Our results show that strategy 2 and strategy 3 have higher imputation accuracy than strategy 1. Particularly, strategy 2 is the best strategy across all the conditions that we have investigated, producing the best accuracy of imputation for rare variant. Our study is helpful in guiding application of imputation methods in next generation association analyses.

View on publisher site Alert me about new mentions

X Demographics

The data shown below were collected from the profiles of 2 X users who shared this research output. Click here to find out more about how the information was compiled.

Geographical breakdown

Country	Count	As %
United States	2	100%

Demographic breakdown

Type	Count	As %
Scientists	2	100%

Mendeley readers

The data shown below were compiled from readership statistics for 33 Mendeley readers of this research output. Click here to see the associated Mendeley record.

Geographical breakdown

Country	Count	As %
United States	3	9%
Finland	1	3%
Brazil	1	3%
New Zealand	1	3%
United Kingdom	1	3%
Unknown	26	79%

Demographic breakdown

Readers by professional status	Count	As %
Researcher	14	42%
Student > Ph. D. Student	7	21%
Student > Master	4	12%
Student > Doctoral Student	2	6%
Other	2	6%
Other	3	9%
Unknown	1	3%

Readers by discipline	Count	As %
Agricultural and Biological Sciences	20	61%
Biochemistry, Genetics and Molecular Biology	2	6%
Mathematics	2	6%
Medicine and Dentistry	2	6%
Computer Science	1	3%
Other	2	6%
Unknown	4	12%

PLOS

Article Metrics

On Combining Reference Data to Improve Imputation Accuracy

Mentioned by

Readers on

X Demographics

Geographical breakdown

Demographic breakdown

Mendeley readers

Geographical breakdown

Demographic breakdown