Bayesian Estimation of the Inbreeding Coefficient for Single Nucleotide Polymorphism Using Complex Survey Data

Thumbnail Image
Publication or External Link
Xue, Zhenyi
Lahiri, Partha
Li, Yan
In genome-wide association studies (GWAS), single nucleotide polymorphism (SNP) is often used as a genetic marker to study gene-disease association. Some large scale health sample surveys have recently started collecting genetic data. There is now growing interest in developing statistical procedures using genetic survey data. This calls for innovative statistical methods that incorporate both genetic and statistical sampling. Under simple random sampling, the traditional estimator of the inbreeding coefficient is given by 1 - (number of observed heterozygotes) / (number of expected heterozygotes). Genetic data quality control reports published by the National Health and Nutrition Examination Survey (NHANES) and the Health and Retirement Study (HRS) use this simple estimator, which serves as a reasonable quality control tool to identify problems such as genotyping error. There is, however, a need to improve on this estimator by considering different features of the complex survey design. The main goal of this dissertation is to fill in this important research gap. First, a design-based estimator and its associated jackknife standard error estimator are proposed. Secondly, a hierarchical Bayesian methodology is developed using the effective sample size and genotype count. Lastly, a Bayesian pseudo-empirical likelihood estimator is proposed using the expected number of heterozygotes in the estimating equation as a constraint when maximizing the pseudo-empirical likelihood. One of the advantages of the proposed Bayesian methodology is that the prior distribution can be used to restrict the parameter space induced by the general inbreeding model. The proposed estimators are evaluated using Monte Carlo simulation studies. Moreover, the proposed estimates of the inbreeding coefficients of SNPs from APOC1 and BDNF genes are compared using the data from the 2006 Health and Retirement Study.