A Kolmogorov-Smirnov Type Two-Sample Test of Equal Distribution for Two Phase Sampling Framework
Files
Publication or External Link
Date
Authors
Advisor
Citation
DRUM DOI
Abstract
In practice, there are cases when data is collected in a two-phase sampling framework: a large i.i.d. data is obtained and divided into several parts at first, then within each part a subsample is drawn without replacement. All subsamples together form the final sample. We specifically consider two cases that this framework is applied: 1. two-phase stratified sampling and 2. merging data from multiple overlapping sources. We briefly introduce these two cases below.For the case of two-phase stratified sampling, a large i.i.d. sample is obtained at phase I, and then stratified based on the auxiliary variables. At phase II, subsamples are drawn within strata by sampling without replacement and the information on the variable of interest is collected for units in subsamples. All subsamples together form the final sample, which is dependent and biased, due to the sampling without replacement and stratification, respectively. For the case of merging data from multiple overlapping sources, we have a large group of subjects and each of them belongs to at least one data source. From each source a subsample of subjects is drawn by sampling without replacement to create a dataset. All datasets together are combined to be the final data. Similar to the case of two-phase stratified sampling, the final data generated in this case is also dependent and biased, due to the sampling without replacement and multiple data sources, respectively. Besides, since one subject can be within multiple sources simultaneously, there is an additional issue of unidentified duplicated selection in the final data. After collecting the data, one of the most frequently investigated research question is to evaluate whether a variable of interest has the same distribution in two groups. Since most of the existing methods are designed for two independent groups of independent and identically distributed (i.i.d.) samples, or dependent (paired) data measured from an i.i.d. sample of subjects, they cannot achieve the correct size of test in our situation, due to the issues that the analytical formula of the critical value is unavailable when data is non i.i.d., and the covariance between two groups affects the asymptotic variance of the test statistic. To address the statistical challenges brought by this two-phase sampling framework, we propose a modified Kolmogorov-Smirnov two-sample testing method to construct the test statis- tic based on a zero-mean Gaussian process. Our contribution is to mathematically derive the covariance function of this zero-mean Gaussian process, and establish the asymptotic property of the proposed test statistic to determine the rejection region. We conduct simulation study to eval- uate our proposed method based on the empirical probability of Type I error and the empirical power for variables of interest with different types of distribution. The numeric results demon- strate that our method achieves the correct size as sample size increases, and has an asymptotic power of 1. We also apply our method to the data from a Wilms tumor study, and the result also proves the validity of our method.