Statistical Inference Using Data From Multiple Files Combined Through Record Linkage
MetadataShow full item record
Record linkage methods help us combine multiple data sets from different sources when a single data set with all necessary information is unavailable or when data collection on additional variables is time consuming and extremely costly. Linkage errors are inevitable in the linked data set because of the unavailability of an error-free and unique identifier and because of possible errors in measuring or recording. It has been realized that even a small amount of linkage errors can lead to substantial bias and increase variability in estimating the parameters of a statistical model. The importance of incorporating uncertainty of the record linkage process into the statistical analysis step cannot be overemphasized. The current research is mainly focused on the regression analysis of the linked data. The record linkage and statistical analysis processes are treated as two separate steps. Due to the limited information about the record linkage process, simplifying assumptions on the linkage mechanism have to be made. In reality, however, these assumptions may be violated. Also, most of the existing linkage error models are built on the linked data set, which only contains records for the designated links. Information about linkage errors carried by the designated non-links is missing. In the dissertation, we provide general methodologies for both regression analysis and small area estimation using data from multiple files. A general integrated model is proposed to combine the record linkage and statistical analysis processes. The proposed linkage error models are built directly on the data values from the original sources, and based on the actual record linkage method that is used. We have adapted the jackknife methods to estimate bias, variance, and mean squared error of our proposed estimators. To illustrate the general methodology, we give one example of estimating the regression coefficients in the linear and logistic regression models, and another example of estimating small area mean under the nested-error linear regression model. In order to reduce the computational burden, simplified version of the proposed estimators, jackknife methods, and numerical algorithms are given. A Monte Carlo simulation study is devised to evaluate the performance of the proposed estimators and to investigate the difference between the standard and simplified jackknife methods.