USING DEEP GENERATIVE MACHINE LEARNING METHODS TO GENERATE SYNTHETIC POPULATION
Files
Publication or External Link
Date
Authors
Advisor
Citation
DRUM DOI
Abstract
Population synthesis is an important area of research aiming at generating synthetic data about households and individuals that would be representative of real large populations. Scholars in different fields have worked on synthetic population generation: statisticians, computers scientists, economists, social scientists, and engineers. In transportation modeling, synthetic agents are a key input for agent-based models, that are gradually replacing zone-based aggregate four steps models. Traditional methods for population synthesis include Iterative Population Fitting (IPF), that weights sample data until marginals for the variables of interest match official statistics (often from CENSUS) at a certain geographical area. Recently, Machine Learning algorithms have been tested and compared to IPF, which suffers from several well-known limitations. In this M.S. thesis, advanced deep generative machine learning methods are applied to generate synthetic populations, including CTGAN and TVAE. CTGAN is an advanced GAN algorithm that models tabular data distribution and sample rows from the underlying distribution. It has been shown that CTGAN can solve issues that challenge conventional GAN model, including mixed data types, non-Gaussian distributions, multimodal distributions, learning from sparse one-hot-encoded vectors and highly imbalanced categorical columns. TVAE is also an advanced VAE model that adapts VAE to tabular data by using preprocessing and modifying the loss function. As a case study, this research applies these two machine learning methods to generate synthetic population based on a sample from the American Community Survey relative to the State of Maryland. To demonstrate the performance of the proposed methods, we compare our results to those obtained with IPF and Bayesian Network using metrics that evaluate the ability of the population synthetizer to reproduce the dependency structure and the marginals in the real population and to solve the problem of zero cells in IPF.