MULTI-DIMENSIONAL ANALYSIS APPROACHES FOR HETEROGENEOUS SINGLE-CELL DATA

Loading...
Thumbnail Image

Files

Publication or External Link

Date

2018

Citation

Abstract

Improvements in experimental techniques have led to an explosion of information in biology research. The increasing number of measurements comes with challenges in analyzing resulting data, as well as opportunities to obtain deeper insights of biological systems. Conventional average based methods are unfit to analyze high dimensional datasets since they fail to take full advantage of such rich information. More importantly, they are not able to capture the heterogeneity that is prevalent in biological systems. Sophisticated algorithms that are able to utilize all available measurements simultaneously are hence emerging rapidly. These algorithms excel at making full use of information within datasets and revealing detailed heterogeneity.

However, there are several important disadvantages of existing algorithms. First, specific knowledge in statistics or machine learning is required to appropriately interpret and tune parameters in these algorithms for future use. This may result in misusage and misinterpretation. Second, using all measurements with equal weighting runs the risk of noise contamination. In addition, information overload has become more common in biology research, with a large volume of irrelevant measurements. Third, regardless of the quality of measurements, analysis methods that simultaneously use a large number of measurements need to avoid the “curse of dimensionality”, which warns that distance estimation and nearest neighbor estimation are not meaningful in high dimensional space. However, most current sophisticated algorithms involve distance estimation and/or nearest neighbor estimation.

In this dissertation, my goal is to build analysis methods that are complex enough to capture heterogeneity and at the same time output results in a format that is easy to interpret and familiar to biologists and medical researchers. I tackle the dimension reduction problem by finding not the best subspace but dividing them into multiple subspaces and examine them one by one. I demonstrate my methods with three types of datasets: image-based high-throughput screening data, flow cytometry data, and mass cytometry data. From each dataset, I was able to discover new biological insights as well as re-validate well-established findings with my methods.

Notes

Rights