A Comparative Study Of Outlier Detection Methods And Their Downstream Effects

Adipudi, Vikram

A Comparative Study Of Outlier Detection Methods And Their Downstream Effects

Files

Adipudi_umd_0117N_24315.pdf (2.34 MB)

No. of downloads: 24

Date

2024

Authors

Adipudi, Vikram

Advisor

Herrmann, Jeffrey W.

DRUM DOI

https://doi.org/10.13016/gkhj-8xej

Abstract

When fitting machine learning models on datasets there is a possibility of mistakes occurring with overfitting due to outliers in the dataset. Mistakes can lead to incorrect predictions from the model and could diminish the usefulness of the model. Outlier detection is conducted as a precursor step to avoid errors caused by this and to improve performance of the model. This study compares how different outlier detection methods impact regression, classification, and clustering methods. To identify which outlier detection performs best in conjunction with different tasks. To conduct this study multiple outlier detection algorithms were used to clean datasets and the cleaned data was fed into the models. The performance of the model with and without cleaning was compared to identify trends. This study found that using outlier detection of any kind will have little impact on supervised tasks such as regression and classification. For the unsupervised task different clustering models had outlier detection and removal algorithms that made the most positive impact in the clustering. Most commonly IForest and PCA had the greatest impact on clustering methods.

URI (handle)

http://hdl.handle.net/1903/33034

Collections

UMD Theses and Dissertations
Computer Science Theses and Dissertations

Full item page