Enhancing Machine Learning through Data-Centric Approaches: Efficiency, Generalization, and Trustworthiness

Ding, Mucong

Enhancing Machine Learning through Data-Centric Approaches: Efficiency, Generalization, and Trustworthiness

dc.contributor.advisor	Huang, Furong	en_US
dc.contributor.author	Ding, Mucong	en_US
dc.contributor.department	Computer Science	en_US
dc.contributor.publisher	Digital Repository at the University of Maryland	en_US
dc.contributor.publisher	University of Maryland (College Park, Md.)	en_US
dc.date.accessioned	2026-01-28T06:31:08Z
dc.date.issued	2025	en_US
dc.description.abstract	This dissertation investigates Data-Centric AI, a paradigm that emphasizes the systematic engineering of data pipelines to address fundamental challenges in contemporary machine learning systems. Rather than focusing exclusively on architectural innovations or algorithmic modifications, this research demonstrates how principled improvements in data quality, collection, curation, and evaluation can serve as a powerful lever for enhancing model efficiency, generalization capabilities, and trustworthiness. This approach recognizes that the effectiveness of machine learning systems is intrinsically tied to the characteristics of the data upon which they are trained and evaluated. The research is organized into three synergistic pillars, collectively advancing the state of machine learning through data-centric methodologies. Part I: Enhancing Efficiency and Scalability. This dissertation first addresses computational bottlenecks that hinder the development and deployment of machine learning models. The narrative begins by tackling training instability in Generative Adversarial Networks through a graphical model approach that leverages conditional independence graphs to impose structural priors on the data generation process. The focus then shifts to scaling Graph Neural Networks to massive, real-world graphs through three complementary data transformation strategies: VQ-GNN employs vector quantization to maintain representative node embeddings while avoiding the neighbor explosion problem; Sketch-GNN introduces polynomial tensor sketching to achieve sublinear training complexity; and a spectral greedy algorithm provides direct coreset selection of ego-graphs for efficient training. Finally, this line of work broadens beyond model training to the entire machine learning pipeline by introducing hyperparameter-calibrated dataset condensation, which synthesizes small, carefully designed validation datasets to dramatically accelerate the costly process of hyperparameter search while preserving performance rankings across architectures. Part II: Improving Generalization. The second thrust focuses on ensuring models perform well on new, unseen data through strategic data-centric interventions. Three complementary approaches are presented to actively improve generalization: SAFLEX introduces a self-adaptive augmentation framework that learns optimal sample weights and soft labels to refine any upstream augmentation pipeline; EnsemW2S demonstrates how ensemble knowledge can be distilled into high-quality synthetic data through a token-level weak-to-strong learning framework; and SAIL establishes an efficient online alignment method for large language models that strategically curates small amounts of high-quality data for continuous model improvement. To rigorously measure generalization capabilities in the first place, Easy2Hard-Bench is developed as a comprehensive benchmark with standardized, continuous difficulty labels spanning six diverse domains, enabling systematic profiling of language model reasoning across complexity levels and providing the community with a foundational tool to validate whether new methods truly enhance generalization. Part III: Strengthening Trustworthiness. The final thrust addresses the critical need for robust and reliable AI systems, particularly for high-stakes applications. This research provides a focused investigation on ensuring the integrity of invisible image watermarks, a key technology for content provenance and security in the era of generative AI. To address the lack of standardized adversarial evaluation methods, WAVES (Watermark Analysis via Enhanced Stress-testing) is established as a comprehensive benchmark featuring diverse attacks ranging from classical image distortions to novel adversarial and regeneration methods, evaluated through a rigorous performance-versus-quality framework. To validate the utility of this benchmark and catalyze community progress, a large-scale NeurIPS 2024 competition is organized featuring black-box and beige-box tracks with 2,722 submissions from 298 global teams. The competition results not only demonstrate the benchmark's value but also uncover critical vulnerabilities in state-of-the-art watermarking methods, with top teams successfully removing watermarks from over 89% of images while maintaining high visual quality, thereby establishing new standards for trustworthiness evaluation in this domain.	en_US
dc.identifier	https://doi.org/10.13016/de5v-fisy
dc.identifier.uri	http://hdl.handle.net/1903/35107
dc.language.iso	en	en_US
dc.subject.pqcontrolled	Artificial intelligence	en_US
dc.subject.pqcontrolled	Computer science	en_US
dc.subject.pquncontrolled	Benchmark Development	en_US
dc.subject.pquncontrolled	Data-Centric AI	en_US
dc.subject.pquncontrolled	Graph Neural Networks	en_US
dc.subject.pquncontrolled	Large Language Models	en_US
dc.subject.pquncontrolled	Model Efficiency	en_US
dc.subject.pquncontrolled	Trustworthy AI	en_US
dc.title	Enhancing Machine Learning through Data-Centric Approaches: Efficiency, Generalization, and Trustworthiness	en_US
dc.type	Dissertation	en_US

Files

Original bundle

Now showing 1 - 1 of 1

Name:: Ding_umd_0117E_25697.pdf
Size:: 28.24 MB
Format:: Adobe Portable Document Format

Download

Collections

UMD Theses and Dissertations
Computer Science Theses and Dissertations