Enhancing Machine Learning through Data-Centric Approaches: Efficiency, Generalization, and Trustworthiness

dc.contributor.advisorHuang, Furongen_US
dc.contributor.authorDing, Mucongen_US
dc.contributor.departmentComputer Scienceen_US
dc.contributor.publisherDigital Repository at the University of Marylanden_US
dc.contributor.publisherUniversity of Maryland (College Park, Md.)en_US
dc.date.accessioned2026-01-28T06:31:08Z
dc.date.issued2025en_US
dc.description.abstractThis dissertation investigates Data-Centric AI, a paradigm that emphasizes the systematic engineering of data pipelines to address fundamental challenges in contemporary machine learning systems. Rather than focusing exclusively on architectural innovations or algorithmic modifications, this research demonstrates how principled improvements in data quality, collection, curation, and evaluation can serve as a powerful lever for enhancing model efficiency, generalization capabilities, and trustworthiness. This approach recognizes that the effectiveness of machine learning systems is intrinsically tied to the characteristics of the data upon which they are trained and evaluated. The research is organized into three synergistic pillars, collectively advancing the state of machine learning through data-centric methodologies. Part I: Enhancing Efficiency and Scalability. This dissertation first addresses computational bottlenecks that hinder the development and deployment of machine learning models. The narrative begins by tackling training instability in Generative Adversarial Networks through a graphical model approach that leverages conditional independence graphs to impose structural priors on the data generation process. The focus then shifts to scaling Graph Neural Networks to massive, real-world graphs through three complementary data transformation strategies: VQ-GNN employs vector quantization to maintain representative node embeddings while avoiding the neighbor explosion problem; Sketch-GNN introduces polynomial tensor sketching to achieve sublinear training complexity; and a spectral greedy algorithm provides direct coreset selection of ego-graphs for efficient training. Finally, this line of work broadens beyond model training to the entire machine learning pipeline by introducing hyperparameter-calibrated dataset condensation, which synthesizes small, carefully designed validation datasets to dramatically accelerate the costly process of hyperparameter search while preserving performance rankings across architectures. Part II: Improving Generalization. The second thrust focuses on ensuring models perform well on new, unseen data through strategic data-centric interventions. Three complementary approaches are presented to actively improve generalization: SAFLEX introduces a self-adaptive augmentation framework that learns optimal sample weights and soft labels to refine any upstream augmentation pipeline; EnsemW2S demonstrates how ensemble knowledge can be distilled into high-quality synthetic data through a token-level weak-to-strong learning framework; and SAIL establishes an efficient online alignment method for large language models that strategically curates small amounts of high-quality data for continuous model improvement. To rigorously measure generalization capabilities in the first place, Easy2Hard-Bench is developed as a comprehensive benchmark with standardized, continuous difficulty labels spanning six diverse domains, enabling systematic profiling of language model reasoning across complexity levels and providing the community with a foundational tool to validate whether new methods truly enhance generalization. Part III: Strengthening Trustworthiness. The final thrust addresses the critical need for robust and reliable AI systems, particularly for high-stakes applications. This research provides a focused investigation on ensuring the integrity of invisible image watermarks, a key technology for content provenance and security in the era of generative AI. To address the lack of standardized adversarial evaluation methods, WAVES (Watermark Analysis via Enhanced Stress-testing) is established as a comprehensive benchmark featuring diverse attacks ranging from classical image distortions to novel adversarial and regeneration methods, evaluated through a rigorous performance-versus-quality framework. To validate the utility of this benchmark and catalyze community progress, a large-scale NeurIPS 2024 competition is organized featuring black-box and beige-box tracks with 2,722 submissions from 298 global teams. The competition results not only demonstrate the benchmark's value but also uncover critical vulnerabilities in state-of-the-art watermarking methods, with top teams successfully removing watermarks from over 89% of images while maintaining high visual quality, thereby establishing new standards for trustworthiness evaluation in this domain.en_US
dc.identifierhttps://doi.org/10.13016/de5v-fisy
dc.identifier.urihttp://hdl.handle.net/1903/35107
dc.language.isoenen_US
dc.subject.pqcontrolledArtificial intelligenceen_US
dc.subject.pqcontrolledComputer scienceen_US
dc.subject.pquncontrolledBenchmark Developmenten_US
dc.subject.pquncontrolledData-Centric AIen_US
dc.subject.pquncontrolledGraph Neural Networksen_US
dc.subject.pquncontrolledLarge Language Modelsen_US
dc.subject.pquncontrolledModel Efficiencyen_US
dc.subject.pquncontrolledTrustworthy AIen_US
dc.titleEnhancing Machine Learning through Data-Centric Approaches: Efficiency, Generalization, and Trustworthinessen_US
dc.typeDissertationen_US

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Ding_umd_0117E_25697.pdf
Size:
28.24 MB
Format:
Adobe Portable Document Format