Extending the Scope of Provable Adversarial Robustness in Machine Learning

## Files

## Publication or External Link

## Date

## Authors

## Advisor

## Citation

## DRUM DOI

## Abstract

The study of provable defenses against adversarial attacks in machine learning has mostly been limited to classification tasks and static one-step adversaries. Robustness certificates are designed with a fixed adversarial budget for each input instance and with the assumption that inputs are sampled independently. The goal of this work is to expand the set of provable robustness techniques to cover more general real-world settings such as adaptive multi-step adversaries (e.g., reinforcement learning), distribution shifts (e.g., color shifts in images) and models with structured outputs (e.g., images, sets, and segmentation masks). Each setting presents unique challenges which require special proof techniques designed specifically to tackle them. For instance, an adversarial attack on a reinforcement learning agent at a given time step can affect its performance in future time steps. Thus, certified robustness methods developed for the static setting cannot provide guarantees in a dynamic environment that evolves over time.

First, we present a robustness certificate for bounded Wasserstein shifts of the input distribution. We show that a simple procedure that randomizes the input of the model within a transformation space is provably robust to distributional shifts under that transformation. Our framework allows the datum-specific perturbation size to vary across different points in the input distribution and is general enough to include fixed-sized perturbations as well. Our certificates produce guaranteed lower bounds on the performance of the model for any (natural or adversarial) shift of the input distribution within a Wasserstein ball around the original distribution.

Next, we present certifiable robustness in the setting of reinforcement learning where the adversary is allowed to track the states, actions, and observations generated in previous time steps and adapt its attack. We prove robustness guarantees for an agent following a Gaussian-smoothed policy. The goal here is to certify that the expected total reward obtained by the robust policy remains above a certain threshold under a norm-bounded adaptive adversary. Our main theoretical contribution is to prove an adaptive version of the Neyman-Pearson Lemma – a key lemma for smoothing-based certificates – where the adversarial perturbation at a particular time step is allowed to be a stochastic function of previous observations, states, and actions.

We then develop a randomized smoothing-based algorithm to produce certifiably robust models for problems with structured outputs. Many machine learning problems like image segmentation, object detection, image/audio-to-text systems, etc., fall under this category. Our procedure works by evaluating the base model on a collection of noisy versions of the input point and aggregating the predictions by computing the center of the smallest ball that covers at least half of the output points.It can produce robustness certificates under a wide range of similarity (or distance) metrics in the output space such as perceptual distance, intersection over union, and cosine distance. These certificates guarantee that the change in the output as measured by the distance metric remains bounded for an adversarial perturbation of the input.

We also study some limitations of randomized smoothing when used to defend against $\ell_p$-norm bounded adversaries for $p>2$, especially for $p = \infty$. We show that this technique suffers from the curse of dimensionality when the smoothing distribution is independent and identical in each input dimension. The size of the certificates decreases with an increase in the dimensionality of the input space. Thus, for high-dimensional inputs such as images, randomized smoothing does not yield meaningful certificates against an $\ell_\infty$-norm bounded adversary.

We also design a method to certify confidence scores for neural network predictions under adversarial perturbations of the input. Conventional classification networks with a softmax layer output a confidence score that can be interpreted as the degree of certainty the network has about the class label. In applications like credit scoring and disease diagnosis systems where reliability is key, it is important to know how sure a model is about its predictions so that a human expert can take over if the model’s confidence is low. Our procedure uses the distribution of the confidence scores under randomized smoothing to generate stronger certificates than a naive approach that ignores the distributional information.

Finally, we present a certifiable defense for streaming models. In many deep learning applications such as online content recommendation and stock market analysis, models use historical data to make predictions. Robustness certificates based on the assumption of independent input samples are not directly applicable in such scenarios. We study provable robustness of machine learning models in the context of data streams, where inputs are presented as a sequence of potentially correlated items. We derive robustness certificates for models that use a fixed-size sliding window over the input stream. Our guarantees hold for the average model performance across the entire stream and are independent of stream size, making them suitable for large data streams.