Data-Driven Techniques For Vulnerability Assessments

Thumbnail Image


Publication or External Link





Security vulnerabilities have been puzzling researchers and practitioners for decades.As highlighted by the recent WannaCry and NotPetya ransomware campaigns, which resulted in billions of dollars of losses, weaponized exploits against vulnerabilities remain one of the main tools for cybercrime. The upward trend in the number of vulnerabilities reported annually and technical challenges in the way of remediation lead to large exposure windows for the vulnerable populations. On the other hand, due to sustained efforts in application and operating system security, few vulnerabilities are exploited in real-world attacks. Existing metrics for severity assessments err on the side of caution and overestimate the risk posed by vulnerabilities, further affecting remediation efforts that rely on prioritization.

In this dissertation we show that severity assessments can be improved by taking into account public information about vulnerabilities and exploits.The disclosure of vulnerabilities is followed by artifacts such as social media discussions, write-ups and proof-of-concepts, containing technical information related to the vulnerabilities and their exploitation. These artifacts can be mined to detect active exploits or predict their development. However, we first need to understand: What features are required for different tasks? What biases are present in public data and how are data-driven systems affected? What security threats do these systems face when deployed operationally?

We explore the questions by first collecting vulnerability-related posts on social media and analyzing the community and the content of their discussions.This analysis reveals that victims of attacks often share their experience online, and we leverage this finding to build an early detector of exploits active in the wild. Our detector significantly improves on the precision of existing severity metrics and can detect active exploits a median of 5 days earlier than a commercial intrusion prevention product.

Next, we investigate the utility of various artifacts in predicting the development of functional exploits. We engineer features causally linked to the ease of exploitation, highlight trade-offs between timeliness and predictive utility of various artifacts, and characterize the biases that affect the ground truth for exploit prediction tasks. Using these insights, we propose a machine learning-based system that continuously collects artifacts and predicts the likelihood of exploits being developed against these vulnerabilities. We demonstrate our system's practical utility through its ability to highlight critical vulnerabilities and predict imminent exploits.

Lastly, we explore the adversarial threats faced by data-driven security systems that rely on inputs of unknown provenance.We propose a framework for defining algorithmic threat models and for exploring adversaries with various degrees of knowledge and capabilities. Using this framework, we model realistic adversaries that could target our systems, design data poisoning attacks to measure their robustness, and highlight promising directions for future defenses against such attacks.