Theses and Dissertations from UMD
Permanent URI for this communityhttp://hdl.handle.net/1903/2
New submissions to the thesis/dissertation collections are added automatically as they are received from the Graduate School. Currently, the Graduate School deposits all theses and dissertations from a given semester after the official graduation date. This means that there may be up to a 4 month delay in the appearance of a give thesis/dissertation in DRUM
More information is available at Theses and Dissertations at University of Maryland Libraries.
Browse
3 results
Search Results
Item Robust Low-Overhead Binary Rewriting: Design, Extensibility, And Customizability(2021) Majlesi Kupaei, Amir; Barua, Rajeev; Electrical Engineering; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)Binary rewriting is the foundation of a wide range of binary analysis tools and techniques, including securing untrusted code, enforcing control-flow integrity, dynamic optimization, profiling, race detection, and taint tracking to prevent data leaks. There are two equally important and necessary criteria that a binary rewriter must have: it must be robust and incur low overhead. First, a binary rewriter must work for different binaries, including those produced by commercial compilers from a wide variety of languages, and possibly modified by obfuscation tools. Second, the binary rewriter must be low overhead. Although the off-line use of programs, such as testing and profiling, can tolerate large overheads, the use of binary rewriters in deployed programs must not introduce significant overheads; typically, it should not be more than a few percent. Existing binary rewriters have their challenges: static rewriters do not reliably work for stripped binaries (i.e., those without relocation information), and dynamic rewriters suffer from high base overhead. Because of this high overhead, existing dynamic rewriters are limited to off-line testing and cannot be practically used in deployment. In the first part, we have designed and implemented a dynamic binary rewriter called RL-Bin, a robust binary rewriter that can instrument binaries reliably with very low overhead. Unlike existing static rewriters, RL-Bin works for all benign binaries, including stripped binaries that do not contain relocation information. In addition, RL-Bin does not suffer from high overhead because its design is not based on the code-cache, which is the primary mechanism for other dynamic rewriters such as Pin, DynamoRIO, and Dyninst. RL-Bin's design and optimization methods have empowered RL-Bin to rewrite binaries with very low overhead (1.04x on average for SPECrate 2017) and very low memory overhead (1.69x for SPECrate 2017). In comparison, existing dynamic rewriters have a high runtime overhead (1.16x for DynamoRIO, 1.29x for Pin, and 1.20x for Dyninst) and have a bigger memory footprint (2.5x for DynamoRIO, 2.73x for Pin, and 2.3x for Dyninst). RL-Bin differentiates itself from other rewriters by having negligible overhead, which is proportional to the added instrumentation. This low overhead is achieved by utilizing an in-place design and applying multiple novel optimization methods. As a result, lightweight instrumentation can be added to applications deployed in live systems for monitoring and analysis purposes. In the second part, we present RL-Bin++, an improved version of RL-Bin, that handles various problematic real-world features commonly found in obfuscated binaries. We demonstrate the effectiveness of RL-Bin++ for the SPECrate 2017 benchmark obfuscated with UPX, PECompact, and ASProtect obfuscation tools. RL-Bin++ can efficiently instrument heavily obfuscated binaries (overhead averaging 2.76x, compared to 4.11x, 4.72x, and 5.31x overhead respectively caused by DynamoRIO, Dyninst, and Pin). However, the major accomplishment is that we achieved this while maintaining the low overhead of RL-Bin for unobfuscated binaries (only 1.04x). The extra level of robustness is achieved by employing dynamic deobfuscation techniques and using a novel hybrid in-place and code-cache design. Finally, to show the efficacy of RL-Bin in the development of sophisticated and efficient analysis tools, we have designed, implemented, and tested two novel applications of RL-Bin; An application-level file access permission system and a security tool for enforcing secure execution of applications. Using RL-Bin's system call instrumentation capability, we developed a fine-grained file access permission system that enables the user to define separate file access policies for each application. The overhead is very low, only 6%, making this tool practical to be used in live systems. Secondly, we designed a security enforcement tool that instruments indirect control transfer instructions to ensure that the program execution follows the predetermined anticipated path. Hence, it would protect the application from being hijacked. Our implementation showed effectiveness in detecting exploits in real-world programs while being practical with a low overhead of only 9%.Item Improving the Usability of Static Analysis Tools Using Machine Learning(2019) Koc, Ugur; Porter, Adam A.; Foster, Jeffrey S.; Computer Science; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)Static analysis can be useful for developers to detect critical security flaws and bugs in software. However, due to challenges such as scalability and undecidability, static analysis tools often have performance and precision issues that reduce their usability and thus limit their wide adoption. In this dissertation, we present machine learning-based approaches to improve the adoption of static analysis tools by addressing two usability challenges: false positive error reports and proper tool configuration. First, false positives are one of the main reasons developers give for not using static analysis tools. To address this issue, we developed a novel machine learning approach for learning directly from program code to classify the analysis results as true or false positives. The approach has two steps: (1) data preparation that transforms source code into certain input formats for processing by sophisticated machine learning techniques; and (2) using the sophisticated machine learning techniques to discover code structures that cause false positive error reports and to learn false positive classification models. To evaluate the effectiveness and efficiency of this approach, we conducted a systematic, comparative empirical study of four families of machine learning algorithms, namely hand-engineered features, bag of words, recurrent neural networks, and graph neural networks, for classifying false positives. In this study, we considered two application scenarios using multiple ground-truth program sets. Overall, the results suggest that recurrent neural networks outperformed the other algorithms, although interesting tradeoffs are present among all techniques. Our observations also provide insight into the future research needed to speed the adoption of machine learning approaches in practice. Second, many static program verification tools come with configuration options that present tradeoffs between performance, precision, and soundness to allow users to customize the tools for their needs. However, understanding the impact of these options and correctly tuning the configurations is a challenging task, requiring domain expertise and extensive experimentation. To address this issue, we developed an automatic approach, auto-tune, to configure verification tools for given target programs. The key idea of auto-tune is to leverage a meta-heuristic search algorithm to probabilistically scan the configuration space using machine learning models both as a fitness function and as an incorrect result filter. auto-tune is tool- and language-agnostic, making it applicable to any off-the-shelf configurable verification tool. To evaluate the effectiveness and efficiency of auto-tune, we applied it to four popular program verification tools for C and Java and conducted experiments under two use-case scenarios. Overall, the results suggest that running verification tools using auto-tune produces results that are comparable to configurations manually-tuned by experts, and in some cases improve upon them with reasonable precision.Item Improving Existing Static and Dynamic Malware Detection Techniques with Instruction-level Behavior(2019) Kim, Danny; Barua, Rajeev; Electrical Engineering; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)My Ph.D. focuses on detecting malware by leveraging the information obtained at an instruction-level. Instruction-level information is obtained by looking at the instructions or disassembly that make up an executable. My initial work focused on using a dynamic binary instrumentation (DBI) tool. A DBI tool enables the study of instruction-level behavior while the malware is executing, which I show proves to be valuable in detecting malware. To expand on my work with dynamic instruction-level information, I integrated it with machine learning to increase the scalability and robustness of my detection tool. To further increase the scalability of the dynamic detection of malware, I created a two stage static-dynamic malware detection scheme aimed at achieving the accuracy of a fully-dynamic detection scheme without the high computational resources and time required. Lastly, I show the improvement of static analysis-based detection of malware by automatically generated machine learning features based on opcode sequences with the help of convolutional neural networks. The first part of my research focused on obfuscated malware. Obfuscation is the process in which malware tries to hide itself from static analysis and trick disassemblers. I found that by using a DBI tool, I was able to not only detect obfuscation, but detect the differences in how it occurred in malware versus goodware. Through dynamic program-level analysis, I was able to detect specific obfuscations and use the varying methods in which it was used by programs to differentiate malware and goodware. I found that by using the mere presence of obfuscation as a method of detecting malware, I was able to detect previously undetected malware. I then focused on using my knowledge of dynamic program-level features to build a highly accurate machine learning-based malware detection tool. Machine learning is useful in malware detection because it can process a large amount of data to determine meaningful relationships to distinguish malware from benign programs. Through the integration of machine learning, I was able to expand my obfuscation detection schemes to address a broader class of malware, which ultimately led to a malware detection tool that can detect 98.45% of malware with a 1% false positive rate. Understanding the pitfalls of dynamic analysis of malware, I focused on creating a more efficient method of detecting malware. Malware detection comes in three methods: static analysis, dynamic analysis, and hybrids. Static analysis is fast and effective for detecting previously seen malware where as dynamic analysis can be more accurate and robust against zero-day or polymorphic malware, but at the cost of a high computational load. Most modern defenses today use a hybrid approach, which uses both static and dynamic analysis, but are suboptimal. I created a two-phase malware detection tool that approaches the accuracy of the dynamic-only system with only a small fraction of its computational cost, while maintaining a real-time malware detection timeliness similar to a static-only system, thus achieving the best of both approaches. Lastly, my Ph.D. focused on reducing the need for manual feature generation by utilizing Convolutional Neural Networks (CNNs) to automatically generate feature vectors from raw input data. My work shows that using a raw sequence of opcode sequences from static disassembly with a CNN model can automatically produce feature vectors that are useful for detecting malware. Because this process is automated, it presents as a scalable method of consistently producing useful features without human intervention or labor that can be used to detect malware.