Summary

My research aims to solve system security problems via program analysis techniques
My research is practical and results in practical systems. All 6 of my first author papers (ASE'13, NDSS'15/17/18, ASPLOS'16, ISSTA'17) have practical systems.

More specifically, I have developed fundamental primitives for the investigation of advanced cyber-attacks and the analysis and prevention of ever-evolving malicious programs and payloads across multiple platforms.
Specifically, my research focuses on building systems for
(1) attack investigation (e.g., root-cause analysis and attack provenance analysis) through causality inference [ READ ],
(2) software exploit prevention via input randomization [ READ ], and
(3) forensic analysis focusing on cross-platform binary analysis and data file format reverse-engineering [ READ ].

In recognition of my contributions, I have been honored with four prestigious awards: Best Paper Award, ACM SIGSOFT Distinguished Paper Award, Maurice H. Halstead Memorial Award, and Microsoft Most Valuable Professional Award.

Lightening Summary (TL;DR). Prior to my work, the two most widely used state-of-the-art techniques for attack investigation were taint analysis and audit-logging. Unfortunately, taint analysis suffers from significant performance overhead. More importantly, both taint analysis and audit-logging are inaccurate as they are imprecise approximation of causality inference.
Counter-factual causality, first introduced in the 18th century by David Hume, can be used to describe the desired causal analysis in an attack investigation. Specifically, given two events, a latter event is causally dependent on a preceding event if changes at the preceding event lead to state differences in the latter event.
Hence, I propose a precise causality inference technique that follows the original definition of counter-factual causality via dual execution which runs two different executions (e.g., the original execution and its mutated execution) in parallel and compares the state differences between the two executions.
Further extending the causality inference engine, I propose and develop a modeling-based causality inference technique that does not require any changes on end-users system. It is more accurate and precise (0.1% FP/FN) than the previous state-of-the-art attack provenance techniques.

Highlights. I propose novel causality inference engine, LDX, which is more precise and efficient (7% of average overhead) than state-of-the-art dynamic taint analysis and audit-logging techniques. LDX can precisely and accurately handle control dependencies which is a major source of inaccuracy in previous approaches.
I propose MCI, a model-based causality inference technique, precisely infers causality from system call logs generated by existing audit-logging systems without any additional requirements on end-users' systems. Both MCI and LDX are successfully transferred to DARPA Transparent Computing Project.

Publications. LDX (ASPLOS'16), MCI (NDSS'18)



Investigating advanced cyber attacks (e.g., APT) is challenging.
Recent cyber attacks are becoming increasingly targeted and sophisticated. Especially, Advanced Persistent Threat, or APT, is an attack that targets a specific organization and compromises target systems over a long period of time without being detected.
Investigating such attacks are challenging because, first, they happen over an extended period time, even years, and, second, they affect many parts the system including executable/configuration/data files in a complex way. Logs to be inspected are more than hundreds of gigabytes including a number of benign activities. Attack provenance is often across multiple processes and files requiring a technique that can precisely identify dependencies.

Existing Approaches: Taint Analysis and Audit Logging.
Taint analysis and audit logging are two state-of-the-art attack investigation techniques.
1) Taint analysis tracks program dependencies by monitoring the data propagation on individual instructions. It is slow because it needs to monitor every instruction, and more importantly, inaccurate because it has difficulty handling control dependencies.
2) Audit-logging focuses on dependencies between system calls. For example, all system calls on the same file handle are causally related, and within a process, all output system calls are causally related to all input system calls. Such coarse-grained analysis often causes a large number of false positives.
For example, consider a Firefox program running for a day, downloading many files from different websites. Audit-logging tools often assume all outputs are dependent on all inputs. Such an assumption is too coarse-grained. Specifically, they may say all downloaded files are dependent on all websites, even though only one or two files and websites are relevant.

What we need is Causality Inference.
Counter-factual causality, first introduced in the 18th century by David Hume, can be used to describe the desired causal analysis in an attack investigation. Specifically, given two events, a latter event is causally dependent on a preceding event if changes at the preceding event lead to state differences in the latter event.
For example, consider there is an incoming network message, and then a file has been changed. In this case, changes in the configuration file are causally dependent on the network message if I block the network message, then the file will not be changed. To this end, I realize that the limitations of taint analysis and audit-logging root from their imprecise approximations of causality inference.

LDX: Causality Inference Engine via Lightweight Dual Execution.
My research takes a fundamental approach: adapting the original counter-factual causality concept in the context of program and program execution. Given an execution, two execution points P1, P2 and two values x and y, y is causally dependent on x, if a value of y at the point P2 is affected (e.g., changed) when a value of x at the point P1 is mutated.

               

LDX conducts faithful counter-factual causality inference on computer systems via dual execution. Specifically, it runs two executions in parallel — the original execution and its mutated version with mutations on input syscalls. Then, it observes differences at output syscalls. Any difference indicates causality between the mutated input syscalls and the output syscalls.

LDX: Challenges and Solutions. Due to the mutation LDX introduces, the mutated execution may take a different path, leading to a different sequence of executed syscalls, when compared with the original execution. Hence, a fundamental challenge of LDX is to align the two executions so that they can be compared at the same execution point, because comparing executions at misaligned points leads to incorrect causality (i.e., FP/FN). To this end, I designed a novel runtime counter derived from program structure. The counter is not a simple logic timestamp, but rather denotes execution points by ensuring an important key property: The counter value indicates the relative progress of executions, meaning that an execution with a larger counter value must be ahead of another execution with a smaller counter value with respect to program structure. The counter facilitates alignment of two executions, enabling precise and efficient causality inference.

Toward a practical attack investigation system.
LDX enables precise attack investigation through causality inference. However, it needs to instrument programs to align the master and slave executions. In fact, many previous attack investigation techniques such as BEEP also require program instrumentation on end-user systems.
Unfortunately, instrumentation is not always a possible option. In our conversations with companies and government agencies to deploy many attack investigation techniques, one common concern was that we are not allowed to instrument programs because of their internal policies or contracts with software vendors.
While there are automata-based techniques where they construct automata that represent program behaviors and parse system call logs to identify the behaviors, they do not take dependencies into account. In other words, they do not know causality between the detected behaviors. For example, they may be able to detect individual “read a file” and “write a file” behaviors, while they do not know the relationship between them (e.g., there may be dependencies between the read and write operations). The lack of dependencies significantly limits their applications in attack provenance.

MCI: Model based causality inference.
MCI leverages a key observation which is "system calls in audit logs have explicit and implicit dependencies that can be leveraged to build expressive models."
Specifically, there are two types of dependencies: explicit and implicit dependences. Explicit dependencies are identified by file handles shared by multiple system calls. Implicit dependencies are caused by memory operations and control dependencies, and they are not visible in system call logs. Hence, we use LDX to infer them. MCI works as follows:

1. Causal Model Generation. Specifically, we first get causal models in the offline phase. A causal model is a sequence of system calls with annotated inter-dependencies between them. In high-level, a causal model represents a typical workload such as opening a file. System calls in a model are the ones that always appear when a program processes a typical workload. Given a program, we use LDX to infer causality between system calls on typical workloads. Then we build causal models from the results.

2. Causality Inference via Causal Models. During deployment, given a system call log and causal models, MCI parses the log with the causal models to infer causality from the log. Causality in the audit logs is exposed though causality annotated in the successfully parsed models over the log. More importantly, the composability of models essentially connects causality across multiple parsed model instances, which enables complex attack investigation across multiple workloads.

Challenges: Parsing syscall logs with causal models with implicit dependency information leads to two prominent challenges: (1) language complexity and (2) ambiguity. First, to express complex inter-dependencies annotated in causal models, expressive grammar is required while more expressive grammar describes more complex language (e.g., context-free or context-sensitive) and hence leads to higher cost in parsing.
Second, some syscalls can be parsed by multiple models that share common parts (e.g., common prefixes). In such cases, it is difficult to decide which model is the right one. As different causalities are derived from different models, the ambiguity problem may lead to incorrect causality (i.e., FP/FN).
To solve these challenges, I designed a novel model parsing algorithm called segmented parsing that can handle multiple model complexity levels (e.g., regular, context-free, and context-sensitive) and substantially mitigate the ambiguity problem by leveraging explicit dependencies that can be directly derived from the log (e.g., dependencies caused by file handles). Specifically, MCI first obtains a model skeleton of each causal model. A model skeleton consists of syscalls with explicit dependencies. The skeleton partitions a model into model segments that can be described and parsed by automata. More importantly, causal models have composability such that models for primitive operations can be composed together to describe complex system-wide attack behaviors. For example, primitive models for “Edit”, “Copy”, “Paste”, and “Save” can compose a new model that represents a complex user behavior “Edit->Copy->Edit->Paste->Edit->Save” (e.g., potential information exfiltration).



Lightening Summary (TL;DR). The cat-and-mouse game of attacks and protections is never ending due to the trend of attack specific defenses and ever-evolving attack vectors and methods. My experience of analyzing on a number of malicious payloads leads to a key observation: malicious payloads are carefully constructed on top of a number of assumptions on underlying systems (e.g., memory layouts, instruction set, and system call interfaces); hence they are very brittle. In other words, even a very simple perturbation on the payloads would completely break them.
To this end, I develop a protection system that exploits the fundamental weakness, the brittleness, of malicious payloads. Specifically, it perturbs inputs which also include malicious payloads. All inputs are encoded and only decoded when the inputs are not exploitable. Otherwise, they are temporary decoded for the intended usage of the input by the original program.
Moreover, I leverage program analysis and a novel constraint solving algorithm to identify when the inputs can be permanently decoded. We also extend the idea to prevent malvertising (PAD in ASE'17). We encode and decode all values written to the memory via certain JavaScript and ActionScript objects which are often exploited by attackers.

Highlights. A2C successfully achieves general protection for a large set of real-world programs, including Apache web server. It prevents a variety of attacks (e.g., heap spraying, use-after-free, buffer-overflow, integer-overflow, and type-confusion) The protection is practical with its low overhead (6%). Its key idea is general and we leverage it to prevent malvertising attacks including the AdGholas malvertising campaign which affected thousands of victims everyday for over a year using a sophisticated steganography technique.

Publications. A2C (NDSS'17), PAD (ASE'17)



The cat and mouse game.
Vulnerabilities are everywhere, and even after decades of effort, the cat and mouse game between attackers and defense systems is still going on. For attack prevention, we find that many of protection systems focus on specific attack methods. For example, some protections target buffer overflow, some others try to mitigate ROP and use-after-free, and so on. However, such protections for a specific kind of attacks could not keep up with rapidly growing new attacks. Building more attack specific protections would not let us win this arms race.

Looking for a common weakness of exploits.
My approach focuses on a common weakness of many attacks. I research various attacks including ROP, use-after-free, buffer overflow and so on, and observe that these attacks have one common step: malicious payload should be injected and executed. Specifically, they need to inject and execute malicious payload at some point.

Little Change, Big Difference = Little perturbation, Completely broken payload.
We look at the malicious payloads and find out that, malicious payloads are designed with strict semantic assumptions about the environment, hence very brittle to any mutation. For example, when we change few bytes in the payload by just incrementing the values of individual bytes, we see the payload is broken. In general, malicious payloads are very carefully designed and any perturbation on the payloads makes their executions crash.

A2C (or Attack to Crash): General protection for various malicious payload injection attacks via input perturbation.
Motivated by the observation, I developed general protection that prevents a wide range of attacks by corrupting the payloads.
Specifically, we encode all inputs including malicious payloads at runtime. After the encoding, payloads are corrupted and broken. However, benign inputs are also corrupted too, making a program execution crash on benign inputs.
To assure that the program continues to function correctly when benign inputs are provided, I developed a static analysis technique that identifies all the places that read and process inputs and selectively inserts decoding logic at some of those places.
Specifically, decoding only occurs when the use of the inputs cannot be exploited. For instance, when inputs in a byte array are copied to an integer array, each byte of the inputs is padded with 3 zero bytes (as an integer is 4 bytes on 32-bit machines) before it is stored into the integer array. Constructing a meaningful payload with 3 zero bytes in every four bytes is extremely difficult, if not impossible.

Dividing program space into two sub spaces: exploitable space and post-exploitable space. I proposed a novel constraint solving algorithm which identifies operations that make inputs no longer exploitable, such as the copy operation from a byte array to an integer array.
The operations essentially divide the state space of a program into exploitable and post-exploitable sub-spaces because the program state before the operation is exploitable, but no longer so after the operations. Therefore, A2C decodes the mutated values only when they are transmitted from the exploitable space to the post-exploitable space. Notably, the exploitable space is much smaller than the post-exploitable space — making A2C highly efficient



PAD to prevent malvertising via input perturbation.
My colleagues and I extend the idea of mutating inputs to break malicious payloads to one of the biggest security threats on the web, malvertising. Specifically, we inspect many malvertising incidents and analyze how they inject and execute malicious payloads.
To this end, we have a key observation: attackers leverage some particular objects (e.g., DataView and strings) that can access consecutive buffers because most objects in JavaScript/ActionScript (JS/AS) do not allocate large size consecutive buffers that can be leveraged to inject malicious payloads.
Hence, we override JS/AS objects that can access consecutive memory buffers so that any values stored through the objects will be encoded, breaking malicious payloads that reside in the buffers. When a JS/AS program wants to use values stored in the buffers, we decode so that a JS/AS program can function correctly. It is important to note that exploits in malvertising are attacking vulnerabilities in JS/AS engine or browsers, not the JS/AS programs. Hence, it is safe to decode the values for JS/AS programs.
Moreover, to prevent heap spraying attacks, we also override objects that can allocate consecutive buffers. We turn the consecutive buffers into segmented buffers connected through a list to break the malicious payloads. For instance, we split a string into a several sub-strings and store them into different buffers.




Lightening Summary (TL;DR). Various platforms such as IoT platforms are introduced and becoming more important. Developing secure applications on such new platforms are important, but there are not many tools are available, hence it is a challenging task. Reimplementing analysis tools and applications is error-prone and time-consuming. I propose a novel execution transformation technique: transforming a platform-dependent execution such as an IoT execution into a platform-independent program so that it can be compiled, executed, and analyzed on any platform. The transformed program reproduces the exact same control dependencies and data dependencies of the original execution.
Furthermore, I extend the execution transformation to a program component extraction and reuse (CPR in ISSTA'17). CPR can extract a component from a platform-dependent binary program without any source code and symbolic information, and reuse it on another platform. For instance, it can extract a component of a Windows utility program, and reuse it on other IoT device.
Besides, I also developed a data file-format reverse-engineering technique, P2C, that can reveal file/message format (including types and semantics of fields) of unknown binary files.

Highlights. Leveraging PIEtrace, many sensors and IoT vulnerabilities have been successfully analyzed by using advanced program analysis tools on Windows and Linux. From the analysis, we find unknown bugs on several programs, which would be difficult to identify if we rely on analysis tools on the same platform.
P2C has been used to reveal the semantics of CnC messages generated by the Zeus malware, one of the largest known botnets, which infected 3.6 million systems. P2C was able to reverse-engineer the CnC messages without access to a program that can read and parse the messages.

Publications. PIEtrace(ASE'13), P2C (NDSS'15), CPR (ISSTA'17)



Rise of IoT platforms and importance of secure software development.
Recently, IoT has received a lot of attention and has become a very important part of our life. Just to name a few, drone products are released, major companies working on self-driving cars, recent surveillance cameras are now connected to the Internet, and so on.
Due to the prevalence of IoT devices, testing, debugging, and security analysis on IoT devices is becoming increasingly important. However, many advanced debugging and analysis tools do not support new IoT platforms. Not surprisingly, a study from HP revealed 70% of IoT devices contain vulnerabilities. Notably, many of these vulnerabilities are not new, hence if existing analysis tools were applicable on IoT devices, then these vulnerabilities could have been detected and patched before deployment. Simiarly, there are many existing program analysis tools while they cannot be applied on such new platforms. For example, taint analysis tools implemented on x86 are not applicable to programs on ARM platforms.
Such tools often need to be reimplemented on new platforms again and again when a new platform is released.

Do not reinvent the wheel, Transform the wheel.
My approach to solving the issue of lack of advanced program analysis/debugging tools on such new platforms is completely different. Instead of porting and developing the analysis tools on every new platform, I transform platform-dependent executions such as IoT executions into platform-independent programs so that they can be analyzed on other platforms where existing analysis tools already support.

Running Example. There is a malicious program on a network router. A security analyst wants to understand what it does. However, it is difficult to do as the network router does not support many analysis tools.
PIEtrace traces the execution of the malicious program and generates a platform independent program which is essentially a C program which we call trace program. The trace program reproduces the same execution including control/data dependencies and can be compiled on other platforms. In this case, I compile the program on both x86-Linux and x86-Windows to leverage existing analysis tools and debuggers on both platforms. Specifically, OllyDbg is a great interactive debugger which only supports Windows OS. Its usability and various plug-ins make analysis very efficient and effective. Also, Valgrind only supports Linux and it has many plug-ins that implement advanced program analysis techniques including dynamic program slicing and taint analysis. Being able to use such tools significantly helps the analysis process.


Do not reimplement programs, synthesize them: CPR.
In addition to the debugging and analysis problem, another fundamental problem on IoT platforms is lack of applications. Re-implementing programs is error-prone and time-consuming. We often see many vendors just reuse old source code without even fixing known vulnerabilities. Reimplementation of existing applications often introduces new vulnerabilities and bugs.
We find that binary code extraction and reuse can provide a better way to deal with the problems.
To this end, we extend the cross-platform execution transformation technique into a cross-platform binary extract and reuse technique. Specifically, we obtain multiple platform independent trace programs. Then, we merge them to synthesize a platform-independent program.
An important distinction from a trace program is that a synthesized program can take new inputs and produce outputs accordingly while a trace program does not take any inputs and reproduce the recorded execution path. Moreover, the synthesized program can take inputs that were not exercised during the training.





Advanced Forensic Analysis: Understanding file format of unknown binary files.
Every day, we deal with a lot of unknown files on our machines, generated by various programs. For example, in "Windows" folder on my machine, there is a binary file mib.bin, which I do not know what it contains. Such problem could be very serious if the file is generated by a malicious program and contains sensitive information. For instance, it may contain secret information such as passwords and credit card numbers. Let’s consider a situation that one of such unknown files is leaked. You may want to know the contents of the file and whether it contains any sensitive information.


Understanding file format from a program execution.
A file format including semantics of its content can be revealed by analyzing a program execution that reads and processes the file. A basic idea is to monitor a program execution when the program reads and uses the content. In particular, we analyze how the program treats the contents. For instance, if it is a number, the content may go through some conversion functions (e.g., atoi()). If it is a string, it may go through some string related functions (i.e., strcmp()). Such hints allow us to reverse-engineer unknown file format including types and semantics.

When life gives you lemons, make lemonade: Understanding file format without a consumer program.
However, such a consumer program may not exist always. For example, Command and Control (C&C) protocols in malware often only include logic to create an outgoing message. A consumer program that can parse the messages is only available on the attacker's side.
A unique capability of P2C is that when there is no available consumer program, but there is a producer program while inputs for the producer program to generate the message is not available, it turns a producer program execution into a consumer program execution so that we can understand the file format and semantics.

Transforming Producer Execution to Consumer Execution.
The transformation from a producer to a consumer is possible because consumer and producer programs are symmetric. In other words, operations in a consumer program have corresponding (symmetric) operations in a producer program.
For instance, consider a pair of network client and server programs that use a simple network protocol. A producer program that generates a network message may first construct a message header (A) and then the body (B). A consumer program that can understand the network message should first read the message header (A) and then the body (B). Note that the program structure of a consumer and a producer are symmetric as they both work on the same network message.




When there is not a single clue, forced execution comes to rescue.
P2C has another advanced feature that even when we do not know a producer program, it can find out the true producer from a set of potential producer programs. Specifically, it transforms executions of candidate producer programs to find an ideal consumer program execution. It is important to note that if a potential producer is not the true producer program of an unknown file, the transformation fails quickly. P2C has successfully revealed many file/network message formats only with an unknown file and a set of candidate producer programs. Specifically, it revealed a secret message (e.g., password) encrypted and embedded into a seemingly benign image file by steganography technique. As the attacker only generates the picture and sends it to his sever, there is no consumer program available to us. We only know some suspicious programs and do not even know how to run them. P2C automatically runs them to see which program is a producer and transform the producer program execution to an ideal consumer program execution to reveal the hidden message. It also analyzed C&nC protocols of popular malware (Zbot) within several minutes.