Summary

Ph.D. Candidate, Department of Computer Science, Purdue University
image

Solving system security problems via program/binary analysis and reverse-engineering

I am a Ph.D. candidate in the Department of Computer Science at Purdue University under the supervison of Prof. Xiangyu Zhang and Prof. Dongyan Xu.

My research aims to solve system security problems via program analysis techniques (e.g., dynamic/static analysis, binary analysis, and reverse-engineering). More specifically, I have proposed and implemented fundamental primitives for the investigation and prevention of advanced cyber-attacks (e.g., Advanced Persistent Threats) and the analysis and prevention of ever-evolving malicious programs and payloads across multiple platforms.

Contacts: {yongkwon, kwon58}purdue.edu, yonghwi.kwonhotmail.com

I am on the academic job market

Resume Research Statement
Invited Talks:
Texas A&M University (Jan 26) Virginia Tech (Feb 5)
Univ. of Delaware (Feb 12)    Univ. of Texas at Dallas (Mar 1)
Case Western Reserve Univ. (Mar 7)
Univ. of California, Irvine (Mar 13)
University of Virginia (April 4)

Awards

Maurice H. Halstead Memorial Award, Purdue University, 2017
ACM SIGSOFT Distinguished Paper Award, ACM, 2013
Best Paper Award, ASE'13, 2013
Microsoft MVP (Most Valuable Professional), Microsoft, 2008-2012

Publications

System Security (9 conf. papers, 4 first authors)
- NDSS'15 (1st), ASPLOS'16 (1st), NDSS'17 (1st), NDSS'18 (1st) | ASE'17 (2nd) | WWW'18 (3rd) | WWW'17 (4th), ACSAC'17 (4th), FSE'16 (4th)
Program/Binary analysis and Reverse-engineering (4 conf. papers, 2 first authors)
- ASE'13 (1st), ISSTA'17 (1st) | ASPLOS'15 (2nd), OOPSLA'16 (2nd)

WWW'18 AdBudgetKiller: Online Advertising Budget Draining Attack
I Luk Kim, Weihang Wang, Yonghwi Kwon, Yunhui Zheng, Yousra Aafer, Weijie Meng, and Xiangyu Zhang,
In Proc. of 27th International World Wide Web Conference
NDSS'18 MCI: Modeling-based Causality Inference in Audit Logging for Attack Investigation
Yonghwi Kwon, Fei Wang, Weihang Wang, Kyu Hyung Lee, Wen-Chuan Lee, Shiqing Ma, Xiangyu Zhang, Dongyan Xu, Somesh Jha, Gabriela Ciocarlie, Ashish Gehani, and Vinod Yegneswaran,
In Proc. of 25th Network and Distributed System Security Symposium
Paper
ACSAC'17 RevARM: A Platform-Agnostic ARM Binary Rewriter for Security Applications
Taegyu Kim, Chung Hwan Kim, Hongjun Choi, Yonghwi Kwon, Brendan Saltaformaggio, Xiangyu Zhang, and Dongyan Xu,
In Proc. of 33rd Annual Conference on Computer Security Applications
Paper
ASE'17 PAD: Programming Third-party Web Advertisement Censorship
Weihang Wang, Yonghwi Kwon, Yunhui Zheng, Yousra Aafer, I Luk Kim, Wen-Chuan Lee, Yingqi Liu, Weijie Meng, Xiangyu Zhang, Patrick Eugster,
In Proc. of 32nd IEEE/ACM International Conference on Automated Software Engineering
Paper
ISSTA'17 CPR: Cross Platform Binary Code Reuse via Platform Independent Trace Program
Yonghwi Kwon, Weihang Wang, Yunhui Zheng, Xiangyu Zhang, and Dongyan Xu,
In Proc. of 26th ACM SIGSOFT International Symposium on Software Testing and Analysis
Paper | Slides
WWW'17 J-Force: Forced Execution on JavaScript
Kyungtae Kim, I Luk Kim, Chung Hwan Kim, Yonghwi Kwon, Yunhui Zheng, Xiangyu Zhang, and Dongyan Xu,
In Proc. of 26th International World Wide Web Conference
Paper
NDSS'17 A2C: Self Destructing Exploit Executions via Input Perturbation
Yonghwi Kwon, Brendan Saltaformaggio, I Luk Kim, Kyu Hyung Lee, Xiangyu Zhang, and Dongyan Xu,
In Proc. of 24th Network and Distributed System Security Symposium
Paper | Slides
OOPSLA'16 Apex: Automatic Programming Assignment Error Explanation
Dohyeong Kim, Yonghwi Kwon, Peng Liu, I Luk Kim, David Mitchel Perry, Xiangyu Zhang, and Gustavo Rodriguez-Rivera,
In Proc. of 2016 ACM SIGPLAN International Conference on Object-Oriented Programming, Systems, Languages, and Applications
Paper | Website
FSE'16 WebRanz: Web Page Randomization For Better Advertisement Delivery and Web-Bot Prevention
Weihang Wang, Yunhui Zheng, Xinyu Xing, Yonghwi Kwon, Xiangyu Zhang, and Patrick Eugster,
In Proc. of 24th ACM SIGSOFT International Symposium on the Foundations of Software Engineering
Paper | Website
WOOT'16 Eavesdropping on Fine-Grained User Activities Within Smartphone Apps Over Encrypted Network Traffic
Brendan Saltaformaggio, Hongjun Choi, Kristen Johnson, Yonghwi Kwon, Qi Zhang, Xiangyu Zhang, Dongyan Xu, John Qian,
In Proc. of 10th USENIX Workshop on Offensive Technologies
Paper
ASPLOS'16 LDX: Causality Inference by Lightweight Dual Execution
Yonghwi Kwon, Dohyeong Kim, William N. Sumner, Kyungtae Kim, Brendan Saltaformaggio, Xiangyu Zhang, and Dongyan Xu,
In Proc. of 21st International Conference on Architectural Support for Programming Languages and Operating Systems
Paper | Slides
ASPLOS'15 Dual Execution for On the Fly Fine Grained Execution Comparison
Dohyeong Kim, Yonghwi Kwon, William N. Sumner, Xiangyu Zhang, and Dongyan Xu,
In Proc. of 20th International Conference on Architectural Support for Programming Languages and Operating Systems
Paper
NDSS'15 P2C: Understanding Output Data Files via On-the-Fly Transformation from Producer to Consumer Executions
Yonghwi Kwon, Fei Peng, Dohyeong Kim, Kyungtae Kim, Xiangyu Zhang, Dongyan Xu, Vinod Yegneswaran, and John Qian,
In Proc. of 22nd Network and Distributed System Security Symposium
Paper | Slides
ASE'13 PIEtrace: Platform Independent Executable Trace
Yonghwi Kwon, Xiangyu Zhang, and Dongyan Xu,
In Proc. of 28th IEEE/ACM International Conference on Automated Software Engineering
Paper | Slides | Website
Best Paper Award, ACM SIGSOFT Distinguished Paper Award

Resume

Systems Security, Program/Binary Analysis, Reverse-engineering

Education

May 2012 - May 2018

Ph.D. in Computer Science, Purdue University, USA

Advisors: Prof. Xiangyu Zhang and Prof. Dongyan Xu

May 2017

Master in Computer Science, Purdue University, USA

March 2004 - July 2011

Bachelor in Computer Engineering, Konkuk University, Seoul, South Korea

Summa Cum Laude. (Includes 3 years of military service at a software company for developing system security products)

Professional Experience

Sep 2011 - Dec 2011

Freelance developer, National Forensic Service, South Korea

I developed digitizers that extract important evidences from image data.

July 2006 - July 2009

Researcher, SETTEC Inc.

I developed security solutions such as DRM (Digital Rights Management) systems, on-the-fly binary protection techniques which encrypt and decrypt executable code at runtime to prevent reverse-engineering attempts. Also, I have developed anti-malware software in both kernel and user-mode.

Aug 2004 - June 2006

Student Researcher, Samsung Electronics

I developed and led many commercial projects including system utilities, network firewalls, file-filter drivers, and image processing applications. I also have developed programs on various platforms including x86, MIPS, and ARM.

July 2004 - Aug 2004

Summer Intern, Korea Telecom

I developed client software that analyzes and optimizes end-users' Windows systems.

Invited Talks

2017

CERIAS Security Seminar, Purdue University

A2C: Self Destructing Exploit Executions via Input Perturbation.

2015

CERIAS Security Seminar, Purdue University

P2C: Understanding Output Data Files via On-the-Fly Transformation from Producer to Consumer Executions

2011

Microsoft Technical Seminar, Microsoft Korea

Migration to the Visual Studio 2010

2010

Microsoft Technical Seminar, Microsoft Korea

Effective Windows Programming

2009

Microsoft Technical Seminar, Microsoft Korea

Advanced topics in Windows Programming

Samsung Eletronics Technical Seminar (Private), Samsung Electronics

Debugging Applications in Windows

Highlights and Achievements

2018

2 top-tier conference papers (NDSS'18 and WWW'18)

2017

Maurice H. Halstead Memorial Award

4 top-tier conference papers (NDSS'17, WWW'17, ISSTA'17, and ASE'17)

2016

3 top-tier conference papers (ASPLOS'16, FSE'16, and OOPSLA'16)

2015

2 top-tier conference papers (NDSS'15 and ASPLOS'15)

2013

ACM SIGSOFT Distinguished Paper Award

ASE Best Paper Award

A top-tier conference paper (ASE'13)

2011

Excellence Award in the Capstone Design Contest, Konkuk University

2010

Authored a book: Effective Windows Programming

2008

Microsoft Most Valuable Professional Award (5 Time Awardee: 2008~2012)

2003

4th Prize in the Korea Informatics Olympiad

Excellence Award in the the Creative Software Contest, Konkuk University - 2 years of scholarship

Services

Program Committee

Annual Conference on Computer Security Applications (ACSAC'18)

International Workshop on Theory and Practice of Provenance (TaPP'17)

External Reviewer

ACM Conference on Computer and Communications Security (CCS’16/15/13)

The Network and Distributed System Security Symposium (NDSS’17/14)

The International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’17)

The ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI’17)

IEEE Conference on Communications and Network Security (CNS’16)

The International Conference on Software Engineering (ICSE’17)

The ACM SIGSOFT International Symposium on the Foundations of Software Engineering (FSE’16,'18)

The International Symposium on Software Testing and Analysis (ISSTA’17/16/14)

The IEEE/IFIP International Conference on Dependable Systems and Networks (DSN’16)

The International Symposium on Research in Attacks, Intrusions and Defenses (RAID’16)

Research

Summary of My Research Projects

My research aims to solve system security problems via program analysis techniques
I build practical systems. All 6 of my first author papers (ASE'13, NDSS'15/17/18, ASPLOS'16, ISSTA'17) lead to practical systems.

More specifically, I have developed fundamental primitives for the investigation and prevention of advanced cyber-attacks (e.g., Advanced Persistent Threats) and the analysis and prevention of ever-evolving malicious programs and payloads across multiple platforms.
Specifically, my research focuses on building systems for
(1) attack investigation (e.g., root-cause analysis and attack provenance analysis) through causality inference [ READ],
(2) software exploit prevention via input randomization [ READ], and
(3) forensic analysis focusing on cross-platform binary analysis and data file format reverse-engineering [ READ].

In recognition of my contributions, I have been honored with four prestigious awards: Best Paper Award, ACM SIGSOFT Distinguished Paper Award, Maurice H. Halstead Memorial Award, and Microsoft Most Valuable Professional Award.


Lightening Summary (TL;DR). Prior to my work, the two most widely used state-of-the-art techniques for attack investigation were taint analysis and audit-logging. Unfortunately, taint analysis suffers from significant performance overhead. More importantly, both taint analysis and audit-logging are inaccurate as they are imprecise approximation of causality inference.
Counter-factual causality, first introduced in the 18th century by David Hume, can be used to describe the desired causal analysis in an attack investigation. Specifically, given two events, a latter event is causally dependent on a preceding event if changes at the preceding event lead to state differences in the latter event.
Hence, I propose a precise causality inference technique that follows the original definition of counter-factual causality via dual execution which runs two different executions (e.g., the original execution and its mutated execution) in parallel and compares the state differences between the two executions.
Further extending the causality inference engine, I propose and develop a modeling-based causality inference technique that does not require any changes on end-users system. It is more accurate and precise (0.1% FP/FN) than the previous state-of-the-art attack provenance techniques.

Highlights. I propose novel causality inference engine, LDX, which is more precise and efficient (7% of average overhead) than state-of-the-art dynamic taint analysis and audit-logging techniques. LDX can precisely and accurately handle control dependencies which is a major source of inaccuracy in previous approaches.
I propose MCI, a model-based causality inference technique, precisely infers causality from system call logs generated by existing audit-logging systems without any additional requirements on end-users' systems. Both MCI and LDX are transferring to DARPA Transparent Computing Project.

Publications. LDX (ASPLOS'16), MCI (NDSS'18)

Investigating advanced cyber attacks (e.g., APT) is challenging.
Recent cyber attacks are becoming increasingly targeted and sophisticated. Especially, Advanced Persistent Threat, or APT, is an attack that targets a specific organization and compromises target systems over a long period of time without being detected.
Investigating such attacks are challenging because, first, they happen over an extended period time, even years, and, second, they affect many parts the system including executable/configuration/data files in a complex way. Logs to be inspected are more than hundreds of gigabytes including a number of benign activities. Attack provenance is often across multiple processes and files requiring a technique that can precisely identify dependencies.

Existing Approaches: Taint Analysis and Audit Logging.
Taint analysis and audit logging are two state-of-the-art attack investigation techniques.
1) Taint analysis tracks program dependencies by monitoring the data propagation on individual instructions. It is slow because it needs to monitor every instruction, and more importantly, inaccurate because it has difficulty handling control dependencies.
2) Audit-logging focuses on dependencies between system calls. For example, all system calls on the same file handle are causally related, and within a process, all output system calls are causally related to all input system calls. Such coarse-grained analysis often causes a large number of false positives.
For example, consider a Firefox program running for a day, downloading many files from different websites. Audit-logging tools often assume all outputs are dependent on all inputs. Such an assumption is too coarse-grained. Specifically, they may say all downloaded files are dependent on all websites, even though only one or two files and websites are relevant.

What we need is Causality Inference.
Counter-factual causality, first introduced in the 18th century by David Hume, can be used to describe the desired causal analysis in an attack investigation. Specifically, given two events, a latter event is causally dependent on a preceding event if changes at the preceding event lead to state differences in the latter event.
For example, consider there is an incoming network message, and then a file has been changed. In this case, changes in the configuration file are causally dependent on the network message if I block the network message, then the file will not be changed. To this end, I realize that the limitations of taint analysis and audit-logging root from their imprecise approximations of causality inference.

LDX: Causality Inference Engine via Lightweight Dual Execution.
My research takes a fundamental approach: adapting the original counter-factual causality concept in the context of program and program execution. Given an execution, two events at execution points P1 and P2, a later event is causally dependent on a preceding event, if changes at the preceding event cause state differences at the later event.


LDX conducts faithful counter-factual causality inference on computer systems via dual execution. Specifically, it runs two executions in parallel — the original execution and its mutated version with mutations on input syscalls. Then, it observes differences at output syscalls. Any difference indicates causality between the mutated input syscalls and the output syscalls.

LDX: Challenges and Solutions. Due to the mutation LDX introduces, the mutated execution may take a different path, leading to a different sequence of executed syscalls, when compared with the original execution. Hence, a fundamental challenge of LDX is to align the two executions so that they can be compared at the same execution point, because comparing executions at misaligned points leads to incorrect causality (i.e., FP/FN). To this end, I designed a novel runtime counter derived from program structure. The counter is not a simple logic timestamp, but rather denotes execution points by ensuring an important key property: The counter value indicates the relative progress of executions, meaning that an execution with a larger counter value must be ahead of another execution with a smaller counter value with respect to program structure. The counter facilitates alignment of two executions, enabling precise and efficient causality inference.

Toward a practical attack investigation system.
LDX enables precise attack investigation through causality inference. However, it needs to instrument programs to align the master and slave executions. In fact, many previous attack investigation techniques such as BEEP also require program instrumentation on end-user systems.
Unfortunately, instrumentation is not always a possible option. In our conversations with companies and government agencies to deploy many attack investigation techniques, one common concern was that we are not allowed to instrument programs because of their internal policies or contracts with software vendors.
While there are automata-based techniques where they construct automata that represent program behaviors and parse system call logs to identify the behaviors, they do not take dependencies into account. In other words, they do not know causality between the detected behaviors. For example, they may be able to detect individual “read a file” and “write a file” behaviors, while they do not know the relationship between them (e.g., there may be dependencies between the read and write operations). The lack of dependencies significantly limits their applications in attack provenance.

MCI: Model based causality inference.
MCI leverages a key observation which is "system calls in audit logs have explicit and implicit dependencies that can be leveraged to build expressive models."
Specifically, there are two types of dependencies: explicit and implicit dependences. Explicit dependencies are identified by file handles shared by multiple system calls. Implicit dependencies are caused by memory operations and control dependencies, and they are not visible in system call logs. Hence, we use LDX to infer them. MCI works as follows:

1. Causal Model Generation. Specifically, we first get causal models in the offline phase. A causal model is a sequence of system calls with annotated inter-dependencies between them. In high-level, a causal model represents a typical workload such as opening a file. System calls in a model are the ones that always appear when a program processes a typical workload. Given a program, we use LDX to infer causality between system calls on typical workloads. Then we build causal models from the results.

2. Causality Inference via Causal Models. During deployment, given a system call log and causal models, MCI parses the log with the causal models to infer causality from the log. Causality in the audit logs is exposed though causality annotated in the successfully parsed models over the log. More importantly, the composability of models essentially connects causality across multiple parsed model instances, which enables complex attack investigation across multiple workloads.

Challenges: Parsing syscall logs with causal models with implicit dependency information leads to two prominent challenges: (1) language complexity and (2) ambiguity. First, to express complex inter-dependencies annotated in causal models, expressive grammar is required while more expressive grammar describes more complex language (e.g., context-free or context-sensitive) and hence leads to higher cost in parsing.
Second, some syscalls can be parsed by multiple models that share common parts (e.g., common prefixes). In such cases, it is difficult to decide which model is the right one. As different causalities are derived from different models, the ambiguity problem may lead to incorrect causality (i.e., FP/FN).
To solve these challenges, I designed a novel model parsing algorithm called segmented parsing that can handle multiple model complexity levels (e.g., regular, context-free, and context-sensitive) and substantially mitigate the ambiguity problem by leveraging explicit dependencies that can be directly derived from the log (e.g., dependencies caused by file handles). Specifically, MCI first obtains a model skeleton of each causal model. A model skeleton consists of syscalls with explicit dependencies. The skeleton partitions a model into model segments that can be described and parsed by automata. More importantly, causal models have composability such that models for primitive operations can be composed together to describe complex system-wide attack behaviors. For example, primitive models for “Edit”, “Copy”, “Paste”, and “Save” can compose a new model that represents a complex user behavior “Edit->Copy->Edit->Paste->Edit->Save” (e.g., potential information exfiltration).



Lightening Summary (TL;DR). The cat-and-mouse game of attacks and protections is never ending due to the trend of attack specific defenses and ever-evolving attack vectors and methods. My experience of analyzing on a number of malicious payloads leads to a key observation: malicious payloads are carefully constructed on top of a number of assumptions on underlying systems (e.g., memory layouts, instruction set, and system call interfaces); hence they are extremely brittle. In other words, even a very simple perturbation on the malicious payloads would completely break their semantics and lead to crashes during executions.
To this end, I develop a protection system that exploits the fundamental weakness, the brittleness, of malicious payloads. Specifically, the protection system perturbs inputs from untrusted sources because the inputs include malicious payloads. All inputs are encoded hence the malicious payloads are also encoded and broken. The encoded inputs are only decoded when the program wants to read and process them so that the program functions correctly. Note that they are temporary decoded for the intended usage of the input by the original program, and right after the usage, the inputs are encoded again so that the decoded malicious payloads would not exist on the memory. However, this requires decoding and reencoding at a number of places, causing significant overhead up to 40%.
To solve this problem, I leverage program analysis and a novel constraint solving algorithm to identify when the inputs can be permanently decoded. Specifically, we identify certain operations which makes buffers no longer exploitable by attackers after the operations via the constraint solving algorithm.
We also extend the idea to prevent malvertising (PAD in ASE'17). We encode and decode all values written to the memory via certain JavaScript and ActionScript objects which are often exploited by attackers.

Highlights. A2C successfully achieves general protection for a large set of real-world programs, including Apache web server. It prevents a variety of attacks (e.g., heap spraying, use-after-free, buffer-overflow, integer-overflow, and type-confusion) The protection is practical with its low overhead (6%). Its key idea is general and we leverage it to prevent malvertising attacks including the AdGholas malvertising campaign which affected thousands of victims everyday for over a year using a sophisticated steganography technique.

Publications. A2C (NDSS'17), PAD (ASE'17)

The cat and mouse game.
Vulnerabilities are everywhere, and even after decades of effort, the cat and mouse game between attackers and defense systems is still going on. For attack prevention, we find that many of protection systems focus on specific attack methods. For example, some protections target buffer overflow, some others try to mitigate ROP and use-after-free, and so on. However, such protections for a specific kind of attacks could not keep up with rapidly growing new attacks. Building more attack specific protections would not let us win this arms race.

Looking for a common weakness of exploits.
My approach focuses on a common weakness of many attacks. I research various attacks including ROP, use-after-free, buffer overflow and so on, and observe that these attacks have one common step: malicious payload should be injected and executed. Specifically, they need to inject and execute malicious payload at some point.

Little Change, Big Difference = Little perturbation, Completely broken payload.
We look at the malicious payloads and find out that, malicious payloads are designed with strict semantic assumptions about the environment, hence very brittle to any mutation. For example, when we change few bytes in the payload by just incrementing the values of individual bytes, we see the payload is broken. In general, malicious payloads are very carefully designed and any perturbation on the payloads makes their executions crash.

A2C (or Attack to Crash): General protection for various malicious payload injection attacks via input perturbation.
Motivated by the observation, I developed general protection that prevents a wide range of attacks by corrupting the payloads.
Specifically, we encode all inputs including malicious payloads at runtime. After the encoding, payloads are corrupted and broken. However, benign inputs are also corrupted too, making a program execution crash on benign inputs.
To assure that the program continues to function correctly when benign inputs are provided, I developed a static analysis technique that identifies all the places that read and process inputs and selectively inserts decoding logic at some of those places.
Specifically, decoding only occurs when the use of the inputs cannot be exploited. For instance, when inputs in a byte array are copied to an integer array, each byte of the inputs is padded with 3 zero bytes (as an integer is 4 bytes on 32-bit machines) before it is stored into the integer array. Constructing a meaningful payload with 3 zero bytes in every four bytes is extremely difficult, if not impossible.

Dividing program space into two sub spaces: exploitable space and post-exploitable space. I proposed a novel constraint solving algorithm which identifies operations that make inputs no longer exploitable, such as the copy operation from a byte array to an integer array.
The operations essentially divide the state space of a program into exploitable and post-exploitable sub-spaces because the program state before the operation is exploitable, but no longer so after the operations. Therefore, A2C decodes the mutated values only when they are transmitted from the exploitable space to the post-exploitable space. Notably, the exploitable space is much smaller than the post-exploitable space — making A2C highly efficient



PAD to prevent malvertising via input perturbation.
My colleagues and I extend the idea of mutating inputs to break malicious payloads to one of the biggest security threats on the web, malvertising. Specifically, we inspect many malvertising incidents and analyze how they inject and execute malicious payloads.
To this end, we have a key observation: attackers leverage some particular objects (e.g., DataView and strings) that can access consecutive buffers because most objects in JavaScript/ActionScript (JS/AS) do not allocate large size consecutive buffers that can be leveraged to inject malicious payloads.
Hence, we override JS/AS objects that can access consecutive memory buffers so that any values stored through the objects will be encoded, breaking malicious payloads that reside in the buffers. When a JS/AS program wants to use values stored in the buffers, we decode so that a JS/AS program can function correctly. It is important to note that exploits in malvertising are attacking vulnerabilities in JS/AS engine or browsers, not the JS/AS programs. Hence, it is safe to decode the values for JS/AS programs.
Moreover, to prevent heap spraying attacks, we also override objects that can allocate consecutive buffers. We turn the consecutive buffers into segmented buffers connected through a list to break the malicious payloads. For instance, we split a string into a several sub-strings and store them into different buffers.




Lightening Summary (TL;DR). Various platforms such as IoT platforms are introduced and becoming more and more important. Developing secure applications on such new platforms is challenging because there are not many program analysis tools are available on such platforms. While lots of people are reimplementing analysis tools on the new platforms, Reimplementing them is error-prone and time-consuming, hence not sustainable.
My unique angle is reusing existing program analysis techniques to analyze program executions on such new platforms by transforming platform dependent program executions to platform independent executable traces.
I propose a novel execution transformation technique: transforming a platform-dependent execution such as an IoT execution into a platform-independent program so that it can be compiled, executed, and analyzed on any platform. The transformed program reproduces the exact same control dependencies and data dependencies of the original execution.
Furthermore, I extend the execution transformation to a program component extraction and reuse (CPR in ISSTA'17). CPR can extract a component (e.g., a set of functions that represent a certain procedures or algorithms) from a platform-dependent binary program without any source code and symbolic information, and reuse it on another platform. For instance, it can extract parts of a Windows utility program, and reuse them on another IoT device.
Besides, I also developed a data file-format reverse-engineering technique, P2C, that can reveal file/message format (including types and semantics of fields) of unknown binary files or network messages.

Highlights. Leveraging PIEtrace, many sensors and IoT vulnerabilities have been successfully analyzed by using advanced program analysis tools on Windows and Linux. From the analysis, we find unknown bugs on several programs, which would be difficult to identify if we rely on analysis tools on the same platform.
P2C has been used to reveal the semantics of CnC messages generated by the Zeus malware, one of the largest known botnets, which infected 3.6 million systems. P2C was able to reverse-engineer the CnC messages without access to a program that can read and parse the messages.

Publications. PIEtrace(ASE'13), P2C (NDSS'15), CPR (ISSTA'17)

Rise of IoT platforms and importance of secure software development.
Recently, IoT has received a lot of attention and has become a very important part of our life. Just to name a few, drone products are released, major companies working on self-driving cars, recent surveillance cameras are now connected to the Internet, and so on.
Due to the prevalence of IoT devices, testing, debugging, and security analysis on IoT devices is becoming increasingly important. However, many advanced debugging and analysis tools do not support new IoT platforms. Not surprisingly, a study from HP revealed 70% of IoT devices contain vulnerabilities. Notably, many of these vulnerabilities are not new, hence if existing analysis tools were applicable on IoT devices, then these vulnerabilities could have been detected and patched before deployment. Simiarly, there are many existing program analysis tools while they cannot be applied on such new platforms. For example, taint analysis tools implemented on x86 are not applicable to programs on ARM platforms.
Such tools often need to be reimplemented on new platforms again and again when a new platform is released.

Do not reinvent the wheel, Transform the wheel.
My approach to solving the issue of lack of advanced program analysis/debugging tools on such new platforms is completely different. Instead of porting and developing the analysis tools on every new platform, I transform platform-dependent executions such as IoT executions into platform-independent programs so that they can be analyzed on other platforms where existing analysis tools already support.

Running Example. There is a malicious program on a network router. A security analyst wants to understand what it does. However, it is difficult to do as the network router does not support many analysis tools.
PIEtrace traces the execution of the malicious program and generates a platform independent program which is essentially a C program which we call trace program. The trace program reproduces the same execution including control/data dependencies and can be compiled on other platforms. In this case, I compile the program on both x86-Linux and x86-Windows to leverage existing analysis tools and debuggers on both platforms. Specifically, OllyDbg is a great interactive debugger which only supports Windows OS. Its usability and various plug-ins make analysis very efficient and effective. Also, Valgrind only supports Linux and it has many plug-ins that implement advanced program analysis techniques including dynamic program slicing and taint analysis. Being able to use such tools significantly helps the analysis process.


Do not reimplement programs, synthesize them: CPR.
In addition to the debugging and analysis problem, another fundamental problem on IoT platforms is lack of applications. Re-implementing programs is error-prone and time-consuming. We often see many vendors just reuse old source code without even fixing known vulnerabilities. Reimplementation of existing applications often introduces new vulnerabilities and bugs.
We find that binary code extraction and reuse can provide a better way to deal with the problems.
To this end, we extend the cross-platform execution transformation technique into a cross-platform binary extract and reuse technique. Specifically, we obtain multiple platform independent trace programs. Then, we merge them to synthesize a platform-independent program.
An important distinction from a trace program is that a synthesized program can take new inputs and produce outputs accordingly while a trace program does not take any inputs and reproduce the recorded execution path. Moreover, the synthesized program can take inputs that were not exercised during the training.



Advanced Forensic Analysis: Understanding file format of unknown binary files.
Every day, we deal with a lot of unknown files on our machines, generated by various programs. For example, in "Windows" folder on my machine, there is a binary file mib.bin, which I do not know what it contains. Such problem could be very serious if the file is generated by a malicious program and contains sensitive information. For instance, it may contain secret information such as passwords and credit card numbers. Let’s consider a situation that one of such unknown files is leaked. You may want to know the contents of the file and whether it contains any sensitive information.



Understanding file format from a program execution.
A file format including semantics of its content can be revealed by analyzing a program execution that reads and processes the file. A basic idea is to monitor a program execution when the program reads and uses the content. In particular, we analyze how the program treats the contents. For instance, if it is a number, the content may go through some conversion functions (e.g., atoi()). If it is a string, it may go through some string related functions (i.e., strcmp()). Such hints allow us to reverse-engineer unknown file format including types and semantics.

When life gives you lemons, make lemonade: Understanding file format without a consumer program.
However, such a consumer program may not exist always. For example, Command and Control (C&C) protocols in malware often only include logic to create an outgoing message. A consumer program that can parse the messages is only available on the attacker's side.
A unique capability of P2C is that when there is no available consumer program, but there is a producer program while inputs for the producer program to generate the message is not available, it turns a producer program execution into a consumer program execution so that we can understand the file format and semantics.

Transforming Producer Execution to Consumer Execution.
The transformation from a producer to a consumer is possible because consumer and producer programs are symmetric. In other words, operations in a consumer program have corresponding (symmetric) operations in a producer program.
For instance, consider a pair of network client and server programs that use a simple network protocol. A producer program that generates a network message may first construct a message header (A) and then the body (B). A consumer program that can understand the network message should first read the message header (A) and then the body (B). Note that the program structure of a consumer and a producer are symmetric as they both work on the same network message.



When there is not a single clue, forced execution comes to rescue.
P2C has another advanced feature that even when we do not know a producer program, it can find out the true producer from a set of potential producer programs. Specifically, it transforms executions of candidate producer programs to find an ideal consumer program execution. It is important to note that if a potential producer is not the true producer program of an unknown file, the transformation fails quickly. P2C has successfully revealed many file/network message formats only with an unknown file and a set of candidate producer programs. Specifically, it revealed a secret message (e.g., password) encrypted and embedded into a seemingly benign image file by steganography technique. As the attacker only generates the picture and sends it to his sever, there is no consumer program available to us. We only know some suspicious programs and do not even know how to run them. P2C automatically runs them to see which program is a producer and transform the producer program execution to an ideal consumer program execution to reveal the hidden message. It also analyzed C&nC protocols of popular malware (Zbot) within several minutes.

Personal

Things keep me alive

Personal Projects

I have been maintaining my own projects such as anti-malware software, system optimizers, and my own programming language and its IDE for more than 10 years.
These applications have millions of users and have been distributed through my website [open].

Media Coverage

1. Interview (Microsoftware , monthly magazine for developers in South Korea), 2008
2. Featured in etnews, one of the major Korean software industry newspapers, as a software developer, 2008
3. Interview (PC Love, monthly magazine for end-users in South Korea), 2002

Things keep me alive

beer

Dark beers

I love dark beers, my favorites are Murphy's and Rogue HazelNut.

coffee

Coffee

I am a coffee addict. I love dark roasted coffee beans from Ethiopia and Indonesia.

Quakelive

Quakelive

I am a big fan of Quake3 and Rocket Arena since 2000. Now I play quakelive.

LouisCK

Stand-up comedy shows

Louis CK is my favorite. Click this to my favorite clip.

southpark

Southpark

SouthPark is the best animation! Love their sarcasm!

kitchen nightmares

Kitchen Nightmares

Kitchen Nightmares secretly teaches about everyday problems and solutions for Ph.D!