I am a third year Ph.D. student at Purdue University, advised by Professor Xiangyu Zhang. I previously obtained my B.Sc. degree with Zhiyuan Honours from Shanghai Jiao Tong University (SJTU) in 2018.
I am also a core member of the CTF team 0ops . Sometimes I play with A*0*E, too.
Click to view my up-to-date CV and Twitter timeline.
Click to view my up-to-date CV and hide Twitter timeline.
I am a third year Ph.D. student at Purdue University, advised by Professor Xiangyu Zhang. I previously obtained my B.Sc. degree with Zhiyuan Honours from Shanghai Jiao Tong University (SJTU) in 2018. I am also a core member of the CTF team 0ops. Sometimes I play with A*0*E, too. Click to view my up-to-date CV.
My research interest mainly lies in software engineering and program analysis, especially for native code without sources.
Run the following command in a terminal (GNU):
$ echo "$(echo "ghNsgnm" | md5sum - | xxd -r -p | base64 | cut -c3-10)@purdue.edu"
Key Words:
Binary Analysis, Variable Recovery, Probabilistic Analysis, Reverse Engineering
close
Abstract:
     Recovering variables and data structure information from stripped binary is a prominent challenge in binary program analysis. While various state-of-the-art techniques are effective in specific settings, such effectiveness may not generalize. This is mainly because the problem is inherently uncertain due to the information loss in compilation. Most existing techniques are deterministic and lack a systematic way of handling such uncertainty. We propose a novel probabilistic technique for variable and structure recovery. Random variables are introduced to denote the likelihood of an abstract memory location having various types and structural properties such as being a field of some data structure. These random variables are connected through probabilistic constraints derived through program analysis. Solving these constraints produces the posterior probabilities of the random variables, which essentially denote the recovery results. Our experiments show that our technique substantially outperforms a number of state-of-the-art systems, including IDA, Ghidra, Angr, and Howard. Our case studies demonstrate the recovered information improves binary code hardening and binary decompilation.
Key Words:
Network Protocol Reverse Engineering, Probabilistic Analysis
close
Abstract:
     Network protocol reverse engineering is an important challenge with many security applications. A popular kind of methods leverage network message traces. These methods rely on pair-wise sequence alignment and/or tokenization. They have various limitations such as difficulties of handling a large number of messages and dealing with inherent uncertainty. In this paper, we propose a novel probabilistic method for network trace based protocol reverse engineering. It first makes use of multiple sequence alignment to align all messages and then reduces the problem to identifying the keyword field from the set of aligned fields. The keyword field determines the type of a message. The identification is probabilistic, using random variables to indicate the likelihood of each field (being the true keyword). A joint distribution is constructed among the random variables and the observations of the messages. Probabilistic inference is then performed to determine the most likely keyword field, which allows messages to be properly clustered by their true types and enables the recovery of message format and state machine. Our evaluation on 10 protocols shows that our technique substantially outperforms the state-of-the-art and our case studies show the unique advantages of our technique in IoT protocol reverse engineering and malware analysis.
Key Words:
Cyber Attack Detection, Attack Forensics, Attack Provenance Graph
close
Abstract:
     Cyber-attacks are becoming more persistent and complex. Most state-of-the-art attack forensics techniques either require annotating and instrumenting software applications or rely on high quality execution profile to serve as the basis for anomaly detection. We propose a novel attack forensics technique ALchemist. It is based on the observations that built-in application logs provide critical high-level semantics and audit log provides low-level fine-grained information; and the two share a lot of common elements. ALchemist is hence a log fusion technique that couples application logs and audit log to derive critical attack information invisible in either log. It is based on a relational reasoning engine Datalog and features the capabilities of inferring new relations such as the task structure of execution(e.g., tabs in firefox), especially in the presence of complex asynchronous execution models, and high-level dependencies between log events. Our evaluation on 15 popular applications including firefox, Chromium, and OpenOffice, and 14 APT attacks from the literature demonstrates that although ALchemist does not require instrumentation, it is highly effective in partitioning execution to autonomous tasks(in order to avoid bogus dependencies) and deriving precise attack provenance graphs, with very small overhead. It also outperforms NoDoze and OmegaLog, two state-of-art techniques that do not require instrumentation.
Key Words:
Malware Analysis, Forced Execution, Probabilistic Analysis
close
Abstract:
     Malware is a prominent security threat and exposing malware behavior is a critical challenge. Recent malware often has payload that is only released when certain conditions are satisfied. It is hence difficult to fully disclose the payload by simply executing the malware. In addition, malware samples may be equipped with cloaking techniques such as VM detectors that stop execution once detecting that the malware is being monitored. Forced execution is a highly effective method to penetrate malware self-protection and expose hidden behavior, by forcefully setting certain branch outcomes. However, an existing state-of-the-art forced execution technique X-Force is very heavy-weight, requiring tracing individual instructions, reasoning about pointer alias relations on-the-fly, and repairing invalid pointers by on demand memory allocation.
     We develop a light-weight and practical forced execution technique. Without losing analysis precision, it avoids tracking individual instructions and on demand allocation. Under our scheme, a forced execution is very similar to a native one. It features a novel memory pre-planning phase that pre-allocates a large memory buffer, and then initializes the buffer, and variables in the subject binary, with carefully crafted values in a random fashion before the real execution. The pre-planning is designed in such a way that dereferencing an invalid pointer has a very large chance to fall into the pre-allocated region and hence does not cause any exception, and semantically unrelated invalid pointer dereferences highly likely access disjoint (pre-allocated) memory regions, avoiding state corruptions with probabilistic guarantees. Our experiments show that our technique is 84 times faster than X-Force, has 6.5X and 10% fewer false positives and negatives for program dependence detection, respectively, and can expose 98% more malicious behaviors in 400 recent malware samples.
@inproceedings{DBLP:conf/sp/You0KAPSHZ20, author = {Wei You and Zhuo Zhang and Yonghwi Kwon and Yousra Aafer and Fei Peng and Yu Shi and Carson Harmon and Xiangyu Zhang}, title = {{PMP:} Cost-effective Forced Execution with Probabilistic Memory Pre-planning}, booktitle = {2020 {IEEE} Symposium on Security and Privacy, {SP} 2020, San Francisco, CA, USA, May 18-21, 2020}, pages = {1121--1138}, publisher = {{IEEE}}, year = {2020}, url = {https://doi.org/10.1109/SP40000.2020.00035}, doi = {10.1109/SP40000.2020.00035}, timestamp = {Tue, 25 Aug 2020 08:18:17 +0200}, biburl = {https://dblp.org/rec/conf/sp/You0KAPSHZ20.bib}, bibsource = {dblp computer science bibliography, https://dblp.org} }
Key Words:
Path Sampling, Abstract Interpretation, Binary Analysis, Data Dependence
close
Abstract:
     Binary program dependence analysis determines dependence between instructions and hence is important for many applications that have to deal with executables without any symbol information. A key challenge is to identify if multiple memory read/write instructions access the same memory location. The state-of-the-art solution is the value set analysis (VSA) that uses abstract interpretation to determine the set of addresses that are possibly accessed by memory instructions. However, VSA is conservative and hence leads to a large number of bogus dependences and then substantial false positives in downstream analyses such as malware behavior analysis. Furthermore, existing public VSA implementations have difficulty scaling to complex binaries.
     In this paper, we propose a new binary dependence analysis called BDA enabled by a randomized abstract interpretation technique. It features a novel whole program path sampling algorithm that is not biased by path length, and a per-path abstract interpretation avoiding precision loss caused by merging paths in traditional analyses. It also provides probabilistic guarantees. Our evaluation on SPECINT2000 programs shows that it can handle complex binaries such as gcc whereas VSA implementations from the-state-of-art platforms have difficulty producing results for many SPEC binaries. In addition, the dependences reported by BDA are 75 and 6 times smaller than Alto, a scalable binary dependence analysis tool, and VSA, respectively, with only 0.19% of true dependences observed during dynamic execution missed (by BDA). Applying BDA to call graph generation and malware analysis shows that BDA substantially supersedes the commercial tool IDA in recovering indirect call targets and outperforms a state-of-the-art malware analysis tool Cuckoo by disclosing 3 times more hidden payloads.
@article{DBLP:journals/pacmpl/ZhangYTWK019, author = {Zhuo Zhang and Wei You and Guanhong Tao and Guannan Wei and Yonghwi Kwon and Xiangyu Zhang}, title = {{BDA:} practical dependence analysis for binary executables by unbiased whole-program path sampling and per-path abstract interpretation}, journal = {Proc. {ACM} Program. Lang.}, volume = {3}, number = {{OOPSLA}}, pages = {137:1--137:31}, year = {2019}, url = {https://doi.org/10.1145/3360563}, doi = {10.1145/3360563}, timestamp = {Mon, 04 Jan 2021 07:44:35 +0100}, biburl = {https://dblp.org/rec/journals/pacmpl/ZhangYTWK019.bib}, bibsource = {dblp computer science bibliography, https://dblp.org} }
Key Words:
Disassembly, Binary Rewrite, Probabilistic Analysis
close
Abstract:
     Disassembling stripped binaries is a prominent challenge for binary analysis, due to the interleaving of code segments and data, and the difficulties of resolving control transfer targets of indirect calls and jumps. As a result, most existing disassemblers have both false positives (FP) and false negatives (FN). We observe that uncertainty is inevitable in disassembly due to the information loss during compilation and code generation. Therefore, we propose to model such uncertainty using probabilities and propose a novel disassembly technique, which computes a probability for each address in the code space, indicating its likelihood of being a true positive instruction. The probability is computed from a set of features that are reachable to an address, including control flow and data flow features. Our experiments with more than two thousands binaries show that our technique does not have any FN and has only 3.7% FP. In comparison, a state-of-the-art superset disassembly technique has 85% FP. A rewriter built on our disassembly can generate binaries that are only half of the size of those by superset disassembly and run 3% faster. While many widely-used disassemblers such as IDA and BAP suffer from missing function entries, our experiment also shows that even without any function entry information, our disassembler can still achieve 0 FN and 6.8% FP.
@inproceedings{DBLP:conf/icse/MillerKSZZL19, author = {Kenneth A. Miller and Yonghwi Kwon and Yi Sun and Zhuo Zhang and Xiangyu Zhang and Zhiqiang Lin}, editor = {Joanne M. Atlee and Tevfik Bultan and Jon Whittle}, title = {Probabilistic disassembly}, booktitle = {Proceedings of the 41st International Conference on Software Engineering, {ICSE} 2019, Montreal, QC, Canada, May 25-31, 2019}, pages = {1187--1198}, publisher = {{IEEE} / {ACM}}, year = {2019}, url = {https://doi.org/10.1109/ICSE.2019.00121}, doi = {10.1109/ICSE.2019.00121}, timestamp = {Mon, 15 Jun 2020 17:06:47 +0200}, biburl = {https://dblp.org/rec/conf/icse/MillerKSZZL19.bib}, bibsource = {dblp computer science bibliography, https://dblp.org} }