Lin Tan @ Purdue University

Below are some example projects (our exciting research moves fast: projects complete and new exciting projects start all the time):

Desired experience: Strong coding skills and motivation in research are required. Background in security or machine learning is not required but a plus.

Possible industry involvement: Some of these projects are funded by Meta/Facebook research awards and J.P.Morgan AI research awards.

We especially encourage applications from women, Aboriginal peoples, and other groups underrepresented in computing.

Some of the positions are funded by NSF REU, which requires U.S. citizenship and permanent residence. In your email, please indicate whether you are a U.S. citizen or permanent resident.

*** Project 1. Binary Recovery and Foundation Models

Binary code analysis is the foundation of crucial security and development tasks, including legacy software maintenance, vulnerability detection, malware detection, and binary recovery. Combined with the sophistication of cybercrime that poses threats worldwide (e.g., cybercrime is predicted to cost $10.5 trillion annually by 2025), effective binary analysis techniques are in high demand. Existing models do not understand the syntax or semantics of binaries. The idea is to build binary foundation models considering syntaxes, compiler optimizations, hardware, etc. Our recent binary foundation model Nova is one of the first. Our CCS 2024 paper recovers data structures and identifier names from binaries, which could be useful for identifier recovery and renaming.

Our recent piror work and background can be found here: [Nova] [ReSym-CCS24]

*** Project 2. Language Models for Detecting and Fixing Software Bugs and Vulnerabilities

In this project, we will develop machine learning approaches including code language models to automatically learn bug and vulnerability patterns and fix patterns from historical data to detect and fix software bugs and security vulnerabilities. We will also study and compare general code language models and domain-specific language models.

Our recent piror work and background can be found here: [VulFix-ISSTA23] [CLM-ICSE23] [KNOD-ICSE23]

*** project 3. Inferring Specifications from Software Text for Finding Bugs and Vulnerabilities

A fundamental challenge of detecting or preventing software bugs and vulnerabilities is to know programmers' intentions, formally called specifications. If we know the specification of a program (e.g., where a lock is needed, what input a deep learning model expects, etc.), a bug detection tool can check if the code matches the specification.

Building upon our expertise on being the first to extract specifications from code comments to automatically detect software bugs and bad comments, in this project, we will analyze various new sources of software textual information (such as API documents and StackOverflow Posts) to extract specifications for bug detection. For example, the API documents of deep learning libraries such as TensorFlow and PyTorch contain a lot of input constraint information about tensors.

Our recent piror work and background can be found here: [Software Text Analytics]

*** Project 4. Testing Deep Learning Systems

We will build cool and novel techniques to make deep learning code such as TensorFlow and PyTorch reliable and secure. We will build it on top of our award-winning paper (ACM SIGSOFT Distinguished Paper Award)!

Machine learning systems including deep learning (DL) systems demand reliability and security. DL systems consist of two key components: (1) models and algorithms that perform complex mathematical calculations, and (2) software that implements the algorithms and models. Here software includes DL infrastructure code (e.g., code that performs core neural network computations) and the application code (e.g., code that loads model weights). Thus, for the entire DL system to be reliable and secure, both the software implementation and models/algorithms must be reliable and secure. If software fails to faithfully implement a model (e.g., due to a bug in the software), the output from the software can be wrong even if the model is correct, and vice versa.

This project aims to use novel approaches including differential testing to detect and localize bugs in DL software (including code and data) to address the testing oracle challenge.

Our recent piror work and background can be found here: [EAGLE-ICSE22] [Fairness-NeurIPS21] [Variance-ASE20]

*** Project 5. Data-Free Model Extraction

Many deployed machine learning models such as ChatGPT and Codex are accessible via a pay-per-query system. It is profitable for an adversary to steal these models for either theft or reconnaissance. Recent model-extraction attacks on Machine Learning as a Service (MLaaS) systems have moved towards data-free approaches, showing the feasibility of stealing models trained with difficult-to-access data. However, these attacks are ineffective or limited due to the low accuracy of extracted models and the high number of queries to the models under attack. The high query cost makes such techniques infeasible for online MLaaS systems that charge per query.

In this project, we will design novel approaches to get higher accuracy and query efficiency than prior data-free model extraction techniques.

Our recent piror work and background can be found here: [DisGUIDE-AAAI23]