The GoBoiler internship program is currently on hiatus.
Purdue's 63 Computer Science faculty members run a PhD program covering 11 broad research areas. The PhD program currently has around 245 PhD students from over 30 countries. Several of the faculty propose the following list of potential projects for GoBoiler interns. Please select one or more of these projects as potential work you would like to do if you are admitted to the program.
Each project link below indicates the Research Area - Faculty mentor name - Project title.
Cities are ecosystems of socio-economic entities which provide concentrated living, working, education, and entertainment options to its inhabitants. Hundreds of years ago, the significantly smaller population and the abundance of natural resources made city design, and even the functioning of cities in relation to their hinterland, quite straightforward. Unfortunately, that is not the case today with over 3.5 billion people in cities. Rather, cities, and urban spaces of all sizes, are extremely complex and their modeling is far from being solved. In this project, we aim to pull together CS, engineering, agricultural-economics, and social science to collectively exploit our unique opportunity to address this emerging problem. Research activities will span many fields, including machine learning, data science, computer graphics/vision, and will perform cross-disciplinary research focused on the idea of designing and simulating the functioning of existing and future cities. Our desire is also to pool this knowledge, identify our unique strengths, and pursue large and ambitious computing projects.
Reconstruction and geometric modeling of developing 3D biological structures belong among the most interesting and most visually plausible problems in Computer Graphics. This project is part of Crops in Silico initiative and grant that attempts to understand how plants grow and how they can be genetically optimized to feed more people. The objective of this task is to generate biologically plausible 3D geometries of growing plants (maize, sorghum, and wheat) by reconstructing them from series of multiple images captured over time. Each plant grows for several weeks in a controlled environment and is regularly photographed from multiple directions by using RGB, multispectral, and infrared cameras. The data needs to be converted into 3D geometry by using deep learning and then the individual plants need to be combined into a functional structural plant model that also captures the temporal dimension (growth).
Multi-agent board games of strategy are played by millions worldwide, representative examples being Monopoly, Risk and Settlers of Catan. A complete game instance involves multiagent, stochastic reasoning of tradeoffs e.g., resource allocation, risk-reward premiums, staggered and uncertain reward functions, agent psychology and interactions, game-theoretic competition/collaboration, temporally varying objectives (players A and B may implicitly collaborate to force out a dominant player C, but will eventually return to competing), and innovative tactics in the face of bad luck. Variants of games like Monopoly allow collusion, racketeering and negotiation, making them instrumental classroom tools in understanding and honing strategy. Like any game, there is both an explicit set of non-violable rules ('contract'), and many implicit rules and loopholes. Robustly navigating novel situations in such a complex domain is an important thrust for open-world AI research, especially supporting military applications (including decision support, human-machine collaborations and autonomy) in a world characterized by gray-zone and multi-domain conflicts.
Students will work closely with our students and Prof Stonebraker's (MIT) students. The objective is to automatically extract data relevant to significant events, identify patterns related to a mission, and push relevant information efficiently to interested parties (e.g. analysts, cyber security experts, and decision makers). We are currently working with West Lafayette Police department data from fixed cameras to identify events.
Students will work with Purdue students and Ford researchers. To explore new mobility management in vehicular networks by leveraging Software-Defined Networking (SDN) and Network Function Virtualization (NFV) and enable ubiquitous vehicle connectivity and high QoS in Intelligent Transport System applications.
Connectivity in automotive environment is not as ubiquitous as in a smartphone-like environment. A persistent connection is very rare in such scenarios, hence it is necessary to explore technologies in vehicular communication networks to achieve effective mobility management, low latency and high network bandwidth utilization. Also emerging V2X transport layer protocols like 5GAA, DSRC demands for interoperability for various interfaces available in a vehicle. Another major concern is heterogeneous nature of in-vehicle networks, demanding a need for better network resource management. To address these gaps we propose to develop a SDN-based, low memory footprint architecture, both in and off vehicle by separating data and control plane with control plane having global visibility of the network.
Binary Analysis focuses on analyzing, typically automatically, compiled binary software (e.g., compiled C/C++ code) without having its source code. In recent years many techniques (e.g., symbolic execution and fuzzing) have been developed and improved to make binary analysis suitable to automatically find software vulnerabilities (e.g., memory corruption bugs). While effective on a small scale, these techniques do not scale enough when applied to large codebases (e.g., an entire browser).
This project aims to augment these techniques by using a human-in-the-loop approach. For instance, a fuzzer (which is software trying many inputs attempting to make a program crash) can detect that just trying random inputs is ineffective in exploring the execution of a particular program. In this case, the fuzzer could ask a human expert for guidance on how to generate more targeted inputs. Another case is to use human expertise to semi-automatically modify existing compiled software to make it easier to be analyzed automatically. For instance, using human help, an automated approach could remove checks that are computationally hard to be bypassed using symbolic execution.
Modern mobile devices (e.g., smartphones and tablets) are equipped with special hardware features (e.g., Trustzone, 'Secure Enclave', ...), which are guaranteed not to be compromised even when the main operating system (e.g., Android) gets compromised (rooted). These features can be used by third-party apps to implement authentication protocols that are guaranteed to be "safe" even on compromised devices. Unfortunately, it is currently extremely challenging for developers to use these hardware features correctly, due to the issues in their implementation and the complexity of the APIs designed to control them.
In this project, we will explore this problem in two parallel directions. First, we will study shortcomings of the current APIs and how to simplify, from a developer perspective, their correct usage by, for instance, implementing proper code libraries. At the same time, we will study how these features are currently used by app developers to pinpoint common issues and vulnerabilities caused by improper usage of these APIs.
In the last few years over a billion user passwords have been exposed to the dangerous threat of offline attacks through breaches at organizations like Yahoo!, Dropbox, LinkedIn, LastPass, AdultFriend Finder and Ashley Madison. Password hashing is a crucial 'last line of defense' against an offline attacker. An attacker who obtains the cryptographic hash of a user’s password can validate password guesses offline by comparing the hashes of likely password guesses with the stolen hash value. There is no way to lock the adversary out so the attacker is limited only by the cost of computing the password hash function millions/billions of times per use. A strong password hashing algorithm should have the property that
- It is prohibitively expensive for the attacker to compute the function millions or billions of times
- It can be computed on a standard personal computer in a reasonable amount of time so that users can still authenticate in a reasonable amount of time.
Memory hard functions (MHFs) are a crucial cryptographic primitive in the design of key-derivation functions which transform a low-entropy secret (e.g., user password) into a cryptographic key. Data-Independent memory hard functions (iMHFs) are an important variant due to their natural resistance to side-channel attacks.
Argon2, the winner of the password hashing competition, initially recommended the data-independent mode Argon2i for password hashing, but this recommendation was later changed in response to a series of space-time tradeoff attacks against Argon2i showing that the amortized Area-Time complexity of this function was significantly lower than initially believed (CRYPTO 2016). In this project students will be exposed to cutting edge research on the design and analysis of memory hard functions and will have the opportunity to help implement and evaluate state of the art constructions (e.g., EUROCRYPT 2017, CCS 2017, 2018).
An ideal student should have a strong background in mathematics and theoretical computer science (e.g., graph theory, data-structures and algorithms) and should be comfortable writing code (e.g., C, C++, C#, Python). The project can be tailored to the student's strengths. One aspect of the research will involve running intensive computational experiments. Another aspect will involve modifying current implementations of memory hard functions and evaluating these implementations. For students with an exceptionally strong background in theoretical computer science there are several challenging open problems to work on.
Program synthesis aims to automatically generate a program that satisfies the user intent expressed through some high-level specifications. For instance, one of the most popular styles of inductive synthesis, Counterexample-Guided Inductive Synthesis (CEGIS), starts with a specification--user defines what the desired program does--a synthesizer produces a candidate program that might satisfy the specification. A verifier decides whether that candidate program meets the desired specification. If the specification is satisfied, we are successful, and if not, the verifier provides feedback to the synthesizer using to guide its search for new candidate programs. While program synthesis has successfully used in the areas including computer-aided education, end-user programming, and data cleaning, the application and scope of program synthesis for security and safety is largely unexplored by the technical community. In this project, we will explore algorithms and techniques to automatically generate programs from formal or informal specifications to improve the security and safety of the users and environments. We will focus on programs used to automate heterogeneous and connected sensors/actuators.
Fuzzing is a process of testing inputs on a software application (the fuzz target) generated as per an input generation policy and detecting if any of the test inputs trigger a bug. Additionally, a fuzzer can use feedback from the fuzz target to guide its input generation. A fuzzer often includes two key components: an input generation, and a bug oracle. The input generation module implements the input generation process and may optionally take in feedback from the fuzz target to guide the generation. The bug oracle signals to the fuzzer if the generated input has triggered a bug. Fuzzing has been conventionally used to discover memory-access violation bugs, such as buffer overflows and use-after-free. The application of fuzzing, however, to detect safety and security policy violations in the Intenet of Things (IoT) and Cyber-Physical Systems (CPS) has not been well explored. This project aims at identifying and addressing the challenges to extend fuzzing to detect policy violations in IoT/CPS environments.
Adversarial samples are the perturbed benign inputs to induce DNN misbehaviors. For instance, an adversarial human photo may be used to evade a face recognition model that unlocks a smart door. Adversarial sample detection techniques, broadly speaking, attempt to find regularity between benign and adversarial samples on the input, latent, and output spaces. This project will include exploring and formalizing adversarial detection techniques such as feature squeezing and attribute-steered detection, categorizing them based on their applications, and evaluating their effectiveness against non-adaptive and adaptive adversaries.
GAN is an ML technique that increases the model generator's effectiveness by training a discriminator that seeks to differentiate between real data and generated data. While GANs have proven to be quite powerful at automatically creating (almost) realistic-looking content such as images, text, and video at scale, the nature and extent for training a GAN to identify the kinds of artifact in security-oriented tasks (e.g., network traffic, and malware binaries) remains limited. This project explores the general security implications of sophisticated GANs in security applications. We will explore how it can harm the environment and society and characterize its fundamental trade-offs between other types of generative models, such as flow and autoregressive models.
Distributed applications often require sharing of data and coordination between application instances. For instance, a note-taking application may need to synchronize the application data across different devices owned by the user so that changes made in one device become visible on others. More complex distributed applications, such as Google Docs, may additionally require synchronizing data across different users. Unfortunately, existing synchronization mechanisms often require specialized servers to synchronize shared data and coordinate processes.
This project will explore novel data synchronization mechanisms that avoid the need for specialized servers that perform data synchronization and process coordination. The mechanisms developed and the prototype built for this project are expected to significantly simplify the process of building and deploying distributed applications. For this project we are seeking students with good C and Python programming skills.
Mobile devices are now ubiquitous, however mobile application development still has a relatively steep learning curve and in practice often involves using complex specialized development environments. Currently, building even simple mobile applications generally requires developers to write a significant amount of code and learn complex concepts.
This summer project will involve building a fast prototyping framework that will enabled developers to quickly develop a wide-range of Android applications. For this project we seek students with experience building Android applications and Java programming skills. Additionally students should have at least basic Python programming skills.
Hypervisors (virtual machine monitors) are a fundamental software layer that powers virtually all cloud computing environments. However, hypervisors are well known to be complex software systems that are challenging to test and implement correctly. In the context of our MultiNyx project, using symbolic execution techniques we built a system that systematically generates effective tests for virtual machines.
This summer research project will consist of developing additional components in the context of the MultiNyx project. In particular, one of the tasks will consist of developing effective methodologies to generate a sequence of x86 instructions that will cause the virtual machine to reach a given memory and register state. For this project we are seeking students with good C programming and low-level systems programming skills. Strong candidates will have at least basic understanding of x86 assembly language.
Run-time instrumentation techniques, such as the Intel Pin framework, enable run-time instrumentation of applications. Such instrumentation has important applications such as allowing efficient and selective program analysis (e.g., tracing) and enabling developers to change the behavior of applications during run-time, for instance, for patching purposes.
At Purdue we are building a kernel run-time instrumentation framework that will allow developers to instrument the entire system, including the operating system kernel. As part of this project, several concrete tasks have to be developed including the implementation of selective instrumentation and the optimization mechanisms that are important to ensure efficiency. For this project we are seeking students with good C programming and low-level systems programming skills. Strong candidates will have at least basic understanding of x86 assembly language.
Testing is key to help ensure the reliability of applications and, in practice, dynamic testing is widely used in industry due to its effective. Despite its effectiveness, dynamic testing approaches are generally computationally very intensive because they require massive sets of test cases to be repeatedly executed, which consume significant CPU and memory resources.
This project will involve developing a novel execution runtime for testing that will increase testing efficiency. To achieve this our system will reduce the amount of redundancy of traditional testing approaches using program analysis and operating system mechanisms. For this project we are seeking students with good C programming and Linux systems programming skills.
Concurrent systems are challenging to implement correctly because developers have to correctly reason about all the possible thread schedules and their impact on the communication between different threads. As a results, real-world systems often suffer from concurrency bugs that are hard to detect and fix.
At Purdue, we are working on a project to build effective tools that will help developers test and diagnose systems for concurrency bugs. This project includes the development of mechanisms and algorithms to dynamically explore the different behaviors of the target system, bug oracles that infer the correct system specification, methods to analyze test results, and optimizations that will improve the infrastructure performance. For this project we are seeking students with good C programming and Linux systems programming skills.
Today machine learning (ML) is touching almost all aspects of our lives. From health care to finance, from computer networks to traffic signals, ML-based solutions are offering improved automated solutions. However, the current centralized approaches introduce a huge privacy risk when the training data comes from different mutually distrusting sources. As decentralized training data and inference environment is a harsh reality in almost all of the above application domains, it has become important to develop privacy-preserving techniques for ML training inference and disclosure. In particular, we plan to work on the two key privacy-preserving ML challenges. Is it possible to train models on confidential data without ever exposing the data? Can a model classify a sample without ever seeing it? In this project, we will design and evaluate a novel, specialized secure multi-party computation (MPC) design to answer the above two questions. Although some theoretical solutions are already available in the literature, our focus will be on developing MPC solutions that significantly improve the efficiency w.r.t current approaches to privacy-preserving ML.
Although the Human Genome Project created a catalog of ~20,000 genes, it did not inform us what goes wrong and how they work together to drive cancer. To decipher this complex logic, there has been great interest recently to apply deep learning to study cancer. In parallel, progress in experimental mapping technologies is rapidly accumulating the quantities of input/output data needed to construct deep neural networks. However, deep learning models are still "black boxes" and difficult to interpret and provide no meaningful insights about how their decisions are made. Such models, while undoubtedly useful, are insufficient in cancer studies for which clinicians need to understand the mechanisms underlying the predictions.
We are interested in constructing a new generation of cancer deep learning frameworks that are both 'transparent' and 'transferable'. The projects involves developing a hierarchical graphical neural network which can jointly infer the structure and function of tumor cells. The intuition is that the underlying cell structure of tumor cell is more efficient to guide the neural network compared to random structures with equal complexity. Such model should be able to learn how to cluster genetic features to build a multi-layer functional hierarchy on top of molecular interaction networks such that genetic variants can be translated more accurately. New biological subsystems will be identified by comparing this hierarchy to literature-curated references such as the Gene Ontology. By integrating cancer data from multiple cohorts, we can further identify subsystems that are enriched for genomic alterations, from which we may find novel cancer genes and drug targets.
Representation learning has recently made great strides in solving increasingly complex tasks by simply mapping vector inputs to desired outputs. Still, the fundamental problem of building neural networks that can account for pre-defined input invariances of these vectors remains largely open. For set invariances, the learned model should give the same output regardless of any permutation of the input vector; for graph invariances, the set of all isomorphic graphs is an input invariance. The goal of this project is to make progress in a mathematical framework that will result in a novel representation learning methods that can encode invariances based on known relationships in the input data. The project will contemplate a variety of applications, including symbolic reasoning, reasoning with knowledge graphs, and predicting the chemical properties of molecules.
- Balasubramaniam Srinivasan, Bruno Ribeiro, On the Equivalence between Node Embeddings and Structural Graph Representations, ICLR 2020
- Changping Meng, Jiasen Yang, Bruno Ribeiro, Jennifer Neville, HATS: A Hierarchical Sequence-Attention Framework for Inductive Set-of-Sets Embeddings, KDD 2019
- Ryan L. Murphy, Balasubramaniam Srinivasan, Vinayak Rao, Bruno Ribeiro, Relational Pooling for Graph Representations, ICML 2019
- Ryan L. Murphy, Balasubramaniam Srinivasan, Vinayak Rao, Bruno Ribeiro, Janossy Pooling: Learning Deep Permutation-Invariant Functions for Variable-Size Inputs, ICLR 2019
Flare is a big data and machine learning platform developed here at Purdue. Flare can transparently accelerate pipelines implemented in Apache Spark and TensorFlow, and provides speedups of 10x-100x, thanks to cutting edge compiler technology. For this summer research project we are looking for students with a strong systems background (databases, distributed systems, compilers). The goal will be to extend Flare in one of several possible dimensions: implement code generation for distributed execution (using MPI or similar), implement streaming abstractions for incremental data processing, implement code generation for GPUs, implement a cost-based query optimizer, implement new internal compiler optimizations, implement case studies based on various workloads.
Lantern is a machine learning framework developed in Scala, based on two important and well-studied programming language concepts, delimited continuations and multi-stage programming (staging for short). Delimited continuations provides a very concise view of the reverse mode automated differentiation, which which permits implementing reverse-mode AD purely via operator overloading and without any auxiliary data structures. Multi-stage programming leading to a highly efficient implementation that combines the performance benefits of deep learning frameworks based on explicit reified computation graphs (e.g., TensorFlow) with the expressiveness of pure library approaches (e.g., PyTorch). This project will extend Lantern in various ways, adding new compiler optimizations or implementing case studies on state-of-the art deep learning models.
We have several ongoing strands of work that explore the foundations of programming languages, in particular compiler transformations, effect systems, approaches for verification and static analysis. To increase the trust in our theoretical models and results, we mechanize most of these artifacts in the Coq proof assistant. For this project, we are seeking students who already have experience with Coq but who want to deepen their expertise and apply their skills to concrete research developments.
A fundamental challenge of detecting or preventing software bugs is to know programmers’ intentions, formally called specifications. If we know the specification of a program (e.g., where a lock is needed, what input a deep learning model expects, etc.), a bug detection tool can check if the code matches the specification.
Building upon our expertise on being the first to extract specifications from code comments to automatically detect software bugs and bad comments, in this project, we will analyze various new sources of software textual information (such as StackOverflow Posts, API documentation, etc.) to extract specification for bug detection. For example, StackOverflow.com contains a huge amount of Q&A text about how to use software libraries such as STL and Java Commons Collections. Good programming skills and strong motivation in research are required. Background in natural language processing is a plus.
Machine learning systems including deep learning (DL) systems demand reliability and security. DL systems consist of two key components:
- models and algorithms that perform complex mathematical calculations, and
- software that implements the algorithms and models.
Here software includes DL infrastructure code (e.g., code that performs core neural network computations) and the application code (e.g., code that loads model weights). Thus, for the entire DL system to be reliable and secure, both the software implementation and models/algorithms must be reliable and secure. If software fails to faithfully implement a model (e.g., due to a bug in the software), the output from the software can be wrong even if the model is correct, and vice versa.
This project aims to use novel approaches including differential testing to detect and localize bugs in DL software (including code and data) to address the testing oracle challenge. In addition, this project works on identifying and defending against adversarial input. Good programming skills and strong motivation in research are required. Background in deep learning and testing is a plus.
In this project, we will develop machine learning approaches to automatically learn vulnerability patterns and fix patterns from historical data to detect and fix software security vulnerabilities. Good programming skills and strong motivation in research are required. Background in security or machine learning is a plus.
Earlier work can be found here: https://www.cs.purdue.edu/homes/lintan/publications/deeplearn-tse18.pdf and https://www.cs.purdue.edu/homes/lintan/publications/priv-icse19.pdf