CS 59000: Bioinformatics Algorithms (Fall 2020)
Course Information
When: Monday/Wednesday 4:30 pm – 5:45 pm.
Where: LWSN B134.
Instructor: Jianzhu Ma, majianzhu@purdue.edu.
Office Hour: Thursday. 4:30 pm – 5:45 pm.or by appointment (send email) Where: TBD
Online discussion is available at Blackboard (mycourses.purdue.edu).
Course description:
This course covers almost all the major research directions in bioinformatics and computational biology. It is a type of “allyouneed” class if you want to develop algorithms to process and understand biological data. By completing this course, students will encounter a handful of fundamental algorithmic approaches deriving straight from very widely cited primary literature, much of which has been published in recent years. The course also introduces basic concepts in statistics, mathematics, and machine learning including deep learning, which are all necessary to understand these approaches. Classes consist of both instructor and student presentations of their final projects (see Syllabus). It goes beyond just invoking existing popular programs and packages. Instead, the aim is for you to understand the underlying principles of the most successful techniques that are currently in use, and provide you with the capacity to design and implement the next generation of tools. Another goal of the course is to tackle the research frontiers of computational biology, and give you a glimpse of how research works, expose you to current bioinformatics research directions and guide you to find the problems most interesting to you.
Prerequisites:
This course is intended to be a first course in Bioinformatics algorithms for graduate students in computer science, computer engineering, mathematics, statistics, computational life sciences, and related fields. Students from computer science, the engineering disciplines, and mathematical sciences may be concerned about how much previous exposure to biological sciences they need to study bioinformatics. The answer is, not much—students do not need any previous knowledge of biology beyond high school. But you will need a willingness to learn biological concepts needed to understand the computational problems we study. We will begin with a quick review of some of the biological concepts needed for bioinformatics, and will learn more as needed.
Students are expected to have the following background:
Basic programming skills to write a reasonably nontrivial computer program in Python or C (e.g., CS 38003 or equivalent are recommended).
At least an undergraduate course in algorithms (a graduate course in algorithms is preferred).
Undergraduate or graduate level machine learning courses (e.g., CS 37300 and CS 578 are sufficient).
Students without this background should discuss their preparation with the instructor. The introduction session in the first week of the class will give an overview of the expected background.
Target Students
This course is intended to be a first course in Bioinformatics algorithms for graduate students in computer science, computer engineering, mathematics, statistics, computational life sciences, and related fields. This is suitable as a foundational course for students planning to do research in bioinformatics with a focus on algorithms; it is also suitable for those who wish to acquire some breadth in algorithms during graduate study and learn about important recent developments in bioinformatics.
Optional textbooks
“Biological sequence analysis” by Durbin, Eddy, Krogh, Mitchison
“An introduction to bioinformatics algorithms” by Jones, Pevzner
“Network Biology:” section in Topology of molecular interaction networks
“Deep Learning” by Ian Goodfellow, Yoshua Bengio, and Aaron Courville
Online book “Bioinformatics Algorithms” [link], Phillip Compeau and Pavel Pevzner
Grading
Homework (50%)
Final projects (50%)
Assignments and exams
There will be FIVE homework assignments including both theory and programming problems. Assignments should be submitted online in Brightspace, piazza or Blackboard systems. Programming projects can be written in both R and Python. Deep learning final projects should be written in either Tensorflow and PyTorch.
Late policy
Assignments are to be submitted by the due date listed. Assignments will NOT BE accepted if they are submitted late. Additional extensions will be granted only due to serious and documented medical or family emergencies.
Final projects
Students are encouraged to work as teams of two or three. Each team can either choose the provided projects on the course website or a separate project related to Biology. If you choose to do your own project, please contact me beforehand to discuss it. Here are the bioinformatics projects you might consider:
Easy projects:
Using deep neural networks to predict protein secondary structures.
Using deep neural networks to predict protein contact/distance map.
Implement a RNA secondary structure prediction algorithm.
Implement a single cell analysis pipeline.
Predicting protein functions prediction using multiple protein interaction networks.
Implement an algorithm to correct the batch effect in single cell data.
Predicting enhancers using genomics features.
Advanced projects:
Implement a (hierarchical) network community detection algorithms to recover Gene Ontology.
Using Contralign algorithm to improve the consistency of multiple network alignment.
Using network and deep learning to predict disease related genes.
Using deep learning to predict drug response for cancer.
Protein function prediction from interaction network using graph.
Comparison and analysis of methods in 3D genome reconstruction from HiC data.
Translation of genotype to phenotype by using the molecular networks.
Syllabus (tentative)
Sequence analysis in Bioinformatics (9 classes)
Time  Topic  Contents 
08/24  Introduction  (1) Motivation (2) Basic biology knowledge (3) Syllabus and grading policy (4) Final projects and presentations Optional Reading: What Is the Role for Algorithmics and Computational Biology in Responding to the COVID19 Pandemic? [paper link] 
08/26  Dynamic programming and sequence alignment  (1) Dynamic programming (2) Local alignment (3) Global alignment 
09/02  Homology search I  (1) PAM score (2) BLOSUM score (3) Extreme Value Theory (4) Blast 
09/07  Multiple sequence alignment  (1) Star algorithms (2) CLUSTALW (3) Tcoffee (4) Mcoffee (5) MUSCLE (6) Probcons 
09/09  Advanced sequence alignment algorithms  (1) Protein sequence alignment (2) CNFpred (3) coevolutation alignment (4) ILPbased alignment 
09/14  Comparative Genomics  (1) Whole genome alignment (2) Suffix tries (3) MUMmer Optional Reading: MUMmer 1.0: Alignment of Whole Genomes. MUMmer 2.0: Fast Algorithms for Largescale Genome Alignment and Comparision. MUMmer 3.0: Versatile and open software for comparing large genomes. MUMmer 4.0: "MUMmer4: A fast and versatile genome alignment system 
09/16  Hidden Markov Models  (1) Basic concepts (2) Probability estimation (3) Finding maximum assignment (4) Parameter estimation of HMM 
09/21  Homology search II  (1) HMMbased sequence alignment (2) HMMER (3) HHpred/HHsearch (4) HHblits (5) ILPbased methods 
09/23  Molecular Evolution and Phylogenetics  (1) Basics and definitions (2) Jukes Cantor distance (3) UPGMA and Neighbor joining algorithms (4) Sankoff algorithm (5) Fitch algorithm Optional Reading: Phylogenetic tree building in the genomic age [review article]

Structure analysis in Bioinformatics (4 classes)
Time  Topic  Contents 
09/28  Protein structure alignment  (1) Basic knowledge of protein structures (2) RMSD (3) SSAP algorithm, (4) DALI algorithm (4) LOCK algorithm (5) Structure databases Optional reading: TMalign: a protein structure alignment algorithm based on the TMscore [pdf] Protein structure alignment beyond spatial proximity [pdf] Matt: Local Flexibility Aids Protein Multiple Structure Alignment [pdf] 
09/30  Protein structure prediction  (1) Templatebased modelling (2) Raptor (3) Templatefree folding (4) Rosetta Optional Reading: (1) Fold proteins by playing games [link] (2) Fold proteins by deep learning [link] 
10/05  RNA structure prediction  (1) RNA structures and functions (2) Nussinov Algorithm (3) Zucker algorithm (4) Stochastic context free grammars (SCFG) Optional Reading: RNA Secondary Structure Prediction By Learning Unrolled Algorithms [link] 
10/07  3D genome analysis  (1) Basic knowledge about HiC techniques (2) Algorithms to detect Topologically Associating Domains (TAD) (3) Algorithms to Predict EnhancerPromoter interactions (PEP) Optional reading: 1. Comprehensive Mapping of LongRange Interactions Reveals Folding Principles of the Human Genome [pdf] 2. Topological Domains in Mammalian Genomes Identified by Analysis of Chromatin Interactions [pdf] 3. A 3D Map of the Human Genome at Kilobase Resolution Reveals Principles of Chromatin Looping [pdf] 4. Singlecell HiC reveals celltocell variability in chromosome structure [pdf] 5. 3D structures of individual mammalian genomes studied by singlecell HiC [pdf]

Machine Learning in Bioinformatics (5 classes)
Time  Topic  Contents 
10/12  Research talks  Given by Prof. Majid Kazemian, Prof. Daisuke Kihara and Prof. Jianzhu Ma 
10/14  Introduction to Deep Learning  (1) Basic concepts, MLP (2) Convolutional Neural Network (3) Recurrent Neural Network (4) Long Short Term Memory (5) Resnet Optional reading: (1) Deep Learning in Bioinformatics [pdf] (2) Deep learning for computational biology [pdf] (3) Machine Learning in Genomic Medicine [pdf] (4) Awesome DeepBio [github] (5) PyTorch Performance Tuning Guide [youtube] 
10/19  Deep Learning in genomics & protein structures  (1) 1D CNN and DeepSea (2) Point cloud modelling and 3D CNN (3) Model interpretations Optional reading: (1) Adaptive Deconvolutional Networks for Mid and High Level Feature Learning [pdf] (2) Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps [pdf] (3) The Mythos of Model Interpretability [pdf] 
10/21  Deep Learning on biological networks  (1) GraphSage (2) Graph Convolutional Neural Network (3) Graph Attention Models (4) Popular datasets Optional reading: (1) The graph neural network model [pdf] (2) Gated Graph Sequence Neural Networks [pdf] (3) Graph Representation Learning Book [website] 
10/26  Deep Learning on small moleculars  (1) (2) (3)

Matrix analysis in Bioinformatics (4 classes)
Time  Topic  Contents 
10/28  Gene expression analysis  (1) Differential expression analysis (2) Dimension reduction: PCA, TSNE 
11/02  Single cell analysis I  (1) Single cell techniques (2) Missing data imputation: MAGIC, DCA algorithms (3) Batch effect correction: MNN algorithm Optional reading: (1) Singlecell RNA sequencing technologies and bioinformatics pipelines [pdf] (2) Comprehensive integration of single cell data [pdf] (3) Singlecell RNA sequencing for the study of development, physiology and disease [pdf] (4) A Systematic Evaluation of Singlecell RNAsequencing Imputation Methods [pdf] (5) A benchmark of batcheffect correction methods for singlecell RNA sequencing data [pdf] 
11/04  Single cell analysis II  (1) Trajectory embedding (2) RNA velocity (3) Bulk cell deconvolution Optional reading: (1) Lineage tracing meets singlecell omics: opportunities and challenges [pdf] 
11/09  Single cell analysis III  (1) Single cell multiomics: single cell epigenetics, single cell HiC (2) Doublets and multiplets Optional reading: (1) Sparse Inverse Covariance Estimation with the Graphical Lasso [pdf] (5) Foldit computer game [link]

Network analysis in Bioinformatics (4 classes)
Time  Topic  Contents 
11/11  Network motif detection  (1) Introduction of Network Biology (2) Basic concepts about graphs (3) Random graphs (4) Network motifs in biological networks (25) Gtries algorithm 
11/16  Network Alignment  (1) Network visualization (2) PathBLAST (3) IsoRank (4) Representationbased network alignments Optional Reading: (1) REGAL: Representation Learningbased Graph Alignment [pdf] (2) Deep Adversarial Network Alignment [pdf] 
11/18  Graphical Models I  (1) General concepts (2) Exact inference (3) Sumproduct algorithm (4) Maxproduct algorithm (5) Conditional Random Fields Optional textbooks: (1) “Probabilistic Graphical Models” by By Daphne Koller and Nir Friedman (2) “Graphical Models, Exponential Families, and Variational Inference” by Martin J. Wainwright and Michael I. Jordan (3) Chapter 8 of “Pattern Recognition and Machine Learning” by Christopher M. Bishop 
11/23  Graphical Models II  (1) Structure learning (2) Gaussian Graphical Model (3) Pseudolikelihood approximation (4) Protein contact prediction (5) DeepLearningbased structure learning Optional reading: (1) Sparse Inverse Covariance Estimation with the Graphical Lasso [pdf] (2) HighDimensional Graphs and Variable Selection with the Lasso [pdf] (3) Protein 3D Structure Computed from Evolutionary Sequence Variation [pdf] (4) AlphaFold: Using AI for scientific discovery [BLOG POST] (5) Foldit computer game [link]

Project presentations (2 classes)
Time  Topic  Presentors 
11/30  Project presentations  TBD 
12/02  Project presentations  TBD

