CS 603: Advanced Topics in Distributed Systems

Class description

Writing programs which will run over a set of physically distributed processors is inherently hard, as one has to deal with latency and failures of both processors and communication -- basically the fact that it takes two to tango.

Most industrial-scale applications developed nowadays involve some form of remote interaction. However, these applications are still mostly built in ad-hoc style, i.e., as sets of logical components, which are then somehow glued together. This provides little reliability in the presence of latency and failures.

This class will focus on models, algorithms, languages, and systems for programming distributed applications.

Instructor

Teaching assistant

Schedule

Grading

Academic Integrity

Academic honesty and ethical behavior are required in this course, as it is in all courses at Purdue University (here is the guide for academic integrity). The class will be conducted according to the policy written by Professor Eugene Spafford.

Schedule

Subject to changes

Week   Module Monday Wednesday Programming
Aug 21 - Aug 25 1 Basic

Intro
- syllabus
- distributed systems?

- background on networks and services

Reading:

?Chapters 1 and 2 from Reliable Distributed Systems

?Why do Internet Services Fail, and What Can be Done about It? D. Oppenheimer, A.Ganapathi and D. A. Patterson, 2003.

Why Do Computers Stop and What Can be Done about It?Jim Gray, 1985.

?

Definitions

- distributed systems models
- failures: omission, crash, recovery, partition, byzantine
- channels: unreliable, reliable, fair lossy
Reading:

Fundamentals of Fault-tolerant Distributed Computing in Asynchronous Environments. Felix C. Gartner, ACM Surveys 31(1):1-26, 1999.

 
Aug 28 - Sep 1 2 Events and clocks
- causality
- snapshots
- lamport clocks
- vector clocks
Reading:
?Time, Clocks, and The Ordering of Events in Distributed Systems. L. Lamport, CACM 27(7): 558-565, 1978.
?Virtual Time and Global States of Distributed Systems, F. Mattern, WDPA’89.

Reliable broadcast
- best effort
- reliable
- uniform

- terminating
- FIFO, causal, total orders
Reading:

Mullender Chap.5 /

(A Modular Approach to) Fault-tolerant Broadcasts and Related Problems. V. Hadzilacos and S. Toueg, Distributed Systems, 97-145, 1993.

 
Sep 4 - Sep 8 3 (Labor day) Fault-tolerant Broadcasts in the Crash Recovery Model (talk by Dr. R. Boichat)
Reading:
Deconstructing Paxos
. R. Boichat, P. Dutta, S. Frolund, R. Guerraoui, SIGACT News 41(1): 47-47, 2003
Chat due Friday midnight (with sockets)
Sep 11 - Sep 15 4 (Reliable broadcast cont'd) Programming support intro
- libraries, languages, reflection
- "middleware"
- limitations
- examples
Reading:
?Concurrency, Distribution, and Parallelism in Object-Oriented Programming.  J-P. Briot, R. Guerraoui, K.P. Löhr, ACM Computing Surveys 30(3): 291-329, September 1999
 
 
Sep 18 - Sep 22 5 Java RMI
- remote interfaces
- proxies
- registries
Reading: RMI spec http://java.sun.com/j2se/1.5.0/docs/guide/rmi/spec/rmiTOC.html
AspectJ
- AOP general
- static, dynamic weaving
- safety issues
Reading:
An Overview of AspectJ. G. Kiczales, E. Hilsdale, J. Hugunin, M. Kersten, J. Palm, and W.G. Griswold. ECOOP 2001, 327-353

 

 
Sep 25 - Sep 29 6 Classic Failure detectors
- accuracy, completeness
- omega
- comparisons
Reading:
Unreliable Failure Detectors for Reliable Distributed Systems. T.D. Chandra and S. Toueg, JACM, 43(2): 225-267, 1996

 

Consensus
- impossibility
- chandra&toueg
- weakest failure detector
- leader-based
Reading:
?
Impossibility of Distributed Consensus with One Faulty Process. M. Fischer, N. Lynch, and M. Paterson. JACM 32(2): 217-246, 1985.

Solving Consensus using Chandra-Toueg’s Unreliable Failure Detectors: A General Quorum-based Approach, A. Mostfaoui and M. Raynal, DISC’99, 49-63, 1999.

?Leader-based Consensus. A. Mostefaoui and M. Raynal. IPL 11(1):95-107, 2001.
 
Assignment on channels and broadcast due Saturday midnight
Oct 2 - Oct 6 7 (Consensus cont'd) Group communication and state machine replication
- state machine replication
- total order broadcast
- virtual synchrony
Reading:
Implementing Fault-Tolerant Services Using the State Machine Approach: A Tutorial. F.B. Schneider. ACM Surveys 22(2): 299-319, 1990
?Group Communication Specifications: A Comprehensive Survey. G. Chockler, I. Keidar, R. Vitenberg. CSUR 23(4): 427-469, 2001.

 

 
Oct 9 - Oct 13 8 (October break) (Group comm. cont'd) Revised chat due Friday midnight
Oct 16 - Oct 20 9 Byzantine failures (guest lecture by Prof. C. Nita-Rotaru)
 
Transactions
- non-blocking atomic commit
- 2PC and 3PC
Reading:
?A Formal Model of Crash Recovery in a Distributed System. D. Skeen and M. Stonebraker, IEEE TSE  9(3): 219-228, 1983.
?Non-blocking Atomic Commit in Asynchronous Distributed Systems with Failure Detectors. R. Guerraoui, DC 15(1):17-25, 2002.
 
 
Oct 23 - Oct 27 10   (absence) (absence) Assignment on consensus, atomic broadcast, and virtual synchrony due Friday midnight
Oct 30 - Nov 3 11 Advanced

Probabilistic broadcast
- channels, failures
- broadcast
- multicast
Reading:

Bimodal Multicast. K. Birman, M. Hayden, O.Ozkasap, Z. Xiao, M. Budiu, and Y. Minsky, ACM TOCS 17(2): 41-88, 1999.
Lightweight Probabilistic Broadcast. P. Eugster, R. Guerraoui, S. Handurukande, A.-M. Kermarrec, and P. Kouznetsov, ACM TOCS 21(4): 341-374, 2003.

Probabilistic agreement
- consensus

- ordering
Reading:

Randomized Protocols for Asynchronous Consensus. J. Aspnes, DC 16(2-3): 165-175, 2003.

From Binary Consensus to Multivalued Consensus in Asynchronous Message-passing Systems. A. Mostéfaoui, M. Raynal, F. Tronel, IPL 73(5-6): 207-212, 2000.

 
Nov 6 - Nov 10 12

Presentations

Fault-tolerant Queries and Aggregation in Sensor Networks. Chris

Global Snapshot Algorithms for Fault-tolerance in Virtual Machines. Ardalan

Presentations

Bounds on Distributed Failure Detection. Jayaram

 
Nov 13 - Nov 17 13

Presentations

Event Correlation Languages for Distributed Programming. Nick

Aspect-oriented Distributed Programming. Armand  

Presentations

Token-based Probabilistic Non-blocking Atomic Commit. Dasarath

Stability and Persistence in Probabilistic Broadcast. Patrick

  

 
Nov 20 - Nov 24 14

Hard problems

- uniform reliable broadcast

- terminating reliable broadcast

- leader election

(Thanksgiving)  
Nov 27 - Dec 1 15
   
Dec 4 - Dec 8 16       Final reports for individual projects due