Programming Assignment 3: Debugging a Solution to the Byzantine Generals Problem
Note: Questions are fully expected on this programming assignment. Do not hesitate to post and ask both the professor and TA. Some instruction is below, but it will doubtlessly be insufficient as a comprehensive guide. (complete, but may be updated)Background/Protocol
The purpose of PA3 is to gain a deeper understanding of the solution to the Byzantine Generals Problem, as described in the paper by Lamport which you read for March 8th. A secondary goal is to improve your skills in debugging someone else's code. In this assignment, you are being given a nearly-correct implementation of the protocol. It works when there are no malicious nodes. But it fails even when there is only a single malicious node. Your task is to understand the protocol and implementation, and to determine how to fix the implementation.
To gain a better understanding of the protocol and implementation, it would be VERY WISE to begin by reading the Blog Post by Mark Nelson, who took the time to write a more in-depth discussion of the Byzantine Generals solution than we covered in class. The implementation provided is based on this description.
For this project, you may use cloud11.cs.purdue.edu through cloud15.cs.purdue.edu. The necessary tools for building and running Mace and the model checker should all be available on those machines.
You should be able to build on your earlier Mace checkout. However, in general: you should use cmake to build Mace out-of-source, and within your scratch directory on these machines. (Mace, when compiled, is quite large, and so keeping the build on /scratch will keep your directory from causing you grief from the sysadmins)
You are welcome to do this project on any machine on which you can build Mace, which should be generally true of Linux and OSX systems. We will use a departmental gentoo build for testing them.
For this project, you only need to submit your final HW3ByzantineAgreement.mac. Any discussion questions should be answered in comments in the file. (Standard C++ commenting syntax applies)
To support this assignment, a new synchronized transport was developed, in which the newtork delivers messages in phases, separating each phase by a tick. Since this code is not fully vetted yet, it is not included in the Mace repository. You will be given a link to a gzipped-tarball of a modified model checker directory, you should replace your own model checker directory with the provided one. Also, simple interfaces for consensus are provided to keep the assignment simple. The simple interface worries only about a single proposal--a practical implementation would require supporting multiple instances of the consensus protocol (not to mention removing the need for a syncrhonized transport).
I have implemented the protocol to be generic to the number of errors a protocol is willing to tolerate. To enable this, a number of recursive routines are implemented which allow this protocol to work [once completely fixed] for any value of "M" (or "F" depending on what text you read). To enable you to work with a variety of parameters, each of the configuration parameters (number of failures, what value to propose, what nodes are malicious) has been defined as a command-line parameter which you can set in the parameter file. Some parameter files have been provided, but you should feel free to modify them at this point to test the configurations you feel are needed. NOTE: it is always assumed that node "0" is the leader.
The basic requirement of this assignment is to get the protocol working for M=1 and values of N greather than or equal to 4. I had to change only 2 lines of code in my implementation from this point to make that work. Additional credit will be awarded for making it support M=2 and M=3. Note that my present working implementation appears to work correctly for M=2, but not M=3.
The steps you should take should include:
- "Proving" the protocol works for M=0, and values of N less than or equal to 10. (Include the output of mc.log from N=10.) Note that in this case, the model checker will only explore a single path, since for M=0, the leader need only send a message to all nodes, and since messages at different nodes are independent in any round, the syncronized network only considers one order. Comment after the output about why the model checker can prove this case.
- Getting the protocol to work for M=1 and N=4. You should make sure to test cases including when the leader is malicious, and when another general is malicious. Discuss whether the model implemented for the malicious node is sufficient to guarantee correctness when the model checker terminates.
- Describing a sequence of steps which causes the protocol to fail if when M=1, there are actually 2 malicious nodes. Note: you should be able to generate this sequence of steps using the model checker and looking at the resulting error.graph and error.log. Show also the final state of the system on which the majority function returns different values for different nodes, and describe why this happened.
- You will receive 5 bonus points each for getting the system working for M=2 and M=3. For M=2, working means comprehensive search for 2-failing nodes in both cases where neither is leader and one is leader. In your code, state whether you think your code has this property. For M=3, working means getting through a "substantial" number of paths without violating any properties. Note that my current code fails on the first path for M=3.
Environment/Development
To get started on the homework, you should setup your environment as follows:
- In your existing directory, "CS505", under mace/services/.
- (Already completed for HW1): Edit mace/services/CMakeLists.txt, and list CS505 (or your directory name)
in the SERVICE_DIRS variable. To avoid problems, place it last. You can also
remove all other services EXCEPT interfaces, Transport, and SimApplication, as
these will not be needed for this assignment. This will make the build faster
and smaller. The diff from this modification should look like:
Index: CMakeLists.txt =================================================================== --- CMakeLists.txt (revision 1037) +++ CMakeLists.txt (working copy) @@ -1,4 +1,4 @@ -SET(SERVICE_DIRS interfaces Transport RequestTransport Http File RandTree ReplayTree RanSub Pastry Bamboo Chord GenericOverlayRoute GenericTreeMulticast DHT ScribeMS SplitStreamMS SimApplication) +SET(SERVICE_DIRS interfaces Transport SimApplication CS505) WRITE_FILE(${EXTERNAL_VARS_FILE} " # Service Includes and Links" APPEND) - Inside CS505, place all .mac and .mh files from ./hw3/. HW3ByzantineAgreement.mac is the file you will be editing.
- Move aside your existing model checker directory. For example:
$ cd mace/application $ mv modelchecker modelchecker.aside
- Download modelchecker.tgz, and expand it inside your mace/application directory. For example:
$ cd mace/application $ tar xzvf modelchecker.tgz
This directory includes the simulator for the syncronized transport, and also the test for HW3. - Edit mace/application/CMakeLists.txt to remove all applications other than the model checker. To do so, comment out all lines except for ADD_SUBDIRECTORY(modelchecker), by prefixing each line with a sharp (#).
- Run "make rebuild_cache" from the build directory. Since you've added a new .mac file, you need to tell the cmake build system to re-detect the files to build.
When editing Mace files (.mac), it is helpful in most editors to use the syntax highlighting for C++. If you are a Vim user, there is a vim syntax file you can use. If you wish to contribute syntax files for other editors, I am happy to share them with others.
Model checking / Debugging
To run and debug this project, we will be using the Mace modelchecker. To setup your environment for model checking:- First, compile Mace
- Then, create a symbolic link from the compiled modelchecker into mace/application/modelchecker directory in the source code. (The Mace scripts all assume the model checker resides there based on prior setup). You should find the modelchecker binary in the build/mace/application/modelchecker/ directory.
- Create a subdirectory under the model checker, for example "HW2". Place params.default into that directory. For convenience, we have also made available params.default.1error for testing with network errors, and params.default.2errors". Note that you should name them params.default when you are using them.
- In params default, you can change the number of nodes to expermient with, as well as what value of M to use, and which nodes are actually malicious. Other parameters should probably remain the same.
- To run the model checker, change to the HW3 directory, then execute "../run-modelchecker.pl". It will start searching. If it finds an error, it will output a number of files. The run-modelchecker.pl script will automatically generate a complete logfile of the erroneous run, as well as a graph of the execution. It will name these error files error.log and error.graph. Error.graph contains control characters which color the output on most terminals. Rather than looking at error.log directly, we recommend using mdb (../mdb error.log), which will open an interactive logfile debugger, allowing you to look at the execution, one step at a time. Each step will end with an output of the system state (state of all nodes in the system).
- If the model checker is not finding bugs, it will output lines of code such as:
3.0.0.0:10201 [Sim::pathComplete] Path 94208 ( 88306 in phase) ended at simulator step: 107 random step: 144 cause: STOPPING_CONDITION (OK) Path<sdepth=10>SimRandUtil<prefixlen=0,position=144,searchlen=10,walklen=134> search sequence [ 0 0 0 2 0 3 3 2 0 3 ] 10.0.0.0:10201 [Sim::pathComplete] Path 94464 ( 88562 in phase) ended at simulator step: 189 random step: 254 cause: STOPPING_CONDITION (OK) Path<sdepth=10>SimRandUtil<prefixlen=0,position=254,searchlen=10,walklen=244> search sequence [ 0 0 0 2 0 3 3 11 1 3 ]
Each line represents an execution which completed. It contains the path number, which indicates how many paths have been completed so far. It also tells you a number of other statistics such as how long the path was in simulator steps and random number requests, the length of the search sequence, and the search sequence itself. These give you an understanding of the progress in the search.
Useful hints:
- Debug first with as few nodes as possible. Problems will be easier to understand with fewer nodes.
- Use error.graph to see what's going on.
- You might consider adding other properties or assertions for things you want to verify or understand better.
- mdb assumes that the mace source directory is present in your home folder. It is recommended to create a symbolic to mace/ from the home folder.
Submission and Grading:
Submit only the file HW3ByzantineAgreement.mac, using BlackBoard. You will find the submission section in the "Assignments" section of BlackBoard for the class. Submissions will be graded according to this grading rubric (TXT).