Computer Science 308-655
Parallel and Distributed Simulation

Instructor

Carl Tropper
Office:112N McConnell
email: carl@cs.mcgill.ca
Phone: 398-3743
Office hours:TBA
Course meets on:

Teaching assistant
Sina MerajiA
e-mail:smeraj@@cs.mcgill.ca
Office hours: TBA


Course Description: Discrete event simulation has long played a central role in modeling and design of large systems, both natural and man-made. Man-made systems include VLSI circuitry, computer systems and networks and manufacturing simulations. Natural systems include modeling of the brain and cosmology simulations. Really, the list is endless. As our desire to understand complicated natural phenomena and to build larger systems has grown, so have the size of the simulations. With the advent of parallel computers (think multi-core as well as distributed memory machines) our ability to simulate these systems can keep pace. Efforts to make use of parallel simulation include:

Like any distributed program, a distributed simulation is composed of processes which communicate with one another. This communication may occur via shared memory in a shared memory multiprocessor or via message passing in a distributed environment. The processes in distributed simulation are each intended to simulate a portion of the system being modeled and are referred to as Logical Processes (LPs). An LP creates events, sends events to other LPs and receives events from other LPs in the course of a simulation. Associated with each LP are input queues used to store messages from other LPs, state variables which reflect the state of the physical process which is being modeled, and a local simulation clock. The purpose of the simulation clock is to capture the advance of simulation time at each LP.

The treatment of time in a distributed system is of fundamental importance to understanding and building distributed systems. A distributed simulation poses a particularly vexing problem. In a uniprocessor, the events in a discrete event simulation are stored in a priority queue nd simulated in the order of the smallest timestamp first. This is easily done in a uniprocessor because all of the events in the simulation are stored in memory. In a distributed simulation, however, the events are spread out among the processors involved in the simulation. A causality violation occurs if we simulate an event at an LP and an event generated by another LP with a smaller timestamp arrives afterwards in real time. This late arrival can easily cause incorrect results in the simulation. In a military simulation it matters in which order the two events "aim the gun" and "fire the gun" are executed.

The central problem of distributed simulation is the development of synchronization algorithms which are capable of maintaining causality and which do so with minimal overhead. This overhead can be in the form of increased memory demands and increased execution time. Two primary approaches to synchronization algorithms have been developed, the conservative and the optimistic classes of algorithms. A conservative algorithm is characterized by its blocking behavior. In a conservative simulation, if one of the input queues at an LP is empty, the LP blocks, awaiting a message from another LP. This behavior insures causality. However, it does so a price. To begin with, when an LP suspends processing there may be an increase in the execution time of the n addition, it is entirely possible for deadlocks to form. This occurs if a group of LPs is arranged in the form of a cycle such that each LP in the cycle is awaiting a message from its predecessor in the cycle. More generally, a deadlock occurs if a knot of LPs forms. Hence, if a conservative synchronization algorithm is used in a distributed simulation, it becomes necessary to either find means to avoid deadlocks or to detect and break them. There are a plethora of algorithms for each of these approaches.

The other major class of synchronization algorithm is known as optimistic, the prime example of which is Time Warp. In optimistic algorithms, LPs maintain one input queue. All of the events which arrive from other LPs are stored in the queue in increasing order of their timestamps and are processed without any concern for the arrival of events with smaller timestamps. If such a "straggler" event arrives at an LP, the LP restores its state just prior to the arrival of the straggler and continues with the simulation from that point. This process is known as rolling back. In order to restore previous states, the LP must checkpoint its state prior to the processing of an event. It is also necessary to cancel messages which were produced subsequent to the straggler. This is accomplished via the use of so-called anti-messages, which cancel, or annihilate the original messages.

In this course, we plan to focus on the synchronization algorithms at the heart of distributed simulation and on the applications of these algorithms to real-world problems. The course will be conducted as a seminar course; we will read papers prior to each class and dissect them in class, trying to get a feeling for their strengths and weaknesses. There will be one programming project which will involve the construction of a distributed simulation on a network of workstations, making use of MPI to connect them. In addition, students will write a paper which summarizes and compares the research done in an area of parallel simulation.

For a summary article on distributed simulation (even if it is slightly dated) you can consult survey article written by me.

It is expected that the students maintain an adequate supply of coffee for the professor during class. A good time will be had by all.

Class Materials The course will be in the form of a research seminar. This means that students will be responsible for reading papers and also participating in discussions of the papers. Part of your grade will be the extent and quality of this participation.

We will make use of a text to cover some of the fundamentals in the area. The text is
Parallel and Distributed Simulation Systems, by Richard Fujimoto, Wiley Interscience Publsihers
It is available in the bookstore and may also be purchased at www.amazon.com. Slides based upon this text may be found at Fujimoto Slides There are used copies of the text available on amazon at a considerable discount from the new and kindle versions.

The collection of papers may be found at papers
A schedule of papers to read may be found at schedule

Course Evaluation
Class participation:20%
Project: 40%
Paper:40%

Project The project may be found at TWproject. It is a simplified version of Time Warp, including a GVT algorithm (Samadi's algorithm). As indicated in the description, you will be expected to demo your project and hand in a written description of the program. The demos will be scheduled for the week after classes end. If you finish before the end of the semester, we should be able to arrange an earlier demo.

MPI References The following references should be of help for learning MPI

Paper The purpose of the project is for you to read papers in an area of parallel simulation and to write a critical summary of these papers. You are free to choose the topic; try to make it close to your research area if this is at all possible. I am looking for an analysis of the papers which you read. By this I mean that I would like you to compare the papers to one another. You should describe their stregnths, their weaknesses and how they relate to one another. Tell me what they don't say in addition to what they do say. I prefer not to give you a specific number of papers to summarize; however I expect you to do a thorough job of covering the topic. By this I mean that if the paper matters, it will be reviewed in your paper. As to the legnth of the paper, think of at least 10 pages.

You should first compile the list of papers which you want to read and send them to me by email or meet with me to discuss the list. I will try to make suggestions about your list. For each paper, you should summarize the contents of each paper, write a section which explains how it is related to other papers, and another section which analyzes its stregnths and weaknesses. There should be a concluding section which summarizes the connections between the papers and indicates the major breakthroughs and their influence on other papers.

You should start by making an outline of the paper. Once this is done, it will be much easier to write it.

A list of suggested topics follows.

The paper is due on the last day of class.