Computer Science 308-655
Parallel and Distributed Simulation

Instructor

Carl Tropper
Office:112N McConnell
email: carl@cs.mcgill.ca
Phone: 398-3743
Office hours:TBA
Course meets on:Monday,Wednesday 11:30-1:00

Teaching assistant
Sina Meraji
e-mail:smeraj@@cs.mcgill.ca
Office hours: TBA

Course Description: In the area of computer networks, an on-going effort is being undertaken to provide a parallel simulation of the Internet. The intention of the project is to provide a realistic test-bed for the development of new Internet protocols and to locate performance problems in aras such as routing algorithms and congestion control algorithms. To date, it has been necessary to do experiments with actual networks to achieve the same goals. A detailed parallel simulation provides an environment in which the experimental conditions can be controlled and in which it is possible to replicate experiments. Air-traffic simulation is making use of distributed simulation as is the design of large manufacturing environments for VLSI wafers. The paralllel simulation of VLSI circuitry can give an enormous competitive advantage to a circuit manufacturer. This is an area being actively pursued at McGill. Military simulations are benefitting a great deal from advances in distributed simulation. A major goal in this area is the unification of simulations of different aspects of combat together e.g. tank warfare, close air support and infantry warfare. The inclusion of live exercises into the scenario then provides a formidable training environment. The SPEEDES simulation environment is being used in major war-gaming exercises. The High Level Architecture(HLA) has been proposed as the common framework for DoD simulation applications. The fundamental design guideline of the HLA is that the local time management mechanism used in each simulation federate (simulation component) should not be visible to other federates. Hence all forms of time management (event-driven, time-stepped, parallel discrete-event simulation and real-time driven simulation..) may be linked together. The field of Distributed Interactive Simulation is devoted to these advances. A counterpart to this military training simulation environment is a virtual laboratory , in which scientists interact with a simulation in a virtual environment. The scientist can make changes to a model of the system which he is studying, causing it to be rolled back to the point which is dictated by the changes and to resume its execution from that point. An example of this environment is the CAVE Automatic Virtual Environment.

Like any distributed program, a distributed simulation is composed of processes which communicate with one another. This communication may occur via shared memory in a shared memory multiprocessor or via message passing in a distributed environment. The processes in distributed simulation are each intended to simulate a portion of the system being modeled and are referred to as Logical Processes (LPs). An LP creates events, sends events to other LPs and receives events from other LPs in the course of a simulation. Associated with each LP are input queues used to store messages from other LPs, state variables which reflect the state of the physical process which is being modeled, and a local simulation clock. The purpose of the simulation clock is to capture the advance of simulation time at each LP.

The treatment of time in a distributed system is of fundamental importance to understanding and building distributed systems. A distributed simulation poses a particularly vexing problem. In a uniprocessor, the events in a discrete event simulation are stored in a priority queue nd simulated in the order of the smallest timestamp first. This is easily done in a uniprocessor because all of the events in the simulation are stored in memory. In a distributed simulation, however, the events are spread out among the processors involved in the simulation. A causality violation occurs if we simulate an event at an LP and an event generated by another LP with a smaller timestamp arrives afterwards in real time. This late arrival can easily cause incorrect results in the simulation. In a military simulation it matters in which order the two events "aim the gun" and "fire the gun" are executed.

The central problem of distributed simulation is the development of synchronization algorithms which are capable of maintaining causality and which do so with minimal overhead. This overhead can be in the form of increased memory demands and increased execution time. Two primary approaches to synchronization algorithms have been developed, the conservative and the optimistic classes of algorithms. A conservative algorithm is characterized by its blocking behavior. In a conservative simulation, if one of the input queues at an LP is empty, the LP blocks, awaiting a message from another LP. This behavior insures causality. However, it does so a price. To begin with, when an LP suspends processing there may be an increase in the execution time of the n addition, it is entirely possible for deadlocks to form. This occurs if a group of LPs is arranged in the form of a cycle such that each LP in the cycle is awaiting a message from its predecessor in the cycle. More generally, a deadlock occurs if a knot of LPs forms. Hence, if a conservative synchronization algorithm is used in a distributed simulation, it becomes necessary to either find means to avoid deadlocks or to detect and break them. There are a plethora of algorithms for each of these approaches.

The other major class of synchronization algorithm is known as optimistic, the prime example of which is Time Warp. In optimistic algorithms, LPs maintain one input queue. All of the events which arrive from other LPs are stored in the queue in increasing order of their timestamps and are processed without any concern for the arrival of events with smaller timestamps. If such a "straggler" event arrives at an LP, the LP restores its state just prior to the arrival of the straggler and continues with the simulation from that point. This process is known as rolling back. In order to restore previous states, the LP must checkpoint its state prior to the processing of an event. It is also necessary to cancel messages which were produced subsequent to the straggler. This is accomplished via the use of so-called anti-messages, which cancel, or annihilate the original messages.

In this course, we plan to focus on the synchronization algorithms at the heart of distributed simulation and on the applications of these algorithms to real-world problems. The course will be conducted as a seminar course; we will read papers prior to each class and dissect them in class, trying to get a feeling for their strengths and weaknesses. There will be one programming project which will involve the construction of a distributed simulation on a network of workstations, making use of MPI to connect them.

For a brief summary article on distributed simulation, you can consult survey article written by me.

It is expected that the students maintain an adequate supply of coffee for the professor during class. A good time will be had by all.

Class Materials The course will be in the form of a research seminar. This means that students will be responsible for presenting papers and also participating in discussions of the papers. Part of your grade will be the extent and quality of this participation.

We will begin the course by covering the fundamentals of the area by going through a text, and will then read selected papers in the area. The text is
Parallel and Distributed Simulation Systems, by Richard Fujimoto, Wiley Interscience Publsihers It is available in the bookstore and may also be purchased at www.amazon.com.

The papers may be found at papers

Course Evaluation
Class participation:30%
Project: 40%
Class presentations:30%

Course Projects:
Cellularproject
TWproject

Computer Science 308-655 Parallel and Distributed Simulation

Instructor

Computer Science 308-655
Parallel and Distributed Simulation