The goal of this project is to develop a scalable distributed Verilog simulation suitable for execution on both distributed and shared memory multi-processors. The raison d'etre of this project is that gate level simulation has proved to be the principle bottleneck in digital circuit design-large designs (e.g. NVIDIA's game chips) can take weeks to simulate on conventional uni-processors. It has become necessary to employ parallel architectures in order to accommodate large circuit designs and to obtain significant reductions in the run time of these simulations. We chose Verilog because of its wide-spread industrial use as a design language.


 Fittingly, we are confronted with major challenges in distributed simulation in this project. A small laundry list includes (1)the simulation has a small computational granularity,(2)a small fraction of the simulation processes are active at any time and (3)the load continuously changes its location throughout the course of the simulation. These are,in fact, generic problems in distributed simulation.  


\cite{Herv01} describes our first approach to the problem, Clustered Time Warp (CTW). CTW is a hybrid approach, making use of clusters which are simulated sequentially while Time Warp synchronization is used between clusters. The use of clusters was inspired by the low level of activity at the individual gate level. Three checkpointing algorithms and accompanying strategies for rolling back clusters or LPs within

the clusters were presented, each representing a different point on the spectrum of memory versus time trade-offs. We used small Iscas circuits and compared it to (our) sequential simulator, getting good speedup. We also developed dynamic load balancing algorithms which resulted in a 25% improvement. 


Our next effort involved the use of an open source Verilog compiler (Icarus) as a front end to a distributed simulator which was based upon CTW. We called it DVS (Distributed Verilog Simulator). DVS consists of 3 layers. The bottom layer is the communications layer, which provides a common message passing interface to the upper layers. Inside this layer, the software communication platform can be either PVM or MPI. The middle layer is a parallel discrete event simulation kernel, OOCTW, which is an object-oriented version of CTW. It provides services such as rollback, state saving and restoration, GVT computation and fossil collection to the top layer. The top layer is the distributed simulation engine, which includes an event  handler and an interpreter which executes instructions in the code space of a virtual thread. A  partitioning framework was created \cite{Li03b} and included in

this environment in order to experiment with a number of partitioning algorithms.

A simple hierarchical partitioning algorithm,CAKE,which minimizes

the inter-processor communication was developed, resulting in the best speedup when compared to breadth first search, depth first search and string partitioning. 


DVS served as an experimental  vehicle for research on event reconstruction \cite{Li04a}, reinforcement learning \cite{Wang07}, parallel Verilog compilation \cite{Wang04} compiled code simulation \cite{Wang06}, and distributed direct cancellation \cite{Zhou01}. 

and partitioning \cite{li07a}. Event reconstruction eliminates event saving by reconstructing events from the state queue. It results in a 

significant reduction in memory utilization compared to dynamic checkpointing.

Reinforcement learning is an AI technique which we use to control the optimism in Time Warp via a bounded window. Because partitioning is central to the success of distributed simulation, we briefly summarize the work on static partitioning. 


\cite{li07a} describes a static partitioning algorithm for Verilog which takes advantage of the 

hierarchical information described by Verilog modules and their instances.  A Verilog instance represents one vertex in a circuit hypergraph. The algorithm attempts to minimize the cutsize of the hypergraph while maintaining a balanced load. A vertex can be flattened into multiple vertices in the event that a sufficiently good load balance is not achieved by the algorithm. In this case, the algorithm flattens the largest instance and moves gates between the partitions in order to improve the load balance.  We compared the algorithm to {\em hmetis}, a well-known interative algorithm which extracts clusters from a flattened netlist, achieving a 4.5 reduction in the cutsize of a circuit. It also resulted in a  40\% decrease in the simulation time compared to a sequential simulation.


Our research then took a new turn-we developed a new simulation engine which we named XTW. The initial reason for doing so was to incorporate a number of advances tailored to distributed gate level simulation. 

In this project we developed a new event scheduling mechanism and a new rollback procedure \cite{Xu06}. In XTW  LPs are grouped into clusters  and  a multi-level queue, XEQ, is used to schedule events in the cluster.

Each input channel to a logical process (a gate) is

managed separately in order to maintain a FCFS property. A separate

LP input queue is also maintained together with a cluster queue, which is used

for the actual scheduling of events. The resulting event scheduling

mechanism has an O(1) complexity, a vast improvement over well-know scheduling 

 mechanisms (splay-tree, heap, red-black tree). In addition,

the cost of finding and deleting a scheduled event in the cluster

queue is $O(1)$, a distinct improvement over CTW, for

which the cost is $O(\log n)$. The new rollback procedure in XTW involves the replacement of anti-messages  by \emph{rb messages}, resulting in a lower time complexity- $O(\log n)$ versus $O(n\log n)$ for anti-messages. The output queue at an LP is eliminated as well, resulting in a large savings of memory. Experimental results on a 90,000 gate circuit clearly demonstrate that XTW is far superior to  CTW with any number of processors. CTW ran out of memory with more then

3 processors, while XTW continued to show scalable results. 


The scalability of XTW was investigated utilizing 5M,10M and 25M gate circuits, on a 

128 node Beowulf cluster-our results indicated that  XTW  scales almost 

linearly with the number of processors and scales well with the size of 

the circuits.


Several years ago, Sun made several of its old CPU designs (RTL) available to the design community to serve as benchmarks. We took advantage of this by creating gate level netlists. Our results for these (realistic) designs indicated that XTW is indeed highly scalable. (**numbers**). 

 

Recently, our work has focused on the use of AI techniques to develop dynamic load balancing algorithms and to control the optimism in Time Warp by means of a bounded window. The bonded window mechanism prevents Time Warp can become unstable in the event that its optimism runs unchecked, hence it is important to control it. A number of classical AI techniques were used including genetic algorithms, learning algorithms and simulated annealing for both of these objectives. 


Several papers which are representative of the work which we have done in this area are included below.