Of Visitors and Symbol Tables

Introduction

In the COMP-520 course at McGill, students have to implement a complete compiler for a subset of a small web-oriented language called Wig. The project is broken down into smaller tasks:

Understanding the Wig language and get a sense of how what the compiler will produce by writing small programs and their translations into a target language of our choice;
Designing the AST and pretty printing;
Implement lexical and syntactic analysis;
Implement a symbol table and apply it to an AST;
Implement static type verification;
Implement code generation.

For this post, I want to focus on the symbol table and type checking phases. My initial approach to symbol tables was flawed, and it adversely affected the subsequent phases that need to access this information. I will discuss how I implemented the symbol table itself, how I manipulated it (which is where the problems started happening), what practical consequences my design has had and how I fixed it.

Symbol tables

Symbol tables are a common data structures in compilers: they map names to semantic information. Let's consider the following program (in a C-like language):

void f(int x) {
    double y;
    y = 2*x - 3.1415;
}

The symbol table for this procedure might be implemented as a hash-table:

{
 "f" : <function, int -> void>,
 "x" : <parameter, int>,
 "y" : <localvar, double>
}

The information you keep about a symbol varies depending on your needs. Here I only included the category of symbol and its type. This is pretty straight-forward, but we must be careful, because it is possible to shadow previous declarations.

void f(int x) {
    double y;

    while (1) {
        struct foo y;
        break;
    }

    y = 2*x - 3.1415;
}

In this code snippet, the variable y is declared twice. Inside the while loop, it is of type /struct foo/ and outside it is of type /double/. How can we reflect that in the symbol table? A common solution is to represent the symbol table as a stack of hash-tables. So for our second program, the symbol table looks like this:

{ "y": <localvar, struct foo> }
{
 "f" : <function, int -> void>,
 "x" : <parameter, int>,
 "y" : <localvar, double>
}

(The symbol table is not a fixed entity, it's dynamic and its state depends on "where" we are in the program. The symbol table state above represents when the context of the program is inside the while loop.)

An abstract data type interface for a symbol table looks roughly like this:

/enterScope/: push a new empty hash-table on top of the stack
/exitScope/: pop the top-most hash-table
/addSymbol/: add a symbol and its information to the top-most hash-table
/getSymbol/: search the hash-tables from top to bottom and return the first matched symbol.

My implementation

Now that we know what a symbol table is, and how it should work, it should be relatively straightforward for most developers with experience to implement, so I'll skip that part entirely. The tricky part, is actually using it. I'll describe what I did, and how I came to find that my solution was not as well designed as I hoped.

I wrote my Wig compiler in Python. (Mini-rant: I would've actually liked to use Haskell, but my partner did not know Haskell and knew Python, so that's what we decided to use; had I known he'd abandon the course 4 weeks in, I would have pushed harder for Haskell.) Python is an OO language, and so a common way to walk through the AST is to use visitor pattern. I went ahead and added an accept method to every Node subclass and created an ASTVisitor class which had preVisit and postVisit methods for every node type. I was ready to implement my symbol table: I subclassed ASTVisitor and had something similar to this code:

class SymTableVisitor(ASTVisitor):
    def __init__(self):
        self.symtable = SymTable()

    def preVisitFunDecl(self, node):
        self.symtable.enterScope()
        self.symtable.addSymbol(node.fun_name, 'function')

    def postVisitFunDecl(self, node):
        self.symtable.exitScope()

    def preVisitVarDecl(self, node):
        self.symtable.addSymbol(node.var_name, 'variable')

Obviously there were more cases to handle in the actual code, but this should give you a good idea of how I wrote my implementation. Initially, this seemed quite clever, because symtable is always in the correct state when I visit the different node types. It's only when the time came to implement type checking that I started seeing the problems with my approach.

The Flaw

Here is the major flaw in my design, which causes many smaller problems and annoyances:

The symbol table exists only while SymTableVisitor is executing.

The ephemeral nature of the symbol table has two consequences:

Any code that needs to use the symbol table must be a sub-class of SymTableVisitor;
We cannot have any algorithm that works in two passes.

The first point has many implications: we cannot write a traversal of the AST in any other way than by using a visitor and the subclass must carefully call the proper super-class methods (e.g. preVisitFunDecl) at the appropriate point. The second point can prevent mutually recursive functions and forces us to write functions in the proper order.

A possible solution

One possible solution to the problem, and the way I did in my own project, was to modify SymTableVisitor to create a complete data structure that could then be passed to other algorithms and be queried without being in the correct visitor context. A T.A. for the class suggested that I create a dictionary mapping AST nodes (i.e. those that introduce a new scope) to a symbol table. So now, my SymTableVisitor looks more like this:

class SymTableVisitor(ASTVisitor):
    def __init__(self):
        self.symtable = SymTable()
        self.all_tables = {}

    def preVisitFunDecl(self, node):
        self.symtable.enterScope()
        self.symtable.addSymbol(node.fun_name, 'function')

    def postVisitFunDecl(self, node):
        self.all_tables[node] = copy.deepcopy(self.symtable)
        self.symtable.exitScope()

    def preVisitVarDecl(self, node):
        self.symtable.addSymbol(node.var_name, 'variable')

After the visit is complete, we can easily fetch the dictionary of symbol tables, pass it to other passes that can inspect it and update it. It's not a perfect solution, for example you need a mechanism to propagate information in a node to all its descendents, but it was more than enough to get me through the rest of the project, and implementing type checking and code generation was much easier and less error prone.