README

This directory contains the source for a mostly complete HTML 3.2 parser. It is based upon the "-//W3C//DTD HTML 3.2 Draft 19960821//EN" DTD. Unlike most browsers, this parser is rather finicky about the input. Although correct HTML 3.2 should get through it without causing an error, it is by no means a validating parser. For that I suggest you use James Clark's SP SGML parser at <http://www.jclark.com/sp/index.htm>. Certain things are not implemented properly: I encourage you to take this parser as a starting point and improve it. Limitations I'm aware of are:

Building and running the HTML parser

The parser is written using JJTree to create a simple representation of the HTML input. To build the parser you have to:

  1. run JJTree on the source
  2. run JavaCC on the grammar file that JJTree generates
  3. compile the Java code in the usual way
  4. run the parser on some input

Here's how a build looks on my system:

adl% jjtree html-3.2.jjt
Java Compiler Compiler Version 0.6(Beta) (Tree Builder Version 0.2.2)
Copyright (c) 1996, 1997 Sun Microsystems Inc.
(type "jjtree" with no arguments for help)
Reading from file html-3.2.jjt . . .
Annotated grammar generated successfully in html-3.2.jj

adl% javacc html-3.2.jj
Java Compiler Compiler Version 0.6(Beta) (Parser Generator)
Copyright (c) 1996, 1997 Sun Microsystems Inc.
(type "javacc" with no arguments for help)
Reading from file html-3.2.jj . . .
Parser generated successfully.

adl% javac html32.java

adl% java html32 <README.html
Reading from standard input...
html
 head
  title
   PCDATA: README
 body
  h1
   PCDATA: README
  p
   PCDATA: This directory contains the source for a mostly complete HTML
      3.2 parser.  It is based upon the "-//W3C//DTD HTML 3.2 Draft
      19960821//EN" DTD.  Unlike most browsers, this parser is rather
      finicky about the input.  In addition, certain things are not
      implemented properly.  I encourage you to take this parser as a
      starting point and improve it.  Limitations I'm aware of are:
...

Notes

This parser uses JJTree Simple mode. It also uses a couple of specialized node classes for representing PCDATA and attributes. It should all seem pretty obvious once you take a look.