This directory contains the source for a mostly complete HTML 3.2 parser. It is based upon the "-//W3C//DTD HTML 3.2 Draft 19960821//EN" DTD. Unlike most browsers, this parser is rather finicky about the input. Although correct HTML 3.2 should get through it without causing an error, it is by no means a validating parser. For that I suggest you use James Clark's SP SGML parser at <http://www.jclark.com/sp/index.htm>. Certain things are not implemented properly: I encourage you to take this parser as a starting point and improve it. Limitations I'm aware of are:
The parser is written using JJTree to create a simple representation of the HTML input. To build the parser you have to:
Here's how a build looks on my system:
adl% jjtree html-3.2.jjt Java Compiler Compiler Version 0.6(Beta) (Tree Builder Version 0.2.2) Copyright (c) 1996, 1997 Sun Microsystems Inc. (type "jjtree" with no arguments for help) Reading from file html-3.2.jjt . . . Annotated grammar generated successfully in html-3.2.jj adl% javacc html-3.2.jj Java Compiler Compiler Version 0.6(Beta) (Parser Generator) Copyright (c) 1996, 1997 Sun Microsystems Inc. (type "javacc" with no arguments for help) Reading from file html-3.2.jj . . . Parser generated successfully. adl% javac html32.java adl% java html32 <README.html Reading from standard input... html head title PCDATA: README body h1 PCDATA: README p PCDATA: This directory contains the source for a mostly complete HTML 3.2 parser. It is based upon the "-//W3C//DTD HTML 3.2 Draft 19960821//EN" DTD. Unlike most browsers, this parser is rather finicky about the input. In addition, certain things are not implemented properly. I encourage you to take this parser as a starting point and improve it. Limitations I'm aware of are: ...
This parser uses JJTree Simple mode. It also uses a couple of specialized node classes for representing PCDATA and attributes. It should all seem pretty obvious once you take a look.