This directory contains the source for a mostly complete HTML 3.2 parser. It is based upon the "-//W3C//DTD HTML 3.2 Draft 19960821//EN" DTD. Unlike most browsers, this parser is rather finicky about the input. Although correct HTML 3.2 should get through it without causing an error, it is by no means a validating parser. For that I suggest you use James Clark's SP SGML parser at <http://www.jclark.com/sp/index.htm>. Certain things are not implemented properly: I encourage you to take this parser as a starting point and improve it. Limitations I'm aware of are:
The parser is written using JJTree to create a simple representation of the HTML input. To build the parser you have to:
Here's how a build looks on my system:
adl% jjtree html-3.2.jjt
Java Compiler Compiler Version 0.6(Beta) (Tree Builder Version 0.2.2)
Copyright (c) 1996, 1997 Sun Microsystems Inc.
(type "jjtree" with no arguments for help)
Reading from file html-3.2.jjt . . .
Annotated grammar generated successfully in html-3.2.jj
adl% javacc html-3.2.jj
Java Compiler Compiler Version 0.6(Beta) (Parser Generator)
Copyright (c) 1996, 1997 Sun Microsystems Inc.
(type "javacc" with no arguments for help)
Reading from file html-3.2.jj . . .
Parser generated successfully.
adl% javac html32.java
adl% java html32 <README.html
Reading from standard input...
html
head
title
PCDATA: README
body
h1
PCDATA: README
p
PCDATA: This directory contains the source for a mostly complete HTML
3.2 parser. It is based upon the "-//W3C//DTD HTML 3.2 Draft
19960821//EN" DTD. Unlike most browsers, this parser is rather
finicky about the input. In addition, certain things are not
implemented properly. I encourage you to take this parser as a
starting point and improve it. Limitations I'm aware of are:
...
This parser uses JJTree Simple mode. It also uses a couple of specialized node classes for representing PCDATA and attributes. It should all seem pretty obvious once you take a look.