[7707] in www-talk@info.cern.ch

home help back first fref pref prev next nref lref last post

Re: HTML parser in Yacc form???

daemon@ATHENA.MIT.EDU (Gavin Nicol)
Thu Mar 23 12:57:50 1995

Date: Thu, 23 Mar 1995 12:55:05 +0500
Errors-To: procmaster@www19.w3.org
Reply-To: gtn@ebt.com
From: Gavin Nicol <gtn@ebt.com>
To: Multiple recipients of list <www-talk@www10.w3.org>

>|>      Hi all,
>|>
>|>      I was wondering if there exists a specification of HTML in yacc
>|>(or bnr) form. It has probably been done as constructing such a parser is
>|>way more easier in this way than with a traditional C subroutine.
> 
>Don't think about it. HTML is not an LR(1) grammar and so trying to use yacc
>is only going to cause pain. The best way of parsing SGML is with a top down
>recursive descent parser. Try to use yacc and you will end up in all sorts of
>troubles, especially with error reporting.

Phill is technically correct (that one cannot parse SGML and hence
HTML using YACC et al).

If one limits oneself to a subset of SGML, it is quite possible to
produce a YACC grammer. Dan Connolly has produced such a grammar for
HTML by hacking DTD2HTML, and the TEI folks have produced an
*excellent* and very *useful* subset of SGML, and the grammar is
available at:

   ftp://ftp-tei.uic.edu/pub/TEI

While these can accept come documents that are not quite legal SGML,
99.9% of documents I've seen would be both legal withing the TEI
grammar, and within SGML.


home help back first fref pref prev next nref lref last post