Re: early preview of a recursive descent parser compiler implemented as a macro
  Home FAQ Contact Sign in
gnu.emacs.help only
 
Advanced search
POPULAR GROUPS

more...

 Up
Re: early preview of a recursive descent parser compiler implemented as a macro         

Group: gnu.emacs.help · Group Profile
Author: Mike Mattie
Date: Jan 11, 2008 15:07

On Tue, 8 Jan 2008 02:34:36 -0800
Mike Mattie gmail.com> wrote:
> Hello,
>
> I recently found myself needing a top down parser, so I wrote up a
> parser compiler as a macro. It is turning out quite nicely for my
> purposes so I would like to submit it for comments and suggestions. I
> am still working out a couple of kinks, but it is at the point where
> I can start integrating feedback.
>
> The goal is a lightweight parser for simple jobs such as scraping a
> data structure out of essentially static buffers, which is the
> problem I was solving when I developed it.
>
> I am in-lining the actual parser compiler, and attaching a testing
> file so people can play with it if they find it interesting.
>
> Status:
>
> It's my first medium sized macro so it may have some thinko's or
> oddities. criticism welcome.
>
> Simple tests work. The backtracking hasn't been extensively tested
> yet. The and form doesn't seem to return the full AST it should.
> Other than that all the basic bits are in place.

I reworked the runtime functions a bit and the and operator works correctly
now. test-parser.el has some minor changes so I have included the latest
version.
> The catch for semantic-error throws is not in place yet. I need
> something like (condition-case) however that construct seemed a
> little over-powered for my purpose. Suggestions on handling multiple
> error conditions in a single form would be appreciated.

My next priority is completing the error handling.
> TODO:
>
> 1. add ? for optional productions. Not a big deal to implement once
> the basic implementation is sound.
> 2. Full Documentation

3. add a (define ) list so that tokens and productions can be defined
without using them in place.
> Review Questions:
>
> Do the Emacs developers consider this tool useful ? Does it fill a
> niche in Emacs ? If so I will document it.
>
> Is it elegant ?
>
> Is the terminology correct ? I am not entirely sure I used curry in
> the strict sense.
>
> Are there any essential features missing ? keeping in mind the goal
> is simplicity, not a fully optimized parser.
>
> Is it interesting in a creative sense, or have I just ground out
> another lisp right of passage ?
>
> Examples:
>
> A couple of contrived examples I have used for testing. A couple of
> things to note. Keywords are token|and|or. token defines a regex
> match, while or and and construct terminals. The definition of a
> token or terminal creates a match object at the point of definition.
>
> There is a built-in start symbol that is a or-terminal, so the first
> example will match to the start symbol either token.
>
> (parser-compile test-parser
> (token whitespace "[[:blank:]]+")
> (token word "[[:alpha:]]+"))
>
> The parser is compiled to the function value of test-parser. You can
> call it like this: (test-parser (point))
>
> It parses a single start production. Looping is done outside of the
> parser itself.
>
> as the second example shows you can reference previously created
> tokens.
>
> (parser-compile test-parser
> (and indented (token whitespace "[[:blank:]]+") (token word
> "[[:alpha:]]+")) (and inverted word whitespace))
>
> The attached file test-parser.el has some test drivers for playing
> with the parser interactively. Some of the features for defining
> tokens such as defining your own actions for tokens are shown there.
>
> P.S, here is a fun little lisp comic in the spirit of finding
> amusement in your work: http://xkcd.com/297/
>
> Without further ado here is parser.el , and enjoy !
>

Here is the second posted version with the changes indicated. If you need
a version number that is not ambiguous this posting is version 636.

;;----------------------------------------------------------------------
;; parser.el
;; Primary Author: Mike Mattie (codermattie@gmail.com)
;; Copyright: Mike Mattie (2008)
;;----------------------------------------------------------------------

;; ->Summary

;; parser.el aims to be a lightweight implementation of parsing.

;; My requirements involve building/extracting a data structure from
;; analysis of a given text. For this purpose a "top-down" parse
;; makes it easier to assemble a data structure. [TODO: why is it easier
;; this way ?]

;; ->Related Works

;; Emacs regex tables.
;;
;; Much of the parsing facilities that I have seen so far in Emacs
;; have been various forms of regex tables.

;; There are many reasons why Emacs parsing tools would evolve away
;; from the usual parser generator tool-set.

;; 1. The requirement to work incrementally: parse errors are not
;; errors, the user just has not finished typing :)

;; 2. Emacs analysis is often partial by requirement, usually only
;; significant parts of the text need to be identified for
;; features.

;; 3. Emacs can leverage the annotation facilities of
;; text properties and overlays making analysis extremely
;; modular and robust while editing.

;; CEDET.
;;
;; CEDET appears to be a parser generator implemented on-top of a
;; CLOS emulation. The fact that this sort of design is not "lispy"
;; does not negate it's value, but limits it IMHO. This tool-set
;; has also "grown around" the problem of analyzing source code as
;; a project which is bloat for simpler requirements.

;; ->Concept

;; A parser compiler that combines lexical (regex) analysis with LL
;; parsing compiled by recursive macro expansion.

;; The concept started out as an idea to build a parser as nested cond
;; forms. The matches would go in the predicate part of the clause
;; while the action part would go in the body.

;; This idea is expressed here with cond forms mutated into lambdas so
;; that previously defined matches can be referenced by productions.

;; -> Requirements

;; 1. Easy to use for trivial problems.

;; 2. Data structures are orthogonal to the parser and not hard-wired
;; in. Leafs and nodes of the AST are constructed with beginning
;; and ending positions of the text in the buffer.
;;
;; This allows the user to choose between positions, markers, overlays
;; according to the requirements.

;; -> Naming Conventions

;; parser-* internal functions
;; parser-build-* Construct the AST structures returned by the parser.
;; parser-interp-* expand definitions of terminals and non-terminals into lisp
;; parser-compile-* compile the expanded definitions into interchangeable match functions.

;; -> Key data structures

;; Matches

;; matches are a cons pair of ( parser . AST ).

;; Parser is used internally to indicate match success or failure. It
;; should be the distance matched from the current parser position.
;; It can, and often will be zero for a successful match. nil indicates
;; a failure to match.

;; AST is the data returned from the match ( production . ( start . end ) | list )
;; the car of the pair (production) is the production's symbol.

;;----------------------------------------------------------------------
;; diagnostics
;;----------------------------------------------------------------------

(defun parser-expr-diagnostic ( form ) ;; tested
(format "type(%%s) %%s" (symbol-name (type-of form)) (pp (eval form))))

(defmacro parser-diagnostic ( form from expected ) ;; tested
"syntax: (parser-diagnostic form from expected)

Where form is the expr received, from is the component issuing the diagnostic,
and expected is a message describing what the component expected"
`(concat (format "[%%s] expected: " ,from) ,expected " not: " ,(parser-expr-diagnostic form)))

;;----------------------------------------------------------------------
;; AST constructors
;;----------------------------------------------------------------------

(defun parser-build-token ( identifier ) ;; tested
"parser-make-token is a built-in constructor that records the analysis
and the location of the text (id . (begin . end))"
(cons identifier (cons (match-beginning 0) (match-end 0))))

;;----------------------------------------------------------------------
;; compiler internals
;;----------------------------------------------------------------------

;; make-match and get-match implement a table of productions and
;; tokens. Both production and tokens are interchangeable as far as
;; matching and referencing is concerned. This table gives the ability
;; to reference previously defined matches in other productions.

;; match-table is dynamically scoped by the macro, and does not appear
;; in the compiled form.

(defun parser-make-match ( symbol function ) ;; tested
"parser-make-match takes ( symbol function ) and returns a symbol
stored in the parser's match-table with the evaluated lambda
bound"
(lexical-let
((new-name (symbol-name symbol)))

(if (intern-soft new-name match-table)
(throw 'semantic-error (format "illegal redefinition of match %%s" new-name)))

(fset (intern new-name match-table) (eval function))
(intern new-name match-table)
))

(defun parser-make-anon-func ( sexp ) ;; tested
"bind an un-evaluated anonymous function to an un-interned symbol"
(let
((anon-func (make-symbol "parser-lambda")))
(fset anon-func (eval sexp))
anon-func))

(defun parser-get-match ( symbol ) ;; tested
"return the compiled match for symbol, or throw a semantic-error if it does not
exist"
(lexical-let
((existing-name (symbol-name symbol)))

(unless (intern-soft existing-name match-table)
(throw 'semantic-error (format "unkown match %%s" existing-name)))

(intern existing-name match-table)))

;;----------------------------------------------------------------------
;; compiler run-time
;;----------------------------------------------------------------------

;; these functions are defined independent of the compilation of a parser
;; for simplicity, and to avoid wasteful duplication in the macro expansion.

;; parser-position

;; The position of the parser in the input stream is maintained as a
;; stack of indexes into the buffer. As matching progresses the top of
;; the stack (car) is updated. The push and pop operations copy a
;; position to a next or previous stack position. backtrack discards
;; the top of the stack.

(defun parser-pos () ;; tested
"Return the current position of the parser in the buffer"
(car parser-position))

(defun parser-push ()
"Copy the parser position to a new stack level so the parser can backtrack when necessary."
(setq parser-position (cons (car parser-position) parser-position)))

(defun parser-pop ()
"Copy the parser position to a previous stack level When the possibility of a backtrack
has been eliminated by matching."
(let
((current (parser-pos)))
(setq parser-position (cons (car parser-position) (cddr parser-position)))
))

(defun parser-backtrack ()
"Restore the previous parser position by discarding the top of the parser-position stack."
(setq parser-position (cdr parser-position))
(goto-char (parser-pos)))

(defun parser-advance ( distance )
"Add distance to the parsing position on the current stack level.
The advancing of the parser is done relative to the current
position so that a successful match can advance 0 characters.
This is important for optional matches such as the \"?\" meta-symbol.
They want to indicate a successful match without moving the parser position."

;; we are going to get zero's as a unfortunate side-effect of
;; keeping the nesting of combination operators simple. Avoid NOP
;; work with a quick test.

(if (> distance 0)
(progn
(setq parser-position (cons (+ distance (car parser-position)) (cdr parser-position)))
(goto-char (parser-pos)))
))

(defun parser-next ()
"Compute the next parser position from the length of the entire regex's current match, plus one."
(+ 1 (- (match-end 0) (match-beginning 0))))

;; parser combination operators

;; the combination operators group matches by and,or operators as the
;; underlying mechanics of non-terminal productions. As these
;; functions are not compiled into the generated parser they have no
;; knowledge of the symbol associated with the match list, only the
;; match list itself.

(defun parser-or ( &rest match-list )
"Combine match objects by or where the first successful match is returned.
nil is returned if no matches are found"
(catch 'match
(dolist (match match-list)
(lexical-let
((match-result (funcall match)))
(if match-result
(progn
(parser-advance (car match-result))
(throw 'match (cons 0 (cdr match-result)))))
))
(throw 'match nil) ;; this is what failure looks like :)
))

(defun parser-and ( &rest match-list )
"combine the matches with and. all of the match objects must return non-nil
in the parser part or the parser will backtrack and return nil."
(parser-push)

(lexical-let
((ast (catch 'backtrack
;; we want to gather all the matches, so mapcar across the match objects.
(mapcar
(lambda (match)
(lexical-let
((match-result (funcall match)))

(if match-result
(progn
;; on match advance the parser and return the AST
(parser-advance (car match-result))
(cdr match-result))

;; on fail backtrack and return nil
(throw 'backtrack nil))
))
match-list)
)))
(if ast
(progn
(parser-pop)
(cons 0 ast)) ;; would be nice to filter optional matches here.
(progn
(parser-backtrack)
nil))
))

;;----------------------------------------------------------------------
;; syntax interpretation
;;----------------------------------------------------------------------

;; interpretation as expansion of a form into lisp is separated from
;; compilation to make the parser easier to debug and verify by
;; phase.

;; the token interpreter was split into two functions to isolate the
;; flexibility of tokens (user functions or return values for
;; constructing AST) from the hard-wired parser part.

(defun parser-interp-token-action ( identifier constructor ) ;; tested
"Translate the AST constructor definition for a token into Elisp."

(unless (symbolp identifier)
(throw 'syntax-error (parser-diagnostic identifier
"parser token"
"identifier: An unbound symbol used as an identifier"
)))

;; Warning: this code is very touchy, double quotation of AST
;; symbols is required. the saving grace is that the symbols don't
;; have a variable value bound so it fails noisily when the
;; quotation is incorrect.
(cond
((eq nil constructor) `(parser-build-token (quote ',identifier)))
((listp constructor) `(,(parser-make-anon-func constructor) (match-beginning 0) (match-end 0)))
((functionp constructor) `(,constructor (match-beginning 0) (match-end 0)))
((symbolp constructor) `(quote ',constructor))

;; all other constructor types are un-handled.
((throw 'syntax-error (parser-diagnostic identifier
"parser token: identifier" "A symbol"))))
)

(defun parser-interp-token ( syntax ) ;; tested
"Translate a token definition into a parser match function."

(lexical-let
((identifier (car syntax))
(regex (cadr syntax)) ;; eval ? many regexs are stored in variables.
(constructor (caddr syntax)))

`(lambda ()
(if (looking-at ,regex)
(cons (parser-next) ,(parser-interp-token-action identifier constructor))
nil
))
))

(defun parser-interp-production ( production )
"Translate the definition of a production, or a reference to a production
into a match object symbol."

(cond
((listp production) (parser-compile-definition production))
((symbolp production) (parser-get-match production))

(throw 'syntax-error
(parser-daignostic production
"interpret definition"
"expected a definition as a list, or a symbol as a production/token reference"))
))

;;----------------------------------------------------------------------
;; compilation
;;----------------------------------------------------------------------

;; after definitions are interpreted they are evaluated or compiled and
;; stored as match functions.

(defun parser-compile-token ( syntax ) ;; tested
"compile a token definition into a match object"
(parser-make-match (car syntax) (parser-interp-token syntax)))

(defun parser-curry-production ( identifier combine-operator &rest grammar )
"Compile a match object with a combine operator and a match function list.
I think curry is applicable, but largely it was named curry so I could
create the parser-compile-production macro."
(unless (symbolp identifier)
(parser-diagnostic identifier
"compile production"
"match identifier"))

(parser-make-match identifier
`(lambda ()
(lexical-let
((result (apply ',combine-operator ',grammar)))
(if result
(cons (car result) (cons ',identifier (cdr result)))
nil)
))
))

(defmacro parser-compile-production ( combine-function production-list )
"parser-compile-production simplifies the syntax of interpreting and compiling
a production. The construct looks hairy because it combines two operations
with quoting necessitated by apply. This macro mechanizes the tricky part
to enhance code readability."
`(apply 'parser-curry-production ;; make a match function
(car ,production-list) ;; the identifier of the production
',combine-function ;; the combine operator
(mapcar 'parser-interp-production (cdr ,production-list)))) ;; interpret the matching definition.

(defun parser-compile-definition ( term )
"parser-compile-definition is the recursive heart of the compiler."
(unless (listp term)
(throw 'syntax-error (parser-diagnostic term
"parser definition"
"expected a definition of token|or|and")))

(lexical-let
((keyword (car term))
(syntax (cdr term)))

(if (listp keyword)
(parser-compile-definition keyword)

(cond
((eq keyword 'token) (parser-compile-token syntax))
((eq keyword 'or) (parser-compile-production parser-or syntax))
((eq keyword 'and) (parser-compile-production parser-and syntax))

((throw 'syntax-error (parser-diagnostic term
"parser definition"
"definition keyword token|or|and")))
))
))

(defvar parser-mtable-init-size 13
"initial size of the match-table objarray for storing match functions. the value
was chosen based on the recommendation of prime numbers for good hashing.")

(defmacro parser-compile ( parser &rest definition )
"compile a LL parser from the given grammar."
(let
;; create a symbol table to store compiled terminal and
;; non-terminal match functions
((match-table (make-vector parser-mtable-init-size 0)))

(lexical-let
((compiled (catch 'syntax-error
`(lambda ( start-pos )
(let
((parser-position (cons start-pos nil))) ;; initialize the backtrack stack
(save-excursion
(goto-char start-pos)
;; note that the start symbol of the grammar is built in as an or combination
;; of the top-level definitions.
(lexical-let
((parse (,(apply 'parser-curry-production 'start 'parser-or
(mapcar 'parser-compile-definition definition))
)))
(if parse
;; if we have a production return the position at which the
;; parser stopped along with the AST.
(cons (car parser-position) (cdr parse))
nil))
)))
)))
(if (stringp compiled)
;; error path, print a message and return nil.
(progn
(message "parser-compile failed! %%s" compiled)
nil)

;; success path, bind the compiled parser to the parser symbol and return t.
(progn
(fset parser (eval compiled))
t))
)))

(provide 'parser)

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (GNU/Linux)

iD8DBQFHh/a6dfRchrkBInkRArv4AJ4j3owP2ERVAPxMxm/PAbb32271kACgnIrQ
mfuPZ0eiCWq6Bl+OYACp0wQ=
=7tjm
-----END PGP SIGNATURE-----
no comments
diggit! del.icio.us! reddit!