分享

Why I don’t use a Parser Generator | Musing Mortoray

 quasiceo 2013-12-29

Why I don’t use a Parser Generator

20 Friday Jul 2012

Posted by in Programming

23 Comments

Parser generators, like ANTLR or Bison, seem like great tools. Yet when I have to write a parser I now tend to steer clear of them, resorting to writing one manually. When I posted my first entry about Cloverleaf I was asked why I don’t use these tools. It’s a fair question as these tools take care of lot of work involved in parsing. In theory they are great tools. In practice I find a lot to be desired and end up fighting with the tool more than using it.

Lexing and Context

One key aspect that bothers me with many of the tools is the requirement to have a distinct first lexing phase. What this means is that your source is first converted to a stream of tokens and then the tree generator can work from this stream. This is a fast approach to parsing. However it has a serious limitation that it requires your tokens to be completely context free.

At first this may not sound too bad, but it quickly complicates the parsing part. This is especially true of domain specific languages where it is very convenient to vary tokens based on context. Consider a very simple example using this text file:

1
2
3
4
Name Edaqa
Age 37
Group 15-B
Phone +49.(0).123.456

Each line has a tag which identifies what type of data that follows it. If you had to lex this first you wouldn’t be able to come up with a satisfactory way to handle the Age, Group, and Phone numbers. You’d be forced to accept a more generic string and post-parse it after the tree is generated. This to me doesn’t seem like a good approach; I find context-free lexing to be a serious limitation on parsing.

Dynamic lexer tokens are also problematic to support. This may sound unnecessary, but is surprisingly common. Consider something as simple as the Bash or Perl heredoc “<<END”, or C++11′s raw string support. These require the lexer to use part of the opening token as the terminator to the token. Indeed, to just identify the start of the heredoc notation could require semantic information.

Shift/Reduce and Grammar Conflicts

From my time using YACC I will never forget shift-reduce errors. While ANTLR did improve greatly what is accepted, I still found myself faced with conflicts, or grammars which simply did not parse the way I intended. These generators expect your grammar to be written in a particular way, and unless you can think like the generator you’re bound to have problems. The limitations of what the generator can understand are not theoretical limitations, they can affect even the most mundane of grammars.

This is the primary area where fighting with the generator occurs. In some cases the reported error was correct, there was indeed a conflict in the grammar. However, in many cases there was no ambiguity. Or rather, the source language I was trying to parse had an unambiguous parsing solution. It was at times extremely difficult to convince the generator to parse it the way I wanted.

On a few occasions I was never able to convince the generator of what I wanted and had to alter the syntax of the language to accommodate it. This is terrible. The language should not be altered to fit the whims of a particular parsing tool. Sure, you do have to worry about having an ambiguous language and an efficient parser, but there are many ways to do this without having to break the language.

Syntax Tree

What comes out of parser generator code is an abstract syntax tree that follows the grammar you have entered. Usually this is not the exact syntax tree you wish to have. Instead you’d like to reorder nodes, collapse a few, and expand others. Changing the tree structure can greatly reduce the burden of further processing.

When I used ANTLR for one of my projects I was happy to discover that tree reordering was supported. This is definitely a very useful feature. Still, I felt that it was limited and had troubles getting some of the structures which I actually wanted. I really don’t want to need a post-processing phase which massages the resulting tree. In a hand-coded parser it is relatively straight forward to modify the tree structure to whatever you want at parsing time.

Mixed Code

At first the promise of having a pure grammar seems really appealing. You don’t have to worry about target language constructs and ideally multiple projects could use the grammar in a variety of languages. The trouble is that ultimately all my grammars have not been functional without adding target language code. There are just too many things the generators don’t handle. Your grammar ends up being filled up with C, C++, Java, or whatever language you are targeting.

The idea of mixing code may also seem like a good idea. The trouble is that you only have bits and pieces and it is rather disconnected from the wrapping code. I have a hard time determining exactly what is happening — what is the context of all this code, or what scope do the variables have? Not only is the target language code not clear, the original grammar code is also so cluttered that it also isn’t clear anymore. Plus you’ve created such a tight bond between the grammar and the target language that maintenance becomes problematic.

Other Limitations

Quite often you’ll have sections of text which just need to be parsed differently than other sections. This goes beyond simple lexing changes. For example, if you are doing a C compiler you will also want to support inline assembly, which has an entirely different syntax. This seems completely antagonistic to most parser generators I’ve seen.

Getting location information is a hassle, and never seems natively supported. In particular, when a parsing error does occur, it is hard to get the parser to indicate where in the source file that it failed. Often the structure of the parser, whatever mechanism it uses, prevents it from identifying the error in a reasonable location (it has simply tried another rule instead of indicating an error). Even when it does parse it can be difficult to get the line and column number.

So I don’t use them

I don’t doubt that an expert in a particular generator would be able to overcome at least a few the issues I have. I’m not sure I should have to be an expert though to gain access to what I consider as fundamental, or necessary features. I always consider the goal of a domain specific tool, which these are, is to make my job easier, not harder. So far I haven’t found one where that is the case (I’ve been through Yacc, Bison and ANTLR v3).

It is with a bit of regret that I don’t use such generators. They have some nice technology in them and can generate fast parsers. Each time I sit down and start writing another map of regexes and switch blocks I start longing for a better way. Unfortunately, a simple recursive descent parser has always served me well.

23 thoughts on “Why I don’t use a Parser Generator”

  1. Vladimir Levin said:

    Recursive descent means you have to re-write significant chunks of your parsing code when you make changes to your language, no?

    Also, this raises the question: How are existing major languages parsed? I thought C and C++ used one of the gnu parses.. No? Due to a typo I now want to create a project called ngu :)

    Reply

    • Recursive descent makes reworking a language very straight-forward. You usually only need to rework a few functions to support the new sytnax, either a new token, or a new structure at that point. You change about as much as the code as the change in the language you’ve made.

      I don’t know what most compilers use, but I’d suspect they don’t use generators. Primarily since generators which can even handle something like C++ are still quite new (though it is still a challenge).

    • Unless its changed in the past year or so, Java uses a hand written parser.

  2. Eagle said:

    What about libraries like PEGTL that allow the grammar to be embedded in the source?

    Reply

    • I’ve not had a lot of experience with these (Boost Spirit being the one I briefly looked at). On the surface they might address the mixed code issue, but when I look at examples that doesn’t appear to be the case. The C++ template stuff is so different from normal C++ it’s almost a different language — but it actually isn’t, so that’s a plus.

      They also wouldn’t make the grammar language indenpendent.

      Whether they satisfy the needs for lexer freedom and syntax trees is hard to tell without an in-depth look. There is nothing ineherent about them being source code based that would address these issues.

  3. Another benefit to rolling your own parser is that you can have much more precise control over the error messages generated in the event of syntax errors. A parser generator can only know so much about the intent of your grammar, and hence can only offer a limited degree of user-friendly diagnostics.

    Reply

    • Agreed. I think this was one problem I had once. The generator couldn’t decide whether the rule just didn’t match (and thus must backtrack) or the user had made a syntax error.

  4. Thanks for this. For one of my projects I hand-code three parsers (nothing fancy, but not trivial) and I’ve always had a nagging fear someone examining my code would think I was being dumb for doing it the hard way.

    For one of the parsers it isn’t line and column I need, but byte offset into the source file.

    For the other parsers, I rolled my own because frankly it just seemed easier. I like being able to debug the parsing directly instead of debugging the generating of the parser.

    Reply

  5. Ulya said:

    When writing a parser for language with a more or less stable grammar, one can afford coding the parser by hand (mostly for performance reasons). Howevwer, such parser is coded once and for all.

    But there’re projects where you have to use parser generators, for your grammar is changing from day to day (or becomes totally new, or you have too many grammars).

    I completely agree that it’s hard and inconvenient to use parser generators, and you may spend more time resolving conflicts than hand-coding parser, and the resulting parser may be slow and ugly.

    The only case I would ultimately prefer parser generator is parsing regular expressions with RE2C.

    Reply

    • My experience has been that changing a well written hand-made parser to be far easier than maintaining a parser generator file. Each time I’ve used a generator I end up hitting a point where I cannot remove or modify the grammar easily without breaking.

    • Ulya said:

      Rewriting parts of recursive descent parser, when language or format it conforms to has changed a bit, may be a lot easier, then modifying parser generator file.

      But it 1) you have to totally rewrite your parser every day (as a planned part of developing process) 2) grammar changes in layout, not in complexity, it may be far easier to rewrite grammar file.

      IMHO, parser generators are ecpecially useful for parsing regexps: it’s easy to define regexps, and hard to build a complex DFA. And mostly you can’t use regexps for anything greater than lexing, so you won’t face conflicts changing your grammar.

    • As part of my Leaf project I continually change parts of my grammar. Though perhaps for highly regular languages, those which are nearly context-free and require no look-ahead a generator may be better. But even for something like math expression evaluation I’d still choose hand-written.

  6. I’m not a compiler expert, though I have written a couple of parsers, sometimes for some quite complex grammars, ranging from recursive descent hand-written parsers, lex/yacc, Irony parsers (C# parser where the grammer is declared in the language), GoldParser, ANTLR v3 and ANTLR v4. While there are indeed some valid points described in this post, I would like to revisit this with ANTLR v4, as the new version is really able to tackle nicely and efficiently some of your concerns.

    1) Context lexer/parser: ANTLR v4 is coming with predicates to modify the grammar or behavior of the lexers and/or parsers based on dynamic contexts. You can even switch the grammar to contextual modes, to accept specifics tokens for part of the grammar at specifics points…etc. There is also the opportunity to develop a “filter” lexer on top of the existing generated lexer in order to generate/modify the flow of the tokens (for example, handle correctly indent grammars like Python)

    2) Shift/Reduce and Grammar Conflicts: definitely yes, either with lex/yacc or with ANTLR3, I had some serious fight sometimes to correctly handle a grammar. But ANTLR v4 is just fantastic for this, at it allows to use left recursive grammar (with a very little restriction). Take a look at the v4 ANTLR grammar for Java for example (https://github.com/antlr/grammars-v4/blob/master/java/Java.g4 ), the way to describe an expression is just ridiculously easy. So far, working with ANTLR v4, I have not been hit by anykind of grammar conflicts.

    3) Concerning syntax tree, I usually don’t use them directly in the language, as I’m always writing an AST that is independent of the parser. Afaik, ANTLR v4 doesn’t have syntax rewriting capabilities, but the new listener/visitor patterns make things quite easy to build a custom ast.

    4) Mixed code: I agree. But again, ANTLR v4 allows to write a grammar almost entirely decoupled from the generated code (look the Java grammar). You can still use custom embedded actions in ANTLR, but they are most of the time not necessary and can be done entirely outside the grammar definition. Predicates in the grammar are usually quite light and they don’t mess up readability of the grammar.

    5) Concerning parser context mode, as I said, It is working nicely in ANTLR v4. For source location, I have not experienced much of your problems with most of the parsers I have been using. I think that ANTLR v4 is quite good at hitting the right spot, column wise.

    Plus on the things that you don’t mention, and with ANTLR v4 in mind I would say:
    - Create a full effective grammar from scratch just takes a few hours for a quite complex language (all languages like C#, Java, Python…etc. are really easy to describe). You are much more productive this way.
    - Developing an efficient token lexer is quite a laborious and stupid work, having a parser that creates optimized DFA tables for this tasks greatly simplify this.
    - Error recovery is quite impressive in ANTLR v4. Achieving the same level of recovery (automatic token deletion, missing token insertion, continue to match rule list…etc) in a hand-written parser is quite difficult (more particularly with backtracks).
    - Developing an efficient hand-written parser is not always as easy as it seems. Particularly when you need to do some backtracks, performance of a recursive descent parsers can be badly screwed-up in this case when this is not correctly handled (and this is absolutely not a simple task). But worst, the code of a recursive descent parsers can become much more difficult to maintain and understand, because of all the grammar ambiguities and performances hacks needed to handle them.

    So for all these reasons, if you care about productivity, I strongly encourage developers to use a parser generator like ANTLR v4.

    I would still recommend developing some hand-written parsers in some specific cases like:
    - For learning how to write a parser :)
    - For performance reasons, though, quite often the bottleneck in a parser is the lexer. Naive hand-written lexers will most certainly perform much worse than an efficient generated DFA table lexer.

    Reply

    • Productivity is a big reason why I don’t like the generators. When you first start a project you can put together a parser really fast (with ANTLR at least, other generators not so fast). But then you hit a complex point and spend many more hours trying to fix the generator than you would have writing your own code. Though granted if you’re not already experienced with writing parsers the generator approach will likely be faster.

      All you say about ANTLR4 sounds promising, but it reinforces my feeling that the generators are always playing catch-up. They always need new additions and extensions to handle new languages and language changes. It’s a legitimate fear that if you use a generator you’ll inevitably hit some syntax you wish to implement but can’t.

      With Leaf I’m cheating a bit. Since it’s a whole new language I’m ensuring the grammar is friendly to parse. It’s recursive descent but doesn’t require any back-tracking (well, debatable since I have a converter stage that actually rewrites one parse tree). Once I have a reasonably stable syntax I might try writing it in a generator. I suspect it’ll work in ANTLR4 based on your description.

  7. Take a look at CoCo/R. The produced code is readable, and traceable in the debugger. You resolve LL1 conflicts with ‘if’ statements. As far as not wanting to mix code and grammar — I understand. I like to keep the code embedded in the grammar as simple as possible, the real work is done elsewhere.

    Reply

    • I’m not sure I want to resolve to the LL1 conflicts, I just want them not to be there. That is, in many cases I seem to get conflicts though my source language doesn’t appear to have them, and a recursive descent parser wouldn’t need the ‘if’ statement (the context only allows one possible answer).

  8. Craig said:

    Concluding against an entire category of tools just because you’ve used a couple of dated and ineffectual ones is a bit odd.

    Seems like a case of: http://en./wiki/Availability_heuristic

    There are plenty of other tools based on formalisms like PEG that solve basically all of the problems you mention.

    Reply

    • Each tool I used was selected after considering many popular options. Each were considered good tools at the time. Should a new tool arrive that solves all of my problems that would be great! I still need parsers and would be happy not to write my own. My experience so far however tells me the likelihood of finding such a solution is still quite low.

  9. evilotto said:

    Have you considered using a parsing expression generator (PEG) parser? They inherently eliminate the separate lexing phase and shift/reduce conflicts, and in general work closer to a recursive descent parser that you would write by hand. I first discovered them a few months ago and have found them quite pleasant to work with. I’m using ‘pt’ (parse tools) from the tcl standard library to generate parsers on the fly or fast c-coded ones.

    Reply

    • The disadvantages listed on Wikipedia leave me worried that I’ll just be trading one set of problems for a new set. It does seem more flexible than some other parsers I used, but it still forces me to think in the grammar of the generator, rather in the grammar of the language I’m trying to parse.

      https://en./wiki/Parsing_expression_grammar

  10. Achilleas Margaritis said:

    For all the reasons you mention (and some more), I’ve built my own parser library (if you’re interested, it’s here: https://github.com/axilmar/parserlib).

    Highlights of my library: a) separation of grammar from AST, b) handling of left recursion.

    The parser is a PEG parser.

    Reply

    • Since I started writing my Leaf compiler the list of things a generator would have to do for me has grown quite a bit. For example, I want to do expression parsing based on a generic syntax matched with a table of priorities for the operators. Would a PEG parser be capable of doing that, or would I have to hard-code the precedence rules in the grammar by making each operator an explicit rule?

Leave a Reply

    本站是提供个人知识管理的网络存储空间,所有内容均由用户发布,不代表本站观点。请注意甄别内容中的联系方式、诱导购买等信息,谨防诈骗。如发现有害或侵权内容,请点击一键举报。
    转藏 分享 献花(0

    0条评论

    发表

    请遵守用户 评论公约

    类似文章 更多