Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Parser generators are a different issue than lexer generators imo. With a lexer, usually it ends up being a lot easier to do it by hand, because otherwise you end up fighting the system or writing awful regular expressions. Parser generators at least let you use nice formats like BNF.

Edit: also your lexical spec is much less likely to change than your grammar is.



so i really just want to say something like, "yeah, but Parsec is still awesome," but then i saw someone got downvoted for a similar comment with Rob Pike in place of Parsec, so what can i add to the discussion?

perhaps just that it's important to keep in mind that computer systems are things we've worked out for ourselves. these notions of "lex" and "parse" are not things given to us by nature...

...although, they do kind of do fall out the circumstances of (1) needing to emit x86 instructions and (2) the preference for writing programs in text editors. the second point gets a lot of discussion what with ideas about editors understanding parse trees and all, but i wonder what happens to the whole lexer-parser dichotomy if we keep 2 and periscope our notions of hardware. does the need for tokenization appear as a result of something fundamental to von neumann architecture, or is it just a result of currently-vogue instruction sets?

ah well, back to making things happen with the tools at hand.


The argument for lexers has nothing to do with machine instructions: it has to do with algorithmic performance. Grammars tend to have time complexity that is greater than linear with a moderate constant (and a parser combinator library, which is really just a lightweight programming abstraction making it easier to develop by-hand recursive descent parsers, normally doesn't try to help with this problem), whereas a good lexer tends to have linear time complexity with a nearly zero constant. If you can separate your compilation phase into "first run a lexer, then run a grammar", you can parse much longer files (not just constantly longer, but asymptotically longer) in the same amount of time. There is no fundamental reason to separate these phases, however, and it has minimal effect on the resulting compiler: numerous compilers, even numerous compiler generators, have these phases combined into a single grammar step, with the tokens now being effectively characters. (There are also engines that try to slide between the two levels, using an algorithm more similar to a lexer at the base of your grammar, but still letting you define these psuedo-tokens in the unified grammar as rules.)


I've personally found that even when working with parser generators or handwritten parsers that don't need the separation it still helps to consider them separate. Having a stable and solid set of fundamental tokens makes the resulting language easier to understand. Whenever I see a language (usually made with a PEG generator) that blurs these lines everything feels very shifty. Yes I know these are very touchy-feely attributes I'm describing, and you can obviously avoid them without that separation as well, but these things are important as well.

It's an interesting accident, to me at least, that this separation turned out to be both optimal and useful.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: