Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

This series of blogposts is amusing but it is kind of frustrating too: the end result is merely a compiler. It is a source to source translator from a source language which is very very similar to a subset of the target language. Well I guess you can call that a compiler, but it really doesn't teach the readers what an actual compiler looks like. Exaggerating a little, I would say that it feels like a few calls to sed could do the same thing using regexp.


Lots of compiler tutorials are like this - there's very little out there to explain how compilers really work.

This is my effort - trying to show genuine data structures and processes.

https://github.com/chrisseaton/rhizome


Hey, I just skimmed through this, it seems to be brilliant!! Just one question, can one go through this tutorial without much knowledge about Ruby? Should one read it with 'Ruby Under a Microscope'? I have a decent enough knowledge about compilers, have gone through Thorsten Ball's, 'Writing an Interpreter in Go'.


Only need to know basic Ruby programming - no libraries or deep knowledge.

But the project is unfinished I’m afraid - you will find dead-ends.


Oh I see, thanks for the info!! I'll see what I can gain out of it. Thanks once again for the info!!


It could not. Unless your language is regular class, then regular expressions cannot parse the language accurately.


You can build a parser around regexes though, where most of the code is regexes and then you have a little bit of code to deal with the irregularity. For instance consider arithmethic expressions consisting of constants, -, +, *, /, and parentheses. You could evaluate that using something like (expression to parse a numeral is left as exercise to reader).

   while expression is not a numeral
       replace all "\((NUMERAL)\)" with first group
       if find first "(NUMERAL)([*/])(NUMERAL)"
         replace with result
       else if find first "(NUMERAL)([+-])(NUMERAL)"
         replace with result


What you are doing there is conflating the lexing phase and the parsing phase. Regexes are perfect for recognizing tokens, but for a language with nested parentheses, you must have a push-down automaton to process it. Otherwise, you will not be able to verify that your delimiters are matched.


That's one way of looking at it I suppose. But it does require stretching the definition of token a little bit if you are eg using regexes to identify arbitrarily long multiplications. I think my prefered perspective is that you are using regexes to parse regular sublanguages and then something more powerful (e.g. a push down automaton, but there are even more powerful parsers than that) to bind those sub-parsers together.


That's why I said "if feels like".

But nonetheless even if regexps would not be able to validate syntax, it does not mean that the source-to-source transformation could not be (mostly) achieved using them, because the source and target language are quite similar.

Imagine a new language exactly like C but every semicolon is replaced by a duck "\_o<". You could not parse it using regexps, but regexp could mostly "compile" it to C as it only requires to look for all "\_o<" to replace them with ";".


This wouldn't work as you could have a semicolon in a string, to handle that you would need to know when you're in a string, and due to the existence of backslash escapable quotes in strings, you can't do that with a regex*

*A real Regular Expression. The extended versions with e.g backrefs can. But they're also Turing complete anyway


Actually, I think that the grammar for a C string with escapes characters is regular. The only bit I am worried about are the back-slash octal and back-slash hexadecimal notations. But they are not relevant if you want to detect escaped quotes in strings.


I know that. But admit that it is nitpicking and completely besides the point.


No, it’s not beside the point at all. It disproves your statement for any real language.


Sigh. My point was that it _feels_ like this because the tutorial does not actually do an actual compiler's work. It is a quite simple rewrite to a very similar language with the same level of abstraction. No language constructs have to be compiled, i.e., to be automatically reimplemented in a target language which doesn't have this.

Now, if your only concern is that everything that people feel and express have to be technically correct, well, have fun in your life.

However, if you actually want to learn about compilation it would be a better exercise to use way simpler languages but actually do the work, for example: write a simple stack VM which can only do NAND operation, and then parse and compile to it a boolean expression language which has 0, 1, not, and, or, xor. There you will need to build a simple AST and traverse it one time to translate the basic boolean operations to use only nand, and then a second time to compile the nand expression to your VM's assembly language.



C is his hammer, but tenure is tenure.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: