This series of blogposts is amusing but it is kind of frustrating too: the end r...

chrisseaton · on Aug 19, 2021

Lots of compiler tutorials are like this - there's very little out there to explain how compilers really work.

This is my effort - trying to show genuine data structures and processes.

divtiwari · on Aug 19, 2021

Hey, I just skimmed through this, it seems to be brilliant!! Just one question, can one go through this tutorial without much knowledge about Ruby? Should one read it with 'Ruby Under a Microscope'? I have a decent enough knowledge about compilers, have gone through Thorsten Ball's, 'Writing an Interpreter in Go'.

chrisseaton · on Aug 19, 2021

Only need to know basic Ruby programming - no libraries or deep knowledge.

But the project is unfinished I’m afraid - you will find dead-ends.

divtiwari · on Aug 20, 2021

Oh I see, thanks for the info!! I'll see what I can gain out of it. Thanks once again for the info!!

junon · on Aug 19, 2021

It could not. Unless your language is regular class, then regular expressions cannot parse the language accurately.

im3w1l · on Aug 19, 2021

You can build a parser around regexes though, where most of the code is regexes and then you have a little bit of code to deal with the irregularity. For instance consider arithmethic expressions consisting of constants, -, +, *, /, and parentheses. You could evaluate that using something like (expression to parse a numeral is left as exercise to reader).

   while expression is not a numeral
       replace all "\((NUMERAL)\)" with first group
       if find first "(NUMERAL)([*/])(NUMERAL)"
         replace with result
       else if find first "(NUMERAL)([+-])(NUMERAL)"
         replace with result

ashton314 · on Aug 19, 2021

What you are doing there is conflating the lexing phase and the parsing phase. Regexes are perfect for recognizing tokens, but for a language with nested parentheses, you must have a push-down automaton to process it. Otherwise, you will not be able to verify that your delimiters are matched.

im3w1l · on Aug 19, 2021

That's one way of looking at it I suppose. But it does require stretching the definition of token a little bit if you are eg using regexes to identify arbitrarily long multiplications. I think my prefered perspective is that you are using regexes to parse regular sublanguages and then something more powerful (e.g. a push down automaton, but there are even more powerful parsers than that) to bind those sub-parsers together.

p4bl0 · on Aug 19, 2021

That's why I said "if feels like".

But nonetheless even if regexps would not be able to validate syntax, it does not mean that the source-to-source transformation could not be (mostly) achieved using them, because the source and target language are quite similar.

Imagine a new language exactly like C but every semicolon is replaced by a duck "\_o<". You could not parse it using regexps, but regexp could mostly "compile" it to C as it only requires to look for all "\_o<" to replace them with ";".

beecafe · on Aug 19, 2021

This wouldn't work as you could have a semicolon in a string, to handle that you would need to know when you're in a string, and due to the existence of backslash escapable quotes in strings, you can't do that with a regex*

*A real Regular Expression. The extended versions with e.g backrefs can. But they're also Turing complete anyway

fjfaase · on Aug 19, 2021

Actually, I think that the grammar for a C string with escapes characters is regular. The only bit I am worried about are the back-slash octal and back-slash hexadecimal notations. But they are not relevant if you want to detect escaped quotes in strings.

p4bl0 · on Aug 19, 2021

I know that. But admit that it is nitpicking and completely besides the point.

gnubison · on Aug 20, 2021

No, it’s not beside the point at all. It disproves your statement for any real language.

p4bl0 · on Aug 20, 2021

Sigh. My point was that it _feels_ like this because the tutorial does not actually do an actual compiler's work. It is a quite simple rewrite to a very similar language with the same level of abstraction. No language constructs have to be compiled, i.e., to be automatically reimplemented in a target language which doesn't have this.

Now, if your only concern is that everything that people feel and express have to be technically correct, well, have fun in your life.

However, if you actually want to learn about compilation it would be a better exercise to use way simpler languages but actually do the work, for example: write a simple stack VM which can only do NAND operation, and then parse and compile to it a boolean expression language which has 0, 1, not, and, or, xor. There you will need to build a simple AST and traverse it one time to translate the basic boolean operations to use only nand, and then a second time to compile the nand expression to your VM's assembly language.

p4bl0 · on Aug 20, 2021

Here you go: https://code.up8.edu/-/snippets/3

peepholeoptim · on Aug 19, 2021

C is his hammer, but tenure is tenure.