This series of blogposts is amusing but it is kind of frustrating too: the end result is merely a compiler. It is a source to source translator from a source language which is very very similar to a subset of the target language. Well I guess you can call that a compiler, but it really doesn't teach the readers what an actual compiler looks like. Exaggerating a little, I would say that it feels like a few calls to sed could do the same thing using regexp.
Hey, I just skimmed through this, it seems to be brilliant!! Just one question, can one go through this tutorial without much knowledge about Ruby? Should one read it with 'Ruby Under a Microscope'? I have a decent enough knowledge about compilers, have gone through Thorsten Ball's, 'Writing an Interpreter in Go'.
You can build a parser around regexes though, where most of the code is regexes and then you have a little bit of code to deal with the irregularity. For instance consider arithmethic expressions consisting of constants, -, +, *, /, and parentheses. You could evaluate that using something like (expression to parse a numeral is left as exercise to reader).
while expression is not a numeral
replace all "\((NUMERAL)\)" with first group
if find first "(NUMERAL)([*/])(NUMERAL)"
replace with result
else if find first "(NUMERAL)([+-])(NUMERAL)"
replace with result
What you are doing there is conflating the lexing phase and the parsing phase. Regexes are perfect for recognizing tokens, but for a language with nested parentheses, you must have a push-down automaton to process it. Otherwise, you will not be able to verify that your delimiters are matched.
That's one way of looking at it I suppose. But it does require stretching the definition of token a little bit if you are eg using regexes to identify arbitrarily long multiplications. I think my prefered perspective is that you are using regexes to parse regular sublanguages and then something more powerful (e.g. a push down automaton, but there are even more powerful parsers than that) to bind those sub-parsers together.
But nonetheless even if regexps would not be able to validate syntax, it does not mean that the source-to-source transformation could not be (mostly) achieved using them, because the source and target language are quite similar.
Imagine a new language exactly like C but every semicolon is replaced by a duck "\_o<". You could not parse it using regexps, but regexp could mostly "compile" it to C as it only requires to look for all "\_o<" to replace them with ";".
This wouldn't work as you could have a semicolon in a string, to handle that you would need to know when you're in a string, and due to the existence of backslash escapable quotes in strings, you can't do that with a regex*
*A real Regular Expression. The extended versions with e.g backrefs can. But they're also Turing complete anyway
Actually, I think that the grammar for a C string with escapes characters is regular. The only bit I am worried about are the back-slash octal and back-slash hexadecimal notations. But they are not relevant if you want to detect escaped quotes in strings.
Sigh. My point was that it _feels_ like this because the tutorial does not actually do an actual compiler's work. It is a quite simple rewrite to a very similar language with the same level of abstraction. No language constructs have to be compiled, i.e., to be automatically reimplemented in a target language which doesn't have this.
Now, if your only concern is that everything that people feel and express have to be technically correct, well, have fun in your life.
However, if you actually want to learn about compilation it would be a better exercise to use way simpler languages but actually do the work, for example: write a simple stack VM which can only do NAND operation, and then parse and compile to it a boolean expression language which has 0, 1, not, and, or, xor. There you will need to build a simple AST and traverse it one time to translate the basic boolean operations to use only nand, and then a second time to compile the nand expression to your VM's assembly language.