Crafting Interpreters: Strings

munificent · on Sept 24, 2018

Author here! Happy to answer questions, take feedback, receive criticism.

Waves awkwardly, shuffles feet.

smueller1234 · on Sept 24, 2018

Hey Bob. Still a big fan of this project. Thank you!

This chapter introduces not only strings but the concept of more complex value types that use an auxiliary struct. There's an anecdote related to that which you might find amusing - and maybe a bit appalling.

In the implementation of perl, there's multiple kinds of integer value type in order to support it's complex implicit casting semantics and additional edge cases. To pick the least wonky, there's plain integer values, "IVs", and those which have a stringified representation of the int attached for caching the expensive conversion, "PVIVs". The former store the int in the primary value struct. The latter use an auxiliary struct with extra space.

Now, the reason for all this prep is that in order to make the two behave the same, and not introduce a lot of branches to low level code, perl always uses the indirect access to the integer value (essentially "x->aux.intval"). This works because for basic integer values, it initializes the aux pointer in the main struct to point to the right place in memory before the location of the main struct such that this access always points at the right slot of the main struct. Madness! So much fun!

(A better explanation is at https://medium.com/booking-com-development/how-we-spent-two-...)

_old_dude_ · on Sept 24, 2018

Shenandoah, one of the GCs in Java uses a similar trick, each time there is an access to a field, instead of directly accessing to it, the VM access to a pointer at the top of the structure that point just to the address below itself, so each field is read using a double indirection. This allows the application to run when the GC move the objects (during the evacuation phase), because each time the GC relocate an object, it also changes the top pointer in the old region to reference the new region. The GC also rewrites the pointers inside the stacks by replacing the pointer to the old region by the value of the top pointer, so the pointers in the stack now reference the new region too.

smueller1234 · on Sept 24, 2018

Ah, and I'd always wanted to read up on that and never prioritized it. Thank you!

This might be a naive question, but how do you do all that redirection under concurrent accesses? Clearly that amount of rewiring isn't going to be atomic? Or am I misunderstand and the only change that needs to be atomic is the top-of-structure redirection?

_old_dude_ · on Sept 24, 2018

As you said, only the top pointer need a CAS.

If you want more info, the is a good presentation by Christine Flood about Shenandoah at FOSDEM 2017 https://archive.fosdem.org/2017/schedule/event/shenandoah/

munificent · on Sept 24, 2018

This is fascinating!

smueller1234 · on Sept 24, 2018

Is that code for nausea inducing? ;)

The perl implementation is full of this sort of stuff. Some corners are hardly penetrable because of it. I'll be visiting the offices in MSN and KIR/SEA next week. Happy to rant over coffee or a drink some night if timing works out.

evntdrvn · on Sept 24, 2018

FYI stray period at the end of your link breaks it :)

smueller1234 · on Sept 24, 2018

Fixed, thank you!

benhoyt · on Sept 24, 2018

I'm curious what benchmarking / research you've done on how much difference it makes to use a compact value representation (like the NaN tagging used in Wren, or using the low bits of a pointer to specify object type). Is it just a couple of percent perf improvement, or quite a large amount?

Also, either with a "compact" representation (Wren) or a less compact one like Lox, have you tried storage small strings in the object itself (<= 7 bytes for a 64-bit object size, or <= 15 bytes for two 64-bit words), and how much difference would that make?

munificent · on Sept 24, 2018

> Is it just a couple of percent perf improvement, or quite a large amount?

This is a really really hard question to answer with any generality. Especially when you talk about percentage of perf improvements, then the performance of the rest of the system directly impacts the gain you see from changing the value representation.

At the time that I implemented it in Wren, using NaN tagging made about a 10% improvement in the handful of microbenchmarks I was using at the time, on my machine. You can toggle NaN tagging on and off with a #define, so it's worth checking the difference today on different benchmarks and on top of the other changes I've made to Wren since then.

The last I checked, Wren spent a large fraction of time in instruction decoding and dispatch (not uncommon for a bytecode VM). In an interpreter where that's more efficient (because of a JIT, threaded code, superinstructions, peephole optimizations, etc.), the value representation will probably make a larger difference.

> have you tried storage small strings in the object itself (<= 7 bytes for a 64-bit object size, or <= 15 bytes for two 64-bit words), and how much difference would that make?

I haven't, though I've briefly toyed with it. With Wren, I'm trying really hard to keep the implementation small and simple, so I try to avoid special-case optimizations like this that add a lot of complexity.

It might be worth it, but I'm not sure. It's a hard call to make. Each optimization or little feature on its own seems small, but when you add them all up, it can be the difference between an approachable, hackable codebase, and a big sprawling one.

krylon · on Sept 24, 2018

Thank you so much for all the effort! I can only imagine - vaguely - how much hard work must go into writing such a book.

munificent · on Sept 24, 2018

Here's the log I keep which tracks what I did every day I worked on it (except for a few months in the beginning when the project was still kind of nebulous):

https://github.com/munificent/craftinginterpreters/blob/mast...

krylon · on Sept 25, 2018

Wow, that looks like even more work than I had imagined. I hope you enjoy writing that book as much as I enjoy reading it!

> 2018/03/29 - pancakes

:)

steveklabnik · on Sept 24, 2018

As always, this series is fantastic. Thank you so much for writing it.

munificent · on Sept 24, 2018

You're welcome! I'm still planning to read the Rust book when I get some time. Other people who care about technical writing like you are an inspiration.

_8huj · on Sept 24, 2018

There are people who put out books for money, and I can see you are not one of them. I've really enjoyed the passion illustrated in your continually updated series.

(Bravo)

z3phyr · on Sept 24, 2018

Are the 1st and 2nd parts mutually related? Is it ok if I start reading from the Bytecode VM part?

PS: I am enjoying the design and writing. Kudos sir!

munificent · on Sept 24, 2018

They're pretty loosely-coupled. The bytecode VM part assumes you are familiar with the language itself, which is described in an introductory chapter [1]. It also assumes you're familiar with most of the basic structure of an interpreter and the concepts around parsing and stuff, since those are introduced when we do the tree-walk interpreter. If you know those already, you can probably skip straight to the bytecode VM. You can always consult earlier chapters if you get stuck.

[1]: http://www.craftinginterpreters.com/the-lox-language.html

z3phyr · on Sept 24, 2018

Thank you very much! This is what I will do!

Waterluvian · on Sept 25, 2018

No question, just a big thank you. I followed along the first half and wrote my own Lox using Python. I learned a ton about a domain I would never have and may never use professionally. But writing my own interpreter helped me understand Python and love the magic of programming all that much more.

fusspawn · on Sept 24, 2018

I love this series.

Implimented my own interpreter of the back of the original series. Managed to get a type system and all sorts added.

The author was even responsive via email giving me hints on where to investigate extending it further. (belive I was trying to add lists at the time).

Having followed a ton of 'how to write a programming language / compiler / ect tutorials'.

Its the first one to really make me grok what's going on and move beyond just copy pasting examples.

So thanks Bob. Keep up the awesome work!

munificent · on Sept 24, 2018

Thank you! This means the world to me. :)

fusspawn · on Sept 24, 2018

No thank you!

I have some spare change kicking in a bitcoin wallet gathering dust and would like to buy you a coffee got an address I can send it to?

Couldn't see one on the site or your blog.

munificent · on Sept 24, 2018

Give it to someone more needy than me. When the book is done and I have the print edition out, the easiest way to say thanks is to buy a copy and/or write a nice review.

Until then, just enjoy it. :)

c-smile · on Sept 24, 2018

If strings are immutable in L0x then this string representation:

    struct sObjString {
      Obj obj;         
      int length;      
      char* chars;     
    };

needlessly wastes 4 or 8 bytes for the chars pointer (among other heap management overhead related to separate allocations).

Instead, I think, you should have

    struct sObjString {
      Obj  obj;         
      int  length;      
      char chars[0];     
    };

and allocate that string object as one chunk with its characters.

munificent · on Sept 24, 2018

Take a look at the first challenge exercise at the end of the chapter. :)

c-smile · on Sept 24, 2018

My pardon, hadn't got to that part at the moment of commenting.

camgunz · on Sept 24, 2018

Hey Bob, I super love this series. Everyone I know who's into PL likes it, so great work :)

A little thing in your section about the different UTFs: I've heard UTF-16 is a lot better than UTF-8 for East Asian languages because pretty much all those codepoints are multibyte. I don't _necessarily_ think anything you wrote was invalid, but just FYI.

ufo · on Sept 24, 2018

However, this is only the case if you have just a big blob of text and UTF-8 might be better if the east asian text is interspersed with ASCII characters. For example, HTML with east asian text often is smaller when encoded with UTF-8, because a significant portion of the content is HTML markup.

camgunz · on Sept 25, 2018

Right well again, if your average East Asian codepoint is 3 bytes in UTF-8, and your average markup codepoint is one byte, then as the percentage of markup in your corpus rises you'll grow asymptotically towards 1. But not all text contains markup ;) Consider any database storage for example, text in a game, e-books, visual novels. I guess anything that isn't the web or JSON is what I'm saying -- which is a lot.

a1369209993 · on Sept 25, 2018

Also bear in mind that gzip (or any non-toy compression algorithm) will do much better than either on any non-contrived input.

mwcampbell · on Sept 25, 2018

Also on UTF-16:

> Alas, if your language needs to run on or interoperate with the browser, the JVM, or the CLR, you might be stuck with [UTF-16]

And Windows. Win32 and COM APIs use UTF-16 natively, so if you want to avoid lots of conversions on Windows, UTF-16 is the way to go. Maybe the best solution would be an abstraction that lets you choose the internal string encoding at compile time.

munificent · on Sept 24, 2018

Hmm, interesting. I guess they'd likely end up three bytes in UTF-8. Do you have any details you can point me at?

camgunz · on Sept 25, 2018

Nothing like a paper or anything, but just kind of anecdotally I downloaded 西遊記 from Project Gutenberg [1] and measured it. In UTF-8 it's 2,236,564 bytes, in UTF-16 it's 1,554,998 bytes.

[1] https://www.gutenberg.org/files/23962/23962-0.txt

freecodyx · on Sept 24, 2018

>>Don't get "voilà" confused with "viola". One means "there it is" and the other is a tiny string instrument. Yes, I did spend two hours drawing a viola just to mention that

madness,

tomcam · on Sept 25, 2018

Since we’re correcting people... Viola is not tiny. It’s larger than a violin, though smaller than a cello.

rurban · on Sept 25, 2018

Hi, Why no rle encoded string length? You cannot access the string then directly, but you can handle arbitrary string lengths and still save a lot of space. inlined strings for short strings or interned strings for faster comparison and sort are also important types. ropes maybe also.

oldandtired · on Sept 25, 2018

If you want to see strings implemented as immutable representations look to the unicon/icon vm representation. This has been in effective use for well over 30 years and has all sorts of nice properties.

james-mcelwain · on Sept 24, 2018

This book is the best!