*The key change you need to make to the RTS is, when you evaluate a thunk, rathe...

The key change you need to make to the RTS is, when you evaluate a thunk, rather than overwriting it with the final value, instead append the evaluated value to the thunk.

This already happens in GHC.

Any thunk that is to get updated has one word reserved as a slot for the update result. This is necessary for proper operation in SMP systems, when two processors race to evaluate the same thunk. The only data that gets overwritten is the thunk's code pointer. There is one standard routine in the binary, known as the indirection code, which just returns the value in the update slot of the currently active thunk. The thunk's pointer is set to point to the well-known address of the indirection code. In all this, the thunk's payload remains untouched.

GHC's GC short circuits these indirection nodes, and ignores pointers in the thunk payload (since the thunk payload has become effectively unreachable). In order to accomplish your goal, we would have to recover the original value of the code pointer. This can be done by storing a copy of this value in the payload (similar to what you are suggesting). A way to do this without increasing the size of any heap object is for the compiler to generate a copy of the indirection code per thunk code block... such that the original thunk code address can be recovered from the particular indirection code address via a lookup (or potentially, via arithmetic). This bloats the emitted code by a small fraction, but may be worth it.