I'm reminded of Chesteron's Fence in this. Every major ABI is listed here as con...

lostcolony · on May 8, 2021

I love seeing others bring up Chesterton's fence; it's been a reference that comes to mind with quite a lot of the WTFery I've encountered in my career (usually it remains WTFery even when looking for underlying reasons, but it at least helps remind me to question my instincts).

I don't really know enough to weigh in on this, but I can say that having pursued a lot of WTFish things in my career so far, 90% of the times I've encountered bad decisions, the explanation for it was either "it was done that way because legacy reasons" (i.e., it had to be done that way then, the reason it had to be has changed, and now it would break things to do it 'correctly') or "it was easier" (i.e., at the time the badness wasn't really going to affect anyone, or not measurably, or was very intentional tech debt, and it's only 'now' that anyone is noticing/caring).

david422 · on May 8, 2021

I've seen people make bad architectural decisions that now the company is stuck with. And it comes down to just the fact that it was a bad decision, no second guessing needed.

I've also seen "bad" decisions made due to outside constraints. These decisions look like bad decisions, except that if you try to "fix" those decisions, it becomes a lot harder than it looks.

lostcolony · on May 9, 2021

Don't get me wrong, there are plenty of times it was cluelessness. I'm just saying, I find myself going "this is stupid" far more often than it -was- stupid. It might be now, but the reasons for it then sometimes made sense.

derefr · on May 8, 2021

In this case, "it was done that way because legacy reasons" is close, but the real answer is "it was done that way because we hadn’t yet invented the parts of compiler theory required to create compilers that enforce this constraint at the type level."

matheusmoreira · on May 9, 2021

All this compiler sophistication represents a step backwards for binary interfaces. For example, C++ compilers emit such incredible machinery that it's essentially impossible for foreign code to interface with the compiled objects at the binary level. As a result everything eventually gets reduced to the C ABI: simple symbols and calling conventions.

derefr · on May 9, 2021

That's... what we're talking about. Simple symbols with calling conventions.

The rules for this proposed ABI are exactly the same as the existing amd64-SystemV C ABI, with one difference: the stack-to-stack copies aren't generated at the call-site; instead, the generated code at the call-site passes the address (in a register, or spilled to stack) for what it would have copied. The compiler generates the stack-to-stack copy in the generated function's prologue, using the address it was passed. Nothing more, nothing less. It's just moving the required location for certain generated code across the linkage, and keeping a temporary alive a little bit longer to make that work. (And in exchange, the temporary that the local stack variable gets put in isn't created at the call-site, so the register-file "pressure" of the change is net neutral.)

This is no more or less complex than the current ABI. It doesn't create more exceptions or edge-cases than the current ABI. It doesn't make the ABI harder to implement. The only thing it does, is choose differently in the matter of a basically-arbitrary choice of where to put some generated glue code (the stack-to-stack copy).

The only practical upshot of this change, is that this enables compilers to sometimes do an optimization that they can't currently do, because doing said optimization would go against the rules of the amd64-SysV ABI (i.e. a caller that pushed a register instead of copying the value wouldn't be an amd64-SysV caller any more, and wouldn't be compatible with precompiled amd64-SysV callees any more; and vice-versa for the callee.)

But if-and-when a compiler does do that optimization, it's internal to the generated function. It doesn't mean that there are two potential callee "signatures" under the proposed ABI. There's only one.

Here's what the proposed ABI would probably say about stack copies:

> "The caller always passes large values by reference; the callee always receives them by reference. If the callee is taking a parameter pass-by-value, then it's up to the compiler of the callee to insert code into the callee's function prologue to turn the passed reference into a stack-local copy of the referenced data."

With that particular legalese, the callee's generated copy is still "required" by the spec, but its effects are now also "hidden" from the caller — i.e. its observable results are no longer leaking across the linkage. Therefore, the compiler is now empowered to optimize out the callee copy, as long as it can ensure the resulting code has observably equivalent results from the caller's perspective.

Note that this isn't anything the person implementing the ABI targeting code in the compiler has to worry about. They just write the code to generate a callee function prologue that does a stack-to-stack copy. It's the person writing the optimization pass that comes after that codegen step, who can now can take that stack-to-stack copy and — static proof of read-only access by the callee in hand — drop it out.

The optimization opportunity being enabled by the change, isn't part of the ABI's spec. The proposed ABI is just about moving the stack-to-stack copy into the callee. What the compiler chooses to do when targeting an ABI where the callee does stack-to-stack copies, is up to the compiler. Presumably, it will do "whatever fiendish things it can" at -O3, and "nothing much different" at -O0. Like usual.

And either way, the linkage itself looks the same. The optimization doesn't change the linkage. Any and all tooling that examines the linkage — debuggers, disassemblers, tracers, etc. — would see the same thing, whether the optimization has occurred or not. Because the optimization isn't part of the linkage; it's internal to the codegen of the callee, enabled by the (uniformly!) modified structure of the linkage.

rsj_hn · on May 8, 2021

Yup, there's also time dependence. Perhaps someone wrote some software in COBOL that is hard to maintain now. But rewritng it may not be worth the opportunity cost now, especially for well-tested systems that have been around for a long time and which have critical failure modes. Sometimes it's better to leave things alone and work around them, even if it results in an uglier design.

legulere · on May 8, 2021

How about those explanations:

It didn't matter before, as compilers were not optimizing as much, code had a much closer 1:1 correspondence to assembly (if you are passing it by pointer and not register, you would want to make that clear in code).

It's much easier to implement in simple compilers. On the side of the callee you don't have to check if you manipulate your arguments, which is generally hard. Being able to manipulate your arguments is another shortcut for keeping the compiler simple. On the side of the caller you don't have to check if you hand out a mutable pointer.

Also finally and most importantly: memory access was much cheaper in terms of cpu cycles. Just look at cdecl: all parameters are passed on the stack instead of registers. Our current calling conventions stem from performance hacks like fastcall that were only optimizing for existing code (you pass big structs by pointer by convention).

moonchild · on May 8, 2021

> my gut is there is something missing here with respect to non local control flow (like exception handling, setjmp/longjmp, and fibers)

(Post author.)

Mechanically, what happens is essentially the same as what ms/arm/riscv do: the caller creates a reference and passes it to the callee. The only difference is that the callee is more restricted than it would otherwise have been in what it can do with the memory pointed to by that reference. So I don't think that there can possibly be any implications for non-local control flow.

hctaw · on May 8, 2021

Doesn't the referenced data have to be guaranteed to outlive the callee, which would only be true if the callee is guaranteed to return to the calling scope?

You can get around the immutability of the reference if your compiler implements the ABI with copy on write semantics, which I think is a reasonable compromise. But I'm still not certain how you would handle arbitrary control flow that the compiler may not be able to reason about.

If for example your arguments may be behind const references, how would you implement getcontext/swapcontext for your ABI? If everything is an integral value in registers or on the stack then it's really easy, but i would think it would have to be a compiler intrinsic if it depends on the function signature of the calling context, in order to perform the required copies.

brigade · on May 8, 2021

Well for one, the language says a copy is made at the time of the function call, and it's perfectly valid to modify the original before the copy is finished being used. So pretty much any potentially aliasing write or function call in the callee would force a copy, and as he notes C's aliasing rules are lax enough that that's most of them.

Then if you care about the possibility of signal handlers modifying the original... you pretty much have to make a copy every time anyway.

temac · on May 8, 2021

Plus any potential concurrency synchro point existing would force a copy, plus using any unknown function, etc.

Using rust and propagating the single writer xor multiple readers requirement in an ABI, this might be interesting. But with C/C++, I'm afraid copies would be forced "all" the time.

mort96 · on May 8, 2021

There's still a lot of functions which don't call unknown functions before accessing an argument passed by value, don't take the argument's address, etc. There are many simple functions such as this one:

    void print_foo(FILE *outf, struct foo foo) {
        fprintf(outf, "foo '%s': %i, %i\n", foo.name, foo.x, foo.y);
    }

That one would gain a speed-up and code-bloat reduction from the proposed ABI, and there are many like it.

But even if every single function had to fall back to making a copy, the argument is that there's still a significant code bloat saving by putting the copy in the callee rather than in the caller. After all, the instructions necessary to make a copy takes some space, and with the proposed ABI, those instructions are put in the called function, rather than in every function call. Most functions are called more than once, and all functions are called at least once (hopefully), so anything which can be changed from O(number of function calls) to O(number of functions) is an improvement.

gpderetta · on May 8, 2021

Exactly, see my example elsethread. Also in C and derivatives distinct objects are guaranteed to have distinct addresses. Implicit sharing would break this.

mort96 · on May 8, 2021

It wouldn't. The compiler would just have to generate the copy when the standard demands it (such as if the function body takes the address of the object).

gpderetta · on May 8, 2021

Yes but then in many cases either (or both!) the caller and the callee might need to make a copy defeating the point of the optimization or even being worse than the original.

mort96 · on May 8, 2021

In many cases the callee would have to make a copy, yes. However:

1. In many cases, no copy would have to be made. There are lots of small non-complex functions out there where the compiler can prove that it's safe to not make a copy.

2. In many other cases, a copy has to be made. But the copy is made by the callee, not by the caller. That means that all the instructions necessary to copy the argument ends up in the binary once in the callee, rather than once for every function call, leading to less code bloat (which has its own performance advantages).

In fact, a stupid compiler could just always make a copy without analyzing the function body. This would result in a compiler which generates code that's about as fast as it would be with current ABIs, but with a smaller size.

gpderetta · on May 8, 2021

You have to make a copy on the caller or the callee if the address of the object escapes, so you might end up with two extra copies even if nothing in the program mutates the object.

mort96 · on May 8, 2021

I don't understand how you achieve extra copies? My understanding is that the caller would never make a copy, it would always pass a pointer to large structs. So the absolute worst case, unless I'm missing something, is that we end up with the same number of copies as we do today (i.e one copy per large struct passed as a parameter).

moonchild · on May 8, 2021

  struct A { int m; } global = {5};
  int f(struct A a) {
   global.m = 7;
   return a.m;
  }
  
  int main() {
   f(global);
   // need to make a copy of 'global' here
   // otherwise f will return 5 instead of 7
  }

mort96 · on May 8, 2021

Hey, I've realized that there are two understandings of the proposed ABI: One in which the only promise is that the callee won't modify the object through the pointer, and one in which the callee promises to not modify the object through the pointer and the caller promises that nothing else will modify the object. Maybe you could shed some light on it since you're the author?

In the first version, the worst case situation is that only one copy is made, and it's always made by the caller. However, the caller has to make a copy if the object is referenced after any function is called, because that function might otherwise modify the parameter if a pointer to the caller's version of the object has leaked out somewhere.

In the second version, the worst case situation is that two copies are made where old ABIs would make just one copy (if the caller has to make a copy and the callee has to make a copy). However, the callee would only have to make a copy if it actually does something which might modify the object through the pointer passed as an argument, so the optimization would apply for more functions.

I think it's fairly clear from the article that your intended ABI is the first version, due to the sentence "In the event that a copy is needed, it will happen only once, in the callee, rather than needing to be repeated by every caller" . But in this comment, you're implying that the caller makes a copy if it can't guarantee that nothing else has a pointer to the object?

moonchild · on May 9, 2021

I should have been clearer; my intention was your second interpretation. The copying happening only once is predicated on the assumption that the struct wasn't aliased; since it's unlikely to be aliased if you're passing it around by value.

Your first interpretation is essentially what the ms/arm/riscv abis do. The reason I don't think that works as well is—

In general, it's rare for functions to mutate their parameters by value. We can effectively treat this as an edge case, and 'compensate' by making copies in the callee when necessary. But, when does the caller need to make a copy?

Version 1: whenever the object is aliased before the call, or read from after it

Version 2: whenever the object is aliased before the call

I think using the same struct multiple times is something that happens relatively frequently, so compared with v1, v2 elides a lot of caller-side copies. In exchange, it adds a relatively small number of callee-side copies. Which, despite the few pathological cases, seems likely to be overwhelmingly worth it most of the time.

mort96 · on May 9, 2021

Sorry, I messed up. I meant to write that in the first version, the copy is made by the callee. If the copy is made by the callee, then the callee can avoid a copy if it can guarantee that the caller's version of the object isn't changed before the callee uses it, and at most one copy is made.

Anyways, your intention is clear now at least. I'd be a bit worried about an ABI which might produce two copies for one parameter. It would be interesting to analyze a bunch of real-world code and see A) how often would my version create a copy, B) how often does the MS/ARM/RISC-5 version have to make a copy, C) how often would your version make a copy, and D) how often would your version require two copies.

Would also be interesting to see an analysis of code bloat due to copying parameters.

moonchild · on May 9, 2021

> If the copy is made by the callee, then the callee can avoid a copy if it can guarantee that the caller's version of the object isn't changed before the callee uses it, and at most one copy is made.

So the callee has to know what every caller of it will ever do? That's ... an ABI. The whole point is that functions can exist in a vacuum without knowledge of who they will be called by.

To be clear, I think it would be really cool if compilers could generate ad-hoc calling conventions using lto to optimize spillage, but that's not really useful as an ABI.

> would be interesting to analyze a bunch of real-world code and see A) how often would my version create a copy, B) how often does the MS/ARM/RISC-5 version have to make a copy, C) how often would your version make a copy, and D) how often would your version require two copies.

> Would also be interesting to see an analysis of code bloat due to copying parameters

I agree!

mort96 · on May 9, 2021

I'm not explaining myself clearly.

The ABI I had in mind was similar to the AArch64 ABI:

>If the argument type is a Composite Type that is larger than 16 bytes, then the argument is copied to memory allocated by the caller and the argument is replaced by a pointer to the copy.

But with a slight modification to put the copy in the callee:

>If the argument type is a Composite Type that is larger than 16 bytes, then argument is replaced by a pointer to the copy. The callee copies the pointed-to object into memory allocated by the callee.

This immediately has the advantage of less binary bloats, because the amount of parameter copying instructions in the binary will become O(number of functions) rather than O(number of function calls). (As an aside: That can probably be a huge advantage for C++ with its large, inlined copy constructors.)

When the copy is made in the callee, we can start identifying cases where a copy isn't necessary, or cases where only certain parts of the struct has to be copied. It would have to be fairly conservative though, since unlike with your ABI, there would be no guarantee made by the caller that there are no other references to the parameter.

I think my version is a clear and obvious improvement over the status quo, with decreased binary sizes and as good or better performance. Your version is more risky where the worst case is two copies per large parameter but, your version will probably achieve zero copies in way more cases than my version. "Low risk / medium reward" versus "medium risk / probably high reward".

---

Anyways, I might end up writing a blog post on this stuff. If I do, it will refer to your blog post. How should I refer to you? Moonchild or elronnd or something else?

gpderetta · on May 8, 2021

If the address of the object escapes on the caller side then it has to make a copy as the object could be mutated or even just break the distinct address guarantee of the language.

mort96 · on May 8, 2021

I still don't understand, sorry. If the callee does something which could cause the caller's object to change, such as calling an unknown function or modifying through another pointer which might alias the parameter, the callee would just have to make a copy.

Could you provide an example of a situation where there would be more copies made using the proposed ABI than in traditional ABIs?

gpderetta · on May 8, 2021

Sure, if calling any external function or writing though any pointer would force the callee to copy the object then yes you can have only the callee do the copy, but then it seems that this optimization would apply only to a very small subset of functions.

mort96 · on May 8, 2021

Right. That was my understanding, but I now see that there are more ways to understand it. I don't know which is correct, so I wrote a response to moonchild's comment here: https://news.ycombinator.com/item?id=27091726

jbverschoor · on May 8, 2021

Sometimes a mistake is a decision under the assumption that the people intended to use this are smarter / more careful than they are.

jcelerier · on May 8, 2021

> A correctly-specified ABI should pass large structures by immutable reference

is just not possible. CPUs don't know about `const`. So you have to work with the assumption that functions that you call can do anything to their arguments. Thus copies cannot be avoided.

mhh__ · on May 8, 2021

The CPU also doesn't know what an ABI is

CuriousCosmic · on May 8, 2021

An ABI also has a concept of defined and undefined behaviour. You can design an ABI that is fully protected against abuse but often the performance penalty for that will be huge.

Instead what you'll do is specify the constrained inputs and expected output behaviour. From there you can out anything that violates those constraints as non-conformant. As long as you maintain those constraints between versions, there's no ABI breakage.

Also you can absolutely have constant references in an ABI. There may be ways of ignoring the const depending on how you design the ABI but they will be obvious abuse.

wizzwizz4 · on May 8, 2021

CPUs actually do know about const; it's called a read-only page.

Besides, that's irrelevant. There's nothing stopping my function from following every pointer on the stack and smashing up its contents; are you going to defend against that, too? If not, how is this any different?