Right, so you just thunk to a different region at the bottom of your own stack i...

anyfoo · on Nov 28, 2022

User software is allowed to switch their stack, too, and user software must be able to be interrupted. Sure, it may be possible to work around this in software with some careful elaborate protocol that involves making sure some value of SP is usable with any value of SS, and prescribing that anything at all adhere to that (because any code at all can be interrupted by an NMI), but that not only means the use of this protocol for all stack layouts, but also implies that every single “MOV SS” or “POP SS” becomes part of some elaborate sequence of instructions (something like “push some other register on the stack, save current SP, in that register, change SP to the magic value, change SS… extra fun if you wanted POP SS, restore SP from the other register, restore the other register”).

That’s a lot to put up with, and in my mind qualifies as a design bug with no palpable software workaround. Not great for wanting folks to use your brand new CPU.

The protected mode transitions that you mentioned are much more palpable because they only happen in very bespoke parts of the OS kernel, usually only in bootstrap in any non-ancient OS.

ajross · on Nov 28, 2022

> Sure, it may be possible to work around this in software with some careful elaborate protocol

Uh, yes, yes it is. That's the point. Go check the 80286 PRM for examples where the same hardware vendor inflicted far worse protocols on their users only a few years after the hack we're looking at.

The subject under discussion isn't "whether or not the 8086 implemented a good stack switching interface". Clearly it did not. It's "whether the 8086 stack switching interface as designed was so irreparably broken that it required a hardware patch[1] to fix". I think the case there is pretty weak, honestly, given the state of the industry at the time. And I found it surprising that Intel would have bothered.

[1] And increase interrupt latency!

anyfoo · on Nov 28, 2022

I’ve literally got the 80286 PRM in my room with me, the actual book, not a printout. What do you think in there is a “far worse protocol” than this abomination?

And then the 286 came along at a time where the x86 architecture was already enormously successful, part of the ubiquitous IBM PC. The 8086 on the other hand was a much simpler, completely new CPU competing with many others. There aren’t many warts in the 8086 itself, whose data sheet is a few pages instead of a book (segmentation was a plus at the time because it allowed a large address space and some easy multitasking without relocation in such a simple CPU), and still it actually only won out against other CPUs for marginal reasons. If 68k had been ready for the PC, that would likely have been it.

Intel had no capital at the time to inflict such an insanity to people. And why is patching the mask in a later revision of the chip such a big deal anyway?[1]

Just look at the solution you proposed. Always gradually adjusting SS and SP in a loop until SP reaches 0, just because an NMI may hit. That’s untenable.

[1] Interrupt latency is only slightly increased on MOV/POP SS. That shouldn’t happen so much that the much more often CLI/STI blocks with many instructions in them aren’t a bigger concern. And the 8086 wasn’t exactly low latency anyway.

ajross · on Nov 28, 2022

Why is this an argument? Is it really so upsetting to you that I thought this was a weird thing to fix at the mask level? It... was. Chips in 1978 had bugs like this everywhere, and we all just dealt with it. You mentioned the 68k, for example, while forgetting the interrupt state bug that forced the introduction of the 68010!

Motorola, when faced with a critical bug that would affect OS implementors in a way they could technically work around (c.f. Apollo, who used two chips) but didn't want to, didn't even bother to fix the thing and made them buy the upgraded version (obviously in some sense the '010 was the equivalent "fix it in the mask" thing). They were selling chips that couldn't restart a hardware fault well into the late 80's.

> Just look at the solution you proposed.

That wasn't the solution I proposed. I proposed just having the stack reserve an interrupt frame at the top and setting SP to zero before changing SS. The code I showed was just proof that even putative users[1] who couldn't change the OS would be able to work around it.

[1] There were none. This was several years before MS-DOS; no one was shipping 8086 hardware in any significant quantities, certainly not to a market with binary software compatibility concerns.

anyfoo · on Nov 29, 2022

> Chips in 1978 had bugs like this everywhere, and we all just dealt with it.

Like what?

> Motorola, when faced with a critical bug [...] (c.f. Apollo, who used two chips)

I am well aware of the Apollo two chip hack, but I was not aware that this was considered a "critical bug". I thought the original 68000 just did not support virtual memory with paging, and the 68010 introduced that. Given that you needed an external MMU if you wanted to do any kind of virtual memory, which most clients did not care about, and that the 8086 and many if not most contemporaries in that class did not support any kind of virtual memory either, that seemed like a reasonable limitation to me.

The kicker is, I actually own a Siemens PC-D, a marvelous machine that has an 80186 (the differences to an 8086 don't matter here), and an actual external MMU made up of discrete 74 series logic chips. It runs SINIX, and has full memory protection through that MMU.

And yet it does not support paging either, because by the time the external circuitry recognized a fault, the CPU had already moved on further. The only thing SINIX did was kill the faulting process.

> while forgetting the interrupt state bug that forced the introduction of the 68010

Are we actually talking about the same thing then? The Apollo hack I know was because after a bus error, not an interrupt, you could not restart the faulting instruction, presumably because by the time the fault was recognized, the CPU had already continued executing an indeterminate amount. This does not matter for regular interrupts, since they are not synchronized to running code and do not restart instructions. It would have been a pretty useless chip otherwise, and yet it thrived even after its descendant was introduced.

The point is, it seemed to me Apollo really wanted to use the 68000 for something it did not support (yet), and other users of the 68000 were not affected (most did not even have the MMU). It was a marvelous, unique hack, that I've actually never seen fully documented or confirmed anywhere (please let me know if you have). The PC would have been amongst the ones not affected, and the chip even spun off variants long after the 68010 was released, not having paged virtual memory just wasn't a dealbreaker for most.

> That wasn't the solution I proposed. I proposed just having the stack reserve an interrupt frame at the top and setting SP to zero before changing SS. The code I showed was just proof that even putative users[1] who couldn't change the OS would be able to work around it.

Yes, you proposed every single piece of code switching a stack, not just OS code, to do some careful dance whenever it switches the stack, on top of making sure that every stack in existence has the proper amount of unused space at its base to support that.

And then all interrupting code would have to immediately switch to its own stack (which, while required today, wasn't necessary back then, mainly because all code ran with all privileges and plenty of stack). While taking care that all registers can be properly restored on the way back, of course. Or only the NMI handler, but then you would have to additionally have to save and restore the interrupt flag around your stack switching.

But okay, let's agree to disagree then. You say there is a reasonable software workaround, I say there is not, for what to me was a pretty clear design bug. Albeit one that was also easily fixed. In my mind, Intel did the sane thing back then by simply patching it up. Can't say that it hurt them.

ajross · on Nov 29, 2022

This has gone completely off the rails, and I simply don't understand why you're being so combative given you seem to have a genuine interest in the same stuff I do. But in the interests of education:

> I thought the original 68000 just did not support virtual memory with paging, and the 68010 introduced that

The 68000 "did not support virtual memory" because instructions that caused a synchronous[1] bus exception could not be restarted. It was a bug with the state pushed on the stack (though I forget the details). In fact it was a bug[2] almost exactly like the one that produced the stack mess we're talking about. And Motorola never bothered to fix it. Intel could likewise have announced "you can only assign to SS when SP is zero" and refused, with little change in device success. And I find it surprising that they didn't.

[1] Because technically the bus isn't synchronous on a 68k, thus the obvious-in-hindsight design thinko. See [2].

[2] A design error following from the failure of the engineers to anticipate the interaction between two obvious-seeming features (processor use of the stack for interrupts, and stack segments in the former, hardware bus errors being used to report synchronous instruction results in the latter).

anyfoo · on Nov 29, 2022

I'm not sure I'm the one being combative, to be honest. So far Intel at the time seemed to think this was a design bug with significant enough consequences that they decided fixing it in some unused part of the mask. And I just agree with that.

I've been privy to such discussions myself, albeit in modern times. I'm familiar with silicon engineers and kernel engineers coming together to discuss a CPU bug or design mishap, and coming up with all sorts of crazy workarounds. Until it's eventually decided that this is just too much to inflict, and it's fixed.

Since you are interested in the 68000 thing, I think I do know the details and can explain, because I have before looked at the difference between the 68000 and the 68010 that makes paging possible. Let's look at the 68010 manual here: http://www.bitsavers.org/components/motorola/68000/MC68010_6...

(Notice how it already says "VIRTUAL MEMORY MICROPROCESSORS" on the front while the 68000 manual in the same directory does not, though I suppose it could be a good marketing move if there was an actual unintended bug, and Motorola decided to advertise the fix as a new CPU?)

Page 5-12: "Bus error exceptions occur when external logic terminates a bus cycle with a bus error signal. Whether the processor was doing instruction or exception processing, that processing is terminated, and the processor immediately begins exception processing."

It specifies external logic, which would be the MMU in the interesting cases. Everyone not using an MMU did not care, and bus errors were still valuable with or without MMU to signal illegal accesses and crash the program that caused it. The full problem is elaborated on the next page: "The value of the saved program counter does not necessarily point to the instruction that was executing when the bus error occurred, but may be advanced by up to five words. This is due to the prefetch mechanism on the MC68010 that always fetches a new instruction word as each previously fetched instruction word is used (see 7.1.2 Instruction Prefetch)."

The 68010 fixed it by adding enough (opaque, undocumented) internal state for the CPU on the stack to effectively undo what it did so far and restart the instruction, that state is shown (without detail, and subject to change between revisions) on the diagram.

The 68000 manual simply says: "Although this information is not sufficient in general to effect full recovery from the bus error, it does allow software diagnosis" (which should include a multitasking OS killing the task and proceeding), and none of the internal state exists on the stack.

But you may still well be right that Motorola did intend for bus errors to be restartable and failed. They would not say so explicitly in the manual of course.

But from what I can see this still only affected systems with an MMU, and then only if they wanted to do full virtual memory including paging. SINIX on the PC-D did not, they were content with swapping and merely detecting illegal accesses, like other UNIXes at the time. The 8086 problem affected everyone, including in application code.

If you know more than what I've cobbled together and can elaborate, I'd be extremely interested in it, because the Apollo DN100 hack always fascinated me to no end (as should be obvious).

anyfoo · on Nov 29, 2022

I might add, it's obvious that you are competent by the way. I think we just disagree on the severity here.