> First, ECC doesn't protect the full data chain, you can have a bitflip in a hardware flip flop (or latch open a gate that drains a line, etc...) before the value reaches the memory. Logic is known to glitch too.
Of course. DRAM ECC protects against errors in the DRAM cells. That doesn't mean other components don't have other strategies for reducing errors which can form a complete chain.
Latches and arrays and register files often have parity (where data can be reconstructed) or ECC bits, or use low level circuits that themselves are hard or redundant enough to achieve a particular target UBER for the full system.
> Second: ECC is mostly designed to protect long term storage in DRAM. Recognize that a cert like this is a very short-term value, it's computed and then transmitted. The failure happened fast, before copies of the correct value were made. That again argues to a failure location other than a DRAM cell.
Not necessarily. Cells that are sitting idle other than refresh have certain error profiles, but ones under constant access. Particularly "idle" cells that are in fact being disturbed by adjacent accesses certainly have a non-zero error profile and need ECC too.
My completely anecdotal guess would be this error is at least an order of magnitude more likely to have occurred in non-ECC memory (if that's what was being used) rather than any other path to the CPU or cache or logic on the CPU itself.
I agree. That said,
> First, ECC doesn't protect the full data chain, you can have a bitflip in a hardware flip flop (or latch open a gate that drains a line, etc...) before the value reaches the memory. Logic is known to glitch too.
Of course. DRAM ECC protects against errors in the DRAM cells. That doesn't mean other components don't have other strategies for reducing errors which can form a complete chain.
Latches and arrays and register files often have parity (where data can be reconstructed) or ECC bits, or use low level circuits that themselves are hard or redundant enough to achieve a particular target UBER for the full system.
> Second: ECC is mostly designed to protect long term storage in DRAM. Recognize that a cert like this is a very short-term value, it's computed and then transmitted. The failure happened fast, before copies of the correct value were made. That again argues to a failure location other than a DRAM cell.
Not necessarily. Cells that are sitting idle other than refresh have certain error profiles, but ones under constant access. Particularly "idle" cells that are in fact being disturbed by adjacent accesses certainly have a non-zero error profile and need ECC too.
My completely anecdotal guess would be this error is at least an order of magnitude more likely to have occurred in non-ECC memory (if that's what was being used) rather than any other path to the CPU or cache or logic on the CPU itself.