Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Bit rot is real. Last month we’ve had weird linker errors on one of our build servers. Turns out that one of the binary libs in the build cache got a single bit flipped in the symbol table, which changed the symbol name, causing the linker errors. If the bit-flip had occured in the .TEXT section, then it wouldn’t have caused any errors at build time, and we would have released a buggy binary. It might have just crashed, but it could have silently corrupted data…


I’ve had a case where a bit flip in a TCP stream was not caught because it happened in a Singapore government deep packet inspection snoop gateway that recalculated the TCP checksum for the bit-flipped segment:

https://blog.majid.info/telco-snooping/


Sorry if this is obvious, but how do we know this isn't due to something more "innocent" like fragmentation?


Fragmentation doesn’t change the TCP checksum. The packet is reassembled and the original checksum verified against the rehydrated packet and TCP segment.


Nice article, enjoyed it.


Thanks!


I am used to ‘bit rot’ refer to code becoming obsolete due to lack of maintenance. Can’t we use another term for actual hardware errors?


It may be a regional thing but I have never heard ”bit rot” refer to legacy code. In the retro computing circles bit rot refers to hardware defects (usually floppies or other storage media) caused by cosmic rays or other environmental hazards.


I have to agree with Kimitri here. This is the only context in which I have ever encountered the term 'bit rot'.


I agree this is the primary context, but I've seen unmaintained (or very old) software being reffered to as "bit rotting" by extension. As in, forward compatibility might break due to obsolete dependencies, etc.


Yeah looks like I was confusing “software rot” and “bit rot”.

https://en.wikipedia.org/wiki/Software_rot


I’ve always understood “bit rot” meaning data getting silently corrupted on o storage device like a hard drive or SSD.


Same here. “Bit rot” is then analogous to food rot: the longer your data sits unverified, the more likely that there will be flipped bits and therefore “rotten data”.


"sharting"


It's also an excuse/opportunity to write a fairly Hard SF zombie story:

https://www.antipope.org/charlie/blog-static/bit-rot.html


Were you actually able to diff it down to a single bit flip?


It's just a story we usually tell the boss when the whole team wants to goof off for a few days.


Yes. Via git diff --no-index and xxd.


I have no words. Stories like this trigger PTSD for me. I wasn't trying to be flip. That must have been a bitch to figure out.


Well, it wasn't that hard to uncover actually. We knew that the same build succeeds on our machines. So we only had to find what the difference was between the two :)

As Arthur Conan Doyle put it: "Once you eliminate the impossible, whatever remains, no matter how improbable, must be the truth." ¯\_(ツ)_/¯


Well done, Watson. This calls for a bit of snuff. Seriously, this is the kind of thing that keeps me up at night, and it's nice to hear a happy ending =D


I'm just thinking (of course) that you said If the bit-flip had occured but it's probably already when the bit flip occurs in the .TEXT section; we don't know what it might already have caused or just passed without notice (unreproducible bug, or bitflip in function that's never called or whatever).


Is this bit rot happening on a server w/ ECC RAM just curious?


It was a MacStadium (https://www.macstadium.com) build server, so most likely non-ECC.


Strange that build services are not just going full ECC, especially with cheaper hardware now supporting it.


Says it's a Mac/iOS build platform. Since it's a commercial service they're probably complying with the license and thus using actual Mac hardware, and in turn the only ECC option is the really awful value, outdated Mac Pro. Seems more likely they're using Minis instead, or at least mostly Minis. An unfortunate thing about Apple hardware (says someone still nursing along a final 5,1 Mac Pro for a last few weeks).


Aha! I had totally forgotten about that. Solid point, thank you.


And that's why we need deterministic builds


Yes, but in this case this was a bit-flip in a third-party binary dependency that we don't have the source code for.


Then we need the Gentoo approach as well ^^




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: