Good article, but there are a lot of claims of '100% certainty' that arn't necessarily true. The author even states that hash functions only guarentee no collisions to a high probability.
It looks like there are about as many Earth-like planets in the universe as grains of sand on the Earth. Write your name on a grain of sand on one of those planets. Now have someone else randomly pick a single grain of sand from some planet in the universe. How certain are you that they won't pick yours?
The chance of that happening is roughly equal to the chance of a collision randomly occurring somewhere in a few quadrillion SHA256 hashes.
Yes, the chance that the hash will be broken is much higher than the chance of collisions occurring randomly. I'm just responding to "hash functions only guarantee no collisions to a high probability." People really underestimate how strong that probabilistic guarantee is.
As long as the hash function remains unbroken, untrusted sources can't screw with you.
Hash functions tend to be broken gradually and publicly, and we migrate to new ones as they start to look shaky. It's theoretically possible for someone to privately break a function that everyone else thinks is secure, but it would be an extremely impressive achievement since lots of full-time cryptographers work on breaking these things and publish every little bit of progress they make.
Let's suppose everyone has moved to content addressing. Considering all the amount of content generated every day, how much time would it take before real hash collisions start to emerge?
The number of 256-bit hashes you would have to generate in order to have a 50% that there's a duplicate in there is ~4*10^38. If we had a billion machines each generating a billion new hashes a second, it would take over 12 million years to get that many.
I don't assume computing power is constant, just that the rate of content generation is more or less constant (I think a billion devices publishing a billion new pieces of content every second is a pretty reasonable upper bound). OP asked about the odds of collisions occurring by accident due to the sheer volume of content generated and published, not about attacker scenarios.
Obviously we wouldn't use the same hash algorithm and setup for 12 million years, but the sheer absurdity of that length of time and that pace of content production shows this method will last, at least until flaws are found in SHA256.
The 50% chance he's referring to is the chance that, out of all that hashes that exist, there will be two hashes that match. He's saying that in his scenario it would be millions of years before there's even a single duplicate.