Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Even for data structures, you generally want to use a cryptographic hash function, to prevent collision and other similar attacks from degrading the performance of a data structure.

For example, if I can sufficiently reverse your hashing process, I can degrade a hash tree in to a linked list for a particular set of data - something your firewalls and other security measures will have virtually no chance of stopping, but which will greatly increase the run time taken to parse (malicious) instructions about that data.

A common use of this is maliciously chosen inserts of new items, to cause the tree structure to become a list building on it.



Well, there are various applications when you don't necessarily need the cryptographic property. Hashing is widely used for large-scale back-end machine learning tasks (LSH, linear models...) where speed and randomness are the only properties that matter.


Until recently I don't think any major hash table implementation (e.g. not Java, STL, Python, Ruby, etc) used a cryptographically secure hashing algorithm. They're just too slow. Slow enough that you might as well use a tree-based data structure, which doesn't suffer from hash-collision based attacks.

This is changing a bit now when SipHash is available. Unfortunately SipHash still is a bit too slow and forces an uncomfortable trade-off.


I would be really surprised to find that a tree is faster then a fast cryptographic hash function like SIP hash. Cache misses to main memory are worth 10s of instructions and that number is going up in many cases. Hash functions can be pretty efficient at allowing the processor to run multiple instructions in parallel.

If you know the tree will be cached and/or the keys you have to hash are large then sure a tree might be a win, but hashing a small key might only be the same as two or three cache misses. Comparing tree nodes is not instruction free either.

I think it is really a case by case sort of thing, but there is no substitute for measuring (and nailing down the comparison to specific hash functions).


"Unfortunately SipHash still is a bit too slow and forces an uncomfortable trade-off."

2.6x slower than DJB2 in my recent Rust trial. (SipHash is the hashing function in Rust's standard library.)


This is why hashes for hashtables and similar datastructures are seeded with a random value upon interpreter startup. This throws off your attack well enough to make it barely feasible with a hashing algorithm that distributes sufficiently random. There was a round of DoS vulnerabilities against PHP, Ruby, Java, ..., all based on that attack vector a while ago, but those were all fixed.


A secret seed is not sufficient to prevent DoS vulnerabilities. That is one of the things the creators of SIP hash demonstrated. You can create collisions with MurmurHash3 that work regardless of the seed.


They've also released code to generate an arbitrarily large number of colliding keys for MurmurHash2 and CityHash64, and one to recover the secret key of Python's randomized hash function. At some point here you're doing cryptography whether you want to or not -- and most of these randomized hash functions are just lousy cryptography.


This is right. My wording was incorrect, excuse me. I should have said more correctly "This is why the hashing functions used nowadays include a randomization with a seed determined at interpreter startup." Jruby for example now uses perls hashing algorithm. They used to rely on murmur2, just as cruby did.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: