Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Sounds like the algorithm noticed the duplication, and picked a "winner". The new winner isn't the same as the old one.


While we are on this topic, does anyone know the current state of art for text duplication detection algos? I understand that Google used LSH but they must have made a lot of progress since.

LSH: http://en.wikipedia.org/wiki/Locality-sensitive_hashing


I'm not sure about the algorithm in use, but I what I hope is happening is that Google is now looking for the earliest publication of content when deduplicating. Most copycat sites have to copy their text from something existing, and if Google has already indexed that, they know that later versions of it are copies (and can presumably be knocked down in rank).




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: