Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Simple,

respect robots.txt

find your data from sitemaps, ensure you query at a slow rate. robots.txt has a cool off period. See https://en.wikipedia.org/wiki/Robots_exclusion_standard#Craw...

example: https://www.google.com/robots.txt



Yeah that's a must do, but I think most websites don't even bother making a robots.txt beyond "please index us, google". However that wouldn't necessarily mean they're happy about someone vacuuming their whole website in a few days.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: