Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Not related to Scrapy, but what are some things you scrape the web for?


I once scraped every October posting from Slashdot to see long term trends. Short story: it's dying. I project the active userbase will be gone by 2020. Curiously the bulk of the posters were in the 100k to 300k UID range. There was also evidence of shenanigans with UID assignment where they were skipping even numbers and odd numbers at various times possibly to inflate their numbers.

This will get you IP banned BTW but I did get a full data set before their script caught me.


ScrapingHub (the guys behind Scrapy) offers Crawlera which provides some sort of automatic proxying and throttling so you can scrape away avoiding getting banned.


I wonder if Netcraft already confirmed that Slashdot is dying ;)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: