Someone please correct me if I'm wrong. The extension: > (3) not scraping or col...

pingswept · on Jan 19, 2014

That function just retrieves the rating for one course, based on the course ID that the user has retrieved by searching for some courses. It is certainly requesting data from Yale's server, but I wouldn't call it scraping. To qualify as scraping (at least as I've done it), you'd have to iterate over all possible IDs and store the records in a database.

For example, consider this basketball scraper: https://github.com/andrewgiessel/basketballcrawler This builds a list of all the current NBA players and then pulls stats on all of them, saving them in a giant JSON file, IIRC. That's scraping.

What Sean Haufler is doing is not massively iterated, and it's linked in time and quantity to human-generated queries. Seems different to me.

ricardobeat · on Jan 19, 2014

Scraping = extracting data from a third-party website. Even if you had a site on a specific course, fetching data for a single resource, it's still scraping.

cynwoody · on Jan 19, 2014

Since it's extracting additional data from the second-party website, I would classify it as enhanced browsing, not as scraping. Enhancing the browsing experience is the purpose of browser extensions.

moocowduckquack · on Jan 19, 2014

Scraping = extracting data from a third-party website.

That definition would seem to include some journalistic quotes.

alanh · on Jan 19, 2014

It would, it that were an automated process. Can we satisfy your quibble by using the word "automated" before "extraction"?

EGreg · on Jan 19, 2014

Scraping involves storing somethjng on a server. I am not sure a user agent can be cited for copyright infringement. Perhaps caching in violation of a server's policy can be considered that, I am not sure. Is there any precdent for caching to be considered scraping and storing by the developer of the client software?

If not, I have an interesting idea for an offline travel app :)

skywhopper · on Jan 19, 2014

I don't know the full etymology, but I first heard the term "screen scraping" about 15 years ago in the context of writing a web application that interfaced with a mainframe application. The mainframe database was not directly accessible to the webapp platform, so instead a session was opened to the mainframe over telnet, and the web application itself issued commands over that telnet session, parsed the terminal output, and dynamically generated a web UI that displayed the data or form fields or what have you.

Scraping these days may more often be a means of extracting data to an offsite cache/database, but ad-hoc on-demand scraping is consistent with the definition.

Regardless, this doesn't really matter. I don't see how any restrictions on "scraping" can possibly be enforced. Screen reader programs that allow visually impaired individuals to browse the web are "scraping". Mobile browsers that reformat pages on the fly to optimize their display on a small screen are "scraping".

Ad-hoc user-initiated scraping of this sort is really no different than just browsing the web directly. So whether one wants to argue that "scraping" is or isn't happening here, it hardly matters to whether this is ethical, and should not matter to whether it is legal (although on that topic, I'm guessing different judges would see it differently).

wisty · on Jan 19, 2014

http://en.wikipedia.org/wiki/Copyright_aspects_of_hyperlinki..., specifically http://en.wikipedia.org/wiki/Perfect_10,_Inc._v._Amazon.com,....

Precedent suggests it's OK to make thumbnails (highly transformative, and not competition), and that it's also OK to hotlink.

From wikipedia:

> conduct was excused because the value to the public of the otherwise unavailable, useful function outweighed the impact on Perfect 10 of Google's possibly superseding use.

> Moreover, in Perfect 10, the court laid down a far-reaching precedent in favor of linking and framing, which the court gave a complete pass under copyright. It concluded that "in-line linking and framing may cause some computer users to believe they are viewing a single Google webpage, [but] the Copyright Act . . . does not protect a copyright holder against acts that cause consumer confusion."

There might be some cases where ajax calls cross the line into infringement.

Caching is easier. Once again, Google smacked down a plaintiff:

http://en.wikipedia.org/wiki/Field_v._Google

> Courts usually do not require a copyright holder to affirmatively take steps to prevent infringement. In this case, however, the court found that the plaintiff had granted Google an implied, nonexclusive license to display the work because of Field’s failure in using meta tags to prevent his site from being cached by Google. This could reasonably be interpreted as a grant of a license for that use and so the courts held that a license for Google to cache the site was implied because Field failed to take the necessary steps when setting up his website.

If the server allows caching (see meta tags), it seems OK to cache it. But it depends (Google had to rely on meta, because they were caching everything ... if you specifically cache a website, you might want to see if hey have a TOS).

I'd just be a little careful in how you transform the data. Also, there's API rules.

Really, you'd need to see a lawyer.

huhtenberg · on Jan 19, 2014

> Scraping involves storing somethjng on a server.

Duh, of course it doesn't. Scraping is taking data in one format from there and presenting it in a different format here.

moocowduckquack · on Jan 19, 2014

That seems wide. I thought it was human readable to machine indexable, not human readable to human readable.

drwl · on Jan 19, 2014

> Scraping involves storing somethjng on a server.

Not really sure I agree with your definition. Website scraping, in general, is taking specifically taking data from a webpage.

ckuijjer · on Jan 19, 2014

I guess he doesn't call it scraping as the function retrieves a json document instead of retrieving some html and extracting data from it.