So you're not so hot on the whole search engine thing?
The article does slide into the sketchy side (I've always wanted an excuse to do that client side javascript trick too) but I found it more interesting because of that, these aren't secrets. Maybe if I put my "won't somebody please think of the children" hat on I agree that glorifying using trojan code to potentially ddos someones server to get around rate limits desired by the owners of the server is bad. Adults and especially adult self described hackers should be able to read this without mock outrage, it's interesting and it's happening all the time.
You can't condemn web scraping though, that's the backbone of the services we all depend on for most internet related things. That's the whole point of structured markup and the world wide web itself.
> So you're not so hot on the whole search engine thing?
They scrape to generate links for users to go to the site. That's quite different than scraping for...any other purpose? So it seems. Would you (anyone) argue otherwise? (genuine curiosity).
They are also using title, description, some snippets from the page and taking a cached version of the site and images you can view without having to visit the site itself. They are also using this data as a product to sell advertising against.
If there wasn't so much benefit for most of all sites to be in search engine indexes you would thinking at least some would object to this scraping.
There would be lots of other scraping that websites want to prevent that takes even less data than this. It just doesn't provide much in return for the website.
Google is even moving into the territory of scraping content to display. Relevant wikipedia snippets are now being displayed on the search page as a side bar. While Wiki probably doesn't care...there are plenty of other sites that would not like Google to scrape the content and display it on the search page.
Yeah, Wikipedia is creative commons so that should be okay?
You are right though I wonder if they have the rights to sports results and weather that they are pulling.
They have even convinced us all to go mark up our page to help them pull stuff like ratings and reviews out.
Wikipedia explicitly allows that kind of thing with CC-BY-SA licenses, and indeed gets substantial funding from companies like answers.com that do it. (Incidentally, answers.com was the only way to see TeX equations on Wikipedia on my Android phone last time I checked, so it's not like they're adding no value.)
From what I understand, Google uses crawled data as a learning set for their translation service. There is no "this phrase was learned from: www.nytimes.com" when I do a translation, so I guess Google is still guilty?
Does it have to be a search based interface to the indexed data?
Does finding a link to the scrapee have to be the primary purpose of the site (and therefore google would be constantly getting "worse" by this scale)?
So how prominent does the link back have to be for it to be ok?
What about the summarized data from there that search engines are adding these days, so you don't need to leave the google results page to get your answer but the data still comes from some site that you rarely notice the name of?
edit: as to your curiosity, I honestly do not see the line that you see. Unless it's that the link back to the source is required. I don't know that I agree with that but I would understand it, although that gets harder and harder the more you massage your dataset to be useful to users.
The article does slide into the sketchy side (I've always wanted an excuse to do that client side javascript trick too) but I found it more interesting because of that, these aren't secrets. Maybe if I put my "won't somebody please think of the children" hat on I agree that glorifying using trojan code to potentially ddos someones server to get around rate limits desired by the owners of the server is bad. Adults and especially adult self described hackers should be able to read this without mock outrage, it's interesting and it's happening all the time.
You can't condemn web scraping though, that's the backbone of the services we all depend on for most internet related things. That's the whole point of structured markup and the world wide web itself.