So you're not so hot on the whole search engine thing? The article does slide in...

cloverich · on Dec 9, 2012

> So you're not so hot on the whole search engine thing?

They scrape to generate links for users to go to the site. That's quite different than scraping for...any other purpose? So it seems. Would you (anyone) argue otherwise? (genuine curiosity).

robryan · on Dec 9, 2012

They are also using title, description, some snippets from the page and taking a cached version of the site and images you can view without having to visit the site itself. They are also using this data as a product to sell advertising against.

If there wasn't so much benefit for most of all sites to be in search engine indexes you would thinking at least some would object to this scraping.

There would be lots of other scraping that websites want to prevent that takes even less data than this. It just doesn't provide much in return for the website.

polyfractal · on Dec 9, 2012

Google is even moving into the territory of scraping content to display. Relevant wikipedia snippets are now being displayed on the search page as a side bar. While Wiki probably doesn't care...there are plenty of other sites that would not like Google to scrape the content and display it on the search page.

fudged71 · on Dec 9, 2012

Well, it probably sucks for Wikipedia because users aren't seeing the Jimmy Wales messages everywhere if they find the content through Google.

robryan · on Dec 10, 2012

Yeah, Wikipedia is creative commons so that should be okay? You are right though I wonder if they have the rights to sports results and weather that they are pulling.

They have even convinced us all to go mark up our page to help them pull stuff like ratings and reviews out.

kragen · on Dec 10, 2012

Sports results are facts and are statutorily not subject to copyright in the US.

kragen · on Dec 10, 2012

Wikipedia explicitly allows that kind of thing with CC-BY-SA licenses, and indeed gets substantial funding from companies like answers.com that do it. (Incidentally, answers.com was the only way to see TeX equations on Wikipedia on my Android phone last time I checked, so it's not like they're adding no value.)

randomdata · on Dec 9, 2012

From what I understand, Google uses crawled data as a learning set for their translation service. There is no "this phrase was learned from: www.nytimes.com" when I do a translation, so I guess Google is still guilty?

freshhawk · on Dec 9, 2012

Does it have to be a search based interface to the indexed data?

Does finding a link to the scrapee have to be the primary purpose of the site (and therefore google would be constantly getting "worse" by this scale)?

So how prominent does the link back have to be for it to be ok?

What about the summarized data from there that search engines are adding these days, so you don't need to leave the google results page to get your answer but the data still comes from some site that you rarely notice the name of?

edit: as to your curiosity, I honestly do not see the line that you see. Unless it's that the link back to the source is required. I don't know that I agree with that but I would understand it, although that gets harder and harder the more you massage your dataset to be useful to users.