Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

We're working on it (as always.) There is a big improvement inspired by the stackoverflow post on its way shortly.

If people want to help out, the best thing to do is to post examples of specific queries. Those become the "fixed points" around which we can tune until we get it right. The more example queries the better, and I'll make sure they get to the right people.

A good way to get example queries is to look through your search history, which if turned on can be found here: http://www.google.com/searchhistory



Hey moultano, I was just thinking about this problem: how can I tell that a site is spammy? Overwhelmingly, they look really really similar. For example, if you search for "diy solar homes" you'll see a wonderful example of some really spammy sites - they popup book offers on load, they have this kind of template where they have big garish fonts and a whole lot of information laid out carelessly on the page.

Then there's the "what you need, when you need it" category, and then there's the "put your google search in the title even though the page has no relevant results" category (mostly software download sites - ie. try searching for "application to use nokia e72 with itunes" you get this site filebuzz in the top two spots that has a whole lot of ads and a bunch of crappy non-related downloads).

So if you add a "uniqueness" index - ie. find ways to "semantically tag" not just the textual content, but the layout and font choices etc. of particular sites, that will catch the blatant affiliate spam bullshit (diy solar homes, what you need when you need it etc.) and then just figure out a way to prevent those "file buzz" type sites from sticking my search term in the title tags (I actually have no idea how this is done) you'll eliminate like 95% of the spam.


The top result for [diy solar homes] looks pretty good. http://www.builditsolar.com/ Looks like it has a lot of resources, though I don't know where it gets them or whether it has any claim to them. Was this one of the bad results for you?


BuildIt solar looks like a relatively genuine attempt at building an online resource for information about solar homes, as well as some kits they sell for themselves.

Likewise http://www.treehugger.com/ looks like it has genuine content - although it's obviously a little bit thin on the ground.

These guys are clearly a legitimate business selling a product (well, legitimate website anyway):

http://www.supremeheating.com.au/pool-heating-top/solar-pool...

Now compare those three sites, to these four:

http://www.diy-solar-power-for-homes.com/

http://www.energy4living.hottipsonly.com/solar-power-for-hom...

http://www.diysolarpower4home.com/

http://www.solarwindpowerguide.com/diy-solar-heating/

this one, too, links back to earth4energy - strikingly similar to "earth4living" above:

http://greenerhomediy.com/create-solar-electricity-build-diy...

then we have this incredibly reputable and highly respected forum whirlpool:

http://forums.whirlpool.net.au/archive/1539413

As a human I find it relatively easy to pick out the massively spammy sites amongst those results that are on the first page for the search "diy solar homes" - and I think that any time you were to get several sites that are similar in some set of ways to each other that rank for the same search term then they should be "de-ranked".

So for example, thousands of people use the same wordpress themes, you can't just say "they're all spammers". But if the top 5 results for a particular search all share some measurable characteristics, you could safely say "hmm there's something spammy going on here".


http://www.diysolarhomes.com/ (3rd result & 4th result) definitely looks spammy. It's like the site owner just bought a bunch of keyword-domain names in order to get high rankings on search. Almost all their links look like affiliate links.

If you click anywhere on the page, you'll get a popup asking you to buy some book of theirs.


Hi -- I run BuildItSolar. Its a non-commercial site for people who want to build renewable energy projects. Its a retirement hobby, not a business. Some of the projects are my own, but many are projects that people have built and sent in the details. Its a site that is of, for, and by DIYers :) Gary


I think they could find your search terms from the referring url, but I am not sure how they are able to get their pages with your terms into the search results and get the terms into their meta description.

They must just have compiled huge lists of relatively specific search terms and have pages against each? But I would think this would be easy to identify and downrank..

It is a puzzle :/


What happened to exact search queries:

For instance, if I search for "a-r" I receive results for "ar".

I hate this. It makes it impossible to filter irrelevant results.

Or try this query: "a-c" -"ac"

This will return 0 results.


And this is where the downfall of Google begins. I also hate that some exact queries are being broad matched without my consent.


I think your search terms are too short.

See "re-elect" -"reelect"


How about adding some sort of a feedback mechanism to search? For instance, when I search for something, maybe some way to mark a result as spam and optionally relevant?

The obvious problem is that spammers would attempt to game the feedback mechanism. But a combination of things like captcha to defend against robots, limits on how many times you can flag/upvote sites in a month (feedback credits), and exposing the feedback mechanism to only real, active-for-a-long-time users above a karma threshold (Google can definitely figure this out looking at the search history, gmail account etc), might be strong enough to beat the spammers.

You could start this as a Labs feature, and see if it works well.


Please don't dumb down Google. It happens all the time. It happened with stemming. It happened with "Instant". It happened, stealthily, quite a long time ago, when Google started to return results that contained most of the search terms, but not all, or when it returned results that contained words that appeared "in the pages linking to this page".

It happens when one tries to use allintext: and is identified as a robot (why??!?)

People should understand how to use a search engine instead of have machines (second-)guess what they're thinking.

People learn to drive; if people can't drive a car we don't give them a car that drives itself!

-- Oh, wait.


Great example - 35,000+ people a year die in the United States alone because of people's inability to drive safely.

Perhaps there is a better way.


Yes, the end of my comment was a (weak) attempt at humor, since there has been quite some talk lately about Google building cars that will drive themselves:

http://www.nytimes.com/2010/10/10/science/10google.html

Edit: actually, all I really wish for, is for allintext: to work all the time. Why would I be a robot if I'm logged in, on an account of a normally active Gmail box?

Also, why is it such a secret? Why not make it more visible? A checkbox on the home page...


Lots of bot queries do allintext: searches, while fewer humans do that query. Bots can hijack human accounts/cookies too. Email spam botnets often use the valid cookie of their owner's host computer.


Fewer humans do that query because it is kept almost secret (why do bots like allintext?)

I wouldn't mind answering a captcha from time to time, but instead Google just bans the use of allintext for several dozen minutes (or more). Really frustrating.


And number-range searches. I can't count how many times I've wound up with a number-range search query that wouldn't work for me from any IP.


So allintext is essentially a honeypot for bots? Otherwise it seems weird to offer the option at all, if all it does is get you banned for being a robot.


I just tried one of the example searches Marco lists and the result is very strange.

When I search for [2010 ira contribution limit], all the results are spammy. The real official answer (on irs.gov) doesn't even show up on the first page.

BUT, if I use Google Instant, it does show up as the first result. As soon as I hit Enter, it again disappears and only the spammy results remain.

The Instant guess suggests that it's because the IRS website ranks for the plural term with "limits" [2010 ira contribution limits], not the one with "limit".


When I search for [2010 ira contribution limit], all the results are spammy. The real official answer (on irs.gov) doesn't even show up on the first page.

The first result I get is is irs.gov. Then again, we know what a bunch of shysters they are, so you still might be onto something. ;-)


Maybe because I'm in the UK.


Oh, in that case Google is definitely trying to tell you how to make a donation to a terrorist organization (IRA)


I get irs.gov at the top as well, and all the results are relevant. Are you logged into google? If so try logging out and retrying-- it would be interesting to see if the results are tailored to individuals.


obviously you don't want this thread polluted with failed google queries, and them working properly on duck duck go, maybe you should offer a contact info, or even generate some sort of form, or even better have a 'this query didn't work right' button on search


There is a "this query didn't work right" button on search - it's the "Give us feedback" link at the bottom of a result page, which links to:

http://www.google.com/quality_form?q=foo

The problem is that it gets polluted by lots of people who have no clue that Google is not the Internet, or (for that matter) their neighborhood handyman that they found through Google and who did an awesome job repairing their windows. A lot of people don't make a distinction between "I found what I was looking for" and "What I was looking for worked out for me", which makes a lot of the feedback a little less than useful.

This thread is as good a place as any - I'm guessing that the URL will get passed around, the appropriate teams will read it and adjust their algorithms, and if DuckDuckGo gets to improve their algorithms too, great - it's one more good search engine that people can use.


Why do people insist on blaming users when users use things in the "wrong" way? If you give a general feedback button to a user, they will give you back general feedback of all kinds - it's not that hard to understand, really. If you want to get feedback specific to the quality of the search results, then provide another button that says "Tell us if the results you got are not useful / just spam". Be clear about what exactly the button does, and you'll get meaningful reports from the users. Note that I said another button - you always need a general feedback button for people wanting to report something else, that's how you get general and specific feedback.

Also, google is one of the worst companies, if not the worst company, at dealing with user feedback. You can't just expect people to give you feedback but never return the favor in any way. People have the feeling that giving feedback to google is like throwing things into a black hole - like talking to a machine - if you know no one will answer you and you will never know if anyone even read your feedback, not even a little thank you note or a clue that the feedback was useful, there isn't much incentive to give feedback, now is there?


Google Maps (a different beast, of course) handles feedback exceptionally well. You're first thanked and promised a response, then a couple of days later you usually get a "you were right, we'll fix that and let you know when it's fixed", and then within a few months another "we fixed it, here's a link to what you reported (shown on Google Maps), please let us know if we still didn't get it right".

As a result, I enjoy reporting issues to Google Maps, because I know they will be addressed. Maps are dealing with a lot more tangible and unchanging dataset than Search, though.


rmoulton at google if you prefer to email, but posting here is fine too. There is a "give feedback" link a the bottom of the search results which produces a lot of good data, but very little of it is "problems HN folks have with search."



Which result were you looking for in the DDG results?


What exactly were you looking for?


Please see my reply above to nostrademons about the usefulness (or lack thereof) of that button.


Where's the best place to post sample queries?


Right here. :)


This is why people say that Google don't get social. What if I don't want to submit a bad search right now, but in a month? How am I going to find this HN thread?

Create a better mechanism through which people can submit bad searches for human review.


I am sure you must be automatically tracking and analyzing queries where users go to page 2 and beyond. Or where they do not click on any result and instead change the query or abandon the page?


I would guess the general Google approach to this problem is to try to improve algorithms.

I wonder if a change towards "human input" might improve things more.

For example, what if the Chrome browser had a big feedback button so that if users wanted to help improve the Google search results they could rate the usefulness of the link they just followed?


If it makes you feel better, our first experiment along those lines was in 2001: http://www.cs.unc.edu/~cutts/toolbarbeta.html

Back then (and with "Remove result" and SearchWiki) it had issues because we were trying to get people to recognize spam, and people weren't that good at recognizing spam techniques like hidden text. The more recent complaints we've had are more like "here's content I don't like." So maybe it's time that we tried something similar again.


It's seems to me like an explicit "Report Spam" button would help. The problem with SearchWiki & "Remove result" is that they have many other influencing factors as to why people would click those buttons.

People are pretty good at recognising "scams", but you need to tell them that's what they're looking for, not just how the content makes them feel. It seems like the report spam button was a large factor behind gmail's spam filtering success. Would love to see the same approach applied here. I know I'd be hitting my report spam button in chrome pretty often :)


Here's a Chrome extension we wrote to allow explicit "Report spam" feedback: https://chrome.google.com/extensions/detail/efinmbicabejjhja...


I still feel like it's not simple enough to be practical. I'd only really use a button that submitted the form behind the scenes. I think the extra step of filling out the fields is likely to drop my submission rate down to times I really get pissed off :)


Wouldn't spammers just "Report Spam" their competitors?


If that strategy worked, spammers could use it in gmail to void spam filter utility. I'm sure the volume of legitimate requests would help drown that noise.


Once Google has significant amounts real human ratings on the usefulness of a site in general or the usefulness of the site given a specific search, machine learning techniques could then be used to predict the usefulness of unrated sites.

I just know I see a lot of worthless junky sites on the first page, and I wouldn't think it would be hard to recognize them using ML techniques, which requires training data.


So don't let just any random Joe provide feedback. Crowdsource it to longtime holders of Google accounts who've got a track record.

Recognizing that not all users are created equal is (IMHO) an incredibly powerful insight that Google and many other companies overlook. Qualified, technically literate users will be happy to volunteer, but you have to ask.


There are many dangers to this as well. Fundamentally, as soon as people are aware of the power they hold it will get abused. See: digg bury teams, reddit circle jerks, SEO link farms, etc. And the smaller the number of people that have the power the more likely, more dangerous, and more harmful their abuses can get.

I think the cases that Google breaks is that some people figure out some trick/technique that lets them become a small power circle, which they then obviously abuse. When Google works it's because they're able to algorithmically spread the power around and at scale see what is quality and what isn't. Therefore I think the better strategy isn't to concentrate power even within an "elite" class, in fact it's the complete opposite, it should be how to make sure the power is spread within the masses quite evenly.


I may come back and ask in a while, so I hope you're right. :)


I think the very first thing is to try to prevent malicious sites to be on the first page. For example, try searching for "lawsuit employment rejection". The very first hit is: jaysgrafx.com/char-tritan-energy-power-com-employment-san-marcos-tx-job/. Do not click on it (it forwards you 84bf4ada.logout3.cz.cc/).


Hey moultano, what about specific sites within an industry that abuse big time with doorway pages and link farms? How can we report that to you? IN SERPs these domains are technically relevant to the search-term, but only show up high in the results because of their deceptive practices.


Which stackoverflow post is this?



The Stack Overflow article seems to address the subject of content being scrapped and republished which is certainly a big issue. What about the equally concerning issue of the overwhelming amount of "fluff" content published with the sole purpose of passing anchor text weight back to a domain? Is anyone else noticing an exponential growth of these types of sites recently or is it just me?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: