Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Jeff Atwood - The Sitemap Paradox (webmasters.stackexchange.com)
64 points by tzury on Nov 1, 2010 | hide | past | favorite | 18 comments


Maybe I'm missing something, but in http://stackoverflow.com/robots.txt I see:

    Sitemap: /sitemap.xml
But http://stackoverflow.com/sitemap.xml is a 404.

Perhaps he submitted a different sitemap URL to Google directly. Maybe that's the problem though?

All the examples on http://www.sitemaps.org/protocol.php#submit_robots use a fully qualified URL not an absolute URL relative to the domain. I've always used that.


They are probably blocking everyone except for Google from grabbing the file.



Perhaps Google checks this... They are typically not cool with cloaking. If you want to hide your site map, just make it a non-guessable URL and submit it to Google the Webmaster CP.


It still 404s with a GoogleBot user agent.


You can do a PTR/A lookup to check if an IP is legit Googlebot or not though. It's possible they're doing that.


correct, we are doing this


It makes sense that links that are only found in the sitemap don't get indexed. Unless you link to the sitemap somewhere on your site (not including in robots.txt), then you're not going to pass any PageRank to the sitemap, and it's not going to pass any PageRank to the pages it's linked to.

Being crawled and being indexed are two different things. Sitemaps allow Googlebot to crawl your site more easily. What gets indexed has a lot to do with PageRank, and if you're not flowing PageRanks efficiently through the site, you're going to have indexation problems.

Here's an interesting post on how sitemaps affect crawlers: http://www.seomoz.org/blog/do-sitemaps-effect-crawlers


PR isn't the only thing in the world. Google regularly state that a large % of their searches are unique.

If you've got the only site in the entire world that talks about mutant killer spider monkeys it should come up top if someone searches for that term, regardless of the page's PR.

After all, that site's the only one in the world with the foresight to predict the coming apocalypse.


we feed Google the sitemap.txt through Google Webmaster Tools


Pfft, that guy only has a 25% accept rate.


What is the 25% a reference to? Obviously missing some kind of in-joke... ?

And for those of you like me who might wonder what an "accept rate" is, it's the percentage of accepted answers to questions asked by a given user on Stack Overflow — http://blog.stackoverflow.com/2009/08/new-question-asker-fea...


In my experience, sitemaps aren't for URL discovery, but more for URL prioritization. Google will pretty much crawl you whole site whether or not you include a sitemap. Where the sitemap becomes important is whether google includes in the supplementary index or the main index. With a sitemap, you can specify a priority for each page and basically hint to google that some pages are more important than others.

http://sitemaps.org/protocol.php#prioritydef


the consensus is that sitemap.xml works best for rapid discovery of new content -- it's not very good for discovery of deep content the crawler can't get to because of the aforementioned paradox (that is, if Google can't see you linking to your own page, it is disinclined to let the sitemap link matter)


I've been working on an e-commerce project for a client that ended up having some major issues with database imports and the client wanted results now. Anyway, we got all of the items online and in Google with sitemaps, but the links were orphans ... You could find them in the built in site search and from a Google search, but not through any link path ...

After about 2 or 3 months like this we started to see the traffic going down and the item pages were getting de-indexed.

Since we fixed the import (and internal politics) problem (about 6 months after de-indexing) we have seen a steady increase in traffic again and the number of indexed pages is going up slowly ... very slowly ...

My take on this experience is that you need to have some sort of click-able link path to get to your content or any gains you might get from a sitemap will be taken away ... Sitemaps might get your pages crawled faster, they may even get into the index faster .. .but to keep them there you need good site structure ...

It is a tool and won't fix design problems ...


this is consistent with our experience as well


This is why you need 2 types of sitemap, the HTML kind which can pass pagerank, and the XML kind which can denote page priority.


we make use of google site search for our search and sitemaps are the vehicle by which we can ensure a new site of ours gets indexed within 24 hours (we have a lot of small, frequently changing sites)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: