Wooing the Googlebot: 5 Steps to an Irresistible XML Sitemap
You can have the best site in the world but you won’t rank well if your site is difficult to crawl and navigate by a robot.
All sites should have an XML formatted sitemap including the pages of their site as a standard. This is what search engines use as part of their crawl and subsequent indexing.
It is always a good idea to make these files as easy as possible for a search engine bot to read, as it can make the indexing of your site faster.
In this post I will show five of the best practices for the set up and submitting of your sitemaps. Not all will apply to all sites, but on the whole if you put these into practice then your sitemaps will be much easier to read and manage, and your site will be easier to index.
1. Split out into sections/categories
There is a standard limit on the size of a sitemap of 10MB or 50,000 URLs, and while a search engine robot will read a sitemap up to this size it can take too long and may compromise crawl time.
What to do:
For larger sites or sites with a clear structure, you can create a few different sitemaps to split out the pages to crawl. For example, you could create a general sitemap for all your top level pages and a separate sitemap (or sitemaps) for your category page(s). Alternatively you could create different sitemaps depending on how frequently they change to set different crawl priorities across the different pages.
How does this help?
Splitting out your sitemaps makes it easier for search engine bots to crawl them. It also allows you to identify with more ease which pages have yet to be indexed by a search engine in Webmaster Tools. It allows you to manage changes to your site much easier. You also don’t have to create a new sitemap for the entire site if things change on your site.
2. Show the links to all your sitemaps in your robots.txt file
This is a pretty simple thing, but it is surprising how often this isn’t done right. A robots.txt file is a small text file that essentially tells a search engine robot what it should and shouldn’t crawl. You can disallow entire directories or pages on your site as well as manage other parameters that pertain to a site crawl. It is always best practice to show the location of your sitemap in the robots.txt file as it is usually the second place the robot will go to when it crawls your site.
What to do:
If you have more than one sitemap you can list them here too in the following format:
User-agent: *
Allow: /
Disallow: /wp-admin/
Disallow: /category/
Sitemap: http://www.mysite.com/top-page-sitemap.xml
Sitemap: http://www.mysitee.com/category-page-sitemap.xml
How does this help?
Essentially it is your first signpost to a search engine, pointing it in the direction that will allow it to crawl your site along the easiest path possible.
3. Make sure there are no broken links
Many sitemap generation tools will crawl your entire site using its internal link structure and produce a sitemap.xml file for you. This has its benefit – it’s quick and easy – but it is not accurate. In some cases where your site has old pages that are still linked to internally by mistake this could result in a 404 error. Sometimes these can end up in your XML sitemap and that isn’t good when a search engine crawls them and hits a dead end.
What to do:
Once you have generated your sitemap you can run it through a validator to see if it can be read properly and if there are any errors such as ones created through broken links. I use the {W3C Validator http://validator.w3.org/} most often.
If you find broken links you should identify their location on your site and remove them. You can then re-generate your sitemap. If there are a lot of pages that have broken links you will want to resolve these as soon as possible to ensure your site doesn’t suffer in other ways.
If you have a large site or lots of pages you need to remove from the sitemaps you can use crawling software like {Screaming Frog http://www.screamingfrog.co.uk/seo-spider/} and open the file in a text editor, identify the pages and remove them. Be careful with this though as if you don’t edit the file appropriately it can corrupt it.
How does this help?
By removing broken links search engines won’t hit any dead ends in your sitemap and this can speed up the crawl. Also, with both Bing and Google Webmaster Tools errors like this can show up in the dashboard and in some cases stop the search engine even looking at the file.
4. Remove 301 redirected pages
If you have internal links which are 301 redirected it makes sense to replace the links with the new destination URL on a sitemap rather than have an internal link which is redundant. It would slow a crawl down and it can also slow site navigation down as well!
When it comes to creating a sitemap file, if you have internal links which redirect these may get picked up by whatever tool you use to create it. This means that you could end up with a sitemap that contains old redirected links and the intended destination URL too.
What to do:
Quite simply, scan over your site and ensure that all internal links are pointing to the intended destination rather than to a page which has a redirect on it. I’d recommend using the Microsoft SEO tool for this as it will pick these up easily for you (You can find out about the tool and its other uses {here http://www.koozai.com/blog/search-marketing/microsoft-seo-toolkit-review/}).
How does this help?
By removing redundant links and only leaving your site and sitemap with links to live pages it can reduce the size of the sitemap and reduce the time it will take for a search engine to read it leaving valuable milliseconds for on-page crawling of your site.
5. Submit your sitemap to Google and Bing Webmaster Tools
Once you have a sitemap created it is important that you submit them to Google and Bing Webmaster Tools. This way you can offer an additional (and quicker) opportunity for them to crawl the sitemap and site itself.
What to do:
In Google Webmaster Tools you navigate to the sitemaps tab on the main dashboard, click the red box on the top right which says ‘Add/Test Sitemap’ and add your sitemap(s) URL excluding the main domain. This then adds it into the dashboard.
In much the same way as Google, in Bing Webmaster Tools whilst in the dashboard you will see in the middle right section of the screen ‘Submit A Sitemap’. Click that and add the sitemap URL to show it in the dashboard.
How does this help?
Submitting to both these Webmaster Tools platforms will allow you to better monitor your sites’ indexing, re-submit sitemaps to the search engines manually when there are changes and also allow you to better understand what areas of your site are getting crawled more often (if you split your sitemaps out).
Having a well-structured and easily accessible XML sitemap will make it much easier for search engines to find the pages on your site and crawl them. Again – you can have the best site in the world but you won’t rank well if your site is difficult to crawl and navigate by a robot! It is within any site owner or developer’s best interests to make it as easy and quick to crawl as possible.
I hope this post has been useful and if you have any questions or other sitemap related comments please feel free to add them!
About the Author ~ Chris Simmance
Chris Simmance is a Digital Marketing Consultant. You can connect with Chris on his website, and Twitter.
Nice resource for newbies. In addition I would like to add an interesting tool from Wemasterworld forum which can simplify these processes: http://freetools.webmasterworld.com/tools/robots-txt-generator/
Thanks for the share Matthew. If you are unsure how to write a robots.txt file past the example in the post the tool you shared is very useful!
Thanks again.
Thank you sir,
Thanks for this amazing share. This one clear many of my doubt about sitemap.Thanks