Sitemaps: Frequently Asked Questions, Special Sitemap Types, Creating Robot Whitelists, and Troubleshooting Common Errors
Published 7 months ago | Last update 7 months ago
You may ask yourself, "how can I learn about sitemaps." It turns out that you can learn about XML sitemaps in this post.
195 views, 2 RAMs, and 0 comments
- Creating and Formatting a Sitemap
- What do the <loc> and <lastmod> tags do in an XML sitemap?
- How to format the <lastmod> tag for an XML sitemap?
- What do the <changefreq> and <priority> tags do in an XML sitemap?
- What are the <urlset> and <url> tags in an XML sitemap?
- What are the <sitemapindex> and <sitemap> tags?
- When should I use multiple XML sitemaps instead of one? And how can I create multiple sitemaps?
- How can I create a sitemap in WordPress, Joomla, Drupal, or another CMS?
- What about creating a sitemap for website builders like Weebly, Jimdo, and Wix?
- What about eCommerce sites like WooCommerce, Magneto,and Shopify?
- Special Sitemaps
- What is a video sitemap?
- What is an image sitemap?
- What is a Google News sitemap?
- Troubleshooting and Common Errors
- Do XML sitemaps need to be stored in a website's root directory?
- XML Parse Errors
- Uploading a Sitemap to Google Webmaster Tools
- Should pagination for comments/forum posts be included in an XML sitemap?
- Restricting Sitemap Access and Setting Whitelists
- Should I restrict access to my sitemap(s)?
- How can I prevent competitor businesses from scraping URLs from my website's sitemap?
- How can I only allow access to my XML sitemap(s) via a whitelist?
- DuckDuckBot (DuckDuckGo)
- BingBot (Bing)
- GoogleBot (Google)
- Slurp (Yahoo!)
- Additional Resources
Sitemaps exist. Here's some information.
Creating and Formatting a Sitemap
What do the <loc> and <lastmod> tags do in an XML sitemap?
- The <loc> tag indicates the location of the resource. I.e., the URL.
- The <lastmod> tag is used to indicate teh last time that a resource was modified. This can be used to provide information that may be taken into account by robots (such as search engines) when deciding whether to recrawl a page (although robots generally take this informaiton as more of a suggestion than a rule)
How to format the <lastmod> tag for an XML sitemap?
Use the W3C datetime standard:
YYYY-MM-DD works fine. Can get more specific, such as this example from the link above:
What do the <changefreq> and <priority> tags do in an XML sitemap?
It seems highly plausible that they are, for the most part, ignored by the major search engines in favor of their own algorithms. Regardless:
- This tag can be set to one of these values: always, hourly, daily, weekly, monthly, yearly, never.
- The <priority> tag is a value from 0.0 to 1.0 that indicates--in your assessment--how important a page is relative to other pages on your website.
In theory, these two tags information indicate to search engines how often pages should be recraweled for new informaiton, although due to the obvious potential for abuse, it is likely that this is ignored to a large extent. If these values are taken into account, they are likely taken into account relative to the other entries in your sitemap; setting everything to a changefreq of hourly and priority of 1.0 is unlikely to result in your website being craweled immediately every hour.
What are the <urlset> and <url> tags in an XML sitemap?
- A parent wrapper tag for the set of <url> tags in a sitemap
- A parent wrapper tag for the <loc>, <lastmod>, <changefreq>, and <priority> tags
What are the <sitemapindex> and <sitemap> tags?
When creating multiple sitemaps, these two tags are used to structure a parent/index sitemap that points to multiple child sitemaps.
- The <sitemapindex> tag wraps all of the child elements for a parent sitemap.
- The <sitemap> tags used in a sitemap index include <loc> and <lastmod> tags that provide information about where to find child sitemaps and their time of last modification.
When should I use multiple XML sitemaps instead of one? And how can I create multiple sitemaps?
The main reason why you'd include multiple is because the sitemap standard indicates that:
You can provide multiple Sitemap files, but each Sitemap file that you provide must have no more than 50,000 URLs and must be no larger than 50MB (52,428,800 bytes). If you would like, you may compress your Sitemap files using gzip to reduce your bandwidth requirement; however the sitemap file once uncompressed must be no larger than 50MB. If you want to list more than 50,000 URLs, you must create multiple Sitemap files.
Source: The Sitemap Protocol
To implement multiple sitemaps, use the <sitemapindex> tag (and sub tags) for your main sitemap and from there link to your additional sitemaps.
More Information: https://support.google.com/webmasters/answer/75712?hl=en
How can I create a sitemap in WordPress, Joomla, Drupal, or another CMS?
In general, the easiest way to create a sitemap when using a CMS is to search for a plugin that will handle its creation for you. Here are a few places to start:
What about creating a sitemap for website builders like Weebly, Jimdo, and Wix?
Wix creates a sitemap for you out of the box with nothing required on your end:
Every Wix site contains a sitemap, which is automatically generated by our server and is always kept up to date with your site’s information.
Weebly also creates a sitemap for you out of the box:
Weebly automatically generates a sitemap for you. To access your sitemap simply add /sitemap.xml to the end of your homepage.
As does Jimdo:
You can find the XML sitemap for your website by visitingwww.yoursite.com/sitemap.xml (or yoursite.jimdo.com/sitemap.xml).
What about eCommerce sites like WooCommerce, Magneto,and Shopify?
Check out the post by Hash Brown here:
Video, image, and Google News sitemaps are additional types of XML files that--while outside of the broader sitemap standard--are supported by a number of other websites, such as (most notably) Google.
What is a video sitemap?
A video sitemap indicates additional information about video content and can be used to:
- Indicate the data and settings related to embeding and autoplaying videos
- Indicate a preferred thumbnail to show in search results
- List metadata, such as video duration and publication date
- Improve the general capability of a website to rank in video searches
What is an image sitemap?
A Google News sitemap indicates additional information about images, such as captions, geographic location, and licensing information.
More Information: https://support.google.com/webmasters/answer/178636?hl=en&ref_topic=4581190
What is a Google News sitemap?
A Google News sitemap indicates information designed to faciliate the inclusion of topical news articles in the Google News news aggregator:
A Google News sitemap lets you control which content you submit to Google News.
Sitemaps allow Google News to find all the news articles on your site more quickly
More Information: Google Webmasters' Forum
Troubleshooting and Common Errors
Do XML sitemaps need to be stored in a website's root directory?
Storing sitemaps in other locations can be done. However, this comes with limitations related to what URLs can be included in those sitemaps and "it is strongly recommended that you place your Sitemap at the root directory of your web server."
More Information: https://www.sitemaps.org/protocol.html#location
XML Parse Errors
XML <parserror>: This page contains the following errors: error on line 2 at column 1: Extra content at the end of the document Below is a rendering of the page up to the first error.
To address this error, make sure that there is no trailing or leading whitespace at the start or end of your XML document.
Uploading a Sitemap to Google Webmaster Tools
The following errors can occur when attempting to upload a sitemap to Google's Webmaster Tools console:
Your sitemap appears to be an html page. Please use a supported sitemap format instead.
It seems that this error can be caused by caching plugins. Try clearing cached plugins, if applicable.
Additionally, if your site also has a human readable sitemap, such as [...].com/sitemap.php or [...].com/sitemap.html, make sure you are submitting the sitemap.xml file.
Sitemap contains ULRs which are blocked by robots.txt.
This error indicates that your site's robots.txt file contradicts your XML sitemap. Check to see if your site's robots.txt file contains any disalow rules and either adjust those rules or adjust the sitemap itself.
Should pagination for comments/forum posts be included in an XML sitemap?
In general, ommitting pagination is a good idea, although this varies on a case-by-case basis. Considerations:
- Are your paginated pages essentially duplicates of each other?
- Is there a canonical version of the paginated pages? If so, generally include that independently.
- Would it be redundant and/or excessive to list all of the paginated URLs?
- Would you actually want users to find the 212th page of some search directory?
Note that ommitting paginated URLs does not mean that they won't be indexed; it means they are not explicitly included in the list of "official" URLs on your website.
This article by Moz also points out additional considerations here:
Restricting Sitemap Access and Setting Whitelists
Should I restrict access to my sitemap(s)?
Plausibly. If preventing scraping of content is important to you, there are ways to restrict access to only allow whitelisted bots (such as specific search engines).
Note that whitelisting bot access to sitemaps prevent many useful bots (for example, obscure search engines, as well as site analysis tools like Majestic and Ahrefs) from crawling your site. It is often the case that if you are concerned about having your content scraped by bots, there is a different, broader problem with your website or business strategy.
How can I prevent competitor businesses from scraping URLs from my website's sitemap?
You might try the strategy that Stackexchange uses:
Although it can still be found/seen to some extent by accessing the page that Google has cached.
How can I only allow access to my XML sitemap(s) via a whitelist?
Look up the documentation for the specific website/bot that you want to whitelist. Here is where you can find information on how to do this for common bots:
DuckDuckGo's web crawler's requests can be identified by IP address:
DuckDuckBot is the Web crawler for DuckDuckGo. It respects WWW::RobotRules and originates from these IP addresses:
How to verify Bingbot traffic:
If you see what appears to be Bingbot traffic in your server logs based on a user agent string, for example Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm), and you want to know if this traffic really is originating from a Bing server, you can take the following steps:
- Perform a reverse DNS lookup using the IP address from the logs to verify that it resolves to a name that end with search.msn.com
- Do a forward DNS lookup using the name from step 1 to confirm that it resolves back to the same IP address
Documentation for how to do this is also provided by Google:
You can verify if a web crawler accessing your server really is Googlebot (or another Google user-agent). This is useful if you're concerned that spammers or other troublemakers are accessing your site while claiming to be Googlebot.
Information on detecting Yahoo's crawlers can be found here:
- Information on how to generate sitemaps for eCommerce websites on various common platforms:
- More information on sitemaps and standards:
August Garcia is some guy who used to sell Viagra on the Internet. He made this website to LARP as a sysadmin while posting about garbage like user-agent spoofing, spintax, the only good keyboard, virtual assitants from Pakistan, links with the rel="nofollow" attribute, proxies, sin, the developer console, literally every link building method, and other junk.
Account created 7 months ago.
162 posts, 847 comments, and 250 RAMs.
12 hours ago:
Commented in thread [Infographic] The Beginner's Vim Cheat Sheet
Post a New Comment
Do you like having a good time?
Read Quality Articles
Read some quality articles. If you can manage to not get banned for like five minutes, you can even post your own articles.
Argue with People on the Internet
Use your account to explain why people are wrong on the Internet forum.
Vandalize the Wiki
Or don't. I'm not your dad.
Ask and/or Answer Questions
If someone asks a terrible question, post a LMGTFY link.
Make Some Money
Hire freelancers and/or advertise your goods and/or services. Hire people directly. We're not a middleman or your dad. Manage your own business transactions.