256 Kilobytes

Sitemaps: Frequently Asked Questions, Special Sitemap Types, Creating Robot Whitelists, and Troubleshooting Common Errors

Articles in Search Engine Optimization | By August R. Garcia

Published 11 months agoTue, 11 Dec 2018 05:22:29 -0800 | Last update 11 months agoTue, 11 Dec 2018 05:32:07 -0800

You may ask yourself, "how can I learn about sitemaps." It turns out that you can learn about XML sitemaps in this post.

272 views, 2 RAMs, and 0 comments

Sitemaps exist. Here's some information.

Creating and Formatting a Sitemap

What do the <loc> and <lastmod> tags do in an XML sitemap?

  • <loc>
    • The <loc> tag indicates the location of the resource. I.e., the URL.
  • <lastmod>
    • The <lastmod> tag is used to indicate teh last time that a resource was modified. This can be used to provide information that may be taken into account by robots (such as search engines) when deciding whether to recrawl a page (although robots generally take this informaiton as more of a suggestion than a rule)

How to format the <lastmod> tag for an XML sitemap?

Use the W3C datetime standard:

YYYY-MM-DD works fine. Can get more specific, such as this example from the link above:

  • 1997-07-16T19:20:30+01:00

What do the <changefreq> and <priority> tags do in an XML sitemap?

It seems highly plausible that they are, for the most part, ignored by the major search engines in favor of their own algorithms. Regardless:

  • <changefreq>
    • This tag can be set to one of these values: always, hourly, daily, weekly, monthly, yearly, never.
  • <priority>
    • The <priority> tag is a value from 0.0 to 1.0 that indicates--in your assessment--how important a page is relative to other pages on your website.

In theory, these two tags information indicate to search engines how often pages should be recraweled for new informaiton, although due to the obvious potential for abuse, it is likely that this is ignored to a large extent. If these values are taken into account, they are likely taken into account relative to the other entries in your sitemap; setting everything to a changefreq of hourly and priority of 1.0 is unlikely to result in your website being craweled immediately every hour.

More Information: https://www.v9digital.com/blog/2011/12/27/sitemap-xml-why-changefreq-priority-are-important/

What are the <urlset> and <url> tags in an XML sitemap?

  • <urlset>
    • A parent wrapper tag for the set of <url> tags in a sitemap
  • <url>
    • A parent wrapper tag for the <loc>, <lastmod>, <changefreq>, and <priority> tags

What are the <sitemapindex> and <sitemap> tags?

When creating multiple sitemaps, these two tags are used to structure a parent/index sitemap that points to multiple child sitemaps.

    • <sitemapindex>
      • The <sitemapindex> tag wraps all of the child elements for a parent sitemap.
    • <sitemap>
      • The <sitemap> tags used in a sitemap index include <loc> and <lastmod> tags that provide information about where to find child sitemaps and their time of last modification.

    When should I use multiple XML sitemaps instead of one? And how can I create multiple sitemaps?

    The main reason why you'd include multiple is because the sitemap standard indicates that:

    You can provide multiple Sitemap files, but each Sitemap file that you provide must have no more than 50,000 URLs and must be no larger than 50MB (52,428,800 bytes). If you would like, you may compress your Sitemap files using gzip to reduce your bandwidth requirement; however the sitemap file once uncompressed must be no larger than 50MB. If you want to list more than 50,000 URLs, you must create multiple Sitemap files.

    Source: The Sitemap Protocol

    To implement multiple sitemaps, use the <sitemapindex> tag (and sub tags) for your main sitemap and from there link to your additional sitemaps.

    More Information: https://support.google.com/webmasters/answer/75712?hl=en

    How can I create a sitemap in WordPress, Joomla, Drupal, or another CMS?

    In general, the easiest way to create a sitemap when using a CMS is to search for a plugin that will handle its creation for you. Here are a few places to start:

    What about creating a sitemap for website builders like Weebly, Jimdo, and Wix?

    Wix creates a sitemap for you out of the box with nothing required on your end:

    Every Wix site contains a sitemap, which is automatically generated by our server and is always kept up to date with your site’s information.

    Source:  https://support.wix.com/en/article/submitting-your-sitemap-directly-to-google

    Weebly also creates a sitemap for you out of the box:

    Weebly automatically generates a sitemap for you. To access your sitemap simply add /sitemap.xml to the end of your homepage.

    Source:  https://www.weebly.com/seo/sitemap

    As does Jimdo:

    You can find the XML sitemap for your website by visitingwww.yoursite.com/sitemap.xml (or yoursite.jimdo.com/sitemap.xml).

    Source:  https://support.jimdo.com/search-engine-optimization/manage-your-sitemap/

    What about eCommerce sites like WooCommerce, Magneto,and Shopify?

    Check out the post by Hash Brown here:

    Special Sitemaps

    Video, image, and Google News sitemaps are additional types of XML files that--while outside of the broader sitemap standard--are supported by a number of other websites, such as (most notably) Google.

    What is a video sitemap?

    A video sitemap indicates additional information about video content and can be used to:

    • Indicate the data and settings related to embeding and autoplaying videos
    • Indicate a preferred thumbnail to show in search results
    • List metadata, such as video duration and publication date
    • Improve the general capability of a website to rank in video searches

    More Information:

    What is an image sitemap?

    A Google News sitemap indicates additional information about images, such as captions, geographic location, and licensing information.

    More Information: https://support.google.com/webmasters/answer/178636?hl=en&ref_topic=4581190

    What is a Google News sitemap?

    A Google News sitemap indicates information designed to faciliate the inclusion of topical news articles in the Google News news aggregator:

    A Google News sitemap lets you control which content you submit to Google News.

    [...]

    Sitemaps allow Google News to find all the news articles on your site more quickly

    More Information: Google Webmasters' Forum

    Troubleshooting and Common Errors

    Do XML sitemaps need to be stored in a website's root directory?

    Storing sitemaps in other locations can be done. However, this comes with limitations related to what URLs can be included in those sitemaps and "it is strongly recommended that you place your Sitemap at the root directory of your web server."

    More Information: https://www.sitemaps.org/protocol.html#location

    XML Parse Errors

    XML <parserror>: This page contains the following errors: error on line 2 at column 1: Extra content at the end of the document Below is a rendering of the page up to the first error.

    To address this error, make sure that there is no trailing or leading whitespace at the start or end of your XML document.

    Uploading a Sitemap to Google Webmaster Tools

    The following errors can occur when attempting to upload a sitemap to Google's Webmaster Tools console:

    Your sitemap appears to be an html page. Please use a supported sitemap format instead.

    It seems that this error can be caused by caching plugins. Try clearing cached plugins, if applicable.

    Additionally, if your site also has a human readable sitemap, such as [...].com/sitemap.php or [...].com/sitemap.html, make sure you are submitting the sitemap.xml file.

    Sitemap contains ULRs which are blocked by robots.txt.

    This error indicates that your site's robots.txt file contradicts your XML sitemap. Check to see if your site's robots.txt file contains any disalow rules and either adjust those rules or adjust the sitemap itself.  

    Should pagination for comments/forum posts be included in an XML sitemap?

    In general, ommitting pagination is a good idea, although this varies on a case-by-case basis. Considerations:

    • Are your paginated pages essentially duplicates of each other?
    • Is there a canonical version of the paginated pages? If so, generally include that independently.
    • Would it be redundant and/or excessive to list all of the paginated URLs?
    • Would you actually want users to find the 212th page of some search directory?

    Note that ommitting paginated URLs does not mean that they won't be indexed; it means they are not explicitly included in the list of "official" URLs on your website. 

    This article by Moz also points out additional considerations here:

    Restricting Sitemap Access and Setting Whitelists

    Should I restrict access to my sitemap(s)?

    Plausibly. If preventing scraping of content is important to you, there are ways to restrict access to only allow whitelisted bots (such as specific search engines).

    Note that whitelisting bot access to sitemaps prevent many useful bots (for example, obscure search engines, as well as site analysis tools like Majestic and Ahrefs) from crawling your site. It is often the case that if you are concerned about having your content scraped by bots, there is a different, broader problem with your website or business strategy.

    How can I prevent competitor businesses from scraping URLs from my website's sitemap?

    You might try the strategy that Stackexchange uses:

    Although it can still be found/seen to some extent by accessing the page that Google has cached.

    How can I only allow access to my XML sitemap(s) via a whitelist?

    Look up the documentation for the specific website/bot that you want to whitelist. Here is where you can find information on how to do this for common bots:

    DuckDuckBot (DuckDuckGo)

    DuckDuckGo's web crawler's requests can be identified by IP address:

    DuckDuckBot is the Web crawler for DuckDuckGo. It respects WWW::RobotRules and originates from these IP addresses:

    • 107.20.237.51 
    • 23.21.226.191 
    • 107.21.1.8 
    • 54.208.102.37

    Source:  https://duckduckgo.com/duckduckbot

    BingBot (Bing)

    How to verify Bingbot traffic:

    If you see what appears to be Bingbot traffic in your server logs based on a user agent string, for example Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm), and you want to know if this traffic really is originating from a Bing server, you can take the following steps:

    • Perform a reverse DNS lookup using the IP address from the logs to verify that it resolves to a name that end with search.msn.com
    • Do a forward DNS lookup using the name from step 1 to confirm that it resolves back to the same IP address

    Source: https://www.bing.com/webmaster/help/how-to-verify-bingbot-3905dc26

    GoogleBot (Google)

    Documentation for how to do this is also provided by Google:

    You can verify if a web crawler accessing your server really is Googlebot (or another Google user-agent). This is useful if you're concerned that spammers or other troublemakers are accessing your site while claiming to be Googlebot.

    Source: https://support.google.com/webmasters/answer/80553

    Slurp (Yahoo!)

    Information on detecting Yahoo's crawlers can be found here:

    Additional Resources

    Users Who Have Downloaded More RAM:
    Hash Brown (10 months ago)
    Huevos Rancheros (10 months ago)
    🐏 ⨉ 2
    Posted by August R. Garcia 11 months ago

    Edit History

    • [2018-12-11 5:22 PST] August R. Garcia (11 months ago)
    • [2018-12-11 5:22 PST] August R. Garcia (11 months ago)
    • [2018-12-11 5:22 PST] August R. Garcia (11 months ago)
    • [2018-12-11 5:22 PST] August R. Garcia (11 months ago)
    • [2018-12-11 5:22 PST] August R. Garcia (11 months ago)
    • [2018-12-11 5:22 PST] August R. Garcia (11 months ago)
    🕓 Posted at 11 December, 2018 05:22 AM PST
    Profile Photo - August R. Garcia August R. Garcia LARPing as a Sysadmi... Portland, OR
    🗎 198 🗨 984 🐏 299
    Site Owner

    Grahew Mattham

    August Garcia is some guy who used to sell Viagra on the Internet. He made this website to LARP as a sysadmin while posting about garbage like user-agent spoofing, spintax, the only good keyboard, virtual assitants from Pakistan, links with the rel="nofollow" attributeproxiessin, the developer console, literally every link building method, and other junk.

    Available at arg@256kilobytes.com, via Twitter, or arg.256kilobytes.com. Open to business inquiries based on availability.


    Account created 11 months ago.
    198 posts, 984 comments, and 299 RAMs.

    Last active 12 hours ago:
    Commented in thread How to fix "indexerror: arrays used as indices must be of integer (or boolean) type?"

    Post a New Comment

    To leave a comment, login to your account or create an account.

    Do you like having a good time?

    Read Quality Articles

    Read some quality articles. If you can manage to not get banned for like five minutes, you can even post your own articles.

    View Articles →

    Argue with People on the Internet

    Use your account to explain why people are wrong on the Internet forum.

    View Forum →

    Vandalize the Wiki

    Or don't. I'm not your dad.

    View Wiki →

    Ask and/or Answer Questions

    If someone asks a terrible question, post a LMGTFY link.

    View Answers →

    Make Some Money

    Hire freelancers and/or advertise your goods and/or services. Hire people directly. We're not a middleman or your dad. Manage your own business transactions.

    Register an Account
    You can also login to an existing account or recover your password. All use of this site is subject to terms outlined in the terms of service and privacy policy.