There are a number of publicly available sitemap.xml generators available for Sitecore on the Marketplace. Here’s what makes Constellation.Foundation.SitemapXml a superior option:
- Multi-site out of the box. Will generate a unique Sitemap for every site in your installation.
- Includes robots.txt support. The robots.txt file points to the Sitemap file.
- Robots.txt includes support for disallows, including unique disallows per site.
- Support for SSL/TLS in URLs. Constellation obeys all the settings on your
<site />
nodes, including the scheme attribute. - Robots.txt and sitemap.xml files are generated on the fly, and cached in memory.
- Background agent to ensure these files are always cached, and always up-to-date.
- Ability to customize how Items get added to the sitemap.xml file
- Ability to add non-Sitecore urls to the sitemap.xml file
- Full support for advanced sitemap.xml attributes including priority, last modified, and modification frequency.
- Each site in your install can have unique robots.txt and sitemap.xml requirements.
Constellation.Foundation.SitemapXml can be used as a plug-n-play Sitemap option for smaller, simpler sites, but it was designed as an API. Developers will likely have to customize the “last mile”. This was done to support the significant variations in Sitecore Information Architecture practices around the concepts of site maps and automatically generated navigation, which often share logic.
Note that there are no static files generated, so no need for weird filenames and redirects! There are two HttpHandlers that are installed near the top of the Handlers section of your web.config. One will handle GET requests for robots.txt, the other for sitemap.xml
In both cases the hostname is inspected and used to create a Sitecore SiteContext, which establishes exactly what should be included in the response.
Robots.txt
The robots.txt handler relies exclusively on Sitecore configuration files for its settings. There is a set of global “disallows” that will be included on all sites’ robots.txt files. Optionally you can create site-specific disallows that will be appended to the global disallows when the file is constructed.
Limitations:
- There is no accommodation for including/excluding certain agents. All search engines are treated equally.
- The robots.txt file will always include a reference to the sitemap.xml file for the site.
Sometimes it’s necessary to put a site that is incomplete out on the public web for review. To prevent search engines from crawling websites in-development, you can set a global “allowed” flag to false. This will warn all search engines that find your “testing” Sitecore installation that they should not crawl the sites within it. Often people use Sitecore’s new role or environment-based configuration to set the allows flag to “false” on all lower environments.
Sitemap.xml
Basic Program Flow
- A Handler or Agent asks the Repository for a sitemap.xml document for a particular site.
- The Repository creates an internal Generator, which is responsible for the actual XML document.
- The Generator calls the CrawlerManager which reads the configuration files and creates the appropriate Crawlers for the run.
- The Generator runs each Crawler in turn.
- Each Crawler reviews Items (or other 3rd party objects), creating ISitemapNode objects for each, which are returned to the Generator.
- The Generator iterates over all ISitemapNodes, evaluating them for suitability in including into the final sitemap.xml document.
- The Generator uses suitable ISitemapNodes to create the
<url/>
elements in the XML document. - The finalized XML document is returned to the Repository
- The Repository caches the document, and then returns it to the calling object.
Handlers and Agents
These components are responsible for calling the Sitemap Repository and delivering the resulting document.
Repository
This is the “public API” for retrieving Sitemaps. It handles both generation and caching of sitemap.xml documents.
Crawler
Crawlers are used to generate a list of ISitemapNode objects. This is typically done by inspecting the Sitecore content tree (or 3rd party system). There can be more than one Crawler that supplies ISitemapNodes to a single Sitemap. (example: you might have one crawler for Page Items, and another one for Products, which come from a 3rd party system). Constellation supports a default group of Crawlers as well as the ability to define a unique group of Crawlers for a particular site.
ISitemapNode
This is the “public API” for a particular Sitemap.xml <url/> element. Developers are responsible for generating objects that can provide the facts on this Interface. It is these facts that are used to determine whether a given Item (or 3rd party object) is added to the Sitemap document.
ISitemapNode is responsible for the following facts:
- ChangeFrequency – an enumeration based on the sitemap.xml DOM
- IsPage – used primarily for Sitecore Items, allows the developer to specify that a given Item is, in fact, a URL that can be visited.
- Location – the authoritative, canonical URL to the Item. The word “location” comes from the sitemap.xml DOM.
- Priority – a decimal ranging from 0 to 1 that tells the search engine how important the page is.
- ShouldIndex – a boolean flag allowing the developer to specify whether search engines should crawl the Item.
- HasPresentation – a boolean flag specifying whether the Sitecore Item has presentation details set.
- LastModified – a DateTime value specifying when the Item was last modified.
- IsValidForInclusionInSitemapXml() – a method that must be implemented by the developer, taking the above facts into consideration. If true, the Node will be included in the sitemap.xml document.
Default Behavior
Although this component was intended to be customized, there is a “safe default” behavior that will suit Sitecore installations with a lot of small, very basic sites.
Default Robots.txt
A simple robots.txt will be rendered that excludes the /sitecore path from search indexing and references the sitemap.xml file.
Default Sitemap Crawler
The default Crawler starts at the Context Site’s “home” Item and iterates through all descendants until it reaches Sitecore’s maximum tree depth (default is 20). All of these Items will be turned into ISitemapNodes for evaluation. The Evalution rules follow:
Default Item Inclusion Rules
By default Items will be included in the final Sitemap.xml document if:
- They are published in the database used for content delivery
- They have a Layout assigned to the Default device
Default Item URL Rules
Absolute URLs for Items in the sitemap.xml document will be created with the following options:
- Default LinkManager options as provided by
LinkManager.GetDefaultUrlOptions()
options.AlwaysIncludeServerUrl
set to true.- options.Site set to the SiteContext used to build the current sitemap.xml document.
This will respect language embedding rules as well as settings on the <site/>
node for targetHostName
and scheme
. The exact result may differ slightly if you have a custom LinkProvider in play.
Installation
Constellation.Foundation.SitemapXml is available on NuGet.
In Visual Studio, fire up the Package Manager console and install into a .NET Web Application project:
PM> Install-Package Constellation.Foundation.SitemapXml
If you deploy your solution without any further changes you should get a serviceable robots.txt and sitemap.xml for every site in your installation.
Next Steps
If you have fields on your Page Items specifically around search engine indexing or sitemap generation, you can create a custom crawler and ISitemapNode combination to get more accurate sitemap.xml results. The configuration file that ships with this component gives details on how to override the default crawler for this purpose.
You can review the source code for this component and the rest of Constellation on GitHub.
If you don’t have SEO or Sitemap fields on your Page Items currently and would like to add these features, Constellation has your back.
- Constellation.Feature.PageTagging will add the fields and metatags you need for basic search engine indexing.
- Constellation.Feature.PageTagging.SitemapXml is an implementation of this component and the PageTagging library. With it you get the complete SEO + Sitemap package you need to bootstrap a new Sitecore installation.
To install both:
PM> Install-Package Constellation.Feature.PageTagging
followed by:
PM> Install-Package Constellation.Feature.PageTagging.SitemapXml
Deploy your solution and publish to enjoy.