So we’re setting up a crawl of a TWiki site as one source in a suite of content sources.
So far so good, once the authentication was sorted we noticed a problem, only the root Url of the site was getting crawled.
Various ideas were thrown around about nofollow and noindex attributes but we couldn’t find anything wrong with our configuration and nothing seemed to fit the problem.
I noticed that this particular TWiki installation was case sensitive to Urls by accident (thought those days were gone, configurable apparently) and that got me thinking.
By kicking a crawl off i noticed that SharePoint was requesting lower case urls from the Site for every link on the home page getting a 404 and stopping.
Ok penny drops but why is SharePoint sending a lower case url, well… this is by design as part of the crawler’s normalization of urls (http://blogs.msdn.com/b/enterprisesearch/archive/2010/07/09/crawling-case-sensitive-repositories-using-sharepoint-server-2010.aspx)
In 2010 if you’re setting up a crawl rule that checkbox you’ve ignored called Match Case (badly named surely Preserve Url Casing would get the point across better) just needs to be set and viola the crawler will preserve the case of Urls it requests.