So we’re setting up a crawl of a TWiki site as one source in a suite of content sources.
So far so good, once the authentication was sorted we noticed a problem, only the root Url of the site was getting crawled.
Various ideas were thrown around about nofollow and noindex attributes but we couldn’t find anything wrong with our configuration and nothing seemed to fit the problem.
I noticed that this particular TWiki installation was case sensitive to Urls by accident (thought those days were gone, configurable apparently) and that got me thinking.
By kicking a crawl off i noticed that SharePoint was requesting lower case urls from the Site for every link on the home page getting a 404 and stopping.
Ok penny drops but why is SharePoint sending a lower case url, well… this is by design as part of the crawler’s normalization of urls (http://blogs.msdn.com/b/enterprisesearch/archive/2010/07/09/crawling-case-sensitive-repositories-using-sharepoint-server-2010.aspx)
In 2010 if you’re setting up a crawl rule that checkbox you’ve ignored called Match Case (badly named surely Preserve Url Casing would get the point across better) just needs to be set and viola the crawler will preserve the case of Urls it requests.
8. March 2008 19:47
Picture this, you're a IT bod working for Thomson Financial, you've got I don't know how many sites subscribing to your global financial news feed, you need to check that your latest update is working, so you key a test message in the system, hmm that's funny nothing coming out on the test feed reader.
Try a second one, and a third and you just keep on going...
Now you did check its the test system you were logged in to?
Its even harder to ignore on some other sites...looks like they might have lost ignore 1 and ignore 2 though, maybe they meant these to go around the world :-)