I’m pleased to say that I’ve found a robust way to auto-check if Google is still “seeing” content at the article-level URLs indexed by JURN. It’s a software based solution, and is basically ‘dark side’ SEO software that I’ve turned to the good side. It auto-prepends the site: modifier to each of the URLs contained in the JURN index, and then checks if those URLs are actually indexed by Google. It then logs any wholly un-indexed URLs. It just chugs away in the background and is very slow — so as not to trigger flood-control blocking measures. But it’s certainly better than doing the checking by hand.
If you have such a list you want to check, it’s probably best to remove or cut back any URLs containing multiple wildcards such as /*/*/. Google has also been known to choke on URLs containing question-marks (it can see them as evidence of someone trying a scripting exploit on Google), although I don’t see this happening during the checking. But if you’re doing the checking in blocks of 200, it’s not difficult to correct those sort of URLs first.
Pingback: First half of the article-level checking done. « JURN blog
Pingback: Second half of the article-level checking done. « JURN blog