Sunday, October 28, 2012

Re: Can't get Google Crawler to index my GWT-based site no matter what I do. Help!

On Sunday, October 28, 2012 8:50:27 AM UTC-7, Joseph Lust wrote:
I see you're using places and the url tokens, are you using GWTP? There is some built in crawler support there that you could use, or at least investigate if you're not using GWTP.


SIncerely,
Joseph

Thanks for responding!

Yes you are correct. I am using places and activities (but not GWTP). I use straight GWT throughout the entire app (including all the editor + validation stuff).

GWTP does include a canned filter for handling crawlability. I have a very similar one albeit slightly optimized.

/**
 * Special filter that adds support for Google crawling as outlined here 
 * ({@link https://developers.google.com/webmasters/ajax-crawling/docs/getting-started}
 * 
 * @author Benjamin Possolo
 */
public class GoogleCrawlerFilter implements Filter {
private static final Logger log = Logger.getLogger(GoogleCrawlerFilter.class.getName());

private static final ThreadLocal<WebClient> webClient = new ThreadLocal<WebClient>(){
@Override
protected WebClient initialValue() {
log.info("Instantiating headless browser");
WebClient wc = new WebClient(BrowserVersion.FIREFOX_3_6);
wc.setThrowExceptionOnScriptError(false);
wc.setThrowExceptionOnFailingStatusCode(false);
wc.setCssEnabled(false);
return wc;
};
};
@Override
public void init(FilterConfig config) throws ServletException {}
@Override
public void doFilter(ServletRequest request, ServletResponse response, FilterChain chain)
throws IOException, ServletException {
HttpServletRequest req = (HttpServletRequest)request;
HttpServletResponse resp = (HttpServletResponse)response;
String queryString = req.getQueryString();
if( queryString != null && queryString.contains("_escaped_fragment_") ){
log.info("Detected request from Google Crawler");
//google requests the URL with the place fragment as a query parameter.
//they do this because URL fragments (the portion after the hash #) are
//not sent with an HTTP request.
//convert the ugly URL to the real url that uses the hashbang
queryString = queryString.replaceFirst("&?_escaped_fragment_=", "#!");
queryString = URLDecoder.decode(queryString, "UTF-8");
StringBuilder pageToCrawlSb = new StringBuilder(req.getScheme()).append("://").append(req.getServerName());
if( req.getServerPort() > 0 )
pageToCrawlSb.append(':').append(req.getServerPort());
pageToCrawlSb.append(req.getRequestURI());
if( ! queryString.startsWith("#!") )
pageToCrawlSb.append('?');
pageToCrawlSb.append(queryString);
String pageToCrawl = pageToCrawlSb.toString();
log.log(Level.INFO, "Page being crawled: {0}", pageToCrawl);
//check if a snapshot of the requested page already exists
String htmlSnapshot = MemcacheUtil.getHtmlSnapshot(pageToCrawl);
if( htmlSnapshot == null ){
try{
//use HtmlUnit to render the requested page
long start = System.currentTimeMillis();
log.info("Using headless browser to fetch page");
HtmlPage page = webClient.get().getPage(pageToCrawl);
log.info("Pumping javascript event loop for 8 seconds");
webClient.get().getJavaScriptEngine().pumpEventLoop(8000); //execute javascript for 8 seconds
long end = System.currentTimeMillis();
log.log(Level.INFO, "Time to generate page snapshot: {0} seconds", ((end - start) / 1000L));
//we add a special message to the top of the page so that anyone seeing the snapshot will
//know it is meant for Google crawling
String snapshotMsg = new StringBuilder("<body>\n\n")
.append("<hr />\n")
.append("<center>\n")
.append("  <h3>\n")
.append("    You are viewing a non-interactive page that is intended for the crawler.<br/>\n")
.append("    You probably want to see this page: <a href=\"" + pageToCrawl + "\">" + pageToCrawl + "</a>\n")
.append("  </h3>\n")
.append("</center>\n")
.append("<hr />\n")
.toString();
htmlSnapshot = page.asXml();
htmlSnapshot = htmlSnapshot.replaceFirst("<body[^>]*>", snapshotMsg);
//store the rendered page in memcache
MemcacheUtil.putHtmlSnapshot(pageToCrawl, htmlSnapshot);
}
finally{
webClient.get().closeAllWindows();
}
}
//send the html snapshot back to the crawler
resp.setContentType("text/html; charset=UTF-8");
PrintWriter writer = resp.getWriter();
writer.print(htmlSnapshot);
}
else{
chain.doFilter(request, response);
}
}

@Override
public void destroy() {
//this is never called on Google App Engine
}
}

--
You received this message because you are subscribed to the Google Groups "Google Web Toolkit" group.
To view this discussion on the web visit https://groups.google.com/d/msg/google-web-toolkit/-/00Zmn6IfmrgJ.
To post to this group, send email to google-web-toolkit@googlegroups.com.
To unsubscribe from this group, send email to google-web-toolkit+unsubscribe@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/google-web-toolkit?hl=en.

No comments:

Post a Comment