Google Web Toolkit: Re: Can't get Google Crawler to index my GWT-based site no matter what I do. Help!

Sunday, October 28, 2012

Re: Can't get Google Crawler to index my GWT-based site no matter what I do. Help!

On Sunday, October 28, 2012 8:50:27 AM UTC-7, Joseph Lust wrote:

I see you're using places and the url tokens, are you using GWTP? There is some built in crawler support there that you could use, or at least investigate if you're not using GWTP.

http://code.google.com/p/gwt-platform/wiki/CrawlerSupport

SIncerely,
Joseph

Thanks for responding!

Yes you are correct. I am using places and activities (but not GWTP). I use straight GWT throughout the entire app (including all the editor + validation stuff).

GWTP does include a canned filter for handling crawlability. I have a very similar one albeit slightly optimized.

/**

* Special filter that adds support for Google crawling as outlined here

* ({@link https://developers.google.com/webmasters/ajax-crawling/docs/getting-started}

* @author Benjamin Possolo

public class GoogleCrawlerFilter implements Filter {

private static final Logger log = Logger.getLogger(GoogleCrawlerFilter.class.getName());

private static final ThreadLocal<WebClient> webClient = new ThreadLocal<WebClient>(){

@Override

protected WebClient initialValue() {

log.info("Instantiating headless browser");

WebClient wc = new WebClient(BrowserVersion.FIREFOX_3_6);

wc.setThrowExceptionOnScriptError(false);

wc.setThrowExceptionOnFailingStatusCode(false);

wc.setCssEnabled(false);

return wc;

};

@Override

public void init(FilterConfig config) throws ServletException {}

@Override

public void doFilter(ServletRequest request, ServletResponse response, FilterChain chain)

throws IOException, ServletException {

HttpServletRequest req = (HttpServletRequest)request;

HttpServletResponse resp = (HttpServletResponse)response;

String queryString = req.getQueryString();

if( queryString != null && queryString.contains("_escaped_fragment_") ){

log.info("Detected request from Google Crawler");

//google requests the URL with the place fragment as a query parameter.

//they do this because URL fragments (the portion after the hash #) are

//not sent with an HTTP request.

//convert the ugly URL to the real url that uses the hashbang

queryString = queryString.replaceFirst("&?_escaped_fragment_=", "#!");

queryString = URLDecoder.decode(queryString, "UTF-8");

StringBuilder pageToCrawlSb = new StringBuilder(req.getScheme()).append("://").append(req.getServerName());

if( req.getServerPort() > 0 )

pageToCrawlSb.append(':').append(req.getServerPort());

pageToCrawlSb.append(req.getRequestURI());

if( ! queryString.startsWith("#!") )

pageToCrawlSb.append('?');

pageToCrawlSb.append(queryString);

String pageToCrawl = pageToCrawlSb.toString();

log.log(Level.INFO, "Page being crawled: {0}", pageToCrawl);

//check if a snapshot of the requested page already exists

String htmlSnapshot = MemcacheUtil.getHtmlSnapshot(pageToCrawl);

if( htmlSnapshot == null ){

try{

//use HtmlUnit to render the requested page

long start = System.currentTimeMillis();

log.info("Using headless browser to fetch page");

HtmlPage page = webClient.get().getPage(pageToCrawl);

log.info("Pumping javascript event loop for 8 seconds");

webClient.get().getJavaScriptEngine().pumpEventLoop(8000); //execute javascript for 8 seconds

long end = System.currentTimeMillis();

log.log(Level.INFO, "Time to generate page snapshot: {0} seconds", ((end - start) / 1000L));

//we add a special message to the top of the page so that anyone seeing the snapshot will

//know it is meant for Google crawling

String snapshotMsg = new StringBuilder("<body>\n\n")

.append("<hr />\n")

.append("<center>\n")

.append(" <h3>\n")

.append(" You are viewing a non-interactive page that is intended for the crawler.<br/>\n")

.append(" You probably want to see this page: <a href=\"" + pageToCrawl + "\">" + pageToCrawl + "</a>\n")

.append(" </h3>\n")

.append("</center>\n")

.append("<hr />\n")

.toString();

htmlSnapshot = page.asXml();

htmlSnapshot = htmlSnapshot.replaceFirst("<body[^>]*>", snapshotMsg);

//store the rendered page in memcache

MemcacheUtil.putHtmlSnapshot(pageToCrawl, htmlSnapshot);

}

finally{

webClient.get().closeAllWindows();

}

//send the html snapshot back to the crawler

resp.setContentType("text/html; charset=UTF-8");

PrintWriter writer = resp.getWriter();

writer.print(htmlSnapshot);

}

else{

chain.doFilter(request, response);

}

@Override

public void destroy() {

//this is never called on Google App Engine

}

--
You received this message because you are subscribed to the Google Groups "Google Web Toolkit" group.
To view this discussion on the web visit https://groups.google.com/d/msg/google-web-toolkit/-/00Zmn6IfmrgJ.
To post to this group, send email to google-web-toolkit@googlegroups.com.
To unsubscribe from this group, send email to google-web-toolkit+unsubscribe@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/google-web-toolkit?hl=en.

Google Web Toolkit

Sunday, October 28, 2012

Re: Can't get Google Crawler to index my GWT-based site no matter what I do. Help!

No comments:

Post a Comment