Thursday, May 15, 2014

Re: Can you tell me Step-By-Step Guidelines of How to use HtmlUnit to make GWT app Crawlable?

I done 70%, but still have some error.

Ok, here is what i did, I downloaded htmlunit-2.14 & unzip it & copy these jar files into my lib folder

htmlunit-2.14
commons-codec
commons-collections
commons-io
commons-logging
cssparser
htmlunit-core-js
nekohtml
commons-lang3
httpclient
httpmime
jetty-websocket
xalan
xercesImpl

After that, i created "public class CrawlServlet implements Filter" as mentioned above:



@Override
 
public void doFilter(ServletRequest request, ServletResponse response,
 
FilterChain chain) throws IOException, ServletException {
 
// TODO Auto-generated method stub
 
HttpServletRequest httpRequest = (HttpServletRequest) request;
 
String requestQueryString = httpRequest.getQueryString();

     
if ((requestQueryString != null) && (requestQueryString.contains("_escaped_fragment_"))) {
     
// rewrite the URL back to the original #! version
     
String url_with_hash_fragment=requestQueryString.replace("?_escaped_fragment_=", "!#");

         
// remember to unescape any %XX characters
         
//url_with_hash_fragment = rewriteQueryString(url_with_escaped_fragment);

         
// use the headless browser to obtain an HTML snapshot
         
final WebClient webClient = new WebClient();
         
HtmlPage page = webClient.getPage(url_with_hash_fragment);


         
// important!  Give the headless browser enough time to execute JavaScript
         
// The exact time to wait may depend on your application.
         webClient
.waitForBackgroundJavaScript(2000);


         
// return the snapshot
         
PrintWriter out = response.getWriter();
         
out.println(page.asXml());
     
} else {
     
try {
       
// not an _escaped_fragment_ URL, so move up the chain of servlet (filters)
        chain
.doFilter(request, response);
     
} catch (ServletException e) {
       
System.err.println("Servlet exception caught: " + e);
        e
.printStackTrace();
     
}
   
}
   
 
}

Ok, now I ran my GWT app in eclipse & open the url "http://127.0.0.1:8888/Myproject.html?gwt.codesvr=127.0.0.1:9997?_escaped_fragment_=article

& here the error in eclipse

[ERROR] 500 - GET /Myproject.html?gwt.codesvr=127.0.0.1:9997?_escaped_fragment_=article (127.0.0.1) 4840 bytes
   
Request headers
     
Accept: text/html, application/xhtml+xml, */*
      Accept-Language: en-AU
      User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko
      Accept-Encoding: gzip, deflate
      Host: 127.0.0.1:8888
      Connection: keep-alive
      Cookie: JSESSIONID=5eehbjnnhsz6m6hlk7el8tu; SILFORACOOKIE=5eehbjnnhsz6m6hlk7el8tu
   Response headers
      Set-Cookie: JSESSIONID=ro2z2xqrbi0j93zfrb1uihl8;Path=/
      Set-Cookie: SILFORACOOKIE=ro2z2xqrbi0j93zfrb1uihl8;Path=/
      Content-Type: text/html;charset=ISO-8859-1
      Cache-Control: must-revalidate,no-cache,no-store
      Content-Length: 4840



In the ChromeBrowser it showed:


HTTP ERROR 500

Problem accessing /Myproject.html. Reason: Server Error


Caused by:java.net.MalformedURLException: no protocol: gwt.codesvr=127.0.0.1:9997!#article
 at java
.net.URL.<init>(Unknown Source)




Do you know how to fix it?

On Friday, May 16, 2014 12:06:24 AM UTC+10, Jens wrote:
HtmlUnit is bundles as jar file so you can put it (and all its dependencies) into WEB-INF/lib of your war.

Then you need to write a servlet that takes the server request of the Google bot, rewrites the _escaped_fragment_ parameter back to the original #!<token> url and starts HtmlUnit with that url. The resulting/rendered page will then be returned by the servlet.

At the bottom is an example:



The rendered page that you serve the Google Bot does not have to be a 1:1 copy of your original page. It is enough if the same content is available, styling is irrelevant. For example compare:



-- J.

--
You received this message because you are subscribed to the Google Groups "Google Web Toolkit" group.
To unsubscribe from this group and stop receiving emails from it, send an email to google-web-toolkit+unsubscribe@googlegroups.com.
To post to this group, send email to google-web-toolkit@googlegroups.com.
Visit this group at http://groups.google.com/group/google-web-toolkit.
For more options, visit https://groups.google.com/d/optout.

No comments:

Post a Comment