Jaunt Webscraping/Automation Tutorial - Extra Topics

Extra Topics for Jaunt v. 1.6.1

The following examples assume knowledge of the basic functionality of the Jaunt API. If you have not already done so, familiarize yourself with the Webscraping Tutorial. To use Jaunt, download and extract the zip file. The zip file contains the licensing agreement, javadocs documentation, example files, release notes, and a jar file (Java 1.6). The jar file must be included in your classpath/project, at which point you will be able to recompile and/or run the example files.

Example 1: Iterating through an Element's child Nodes.

Example1.java:

try{
  UserAgent userAgent = new UserAgent();
  userAgent.openContent("<html>Welcome!<div>Under Construction</div><!-- todo: add more --></html>");
  
  Element html = userAgent.doc.findFirst("<html>");     //find the html element
  List<Node> childNodes = html.getChildNodes();         //retrieve the child nodes as a List  
  
  for(Node node : childNodes){                          //iterate through the list of Nodes.
    if(node.getType() == Node.ELEMENT_TYPE){            //determine whether the node is an Element
      System.out.println("element: " + ((Element)node).outerHTML());   //print the element and its content
    }
    else if(node.getType() == Node.TEXT_TYPE){          //determine whether the node is Text
      System.out.println("text: " + ((Text)node).toString());          //print the text
    }
    else if(node.getType() == Node.COMMENT_TYPE){       //determine whether the node is a Comment
      System.out.println("comment: " + ((Comment)node).toString());    //print the comment
    }  
  }
}
catch(JauntException e){
  System.err.println(e);
}

This example illustrates how to traverse an Element's child nodes. Because Element objects have powerful search methods, traversing child nodes using this technique is rarely required for searching.

When using getChildNodes to iterate through a list of Node children (as on line 6), the node type can be determined either by checking the value of getType() (as on lines 9, 12, and 15) or by attempting to cast the value to different types (ie, Element, Comment or Text) and catching ClassCastExceptions. Another useful method for iterating through children (not shown) is getChildElements(), which returns only Elements.

On lines 13 and 16 the Node object is cast to Text/Comment before calling toString(). Casting us actually unecessary here, due to dynamic binding with toString(); however, it is performed to illustrate the relevant classes to use when casting.

The Comment object is used to represent standard HTML/XML comments as well as doctype definitions, processing instructions, and CDATA sections, which can by differentiated by checking the Comment's type, (ie, calling getCommentType()). Example 2 deals with comment-related searches.

Example 2: Accessing Comments, DOCTYPE definitions, processing instructions, and CDATA.

message.xml:

<?xml version="1.0" encoding="ISO-8859-1"?>
<?id 568375?>
<!DOCTYPE MML SYSTEM "MML.dtd">
<message>
  <content>
    <!-- last edited Jan 12, 2013 -->
    This XML document is a test of Message Markup Language (MML) 
    <![CDATA[ MML has tags like <message> and <content> ]]>
  </content>
</message>
<!-- copyright 2013  -->

Example2.java:

try{ 
  UserAgent userAgent = new UserAgent(); 
  userAgent.visit("http://jaunt-api.com/examples_advanced/message.xml");
   
  Comment result = userAgent.doc.getComment(2);              //get doc's 3rd child comment
  System.out.println("result: " + result);                   //print the result
  Comment doctype = userAgent.doc.getFirst(Comment.DOCTYPE); //get doc's first child doctype
  System.out.println("doctype: " + doctype);                 //print the comment
   
  List<Comment> pis = userAgent.doc.getEach(Comment.PROCESSING_INSTRUCTION); //get doc's immediate child PIs
  for(Comment pi : pis) System.out.println("processing instruction: " + pi); //print the list of PIs.
        
  Comment cdata = userAgent.doc.findFirst(Comment.CDATA);    //find first CDATA section in document
  System.out.println("cdata: " + cdata);                     //print the CDATA section
   
  List<Comment> comments = userAgent.doc.findEach(Comment.REGULAR); //find each regular comment in document
  for(Comment comment : comments) System.out.println("comment: " + comment); //print list of regular comments
}
catch(JauntException e){
  System.err.println(e);
}

This example illustrates a variety of Comment-related search methods. The Comment class represents "regular" comments that have the form , but also represents DOCTYPE definitions, processing instructions, CDATA sections, and (not shown) irregular comments of the form <! irregular comment >.

The method names used in this example closely mirror those previously encountered, used to search for Elements. Although they work in a similar fashion with respect to get vs. find and each vs every, it should be noted that their method signatures differ. Since Comments do not have children, searches for comments are never chained, and therefore multiple results are returned in a List instead of an <#elements> container.

Another difference is that the Comment-related search methods do not accept a String query, but rather a constant, which is used to specify the type of comment being targeted. The getComment(int) method (line 5), like the getElement(int) method, accepts an integer n, representing the n^th immediate child Comment to be retrieved.

Example 3: Using Filters to block/allow content when parsing.

messageList.xml:

<?xml version="1.0"?>
<messageList>
  <message id="0">Hello there</message>
  <message id="1">What's up?</message>
  <message id="2">Anyone there?</message>
  <notification type="disconnection"/>
</messageList>

Example3.java:

try{
  UserAgent userAgent = new UserAgent(); 
  userAgent.settings.genericXMLMode = true;          //set mode for processing generic XML
      
  userAgent.setFilter(new Filter(){                  //subclass Filter to create custom filter
    public boolean childElementAllowed(Element parent, Element child){ //override callback method
      if(child.getName().equals("message")){         //only allow tags named 'message'
        child.removeAttribute("id");                 //remove 'checked' attribute, if present
        return true;
      }
      else return false;
    }
  });
  userAgent.visit("http://jaunt-api.com/examples_advanced/messageList.xml"); //open content 
  System.out.println("Filtered document:\n" + userAgent.doc.innerXML());     //print filtered document.    
}
catch(JauntException e){
  System.err.println(e);
}

This example illustrates using a filter to remove/alter content. The UserAgent method setFilter(FilterCallback) can be used two ways. You can either create your own class that satisfies the FilterCallback interface (three methods), or you can subclass Filter (as in this example). The class Filter already implements FilterCallback and allows everything to pass through the filter (ie, each of the three methods always return true). In subclassing it, you need only override the method(s) that handle what you want to filter out. In this example, we only override the method for filtering elements, while the methods for filting text and comments are left to allow all comments/text.

Each of the callback methods that you override is called by the parser in order to check whether any given element/text/comment should be added to the DOM tree. One line 5, our custom filter is passed as an argument to setFilter(FilterCallback). These filter settings will be retained by the UserAgent until removeFilter() has been called. See below for the printed output for this example.

Output:

<?xml version="1.0"?>

  <message>Hello there</message>
  <message>What's up?</message>
  <message>Anyone there?</message>

Example 4: Injecting/Overriding HTTP Request Headers.

Example4.java:

try{
  UserAgent userAgent = new UserAgent(); 
  userAgent.settings.showHeaders = true;      //change settings to auto-print header information
    
  System.out.println("SENDING DEFAULT, AUTO-MANAGED REQUEST HEADERS...\n");
  userAgent.sendGET("http://jaunt-api.com/examples_advanced/hello.xml");   //perform HTTP request
  System.out.println("DOCUMENT:\n" + userAgent.doc.innerXML());   //print retrieved document
    
  System.out.println("SENDING MODIFIED REQUEST HEADERS...\n");
  userAgent.sendGET("http://jaunt-api.com/examples_advanced/hello.xml",    
    "user-agent:Mozilla/4.0", "foo:bar");                         //specify new/modified headers
  System.out.println("DOCUMENT:\n" + userAgent.doc.innerXML());   //print retrieved document
}
catch(JauntException e){
  System.err.println(e);
}

This example illustrates sending two HTTP GET requests -- the first with default reqest headers and the second with additional/modified request headers. To begin, the UserAgent settings are changed to automatically display sent/received headers to the console (line 3).

For the first request (line 6), UserAgent's sendGET(String) method is invoked, which is functionally identical to the visit(String) method seen in previous examples. The retrieved document is then printed as XHTML (line 7). Examining the printed request headers reveals that they consist of the default headers specified in userAgent.settings.defaultRequestHeaders, as well as headers that are auto-managed by the UserAgent, such as any cookie-related headers and basic authentication headers.

For the second request (line 10), the sendGET(String, String ...) method is called. Note that the second argument allows a variable number of Strings. Each String should contain the name and value of a request header, separated by a colon. The headers specified here are sent in addition to (or in place of) the default headers and headers that are auto-managed by the UserAgent. Therefore, the header "foo:bar" is sent in addition to the automatically managed headers, but the header "user-agent:Mozilla/4.0" overrides the user-agent header specified in UserAgent.settings.defaultRequestHeaders.

Example 5: Sending HEAD, POST, PUT, and DELETE Requests.

Example5.java:

try{
  UserAgent userAgent = new UserAgent();

  System.out.println("SENDING HEAD REQUEST...\n");
  userAgent.sendHEAD("http://jaunt-api.com/examples_advanced/hello.xml");   //send HTTP HEAD Request
  System.out.println("RESPONSE HEADERS:\n" + userAgent.response.getHeaders());
     
  System.out.println("SENDING POST REQUEST...\n");
  userAgent.sendPOST("http://tomcervenka.site90.com/handlePost.php",
    "username=tom&password=secret");                               //send HTTP POST Request with queryString
  System.out.println("DOCUMENT:\n" + userAgent.doc.innerXML());    //print retrieved Document
      
  /** this section requires TARGET_SERVER to support PUT requests
  System.out.println("SENDING PUT REQUEST...\n");
  userAgent.sendPUT("http://TARGET_SERVER/examples_advanced/hello.xml", //send HTTP PUT Request with updated content
    "<?xml version=\"1.0\"?>Hi Mom!");
  System.out.println("DOCUMENT:\n" + userAgent.doc.innerXML());    //print retrieved Document
  */
      
  /** this section requires TARGET_SERVER to support DELETE requests
  System.out.println("SENDING DELETE REQUEST...\n");                  
  userAgent.sendDELETE("http://TARGET_SERVER/examples_advanced/hello.xml"); //send HTTP DELETE request
  System.out.println("DOCUMENT:\n" + userAgent.doc.innerXML());    //print retrieved Document
  */
}
catch(JauntException e){
  System.err.println(e);
}

This example illustrates sending a variety of HTTP requests aside from the GET Request, which was illustrated in the previous example. GET, HEAD, and POST requests are typically supported by default on webservers, whereas DELETE and PUT requests require server-side configuration (which not been enabled for these examples).

The HEAD request (line 5) creates (and returns) an HttpResponse property (UserAgent.response), who's header information is printed on line 6. The POST request (line 9) accepts queryString-formatted data as the second parameter and creates (and returns) a Document object (UserAgent.doc) which is then printed on line 11. The PUT request (line 15) accepts updated content as the second parameter and creates (and returns) a Document object (UserAgent.doc), which is then printed on line 16. The DELETE request (line 22) deletes the resource specified by the first parameter and creates (and returns) a Document object (UserAgent.doc), which is then printed on line 20.

In each case where a Document object is created/returned, the document reference is guaranteed to be non-null (unless an exception is thrown), although the document may be empty (ie, hold no content).

Example 6: Using content handlers to retrieve JS/CSS/GIF/etc files.

Example6.java:

try{
  //create UserAgent and content handlers.
  UserAgent userAgent = new UserAgent();    
  HandlerForText handlerForText = new HandlerForText();
  HandlerForBinary handlerForBinary = new HandlerForBinary();

  //register each handler with a specific content-type
  userAgent.setHandler("text/css", handlerForText);
  userAgent.setHandler("text/javascript", handlerForText);
  userAgent.setHandler("application/x-javascript", handlerForText);
  userAgent.setHandler("image/gif", handlerForBinary);
  userAgent.setHandler("image/jpeg", handlerForBinary);

  //retrieve CSS content as String
  userAgent.visit("http://jaunt-api.com/syntaxhighlighter/styles/shCore.css");
  System.out.println(handlerForText.getContent());
    
  //retrieve JS content as String
  userAgent.visit("http://jaunt-api.com/syntaxhighlighter/scripts/shCore.js");
  System.out.println(handlerForText.getContent());
    
  //retrieve GIF content as byte[], and print its length
  userAgent.visit("http://jaunt-api.com/background.gif");
  System.out.println(handlerForBinary.getContent().length);   
} 
catch(JauntException e){
  System.err.println(e);
}

This example illustrates using content handlers to read a CSS file, a JS file, and a GIF file.

On lines 4-5, one HandlerForText and one HandlerForBinary are created. On lines 8-12, the userAgent associates various content types with one of the two handers. On the subsequent line, the userAgent is directed to a css file, a js file and a gif file. In each case the hander objects is used to retrieve the content in either text or binary form, depending on which handler was used.

Example 7: Using an HTTP proxy (UserAgent-level config).

Example7.java:

UserAgent userAgent = new UserAgent();  
userAgent.setProxyHost("3.14.159.68");        //specify the proxy host (ip address)
userAgent.setProxyPort(8080);                 //specify the proxy port
                                              //visit a (non-https://) URL through the proxy
userAgent.visit("http://amazon.com");                
System.out.println(userAgent.doc.innerXML()); //print the retrieved document

In this example, the UserAgent is configured to make requests through an HTTP proxy that does not require username/password. The proxy is specified by its hostname and port number on lines 2-3 (bogus proxy information shown). The retrieved document is then printed as XHTML. To unset the proxy, the proxyHost should be specified as null in the method setProxyHost(String). Since this configuraton is performed at UserAgent-level, it's possible to create multiple UserAgents which simultaneously utlilize a different proxy (eg, one per thread). Note that when dealing with proxies, response times may lag. See UserAgentSettings.responseTimeout for setting a hard time limit on responses before throwing a ResponseException.

Example 8: Using an HTTP proxy with username/password (System-level config).

Example8.java:

//specify http proxy at System level.
System.setProperty("http.proxyHost", "3.14.159.68"); 
System.setProperty("http.proxyPort", "8080");
//specify username/password credentials at System level.
ProxyAuthenticator.setCredentials("tom", "secret");
                                              
//visit a (non-https://) URL through the proxy 
userAgent.visit("http://amazon.com");                   
System.out.println(userAgent.doc.innerXML()); //print the retrieved document

In this example, the System is configured to send requests through an HTTP proxy that requires username/password. The proxy is specified by its hostname and port number on lines 2-3 (bogus proxy information shown). The retrieved document is then printed as XHTML. To unset the proxy, the System properties should be cleared using System.clearProperty(String). Since this configuraton is performed at System-level, it is applied to all UserAgents that are subsequently created.

Example 9: Using an HTTPS proxy (System-level config).

Example9.java:

//specify https proxy at System level.
System.setProperty("https.proxyHost", "12.345.67.8"); 
System.setProperty("https.proxyPort", "80");     

//visit a (https://) URL through the proxy 
userAgent.visit("https://google.com");
System.out.println(userAgent.doc.innerXML()); //print the retrieved document

In this example, the System is configured to send requests through an HTTPS proxy (ie, an SSL tunneling proxy). The proxy is specified by its hostname and port number on lines 2-3 (bogus proxy information shown). The retrieved document is then printed as XHTML. To unset the proxy, the System properties should be cleared using System.clearProperty(String). Since this configuraton is performed at System-level, it is applied to all UserAgents that are subsequently created.