Home | Javadocs | Web-Scraping Tutorial | JSON Querying Tutorial | FAQ | Download

Jaunt Webscraping Tutorial - Quickstart

Overview of Webscraping with Jaunt

The Jaunt package contains the class UserAgent, which represents a headless browser. When the UserAgent loads an HTML or XML page, it creates a Document object. The Document object exposes the content as a tree of Nodes, such as Element objects, Text objects, and Comment objects. HTML documents, for example, begin with the <html> Element, who's child nodes are <head> and <body> Elements. Each Element contains zero or more attributes (such as <body class='foo'>) and zero or more child Nodes. The children can be Text objects, Comment objects, or other Element Objects.

When creating the Document, the UserAgent may encounter malformed HTML/XML. The parser automatically deals with the kinds of dirty data found online and is guaranteed to create a parse tree to represent the content, with relative urls automatically converted to absolute urls. The parser generally preserves non-validating structures if they may reflect the intent of the document creator.

Jaunt's document model is very similar to the W3C DOM, however, it is not identical. One difference is that entity references (such as &nbsp;) are represented as regular text. In addition to exposing the tree structure of the DOM, the Document also provides access to higher-level components such as Form and Table. These components provide convenience methods for submitting forms and extracting data from tables. The UserAgent also provides access to the HttpResponse object, which holds headers for the most recent HTTP response.

To begin using Jaunt, download and extract the zip file. The zip file contains the licensing agreement, javadocs documentation, example files, release notes, and a jar file (Java 1.6). Include the jar file in your classpath/project, at which point you will be able to recompile and/or run the example files. If your IDE supports it, configure javadoc integration (eg, in your Eclipse project, configure java build path, expand the entry for the jar file, select the path to the javadocs folder).

Example 1: Create a UserAgent, visit a url, print the HTML.
try{
  UserAgent userAgent = new UserAgent();                       //create new userAgent (headless browser).
  userAgent.visit("http://oracle.com");                        //visit a url   
  System.out.println(userAgent.doc.innerHTML());               //print the document as HTML
}
catch(JauntException e){         //if an HTTP/connection error occurs, handle JauntException.
  System.err.println(e);
}
    
This example illustrates creating a UserAgent, visiting a webpage, and printing the document as HTML.

When the userAgent visits a url (line 3), it creates a Document object (userAgent.doc) to represent the parsed content, whether it's HTML, XHTML, or XML. On line 4, the document content is printed as HTML. In order to print the document as XML instead, the method innerXML() can be called. Malformed tags and/or missing closing tags are automatically corrected, so the printed output may not be identical to the original/unparsed content. The UserAgent method getSource() provides the original/unaltered source.

Example 2: UserAgent settings, searching using findFirst.
try{
  UserAgent userAgent = new UserAgent();                       //create new userAgent (headless browser).
  System.out.println("SETTINGS:\n" + userAgent.settings);      //print the userAgent's default settings.
  userAgent.settings.autoSaveAsHTML = true;                    //change settings to autosave last visited page.

  userAgent.visit("http://oracle.com");                        //visit a url.
  String title = userAgent.doc.findFirst("<title>").getText(); //get child text of title element.
  System.out.println("\nOracle's website title: " + title);    //print the title 

  userAgent.visit("http://amazon.com");                        //visit another url.
  title = userAgent.doc.findFirst("<title>").getText();        //get child text of title element.
  System.out.println("\nAmazon's website title: " + title);    //print the title  
}
catch(JauntException e){   //if title element isn't found or HTTP/connection error occurs, handle JauntException.
  System.err.println(e);          
}
This example illustrates visiting two urls, in each case extracting and printing the title of webpage.

On line 4 autosaving is enabled, which means that anytime a page is visited it will be autosaved as LAST_VISITED.html in the directory specified by settings.outputPath. The related setting autoSaveAsXML can be used to save the document in XML format rather than HTML. Autosaving is useful in development and debugging, since LAST_VISITED.html can be checked with Chrome/Firefox/etc to examine the DOM structure. See UserAgentSettings for all the settings of a UserAgent.

The document's findFirst(String) method (lines 7 and 11) accepts a query that (in simple cases) resembles an HTML tag, and searches the document tree until it finds a matching element. It should be noted that the query "<title>" will match any Element who's tagname is title (case insensitive), whether or not the element has additional attributes. As we'll see in later examples, the tagname portion of the query is actually a regular expression, which provides a powerful syntax for pattern matching. For example, the query "<h(1|2)>" would match any h1 or h2 tag. Example 11 provides a full account of the query syntax.

Note that in this example we catch JauntException, which is the superclass of all other Jaunt-related Exceptions. Later examples will demonstrate handling HTTP/connection errors separately from search-related errors.

Example 3: Opening HTML from a String, retrieving Element text.
try{
  UserAgent userAgent = new UserAgent();                        
                                                                //open HTML from a String.
  userAgent.openContent("<html><body>WebPage <div>Hobbies:<p>beer<p>skiing</div> Copyright 2013</body></html>");
  Element body = userAgent.doc.findFirst("<body>");
  Element div = body.findFirst("<div>");
  
  System.out.println("body's text: " + body.getText());         //join child text of body element
  System.out.println("body's innerText: " + body.innerText());  //join all text within body element
  System.out.println("div's text: " + div.getText());           //join child text of div element
  System.out.println("div's innerText: " + div.innerText());    //join all text within the div element
}
catch(JauntException e){                          
  System.err.println(e);          
}
This example illustrates opening HTML content from a String and printing the text content of specific Elements.

On line 6 we see that the firstFirst(String) method can be invoked on an Element (or the Document as on the previous line). When invoked on an Element, the search is restricted to that Element's descendants. The findFirst(String) method actually belongs to the class Element, and is inherited by Document.

On lines 8 - 12 we see that the getText() method returns a String concatenation (joining) of the text children of the element, whereas the innerText() method returns a String concatenation of all Text descendants. If an element does not contain any text, either method will return an empty String.

The variation innerText(String, boolean, boolean) accepts additional parameters for inserting a delimeter between concatenated texts, specifying whether to exclude script tag text, and/or indicating whether HTML/XHTML entity references within the text should be replaced with their character equivalents (eg converting &amp; to &).

Example 4: Element properties, attributes and parents.
try{
  UserAgent userAgent = new UserAgent();    
  userAgent.visit("http://intel.com");
  
  Element anchor = userAgent.doc.findFirst("<a href>");            //find 1st anchor element with href attribute
  System.out.println("anchor element: " + anchor);                 //print the anchor element
  System.out.println("anchor's tagname: " + anchor.getName());     //print the anchor's tagname
  System.out.println("anchor's href attribute: " + anchor.getAt("href"));  //print the anchor's href attribute
  System.out.println("anchor's parent Element: " + anchor.getParent());    //print the anchor's parent element
  
  Element meta = userAgent.doc.findFirst("<head>").findFirst("<meta>");    //find 1st meta element in head section
  System.out.println("meta element: " + meta);                     //print the meta element
  System.out.println("meta's tagname: " + meta.getName());         //print the meta's tagname
  System.out.println("meta's content attribute: " + meta.getAt("content"));//print the meta's content attribute
  System.out.println("meta's parent Element: " + meta.getParent());        //print the meta's parent element
}
catch(JauntException e){              
  System.err.println(e);          
}
This example illustrates visiting a website, searching for specific elements, and printing various attributes and properties of those elements.

The search query on line 5 specifies not only the tagname but also that the Element must contain an href attribute. As previously noted, the tagname portion of the query is a regular expression. The attributename section of the query, however, is not; it is matched as a case-insensitive String.

On lines 6, 9, 12 and 16, an Element's toString() method is implicity called, which returns a String representation of the Element excluding its children. See Example 5 for how to obtain a String representation of an Element that does include its children.

On line 8, the getAt(String) method is called to retrieve the attribute value associated with the (case insensitive) attribute name href. If the anchor tag did not have an href attribute, calling getAt(String) would throw a NotFound Exception. The related method getAtString(String) returns an empty String rather than throwing a NotFound Exception if the attribute value does not exist (not shown).

An example of chaining search methods can be seen on line 11, where the document is searched for the first head Element, which is subsequently searched for the first meta Element. In this case, the same result would be obtained by simply calling userAgent.doc.findFirst("<meta>"), however if the head section did not contain a meta element, this search would scan the entire document until it found a match rather than only searching the head section.

Example 5: Opening HTML from a file, altering an Element's content.
<html>
   <div class='images'>
     <img src='image1.jpg'><br>
     <img src='image2.jpg'><br>
   </div> 
   <p>
     visit again soon!
   </p>
</html>
try{
  UserAgent userAgent = new UserAgent();
  userAgent.open(new File("images.htm"));  //open the HTML (or XML) from a file
   
  Element div = userAgent.doc.findFirst("<div class=images>");  //find first div who's class matches 'images'  
  System.out.println("div as HTML:\n" + div.outerHTML());       
  System.out.println("div's content as HTML:\n" + div.innerHTML());
  System.out.println("div as XML:\n" + div.outerXML(2));             //specify that indentation is 2 spaces
  System.out.println("div's content as XML:\n" + div.innerXML(2));   //specify that indentation is 2 spaces
 
  //make some changes
  div.innerHTML("<img src='presto.gif'><br>Presto!");          //replace div's content with different elements.
  System.out.println("Altered document as HTML:\n" + userAgent.doc.innerHTML());  //print the altered document.
}
catch(JauntException e){
   System.err.println(e);
}
This example illustrates opening content from a local file, searching for specific elements, printing those elements as HTML/XML, then altering the HTML/XML content of an element, and finally printing the entire document.

The query used on line 5 can be read as "find the first element which has a tagname that matches the case-insensitive regular expression div and which has an attribute who's name matches the case-insensitive String class, where the value of the attribute matches the case-insensitive regular expression images. Note that on line 5, the attribute value within the query is unquoted, but as with HTML, quotes are optional.

Because the DOM represents HTML and XML using the same internal model, switching between the two when printing is a matter of output formatting. When using the indenting options (ie outerHTML(int) and outerXML(int)), whitespace characters are added to the output string in order to indent each node (including existing whitespace nodes); so this indending whitespace will appear in addition to any that was already present.

The interchangeabiliy between HTML and XML is also seen in the methods that require HTML or XML input, such as innerHTML(String) on line 12; such methods accept either format.

Example 6: Detecting HTTP errors and connection errors with the Response object.

try{
  UserAgent userAgent = new UserAgent();      
  userAgent.visit("http://oracle.com");       
  System.out.println("Response:\n" + userAgent.response);  //print response data
}
catch(ResponseException e){                                //catch HTTP/Connection error
  HttpResponse response = e.getResponse();                 //or check userAgent.response
  if(response != null){                                    //print response data field by field
    System.err.println("Requested url: " + response.getRequestedUrlMsg()); //print the requested url
    System.err.println("HTTP error code: " + response.getStatus());        //print HTTP error code
    System.err.println("Error message: " + response.getMessage());         //print HTTP status message
  }
  else{
    System.out.println("Connection error, no response!");
  }
} 
When the UserAgent attempts to visit a url, it's possible that the connection to the webserver will fail or that an HTTP error code will be returned. The HttpResponse object (userAgent.response) contains information about the webserver response. If no error occurs, UserAgent.response can be examined for details regarding the response, as on line 4.

If the connection fails or an HTTP error occurs, the UserAgent.visit(String) method will throw a ResponseException. The ResponseException also contains a reference to the response, however in this case it's possble that the response is null (indicating that no response was received due to a connection error). Since the response object could be null, that possility is checked on line 8 before invoking any of its methods for the printing steps. A simpler alternative would be to simply print the ResponseException e, which would show the same (and more) information (see next Example).

In some cases, a webserver response will redirect the UserAgent to visit another url. In the case of a sequence of redirected requests and responses, userAgent.response represents the most recent response in the chain.

Example 7: Handling HTTP errors, connection errors, and search Exceptions.
try{
  UserAgent userAgent = new UserAgent();  //find the first anchor having href, get href value (below)
  String firstAnchorUrl = userAgent.visit("http://amazon.com").findFirst("<a href>").getAt("href");
  userAgent.visit(firstAnchorUrl);                              //visit url
  System.out.println("location:" + userAgent.getLocation());    //print the current location (url).
}
catch(SearchException e){        //if an element or attribute isn't found, catch the exception.
  System.err.println(e);         //printing exception shows details regarding origin of error
}
catch(ResponseException e){      //in case of HTTP/Connection error, catch ResponseExeption
  System.err.println(e);         //printing exception shows HTTP error information or connection error
}

This example illustrates visiting a series of urls and handling the various types of errors that can occur. On line 3, the UserAgent visits http://amazon.com, searches for the first anchor that has an href attribute, and then returns the href value. Note that these steps can occur in a single statement because the visit(String) method returns a Document. When working with hyperlinks, it can also be useful to take advantage of Document's convenience methods, such as getHyperlink(String) and findAttributeValues(String query) (not shown).

On line 7, any search-related errors are caught, which in this case would be a possible NotFound Exception, which is a subclass of SearchException.

On line 10, any HTTP and connection-related errors are caught. The ResponseException itself is printed here, which automatically shows information regaring the HttpResponse (or lack thereof) that caused the problem.

Example 8: Searching using findEach, iterating through search results.
try{
  UserAgent userAgent = new UserAgent();
  userAgent.visit("http://amazon.com");    

  Elements tables = userAgent.doc.findEach("<table>");       //find non-nested tables    
  System.out.println("Found " + tables.size() + " tables:");
  for(Element table : tables){                               //iterate through Results
    System.out.println(table.outerHTML() + "\n----\n");      //print each element and its contents
  }    
                                                        
  Elements ols = userAgent.doc.findEach("<table>").findEach("<ol>");//find non-nested ol's within non-nested tables
  System.out.println("Found " + ols.size() + " OLs:");
  for(Element ol : ols){                                     //iterate through Results
    System.out.println(ol.outerHTML() + "\n----\n");         //print each element and its contents
  }  
}
catch(ResponseException e){
  System.out.println(e);
}
The findEach(String) method walks the document tree searching for any Element that matches the specified query, but any such elements are not themselves searched further. So for example Elements tables (line 5) will contain all the non-nested table elements (ie, it will not include tables that exist within other tables). The class Elements is a container for holding search results, but is itself a searchable element - it actually has the same search methods that Element does via inheritance. A good way of thinking about class Elements is as a <#elements> tag who's children are the elements returned from the search.

Although Elements tables on line 5 contains tables from the search, it should be noted that the container is not considered the parent of any of those search results. In other words, calling getParent() on any of the table Elements would return their parent Element in the document, it would not return the Elements container.

A benefit of the Elements class itself being searchable is that it allows searches to be easily chained. On line 11, for example, the document is first searched for non-nested tables, and then the resulting Elements container is subsequently search for non-nested <ol> elements.

If the findEach(String) method does not locate any Elements that match the query, an empty Results container is returned.

Example 9: Searching using findEvery vs. findEach
<html>
  <body>
    <div>vegetables</div>
    <div>fruits</div>
    <p class='meat'>
      Meats
      <div>chicken</div>
      <div>beef</div>
    </p>
    <div class='nut'>
      Nuts
      <div>peanuts</div>
      <div>walnuts</div>
    </div>
  </body>
</html>
try{
  UserAgent userAgent = new UserAgent(); 
  userAgent.visit("http://jaunt-api.com/examples/food.htm");
   
  Elements elements = userAgent.doc.findEvery("<div>");             //find all divs in the document
  System.out.println("Every div: " + elements.size() + " results"); //report number of search results.
   
  elements = userAgent.doc.findEach("<div>");                       //find all non-nested divs
  System.out.println("Each div: " + elements.size() + " results");  //report number of search results.
                                                                    //find non-nested divs within <p class='meat'>
  elements = userAgent.doc.findFirst("<p class=meat>").findEach("<div>"); 
  System.out.println("Meat search: " + elements.size() + " results");//report number of search results.
}
catch(JauntException e){
  System.err.println(e);
}
The findEvery(String) search (line 6) retrieves every div Element in the document (seven divs). The findEach(String) method (line 9) retrieves only five divs, since it will not find the nested divs. The last findEach(String) method (line 12) retrieves the divs that are children of <div class='meat'> (two results).

As with the findEach(String) method, findEvery(String) returns a <#elements> search results container. If no Elements are found, an empty <#elements> container is returned (no Exception is thrown).

Example 10: Searching using getElement and getEach, search method summary
<html>
  <body>
    <div>vegetables</div>
    <div>fruits</div>
    <p class='meat'>
      Meats
      <div>chicken</div>
      <div>beef</div>
    </p>
    <div class='nut'>
      Nuts
      <div>peanuts</div>
      <div>walnuts</div>
    </div>
  </body>
</html>
try{ 
  UserAgent userAgent = new UserAgent(); 
  userAgent.visit("http://jaunt-api.com/examples/food.htm");
  Element element;
   
  element = userAgent.doc.getElement(0);                     //retrieve 1st child element within the doc.      
  System.out.println("result1: " + element);                 //print the element
   
  element = userAgent.doc.getElement(0).getElement(0).getElement(3);     //get 4th child of 1st child of 1st child
  System.out.println("result2: " + element);                             //print the element
   
  element = userAgent.doc.findFirst("<p class=meat>").getElement(1);     //retrieve 2nd child element of p
  System.out.println("result3: " + element.outerHTML());                 //print the element and its content
   
  Elements elements = userAgent.doc.findFirst("<body>").getEach("<div>");//get body's child divs
  System.out.println("result4 has " + elements.size() + " divs:\n");     //print the search results
  System.out.println(elements.innerHTML(2));                             //print elements, indenting by 2
}
catch(JauntException e){
  System.err.println(e);
}
This example illustrates a variety of search methods who's names begin with 'get', which indicates that it searches only children (as opposed to 'find' methods, which search all descendants).

The getElement(int) method on line 6 retrieves the first child of the document container (there is only one child -- the <html> element). On line 9, several getElement(int) methods are chained together to create a path to the <div class='nut'> element. On lines 12-13 a similar technique is used to retrieve and print <div>beef</div>. On line 15, the findFirst(String) method first retrieves the <body> element, then the getEach(String) method searches only the child elements of <body> for div elements. The results of the search (three divs) are then printed. When reviewing the output, remember that only the children of <#elements> constitute a search result.

Search Method Summary: a table of search methods
The following table summarizes the most important search methods covered in previous examples.
FirstEachEvery
get getFirst(String query) getEach(String query) -- searches children only
find findFirst(String query) findEach(String query) findEvery(String query) searches children/descendants to any depth
searches for first Element that matches the query, returns Element or throws NotFound searches for matching, non-nested Elements, which are returned in Elements container. searches for all matching Elements, which are returned in Elements container.
Example 11: Searching with regular expressions.
<html>
  <body>
    <p id='1'>hi</p>
    <span id='2'>bonjour</span>
    <div id='3'>hola</div>
    <p id='4'>ahoj</p>
  </body>
</html>
      UserAgent userAgent = new UserAgent();  
      userAgent.visit("http://jaunt-api.com/examples/hello.htm");

      Elements elements = userAgent.doc.findEvery("<div|span>");      //find every element who's tagname is div or span.
      System.out.println("search results:\n" + elements.innerHTML()); //print the search results

      elements = userAgent.doc.findEvery("<p id=1|4>");               //find every p element who's id is 1 or 4
      System.out.println("search results:\n" + elements.innerHTML()); //print the search results

      elements = userAgent.doc.findEvery("< id=[2-6]>");              //find every element (any name) with id from 2-6
      System.out.println("search results:\n" + elements.innerHTML()); //print the search results
     
      elements = userAgent.doc.findEvery("<p>ho");      //find every p who's joined child text contains 'ho' (regex) 
      System.out.println("search results:\n" + elements.innerHTML()); //print the search results
This example illustrates using regular expressions within search queries. [Note that Java-style regular expressions use double downslashes]. A search query has the general form:
<tagnameRegex attributeName='attributeValueRegex'>childTextRegex
where multiple attributes are allowed. In order for the query to match against an element, all parts of the query (ie, the tagnameRegex, attribute name, attributeValueRegex and childTextRegex) must match if they are specified.
tagnameRegex:
If tagnameRegex is a whitespace character, it will match any tagname. Otherwise, the tagnameRegex will be treated as case-insensitive and be evaluated against entire tagnames (ie will not match substrings). The tagnameRegex must begin with either an alphabetical character or a round opening bracket, and may not contain whitespace (though it may contain \\s, which matches whitespace)
attributeName:
If no attributes are included in the query, the query will match any attributes in a candidate element (including one without attributes). Otherwise, the attributeName in the query is matched as a case-insensitive string, not as a regular expression.
attributeValueRegex:
If attributeValueRegex is not present, the attributeName in the query will be matched against candidate attributeNames irrespective of their attributeValues. If attributeValueRegex is present, it will be treated as case-insensitive and be evaluated against the entire corresponding attribute value (ie will not match substrings).
childTextRegex:
If childTextRegex is not present, the query will match any child text (including lack of text). Otherwise, childTextRegex will be evaluated against the concatenation of Text children of the Element. It's important to note that the childTextRegex is case sensitive and will match against substrings.
Example 12: Filling-out form fields in sequence using the Document object.
    try{
      UserAgent userAgent = new UserAgent(); 
      userAgent.visit("http://jaunt-api.com/examples/signup.htm");
      
      userAgent.doc.apply(     //fill-out the form by applying a sequence of inputs
        "tom@mail.com",        //string input is applied to textfield
        "(advanced)",          //bracketed string (regular expression) selects a menu item
        "no comment",          //string input is applied to textarea
        1                      //integer specifies index of radiobutton choice
      );  
      userAgent.doc.submit("create trial account"); //press the submit button labelled 'create trial account'
      System.out.println(userAgent.getLocation());  //print the current location (url)
    }
    catch(JauntException e){ 
      System.out.println(e);
    }
Form manipulation can occur at several different levels. Using the Form component (see example 15) is a convenient way to fill-out and submit a specific form on a page (such as when a page contains more than one editable form). The present example uses the Document object to skip the step of identifying the form when there is only one editable form on the page. It allows the user to fill-out editable fields by specifying a sequence of input values. The input values are applied starting with the first field (eg textfield, radiobutton, checkbox, menu, etc), or starting at whichever field currently has focus (see the next example for altering focus).

On lines 6-8, the apply(Object ... args) method of the document is called, which can be used for filling-out any sequence of editable textfields, password fields, textareas, radiobuttons, checkboxes, menus, or file upload dialogues. In this case, the sequence of inputs has the following effect: it fills-out the textfield with tom@mail.com, selects the menu option that matches the regular expression (advanced), fills-out the textarea with the text no comment, and finally selects the radiobutton at index 1 (ie, the second radiobutton). Although not shown in this example, boolean values (true/false) can be used to check/uncheck checkboxes, the string "\t" can be used to skip the next field, and a File object can be used to specify a file for file-upload buttons. On line 11, the submit button is pressed, which submits the form that was filled out and on line 12, the url of the followup page is printed, using getLocation().

It is worth noting that Form objects also have their own apply(Object ... args) method. So in a similar fashion, a sequence of inputs can be applied to a specific form (ie, not necessarily the first one).

Example 13: Filling-out form fields by label with the Document object (textfields, password fields, checkboxes).
try{ 
  UserAgent userAgent = new UserAgent();  
  userAgent.visit("http://jaunt-api.com/examples/login.htm");

  userAgent.doc.fillout("Username:", "tom");       //fill out the component labelled 'Username:' with "tom"
  userAgent.doc.fillout("Password:", "secret");    //fill out the component labelled 'Password:' with "secret"
  userAgent.doc.choose(Label.RIGHT, "Remember me");//choose the component right-labelled 'Remember me'.
  userAgent.doc.submit();                          //submit the form
  System.out.println(userAgent.getLocation());     //print the current location (url)
}
catch(JauntException e){ 
  System.err.println(e);
}
This example illustrates using the Document object to fill-out/manipulate specific form fields (such as textfields, checkboxes, radiobuttons, etc) on the basis of how they are visibly labelled, whether or not the form is the first/only form on the page. Anytime a specific form field is filled-out, the focus automatically moves to the next visible field in the same form. Since focus automatically moves to the next field, the method apply(Obect ... args) (see previous example) can be called to continue filling out the next/remaining fields, whether or not they are labelled.

On lines 5 and 6, the fillout(String, String) method of the document is called, which is used for filling out textfields, password fields, and textarea fields. The first argument is a case-insensitive and spacing-insensitive String used to match the text label to the left of the fields. The second argument is the value to be entered into the field.

On line 7, the choose(short, String) method is called, which can be used for choosing checkboxes and radiobuttons. The first parameter specifies the orientation of the label relative to the checkbox/radiobutton. The second parameter is a case-insensitive and spacing-insensitive String for matching the text of the label.

On line 8, the submit button is pressed, which submits the form. If there is more than one form on the page, the 'active' form is submitted. The active form is determined by the first field to be filled out. Once a particular form is active, attempting to fill out an input from a different form causes a NotFound Exception to be thrown. A MultipleFound Exception is thrown if the specified label text matches more than one label of the active form.

On line 9, the url of the followup page is printed, using getLocation().

try{  //same steps, using fluent method invokation
  UserAgent userAgent = new UserAgent();      
  userAgent.visit("http://jaunt-api.com/examples/login.htm")  
    .fillout("Username:", "tom")     
    .fillout("Password:", "secret")  
    .choose(Label.RIGHT, "Remember me")
    .submit();                         
  System.out.println(userAgent.getLocation()); 
}
catch(JauntException e){ 
  System.err.println(e);
}
Example 14: Filling-out form fields by label with the Document object (select fields, textarea fields, radiobuttons)
try{ 
  UserAgent userAgent = new UserAgent();  
  userAgent.visit("http://jaunt-api.com/examples/signup.htm");
  Document doc = userAgent.doc;
  
  doc.fillout("E-mail:", "tom@mail.com");  //fill out the (textfield) component labelled "E-mail:"
  doc.choose("Account Type:", "advanced"); //choose "advanced" from the menu labelled "Account Type:"
  doc.fillout("Comments:", "no comment");  //fill out the (textarea) component labelled "Comments:"
  doc.choose(Label.RIGHT, "No thanks");    //choose the (radiobutton) component right-labelled "No thanks"
  doc.submit("create trial account");      //press the submit button labelled 'create trial account'
  System.out.println(userAgent.getLocation());  //print the current location (url)
}
catch(JauntException e){                   
  System.out.println(e);
}
Like the previous example, this example illustrates the document-level technique for filling out forms by targetting field labels.

The choose(String, String) method (line 7) is used to select menu items or menulist items, though it does not support making multiple selections from a menulist. The fillout(String, String) method (lines 6 and 8) has been previously covered, though here we see it used with a texarea field. The choose(int, String) method on line 9 is used to select a radio button.

On line 10, the form is submitted by specifying the label of the submit button (submit(String)). The functionality is important when the form has more than one submit button; otherwise, the method submit() would suffice. Specifying a label that does not match a submit button of the active form results in a NotFound Exception, which is a subclasses of JauntException (caught on line 18). As previously noted, the active form is the form targeted by the first fillout/choose/select operation.

Example 15: Filling-out form fields by name with the Form object (textfields, password fields, checkboxes, select fields, textareas, radiobuttons).
<html>
Sign up:<br>
<form name="signup" action="http://jaunt-api.com/examples/signup2Response.htm">
  E-mail:<input type="text" name="email"><br>
  Password:<input type="password" name="pw"><br>
  Remember me <input type="checkbox" name="remember"><br>
  Account Type:<select name="account"><option>regular<option>advanced</select><br>
  Comments:<br><textarea name='comment'></textarea><br>
  <input type="radio" name="inform" value="yes" checked>Inform me of updates<br>
  <input type="radio" name="inform" value="no">No thanks<br>
  <input type="submit" name="action" value="create account">
  <input type="submit" name="action" value="create trial account">
</form>
</html>
try{ 
  UserAgent userAgent = new UserAgent();  
  userAgent.visit("http://jaunt-api.com/examples/signup2.htm");

  Form form = userAgent.doc.getForm(0);       //get the document's first Form
  form.setTextField("email", "tom@mail.com"); //or form.set("email", "tom@mail.com");
  form.setPassword("pw", "secret");           //or form.set("pw", "secret"); 
  form.setCheckBox("remember", true);         //or form.set("remember", "on");
  form.setSelect("account", "advanced");      //or form.set("account", "advanced"); 
  form.setTextArea("comment", "no comment");  //or form.set("comment", "no comment");
  form.setRadio("inform", "no");              //or form.set("inform", "no"); 
  form.submit("create trial account");        //click the submit button labelled 'create trial account'
  System.out.println(userAgent.getLocation());//print the current location (url)
}
catch(JauntException e){                    
  System.err.println(e);
}
This example illustrates manipulating a specific form on the page by using a Form component. Through the form compoment, each field can be accessed by its name (ie, the value of the name attributes). A form component can be obtained from the document object in a variety of ways, including by specifying the index of the form (as on line 5), or by using the Document's search methods to find a form on the basis of its button text or by using a search query, eg: userAgent.doc.getForm("<form name=signup>").

On lines 6-11, input field of various types are identified by their (case insensitive) names and filled out with specific values. The setSelect(String, String) operation on line 9 is used to set a dropdown menu or selection list to a single value, however it can be called more than once to make multiple selections in a selection list where multiple selections are enabled. All the methods for setting values by name throw a NotFound Exception if input field's name cannot be matched.

On line 12, the form object is submitted by specifying the label of the submit button (submit(String)). Specifying the submit button is important when the form has more than one or when the submit button contributes a name-value pair required by the application; otherwise, the method submit() would suffice.

Example 16: Generating a form's request permutations.
<html>
<h1>Movie Search:</h1>
<form name="srch" action="http://jaunt-api.com/examples/searchResponse.htm">
  Movie Keyword:<input type="text" name="keyword"><br>
  Movie Genre:<select name="movieType"><option>Drama<option>Horror</select><br>
  Language:
  <input type='radio' name='lang' value='english'>English
  <input type='radio' name='lang' value='french'>French<br>
  <input type="submit" value="submit search">
</form>
</html>
UserAgent userAgent = new UserAgent();  
userAgent.visit("http://jaunt-api.com/examples/search.htm");

Form form = userAgent.doc.getForm("<form name=srch>");            //retrieve Form object by its name.
form.addPermutationTarget("keyword", new String[]{"cat", "dog"}); //specify seach terms to permute thru
form.addPermutationTarget("movieType");            //specify that movietype field will be permuted (all values)
form.addPermutationTarget("lang");                 //specify that lang field will be permuted (all values)
List<HttpRequest> requests = form.getRequestPermutations();       //generate list of request permutations
  
System.out.println("request permutations:");
for(HttpRequest request : requests){               //print the list of request permutation
  System.out.println(request.asUrl());
}  
This example illustrates how to automatically permute a form through different combinations of input in order to generate a list of requests. In this case, the form is a search interface for a movie database. Being able to generate a comprehensive list of request objects for the search form is a simpler and faster solution than laboriously changing each field one at a time after each form submition.

On line 4, the Form object is retrieved by the form's name, and in the next three lines (5-7) permutation targets are added. Each permutation target identifies a specific form field (by its name) and defines all the possible values through which it should be permuted. In some cases, the possible values are by definition embedded as part of the component (such as menus, radiobuttons, and checkboxes). In other cases (such as texfields, password fields, and textareas) the permutation values need to be specified, since there are an unlimited number of possible inputs. In such cases the permutation values are provided in a String array, as on line 5, where the textfield is set to be permuted through the search terms 'cat' and 'dog'.

On lines 11-13, each generated HttpRequest is printed as a URL (see output below). The UserAgent can directly accept the HttpRequest objects, using UserAgent.send(HttpRequest).

For additional control, the form can be redefined at the DOM-level before defining permutation targets. For example, you may wish to remove the first entry of a dropdown menu, if it is simply a blank option, rather than have it generate a meaningless request permutation. Any DOM-level manipulatation must occur before the form is aquired through the document's getForm(String) method.

Output:
http://jaunt-api.com/examples/searchResponse.htm?keyword=cat&movieType=Drama&lang=english
http://jaunt-api.com/examples/searchResponse.htm?keyword=cat&movieType=Drama&lang=french
http://jaunt-api.com/examples/searchResponse.htm?keyword=cat&movieType=Horror&lang=english
http://jaunt-api.com/examples/searchResponse.htm?keyword=cat&movieType=Horror&lang=french
http://jaunt-api.com/examples/searchResponse.htm?keyword=dog&movieType=Drama&lang=english
http://jaunt-api.com/examples/searchResponse.htm?keyword=dog&movieType=Drama&lang=french
http://jaunt-api.com/examples/searchResponse.htm?keyword=dog&movieType=Horror&lang=english
http://jaunt-api.com/examples/searchResponse.htm?keyword=dog&movieType=Horror&lang=french
Example 17: Table traversal
<html>
  <table class="stocks" border="1">
    <tr><td>MSFT</td><td>GOOG</td><td>APPL</td></tr>
    <tr><td>$31.58</td><td>$896.57</td><td>$465.25</td></tr>
  </table>
</html>
    try{
      UserAgent userAgent = new UserAgent(); 
      userAgent.visit("http://jaunt-api.com/examples/stocks.htm");
      
      Element table = userAgent.doc.findFirst("<table class=stocks>");  //find table element
      Elements tds = table.findEach("<td|th>");                         //find non-nested td/th elements
      for(Element td: tds){                                             //iterate through td/th's
        System.out.println(td.outerHTML());                             //print each td/th element
      }
    }
    catch(JauntException e){
      System.err.println(e);
    }
This example does not introduce any new concepts. Rather, rather it illustrates a technique for traversing a table. The findEach(String) method is used to collect every non-nested td/th descendant of the table element (line 6). The parameter "<td|th>" is a query that uses the regular expression td|th to match the tagname.
Example 18: Table data extraction using the Table component.
try{
  UserAgent userAgent = new UserAgent(); 
  userAgent.visit("http://jaunt-api.com/examples/schedule.htm");
  Table table = userAgent.doc.getTable("<table class=schedule>");   //get Table component via search query
      
  System.out.println("\nColumn having 'Mon':");
  Elements elements = table.getCol("mon");                                  //get entire column containing 'Mon'
  for(Element element : elements) System.out.println(element.outerHTML());  //iterate through & print elements
      
  System.out.println("\nColumn below 'Tue':");                              
  elements = table.getColBelow("tue");                                      //get column elements below 'Tue'
  for(Element element : elements) System.out.println(element.outerHTML());  //iterate through & print elements
      
  System.out.println("\nFirst row:");
  elements = table.getRow(0);                                               //get row at row index 0.
  for(Element element : elements) System.out.println(element.outerHTML());  //iterate through & print elements
      
  System.out.println("\nRow right of '2:00pm':");
  elements = table.getRowRightOf("2:00pm");                                 //get row elements right of 2:00pm
  for(Element element : elements) System.out.println(element.outerHTML());  //iterate through & print elements
      
  System.out.println("\nCell for fri at 10:00am:");                        
  Element element = table.getCell("fri", "10:00am");             //get element at intersection of col/row
  System.out.println(element.outerHTML());                       //print element
      
  System.out.println("\nCell at position 3,3:");
  element = table.getCell(3,3);                                  //get element at col index 3, row index 3
  System.out.println(element.outerHTML());                       //print element
}
catch(JauntException e){
  System.err.println(e);
}
This example illustrates most of the data-extraction methods of the Table component. The Table component makes it easy to extract the td/th elements for a paricular row, column, cell, or the segment of row/column relative to a cell containing specific text.

On line 4, a table component is aquired via an element query (queries are covered in the examples on search methods). It can also be aquired by calling Document's getTable(int) method, which takes the table index as an argument (indexing starts at zero and refers to non-nested tables).

Several methods in this example accept a regular expression for matching the text within a particular cell (td/th element). These regular expressions are matched in a case-insentive way against the innerText() of the td/th elements. In cases where there is more than one td/th that matches the regular expression, the first encountered cell will constitute the match, where the table is processed row by row, left to right, top to bottom.

Example 19: Using the HTML cache.
try{
  UserAgent userAgent = new UserAgent(); 
  String url = "http://northernbushcraft.com";
  userAgent.setCacheEnabled(true);         //caching turned on
  userAgent.visit(url);                    //cache empty, so HTML page requested via http & saved in cache.
  userAgent.visit(url);                    //when revisiting, page pulled from filesystem cache - no http request.
  System.out.println(userAgent.response);  //response object shows that content was cached, no response headers
      
  userAgent.setCacheEnabled(false);        //caching turned off
  userAgent.visit(url);                    //page is once again retrieved via http request. 
  System.out.println(userAgent.response);  //print response object, which now shows response headers
}
catch(JauntException e){
  System.err.println(e);
}
This example illustrates using the default HTML cache. The cache saves the original HTML/XHTML source of webpages locally in a directory called jaunt_cache, which is located in the directory specified by UserAgentSettings.outputPath (see Example 2 for accessing/altering settings).

The purpose of the HTML cache is to provide a means of accessing frequently-required HTML pages from local storage rather than repeatedly making HTTP requests. When screen-scraping data from a large website, it's common to run your program multiple times while refining/testing the scraping algorithm. By enabling caching, you can avoid repeatedly hitting the webserver, since the UserAgent will first check the cache to see whether the document is available locally.

To use the HTML cache you must first enable it, as seen on line 4. Disabling the cache (as seen on line 9) reverts the UserAgent to making HTTP requests rather than pulling content from the cache. However, disabling the cache does not delete the contents of the cache; the cache folder persists until it is manually deleted.

Continue to Extra Topics

Home | Javadocs | Web-Scraping Tutorial | JSON Querying Tutorial | FAQ | Download