Home | Javadocs | Web-Scraping Tutorial | JSON Querying Tutorial | FAQ | Download

Jaunt Webscraping Tutorial - Quickstart

Overview of Webscraping with Jaunt

The Jaunt package contains the class UserAgent, which represents a headless browser. When the UserAgent loads an HTML or XML page, it creates a Document object. The Document object exposes the content as a tree of Nodes, such as Element objects, Text objects, and Comment objects. For example, an HTML document has the following tree structure: it begins with the <html> Element, who's child nodes are <head> and <body> Elements. Each Element contains zero or more attributes (such as <body class='foo'>) and zero or more child Nodes. The children can be Text Nodes, Comment Nodes, or other Elements.

When creating the Document, the UserAgent may encounter malformed HTML/XML. The parser automatically corrects malformed tags and converts relative urls to absolute urls but does not otherwise alter the document structure.

In addition to exposing the DOM, the Document also provides high-level utility classes for webscraping. For example, the class Form and Table provide convenience methods for submitting forms and extracting data from tables. For lower-level, work, the UserAgent provides access to HTTP Requests and Responses and the ability to manage cookies.

To begin using Jaunt, download and extract the zip file. The zip file contains the licensing agreement, javadocs documentation, example files, release notes, and a jar file (Java 1.6). Include the jar file in your classpath/project, at which point you will be able to recompile and/or run the example files. If your IDE supports it, configure javadoc integration (eg, in your Eclipse project, configure java build path, expand the entry for the jar file, select the path to the javadocs folder).

Example 1: Create a UserAgent, visit a url, print the HTML.
try{
  UserAgent userAgent = new UserAgent();                       //create new userAgent (headless browser).
  userAgent.visit("http://oracle.com");                        //visit a url   
  System.out.println(userAgent.doc.innerHTML());               //print the document as HTML
}
catch(JauntException e){         //if an HTTP/connection error occurs, handle JauntException.
  System.err.println(e);
}
    
This example illustrates creating a UserAgent, visiting a webpage, and printing the document as HTML.

When the userAgent visits a url (line 3), it creates a Document object (userAgent.doc) to represent the parsed content, whether it's HTML, XHTML, or XML. On line 4, the document content is printed as HTML. In order to print the document as XML instead, the method innerXML() can be called. Malformed tags and/or missing closing tags are automatically corrected, so the printed output may not be identical to the original/unparsed content. The UserAgent method getSource() provides the original/unaltered source.

Example 2: UserAgent settings, searching using findFirst.
try{
  UserAgent userAgent = new UserAgent();                       //create new userAgent (headless browser).
  System.out.println("SETTINGS:\n" + userAgent.settings);      //print the userAgent's default settings.
  userAgent.settings.autoSaveAsHTML = true;                    //change settings to autosave last visited page.

  userAgent.visit("http://oracle.com");                        //visit a url.
  String title = userAgent.doc.findFirst("<title>").getChildText(); //get child text of title element.
  System.out.println("\nOracle's website title: " + title);    //print the title 

  userAgent.visit("http://amazon.com");                        //visit another url.
  title = userAgent.doc.findFirst("<title>").getChildText();   //get child text of title element.
  System.out.println("\nAmazon's website title: " + title);    //print the title  
}
catch(JauntException e){   //if title element isn't found or HTTP/connection error occurs, handle JauntException.
  System.err.println(e);          
}
This example illustrates visiting two urls, in each case extracting and printing the title of webpage.

On line 4 autosaving is enabled, which means that anytime a page is visited it will be autosaved as LAST_VISITED.html in the directory specified by settings.outputPath. The related setting autoSaveAsXML can be used to save the document in XML format rather than HTML. Autosaving is useful in development and debugging, since LAST_VISITED.html can be checked with Chrome/Firefox/etc to examine the DOM structure. See UserAgentSettings for all the settings of a UserAgent.

The document's findFirst(String) method (lines 7 and 11) accepts a tagQuery that (in simple cases) resembles an HTML tag, and searches the document tree until it finds a matching element. It should be noted that the query "<title>" will match any Element who's tagname is title (case insensitive), whether or not the element has additional attributes. As we'll see in later examples, the tagname portion of the query is actually a regular expression, which provides a powerful syntax for pattern matching. For example, the query "<h(1|2)>" would match any h1 or h2 tag. Example 11 provides a full account of the tagQuery syntax.

Note that in this example we catch JauntException, which is the superclass of all other Jaunt-related Exceptions. Later examples will demonstrate handling HTTP/connection errors separately from search-related errors.

Example 3: Opening HTML from a String, retrieving an Element's text.
try{
  UserAgent userAgent = new UserAgent();                        
  //open HTML from a String.
  userAgent.openContent("<html><body>WebPage <div>Hobbies:<p>beer<p>skiing</div> Copyright 2013</body></html>");
  Element body = userAgent.doc.findFirst("<body>");
  Element div = body.findFirst("<div>");
  
  System.out.println("body's childtext: " + body.getChildText());   //join child text of body element
  System.out.println("-----------");
  System.out.println("all body's text: " + body.getTextContent());  //join all text within body element
  System.out.println("-----------");
  System.out.println("div's child text: " + div.getChildText());    //join child text of div element
  System.out.println("-----------");
  System.out.println("all div's text: " + div.getTextContent());    //join all text within the div element
}
catch(JauntException e){                          
  System.err.println(e);          
}
This example illustrates visiting a website, searching for specific elements, then accessing various attributes and properties of those elements.

On line 6 we see that the findFirst(String) method can be invoked on an Element (or the Document as on the previous line). When invoked on an Element, the search is restricted to that Element's descendants. The findFirst(String) method actually belongs to the class Element, and is inherited by Document.

On lines 8 - 14 we see that the getChildText() method returns a String concatenation (joining) of the text children of the element, whereas the getTextContent() method returns a String concatenation of all Text descendants. If an element does not contain any text, either method will return an empty String.

The variation getTextContent(String, boolean, boolean) accepts additional parameters for inserting a delimeter between concatenated texts, specifying whether to exclude script tag text, and/or indicating whether HTML/XHTML entity references within the text should be replaced with their character equivalents (eg converting &amp; to &).

Example 4: Accessing an Element's attributes/properties
try{
  UserAgent userAgent = new UserAgent();    
  userAgent.visit("http://intel.com");
  
  Element anchor = userAgent.doc.findFirst("<a href>");            //find 1st anchor element with href attribute
  System.out.println("anchor element: " + anchor);                 //print the anchor element
  System.out.println("anchor's tagname: " + anchor.getName());     //print the anchor's tagname
  System.out.println("anchor's href attribute: " + anchor.getAt("href"));  //print the anchor's href attribute
  System.out.println("anchor's parent Element: " + anchor.getParent());    //print the anchor's parent element
  
  Element meta = userAgent.doc.findFirst("<head>").findFirst("<meta>");    //find 1st meta element in head section
  System.out.println("meta element: " + meta);                     //print the meta element
  System.out.println("meta's tagname: " + meta.getName());         //print the meta's tagname
  System.out.println("meta's content attribute: " + meta.getAt("content"));//print the meta's content attribute
  System.out.println("meta's parent Element: " + meta.getParent());        //print the meta's parent element
}
catch(JauntException e){              
  System.err.println(e);          
}
This example illustrates visiting a website, searching for specific elements, and printing various attributes and properties of those elements.

The tagQuery <a href> on line 5 specifies not only the tagname but also that the Element must contain an href attribute. As previously noted, the tagname portion of the query is a regular expression. The attributename, however, is not; it is matched as a case-insensitive String.

On lines 6, 9, 12 and 16, an Element's toString() method is implicity called, which returns a String representation of the Element excluding its children. See Example 5 for how to obtain a String representation of an Element that does include its children.

On line 8, the getAt(String) method is called to retrieve the attribute value associated with the (case insensitive) attribute name href. If the anchor tag did not have an href attribute, calling getAt(String) would throw a NotFound Exception. The related method getAtString(String) differs in that it returns an empty String rather than throwing a NotFound Exception if the attribute value does not exist (not shown).

An example of chaining search methods can be seen on line 11, where the document is searched for the first head Element, which is subsequently searched for the first meta Element. In this case, the same result would be obtained by simply calling userAgent.doc.findFirst("<meta>"). However the latter search would be slower if no meta tag was present, since it would search the entire document rather than only searching the head section.

Example 5: Opening HTML from a file, accessing inner/outerHTML and inner/outerXML.
<html>
   <div class='colors'>redgreen</div> 
   <p>visit again soon!</p>
</html>
try{
  UserAgent userAgent = new UserAgent();
  userAgent.open(new File("path/to/colors.htm"));  //open the HTML (or XML) from a file
   
  Element div = userAgent.doc.findFirst("<div class=images>");  //find first div who's class matches 'images'  
  System.out.println("div's outerHTML():\n" + div.outerHTML());   
  System.out.println("-------------");    
  System.out.println("div's innerHTML():\n" + div.innerHTML());
  System.out.println("-------------");  
  System.out.println("div's outerXML(2):\n" + div.outerXML(2)); //2 extra spaces used per indent
  System.out.println("-------------");
  System.out.println("div's innerXML(2):\n" + div.innerXML(2)); //2 extra spaces used per indent
  System.out.println("-------------");
 
  //make some changes
  div.innerHTML("<h1>Presto!</h1>");          //replace div's content with different elements.
  System.out.println("Altered document as HTML:\n" + userAgent.doc.innerHTML());  //print the altered document.
}
catch(JauntException e){
   System.err.println(e);
}
This example illustrates opening HTML content from a local file, searching for specific elements, printing those elements as HTML/XML, then altering the HTML/XML content of an element, and finally printing the entire document. Before running this example, you will need to edit line 4 to point to the location of colors.htm on your local filesystem.

The query used on line 5 can be read as "find the first element which has a tagname that matches the case-insensitive regular expression div and which has an attribute who's name matches the case-insensitive String class, where the value of the attribute matches the case-insensitive regular expression images. Note that on line 5, the attribute value within the query is unquoted and that quotes are optional.

Because the DOM represents HTML and XML using the same internal model, switching between the two when printing is a matter of output formatting. When using the indenting options (ie outerHTML(int) and outerXML(int)), whitespace characters are added to the output string in order to indent each node (including existing whitespace nodes); so this indending whitespace will appear in addition to any that was already present.

The interchangeabiliy between HTML and XML is also seen in the methods that require HTML or XML input, such as innerHTML(String) on line 16; such methods accept either format.

Example 6: Detecting HTTP errors and connection errors with the Response object.

try{
  UserAgent userAgent = new UserAgent();      
  userAgent.visit("http://oracle.com");       
  System.out.println("Response:\n" + userAgent.response);  //print response data
}
catch(ResponseException e){                                //catch HTTP/Connection error
  HttpResponse response = e.getResponse();                 //or check userAgent.response
  if(response != null){                                    //print response data field by field
    System.err.println("Requested url: " + response.getRequestedUrlMsg()); //print the requested url
    System.err.println("HTTP error code: " + response.getStatus());        //print HTTP error code
    System.err.println("Error message: " + response.getMessage());         //print HTTP status message
  }
  else{
    System.out.println("Connection error, no response!");
  }
} 
When the UserAgent attempts to visit a url, it's possible that the connection to the webserver will fail or that an HTTP error code will be returned. The HttpResponse object (userAgent.response) contains information about the webserver response. If no error occurs, UserAgent.response can be examined for details regarding the response, as on line 4.

If the connection fails or an HTTP error occurs, the UserAgent.visit(String) method will throw a ResponseException. The ResponseException also contains a reference to the response, however in this case it's possble that the response is null (indicating that no response was received due to a connection error). Since the response object could be null, that possility is checked on line 8 before invoking any of its methods for the printing steps. A simpler alternative would be to simply print the ResponseException e, which would show the same (and more) information (see next Example).

In some cases, a webserver response will redirect the UserAgent to visit another url. In the case of a sequence of redirected requests and responses, userAgent.response represents the most recent response in the chain.

Example 7: Handling HTTP errors, connection errors, and search Exceptions.
try{
  UserAgent userAgent = new UserAgent();  //find the first anchor having href, get href value (below)
  String firstAnchorUrl = userAgent.visit("http://amazon.com").findFirst("<a href>").getAt("href");
  userAgent.visit(firstAnchorUrl);                              //visit url
  System.out.println("location:" + userAgent.getLocation());    //print the current location (url).
}
catch(SearchException e){        //if an element or attribute isn't found, catch the exception.
  System.err.println(e);         //printing exception shows details regarding origin of error
}
catch(ResponseException e){      //in case of HTTP/Connection error, catch ResponseExeption
  System.err.println(e);         //printing exception shows HTTP error information or connection error
}

This example illustrates visiting a series of urls and handling the various types of errors that can occur. On line 3, the UserAgent visits http://amazon.com, searches for the first anchor that has an href attribute, and then returns the href value. Note that these steps can be chained together because the visit(String) method returns the Document. When working with hyperlinks, it can also be useful to take advantage of Document's convenience methods, such as getHyperlink(String) and findAttributeValues(String query) (not shown).

On line 7, any search-related errors are caught, which in this case would be a possible NotFound Exception, which is a subclass of SearchException.

On line 10, any HTTP and connection-related errors are caught. The ResponseException itself is printed here, which automatically shows information regaring the HttpResponse (or lack thereof) that caused the problem.

Example 8: Searching using findEach, iterating through search results.
try{
  UserAgent userAgent = new UserAgent();
  userAgent.visit("http://amazon.com");    

  Elements tables = userAgent.doc.findEach("<table>");       //find non-nested tables    
  System.out.println("Found " + tables.size() + " tables:");
  for(Element table : tables){                               //iterate through search results
    System.out.println(table.outerHTML() + "\n----\n");      //print each element and its contents
  }     
}
catch(ResponseException e){
  System.out.println(e);
}
This example demonstrates the findEach(tagQuery) method.

On line 5, the findEach method is invoked on the document, so it walks the document tree searching for any Elements that match the query "<table>". Any such elements are returned in an Elements object, which is a container for search results. The defining feature of the findEach search is that it when it finds an element that matches the query, the does not search further into that element. So in this example, the findEach method only returns non-nested tables (ie, does not include tables that occur within other tables).

Class Elements has convenience methods that make the search results themselves easily searchable. The search methods are similar to those already covered in class Element (eg, findFirst, findEach, etc).

One benefit of class Elements itself being searchable is that it allows searches to be easily chained together. A good way of thinking about class Elements is as a <#elements> tag and each of its children is a single search result.

If the findEach(String) method does not locate any Elements that match the query, an empty Elements container is returned.

Example 9: Searching using findEvery vs. findEach
<html>
  <body>
    <div>vegetables</div>
    <div>fruits</div>
    <div class='meat'>
      Meats
      <div>chicken</div>
      <div>beef</div>
    </p>
    <div class='nut'>
      Nuts
      <div>peanuts</div>
      <div>walnuts</div>
    </div>
  </body>
</html>
try{
  UserAgent userAgent = new UserAgent(); 
  userAgent.visit("http://jaunt-api.com/examples/food.htm");
   
  Elements elements = userAgent.doc.findEvery("<div>");             //find all divs in the document
  System.out.println("Every div: " + elements.size() + " results"); //report number of search results.
   
  elements = userAgent.doc.findEach("<div>");                       //find all non-nested divs
  System.out.println("Each div: " + elements.size() + " results");  //report number of search results.
                                                                    //find non-nested divs within <p class='meat'>
  elements = userAgent.doc.findFirst("<div class=meat>").findEach("<div>"); 
  System.out.println("Meat search: " + elements.size() + " results");//report number of search results.
}
catch(JauntException e){
  System.err.println(e);
}
The findEvery method operates by examining all the descendants of an element (or of a document). Every Element that matchs the tagQuery is added to the Elements container, which is returned by the method. As discussed in the previous example, class Elements is a container for search results that is itself searchable.

On line 6, The findEvery search is invoked on the document, so it retrieves every div Element in the document (eight divs). The findEach(String) method (line 9) retrieves only four divs, since it will not find the nested divs. The last findEach method (line 12) is not invoked on the document object but rather on a particular Element. It retrieves the three divs that are children of <div class='meat'>.

As with the findEach method, if the findEvery method does not find any Elements that match the tagQuery, an empty <#elements> container is returned (no Exception is thrown).

Example 10: Searching using getElement and getEach, search method summary
<html>
  <body>
    <div>vegetables</div>
    <div>fruits</div>
    <div class='meat'>
      Meats
      <div>chicken</div>
      <div>beef</div>
    </p>
    <div class='nut'>
      Nuts
      <div>peanuts</div>
      <div>walnuts</div>
    </div>
  </body>
</html>
try{ 
  UserAgent userAgent = new UserAgent(); 
  userAgent.visit("http://jaunt-api.com/examples/food.htm");
  
  Element body = userAgent.doc.findFirst("<body>");                    //find body element
  Element element = body.getElement(2);                                //retrieve 3rd child element within the body.      
  System.out.println("result1: " + element);                           //print the element
   
  String text = body.getElement(3).getElement(0).getChildText();       //get text of 1st child of 4th child of body.
  System.out.println("result2: " + text);                              //print the text
   
  element = body.findFirst("<div class=meat>").getElement(1);          //retrieve 2nd child element of div
  System.out.println("result3: " + element.outerHTML());               //print the element and its content
   
  Elements elements = body.getEach("<div>");                           //get body's child divs
  System.out.println("result4 has " + elements.size() + " divs:\n");   //print the search results
  System.out.println(elements.innerHTML(2));                           //print elements, indenting by 2
}
catch(JauntException e){
  System.err.println(e);
}
This example illustrates a variety of search methods who's names begin with 'get', which indicates that it searches only children (as opposed to 'find' methods, which search all descendants).

The getElement(int) method on line 6 retrieves the first child of the body element. On line 9, several getElement(int) methods are chained together to create a path to the <div>peanut</div> element. On lines 12-13 a similar technique is used to retrieve and print <div>beef</div>. On line 15 the getEach(String) method searches the child elements of <body> for div elements. The results of the search (four divs) are then printed. When reviewing the output, remember that each child of <#elements> constitutes a single search result.

Search Method Summary: a table of search methods
The following table summarizes the most important search methods covered in previous examples.
FirstEachEvery
get getFirst(String query) getEach(String query) -- searches children only
find findFirst(String query) findEach(String query) findEvery(String query) searches children/descendants to any depth
searches for first Element that matches the query, returns Element or throws NotFound searches for matching, non-nested Elements, which are returned in Elements container. searches for all matching Elements, which are returned in Elements container.
Example 11: Searching with regular expressions.
<html>
  <body>
    <p id='1'>hi</p>
    <span id='2'>bonjour</span>
    <div id='3'>hola</div>
    <p id='4'>ahoj</p>
  </body>
</html>
      UserAgent userAgent = new UserAgent();  
      userAgent.visit("http://jaunt-api.com/examples/hello.htm");

      Elements elements = userAgent.doc.findEvery("<div|span>");//find every element who's tagname is div or span.
      System.out.println("results1:\n" + elements.innerHTML()); //print the search results

      elements = userAgent.doc.findEvery("<p id=1|4>");         //find every p element who's id is 1 or 4
      System.out.println("results2:\n" + elements.innerHTML()); //print the search results

      elements = userAgent.doc.findEvery("< id=[2-6]>");        //find every element (any name) with id from 2-6
      System.out.println("results3:\n" + elements.innerHTML()); //print the search results
     
      elements = userAgent.doc.findEvery("<p>ho");        //find every p who's joined child text contains 'ho' (regex) 
      System.out.println("results4:\n" + elements.innerHTML()); //print the search results
	  
	  elements = userAgent.doc.findEvery("<p|div>^ho");   //find every p or div who's child text starts with 'ho'
      System.out.println("results5:\n" + elements.innerHTML()); //print the search result
       
      elements = userAgent.doc.findEvery("<p>^(hi|ahoj)");//find every p who's child text starts with 'hi' or 'ahoy'
      System.out.println("results6:\n" + elements.innerHTML()); //print the search result
This example illustrates using regular expressions within search queries. [Note that Java-style regular expressions use double downslashes]. A search query has the general form:
<tagnameRegex attributeName='attributeValueRegex'>childTextRegex
where multiple attributes are allowed. In order for the query to match against an element, all parts of the query (ie, the tagnameRegex, attribute name, attributeValueRegex and childTextRegex) must match if they are specified.
tagnameRegex:
If tagnameRegex is a whitespace character, it will match any tagname. Otherwise, the tagnameRegex will be treated as case-insensitive and be evaluated against entire tagnames (ie will not match substrings). The tagnameRegex must begin with either an alphabetical character or a round opening bracket, and may not contain whitespace (though it may contain \\s, which matches whitespace)
attributeName:
If no attributes are included in the query, the query will match any attributes in a candidate element (including one without attributes). Otherwise, the attributeName in the query is matched as a case-insensitive string, not as a regular expression.
attributeValueRegex:
If attributeValueRegex is not present, the attributeName in the query will be matched against candidate attributeNames irrespective of their attributeValues. If attributeValueRegex is present, it will be treated as case-insensitive and be evaluated against the entire corresponding attribute value (ie will not match substrings).
childTextRegex:
If childTextRegex is not present, the query will match any child text (including lack of text). Otherwise, childTextRegex will be evaluated against the concatenation of Text children of the Element. It's important to note that the childTextRegex is case sensitive and will match against substrings.
Example 12: Filling-out form fields in sequence using Document.apply().
    try{
      UserAgent userAgent = new UserAgent(); 
      userAgent.visit("http://jaunt-api.com/examples/signup.htm");
      
      userAgent.doc.apply(     //fill-out the form by applying a sequence of inputs
        "tom@mail.com",        //string input is applied to textfield
        "(advanced)",          //bracketed string (regular expression) selects a menu item
        "no comment",          //string input is applied to textarea
        1                      //integer specifies index of radiobutton choice
      );  
      userAgent.doc.submit("create trial account"); //press the submit button labelled 'create trial account'
      System.out.println(userAgent.getLocation());  //print the current location (url)
    }
    catch(JauntException e){ 
      System.out.println(e);
    }
Form manipulation can occur at several different levels. Using the Form component (see example 15) is a convenient way to fill-out and submit a specific form on a page (such as when a page contains more than one editable form). The present example uses the Document object to skip the step of identifying the form when there is only one editable form on the page. It allows the user to fill-out editable fields by specifying a sequence of input values. The input values are applied starting with the first field (eg textfield, radiobutton, checkbox, menu, etc), or starting at whichever field currently has focus (see the next example for altering focus).

On lines 6-8, the apply(Object ... args) method of the document is called, which can be used for filling-out any sequence of editable textfields, password fields, textareas, radiobuttons, checkboxes, menus, or file upload dialogues. In this case, the sequence of inputs has the following effect: it fills-out the textfield with tom@mail.com, selects the menu option that matches the regular expression (advanced), fills-out the textarea with the text no comment, and finally selects the radiobutton at index 1 (ie, the second radiobutton). Although not shown in this example, boolean values (true/false) can be used to check/uncheck checkboxes, the string "\t" can be used to skip the next field, and a File object can be used to specify a file for file-upload buttons. On line 11, the submit button is pressed, which submits the form that was filled out and on line 12, the url of the followup page is printed, using getLocation().

It is worth noting that Form objects also have their own apply(Object ... args) method. So in a similar fashion, a sequence of inputs can be applied to a specific form (ie, not necessarily the first one).

Example 13: Filling-out form fields by label with the Document object (textfields, password fields, checkboxes).
try{ 
  UserAgent userAgent = new UserAgent();  
  userAgent.visit("http://jaunt-api.com/examples/login.htm");

  userAgent.doc.filloutField("Username:", "tom");          //fill the field labelled 'Username:' with "tom"
  userAgent.doc.filloutField("Password:", "secret");       //fill the field labelled 'Password:' with "secret"
  userAgent.doc.chooseCheckBox("Remember me", Label.RIGHT);//choose the component right-labelled 'Remember me'
  userAgent.doc.submit();                                  //submit the form
  System.out.println(userAgent.getLocation());             //print the current location (url)
}
catch(JauntException e){ 
  System.err.println(e);
}
This example illustrates using the Document object to fill-out/manipulate specific form fields (such as textfields, checkboxes, radiobuttons, etc) on the basis of how they are visibly labelled, whether or not the form is the first/only form on the page. Anytime a specific form field is filled-out, the focus automatically moves to the next visible field in the same form. Since focus automatically moves to the next field, the method apply(Obect ... args) (see previous example) can be called to continue filling out the next/remaining fields, whether or not they are labelled.

On lines 5 and 6, the filloutField(String, String) method of the document is called, which is used for filling out textfields, password fields, and textarea fields. The first argument is a case-insensitive and spacing-insensitive String used to match the text label to the left of the fields. The second argument is the value to be entered into the field.

On line 7, the chooseCheckBox(short, String) method is called to check the checkboxe. The first parameter is a case-insensitive and spacing-insensitive String for matching the text of the label. The second parameter specifies the orientation of the label relative to the checkbox.

On line 8, the submit button is pressed, which submits the form. If there is more than one form on the page, the 'active' form is submitted. The active form is determined by the first field to be filled out. Once a particular form is active, attempting to fill out an input from a different form causes a NotFound Exception to be thrown. A MultipleFound Exception is thrown if the specified label text matches more than one label of the active form.

On line 9, the url of the followup page is printed, using getLocation().

try{  //same steps, using fluent method invokation
  UserAgent userAgent = new UserAgent();      
  userAgent.visit("http://jaunt-api.com/examples/login.htm")  
    .filloutField("Username:", "tom")     
    .filloutField("Password:", "secret")  
    .chooseCheckBox("Remember me", Label.RIGHT)
    .submit();                         
  System.out.println(userAgent.getLocation()); 
}
catch(JauntException e){ 
  System.err.println(e);
}
Example 14: Filling-out form fields by label with the Document object (select fields, textarea fields, radiobuttons)
try{ 
  UserAgent userAgent = new UserAgent();  
  userAgent.visit("http://jaunt-api.com/examples/signup.htm");
  Document doc = userAgent.doc;
  
  doc.filloutField("E-mail:", "tom@mail.com");     //fill out the (textfield) component labelled "E-mail:"
  doc.chooseMenuItem("Account Type:", "advanced"); //choose "advanced" from the menu labelled "Account Type:"
  doc.filloutField("Comments:", "no comment");     //fill out the (textarea) component labelled "Comments:"
  doc.chooseRadioButton("No thanks", Label.RIGHT); //choose the (radiobutton) component right-labelled "No thanks"
  doc.submit("create trial account");              //press the submit button labelled 'create trial account'
  System.out.println(userAgent.getLocation());     //print the current location (url)
}
catch(JauntException e){                   
  System.out.println(e);
}
As in the previous example, this example illustrates the document-level technique for filling out forms by field label.

The chooseMenuItem method (line 7) is used to select menu items or menulist items, though it does not support making multiple selections from a menulist. The filloutField method (lines 6 and 8) has been previously covered, though here we see it used with a texarea field. The chooseRadioButton method on line 9 is used to select a radio button.

On line 10, the form is submitted by specifying the label of the submit button (submit(String)). The functionality is important when the form has more than one submit button; otherwise, the method submit() would suffice. Specifying a label that does not match a submit button of the active form results in a NotFound Exception, which is a subclasses of JauntException (caught on line 18). As previously noted, the active form is the form targeted by the first fillout/choose/select operation.

Example 15: Filling-out form fields by name with the Form object (textfields, password fields, checkboxes, menus, textareas, radiobuttons).
<html>
Sign up:<br>
<form name="signup" action="http://jaunt-api.com/examples/signup2Response.htm">
  E-mail:<input type="text" name="email"><br>
  Password:<input type="password" name="pw"><br>
  Remember me <input type="checkbox" name="remember"><br>
  Account Type:<select name="account"><option>regular<option>advanced</select><br>
  Comments:<br><textarea name='comment'></textarea><br>
  <input type="radio" name="inform" value="yes" checked>Inform me of updates<br>
  <input type="radio" name="inform" value="no">No thanks<br>
  <input type="submit" name="action" value="create account">
  <input type="submit" name="action" value="create trial account">
</form>
</html>
try{ 
  UserAgent userAgent = new UserAgent();  
  userAgent.visit("http://jaunt-api.com/examples/signup2.htm");

  Form form = userAgent.doc.getForm(0);       //get the document's first Form
  form.setTextField("email", "tom@mail.com"); //or form.set("email", "tom@mail.com");
  form.setPassword("pw", "secret");           //or form.set("pw", "secret"); 
  form.setCheckBox("remember", true);         //or form.set("remember", "on");
  form.setSelect("account", "advanced");      //or form.set("account", "advanced"); 
  form.setTextArea("comment", "no comment");  //or form.set("comment", "no comment");
  form.setRadio("inform", "no");              //or form.set("inform", "no"); 
  form.submit("create trial account");        //click the submit button labelled 'create trial account'
  System.out.println(userAgent.getLocation());//print the current location (url)
}
catch(JauntException e){                    
  System.err.println(e);
}
This example illustrates manipulating a specific form on the page by using a Form component. Through the form compoment, each field can be accessed by its name (ie, the value of the name attributes). A form component can be obtained from the document object in a variety of ways, including by specifying the index of the form (as on line 5), or by using the Document's search methods to find a form on the basis of its button text or by using a search query, eg: userAgent.doc.getForm("<form name=signup>").

On lines 6-11, input field of various types are identified by their (case insensitive) names and filled out with specific values. The setSelect(String, String) operation on line 9 is used to set a dropdown menu or selection list to a single value, however it can be called more than once to make multiple selections in a selection list where multiple selections are enabled. All the methods for setting values by name throw a NotFound Exception if input field's name cannot be matched.

On line 12, the form object is submitted by specifying the label of the submit button (submit(String)). Specifying the submit button is important when the form has more than one or when the submit button contributes a name-value pair required by the application; otherwise, the method submit() would suffice.

Example 16: Generating a form's request permutations.
<html>
<h1>Movie Search:</h1>
<form name="srch" action="http://jaunt-api.com/examples/searchResponse.htm">
  Movie Keyword:<input type="text" name="keyword"><br>
  Movie Genre:<select name="movieType"><option>Drama<option>Horror</select><br>
  Language:
  <input type='radio' name='lang' value='english'>English
  <input type='radio' name='lang' value='french'>French<br>
  <input type="submit" value="submit search">
</form>
</html>
UserAgent userAgent = new UserAgent();  
userAgent.visit("http://jaunt-api.com/examples/search.htm");

Form form = userAgent.doc.getForm("<form name=srch>");            //retrieve Form object by its name.
form.addPermutationTarget("keyword", new String[]{"cat", "dog"}); //specify seach terms to permute thru
form.addPermutationTarget("movieType");            //specify that movietype field will be permuted (all values)
form.addPermutationTarget("lang");                 //specify that lang field will be permuted (all values)
List<HttpRequest> requests = form.getRequestPermutations();       //generate list of request permutations
  
System.out.println("request permutations:");
for(HttpRequest request : requests){               //print the list of request permutation
  System.out.println(request.asUrl());
}  
This example illustrates how to automatically permute a form through different combinations of input in order to generate a list of requests. In this case, the form is a search interface for a movie database. Being able to generate a comprehensive list of request objects for the search form is a simpler and faster solution than laboriously changing each field one at a time after each form submition.

On line 4, the Form object is retrieved by the form's name, and in the next three lines (5-7) permutation targets are added. Each permutation target identifies a specific form field (by its name) and defines all the possible values through which it should be permuted. In some cases, the possible values are by definition embedded as part of the component (such as menus, radiobuttons, and checkboxes). In other cases (such as texfields, password fields, and textareas) the permutation values need to be specified, since there are an unlimited number of possible inputs. In such cases the permutation values are provided in a String array, as on line 5, where the textfield is set to be permuted through the search terms 'cat' and 'dog'.

On lines 11-13, each generated HttpRequest is printed as a URL (see output below). The UserAgent can directly accept the HttpRequest objects, using UserAgent.send(HttpRequest).

For additional control, the form can be redefined at the DOM-level before defining permutation targets. For example, you may wish to remove the first entry of a dropdown menu, if it is simply a blank option, rather than have it generate a meaningless request permutation. Any DOM-level manipulatation must occur before the form is aquired through the document's getForm(String) method.

Output:
http://jaunt-api.com/examples/searchResponse.htm?keyword=cat&movieType=Drama&lang=english
http://jaunt-api.com/examples/searchResponse.htm?keyword=cat&movieType=Drama&lang=french
http://jaunt-api.com/examples/searchResponse.htm?keyword=cat&movieType=Horror&lang=english
http://jaunt-api.com/examples/searchResponse.htm?keyword=cat&movieType=Horror&lang=french
http://jaunt-api.com/examples/searchResponse.htm?keyword=dog&movieType=Drama&lang=english
http://jaunt-api.com/examples/searchResponse.htm?keyword=dog&movieType=Drama&lang=french
http://jaunt-api.com/examples/searchResponse.htm?keyword=dog&movieType=Horror&lang=english
http://jaunt-api.com/examples/searchResponse.htm?keyword=dog&movieType=Horror&lang=french
Example 17: Table traversal
<html>
  <table class="stocks" border="1">
    <tr><td>MSFT</td><td>GOOG</td><td>APPL</td></tr>
    <tr><td>$31.58</td><td>$896.57</td><td>$465.25</td></tr>
  </table>
</html>
    try{
      UserAgent userAgent = new UserAgent(); 
      userAgent.visit("http://jaunt-api.com/examples/stocks.htm");
      
      Element table = userAgent.doc.findFirst("<table class=stocks>");  //find table element
      Elements tds = table.findEach("<td|th>");                         //find non-nested td/th elements
      for(Element td: tds){                                             //iterate through td/th's
        System.out.println(td.outerHTML());                             //print each td/th element
      }
    }
    catch(JauntException e){
      System.err.println(e);
    }
This example does not introduce any new concepts. Rather, rather it illustrates a technique for traversing a table. The findEach(String) method is used to collect every non-nested td/th descendant of the table element (line 6). The parameter "<td|th>" is a query that uses the regular expression td|th to match the tagname.
Example 18: Table text extraction using the Table component.
try{
  UserAgent userAgent = new UserAgent();
  userAgent.visit("http://jaunt-api.com/examples/schedule.htm");
  Element tableElement = userAgent.doc.findFirst("<table class=schedule>");   //find table Element
  Table table = new Table(tableElement);                   //create Table component 

  System.out.println("\nText of first column:");                            
  List<String> results = table.getTextFromColumn(0);       //get text from first column
  for(String text : results) System.out.println(text);     //iterate through results & print     
      
  System.out.println("\nText of column containing 'Mon':");
  results = table.getTextFromColumn("Mon");                //get text from column containing 'Mon'
  for(String text : results) System.out.println(text);     //iterate through results & print 
		   
  System.out.println("\nText of first row:");
  results = table.getTextFromRow(0);                       //get text from first row
  for(String text : results) System.out.println(text);     //iterate through results & print 
    
  System.out.println("\nText of row containing '2:00pm':");
  results = table.getTextFromRow("2:00pm");                //get text from row containing '2:00pm'
  for(String text : results) System.out.println(text);     //iterate through results & print
      
  System.out.println("\nCreate Map of text from first two columns:");  
  Map<String, String> map = table.getTextFromColumns(0, 1);//create map containing text from cols 0 and 1
  for(String key : map.keySet()){                          //print keys (from col 0) and values (from col 1) 
    System.out.println(key + ":" + map.get(key));           
  }
}
catch(JauntException e){
  System.out.println(e);
}
This example illustrates the text-extraction methods of the Table component, which is a utility object that makes it easy to extract text content of a particular row, column, or cell of an table. It can also be used to to extract text from two columns into a Map, where one column constitutes the keys of the Map and the second column constitutes the values.

On line 4, the target table is located using the tagQuery <table class=schedule>; the table Element is then passed into the constructor for a Table component. Note that the Document object provides a number of alternative ways exist to create a Table component in a single step, including Document.getTable(String tagQuery) and Document.getTableByText(String... regex).

As you peruse each data extraction method, note that several of them accept a regular expression for matching the text within a particular cell (td/th element). These regular expressions are matched in a case-insentive way against the inner text of the td/th elements (see Element.getTextContent()). Regular expressions are matched against text using Matcher.matches(), which performs whole-string matching as opposed to substring matching. In cases where there is more than one td/th that matches the regular expression, the first matching cell is used, where the table is processed row by row, left to right, top to bottom.

Example 19: Table cell extraction using the Table component.
try{
  UserAgent userAgent = new UserAgent(); 
  userAgent.visit("http://jaunt-api.com/examples/schedule.htm");
  Table table = userAgent.doc.getTable("<table class=schedule>");   //get Table component via search query
      
  System.out.println("\nColumn having 'Mon':");
  Elements elements = table.getCol("mon");                                  //get entire column containing 'Mon'
  for(Element element : elements) System.out.println(element.outerHTML());  //iterate through & print elements
      
  System.out.println("\nColumn below 'Tue':");                              
  elements = table.getColBelow("tue");                                      //get column elements below 'Tue'
  for(Element element : elements) System.out.println(element.outerHTML());  //iterate through & print elements
      
  System.out.println("\nFirst row:");
  elements = table.getRow(0);                                               //get row at row index 0.
  for(Element element : elements) System.out.println(element.outerHTML());  //iterate through & print elements
      
  System.out.println("\nRow right of '2:00pm':");
  elements = table.getRowRightOf("2:00pm");                                 //get row elements right of 2:00pm
  for(Element element : elements) System.out.println(element.outerHTML());  //iterate through & print elements
      
  System.out.println("\nCell for fri at 10:00am:");                        
  Element element = table.getCell("fri", "10:00am");             //get element at intersection of col/row
  System.out.println(element.outerHTML());                       //print element
      
  System.out.println("\nCell at position 3,3:");
  element = table.getCell(3,3);                                  //get element at col index 3, row index 3
  System.out.println(element.outerHTML());                       //print element
}
catch(JauntException e){
  System.err.println(e);
}
This example illustrates the cell-extraction methods of the Table component. The Table component makes it easy to extract the td/th elements for a paricular row, column, cell, or the segment of row/column relative to a cell containing specific text.

On line 4, a table component is aquired via an element query (queries are covered in the examples on search methods). It can also be aquired by calling Document's getTable(int) method, which takes the table index as an argument (indexing starts at zero and refers to non-nested tables).

Several methods in this example accept a regular expression for matching the text within a particular cell (td/th element). These regular expressions are matched in a case-insentive way against the innerText() of the td/th elements. In cases where there is more than one td/th that matches the regular expression, the first encountered cell will constitute the match, where the table is processed row by row, left to right, top to bottom.

Example 20: Pagination Discovery
try{
  UserAgent userAgent = new UserAgent();    
  userAgent.visit("https://google.com");                 //visit google.com
  userAgent.doc.apply("seashells").submit();             //apply search term and submit form

  Hyperlink nextPageLink = userAgent.doc.nextPageLink(); //get hyperlink to next page of results
  nextPageLink.follow();                                 //visit next page (p 2).
  System.out.println("location: " + userAgent.getLocation());  //print current location (url)
}  
catch(JauntException e){
  System.err.println(e);
}
This example illustrates using the pagination discovery feature of class Document in order to navigate through a series of paginated web pages, such as those produced by a search engine or database-type web interfactes.

On line 3 the browser visits google.com, and on the following line a search for "seashells" is applied to the textfield and then submitted.

On line 6 the method nextPageLink() is invoked, which returns a Hyperlink object. The hyperlink object represents a link to the next page of results (page 2). On the next two lines, the browser follows the hyperlink and then prints the current location of the UserAgent.

See also the related methods Document.nextPageLink(Element) and Document.nextPageLinkExists().

Example 21: Using the HTML cache.
try{
  UserAgent userAgent = new UserAgent(); 
  String url = "http://northernbushcraft.com";
  userAgent.setCacheEnabled(true);         //caching turned on
  userAgent.visit(url);                    //cache empty, so HTML page requested via http & saved in cache.
  userAgent.visit(url);                    //when revisiting, page pulled from filesystem cache - no http request.
  System.out.println(userAgent.response);  //response object shows that content was cached, no response headers
      
  userAgent.setCacheEnabled(false);        //caching turned off
  userAgent.visit(url);                    //page is once again retrieved via http request. 
  System.out.println(userAgent.response);  //print response object, which now shows response headers
}
catch(JauntException e){
  System.err.println(e);
}
This example illustrates using the default HTML cache. The cache saves the original HTML/XHTML source of webpages locally in a directory called jaunt_cache, which is located in the directory specified by UserAgentSettings.outputPath (see Example 2 for accessing/altering settings).

The purpose of the HTML cache is to provide a means of accessing frequently-required HTML pages from local storage rather than repeatedly making HTTP requests. When screen-scraping data from a large website, it's common to run your program multiple times while refining/testing the scraping algorithm. By enabling caching, you can avoid repeatedly hitting the webserver, since the UserAgent will first check the cache to see whether the document is available locally.

To use the HTML cache you must first enable it, as seen on line 4. Disabling the cache (as seen on line 9) reverts the UserAgent to making HTTP requests rather than pulling content from the cache. However, disabling the cache does not delete the contents of the cache; the cache folder persists until it is manually deleted.

Continue to Extra Topics

Home | Javadocs | Web-Scraping Tutorial | JSON Querying Tutorial | FAQ | Download