The following examples assume knowledge of the basic functionality of the Jaunt API. If you have not already done so, familiarize yourself with the Webscraping Tutorial. To use Jaunt, download and extract the zip file. The zip file contains the licensing agreement, javadocs documentation, example files, release notes, and a jar file (Java 1.6). The jar file must be included in your classpath/project, at which point you will be able to recompile and/or run the example files.
try{ UserAgent userAgent = new UserAgent(); userAgent.openContent("<html>Welcome!<div>Under Construction</div><!-- todo: add more --></html>"); Element html = userAgent.doc.findFirst("<html>"); //find the html element List<Node> childNodes = html.getChildNodes(); //retrieve the child nodes as a List for(Node node : childNodes){ //iterate through the list of Nodes. if(node.getType() == Node.ELEMENT_TYPE){ //determine whether the node is an Element System.out.println("element: " + ((Element)node).outerHTML()); //print the element and its content } else if(node.getType() == Node.TEXT_TYPE){ //determine whether the node is Text System.out.println("text: " + ((Text)node).toString()); //print the text } else if(node.getType() == Node.COMMENT_TYPE){ //determine whether the node is a Comment System.out.println("comment: " + ((Comment)node).toString()); //print the comment } } } catch(JauntException e){ System.err.println(e); }
When using getChildNodes
to iterate through a list of Node children (as on line 6), the node type can be determined either by checking the value of getType()
(as on lines 9, 12, and 15) or by attempting to cast the value to different types (ie, Element, Comment or Text) and catching ClassCastExceptions. Another useful method for iterating through children (not shown) is getChildElements()
, which returns only Elements.
On lines 13 and 16 the Node object is cast to Text/Comment before calling toString()
. Casting us actually unecessary here, due to dynamic binding with toString()
; however, it is performed to illustrate the relevant classes to use when casting.
The Comment object is used to represent standard HTML/XML comments as well as doctype definitions, processing instructions, and CDATA sections, which can by differentiated by checking the Comment's type, (ie, calling getCommentType()
). Example 2 deals with comment-related searches.
<?xml version="1.0" encoding="ISO-8859-1"?> <?id 568375?> <!DOCTYPE MML SYSTEM "MML.dtd"> <message> <content> <!-- last edited Jan 12, 2013 --> This XML document is a test of Message Markup Language (MML) <![CDATA[ MML has tags like <message> and <content> ]]> </content> </message> <!-- copyright 2013 -->
try{ UserAgent userAgent = new UserAgent(); userAgent.visit("http://jaunt-api.com/examples_advanced/message.xml"); Comment result = userAgent.doc.getComment(2); //get doc's 3rd child comment System.out.println("result: " + result); //print the result Comment doctype = userAgent.doc.getFirst(Comment.DOCTYPE); //get doc's first child doctype System.out.println("doctype: " + doctype); //print the comment List<Comment> pis = userAgent.doc.getEach(Comment.PROCESSING_INSTRUCTION); //get doc's immediate child PIs for(Comment pi : pis) System.out.println("processing instruction: " + pi); //print the list of PIs. Comment cdata = userAgent.doc.findFirst(Comment.CDATA); //find first CDATA section in document System.out.println("cdata: " + cdata); //print the CDATA section List<Comment> comments = userAgent.doc.findEach(Comment.REGULAR); //find each regular comment in document for(Comment comment : comments) System.out.println("comment: " + comment); //print list of regular comments } catch(JauntException e){ System.err.println(e); }
<!-- regular comment -->
, but also represents DOCTYPE definitions, processing instructions, CDATA sections, and (not shown) irregular comments of the form <! irregular comment >
.
The method names used in this example closely mirror those previously encountered, used to search for Elements. Although they work in a similar fashion with respect to get vs. find and each vs every, it should be noted that their method signatures differ. Since Comments do not have children, searches for comments are never chained, and therefore multiple results are returned in a List instead of an <#elements>
container.
Another difference is that the Comment-related search methods do not accept a String query, but rather a constant, which is used to specify the type of comment being targeted. The getComment(int)
method (line 5), like the getElement(int)
method, accepts an integer n, representing the nth immediate child Comment to be retrieved.
<?xml version="1.0"?> <messageList> <message id="0">Hello there</message> <message id="1">What's up?</message> <message id="2">Anyone there?</message> <notification type="disconnection"/> </messageList>
try{ UserAgent userAgent = new UserAgent(); userAgent.settings.genericXMLMode = true; //set mode for processing generic XML userAgent.setFilter(new Filter(){ //subclass Filter to create custom filter public boolean childElementAllowed(Element parent, Element child){ //override callback method if(child.getName().equals("message")){ //only allow tags named 'message' child.removeAttribute("id"); //remove 'checked' attribute, if present return true; } else return false; } }); userAgent.visit("http://jaunt-api.com/examples_advanced/messageList.xml"); //open content System.out.println("Filtered document:\n" + userAgent.doc.innerXML()); //print filtered document. } catch(JauntException e){ System.err.println(e); }
setFilter(FilterCallback)
can be used two ways. You can either create your own class that satisfies the FilterCallback interface (three methods), or you can subclass Filter (as in this example). The class Filter already implements FilterCallback and allows everything to pass through the filter (ie, each of the three methods always return true
). In subclassing it, you need only override the method(s) that handle what you want to filter out. In this example, we only override the method for filtering elements, while the methods for filting text and comments are left to allow all comments/text.
Each of the callback methods that you override is called by the parser in order to check whether any given element/text/comment should be added to the DOM tree. One line 5, our custom filter is passed as an argument to setFilter(FilterCallback)
. These filter settings will be retained by the UserAgent until removeFilter()
has been called. See below for the printed output for this example.
<?xml version="1.0"?> <message>Hello there</message> <message>What's up?</message> <message>Anyone there?</message>
try{ UserAgent userAgent = new UserAgent(); userAgent.settings.showHeaders = true; //change settings to auto-print header information System.out.println("SENDING DEFAULT, AUTO-MANAGED REQUEST HEADERS...\n"); userAgent.sendGET("http://jaunt-api.com/examples_advanced/hello.xml"); //perform HTTP request System.out.println("DOCUMENT:\n" + userAgent.doc.innerXML()); //print retrieved document System.out.println("SENDING MODIFIED REQUEST HEADERS...\n"); userAgent.sendGET("http://jaunt-api.com/examples_advanced/hello.xml", "user-agent:Mozilla/4.0", "foo:bar"); //specify new/modified headers System.out.println("DOCUMENT:\n" + userAgent.doc.innerXML()); //print retrieved document } catch(JauntException e){ System.err.println(e); }
For the first request (line 6), UserAgent's sendGET(String)
method is invoked, which is functionally identical to the visit(String)
method seen in previous examples. The retrieved document is then printed as XHTML (line 7). Examining the printed request headers reveals that they consist of the default headers specified in userAgent.settings.defaultRequestHeaders
, as well as headers that are auto-managed by the UserAgent, such as any cookie-related headers and basic authentication headers.
For the second request (line 10), the sendGET(String, String ...) method is called. Note that the second argument allows a variable number of Strings. Each String should contain the name and value of a request header, separated by a colon. The headers specified here are sent in addition to (or in place of) the default headers and headers that are auto-managed by the UserAgent. Therefore, the header "foo:bar" is sent in addition to the automatically managed headers, but the header "user-agent:Mozilla/4.0" overrides the user-agent header specified in UserAgent.settings.defaultRequestHeaders
.
try{ UserAgent userAgent = new UserAgent(); System.out.println("SENDING HEAD REQUEST...\n"); userAgent.sendHEAD("http://jaunt-api.com/examples_advanced/hello.xml"); //send HTTP HEAD Request System.out.println("RESPONSE HEADERS:\n" + userAgent.response.getHeaders()); System.out.println("SENDING POST REQUEST...\n"); userAgent.sendPOST("http://tomcervenka.site90.com/handlePost.php", "username=tom&password=secret"); //send HTTP POST Request with queryString System.out.println("DOCUMENT:\n" + userAgent.doc.innerXML()); //print retrieved Document /** this section requires TARGET_SERVER to support PUT requests System.out.println("SENDING PUT REQUEST...\n"); userAgent.sendPUT("http://TARGET_SERVER/examples_advanced/hello.xml", //send HTTP PUT Request with updated content "<?xml version=\"1.0\"?>Hi Mom! "); System.out.println("DOCUMENT:\n" + userAgent.doc.innerXML()); //print retrieved Document */ /** this section requires TARGET_SERVER to support DELETE requests System.out.println("SENDING DELETE REQUEST...\n"); userAgent.sendDELETE("http://TARGET_SERVER/examples_advanced/hello.xml"); //send HTTP DELETE request System.out.println("DOCUMENT:\n" + userAgent.doc.innerXML()); //print retrieved Document */ } catch(JauntException e){ System.err.println(e); }
The HEAD request (line 5) creates (and returns) an HttpResponse property (UserAgent.response)
, who's header information is printed on line 6. The POST request (line 9) accepts queryString-formatted data as the second parameter and creates (and returns) a Document object (UserAgent.doc)
which is then printed on line 11. The PUT request (line 15) accepts updated content as the second parameter and creates (and returns) a Document object (UserAgent.doc)
, which is then printed on line 16. The DELETE request (line 22) deletes the resource specified by the first parameter and creates (and returns) a Document object (UserAgent.doc)
, which is then printed on line 20.
In each case where a Document object is created/returned, the document reference is guaranteed to be non-null (unless an exception is thrown), although the document may be empty (ie, hold no content).
try{ //create UserAgent and content handlers. UserAgent userAgent = new UserAgent(); HandlerForText handlerForText = new HandlerForText(); HandlerForBinary handlerForBinary = new HandlerForBinary(); //register each handler with a specific content-type userAgent.setHandler("text/css", handlerForText); userAgent.setHandler("text/javascript", handlerForText); userAgent.setHandler("application/x-javascript", handlerForText); userAgent.setHandler("image/gif", handlerForBinary); userAgent.setHandler("image/jpeg", handlerForBinary); //retrieve CSS content as String userAgent.visit("http://jaunt-api.com/syntaxhighlighter/styles/shCore.css"); System.out.println(handlerForText.getContent()); //retrieve JS content as String userAgent.visit("http://jaunt-api.com/syntaxhighlighter/scripts/shCore.js"); System.out.println(handlerForText.getContent()); //retrieve GIF content as byte[], and print its length userAgent.visit("http://jaunt-api.com/background.gif"); System.out.println(handlerForBinary.getContent().length); } catch(JauntException e){ System.err.println(e); }
On lines 4-5, one HandlerForText
and one HandlerForBinary
are created. On lines 8-12, the userAgent associates various content types with one of the two handers. On the subsequent line, the userAgent is directed to a css file, a js file and a gif file. In each case the hander objects is used to retrieve the content in either text or binary form, depending on which handler was used.
UserAgent userAgent = new UserAgent(); userAgent.setProxyHost("3.14.159.68"); //specify the proxy host (ip address) userAgent.setProxyPort(8080); //specify the proxy port //visit a (non-https://) URL through the proxy userAgent.visit("http://amazon.com"); System.out.println(userAgent.doc.innerXML()); //print the retrieved document
null
in the method setProxyHost(String)
. Since this configuraton is performed at UserAgent-level, it's possible to create multiple UserAgents which simultaneously utlilize a different proxy (eg, one per thread). Note that when dealing with proxies, response times may lag. See UserAgentSettings.responseTimeout
for setting a hard time limit on responses before throwing a ResponseException.
//specify http proxy at System level. System.setProperty("http.proxyHost", "3.14.159.68"); System.setProperty("http.proxyPort", "8080"); //specify username/password credentials at System level. ProxyAuthenticator.setCredentials("tom", "secret"); //visit a (non-https://) URL through the proxy userAgent.visit("http://amazon.com"); System.out.println(userAgent.doc.innerXML()); //print the retrieved document
System.clearProperty(String)
. Since this configuraton is performed at System-level, it is applied to all UserAgents that are subsequently created.
//specify https proxy at System level. System.setProperty("https.proxyHost", "12.345.67.8"); System.setProperty("https.proxyPort", "80"); //visit a (https://) URL through the proxy userAgent.visit("https://google.com"); System.out.println(userAgent.doc.innerXML()); //print the retrieved document
System.clearProperty(String)
. Since this configuraton is performed at System-level, it is applied to all UserAgents that are subsequently created.