The Java ecosystem has a number of tools for parsing HTML, the best of which deal gracefully with real-world, online data, which is often dirty and unpredictably formatted. "Graceful", in this case, means not only parsing without choking, but being able to switch seamlessly between HTML and XHTML. Jaunt's parser, which handles both HTML and XML, is guaranteed to generate a parse tree for even the messiest, non-validating data.
Beyond acting as a parser and exposing the low-level DOM-level mechanics, Jaunt also provides high-level convenience functions. The package accomodates three levels of abstraction:
The following Jaunt program visits, fills-out, and submits a login form:
try{ UserAgent userAgent = new UserAgent(); //create new useragent (headless browser) userAgent.visit("http://jaunt-api.com/examples/login.htm") .fillout("Username:", "tom") //fill-out form fields by text labels .fillout("Password:", "secret") .choose(Label.RIGHT, "Remember me") .submit(); //submit form, then print current url System.out.println("location: " + userAgent.getLocation()); } catch(JauntException e){ System.err.println(e); }
But Jaunt can make life even easier. Instead of writing code to navigate back and forth through a form interface for multiple submissions (eg. search forms), the developer can automatically generate a form's request permutations. Each request represents a submission for a possible combination of inputs, so there is no need to manipulate the form inputs one at a time; the developer is free to focus on actual data extraction from the results pages.
Jaunt is Beta software and is ready to be test driven. The website (http://jaunt-api.com) provides a quickstart tutorial with plenty of short, simple examples for each of Jaunt's most important features. Try it out and provide feedback for the next release!