Home | Javadocs | Web-Scraping Tutorial | JSON Querying Tutorial | FAQ | Download

Jaunt FAQ

Frequently Asked Questions
Answers
  1. What is the status of the project?

    Jaunt was created and is currently maintained by Tom Cervenka (tom dot cervenka at gmail dot com).

  2. Will Jaunt remain free once it's out of Beta?

    Jaunt 1.x will remain available as a free monthly download. Alternative editions will also remain available for those preferring longer licensing terms. Note that the Enterprise Edition is ideal for businesses/organizations, since the license allows for an unlimited number of users within the organization and never expires. Aside from licensing terms and expiration, the Enterprise edition does not differ from the standard editions in its API/javadocs.

  3. Why should I use Jaunt rather than HtmlUnit/Phantomjs/Selenium/Mechanize/etc?

    Jaunt is small and speedy compared to products that support Javascript (eg webkit based tools). Unlike those solutions, Jaunt provides HTTP-level control for accessing headers and performing REST calls, as well as the ability to query JSON payloads. Because it's lightweight, it's relatively easy to scale such as using one UserAgent per thread.

    The library is designed to hide unecessary compexity while still providing full DOM-level control. For example, Jaunt enables your program to fill-out and submit HTML forms without relying on XPath or CSS-selectors that are cumbersome or can break when page structure/style changes. Instead, form fields can be targeted based on how they are visibly labelled, or even simpler, filled-out from start to end by specifying a sequence of inputs. DOM-level control is also powerful and easy; the query syntax used to seach for elements resembles HTML/XML. Search methods throw informative exceptions and free the developer from performing explicit error checking for null return values.

    Jaunt also provides high-level components for common web-scraping tasks. For example, the Table component allows you to extract a row or column of data with a single statement, either by specifying row/column indexes or by regex text matching. Another example is the Form component, which allows you to automatically permute a form's fields through different possible combination of user input, and generate a list of HttpRequests that represents each possible state -- a huge time saver for functional testing or when crawling through complex search interfaces. Jaunt's built-in support for HTTP proxies and HTML caching is a life-saver when scraping larger numbers of webpages.

  4. Why doesn't Jaunt work when scraping data from [some site]?

    If you encounter a problem using Jaunt, it may be a due to a bug in Jaunt and you should post your question to the forum (be sure to mention which version of Jaunt you're using and if possible submit code to demonstrate the bug). Be aware that Jaunt does not support Javascript, and therefore cannot be used to scrape content or perform actions on a webpage if that content/action requires Javascript. If you require Javascript support, see Jaunt's new, free, sister product, Jauntium.

    To determine whether a page/site requires Javascript, disable Javascript in your regular browser (eg Chrome) and then visit the page using your browser. If the page/site still functions normally, then Javascript support is not required.

    In cases whether Javascript is required for the page to render, you may be able to still use Jaunt to perform data-extraction by bypassing Javascript and directly communicating with the (often undocumented) REST endpoints. Use your browser's developer tools to examine network traffic and determine the appropriate HTTP requests. Then use Jaunt to create the requests and parse the XML/JSON data from those endpoints. Another strategy for bypassing javascript is to change the user-agent header (see UserAgentSettings) to that of a mobile browser. The website may serve a mobile-friendly (possiby non-javascript) version of the content.

  5. Help! Jaunt Expired, shortly after I upgraded to a newer version!

    A common problem re. expiration is that a previous (and expired) version of Jaunt is still on the classpath, and that's where the expiration message comes from. Be sure to delete any previous versions of Jaunt, or at least remove them from your classpath. Finally, to ensure that you are running the version that you think you are, print UserAgent.getVersionInfo(), which will tell you both the version that you're using as well as the the expiration date.

  6. How do I avoid the error java.lang.OutOfMemoryError: Java heap space

    Increase your heap allocation using the -Xmx flag when running your program (eg java -Xmx2G myProgram allocates 2 gigs). You can also control the maximum allowed length of a XML/HTML page by setting the maxBytes property of UserAgentSettings to something other than the default of -1 (no limit). So for example userAgent.settings.mayBytes = 250000 notifies the parser to process only the first 25,000 bytes of content, which will result in a truncated document if the response content is larger. When using this flag, you can check whether the resultant document has been truncated by examining the boolean property Document.truncated (as of Jaunt 1.1.1). Even when a document is truncated, it is truncated gracefully, ie any incomplete tags are closed.

    As of Jaunt 1.1.1, a bug has been fixed whereby Jaunt would read (potentially large) files if they were of an unsupported content-type, which could result in an OutOfMemoryError.

  7. What features are on the horizon?

    The featureset after v. 1.0 will continue to be driven by user demand/feedback.

  8. Where can I go for support, discussion, bug reporting?

    Support, discussion, etc. is available at the Jaunt Google Group.



Home | Javadocs | Web-Scraping Tutorial | JSON Querying Tutorial | FAQ | Download