org.jsoup
Class Jsoup

java.lang.Object
  extended by org.jsoup.Jsoup

public class Jsoup
extends Object

The core public access point to the jsoup functionality.

Author:
Jonathan Hedley

Method Summary
static String clean(String bodyHtml, String baseUri, Whitelist whitelist)
          Get safe HTML from untrusted input HTML, by parsing input HTML and filtering it through a white-list of permitted tags and attributes.
static String clean(String bodyHtml, Whitelist whitelist)
          Get safe HTML from untrusted input HTML, by parsing input HTML and filtering it through a white-list of permitted tags and attributes.
static Document parse(File in, String charsetName)
          Parse the contents of a file as HTML.
static Document parse(File in, String charsetName, String baseUri)
          Parse the contents of a file as HTML.
static Document parse(String html)
          Parse HTML into a Document.
static Document parse(String html, String baseUri)
          Parse HTML into a Document.
static Document parse(URL url, int timeoutMillis)
          Fetch a URL, and parse it as HTML.
static Document parseBodyFragment(String bodyHtml)
          Parse a fragment of HTML, with the assumption that it forms the body of the HTML.
static Document parseBodyFragment(String bodyHtml, String baseUri)
          Parse a fragment of HTML, with the assumption that it forms the body of the HTML.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Method Detail

parse

public static Document parse(String html,
                             String baseUri)
Parse HTML into a Document. The parser will make a sensible, balanced document tree out of any HTML.

Parameters:
html - HTML to parse
baseUri - The URL where the HTML was retrieved from. Used to resolve relative URLs to absolute URLs, that occur before the HTML declares a <base href> tag.
Returns:
sane HTML

parse

public static Document parse(String html)
Parse HTML into a Document. As no base URI is specified, absolute URL detection relies on the HTML including a <base href> tag.

Parameters:
html - HTML to parse
Returns:
sane HTML
See Also:
parse(String, String)

parse

public static Document parse(URL url,
                             int timeoutMillis)
                      throws IOException
Fetch a URL, and parse it as HTML.

Parameters:
url - URL to fetch (with a GET). The protocol must be http or https.
timeoutMillis - Connection and read timeout, in milliseconds. If exceeded, IOException is thrown.
Returns:
The parsed HTML.
Throws:
IOException - If the final server response != 200 OK (redirects are followed), or if there's an error reading the response stream.

parse

public static Document parse(File in,
                             String charsetName,
                             String baseUri)
                      throws IOException
Parse the contents of a file as HTML.

Parameters:
in - file to load HTML from
charsetName - character set of file contents. If you don't know the charset, generally the best guess is UTF-8.
baseUri - The URL where the HTML was retrieved from, to generate absolute URLs relative to.
Returns:
sane HTML
Throws:
IOException - if the file could not be found, or read, or if the charsetName is invalid.

parse

public static Document parse(File in,
                             String charsetName)
                      throws IOException
Parse the contents of a file as HTML. The location of the file is used as the base URI to qualify relative URLs.

Parameters:
in - file to load HTML from
charsetName - character set of file contents. If you don't know the charset, generally the best guess is UTF-8.
Returns:
sane HTML
Throws:
IOException - if the file could not be found, or read, or if the charsetName is invalid.
See Also:
parse(File, String, String)

parseBodyFragment

public static Document parseBodyFragment(String bodyHtml,
                                         String baseUri)
Parse a fragment of HTML, with the assumption that it forms the body of the HTML.

Parameters:
bodyHtml - body HTML fragment
baseUri - URL to resolve relative URLs against.
Returns:
sane HTML document
See Also:
Document.body()

parseBodyFragment

public static Document parseBodyFragment(String bodyHtml)
Parse a fragment of HTML, with the assumption that it forms the body of the HTML.

Parameters:
bodyHtml - body HTML fragment
Returns:
sane HTML document
See Also:
Document.body()

clean

public static String clean(String bodyHtml,
                           String baseUri,
                           Whitelist whitelist)
Get safe HTML from untrusted input HTML, by parsing input HTML and filtering it through a white-list of permitted tags and attributes.

Parameters:
bodyHtml - input untrusted HMTL
baseUri - URL to resolve relative URLs against
whitelist - white-list of permitted HTML elements
Returns:
safe HTML
See Also:
Cleaner.clean(Document)

clean

public static String clean(String bodyHtml,
                           Whitelist whitelist)
Get safe HTML from untrusted input HTML, by parsing input HTML and filtering it through a white-list of permitted tags and attributes.

Parameters:
bodyHtml - input untrusted HTML
whitelist - white-list of permitted HTML elements
Returns:
safe HTML
See Also:
Cleaner.clean(Document)


Copyright © 2009-2010 Jonathan Hedley. All Rights Reserved.