Jsoup (jsoup 1.7.2 API)

Overview

Package

Class

Use

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

org.jsoup
Class Jsoup

java.lang.Object
  org.jsoup.Jsoup

public class Jsoup
extends Object
extends Object

The core public access point to the jsoup functionality.

Author:: Jonathan Hedley

Method Summary
`static String`	`clean(String bodyHtml, String baseUri, Whitelist whitelist)` Get safe HTML from untrusted input HTML, by parsing input HTML and filtering it through a white-list of permitted tags and attributes.
`static String`	`clean(String bodyHtml, String baseUri, Whitelist whitelist, Document.OutputSettings outputSettings)` Get safe HTML from untrusted input HTML, by parsing input HTML and filtering it through a white-list of permitted tags and attributes.
`static String`	`clean(String bodyHtml, Whitelist whitelist)` Get safe HTML from untrusted input HTML, by parsing input HTML and filtering it through a white-list of permitted tags and attributes.
`static Connection`	`connect(String url)` Creates a new `Connection` to a URL.
`static boolean`	`isValid(String bodyHtml, Whitelist whitelist)` Test if the input HTML has only tags and attributes allowed by the Whitelist.
`static Document`	`parse(File in, String charsetName)` Parse the contents of a file as HTML.
`static Document`	`parse(File in, String charsetName, String baseUri)` Parse the contents of a file as HTML.
`static Document`	`parse(InputStream in, String charsetName, String baseUri)` Read an input stream, and parse it to a Document.
`static Document`	`parse(InputStream in, String charsetName, String baseUri, Parser parser)` Read an input stream, and parse it to a Document.
`static Document`	`parse(String html)` Parse HTML into a Document.
`static Document`	`parse(String html, String baseUri)` Parse HTML into a Document.
`static Document`	`parse(String html, String baseUri, Parser parser)` Parse HTML into a Document, using the provided Parser.
`static Document`	`parse(URL url, int timeoutMillis)` Fetch a URL, and parse it as HTML.
`static Document`	`parseBodyFragment(String bodyHtml)` Parse a fragment of HTML, with the assumption that it forms the `body` of the HTML.
`static Document`	`parseBodyFragment(String bodyHtml, String baseUri)` Parse a fragment of HTML, with the assumption that it forms the `body` of the HTML.

Methods inherited from class java.lang.Object
`clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait`

Method Detail

parse

public static Document parse(String html,
                             String baseUri)

Parse HTML into a Document. The parser will make a sensible, balanced document tree out of any HTML.

Parameters:: html - HTML to parse; baseUri - The URL where the HTML was retrieved from. Used to resolve relative URLs to absolute URLs, that occur before the HTML declares a <base href> tag.
Returns:: sane HTML

parse

public static Document parse(String html,
                             String baseUri,
                             Parser parser)

Parse HTML into a Document, using the provided Parser. You can provide an alternate parser, such as a simple XML (non-HTML) parser.

Parameters:: html - HTML to parse; baseUri - The URL where the HTML was retrieved from. Used to resolve relative URLs to absolute URLs, that occur before the HTML declares a <base href> tag.; parser - alternate parser to use.
Returns:: sane HTML

parse

public static Document parse(String html)

Parse HTML into a Document. As no base URI is specified, absolute URL detection relies on the HTML including a <base href> tag.

Parameters:: html - HTML to parse
Returns:: sane HTML
See Also:: parse(String, String)

connect

public static Connection connect(String url)

Creates a new Connection to a URL. Use to fetch and parse a HTML page.

Use examples:

Document doc = Jsoup.connect("http://example.com").userAgent("Mozilla").data("name", "jsoup").get();
Document doc = Jsoup.connect("http://example.com").cookie("auth", "token").post();

Parameters:: url - URL to connect to. The protocol must be http or https.
Returns:: the connection. You can add data, cookies, and headers; set the user-agent, referrer, method; and then execute.

parse

public static Document parse(File in,
                             String charsetName,
                             String baseUri)
                      throws IOException

Parse the contents of a file as HTML.

Parameters:: in - file to load HTML from; charsetName - (optional) character set of file contents. Set to null to determine from http-equiv meta tag, if present, or fall back to UTF-8 (which is often safe to do).; baseUri - The URL where the HTML was retrieved from, to resolve relative links against.
Returns:: sane HTML
Throws:: IOException - if the file could not be found, or read, or if the charsetName is invalid.

parse

public static Document parse(File in,
                             String charsetName)
                      throws IOException

Parse the contents of a file as HTML. The location of the file is used as the base URI to qualify relative URLs.

Parameters:: in - file to load HTML from; charsetName - (optional) character set of file contents. Set to null to determine from http-equiv meta tag, if present, or fall back to UTF-8 (which is often safe to do).
Returns:: sane HTML
Throws:: IOException - if the file could not be found, or read, or if the charsetName is invalid.
See Also:: parse(File, String, String)

parse

public static Document parse(InputStream in,
                             String charsetName,
                             String baseUri)
                      throws IOException

Read an input stream, and parse it to a Document.

Parameters:: in - input stream to read. Make sure to close it after parsing.; charsetName - (optional) character set of file contents. Set to null to determine from http-equiv meta tag, if present, or fall back to UTF-8 (which is often safe to do).; baseUri - The URL where the HTML was retrieved from, to resolve relative links against.
Returns:: sane HTML
Throws:: IOException - if the file could not be found, or read, or if the charsetName is invalid.

parse

public static Document parse(InputStream in,
                             String charsetName,
                             String baseUri,
                             Parser parser)
                      throws IOException

Read an input stream, and parse it to a Document. You can provide an alternate parser, such as a simple XML (non-HTML) parser.

Parameters:: in - input stream to read. Make sure to close it after parsing.; charsetName - (optional) character set of file contents. Set to null to determine from http-equiv meta tag, if present, or fall back to UTF-8 (which is often safe to do).; baseUri - The URL where the HTML was retrieved from, to resolve relative links against.; parser - alternate parser to use.
Returns:: sane HTML
Throws:: IOException - if the file could not be found, or read, or if the charsetName is invalid.

parseBodyFragment

public static Document parseBodyFragment(String bodyHtml,
                                         String baseUri)

Parse a fragment of HTML, with the assumption that it forms the body of the HTML.

Parameters:: bodyHtml - body HTML fragment; baseUri - URL to resolve relative URLs against.
Returns:: sane HTML document
See Also:: Document.body()

parseBodyFragment

public static Document parseBodyFragment(String bodyHtml)

Parse a fragment of HTML, with the assumption that it forms the body of the HTML.

Parameters:: bodyHtml - body HTML fragment
Returns:: sane HTML document
See Also:: Document.body()

parse

public static Document parse(URL url,
                             int timeoutMillis)
                      throws IOException

Fetch a URL, and parse it as HTML. Provided for compatibility; in most cases use connect(String) instead.

The encoding character set is determined by the content-type header or http-equiv meta tag, or falls back to UTF-8.

Parameters:: url - URL to fetch (with a GET). The protocol must be http or https.; timeoutMillis - Connection and read timeout, in milliseconds. If exceeded, IOException is thrown.
Returns:: The parsed HTML.
Throws:: MalformedURLException - if the request URL is not a HTTP or HTTPS URL, or is otherwise malformed; HttpStatusException - if the response is not OK and HTTP response errors are not ignored; UnsupportedMimeTypeException - if the response mime type is not supported and those errors are not ignored; SocketTimeoutException - if the connection times out; IOException - if a connection or read error occurs
See Also:: connect(String)

clean

public static String clean(String bodyHtml,
                           String baseUri,
                           Whitelist whitelist)

Get safe HTML from untrusted input HTML, by parsing input HTML and filtering it through a white-list of permitted tags and attributes.

Parameters:: bodyHtml - input untrusted HTML (body fragment); baseUri - URL to resolve relative URLs against; whitelist - white-list of permitted HTML elements
Returns:: safe HTML (body fragment)
See Also:: Cleaner.clean(Document)

clean

public static String clean(String bodyHtml,
                           Whitelist whitelist)

Get safe HTML from untrusted input HTML, by parsing input HTML and filtering it through a white-list of permitted tags and attributes.

Parameters:: bodyHtml - input untrusted HTML (body fragment); whitelist - white-list of permitted HTML elements
Returns:: safe HTML (body fragment)
See Also:: Cleaner.clean(Document)

clean

public static String clean(String bodyHtml,
                           String baseUri,
                           Whitelist whitelist,
                           Document.OutputSettings outputSettings)

Get safe HTML from untrusted input HTML, by parsing input HTML and filtering it through a white-list of permitted tags and attributes.

Parameters:: bodyHtml - input untrusted HTML (body fragment); baseUri - URL to resolve relative URLs against; whitelist - white-list of permitted HTML elements; outputSettings - document output settings; use to control pretty-printing and entity escape modes
Returns:: safe HTML (body fragment)
See Also:: Cleaner.clean(Document)

isValid

public static boolean isValid(String bodyHtml,
                              Whitelist whitelist)

Test if the input HTML has only tags and attributes allowed by the Whitelist. Useful for form validation. The input HTML should still be run through the cleaner to set up enforced attributes, and to tidy the output.

Parameters:: bodyHtml - HTML to test; whitelist - whitelist to test against
Returns:: true if no tags or attributes were removed; false otherwise
See Also:: clean(String, org.jsoup.safety.Whitelist)

Overview

Package

Class

Use

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

org.jsoup Class Jsoup

parse

parse

parse

connect

parse

parse

parse

parse

parseBodyFragment

parseBodyFragment

parse

clean

clean

clean

isValid

org.jsoup
Class Jsoup