Browser

trait Browser

A client able to retrieve and parse HTML pages from the web and from local resources.

An implementation of Browser can fetch pages via HTTP GET or POST requests, parse the downloaded page and return a net.ruippeixotog.scalascraper.model.Document instance, which can be queried via the scraper DSL or using its methods.

Different net.ruippeixotog.scalascraper.browser.Browser implementations can embed pages with different runtime behavior. For example, some browsers may limit themselves to parse the HTML content inside the page without executing any scripts inside, while others may run JavaScript and allow for Document instances with dynamic content. The documentation of each implementation should be read for more information on the semantics of its Document and net.ruippeixotog.scalascraper.model.Element implementations.

class Object
trait Matchable
class Any

Type members

Types

The concrete type of documents created by this browser.

The concrete type of documents created by this browser.

Value members

Abstract methods

def clearCookies(): Unit

Clears the cookie store of this browser.

Clears the cookie store of this browser.

def cookies(url: String): Map[String, String]

Returns the current set of cookies stored in this browser for a given URL.

Returns the current set of cookies stored in this browser for a given URL.

Value parameters:
url

the URL whose stored cookies are to be returned

Returns:

a mapping of cookie names to their respective values.

def get(url: String): DocumentType

Retrieves and parses a web page using a GET request.

Retrieves and parses a web page using a GET request.

Value parameters:
url

the URL of the page to retrieve

Returns:

a Document containing the retrieved web page.

def parseFile(file: File, charset: String): DocumentType

Parses a local HTML file with a specified charset.

Parses a local HTML file with a specified charset.

Value parameters:
charset

the charset of the file

file

the HTML file to parse

Returns:

a Document containing the parsed web page.

def parseInputStream(inputStream: InputStream, charset: String): DocumentType

Parses an input stream with its content in a specified charset. The provided input stream is always closed before this method returns or throws an exception.

Parses an input stream with its content in a specified charset. The provided input stream is always closed before this method returns or throws an exception.

Value parameters:
charset

the charset of the input stream content

inputStream

the input stream to parse

Returns:

a Document containing the parsed web page.

def parseString(html: String): DocumentType

Parses an HTML string.

Parses an HTML string.

Value parameters:
html

the HTML string to parse

Returns:

a Document containing the parsed web page.

def post(url: String, form: Map[String, String]): DocumentType

Submits a form via a POST request and parses the resulting page.

Submits a form via a POST request and parses the resulting page.

Value parameters:
form

a map containing the form fields to submit with their respective values

url

the URL of the page to retrieve

Returns:

a Document containing the resulting web page.

def userAgent: String

The user agent used by this browser to retrieve HTML pages from the web.

The user agent used by this browser to retrieve HTML pages from the web.

def withProxy(proxy: Proxy): Browser

Returns a new browser that uses the provided proxy for all connections.

Returns a new browser that uses the provided proxy for all connections.

Concrete methods

def parseFile(file: File): DocumentType

Parses a local HTML file encoded in UTF-8.

Parses a local HTML file encoded in UTF-8.

Value parameters:
file

the HTML file to parse

Returns:

a Document containing the parsed web page.

def parseFile(path: String, charset: String): DocumentType

Parses a local HTML file with a specified charset.

Parses a local HTML file with a specified charset.

Value parameters:
charset

the charset of the file

path

the path in the local filesystem where the HTML file is located

Returns:

a Document containing the parsed web page.

def parseFile(path: String): DocumentType

Parses a local HTML file encoded in UTF-8.

Parses a local HTML file encoded in UTF-8.

Value parameters:
path

the path in the local filesystem where the HTML file is located

Returns:

a Document containing the parsed web page.

def parseResource(name: String, charset: String): DocumentType

Parses a resource with a specified charset.

Parses a resource with a specified charset.

Value parameters:
charset

the charset of the resource

name

the name of the resource to parse

Returns:

a Document containing the parsed web page.