Article

An article

title: of the article
cleanedArticleText: stores the lovely, pure text from the article, stripped of html, formatting, etcjust raw text with paragraphs separated by newlines. This is probably what you want to use.
metaDescription: description field in HTML source
metaKeywords: field in the HTML source
canonicalLink: of this article if found in the meta data
domain: of this article we're parsing
topNode: holds the top Element we think is a candidate for the main body of the article
topImage: holds the top Image object that we think represents this article
tags: holds a set of tags that may have been in the article, these are not meta keywords
movies: holds a list of any movies we found on the page like youtube, vimeo
finalUrl: tores the final URL that we're going to try and fetch content against, this would be expanded if any escaped fragments were found in the starting url
linkHash: stores the MD5 hash of the url to use for various identification tasks
rawHtml: stores the RAW HTML straight from the network connection
doc: the JSoup Document object
rawDoc: this is the original JSoup document that contains a pure object from the original HTML without any cleaning options done on it
publishDate: Sometimes useful to try and know when the publish date of an article was
additionalData: A property bucket for consumers of goose to store custom data extractions. This is populated by an implementation of goose.extractors.AdditionalDataExtractor which is executed before document cleansing within goose.CrawlingActor#crawl
openGraphData: Facebook Open Graph data that that is found in Article Meta tags

Linear Supertypes

Serializable, Serializable, Product, Equals, AnyRef, Any

Instance Constructors

new Article(title: String, cleanedArticleText: Option[String], metaDescription: String, metaKeywords: String, canonicalLink: String, domain: String, topNode: Option[Element], topImage: Option[Image], tags: Set[String], movies: List[Element], finalUrl: String, linkHash: String, rawHtml: String, doc: Document, rawDoc: Document, publishDate: Option[DateTime], additionalData: Map[String, String], openGraphData: OpenGraphData)

title
of the article
cleanedArticleText
stores the lovely, pure text from the article, stripped of html, formatting, etcjust raw text with paragraphs separated by newlines. This is probably what you want to use.
metaDescription
description field in HTML source
metaKeywords
field in the HTML source
canonicalLink
of this article if found in the meta data
domain
of this article we're parsing
topNode
holds the top Element we think is a candidate for the main body of the article
topImage
holds the top Image object that we think represents this article
tags
holds a set of tags that may have been in the article, these are not meta keywords
movies
holds a list of any movies we found on the page like youtube, vimeo
finalUrl
tores the final URL that we're going to try and fetch content against, this would be expanded if any escaped fragments were found in the starting url
linkHash
stores the MD5 hash of the url to use for various identification tasks
rawHtml
stores the RAW HTML straight from the network connection
doc
the JSoup Document object
rawDoc
this is the original JSoup document that contains a pure object from the original HTML without any cleaning options done on it
publishDate
Sometimes useful to try and know when the publish date of an article was
additionalData
A property bucket for consumers of goose to store custom data extractions. This is populated by an implementation of goose.extractors.AdditionalDataExtractor which is executed before document cleansing within goose.CrawlingActor#crawl
openGraphData
Facebook Open Graph data that that is found in Article Meta tags

Value Members

final def !=(arg0: Any): Boolean

Definition Classes
AnyRef → Any
final def ##(): Int

Definition Classes
AnyRef → Any
final def ==(arg0: Any): Boolean

Definition Classes
AnyRef → Any
val additionalData: Map[String, String]

A property bucket for consumers of goose to store custom data extractions.
A property bucket for consumers of goose to store custom data extractions. This is populated by an implementation of goose.extractors.AdditionalDataExtractor which is executed before document cleansing within goose.CrawlingActor#crawl
final def asInstanceOf[T0]: T0

Definition Classes
Any
val canonicalLink: String

of this article if found in the meta data
val cleanedArticleText: Option[String]

stores the lovely, pure text from the article, stripped of html, formatting, etcjust raw text with paragraphs separated by newlines.
stores the lovely, pure text from the article, stripped of html, formatting, etcjust raw text with paragraphs separated by newlines. This is probably what you want to use.
def clone(): AnyRef

Attributes
protected[java.lang]
Definition Classes
AnyRef
Annotations
@throws( ... )
val doc: Document

the JSoup Document object
val domain: String

of this article we're parsing
final def eq(arg0: AnyRef): Boolean

Definition Classes
AnyRef
val finalUrl: String

tores the final URL that we're going to try and fetch content against, this would be expanded if any escaped fragments were found in the starting url
def finalize(): Unit

Attributes
protected[java.lang]
Definition Classes
AnyRef
Annotations
@throws( classOf[java.lang.Throwable] )
final def getClass(): Class[_]

Definition Classes
AnyRef → Any
final def isInstanceOf[T0]: Boolean

Definition Classes
Any
val linkHash: String

stores the MD5 hash of the url to use for various identification tasks
val metaDescription: String

description field in HTML source
val metaKeywords: String

field in the HTML source
val movies: List[Element]

holds a list of any movies we found on the page like youtube, vimeo
final def ne(arg0: AnyRef): Boolean

Definition Classes
AnyRef
final def notify(): Unit

Definition Classes
AnyRef
final def notifyAll(): Unit

Definition Classes
AnyRef
val openGraphData: OpenGraphData

Facebook Open Graph data that that is found in Article Meta tags
val publishDate: Option[DateTime]

Sometimes useful to try and know when the publish date of an article was
val rawDoc: Document

this is the original JSoup document that contains a pure object from the original HTML without any cleaning options done on it
val rawHtml: String

stores the RAW HTML straight from the network connection
final def synchronized[T0](arg0: ⇒ T0): T0

Definition Classes
AnyRef
val tags: Set[String]

holds a set of tags that may have been in the article, these are not meta keywords
val title: String

of the article
val topImage: Option[Image]

holds the top Image object that we think represents this article
val topNode: Option[Element]

holds the top Element we think is a candidate for the main body of the article
final def wait(): Unit

Definition Classes
AnyRef
Annotations
@throws( ... )
final def wait(arg0: Long, arg1: Int): Unit

Definition Classes
AnyRef
Annotations
@throws( ... )
final def wait(arg0: Long): Unit

Definition Classes
AnyRef
Annotations
@throws( ... )

Related Doc: package gander

Instance Constructors

Value Members

final def !=(arg0: Any): Boolean

final def ##(): Int

final def ==(arg0: Any): Boolean

val additionalData: Map[String, String]

final def asInstanceOf[T0]: T0

val canonicalLink: String

val cleanedArticleText: Option[String]

def clone(): AnyRef

val doc: Document

val domain: String

final def eq(arg0: AnyRef): Boolean

val finalUrl: String

def finalize(): Unit

final def getClass(): Class[_]

final def isInstanceOf[T0]: Boolean

val linkHash: String

val metaDescription: String

val metaKeywords: String

val movies: List[Element]

final def ne(arg0: AnyRef): Boolean

final def notify(): Unit

final def notifyAll(): Unit

val openGraphData: OpenGraphData

val publishDate: Option[DateTime]

val rawDoc: Document

val rawHtml: String

final def synchronized[T0](arg0: ⇒ T0): T0

val tags: Set[String]

val title: String

val topImage: Option[Image]

val topNode: Option[Element]

final def wait(): Unit

final def wait(arg0: Long, arg1: Int): Unit

final def wait(arg0: Long): Unit

Inherited from Serializable

Inherited from Serializable

Inherited from Product

Inherited from Equals

Inherited from AnyRef

Inherited from Any

Ungrouped