Class

gander

Article

Related Doc: package gander

Permalink

final case class Article(title: String, cleanedArticleText: Option[String], metaDescription: String, metaKeywords: String, canonicalLink: String, domain: String, topNode: Option[Element], topImage: Option[Image], tags: Set[String], movies: List[Element], finalUrl: String, linkHash: String, rawHtml: String, doc: Document, rawDoc: Document, publishDate: Option[DateTime], additionalData: Map[String, String], openGraphData: OpenGraphData) extends Product with Serializable

An article

title

of the article

cleanedArticleText

stores the lovely, pure text from the article, stripped of html, formatting, etcjust raw text with paragraphs separated by newlines. This is probably what you want to use.

metaDescription

description field in HTML source

metaKeywords

field in the HTML source

canonicalLink

of this article if found in the meta data

domain

of this article we're parsing

topNode

holds the top Element we think is a candidate for the main body of the article

topImage

holds the top Image object that we think represents this article

tags

holds a set of tags that may have been in the article, these are not meta keywords

movies

holds a list of any movies we found on the page like youtube, vimeo

finalUrl

tores the final URL that we're going to try and fetch content against, this would be expanded if any escaped fragments were found in the starting url

linkHash

stores the MD5 hash of the url to use for various identification tasks

rawHtml

stores the RAW HTML straight from the network connection

doc

the JSoup Document object

rawDoc

this is the original JSoup document that contains a pure object from the original HTML without any cleaning options done on it

publishDate

Sometimes useful to try and know when the publish date of an article was

additionalData

A property bucket for consumers of goose to store custom data extractions. This is populated by an implementation of goose.extractors.AdditionalDataExtractor which is executed before document cleansing within goose.CrawlingActor#crawl

openGraphData

Facebook Open Graph data that that is found in Article Meta tags

Linear Supertypes
Serializable, Serializable, Product, Equals, AnyRef, Any
Ordering
  1. Alphabetic
  2. By Inheritance
Inherited
  1. Article
  2. Serializable
  3. Serializable
  4. Product
  5. Equals
  6. AnyRef
  7. Any
  1. Hide All
  2. Show All
Visibility
  1. Public
  2. All

Instance Constructors

  1. new Article(title: String, cleanedArticleText: Option[String], metaDescription: String, metaKeywords: String, canonicalLink: String, domain: String, topNode: Option[Element], topImage: Option[Image], tags: Set[String], movies: List[Element], finalUrl: String, linkHash: String, rawHtml: String, doc: Document, rawDoc: Document, publishDate: Option[DateTime], additionalData: Map[String, String], openGraphData: OpenGraphData)

    Permalink

    title

    of the article

    cleanedArticleText

    stores the lovely, pure text from the article, stripped of html, formatting, etcjust raw text with paragraphs separated by newlines. This is probably what you want to use.

    metaDescription

    description field in HTML source

    metaKeywords

    field in the HTML source

    canonicalLink

    of this article if found in the meta data

    domain

    of this article we're parsing

    topNode

    holds the top Element we think is a candidate for the main body of the article

    topImage

    holds the top Image object that we think represents this article

    tags

    holds a set of tags that may have been in the article, these are not meta keywords

    movies

    holds a list of any movies we found on the page like youtube, vimeo

    finalUrl

    tores the final URL that we're going to try and fetch content against, this would be expanded if any escaped fragments were found in the starting url

    linkHash

    stores the MD5 hash of the url to use for various identification tasks

    rawHtml

    stores the RAW HTML straight from the network connection

    doc

    the JSoup Document object

    rawDoc

    this is the original JSoup document that contains a pure object from the original HTML without any cleaning options done on it

    publishDate

    Sometimes useful to try and know when the publish date of an article was

    additionalData

    A property bucket for consumers of goose to store custom data extractions. This is populated by an implementation of goose.extractors.AdditionalDataExtractor which is executed before document cleansing within goose.CrawlingActor#crawl

    openGraphData

    Facebook Open Graph data that that is found in Article Meta tags

Value Members

  1. final def !=(arg0: Any): Boolean

    Permalink
    Definition Classes
    AnyRef → Any
  2. final def ##(): Int

    Permalink
    Definition Classes
    AnyRef → Any
  3. final def ==(arg0: Any): Boolean

    Permalink
    Definition Classes
    AnyRef → Any
  4. val additionalData: Map[String, String]

    Permalink

    A property bucket for consumers of goose to store custom data extractions.

    A property bucket for consumers of goose to store custom data extractions. This is populated by an implementation of goose.extractors.AdditionalDataExtractor which is executed before document cleansing within goose.CrawlingActor#crawl

  5. final def asInstanceOf[T0]: T0

    Permalink
    Definition Classes
    Any
  6. val canonicalLink: String

    Permalink

    of this article if found in the meta data

  7. val cleanedArticleText: Option[String]

    Permalink

    stores the lovely, pure text from the article, stripped of html, formatting, etcjust raw text with paragraphs separated by newlines.

    stores the lovely, pure text from the article, stripped of html, formatting, etcjust raw text with paragraphs separated by newlines. This is probably what you want to use.

  8. def clone(): AnyRef

    Permalink
    Attributes
    protected[java.lang]
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  9. val doc: Document

    Permalink

    the JSoup Document object

  10. val domain: String

    Permalink

    of this article we're parsing

  11. final def eq(arg0: AnyRef): Boolean

    Permalink
    Definition Classes
    AnyRef
  12. val finalUrl: String

    Permalink

    tores the final URL that we're going to try and fetch content against, this would be expanded if any escaped fragments were found in the starting url

  13. def finalize(): Unit

    Permalink
    Attributes
    protected[java.lang]
    Definition Classes
    AnyRef
    Annotations
    @throws( classOf[java.lang.Throwable] )
  14. final def getClass(): Class[_]

    Permalink
    Definition Classes
    AnyRef → Any
  15. final def isInstanceOf[T0]: Boolean

    Permalink
    Definition Classes
    Any
  16. val linkHash: String

    Permalink

    stores the MD5 hash of the url to use for various identification tasks

  17. val metaDescription: String

    Permalink

    description field in HTML source

  18. val metaKeywords: String

    Permalink

    field in the HTML source

  19. val movies: List[Element]

    Permalink

    holds a list of any movies we found on the page like youtube, vimeo

  20. final def ne(arg0: AnyRef): Boolean

    Permalink
    Definition Classes
    AnyRef
  21. final def notify(): Unit

    Permalink
    Definition Classes
    AnyRef
  22. final def notifyAll(): Unit

    Permalink
    Definition Classes
    AnyRef
  23. val openGraphData: OpenGraphData

    Permalink

    Facebook Open Graph data that that is found in Article Meta tags

  24. val publishDate: Option[DateTime]

    Permalink

    Sometimes useful to try and know when the publish date of an article was

  25. val rawDoc: Document

    Permalink

    this is the original JSoup document that contains a pure object from the original HTML without any cleaning options done on it

  26. val rawHtml: String

    Permalink

    stores the RAW HTML straight from the network connection

  27. final def synchronized[T0](arg0: ⇒ T0): T0

    Permalink
    Definition Classes
    AnyRef
  28. val tags: Set[String]

    Permalink

    holds a set of tags that may have been in the article, these are not meta keywords

  29. val title: String

    Permalink

    of the article

  30. val topImage: Option[Image]

    Permalink

    holds the top Image object that we think represents this article

  31. val topNode: Option[Element]

    Permalink

    holds the top Element we think is a candidate for the main body of the article

  32. final def wait(): Unit

    Permalink
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  33. final def wait(arg0: Long, arg1: Int): Unit

    Permalink
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  34. final def wait(arg0: Long): Unit

    Permalink
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )

Inherited from Serializable

Inherited from Serializable

Inherited from Product

Inherited from Equals

Inherited from AnyRef

Inherited from Any

Ungrouped