gander.extractors

StandardContentExtractor

object StandardContentExtractor extends ContentExtractor

Created by Jim Plush User: jim Date: 8/15/11

Linear Supertypes
ContentExtractor, AnyRef, Any
Ordering
  1. Alphabetic
  2. By inheritance
Inherited
  1. StandardContentExtractor
  2. ContentExtractor
  3. AnyRef
  4. Any
  1. Hide All
  2. Show all
Learn more about member selection
Visibility
  1. Public
  2. All

Value Members

  1. final def !=(arg0: AnyRef): Boolean

    Definition Classes
    AnyRef
  2. final def !=(arg0: Any): Boolean

    Definition Classes
    Any
  3. final def ##(): Int

    Definition Classes
    AnyRef → Any
  4. final def ==(arg0: AnyRef): Boolean

    Definition Classes
    AnyRef
  5. final def ==(arg0: Any): Boolean

    Definition Classes
    Any
  6. val ARROWS_SPLITTER: StringSplitter

    Definition Classes
    ContentExtractor
  7. val A_REL_TAG_SELECTOR: String

    Definition Classes
    ContentExtractor
  8. val COLON_SPLITTER: StringSplitter

    Definition Classes
    ContentExtractor
  9. val DASH_SPLITTER: StringSplitter

    Definition Classes
    ContentExtractor
  10. val ESCAPED_FRAGMENT_REPLACEMENT: StringReplacement

    Definition Classes
    ContentExtractor
  11. val MOTLEY_REPLACEMENT: StringReplacement

    Definition Classes
    ContentExtractor
  12. val NO_STRINGS: Set[String]

    Definition Classes
    ContentExtractor
  13. val PIPE_SPLITTER: StringSplitter

    Definition Classes
    ContentExtractor
  14. val SPACE_SPLITTER: StringSplitter

    Definition Classes
    ContentExtractor
  15. val TITLE_REPLACEMENTS: ReplaceSequence

    Definition Classes
    ContentExtractor
  16. val TOP_NODE_TAGS: TagsEvaluator

    Definition Classes
    ContentExtractor
  17. final def asInstanceOf[T0]: T0

    Definition Classes
    Any
  18. def calculateBestNodeBasedOnClustering(doc: Document): Option[Element]

    we're going to start looking for where the clusters of paragraphs are.

    we're going to start looking for where the clusters of paragraphs are. We'll score a cluster based on the number of stopwords and the number of consecutive paragraphs together, which should form the cluster of text that this node is around also store on how high up the paragraphs are, comments are usually at the bottom and should get a lower score

    // todo refactor this long method

    returns

    Definition Classes
    ContentExtractor
  19. def clone(): AnyRef

    Attributes
    protected[java.lang]
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  20. def doTitleSplits(title: String, splitter: StringSplitter): String

    based on a delimeter in the title take the longest piece or do some custom logic based on the site

    based on a delimeter in the title take the longest piece or do some custom logic based on the site

    title
    splitter
    returns

    Definition Classes
    ContentExtractor
  21. final def eq(arg0: AnyRef): Boolean

    Definition Classes
    AnyRef
  22. def equals(arg0: Any): Boolean

    Definition Classes
    AnyRef → Any
  23. def extractTags(doc: Document): Set[String]

    Definition Classes
    ContentExtractor
  24. def extractVideos(node: Element): List[Element]

    pulls out videos we like

    pulls out videos we like

    returns

    Definition Classes
    ContentExtractor
  25. def finalize(): Unit

    Attributes
    protected[java.lang]
    Definition Classes
    AnyRef
    Annotations
    @throws( classOf[java.lang.Throwable] )
  26. def getCanonicalLink(doc: Document, finalUrl: String): String

    if the article has meta canonical link set in the url

    if the article has meta canonical link set in the url

    Definition Classes
    ContentExtractor
  27. final def getClass(): Class[_]

    Definition Classes
    AnyRef → Any
  28. def getDomain(url: String): String

    Definition Classes
    ContentExtractor
  29. def getLogger(): Logger

    Definition Classes
    ContentExtractor
  30. def getMetaDescription(doc: Document): String

    if the article has meta description set in the source, use that

    if the article has meta description set in the source, use that

    Definition Classes
    ContentExtractor
  31. def getMetaKeywords(doc: Document): String

    if the article has meta keywords set in the source, use that

    if the article has meta keywords set in the source, use that

    Definition Classes
    ContentExtractor
  32. def getShortText(e: String, max: Int): String

    Definition Classes
    ContentExtractor
  33. def getSiblingContent(currentSibling: Element, baselineScoreForSiblingParagraphs: Int): Option[String]

    adds any siblings that may have a decent score to this node

    adds any siblings that may have a decent score to this node

    currentSibling
    returns

    Definition Classes
    ContentExtractor
  34. def getTitle(doc: Document): String

    Definition Classes
    ContentExtractor
  35. def hashCode(): Int

    Definition Classes
    AnyRef → Any
  36. final def isInstanceOf[T0]: Boolean

    Definition Classes
    Any
  37. def isNodeScoreThreshholdMet(node: Element, e: Element): Boolean

    Definition Classes
    ContentExtractor
  38. def isTableTagAndNoParagraphsExist(e: Element): Boolean

    Definition Classes
    ContentExtractor
  39. final def ne(arg0: AnyRef): Boolean

    Definition Classes
    AnyRef
  40. final def notify(): Unit

    Definition Classes
    AnyRef
  41. final def notifyAll(): Unit

    Definition Classes
    AnyRef
  42. def postExtractionCleanup(targetNode: Element): Element

    remove any divs that looks like non-content, clusters of links, or paras with no gusto

    remove any divs that looks like non-content, clusters of links, or paras with no gusto

    targetNode
    returns

    Definition Classes
    ContentExtractor
  43. def printTraceLog(topNode: Element): Unit

    Definition Classes
    ContentExtractor
  44. final def synchronized[T0](arg0: ⇒ T0): T0

    Definition Classes
    AnyRef
  45. def toString(): String

    Definition Classes
    AnyRef → Any
  46. final def wait(): Unit

    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  47. final def wait(arg0: Long, arg1: Int): Unit

    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  48. final def wait(arg0: Long): Unit

    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  49. def walkSiblings[T](node: Element)(work: (Element) ⇒ T): Seq[T]

    Definition Classes
    ContentExtractor

Inherited from ContentExtractor

Inherited from AnyRef

Inherited from Any

Ungrouped