Packages

class HTMLReader extends Serializable

Class to parse and read HTML files.

Linear Supertypes
Serializable, Serializable, AnyRef, Any
Ordering
  1. Alphabetic
  2. By Inheritance
Inherited
  1. HTMLReader
  2. Serializable
  3. Serializable
  4. AnyRef
  5. Any
  1. Hide All
  2. Show All
Visibility
  1. Public
  2. All

Instance Constructors

  1. new HTMLReader(titleFontSize: Int = 16, storeContent: Boolean = false, timeout: Int = 0, headers: Map[String, String] = Map.empty)

    titleFontSize

    Minimum font size threshold used as part of heuristic rules to detect title elements based on formatting (e.g., bold, centered, capitalized). By default, it is set to 16.

    storeContent

    Whether to include the raw file content in the output DataFrame as a separate 'content' column, alongside the structured output. By default, it is set to false.

    timeout

    Timeout value in seconds for reading remote HTML resources. Applied when fetching content from URLs. By default, it is set to 0.

    headers

    sets the necessary headers for the URL request. Two types of input paths are supported for the reader, htmlPath: this is a path to a directory of HTML files or a path to an HTML file E.g. "path/html/files" url: this is the URL or set of URLs of a website . E.g., "https://www.wikipedia.org"

    Example

    val path = "./html-files/fake-html.html"
    val HTMLReader = new HTMLReader()
    val htmlDF = HTMLReader.read(url)
    htmlDF.show()
    +--------------------+--------------------+
    |                path|                html|
    +--------------------+--------------------+
    |file:/content/htm...|[{Title, My First...|
    +--------------------+--------------------+
    
    htmlDf.printSchema()
    root
     |-- url: string (nullable = true)
     |-- html: array (nullable = true)
     |    |-- element: struct (containsNull = true)
     |    |    |-- elementType: string (nullable = true)
     |    |    |-- content: string (nullable = true)
     |    |    |-- metadata: map (nullable = true)
     |    |    |    |-- key: string
     |    |    |    |-- value: string (valueContainsNull = true)

    For more examples please refer to this notebook.

Value Members

  1. final def !=(arg0: Any): Boolean
    Definition Classes
    AnyRef → Any
  2. final def ##(): Int
    Definition Classes
    AnyRef → Any
  3. final def ==(arg0: Any): Boolean
    Definition Classes
    AnyRef → Any
  4. final def asInstanceOf[T0]: T0
    Definition Classes
    Any
  5. def clone(): AnyRef
    Attributes
    protected[lang]
    Definition Classes
    AnyRef
    Annotations
    @throws( ... ) @native()
  6. final def eq(arg0: AnyRef): Boolean
    Definition Classes
    AnyRef
  7. def equals(arg0: Any): Boolean
    Definition Classes
    AnyRef → Any
  8. def finalize(): Unit
    Attributes
    protected[lang]
    Definition Classes
    AnyRef
    Annotations
    @throws( classOf[java.lang.Throwable] )
  9. final def getClass(): Class[_]
    Definition Classes
    AnyRef → Any
    Annotations
    @native()
  10. def getOutputColumn: String
  11. def hashCode(): Int
    Definition Classes
    AnyRef → Any
    Annotations
    @native()
  12. def htmlToHTMLElement(html: String): Array[HTMLElement]
  13. final def isInstanceOf[T0]: Boolean
    Definition Classes
    Any
  14. final def ne(arg0: AnyRef): Boolean
    Definition Classes
    AnyRef
  15. final def notify(): Unit
    Definition Classes
    AnyRef
    Annotations
    @native()
  16. final def notifyAll(): Unit
    Definition Classes
    AnyRef
    Annotations
    @native()
  17. def read(inputURLs: Array[String]): DataFrame

    inputURLs

    this is a list of URLs E.g. [www.wikipedia.com, www.example.com]

    returns

    Dataframe with parsed URL content.

  18. def read(inputSource: String): DataFrame

    inputSource

    this is the link to the URL E.g. www.wikipedia.com

    returns

    Dataframe with parsed URL content.

  19. def setOutputColumn(value: String): HTMLReader.this.type
  20. final def synchronized[T0](arg0: ⇒ T0): T0
    Definition Classes
    AnyRef
  21. def toString(): String
    Definition Classes
    AnyRef → Any
  22. def urlToHTMLElement(url: String): Array[HTMLElement]
  23. final def wait(): Unit
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  24. final def wait(arg0: Long, arg1: Int): Unit
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  25. final def wait(arg0: Long): Unit
    Definition Classes
    AnyRef
    Annotations
    @throws( ... ) @native()

Inherited from Serializable

Inherited from Serializable

Inherited from AnyRef

Inherited from Any

Ungrouped