Package

com.github.nielsenbe.sparkwikiparser

wikipedia

Permalink

package wikipedia

Visibility
  1. Public
  2. All

Type Members

  1. case class InputPage(id: Long, ns: Long, redirect: InputRedirect, restrictions: Option[String], title: String, revision: InputRevision) extends Product with Serializable

    Permalink

    Serialized form of XML

  2. case class InputRedirect(_VALUE: String, _title: String) extends Product with Serializable

    Permalink

    Serialized form of XML

  3. case class InputRevision(timestamp: String, comment: String, format: String, id: Long, text: InputWikiText) extends Product with Serializable

    Permalink

    Serialized form of XML

  4. case class InputWikiText(_VALUE: String, _space: String) extends Product with Serializable

    Permalink

    Serialized form of XML

  5. sealed trait WikipediaElement extends AnyRef

    Permalink

    Common type for the wikipedia nodes

  6. case class WikipediaHeader(parentPageId: Int, parentRevisionId: Int, headerId: Int, title: String, level: Int) extends WikipediaElement with Product with Serializable

    Permalink

    Container to hold header section data.

    Container to hold header section data.

    parentPageId

    Wikimedia Id for the page

    parentRevisionId

    Revision Id element is associated with

    headerId

    Unique (to the page) identifier for a header.

    title

    Header text

    level

    Header depth. 1 is Lead H2 = 2, H3 = 3, etc.

  7. case class WikipediaLink(parentPageId: Int, parentRevisionId: Int, parentHeaderId: Int, elementId: Int, destination: String, text: String, linkType: String, subType: String, pageBookmark: String) extends WikipediaElement with Product with Serializable

    Permalink

    HTTP link to either an internal page or an external page.

    HTTP link to either an internal page or an external page.

    parentPageId

    Wikimedia Id for the page

    parentRevisionId

    Revision Id element is associated with

    parentHeaderId

    The header the element is a child of.

    elementId

    Unique (to the page) integer for an element.

    destination

    URL. For internal links, the wikipedia title, otherwise the domain. Internal domains may (and often do) point to redirects. This needs to be taken into account when analysing links.

    text

    The textual overlay for a link. If empty the destination will be used.

    linkType

    WIKIMEDIA or EXTERNAL

    subType

    Namespace for WIKIMEDIA links or the domain for external links

    pageBookmark

    We separate the page book mark from the domain for analytic purposes. www.test.com#page_bookmark becomes www.test.com and page_bookmark.

  8. case class WikipediaPage(id: Long, title: String, redirect: String, nameSpace: Long, revisionId: Long, revisionDate: Long, parserMessage: String, headerSections: List[WikipediaHeader], texts: List[WikipediaText], templates: List[WikipediaTemplate], links: List[WikipediaLink], tags: List[WikipediaTag], tables: List[WikipediaTable]) extends Product with Serializable

    Permalink

    Domain object for an Wikipedia page.

    Domain object for an Wikipedia page. Structured representation of a page's meta data plus parsed wiki code.

    id

    Unique wikipedia ID from the dump file.

    title

    Wikipedia page's title.

    nameSpace

    Text name of a wiki's name space. https://en.wikipedia.org/wiki/Wikipedia:Namespace

    revisionId

    identifier for the last revision.

    revisionDate

    Date for when the page was last updated.

    parserMessage

    SUCCESS or the error message

    headerSections

    Flattened list of wikipedia header sections

    texts

    Natural language portion of page

    templates

    WikiMedia templates

    links

    Wikimedia and Exteranl Links

    tags

    Handful of extended tags

    tables

    Wikimedia tables converted to HTML

  9. case class WikipediaTable(parentPageId: Int, parentRevisionId: Int, parentHeaderId: Int, elementId: Int, tableHtmlType: String, caption: String, html: String) extends WikipediaElement with Product with Serializable

    Permalink

    Contains info about a table.

    Contains info about a table.

    parentPageId

    Wikimedia Id for the page

    parentRevisionId

    Revision Id element is associated with

    parentHeaderId

    The header the element is a child of.

    elementId

    Unique (to the page) integer for an element.

    tableHtmlType

    The primary html element of the table TABLE, OL, UL, or DL

    caption

    Table title (if any).

    html

    Table converted to HTML form. Wiki tables are tricky to capture in a common structured form. Columns and rows can be merged. Table header tags can be abused. We default to leaving it in HTML and let the caller deal with it.

  10. case class WikipediaTag(parentPageId: Int, parentRevisionId: Int, parentHeaderId: Int, elementId: Int, tag: String, tagValue: String) extends WikipediaElement with Product with Serializable

    Permalink

    Contains info about an HTML tag.

    Contains info about an HTML tag. Mostly these are tags that Sweble cannot parse.

    Special XML tags that are not handled else where in the code. For the most part, ref and math are the main ones.

    parentPageId

    Wikimedia Id for the page

    parentRevisionId

    Revision Id element is associated with

    parentHeaderId

    The header the element is a child of.

    elementId

    Unique (to the page) integer for an element.

    tag

    tag name (without brackets)

    tagValue

    contents inside of the tags

  11. case class WikipediaTemplate(parentPageId: Int, parentRevisionId: Int, parentHeaderId: Int, elementId: Int, templateType: String, parameters: List[(String, String)]) extends WikipediaElement with Product with Serializable

    Permalink

    Templates are a special MediaWiki construct that allows code to be shared among pages

    Templates are a special MediaWiki construct that allows code to be shared among pages

    For example {{Global warming}} will create a table with links that are common to all GW related pages.

    parentPageId

    Wikimedia Id for the page

    parentRevisionId

    Revision Id element is associated with

    parentHeaderId

    The header the element is a child of.

    elementId

    Unique (to the page) integer for an element.

    templateType

    Template name, definition can be found via https://en.wikipedia.org/wiki/Template:[Template name]

    parameters

    Templates can have 0..n parameters. These may be named (arg=val) or just referenced sequentially. In this code they are represented via list of tuple (arg, value). If a argument is not named, then a place holder of *POS_[0 based index] is used.

  12. case class WikipediaText(parentPageId: Int, parentRevisionId: Int, parentHeaderId: Int, text: String) extends WikipediaElement with Product with Serializable

    Permalink

    Natural language part of the wikipedia page.

    Natural language part of the wikipedia page.

    Natural text of an page. The wikicode parsing process isn't an exact process and some artifacts and some junk are to be expected.

    parentPageId

    Wikimedia Id for the Page

    parentRevisionId

    Revision Id element is associated with

    parentHeaderId

    The header the element is a child of.

    text

    text fragment

  13. case class WkpParserConfiguration(parseText: Boolean, parseTemplates: Boolean, parseLinks: Boolean, parseTags: Boolean, parseTables: Boolean, parseRefTags: Boolean) extends Product with Serializable

    Permalink
  14. class WkpParserState extends AnyRef

    Permalink

    Used to pass parser state between page nodes

Value Members

  1. object WkpParser

    Permalink
  2. package sparkdbbuild

    Permalink

Ungrouped