Serialized form of XML
Serialized form of XML
Serialized form of XML
Serialized form of XML
Common type for the wikipedia nodes
Container to hold header section data.
HTTP link to either an internal page or an external page.
HTTP link to either an internal page or an external page.
Wikimedia Id for the page
Revision Id element is associated with
The header the element is a child of.
Unique (to the page) integer for an element.
URL. For internal links, the wikipedia title, otherwise the domain. Internal domains may (and often do) point to redirects. This needs to be taken into account when analysing links.
The textual overlay for a link. If empty the destination will be used.
WIKIMEDIA or EXTERNAL
Namespace for WIKIMEDIA links or the domain for external links
We separate the page book mark from the domain for analytic purposes. www.test.com#page_bookmark becomes www.test.com and page_bookmark.
Domain object for an Wikipedia page.
Domain object for an Wikipedia page. Structured representation of a page's meta data plus parsed wiki code.
Unique wikipedia ID from the dump file.
Wikipedia page's title.
Text name of a wiki's name space. https://en.wikipedia.org/wiki/Wikipedia:Namespace
identifier for the last revision.
Date for when the page was last updated.
SUCCESS or the error message
Flattened list of wikipedia header sections
Natural language portion of page
WikiMedia templates
Wikimedia and Exteranl Links
Handful of extended tags
Wikimedia tables converted to HTML
Contains info about a table.
Contains info about a table.
Wikimedia Id for the page
Revision Id element is associated with
The header the element is a child of.
Unique (to the page) integer for an element.
The primary html element of the table TABLE, OL, UL, or DL
Table title (if any).
Table converted to HTML form. Wiki tables are tricky to capture in a common structured form. Columns and rows can be merged. Table header tags can be abused. We default to leaving it in HTML and let the caller deal with it.
Contains info about an HTML tag.
Contains info about an HTML tag. Mostly these are tags that Sweble cannot parse.
Special XML tags that are not handled else where in the code. For the most part, ref and math are the main ones.
Wikimedia Id for the page
Revision Id element is associated with
The header the element is a child of.
Unique (to the page) integer for an element.
tag name (without brackets)
contents inside of the tags
Templates are a special MediaWiki construct that allows code to be shared among pages
Templates are a special MediaWiki construct that allows code to be shared among pages
For example {{Global warming}} will create a table with links that are common to all GW related pages.
Wikimedia Id for the page
Revision Id element is associated with
The header the element is a child of.
Unique (to the page) integer for an element.
Template name, definition can be found via https://en.wikipedia.org/wiki/Template:[Template name]
Templates can have 0..n parameters. These may be named (arg=val) or just referenced sequentially. In this code they are represented via list of tuple (arg, value). If a argument is not named, then a place holder of *POS_[0 based index] is used.
Natural language part of the wikipedia page.
Natural language part of the wikipedia page.
Natural text of an page. The wikicode parsing process isn't an exact process and some artifacts and some junk are to be expected.
Wikimedia Id for the Page
Revision Id element is associated with
The header the element is a child of.
text fragment
Used to pass parser state between page nodes
Container to hold header section data.
Wikimedia Id for the page
Revision Id element is associated with
Unique (to the page) identifier for a header.
Header text
Header depth. 1 is Lead H2 = 2, H3 = 3, etc.