case classDatastoreGoogleNGram(groupName: String, artifactName: String, version: Int, frequencyCutoff: Int) extends Product with Serializable
A class that parses Google N-Gram data
(http://commondatastorage.googleapis.com/books/syntactic-ngrams/index.html) to provide
information about a requested n-gram.
Takes the datastore location details for a data directory and parses each file, expected
to be in the following format
(from https://docs.google.com/document/d/14PWeoTkrnKk9H8_7CfVbdvuoFZ7jYivNTkBX2Hj7qLw/edit) -
format: OFF
head_word<TAB>syntactic-ngram<TAB>total_count<TAB>counts_by_year
The counts_by_year format is a tab-separated list of year<comma>count items.
Years are sorted in ascending order, and only years with non-zero counts are included.
The syntactic-ngram format is a space-separated list of tokens, each token format is:
“word/pos-tag/dep-label/head-index”.
The word field can contain any non-whitespace character.
The other fields can contain any non-whitespace character except for ‘/’.
pos-tag is a Penn-Treebank part-of-speech tag.
dep-label is a stanford-basic-dependencies label.
head-index is an integer, pointing to the head of the current token.
“1” refers to the first token in the list, 2 the second,
and 0 indicates that the head is the root of the fragment.
format: ON
Linear Supertypes
Serializable, Serializable, Product, Equals, AnyRef, Any
A class that parses Google N-Gram data (http://commondatastorage.googleapis.com/books/syntactic-ngrams/index.html) to provide information about a requested n-gram. Takes the datastore location details for a data directory and parses each file, expected to be in the following format (from https://docs.google.com/document/d/14PWeoTkrnKk9H8_7CfVbdvuoFZ7jYivNTkBX2Hj7qLw/edit) - format: OFF head_word<TAB>syntactic-ngram<TAB>total_count<TAB>counts_by_year The counts_by_year format is a tab-separated list of year<comma>count items. Years are sorted in ascending order, and only years with non-zero counts are included. The syntactic-ngram format is a space-separated list of tokens, each token format is: “word/pos-tag/dep-label/head-index”. The word field can contain any non-whitespace character. The other fields can contain any non-whitespace character except for ‘/’. pos-tag is a Penn-Treebank part-of-speech tag. dep-label is a stanford-basic-dependencies label. head-index is an integer, pointing to the head of the current token. “1” refers to the first token in the list, 2 the second, and 0 indicates that the head is the root of the fragment. format: ON