we're going to start looking for where the clusters of paragraphs are.
we're going to start looking for where the clusters of paragraphs are. We'll score a cluster based on the number of stopwords and the number of consecutive paragraphs together, which should form the cluster of text that this node is around also store on how high up the paragraphs are, comments are usually at the bottom and should get a lower score
// todo refactor this long method
based on a delimeter in the title take the longest piece or do some custom logic based on the site
based on a delimeter in the title take the longest piece or do some custom logic based on the site
pulls out videos we like
if the article has meta canonical link set in the url
if the article has meta canonical link set in the url
if the article has meta description set in the source, use that
if the article has meta description set in the source, use that
if the article has meta keywords set in the source, use that
if the article has meta keywords set in the source, use that
adds any siblings that may have a decent score to this node
adds any siblings that may have a decent score to this node
remove any divs that looks like non-content, clusters of links, or paras with no gusto
remove any divs that looks like non-content, clusters of links, or paras with no gusto
Created by Jim Plush User: jim Date: 8/15/11