com.twitter.scalding.examples

WeightedPageRank

class WeightedPageRank extends Job

weighted page rank for the given graph, start from the given pagerank, perform one iteartion, test for convergence, if not yet, clone itself and start the next page rank job with updated pagerank as input.

This class is very similar to the PageRank class, main differences are: 1. supported weighted pagerank 2. the reset pagerank is pregenerated, possibly through a previous job 3. dead pagerank is evenly distributed

Options: --pwd: working directory, will read/generate the following files there numnodes: total number of nodes nodes: nodes file <'src_id, 'dst_ids, 'weights, 'mass_prior> pagerank: the page rank file eg pagerank_0, pagerank_1 etc totaldiff: the current max pagerank delta Optional arguments: --weighted: do weighted pagerank, default false --curiteration: what is the current iteration, default 0 --maxiterations: how many iterations to run. Default is 20 --jumpprob: probability of a random jump, default is 0.1 --threshold: total difference before finishing early, default 0.001

Linear Supertypes
Job, Serializable, FieldConversions, LowPriorityFieldConversions, AnyRef, Any
Ordering
  1. Alphabetic
  2. By inheritance
Inherited
  1. WeightedPageRank
  2. Job
  3. Serializable
  4. FieldConversions
  5. LowPriorityFieldConversions
  6. AnyRef
  7. Any
  1. Hide All
  2. Show all
Learn more about member selection
Visibility
  1. Public
  2. All

Instance Constructors

  1. new WeightedPageRank(args: Args)

Value Members

  1. final def !=(arg0: AnyRef): Boolean

    Definition Classes
    AnyRef
  2. final def !=(arg0: Any): Boolean

    Definition Classes
    Any
  3. final def ##(): Int

    Definition Classes
    AnyRef → Any
  4. final def ==(arg0: AnyRef): Boolean

    Definition Classes
    AnyRef
  5. final def ==(arg0: Any): Boolean

    Definition Classes
    Any
  6. val ALPHA: Double

  7. val CURITERATION: Int

  8. val MAXITERATIONS: Int

  9. val PWD: String

  10. val ROW_TYPE_1: Int

  11. val ROW_TYPE_2: Int

  12. val THRESHOLD: Double

  13. val WEIGHTED: Boolean

  14. def anyToFieldArg(f: Any): Comparable[_]

    Attributes
    protected
    Definition Classes
    LowPriorityFieldConversions
  15. final def asInstanceOf[T0]: T0

    Definition Classes
    Any
  16. def asList(f: Fields): List[Comparable[_]]

    Definition Classes
    FieldConversions
  17. def asSet(f: Fields): Set[Comparable[_]]

    Definition Classes
    FieldConversions
  18. def buildFlow: Flow[_]

    combine the config, flowDef and the Mode to produce a flow

    combine the config, flowDef and the Mode to produce a flow

    Definition Classes
    Job
  19. def classIdentifier: String

    Definition Classes
    Job
  20. def clear: Unit

    Definition Classes
    Job
  21. def clone(nextargs: Args): Job

    Copy this job By default, this uses reflection and the single argument Args constructor

    Copy this job By default, this uses reflection and the single argument Args constructor

    Definition Classes
    Job
  22. def clone(): AnyRef

    Attributes
    protected[java.lang]
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  23. def config: Map[AnyRef, AnyRef]

    This is the exact config that is passed to the Cascading FlowConnector.

    This is the exact config that is passed to the Cascading FlowConnector. By default: if there are no spill thresholds in mode.config, we replace with defaultSpillThreshold we overwrite io.serializations with ioSerializations we overwrite cascading.tuple.element.comparator.default to defaultComparator we add some scalding keys for debugging/logging

    Tip: override this method, call super, and ++ your additional map to add or overwrite more options

    This returns Map[AnyRef, AnyRef] for compatibility with older code

    Definition Classes
    Job
  24. implicit def dateParser: DateParser

    Override this to control how dates are parsed

    Override this to control how dates are parsed

    Definition Classes
    Job
  25. def defaultComparator: Option[Class[_ <: Comparator[_]]]

    Override this if you want to customize comparisons/hashing for your job the config method overwrites using this before sending to cascading The one we use by default is needed used to make Joins in the Fields-API more robust to Long vs Int differences.

    Override this if you want to customize comparisons/hashing for your job the config method overwrites using this before sending to cascading The one we use by default is needed used to make Joins in the Fields-API more robust to Long vs Int differences. If you only use the Typed-API, consider changing this to return None

    Definition Classes
    Job
  26. def defaultMode(fromFields: Fields, toFields: Fields): Fields

    Rather than give the full power of cascading's selectors, we have a simpler set of rules encoded below: 1) if the input is non-definite (ALL, GROUP, ARGS, etc.

    Rather than give the full power of cascading's selectors, we have a simpler set of rules encoded below: 1) if the input is non-definite (ALL, GROUP, ARGS, etc...) ALL is the output. Perhaps only fromFields=ALL will make sense 2) If one of from or to is a strict super set of the other, SWAP is used. 3) If they are equal, REPLACE is used. 4) Otherwise, ALL is used.

    Definition Classes
    FieldConversions
  27. def defaultSpillThreshold: Int

    Keep 100k tuples in memory by default before spilling Turn this up as high as you can without getting OOM.

    Keep 100k tuples in memory by default before spilling Turn this up as high as you can without getting OOM.

    This is ignored if there is a value set in the incoming jobConf on Hadoop

    Definition Classes
    Job
  28. def doPageRank(nodeRows: RichPipe, inputPagerank: RichPipe): RichPipe

    one iteration of pagerank inputPagerank: <'src_id_input, 'mass_input> return <'src_id, 'mass_n, 'mass_input>

    one iteration of pagerank inputPagerank: <'src_id_input, 'mass_input> return <'src_id, 'mass_n, 'mass_input>

    Here is a highlevel view of the unweighted algorithm: let N: number of nodes inputPagerank(N_i): prob of walking to node i, d(N_j): N_j's out degree then pagerankNext(N_i) = (\sum_{j points to i} inputPagerank(N_j) / d_j) deadPagerank = (1 - \sum_{i} pagerankNext(N_i)) / N randomPagerank(N_i) = userMass(N_i) * ALPHA + deadPagerank * (1-ALPHA) pagerankOutput(N_i) = randomPagerank(N_i) + pagerankNext(N_i) * (1-ALPHA)

    For weighted algorithm: let w(N_j, N_i): weight from N_j to N_i tw(N_j): N_j's total out weights then pagerankNext(N_i) = (\sum_{j points to i} inputPagerank(N_j) * w(N_j, N_i) / tw(N_j))

  29. final def ensureUniqueFields(left: Fields, right: Fields, rightPipe: Pipe): (Fields, Pipe)

    Definition Classes
    FieldConversions
  30. implicit def enumValueToFields(x: Value): Fields

    Definition Classes
    FieldConversions
  31. final def eq(arg0: AnyRef): Boolean

    Definition Classes
    AnyRef
  32. def equals(arg0: Any): Boolean

    Definition Classes
    AnyRef → Any
  33. implicit def fieldFields[T <: TraversableOnce[Field[_]]](f: T): RichFields

    Definition Classes
    FieldConversions
  34. implicit def fieldToFields(f: Field[_]): RichFields

    Definition Classes
    FieldConversions
  35. implicit def fields[T <: TraversableOnce[Symbol]](f: T): Fields

    Definition Classes
    FieldConversions
  36. implicit def fieldsToRichFields(fields: Fields): RichFields

    We can't set the field Manifests because cascading doesn't (yet) expose field type information in the Fields API.

    We can't set the field Manifests because cascading doesn't (yet) expose field type information in the Fields API.

    Definition Classes
    FieldConversions
  37. def finalize(): Unit

    Attributes
    protected[java.lang]
    Definition Classes
    AnyRef
    Annotations
    @throws( classOf[java.lang.Throwable] )
  38. implicit val flowDef: FlowDef

    Attributes
    protected
    Definition Classes
    Job
  39. implicit def fromEnum[T <: Enumeration](enumeration: T): Fields

    Multi-entry fields.

    Multi-entry fields. This are higher priority than Product conversions so that List will not conflict with Product.

    Definition Classes
    FieldConversions
  40. final def getClass(): Class[_]

    Definition Classes
    AnyRef → Any
  41. def getField(f: Fields, idx: Int): Fields

    Definition Classes
    FieldConversions
  42. def getInputPagerank(fileName: String): Pipe

  43. def getNodes(fileName: String): Pipe

    read the pregenerated nodes file <'src_id, 'dst_ids, 'weights, 'mass_prior>

  44. def getNumNodes(fileName: String): Pipe

    the total number of nodes, single line file

  45. def handleStats(statsData: CascadingStats): Unit

    Attributes
    protected
    Definition Classes
    Job
  46. def hasInts(f: Fields): Boolean

    Definition Classes
    FieldConversions
  47. def hashCode(): Int

    Definition Classes
    AnyRef → Any
  48. val inputPagerank: Pipe

  49. implicit def intFields[T <: TraversableOnce[Int]](f: T): Fields

    Definition Classes
    FieldConversions
  50. implicit def intToFields(x: Int): Fields

    Definition Classes
    FieldConversions
  51. implicit def integerToFields(x: Integer): Fields

    Definition Classes
    FieldConversions
  52. def ioSerializations: List[Class[_ <: Serialization[_]]]

    These are user-defined serializations IN-ADDITION to (but deduped) with the required serializations

    These are user-defined serializations IN-ADDITION to (but deduped) with the required serializations

    Definition Classes
    Job
  53. final def isInstanceOf[T0]: Boolean

    Definition Classes
    Any
  54. implicit def iterableToRichPipe[T](iter: Iterable[T])(implicit set: TupleSetter[T], conv: TupleConverter[T]): RichPipe

    Definition Classes
    Job
  55. def keepAlive: Unit

    Use this if a map or reduce phase takes a while before emitting tuples.

    Use this if a map or reduce phase takes a while before emitting tuples.

    Definition Classes
    Job
  56. def listeners: List[FlowListener]

    Definition Classes
    Job
  57. implicit def mode: Mode

    Definition Classes
    Job
  58. def name: String

    Definition Classes
    Job
  59. final def ne(arg0: AnyRef): Boolean

    Definition Classes
    AnyRef
  60. final def newSymbol(avoid: Set[Symbol], guess: Symbol, trial: Int = 0): Symbol

    Definition Classes
    FieldConversions
    Annotations
    @tailrec()
  61. def next: Option[Job]

    test convergence, if not yet, kick off the next iteration

    test convergence, if not yet, kick off the next iteration

    Definition Classes
    WeightedPageRankJob
  62. val nodes: Pipe

  63. final def notify(): Unit

    Definition Classes
    AnyRef
  64. final def notifyAll(): Unit

    Definition Classes
    AnyRef
  65. val numNodes: Pipe

  66. val outputFileName: String

  67. val outputPagerank: RichPipe

  68. implicit def parseAnySeqToFields[T <: TraversableOnce[Any]](anyf: T): Fields

    Useful to convert f : Any* to Fields.

    Useful to convert f : Any* to Fields. This handles mixed cases ("hey", 'you). Not sure we should be this flexible, but given that Cascading will throw an exception before scheduling the job, I guess this is okay.

    Definition Classes
    FieldConversions
  69. implicit def pipeToRichPipe(pipe: Pipe): RichPipe

    you should never call this directly, it is here to make the DSL work.

    you should never call this directly, it is here to make the DSL work. Just know, you can treat a Pipe as a RichPipe within a Job

    Definition Classes
    Job
  70. implicit def productToFields(f: Product): Fields

    Handles treating any TupleN as a Fields object.

    Handles treating any TupleN as a Fields object. This is low priority because List is also a Product, but this method will not work for List (because List is Product2(head, tail) and so productIterator won't work as expected. Lists are handled by an implicit in FieldConversions, which have higher priority.

    Definition Classes
    LowPriorityFieldConversions
  71. implicit def read(src: Source): Pipe

    This is implicit so that a Source can be used as the argument to a join or other method that accepts Pipe.

    This is implicit so that a Source can be used as the argument to a join or other method that accepts Pipe.

    Definition Classes
    Job
  72. def run: Boolean

    Definition Classes
    Job
  73. implicit def scaldingConfig: Config

    This is here so that Mappable.

    This is here so that Mappable.toIterator can find an implicit config

    Attributes
    protected
    Definition Classes
    Job
  74. def skipStrategy: Option[FlowSkipStrategy]

    Definition Classes
    Job
  75. implicit def sourceToRichPipe(src: Source): RichPipe

    This implicit is to enable RichPipe methods directly on Source objects, such as map/flatMap, etc.

    This implicit is to enable RichPipe methods directly on Source objects, such as map/flatMap, etc...

    Note that Mappable is a subclass of Source, and Mappable already has mapTo and flatMapTo BUT WITHOUT incoming fields used (see the Mappable trait). This creates some confusion when using these methods (this is an unfortunate mistake in our design that was not noticed until later). To remove ambiguity, explicitly call .read on any Source that you begin operating with a mapTo/flatMapTo.

    Definition Classes
    Job
  76. def stepListeners: List[FlowStepListener]

    Definition Classes
    Job
  77. def stepStrategy: Option[FlowStepStrategy[_]]

    Specify a callback to run before the start of each flow step.

    Specify a callback to run before the start of each flow step.

    Defaults to what Config.getReducerEstimator specifies.

    Definition Classes
    Job
    See also

    ExecutionContext.buildFlow

  78. implicit def strFields[T <: TraversableOnce[String]](f: T): Fields

    Definition Classes
    FieldConversions
  79. implicit def stringToFields(x: String): Fields

    Definition Classes
    FieldConversions
  80. implicit def symbolToFields(x: Symbol): Fields

    '* means Fields.

    '* means Fields.ALL, otherwise we take the .name

    Definition Classes
    FieldConversions
  81. final def synchronized[T0](arg0: ⇒ T0): T0

    Definition Classes
    AnyRef
  82. def timeout[T](timeout: AbsoluteDuration)(t: ⇒ T): Option[T]

    Definition Classes
    Job
  83. implicit def toPipe[T](iter: Iterable[T])(implicit set: TupleSetter[T], conv: TupleConverter[T]): Pipe

    Definition Classes
    Job
  84. def toString(): String

    Definition Classes
    AnyRef → Any
  85. val totalDiff: Pipe

  86. implicit def tuple2ToFieldsPair[T, U](pair: (T, U))(implicit tf: (T) ⇒ Fields, uf: (U) ⇒ Fields): (Fields, Fields)

    Definition Classes
    FieldConversions
  87. implicit def unitToFields(u: Unit): Fields

    Definition Classes
    FieldConversions
  88. def validate: Unit

    Definition Classes
    Job
  89. final def wait(): Unit

    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  90. final def wait(arg0: Long, arg1: Int): Unit

    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  91. final def wait(arg0: Long): Unit

    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  92. def write(pipe: Pipe, src: Source): Unit

    This is only here for Java jobs which cannot automatically access the implicit Pipe => RichPipe which makes: pipe.

    This is only here for Java jobs which cannot automatically access the implicit Pipe => RichPipe which makes: pipe.write( ) convenient

    Definition Classes
    Job

Inherited from Job

Inherited from Serializable

Inherited from FieldConversions

Inherited from AnyRef

Inherited from Any

Ungrouped