combine the config, flowDef and the Mode to produce a flow
combine the config, flowDef and the Mode to produce a flow
Copy this job By default, this uses reflection and the single argument Args constructor
Copy this job By default, this uses reflection and the single argument Args constructor
This is the exact config that is passed to the Cascading FlowConnector.
This is the exact config that is passed to the Cascading FlowConnector. By default: if there are no spill thresholds in mode.config, we replace with defaultSpillThreshold we overwrite io.serializations with ioSerializations we overwrite cascading.tuple.element.comparator.default to defaultComparator we add some scalding keys for debugging/logging
Tip: override this method, call super, and ++ your additional map to add or overwrite more options
This returns Map[AnyRef, AnyRef] for compatibility with older code
Override this to control how dates are parsed
Override this to control how dates are parsed
Override this if you want to customize comparisons/hashing for your job the config method overwrites using this before sending to cascading The one we use by default is needed used to make Joins in the Fields-API more robust to Long vs Int differences.
Override this if you want to customize comparisons/hashing for your job the config method overwrites using this before sending to cascading The one we use by default is needed used to make Joins in the Fields-API more robust to Long vs Int differences. If you only use the Typed-API, consider changing this to return None
Rather than give the full power of cascading's selectors, we have a simpler set of rules encoded below: 1) if the input is non-definite (ALL, GROUP, ARGS, etc.
Rather than give the full power of cascading's selectors, we have a simpler set of rules encoded below: 1) if the input is non-definite (ALL, GROUP, ARGS, etc...) ALL is the output. Perhaps only fromFields=ALL will make sense 2) If one of from or to is a strict super set of the other, SWAP is used. 3) If they are equal, REPLACE is used. 4) Otherwise, ALL is used.
Keep 100k tuples in memory by default before spilling Turn this up as high as you can without getting OOM.
Keep 100k tuples in memory by default before spilling Turn this up as high as you can without getting OOM.
This is ignored if there is a value set in the incoming jobConf on Hadoop
one iteration of pagerank inputPagerank: <'src_id_input, 'mass_input> return <'src_id, 'mass_n, 'mass_input>
one iteration of pagerank inputPagerank: <'src_id_input, 'mass_input> return <'src_id, 'mass_n, 'mass_input>
Here is a highlevel view of the unweighted algorithm: let N: number of nodes inputPagerank(N_i): prob of walking to node i, d(N_j): N_j's out degree then pagerankNext(N_i) = (\sum_{j points to i} inputPagerank(N_j) / d_j) deadPagerank = (1 - \sum_{i} pagerankNext(N_i)) / N randomPagerank(N_i) = userMass(N_i) * ALPHA + deadPagerank * (1-ALPHA) pagerankOutput(N_i) = randomPagerank(N_i) + pagerankNext(N_i) * (1-ALPHA)
For weighted algorithm: let w(N_j, N_i): weight from N_j to N_i tw(N_j): N_j's total out weights then pagerankNext(N_i) = (\sum_{j points to i} inputPagerank(N_j) * w(N_j, N_i) / tw(N_j))
We can't set the field Manifests because cascading doesn't (yet) expose field type information in the Fields API.
We can't set the field Manifests because cascading doesn't (yet) expose field type information in the Fields API.
Multi-entry fields.
Multi-entry fields. This are higher priority than Product conversions so that List will not conflict with Product.
read the pregenerated nodes file <'src_id, 'dst_ids, 'weights, 'mass_prior>
the total number of nodes, single line file
These are user-defined serializations IN-ADDITION to (but deduped) with the required serializations
These are user-defined serializations IN-ADDITION to (but deduped) with the required serializations
Use this if a map or reduce phase takes a while before emitting tuples.
Use this if a map or reduce phase takes a while before emitting tuples.
test convergence, if not yet, kick off the next iteration
test convergence, if not yet, kick off the next iteration
Useful to convert f : Any* to Fields.
Useful to convert f : Any* to Fields. This handles mixed cases ("hey", 'you). Not sure we should be this flexible, but given that Cascading will throw an exception before scheduling the job, I guess this is okay.
you should never call this directly, it is here to make the DSL work.
you should never call this directly, it is here to make the DSL work. Just know, you can treat a Pipe as a RichPipe within a Job
Handles treating any TupleN as a Fields object.
Handles treating any TupleN as a Fields object. This is low priority because List is also a Product, but this method will not work for List (because List is Product2(head, tail) and so productIterator won't work as expected. Lists are handled by an implicit in FieldConversions, which have higher priority.
This is implicit so that a Source can be used as the argument to a join or other method that accepts Pipe.
This is implicit so that a Source can be used as the argument to a join or other method that accepts Pipe.
This is here so that Mappable.
This is here so that Mappable.toIterator can find an implicit config
This implicit is to enable RichPipe methods directly on Source objects, such as map/flatMap, etc.
This implicit is to enable RichPipe methods directly on Source objects, such as map/flatMap, etc...
Note that Mappable is a subclass of Source, and Mappable already has mapTo and flatMapTo BUT WITHOUT incoming fields used (see the Mappable trait). This creates some confusion when using these methods (this is an unfortunate mistake in our design that was not noticed until later). To remove ambiguity, explicitly call .read on any Source that you begin operating with a mapTo/flatMapTo.
Specify a callback to run before the start of each flow step.
Specify a callback to run before the start of each flow step.
Defaults to what Config.getReducerEstimator specifies.
ExecutionContext.buildFlow
'* means Fields.
'* means Fields.ALL, otherwise we take the .name
This is only here for Java jobs which cannot automatically access the implicit Pipe => RichPipe which makes: pipe.
This is only here for Java jobs which cannot automatically access the implicit Pipe => RichPipe which makes: pipe.write( ) convenient
weighted page rank for the given graph, start from the given pagerank, perform one iteartion, test for convergence, if not yet, clone itself and start the next page rank job with updated pagerank as input.
This class is very similar to the PageRank class, main differences are: 1. supported weighted pagerank 2. the reset pagerank is pregenerated, possibly through a previous job 3. dead pagerank is evenly distributed
Options: --pwd: working directory, will read/generate the following files there numnodes: total number of nodes nodes: nodes file <'src_id, 'dst_ids, 'weights, 'mass_prior> pagerank: the page rank file eg pagerank_0, pagerank_1 etc totaldiff: the current max pagerank delta Optional arguments: --weighted: do weighted pagerank, default false --curiteration: what is the current iteration, default 0 --maxiterations: how many iterations to run. Default is 20 --jumpprob: probability of a random jump, default is 0.1 --threshold: total difference before finishing early, default 0.001