Copies the tuple, since cascading may change it after the end of an operation (and it is not safe to assume the consumer has not kept a ref to this tuple
Copies the tupleEntry, since cascading may change it after the end of an operation (and it is not safe to assume the consumer has not kept a ref to this tuple
By default we only set two keys: io.
By default we only set two keys: io.serializations cascading.tuple.element.comparator.default Override this class, call base and ++ your additional map to set more options
Rather than give the full power of cascading's selectors, we have a simpler set of rules encoded below: 1) if the input is non-definite (ALL, GROUP, ARGS, etc.
Rather than give the full power of cascading's selectors, we have a simpler set of rules encoded below: 1) if the input is non-definite (ALL, GROUP, ARGS, etc...) ALL is the output. Perhaps only fromFields=ALL will make sense 2) If one of from or to is a strict super set of the other, SWAP is used. 3) If they are equal, REPLACE is used. 4) Otherwise, ALL is used.
The basic idea is to groupBy the dst key with BOTH the nodeset and the edge rows.
The basic idea is to groupBy the dst key with BOTH the nodeset and the edge rows. the nodeset rows have the old page-rank, the edge rows are reversed, so we can get the incoming page-rank from the nodes that point to each destination.
Multi-entry fields.
Multi-entry fields. This are higher priority than Product conversions so that List will not conflict with Product.
override this function to change how you generate a pipe of (Long, String, Double) where the first entry is the nodeid, the second is the list of neighbors, as a comma (no spaces) separated string representation of the numeric nodeids, the third is the initial page rank (if not starting from a previous run, this should be 1.
override this function to change how you generate a pipe of (Long, String, Double) where the first entry is the nodeid, the second is the list of neighbors, as a comma (no spaces) separated string representation of the numeric nodeids, the third is the initial page rank (if not starting from a previous run, this should be 1.0
NOTE: if you want to run until convergence, the initialize method must read the same EXACT format as the output method writes. This is your job!
Here is where we check for convergence and then run the next job if we're not converged
you should never call these directly, there are here to make the DSL work.
you should never call these directly, there are here to make the DSL work. Just know, you can treat a Pipe as a RichPipe and vice-versa within a Job
Useful to convert f : Any* to Fields.
Useful to convert f : Any* to Fields. This handles mixed cases ("hey", 'you). Not sure we should be this flexible, but given that Cascading will throw an exception before scheduling the job, I guess this is okay.
Handles treating any TupleN as a Fields object.
Handles treating any TupleN as a Fields object. This is low priority because List is also a Product, but this method will not work for List (because List is Product2(head, tail) and so productIterator won't work as expected. Lists are handled by an implicit in FieldConversions, which have higher priority.
'* means Fields.
'* means Fields.ALL, otherwise we take the .name
Options: --input: the three column TSV with node, comma-sep-out-neighbors, initial pagerank (set to 1.0 first) --ouput: the name for the TSV you want to write to, same as above. optional arguments: --errorOut: name of where to write the L1 error between the input page-rank and the output if this is omitted, we don't compute the error --iterations: how many iterations to run inside this job. Default is 1, 10 is about as much as cascading can handle. --jumpprob: probability of a random jump, default is 0.15 --convergence: if this is set, after every "--iterations" steps, we check the error and see if we should continue. Since the error check is expensive (involving a join), you should avoid doing this too frequently. 10 iterations is probably a good number to set. --temp: this is the name where we will store a temporary output so we can compare to the previous for convergence checking. If convergence is set, this MUST be.