JoinAlgorithms

Abstract Value Members

abstract def pipe: Pipe

Concrete Value Members

final def !=(arg0: Any): Boolean

Definition Classes
AnyRef → Any
final def ##(): Int

Definition Classes
AnyRef → Any
final def ==(arg0: Any): Boolean

Definition Classes
AnyRef → Any
final def asInstanceOf[T0]: T0

Definition Classes
Any
def blockJoinWithSmaller(fs: (Fields, Fields), otherPipe: Pipe, rightReplication: Int = 1, leftReplication: Int = 1, joiner: Joiner = new InnerJoin, reducers: Int = 1): Pipe

Performs a block join, otherwise known as a replicate fragment join (RF join).
Performs a block join, otherwise known as a replicate fragment join (RF join). The input params leftReplication and rightReplication control the replication of the left and right pipes respectively.
This is useful in cases where the data has extreme skew. A symptom of this is that we may see a job stuck for a very long time on a small number of reducers.
A block join is way to get around this: we add a random integer field and a replica field to every tuple in the left and right pipes. We then join on the original keys and on these new dummy fields. These dummy fields make it less likely that the skewed keys will be hashed to the same reducer.
The final data size is right * rightReplication + left * leftReplication but because of the fragmentation, we are guaranteed the same number of hits as the original join.
If the right pipe is really small then you are probably better off with a joinWithTiny. If however the right pipe is medium sized, then you are better off with a blockJoinWithSmaller, and a good rule of thumb is to set rightReplication = left.size / right.size and leftReplication = 1
Finally, if both pipes are of similar size, e.g. in case of a self join with a high data skew, then it makes sense to set leftReplication and rightReplication to be approximately equal.
Note
You can only use an InnerJoin or a LeftJoin with a leftReplication of 1 (or a RightJoin with a rightReplication of 1) when doing a blockJoin.
def clone(): AnyRef

Attributes
protected[java.lang]
Definition Classes
AnyRef
Annotations
@throws( ... )
def coGroupBy(f: Fields, j: JoinMode = InnerJoinMode)(builder: (CoGroupBuilder) ⇒ GroupBuilder): Pipe

This method is used internally to implement all joins.
This method is used internally to implement all joins. You can use this directly if you want to implement something like a star join, e.g., when joining a single pipe to multiple other pipes. Make sure that you call this method on the larger pipe to make the grouping as efficient as possible.
If you are only joining two pipes, then you are better off using joinWithSmaller/joinWithLarger/joinWithTiny/leftJoinWithTiny.
def crossWithSmaller(p: Pipe, replication: Int = 20): Pipe

Does a cross-product by doing a blockJoin.
Does a cross-product by doing a blockJoin. Useful when doing a large cross, if your cluster can take it. Prefer crossWithTiny
def crossWithTiny(tiny: Pipe): Pipe

Doing a cross product with even a moderate sized pipe can create ENORMOUS output.
WARNING
Doing a cross product with even a moderate sized pipe can create ENORMOUS output. The use-case here is attaching a constant (e.g. a number or a dictionary or set) to each row in another pipe. A common use-case comes from a groupAll and reduction to one row, then you want to send the results back out to every element in a pipe
This uses joinWithTiny, so tiny pipe is replicated to all Mappers. If it is large, this will blow up. Get it: be foolish here and LOSE IT ALL!
Use at your own risk.
final def eq(arg0: AnyRef): Boolean

Definition Classes
AnyRef
def equals(arg0: Any): Boolean

Definition Classes
AnyRef → Any
def finalize(): Unit

Attributes
protected[java.lang]
Definition Classes
AnyRef
Annotations
@throws( classOf[java.lang.Throwable] )
final def getClass(): Class[_]

Definition Classes
AnyRef → Any
def hashCode(): Int

Definition Classes
AnyRef → Any
final def isInstanceOf[T0]: Boolean

Definition Classes
Any
def joinWithLarger(fs: (Fields, Fields), that: Pipe, joiner: Joiner = new InnerJoin, reducers: Int = 1): Pipe

same as reversing the order on joinWithSmaller
def joinWithSmaller(fs: (Fields, Fields), that: Pipe, joiner: Joiner = new InnerJoin, reducers: Int = 1): Pipe

Joins the first set of keys in the first pipe to the second set of keys in the second pipe.
Joins the first set of keys in the first pipe to the second set of keys in the second pipe. All keys must be unique UNLESS it is an inner join, then duplicated join keys are allowed, but the second copy is deleted (as cascading does not allow duplicated field names).
Smaller here means that the values/key is smaller than the left.
Avoid going crazy adding more explicit join modes. Instead do for some other join mode with a larger pipe:
```
.then { pipe => other.
          joinWithSmaller(('other1, 'other2)->('this1, 'this2), pipe, new FancyJoin)
      }
```
def joinWithTiny(fs: (Fields, Fields), that: Pipe): Pipe

This does an assymmetric join, using cascading's "HashJoin".
This does an assymmetric join, using cascading's "HashJoin". This only runs through this pipe once, and keeps the right hand side pipe in memory (but is spillable).
Choose this when Left > max(mappers,reducers) * Right, or when the left side is three orders of magnitude larger.
joins the first set of keys in the first pipe to the second set of keys in the second pipe. Duplicated join keys are allowed, but the second copy is deleted (as cascading does not allow duplicated field names).
Warning
This does not work with outer joins, or right joins, only inner and left join versions are given.
def joinerToJoinModes(j: Joiner): (Product with Serializable with JoinMode, Product with Serializable with JoinMode)
def leftJoinWithLarger(fs: (Fields, Fields), that: Pipe, reducers: Int = 1): Pipe

This is joinWithLarger with joiner parameter fixed to LeftJoin.
This is joinWithLarger with joiner parameter fixed to LeftJoin. If the item is absent on the right put null for the keys and values
def leftJoinWithSmaller(fs: (Fields, Fields), that: Pipe, reducers: Int = 1): Pipe

This is joinWithSmaller with joiner parameter fixed to LeftJoin.
This is joinWithSmaller with joiner parameter fixed to LeftJoin. If the item is absent on the right put null for the keys and values
def leftJoinWithTiny(fs: (Fields, Fields), that: Pipe): HashJoin
final def ne(arg0: AnyRef): Boolean

Definition Classes
AnyRef
final def notify(): Unit

Definition Classes
AnyRef
final def notifyAll(): Unit

Definition Classes
AnyRef
def skewJoinWithLarger(fs: (Fields, Fields), otherPipe: Pipe, sampleRate: Double = 0.001, reducers: Int = 1, replicator: SkewReplication = SkewReplicationA()): Pipe
def skewJoinWithSmaller(fs: (Fields, Fields), otherPipe: Pipe, sampleRate: Double = 0.001, reducers: Int = 1, replicator: SkewReplication = SkewReplicationA()): Pipe

Performs a skewed join, which is useful when the data has extreme skew.
Performs a skewed join, which is useful when the data has extreme skew.
For example, imagine joining a pipe of Twitter's follow graph against a pipe of user genders, in order to find the gender distribution of the accounts every Twitter user follows. Since celebrities (e.g., Justin Bieber and Lady Gaga) have a much larger follower base than other users, and (under a standard join algorithm) all their followers get sent to the same reducer, the job will likely be stuck on a few reducers for a long time. A skewed join attempts to alleviate this problem.
This works as follows:
1. First, we sample from the left and right pipes with some small probability, in order to determine approximately how often each join key appears in each pipe. 2. We use these estimated counts to replicate the join keys, according to the given replication strategy. 3. Finally, we join the replicated pipes together.
sampleRate
This controls how often we sample from the left and right pipes when estimating key counts.
replicator
Algorithm for determining how much to replicate a join key in the left and right pipes. Note: since we do not set the replication counts, only inner joins are allowed. (Otherwise, replicated rows would stay replicated when there is no counterpart in the other pipe.)
final def synchronized[T0](arg0: ⇒ T0): T0

Definition Classes
AnyRef
def toString(): String

Definition Classes
AnyRef → Any
final def wait(): Unit

Definition Classes
AnyRef
Annotations
@throws( ... )
final def wait(arg0: Long, arg1: Int): Unit

Definition Classes
AnyRef
Annotations
@throws( ... )
final def wait(arg0: Long): Unit

Definition Classes
AnyRef
Annotations
@throws( ... )

Related Docs: object JoinAlgorithms | package scalding

trait JoinAlgorithms extends AnyRef

Abstract Value Members

abstract def pipe: Pipe

Concrete Value Members

final def !=(arg0: Any): Boolean

final def ##(): Int

final def ==(arg0: Any): Boolean

final def asInstanceOf[T0]: T0

def blockJoinWithSmaller(fs: (Fields, Fields), otherPipe: Pipe, rightReplication: Int = 1, leftReplication: Int = 1, joiner: Joiner = new InnerJoin, reducers: Int = 1): Pipe

Note

def clone(): AnyRef

def coGroupBy(f: Fields, j: JoinMode = InnerJoinMode)(builder: (CoGroupBuilder) ⇒ GroupBuilder): Pipe

def crossWithSmaller(p: Pipe, replication: Int = 20): Pipe

def crossWithTiny(tiny: Pipe): Pipe

WARNING

final def eq(arg0: AnyRef): Boolean

def equals(arg0: Any): Boolean

def finalize(): Unit

final def getClass(): Class[_]

def hashCode(): Int

final def isInstanceOf[T0]: Boolean

def joinWithLarger(fs: (Fields, Fields), that: Pipe, joiner: Joiner = new InnerJoin, reducers: Int = 1): Pipe

def joinWithSmaller(fs: (Fields, Fields), that: Pipe, joiner: Joiner = new InnerJoin, reducers: Int = 1): Pipe

def joinWithTiny(fs: (Fields, Fields), that: Pipe): Pipe

Warning

def joinerToJoinModes(j: Joiner): (Product with Serializable with JoinMode, Product with Serializable with JoinMode)

def leftJoinWithLarger(fs: (Fields, Fields), that: Pipe, reducers: Int = 1): Pipe

def leftJoinWithSmaller(fs: (Fields, Fields), that: Pipe, reducers: Int = 1): Pipe

def leftJoinWithTiny(fs: (Fields, Fields), that: Pipe): HashJoin

final def ne(arg0: AnyRef): Boolean

final def notify(): Unit

final def notifyAll(): Unit

def skewJoinWithLarger(fs: (Fields, Fields), otherPipe: Pipe, sampleRate: Double = 0.001, reducers: Int = 1, replicator: SkewReplication = SkewReplicationA()): Pipe

def skewJoinWithSmaller(fs: (Fields, Fields), otherPipe: Pipe, sampleRate: Double = 0.001, reducers: Int = 1, replicator: SkewReplication = SkewReplicationA()): Pipe

final def synchronized[T0](arg0: ⇒ T0): T0

def toString(): String

final def wait(): Unit

final def wait(arg0: Long, arg1: Int): Unit

final def wait(arg0: Long): Unit

Inherited from AnyRef

Inherited from Any

Ungrouped