org.apache.cassandra.hadoop
Class ColumnFamilyInputFormat

java.lang.Object
  extended by org.apache.hadoop.mapreduce.InputFormat<java.nio.ByteBuffer,java.util.SortedMap<java.nio.ByteBuffer,IColumn>>
      extended by org.apache.cassandra.hadoop.ColumnFamilyInputFormat
All Implemented Interfaces:
org.apache.hadoop.mapred.InputFormat<java.nio.ByteBuffer,java.util.SortedMap<java.nio.ByteBuffer,IColumn>>

public class ColumnFamilyInputFormat
extends org.apache.hadoop.mapreduce.InputFormat<java.nio.ByteBuffer,java.util.SortedMap<java.nio.ByteBuffer,IColumn>>
implements org.apache.hadoop.mapred.InputFormat<java.nio.ByteBuffer,java.util.SortedMap<java.nio.ByteBuffer,IColumn>>

Hadoop InputFormat allowing map/reduce against Cassandra rows within one ColumnFamily. At minimum, you need to set the CF and predicate (description of columns to extract from each row) in your Hadoop job Configuration. The ConfigHelper class is provided to make this simple: ConfigHelper.setInputColumnFamily ConfigHelper.setInputSlicePredicate You can also configure the number of rows per InputSplit with ConfigHelper.setInputSplitSize This should be "as big as possible, but no bigger." Each InputSplit is read from Cassandra with multiple get_slice_range queries, and the per-call overhead of get_slice_range is high, so larger split sizes are better -- but if it is too large, you will run out of memory. The default split size is 64k rows.


Field Summary
static java.lang.String CASSANDRA_HADOOP_MAX_KEY_SIZE
           
static int CASSANDRA_HADOOP_MAX_KEY_SIZE_DEFAULT
           
static java.lang.String MAPRED_TASK_ID
           
 
Constructor Summary
ColumnFamilyInputFormat()
           
 
Method Summary
 org.apache.hadoop.mapreduce.RecordReader<java.nio.ByteBuffer,java.util.SortedMap<java.nio.ByteBuffer,IColumn>> createRecordReader(org.apache.hadoop.mapreduce.InputSplit inputSplit, org.apache.hadoop.mapreduce.TaskAttemptContext taskAttemptContext)
           
 org.apache.hadoop.mapred.RecordReader<java.nio.ByteBuffer,java.util.SortedMap<java.nio.ByteBuffer,IColumn>> getRecordReader(org.apache.hadoop.mapred.InputSplit split, org.apache.hadoop.mapred.JobConf jobConf, org.apache.hadoop.mapred.Reporter reporter)
           
 org.apache.hadoop.mapred.InputSplit[] getSplits(org.apache.hadoop.mapred.JobConf jobConf, int numSplits)
           
 java.util.List<org.apache.hadoop.mapreduce.InputSplit> getSplits(org.apache.hadoop.mapreduce.JobContext context)
           
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

MAPRED_TASK_ID

public static final java.lang.String MAPRED_TASK_ID
See Also:
Constant Field Values

CASSANDRA_HADOOP_MAX_KEY_SIZE

public static final java.lang.String CASSANDRA_HADOOP_MAX_KEY_SIZE
See Also:
Constant Field Values

CASSANDRA_HADOOP_MAX_KEY_SIZE_DEFAULT

public static final int CASSANDRA_HADOOP_MAX_KEY_SIZE_DEFAULT
See Also:
Constant Field Values
Constructor Detail

ColumnFamilyInputFormat

public ColumnFamilyInputFormat()
Method Detail

getSplits

public java.util.List<org.apache.hadoop.mapreduce.InputSplit> getSplits(org.apache.hadoop.mapreduce.JobContext context)
                                                                 throws java.io.IOException
Specified by:
getSplits in class org.apache.hadoop.mapreduce.InputFormat<java.nio.ByteBuffer,java.util.SortedMap<java.nio.ByteBuffer,IColumn>>
Throws:
java.io.IOException

createRecordReader

public org.apache.hadoop.mapreduce.RecordReader<java.nio.ByteBuffer,java.util.SortedMap<java.nio.ByteBuffer,IColumn>> createRecordReader(org.apache.hadoop.mapreduce.InputSplit inputSplit,
                                                                                                                                         org.apache.hadoop.mapreduce.TaskAttemptContext taskAttemptContext)
                                                                                                                                  throws java.io.IOException,
                                                                                                                                         java.lang.InterruptedException
Specified by:
createRecordReader in class org.apache.hadoop.mapreduce.InputFormat<java.nio.ByteBuffer,java.util.SortedMap<java.nio.ByteBuffer,IColumn>>
Throws:
java.io.IOException
java.lang.InterruptedException

getSplits

public org.apache.hadoop.mapred.InputSplit[] getSplits(org.apache.hadoop.mapred.JobConf jobConf,
                                                       int numSplits)
                                                throws java.io.IOException
Specified by:
getSplits in interface org.apache.hadoop.mapred.InputFormat<java.nio.ByteBuffer,java.util.SortedMap<java.nio.ByteBuffer,IColumn>>
Throws:
java.io.IOException

getRecordReader

public org.apache.hadoop.mapred.RecordReader<java.nio.ByteBuffer,java.util.SortedMap<java.nio.ByteBuffer,IColumn>> getRecordReader(org.apache.hadoop.mapred.InputSplit split,
                                                                                                                                   org.apache.hadoop.mapred.JobConf jobConf,
                                                                                                                                   org.apache.hadoop.mapred.Reporter reporter)
                                                                                                                            throws java.io.IOException
Specified by:
getRecordReader in interface org.apache.hadoop.mapred.InputFormat<java.nio.ByteBuffer,java.util.SortedMap<java.nio.ByteBuffer,IColumn>>
Throws:
java.io.IOException


Copyright © 2012 The Apache Software Foundation