org.apache.cassandra.hadoop
Class ColumnFamilyOutputFormat

java.lang.Object
  extended by org.apache.hadoop.mapreduce.OutputFormat<java.nio.ByteBuffer,java.util.List<org.apache.cassandra.thrift.Mutation>>
      extended by org.apache.cassandra.hadoop.ColumnFamilyOutputFormat
All Implemented Interfaces:
org.apache.hadoop.mapred.OutputFormat<java.nio.ByteBuffer,java.util.List<org.apache.cassandra.thrift.Mutation>>

public class ColumnFamilyOutputFormat
extends org.apache.hadoop.mapreduce.OutputFormat<java.nio.ByteBuffer,java.util.List<org.apache.cassandra.thrift.Mutation>>
implements org.apache.hadoop.mapred.OutputFormat<java.nio.ByteBuffer,java.util.List<org.apache.cassandra.thrift.Mutation>>

The ColumnFamilyOutputFormat acts as a Hadoop-specific OutputFormat that allows reduce tasks to store keys (and corresponding values) as Cassandra rows (and respective columns) in a given ColumnFamily.

As is the case with the ColumnFamilyInputFormat, you need to set the Keyspace and ColumnFamily in your Hadoop job Configuration. The ConfigHelper class, through its ConfigHelper.setOutputColumnFamily(org.apache.hadoop.conf.Configuration, java.lang.String, java.lang.String) method, is provided to make this simple.

For the sake of performance, this class employs a lazy write-back caching mechanism, where its record writer batches mutations created based on the reduce's inputs (in a task-specific map), and periodically makes the changes official by sending a batch mutate request to Cassandra.


Nested Class Summary
static class ColumnFamilyOutputFormat.NullOutputCommitter
          An OutputCommitter that does nothing.
 
Field Summary
static java.lang.String BATCH_THRESHOLD
           
static java.lang.String QUEUE_SIZE
           
 
Constructor Summary
ColumnFamilyOutputFormat()
           
 
Method Summary
 void checkOutputSpecs(org.apache.hadoop.fs.FileSystem filesystem, org.apache.hadoop.mapred.JobConf job)
          Deprecated. 
 void checkOutputSpecs(org.apache.hadoop.mapreduce.JobContext context)
          Check for validity of the output-specification for the job.
static org.apache.cassandra.thrift.Cassandra.Client createAuthenticatedClient(org.apache.thrift.transport.TSocket socket, org.apache.hadoop.conf.Configuration conf)
          Return a client based on the given socket that points to the configured keyspace, and is logged in with the configured credentials.
 org.apache.hadoop.mapreduce.OutputCommitter getOutputCommitter(org.apache.hadoop.mapreduce.TaskAttemptContext context)
          The OutputCommitter for this format does not write any data to the DFS.
 org.apache.cassandra.hadoop.ColumnFamilyRecordWriter getRecordWriter(org.apache.hadoop.fs.FileSystem filesystem, org.apache.hadoop.mapred.JobConf job, java.lang.String name, org.apache.hadoop.util.Progressable progress)
          Deprecated. 
 org.apache.cassandra.hadoop.ColumnFamilyRecordWriter getRecordWriter(org.apache.hadoop.mapreduce.TaskAttemptContext context)
          Get the RecordWriter for the given task.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

BATCH_THRESHOLD

public static final java.lang.String BATCH_THRESHOLD
See Also:
Constant Field Values

QUEUE_SIZE

public static final java.lang.String QUEUE_SIZE
See Also:
Constant Field Values
Constructor Detail

ColumnFamilyOutputFormat

public ColumnFamilyOutputFormat()
Method Detail

checkOutputSpecs

public void checkOutputSpecs(org.apache.hadoop.mapreduce.JobContext context)
Check for validity of the output-specification for the job.

Specified by:
checkOutputSpecs in class org.apache.hadoop.mapreduce.OutputFormat<java.nio.ByteBuffer,java.util.List<org.apache.cassandra.thrift.Mutation>>
Parameters:
context - information about the job
Throws:
java.io.IOException - when output should not be attempted

getOutputCommitter

public org.apache.hadoop.mapreduce.OutputCommitter getOutputCommitter(org.apache.hadoop.mapreduce.TaskAttemptContext context)
                                                               throws java.io.IOException,
                                                                      java.lang.InterruptedException
The OutputCommitter for this format does not write any data to the DFS.

Specified by:
getOutputCommitter in class org.apache.hadoop.mapreduce.OutputFormat<java.nio.ByteBuffer,java.util.List<org.apache.cassandra.thrift.Mutation>>
Parameters:
context - the task context
Returns:
an output committer
Throws:
java.io.IOException
java.lang.InterruptedException

checkOutputSpecs

@Deprecated
public void checkOutputSpecs(org.apache.hadoop.fs.FileSystem filesystem,
                                        org.apache.hadoop.mapred.JobConf job)
                      throws java.io.IOException
Deprecated. 

Fills the deprecated OutputFormat interface for streaming.

Specified by:
checkOutputSpecs in interface org.apache.hadoop.mapred.OutputFormat<java.nio.ByteBuffer,java.util.List<org.apache.cassandra.thrift.Mutation>>
Throws:
java.io.IOException

getRecordWriter

@Deprecated
public org.apache.cassandra.hadoop.ColumnFamilyRecordWriter getRecordWriter(org.apache.hadoop.fs.FileSystem filesystem,
                                                                                       org.apache.hadoop.mapred.JobConf job,
                                                                                       java.lang.String name,
                                                                                       org.apache.hadoop.util.Progressable progress)
                                                                     throws java.io.IOException
Deprecated. 

Fills the deprecated OutputFormat interface for streaming.

Specified by:
getRecordWriter in interface org.apache.hadoop.mapred.OutputFormat<java.nio.ByteBuffer,java.util.List<org.apache.cassandra.thrift.Mutation>>
Throws:
java.io.IOException

getRecordWriter

public org.apache.cassandra.hadoop.ColumnFamilyRecordWriter getRecordWriter(org.apache.hadoop.mapreduce.TaskAttemptContext context)
                                                                     throws java.io.IOException,
                                                                            java.lang.InterruptedException
Get the RecordWriter for the given task.

Specified by:
getRecordWriter in class org.apache.hadoop.mapreduce.OutputFormat<java.nio.ByteBuffer,java.util.List<org.apache.cassandra.thrift.Mutation>>
Parameters:
context - the information about the current task.
Returns:
a RecordWriter to write the output for the job.
Throws:
java.io.IOException
java.lang.InterruptedException

createAuthenticatedClient

public static org.apache.cassandra.thrift.Cassandra.Client createAuthenticatedClient(org.apache.thrift.transport.TSocket socket,
                                                                                     org.apache.hadoop.conf.Configuration conf)
                                                                              throws org.apache.cassandra.thrift.InvalidRequestException,
                                                                                     org.apache.thrift.TException,
                                                                                     org.apache.cassandra.thrift.AuthenticationException,
                                                                                     org.apache.cassandra.thrift.AuthorizationException
Return a client based on the given socket that points to the configured keyspace, and is logged in with the configured credentials.

Parameters:
socket - a socket pointing to a particular node, seed or otherwise
conf - a job configuration
Returns:
a cassandra client
Throws:
org.apache.cassandra.thrift.InvalidRequestException
org.apache.thrift.TException
org.apache.cassandra.thrift.AuthenticationException
org.apache.cassandra.thrift.AuthorizationException


Copyright © 2011 The Apache Software Foundation