org.apache.accumulo.core.client.mapred
Class InputFormatBase<K,V>

java.lang.Object
  extended by org.apache.accumulo.core.client.mapred.InputFormatBase<K,V>
All Implemented Interfaces:
org.apache.hadoop.mapred.InputFormat<K,V>
Direct Known Subclasses:
AccumuloInputFormat, AccumuloRowInputFormat

public abstract class InputFormatBase<K,V>
extends Object
implements org.apache.hadoop.mapred.InputFormat<K,V>

This abstract InputFormat class allows MapReduce jobs to use Accumulo as the source of K,V pairs.

Subclasses must implement a InputFormat.getRecordReader(InputSplit, JobConf, Reporter) to provide a RecordReader for K,V.

A static base class, RecordReaderBase, is provided to retrieve Accumulo Key/Value pairs, but one must implement its RecordReader.next(Object, Object) to transform them to the desired generic types K,V.

See AccumuloInputFormat for an example implementation.


Nested Class Summary
static class InputFormatBase.RangeInputSplit
          The Class RangeInputSplit.
protected static class InputFormatBase.RecordReaderBase<K,V>
          An abstract base class to be used to create RecordReader instances that convert from Accumulo Key/Value pairs to the user's K/V types.
 
Field Summary
protected static org.apache.log4j.Logger log
           
 
Constructor Summary
InputFormatBase()
           
 
Method Summary
static void addIterator(org.apache.hadoop.mapred.JobConf job, IteratorSetting cfg)
          Encode an iterator on the input for this job.
static void fetchColumns(org.apache.hadoop.mapred.JobConf job, Collection<Pair<org.apache.hadoop.io.Text,org.apache.hadoop.io.Text>> columnFamilyColumnQualifierPairs)
          Restricts the columns that will be mapped over for this job.
protected static boolean getAutoAdjustRanges(org.apache.hadoop.mapred.JobConf job)
          Determines whether a configuration has auto-adjust ranges enabled.
protected static Set<Pair<org.apache.hadoop.io.Text,org.apache.hadoop.io.Text>> getFetchedColumns(org.apache.hadoop.mapred.JobConf job)
          Gets the columns to be mapped over from this job.
protected static String getInputTableName(org.apache.hadoop.mapred.JobConf job)
          Gets the table name from the configuration.
protected static Instance getInstance(org.apache.hadoop.mapred.JobConf job)
          Initializes an Accumulo Instance based on the configuration.
protected static List<IteratorSetting> getIterators(org.apache.hadoop.mapred.JobConf job)
          Gets a list of the iterator settings (for iterators to apply to a scanner) from this configuration.
protected static org.apache.log4j.Level getLogLevel(org.apache.hadoop.mapred.JobConf job)
          Gets the log level from this configuration.
protected static String getPrincipal(org.apache.hadoop.mapred.JobConf job)
          Gets the user name from the configuration.
protected static List<Range> getRanges(org.apache.hadoop.mapred.JobConf job)
          Gets the ranges to scan over from a job.
protected static Authorizations getScanAuthorizations(org.apache.hadoop.mapred.JobConf job)
          Gets the authorizations to set for the scans from the configuration.
 org.apache.hadoop.mapred.InputSplit[] getSplits(org.apache.hadoop.mapred.JobConf job, int numSplits)
          Read the metadata table to get tablets and match up ranges to them.
protected static TabletLocator getTabletLocator(org.apache.hadoop.mapred.JobConf job)
          Initializes an Accumulo TabletLocator based on the configuration.
protected static byte[] getToken(org.apache.hadoop.mapred.JobConf job)
          Gets the password from the configuration.
protected static String getTokenClass(org.apache.hadoop.mapred.JobConf job)
          Gets the serialized token class from the configuration.
protected static Boolean isConnectorInfoSet(org.apache.hadoop.mapred.JobConf job)
          Determines if the connector has been configured.
protected static boolean isIsolated(org.apache.hadoop.mapred.JobConf job)
          Determines whether a configuration has isolation enabled.
protected static boolean isOfflineScan(org.apache.hadoop.mapred.JobConf job)
          Determines whether a configuration has the offline table scan feature enabled.
static void setAutoAdjustRanges(org.apache.hadoop.mapred.JobConf job, boolean enableFeature)
          Controls the automatic adjustment of ranges for this job.
static void setConnectorInfo(org.apache.hadoop.mapred.JobConf job, String principal, AuthenticationToken token)
          Sets the connector information needed to communicate with Accumulo in this job.
static void setInputTableName(org.apache.hadoop.mapred.JobConf job, String tableName)
          Sets the name of the input table, over which this job will scan.
static void setLocalIterators(org.apache.hadoop.mapred.JobConf job, boolean enableFeature)
          Controls the use of the ClientSideIteratorScanner in this job.
static void setLogLevel(org.apache.hadoop.mapred.JobConf job, org.apache.log4j.Level level)
          Sets the log level for this job.
static void setMockInstance(org.apache.hadoop.mapred.JobConf job, String instanceName)
          Configures a MockInstance for this job.
static void setOfflineTableScan(org.apache.hadoop.mapred.JobConf job, boolean enableFeature)
           Enable reading offline tables.
static void setRanges(org.apache.hadoop.mapred.JobConf job, Collection<Range> ranges)
          Sets the input ranges to scan for this job.
static void setScanAuthorizations(org.apache.hadoop.mapred.JobConf job, Authorizations auths)
          Sets the Authorizations used to scan.
static void setScanIsolation(org.apache.hadoop.mapred.JobConf job, boolean enableFeature)
          Controls the use of the IsolatedScanner in this job.
static void setZooKeeperInstance(org.apache.hadoop.mapred.JobConf job, String instanceName, String zooKeepers)
          Configures a ZooKeeperInstance for this job.
protected static boolean usesLocalIterators(org.apache.hadoop.mapred.JobConf job)
          Determines whether a configuration uses local iterators.
protected static void validateOptions(org.apache.hadoop.mapred.JobConf job)
          Check whether a configuration is fully configured to be used with an Accumulo InputFormat.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 
Methods inherited from interface org.apache.hadoop.mapred.InputFormat
getRecordReader
 

Field Detail

log

protected static final org.apache.log4j.Logger log
Constructor Detail

InputFormatBase

public InputFormatBase()
Method Detail

setConnectorInfo

public static void setConnectorInfo(org.apache.hadoop.mapred.JobConf job,
                                    String principal,
                                    AuthenticationToken token)
                             throws AccumuloSecurityException
Sets the connector information needed to communicate with Accumulo in this job.

WARNING: The serialized token is stored in the configuration and shared with all MapReduce tasks. It is BASE64 encoded to provide a charset safe conversion to a string, and is not intended to be secure.

Parameters:
job - the Hadoop job instance to be configured
principal - a valid Accumulo user name (user must have Table.CREATE permission)
token - the user's password
Throws:
AccumuloSecurityException
Since:
1.5.0

isConnectorInfoSet

protected static Boolean isConnectorInfoSet(org.apache.hadoop.mapred.JobConf job)
Determines if the connector has been configured.

Parameters:
job - the Hadoop context for the configured job
Returns:
true if the connector has been configured, false otherwise
Since:
1.5.0
See Also:
setConnectorInfo(JobConf, String, AuthenticationToken)

getPrincipal

protected static String getPrincipal(org.apache.hadoop.mapred.JobConf job)
Gets the user name from the configuration.

Parameters:
job - the Hadoop context for the configured job
Returns:
the user name
Since:
1.5.0
See Also:
setConnectorInfo(JobConf, String, AuthenticationToken)

getTokenClass

protected static String getTokenClass(org.apache.hadoop.mapred.JobConf job)
Gets the serialized token class from the configuration.

Parameters:
job - the Hadoop context for the configured job
Returns:
the user name
Since:
1.5.0
See Also:
setConnectorInfo(JobConf, String, AuthenticationToken)

getToken

protected static byte[] getToken(org.apache.hadoop.mapred.JobConf job)
Gets the password from the configuration. WARNING: The password is stored in the Configuration and shared with all MapReduce tasks; It is BASE64 encoded to provide a charset safe conversion to a string, and is not intended to be secure.

Parameters:
job - the Hadoop context for the configured job
Returns:
the decoded user password
Since:
1.5.0
See Also:
setConnectorInfo(JobConf, String, AuthenticationToken)

setZooKeeperInstance

public static void setZooKeeperInstance(org.apache.hadoop.mapred.JobConf job,
                                        String instanceName,
                                        String zooKeepers)
Configures a ZooKeeperInstance for this job.

Parameters:
job - the Hadoop job instance to be configured
instanceName - the Accumulo instance name
zooKeepers - a comma-separated list of zookeeper servers
Since:
1.5.0

setMockInstance

public static void setMockInstance(org.apache.hadoop.mapred.JobConf job,
                                   String instanceName)
Configures a MockInstance for this job.

Parameters:
job - the Hadoop job instance to be configured
instanceName - the Accumulo instance name
Since:
1.5.0

getInstance

protected static Instance getInstance(org.apache.hadoop.mapred.JobConf job)
Initializes an Accumulo Instance based on the configuration.

Parameters:
job - the Hadoop context for the configured job
Returns:
an Accumulo instance
Since:
1.5.0
See Also:
setZooKeeperInstance(JobConf, String, String), setMockInstance(JobConf, String)

setLogLevel

public static void setLogLevel(org.apache.hadoop.mapred.JobConf job,
                               org.apache.log4j.Level level)
Sets the log level for this job.

Parameters:
job - the Hadoop job instance to be configured
level - the logging level
Since:
1.5.0

getLogLevel

protected static org.apache.log4j.Level getLogLevel(org.apache.hadoop.mapred.JobConf job)
Gets the log level from this configuration.

Parameters:
job - the Hadoop context for the configured job
Returns:
the log level
Since:
1.5.0
See Also:
setLogLevel(JobConf, Level)

setInputTableName

public static void setInputTableName(org.apache.hadoop.mapred.JobConf job,
                                     String tableName)
Sets the name of the input table, over which this job will scan.

Parameters:
job - the Hadoop job instance to be configured
tableName - the table to use when the tablename is null in the write call
Since:
1.5.0

getInputTableName

protected static String getInputTableName(org.apache.hadoop.mapred.JobConf job)
Gets the table name from the configuration.

Parameters:
job - the Hadoop context for the configured job
Returns:
the table name
Since:
1.5.0
See Also:
setInputTableName(JobConf, String)

setScanAuthorizations

public static void setScanAuthorizations(org.apache.hadoop.mapred.JobConf job,
                                         Authorizations auths)
Sets the Authorizations used to scan. Must be a subset of the user's authorization. Defaults to the empty set.

Parameters:
job - the Hadoop job instance to be configured
auths - the user's authorizations
Since:
1.5.0

getScanAuthorizations

protected static Authorizations getScanAuthorizations(org.apache.hadoop.mapred.JobConf job)
Gets the authorizations to set for the scans from the configuration.

Parameters:
job - the Hadoop context for the configured job
Returns:
the Accumulo scan authorizations
Since:
1.5.0
See Also:
setScanAuthorizations(JobConf, Authorizations)

setRanges

public static void setRanges(org.apache.hadoop.mapred.JobConf job,
                             Collection<Range> ranges)
Sets the input ranges to scan for this job. If not set, the entire table will be scanned.

Parameters:
job - the Hadoop job instance to be configured
ranges - the ranges that will be mapped over
Since:
1.5.0

getRanges

protected static List<Range> getRanges(org.apache.hadoop.mapred.JobConf job)
                                throws IOException
Gets the ranges to scan over from a job.

Parameters:
job - the Hadoop context for the configured job
Returns:
the ranges
Throws:
IOException - if the ranges have been encoded improperly
Since:
1.5.0
See Also:
setRanges(JobConf, Collection)

fetchColumns

public static void fetchColumns(org.apache.hadoop.mapred.JobConf job,
                                Collection<Pair<org.apache.hadoop.io.Text,org.apache.hadoop.io.Text>> columnFamilyColumnQualifierPairs)
Restricts the columns that will be mapped over for this job.

Parameters:
job - the Hadoop job instance to be configured
columnFamilyColumnQualifierPairs - a pair of Text objects corresponding to column family and column qualifier. If the column qualifier is null, the entire column family is selected. An empty set is the default and is equivalent to scanning the all columns.
Since:
1.5.0

getFetchedColumns

protected static Set<Pair<org.apache.hadoop.io.Text,org.apache.hadoop.io.Text>> getFetchedColumns(org.apache.hadoop.mapred.JobConf job)
Gets the columns to be mapped over from this job.

Parameters:
job - the Hadoop context for the configured job
Returns:
a set of columns
Since:
1.5.0
See Also:
fetchColumns(JobConf, Collection)

addIterator

public static void addIterator(org.apache.hadoop.mapred.JobConf job,
                               IteratorSetting cfg)
Encode an iterator on the input for this job.

Parameters:
job - the Hadoop job instance to be configured
cfg - the configuration of the iterator
Since:
1.5.0

getIterators

protected static List<IteratorSetting> getIterators(org.apache.hadoop.mapred.JobConf job)
Gets a list of the iterator settings (for iterators to apply to a scanner) from this configuration.

Parameters:
job - the Hadoop context for the configured job
Returns:
a list of iterators
Since:
1.5.0
See Also:
addIterator(JobConf, IteratorSetting)

setAutoAdjustRanges

public static void setAutoAdjustRanges(org.apache.hadoop.mapred.JobConf job,
                                       boolean enableFeature)
Controls the automatic adjustment of ranges for this job. This feature merges overlapping ranges, then splits them to align with tablet boundaries. Disabling this feature will cause exactly one Map task to be created for each specified range. The default setting is enabled. *

By default, this feature is enabled.

Parameters:
job - the Hadoop job instance to be configured
enableFeature - the feature is enabled if true, disabled otherwise
Since:
1.5.0
See Also:
setRanges(JobConf, Collection)

getAutoAdjustRanges

protected static boolean getAutoAdjustRanges(org.apache.hadoop.mapred.JobConf job)
Determines whether a configuration has auto-adjust ranges enabled.

Parameters:
job - the Hadoop context for the configured job
Returns:
false if the feature is disabled, true otherwise
Since:
1.5.0
See Also:
setAutoAdjustRanges(JobConf, boolean)

setScanIsolation

public static void setScanIsolation(org.apache.hadoop.mapred.JobConf job,
                                    boolean enableFeature)
Controls the use of the IsolatedScanner in this job.

By default, this feature is disabled.

Parameters:
job - the Hadoop job instance to be configured
enableFeature - the feature is enabled if true, disabled otherwise
Since:
1.5.0

isIsolated

protected static boolean isIsolated(org.apache.hadoop.mapred.JobConf job)
Determines whether a configuration has isolation enabled.

Parameters:
job - the Hadoop context for the configured job
Returns:
true if the feature is enabled, false otherwise
Since:
1.5.0
See Also:
setScanIsolation(JobConf, boolean)

setLocalIterators

public static void setLocalIterators(org.apache.hadoop.mapred.JobConf job,
                                     boolean enableFeature)
Controls the use of the ClientSideIteratorScanner in this job. Enabling this feature will cause the iterator stack to be constructed within the Map task, rather than within the Accumulo TServer. To use this feature, all classes needed for those iterators must be available on the classpath for the task.

By default, this feature is disabled.

Parameters:
job - the Hadoop job instance to be configured
enableFeature - the feature is enabled if true, disabled otherwise
Since:
1.5.0

usesLocalIterators

protected static boolean usesLocalIterators(org.apache.hadoop.mapred.JobConf job)
Determines whether a configuration uses local iterators.

Parameters:
job - the Hadoop context for the configured job
Returns:
true if the feature is enabled, false otherwise
Since:
1.5.0
See Also:
setLocalIterators(JobConf, boolean)

setOfflineTableScan

public static void setOfflineTableScan(org.apache.hadoop.mapred.JobConf job,
                                       boolean enableFeature)

Enable reading offline tables. By default, this feature is disabled and only online tables are scanned. This will make the map reduce job directly read the table's files. If the table is not offline, then the job will fail. If the table comes online during the map reduce job, it is likely that the job will fail.

To use this option, the map reduce user will need access to read the Accumulo directory in HDFS.

Reading the offline table will create the scan time iterator stack in the map process. So any iterators that are configured for the table will need to be on the mapper's classpath. The accumulo-site.xml may need to be on the mapper's classpath if HDFS or the Accumulo directory in HDFS are non-standard.

One way to use this feature is to clone a table, take the clone offline, and use the clone as the input table for a map reduce job. If you plan to map reduce over the data many times, it may be better to the compact the table, clone it, take it offline, and use the clone for all map reduce jobs. The reason to do this is that compaction will reduce each tablet in the table to one file, and it is faster to read from one file.

There are two possible advantages to reading a tables file directly out of HDFS. First, you may see better read performance. Second, it will support speculative execution better. When reading an online table speculative execution can put more load on an already slow tablet server.

By default, this feature is disabled.

Parameters:
job - the Hadoop job instance to be configured
enableFeature - the feature is enabled if true, disabled otherwise
Since:
1.5.0

isOfflineScan

protected static boolean isOfflineScan(org.apache.hadoop.mapred.JobConf job)
Determines whether a configuration has the offline table scan feature enabled.

Parameters:
job - the Hadoop context for the configured job
Returns:
true if the feature is enabled, false otherwise
Since:
1.5.0
See Also:
setOfflineTableScan(JobConf, boolean)

getTabletLocator

protected static TabletLocator getTabletLocator(org.apache.hadoop.mapred.JobConf job)
                                         throws TableNotFoundException
Initializes an Accumulo TabletLocator based on the configuration.

Parameters:
job - the Hadoop context for the configured job
Returns:
an Accumulo tablet locator
Throws:
TableNotFoundException - if the table name set on the configuration doesn't exist
Since:
1.5.0

validateOptions

protected static void validateOptions(org.apache.hadoop.mapred.JobConf job)
                               throws IOException
Check whether a configuration is fully configured to be used with an Accumulo InputFormat.

Parameters:
job - the Hadoop context for the configured job
Throws:
IOException - if the context is improperly configured
Since:
1.5.0

getSplits

public org.apache.hadoop.mapred.InputSplit[] getSplits(org.apache.hadoop.mapred.JobConf job,
                                                       int numSplits)
                                                throws IOException
Read the metadata table to get tablets and match up ranges to them.

Specified by:
getSplits in interface org.apache.hadoop.mapred.InputFormat<K,V>
Throws:
IOException


Copyright © 2013 Apache Accumulo Project. All Rights Reserved.