@Deprecated @InterfaceAudience.Private public class DistributedCache extends Object
DistributedCache
is a facility provided by the Map-Reduce
framework to cache files (text, archives, jars etc.) needed by applications.
Applications specify the files, via urls (hdfs:// or http://) to be cached
via the JobConf
. The
DistributedCache
assumes that the files specified via urls are
already present on the FileSystem
at the path specified by the url
and are accessible by every machine in the cluster.
The framework will copy the necessary files on to the slave node before any tasks for the job are executed on that node. Its efficiency stems from the fact that the files are only copied once per job and the ability to cache archives which are un-archived on the slaves.
DistributedCache
can be used to distribute simple, read-only
data/text files and/or more complex types such as archives, jars etc.
Archives (zip, tar and tgz/tar.gz files) are un-archived at the slave nodes.
Jars may be optionally added to the classpath of the tasks, a rudimentary
software distribution mechanism. Files have execution permissions.
In older version of Hadoop Map/Reduce users could optionally ask for symlinks
to be created in the working directory of the child task. In the current
version symlinks are always created. If the URL does not have a fragment
the name of the file or directory will be used. If multiple files or
directories map to the same link name, the last one added, will be used. All
others will not even be downloaded.
DistributedCache
tracks modification timestamps of the cache
files. Clearly the cache files should not be modified by the application
or externally while the job is executing.
Here is an illustrative example on how to use the
DistributedCache
:
It is also very common to use the DistributedCache by using// Setting up the cache for the application 1. Copy the requisite files to theFileSystem
: $ bin/hadoop fs -copyFromLocal lookup.dat /myapp/lookup.dat $ bin/hadoop fs -copyFromLocal map.zip /myapp/map.zip $ bin/hadoop fs -copyFromLocal mylib.jar /myapp/mylib.jar $ bin/hadoop fs -copyFromLocal mytar.tar /myapp/mytar.tar $ bin/hadoop fs -copyFromLocal mytgz.tgz /myapp/mytgz.tgz $ bin/hadoop fs -copyFromLocal mytargz.tar.gz /myapp/mytargz.tar.gz 2. Setup the application'sJobConf
: JobConf job = new JobConf(); DistributedCache.addCacheFile(new URI("/myapp/lookup.dat#lookup.dat"), job); DistributedCache.addCacheArchive(new URI("/myapp/map.zip", job); DistributedCache.addFileToClassPath(new Path("/myapp/mylib.jar"), job); DistributedCache.addCacheArchive(new URI("/myapp/mytar.tar", job); DistributedCache.addCacheArchive(new URI("/myapp/mytgz.tgz", job); DistributedCache.addCacheArchive(new URI("/myapp/mytargz.tar.gz", job); 3. Use the cached files in theMapper
orReducer
: public static class MapClass extends MapReduceBase implements Mapper<K, V, K, V> { private Path[] localArchives; private Path[] localFiles; public void configure(JobConf job) { // Get the cached archives/files File f = new File("./map.zip/some/file/in/zip.txt"); } public void map(K key, V value, OutputCollector<K, V> output, Reporter reporter) throws IOException { // Use data from the cached archives/files here // ... // ... output.collect(k, v); } }
GenericOptionsParser
.
This class includes methods that should be used by users
(specifically those mentioned in the example above, as well
as addArchiveToClassPath(Path, Configuration)
),
as well as methods intended for use by the MapReduce framework
(e.g., JobClient
).Constructor and Description |
---|
DistributedCache()
Deprecated.
|
Modifier and Type | Method and Description |
---|---|
static void |
addArchiveToClassPath(org.apache.hadoop.fs.Path archive,
org.apache.hadoop.conf.Configuration conf)
Deprecated.
Use
Job.addArchiveToClassPath(Path) instead |
static void |
addArchiveToClassPath(org.apache.hadoop.fs.Path archive,
org.apache.hadoop.conf.Configuration conf,
org.apache.hadoop.fs.FileSystem fs)
Deprecated.
Add an archive path to the current set of classpath entries.
|
static void |
addCacheArchive(URI uri,
org.apache.hadoop.conf.Configuration conf)
Deprecated.
Use
Job.addCacheArchive(URI) instead |
static void |
addCacheFile(URI uri,
org.apache.hadoop.conf.Configuration conf)
Deprecated.
Use
Job.addCacheFile(URI) instead |
static void |
addFileToClassPath(org.apache.hadoop.fs.Path file,
org.apache.hadoop.conf.Configuration conf)
Deprecated.
Use
Job.addFileToClassPath(Path) instead |
static void |
addFileToClassPath(org.apache.hadoop.fs.Path file,
org.apache.hadoop.conf.Configuration conf,
org.apache.hadoop.fs.FileSystem fs)
Deprecated.
Add a file path to the current set of classpath entries.
|
static boolean |
checkURIs(URI[] uriFiles,
URI[] uriArchives)
Deprecated.
This method checks if there is a conflict in the fragment names
of the uris.
|
static void |
createSymlink(org.apache.hadoop.conf.Configuration conf)
Deprecated.
This is a NO-OP.
|
static org.apache.hadoop.fs.Path[] |
getArchiveClassPaths(org.apache.hadoop.conf.Configuration conf)
Deprecated.
Use
JobContext.getArchiveClassPaths() instead |
static long[] |
getArchiveTimestamps(org.apache.hadoop.conf.Configuration conf)
Deprecated.
Use
JobContext.getArchiveTimestamps() instead |
static boolean[] |
getArchiveVisibilities(org.apache.hadoop.conf.Configuration conf)
Deprecated.
Get the booleans on whether the archives are public or not.
|
static URI[] |
getCacheArchives(org.apache.hadoop.conf.Configuration conf)
Deprecated.
Use
JobContext.getCacheArchives() instead |
static URI[] |
getCacheFiles(org.apache.hadoop.conf.Configuration conf)
Deprecated.
Use
JobContext.getCacheFiles() instead |
static org.apache.hadoop.fs.Path[] |
getFileClassPaths(org.apache.hadoop.conf.Configuration conf)
Deprecated.
Use
JobContext.getFileClassPaths() instead |
static long[] |
getFileTimestamps(org.apache.hadoop.conf.Configuration conf)
Deprecated.
Use
JobContext.getFileTimestamps() instead |
static boolean[] |
getFileVisibilities(org.apache.hadoop.conf.Configuration conf)
Deprecated.
Get the booleans on whether the files are public or not.
|
static org.apache.hadoop.fs.Path[] |
getLocalCacheArchives(org.apache.hadoop.conf.Configuration conf)
Deprecated.
Use
JobContext.getLocalCacheArchives() instead |
static org.apache.hadoop.fs.Path[] |
getLocalCacheFiles(org.apache.hadoop.conf.Configuration conf)
Deprecated.
Use
JobContext.getLocalCacheFiles() instead |
static boolean |
getSymlink(org.apache.hadoop.conf.Configuration conf)
Deprecated.
symlinks are always created.
|
static void |
setCacheArchives(URI[] archives,
org.apache.hadoop.conf.Configuration conf)
Deprecated.
Use
Job.setCacheArchives(URI[]) instead |
static void |
setCacheFiles(URI[] files,
org.apache.hadoop.conf.Configuration conf)
Deprecated.
Use
Job.setCacheFiles(URI[]) instead |
@Deprecated public static void setCacheArchives(URI[] archives, org.apache.hadoop.conf.Configuration conf)
Job.setCacheArchives(URI[])
insteadarchives
- The list of archives that need to be localizedconf
- Configuration which will be changed@Deprecated public static void setCacheFiles(URI[] files, org.apache.hadoop.conf.Configuration conf)
Job.setCacheFiles(URI[])
insteadfiles
- The list of files that need to be localizedconf
- Configuration which will be changed@Deprecated public static URI[] getCacheArchives(org.apache.hadoop.conf.Configuration conf) throws IOException
JobContext.getCacheArchives()
insteadconf
- The configuration which contains the archivesIOException
@Deprecated public static URI[] getCacheFiles(org.apache.hadoop.conf.Configuration conf) throws IOException
JobContext.getCacheFiles()
insteadconf
- The configuration which contains the filesIOException
@Deprecated public static org.apache.hadoop.fs.Path[] getLocalCacheArchives(org.apache.hadoop.conf.Configuration conf) throws IOException
JobContext.getLocalCacheArchives()
insteadconf
- Configuration that contains the localized archivesIOException
@Deprecated public static org.apache.hadoop.fs.Path[] getLocalCacheFiles(org.apache.hadoop.conf.Configuration conf) throws IOException
JobContext.getLocalCacheFiles()
insteadconf
- Configuration that contains the localized filesIOException
@Deprecated public static long[] getArchiveTimestamps(org.apache.hadoop.conf.Configuration conf)
JobContext.getArchiveTimestamps()
insteadconf
- The configuration which stored the timestamps@Deprecated public static long[] getFileTimestamps(org.apache.hadoop.conf.Configuration conf)
JobContext.getFileTimestamps()
insteadconf
- The configuration which stored the timestamps@Deprecated public static void addCacheArchive(URI uri, org.apache.hadoop.conf.Configuration conf)
Job.addCacheArchive(URI)
insteaduri
- The uri of the cache to be localizedconf
- Configuration to add the cache to@Deprecated public static void addCacheFile(URI uri, org.apache.hadoop.conf.Configuration conf)
Job.addCacheFile(URI)
insteaduri
- The uri of the cache to be localizedconf
- Configuration to add the cache to@Deprecated public static void addFileToClassPath(org.apache.hadoop.fs.Path file, org.apache.hadoop.conf.Configuration conf) throws IOException
Job.addFileToClassPath(Path)
insteadfile
- Path of the file to be addedconf
- Configuration that contains the classpath settingIOException
public static void addFileToClassPath(org.apache.hadoop.fs.Path file, org.apache.hadoop.conf.Configuration conf, org.apache.hadoop.fs.FileSystem fs) throws IOException
file
- Path of the file to be addedconf
- Configuration that contains the classpath settingfs
- FileSystem with respect to which archivefile
should
be interpreted.IOException
@Deprecated public static org.apache.hadoop.fs.Path[] getFileClassPaths(org.apache.hadoop.conf.Configuration conf)
JobContext.getFileClassPaths()
insteadconf
- Configuration that contains the classpath setting@Deprecated public static void addArchiveToClassPath(org.apache.hadoop.fs.Path archive, org.apache.hadoop.conf.Configuration conf) throws IOException
Job.addArchiveToClassPath(Path)
insteadarchive
- Path of the archive to be addedconf
- Configuration that contains the classpath settingIOException
public static void addArchiveToClassPath(org.apache.hadoop.fs.Path archive, org.apache.hadoop.conf.Configuration conf, org.apache.hadoop.fs.FileSystem fs) throws IOException
archive
- Path of the archive to be addedconf
- Configuration that contains the classpath settingfs
- FileSystem with respect to which archive
should be interpreted.IOException
@Deprecated public static org.apache.hadoop.fs.Path[] getArchiveClassPaths(org.apache.hadoop.conf.Configuration conf)
JobContext.getArchiveClassPaths()
insteadconf
- Configuration that contains the classpath setting@Deprecated public static void createSymlink(org.apache.hadoop.conf.Configuration conf)
conf
- the jobconf@Deprecated public static boolean getSymlink(org.apache.hadoop.conf.Configuration conf)
conf
- the jobconfpublic static boolean[] getFileVisibilities(org.apache.hadoop.conf.Configuration conf)
conf
- The configuration which stored the timestampspublic static boolean[] getArchiveVisibilities(org.apache.hadoop.conf.Configuration conf)
conf
- The configuration which stored the timestampspublic static boolean checkURIs(URI[] uriFiles, URI[] uriArchives)
uriFiles
- The uri array of urifilesuriArchives
- the uri array of uri archivesCopyright © 2017 Apache Software Foundation. All Rights Reserved.