Details
-
Type: Bug
-
Status: Open
-
Priority: Minor
-
Resolution: Unresolved
-
Affects Version/s: CDH4.2.1
-
Fix Version/s: None
-
Component/s: MapReduce
-
Labels:None
Description
MapFileOutputFormat.getReaders errors on Crunch (starting with 0.7.x) _SUCCESS file, and any file other than data/index, and any folder not containing only data and index (empty folders error). It should honor in the least a PathFilter object to filter out, such as the hiddenFileFilter as found in FileInputFormat. Latest beta version [1] appears to take in a FileSystem ignored argument (which itself is ignored as currently written).
below is example code that resolves the issue:
public static MapFile.Reader[] getReaders(final Path dir, final Configuration conf) throws IOException { final FileSystem fs = dir.getFileSystem(conf); final Path[] names = FileUtil.stat2Paths(fs.listStatus(dir, hiddenFileFilter)); // sort names, so that hash partitioning works Arrays.sort(names); final MapFile.Reader[] parts = new MapFile.Reader[names.length]; for (int i = 0; i < names.length; i++) { parts[i] = new MapFile.Reader(names[i], conf); } return parts; } private static final PathFilter hiddenFileFilter = new PathFilter() { @Override public boolean accept(final Path p) { final String name = p.getName(); return !name.startsWith("_") && !name.startsWith("."); } };
I'm not sure the standard way to do a hiddenFileFilter, but it should be made platform independent. There are other ways to filter that may be better as well, specifically when data + index files are always anticipated for MapFile format.
For the fix version, the project I am on needs patched back at minimum to at least 4.3.0 (ideally 4.2.1 which is the version we use currently, but will be upgrading at some point)