Uploaded image for project: 'CDH (READ-ONLY)'
  1. CDH (READ-ONLY)
  2. DISTRO-522

MapFileOutputFormat.getReaders errors on Crunch _SUCCESS file

    Details

    • Type: Bug
    • Status: Open
    • Priority: Minor
    • Resolution: Unresolved
    • Affects Version/s: CDH4.2.1
    • Fix Version/s: None
    • Component/s: MapReduce
    • Labels:
      None

      Description

      MapFileOutputFormat.getReaders errors on Crunch (starting with 0.7.x) _SUCCESS file, and any file other than data/index, and any folder not containing only data and index (empty folders error). It should honor in the least a PathFilter object to filter out, such as the hiddenFileFilter as found in FileInputFormat. Latest beta version [1] appears to take in a FileSystem ignored argument (which itself is ignored as currently written).

      below is example code that resolves the issue:

      public static MapFile.Reader[] getReaders(final Path dir,
                  final Configuration conf) throws IOException {
              final FileSystem fs = dir.getFileSystem(conf);
              final Path[] names = FileUtil.stat2Paths(fs.listStatus(dir, hiddenFileFilter));
              // sort names, so that hash partitioning works
              Arrays.sort(names);
       
       
              final MapFile.Reader[] parts = new MapFile.Reader[names.length];
              for (int i = 0; i < names.length; i++) {
                  parts[i] = new MapFile.Reader(names[i], conf);
              }
              return parts;
          }
       
       
          private static final PathFilter hiddenFileFilter = new PathFilter() {
              @Override
              public boolean accept(final Path p) {
                  final String name = p.getName();
                  return !name.startsWith("_") && !name.startsWith(".");
              }
          };
      

      I'm not sure the standard way to do a hiddenFileFilter, but it should be made platform independent. There are other ways to filter that may be better as well, specifically when data + index files are always anticipated for MapFile format.

      For the fix version, the project I am on needs patched back at minimum to at least 4.3.0 (ideally 4.2.1 which is the version we use currently, but will be upgrading at some point)

      [1]
      http://search-hadoop.com/c/MapReduce:hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapred/MapFileOutputFormat.java%7C%7C+%2522done+%2522map+output%2522

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              catbauer24 Charles Hansen
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated: