Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.2.0
    • Component/s: Data Module
    • Labels:
      None

      Description

      When using a combination of YARN + Crunch + Kite + S3 (org.apache.hadoop.fs.s3a.S3AFileSystem) it was observed that data records are copied a number of times after being written to S3, by S3 copy requests. This happens because the objects are moved from /path/.temp/job_X/mr/attempt_X, to /path/.temp/job_X/mr/job_X, to their final destination in /path. In each location they are renamed .filename.tmp to filename. This results in a total of 5 renames after the initial write.

      Because S3 does not natively support renaming objects, the filesystem must implement rename with a copy operation, which is very slow. The first 3 copies happen concurrently across all reduce tasks. The last 2 are particularly painful as they are performed serially for all output files by the application master.

      To make matters worse S3AFileSystem does not set the multipart copy part size and threshold in the transfer manager configuration [1], and the default threshold is 5 GB which is larger than our Crunch output files, so a full-object remote copy is done. We are using snappy-compressed Avro.

      As far as a long-term solution to this goes, the Hadoop project is working on a substantial enhancement [2] to enable improved interactions with "Object Store" (non-)filesystems [3] such as S3. As evident from the Jira activity this is a well-known deficiency. Until that work is done, Kite can inspect the filesystem URI and when writing to known Object Stores elect to avoid all the renaming (i.e., copying) overhead. A new configuration property can be introduced and supplied to override this inference and consequential write behavior adjustment if desired.

      We noticed a recent Crunch patch [4] that parallelized some renaming. However that code is actually not consumed by Kite, and skipping the temp file + rename entirely is much preferable anyway.

      [1] https://issues.apache.org/jira/browse/HADOOP-12891
      [2] https://issues.apache.org/jira/browse/HADOOP-9565
      [3] https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/index.md#warning-object-stores-are-not-filesystems
      [4] https://issues.apache.org/jira/browse/CRUNCH-580

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              noslowerdna Andrew Olson
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: