Uploaded image for project: 'Kite SDK (READ-ONLY)'
  1. Kite SDK (READ-ONLY)
  2. KITE-1025

Wrap Parquet and Avro input formats with CombineFileInputFormat

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 1.1.0
    • Fix Version/s: 1.2.0
    • Component/s: None
    • Labels:
      None

      Description

      Due to various historical changes in the way Kite works with its own InputFormat, the automatic use of Crunch's CrunchCombineFileInputFormat no longer gets used when reading file-based datasets via Crunch.

      This means that each file in an input dataset will result in an additional input split, and therefore an additional map task when reading a dataset. The overhead of a large number of extra map tasks can negatively impact performance.

      It would be very useful if Kite were to automatically use CombineFileInputFormat's ability to combine multiple small files into a single input split when processing data via Crunch or MapReduce.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                greid Gabriel Reid
                Reporter:
                greid Gabriel Reid
              • Votes:
                0 Vote for this issue
                Watchers:
                2 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: