Uploaded image for project: 'Kite SDK (READ-ONLY)'
  1. Kite SDK (READ-ONLY)
  2. KITE-1101

Inefficient Scanning for Matching Partitions in FileSystemPartitionIterator

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.2.0
    • Component/s: Data Module
    • Labels:
      None

      Description

      We are noticing some poor performance on job start up times where it can take 30 min to an hour or longer to kick off an MR job. There are two factors that come into play:

      • The time to detect the files matching the partition
      • Calculating the splits

      In this issue I want to focus on the first issue of finding files that should be included for processing. If you partition your data and only are looking for a subset of data it looks like the way the code is written is to enumerate all of the files in the dataset and then filter out those paths that do not match the partition constraints. This leads to a lot of calls to HDFS to list file statuses for directories that do not have data matching the partition restrictions.

      [1] - https://github.com/kite-sdk/kite/blob/3945ce5d512f06f8a79557acd0344229ea3dc919/kite-data/kite-data-core/src/main/java/org/kitesdk/data/spi/filesystem/FileSystemPartitionIterator.java#L110

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                mkwhitacre Micah Whitacre
                Reporter:
                mkwhitacre Micah Whitacre
              • Votes:
                0 Vote for this issue
                Watchers:
                2 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: