[KITE-1101] Inefficient Scanning for Matching Partitions in FileSystemPartitionIterator - Cloudera Open Source

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 1.2.0
Component/s: Data Module
Labels:
None

Description

We are noticing some poor performance on job start up times where it can take 30 min to an hour or longer to kick off an MR job. There are two factors that come into play:

The time to detect the files matching the partition
Calculating the splits

In this issue I want to focus on the first issue of finding files that should be included for processing. If you partition your data and only are looking for a subset of data it looks like the way the code is written is to enumerate all of the files in the dataset and then filter out those paths that do not match the partition constraints. This leads to a lot of calls to HDFS to list file statuses for directories that do not have data matching the partition restrictions.

[1] - https://github.com/kite-sdk/kite/blob/3945ce5d512f06f8a79557acd0344229ea3dc919/kite-data/kite-data-core/src/main/java/org/kitesdk/data/spi/filesystem/FileSystemPartitionIterator.java#L110

Attachments

Issue Links

links to

Activity

People

Assignee:

Micah Whitacre

Reporter:

Micah Whitacre

Votes:

0 Vote for this issue

Watchers:

2 Start watching this issue

Dates

Created:

14/Jan/16 8:26 PM

Updated:

07/Mar/16 2:49 PM

Resolved:

07/Mar/16 2:49 PM