We are noticing some poor performance on job start up times where it can take 30 min to an hour or longer to kick off an MR job. There are two factors that come into play:
- The time to detect the files matching the partition
- Calculating the splits
In this issue I want to focus on the first issue of finding files that should be included for processing. If you partition your data and only are looking for a subset of data it looks like the way the code is written is to enumerate all of the files in the dataset and then filter out those paths that do not match the partition constraints. This leads to a lot of calls to HDFS to list file statuses for directories that do not have data matching the partition restrictions.