Details
-
Type: Improvement
-
Status: Resolved
-
Priority: Major
-
Resolution: Fixed
-
Affects Version/s: 1.0.0
-
Fix Version/s: 1.1.0
-
Component/s: Command-line Interface, Data Module
-
Labels:None
Description
In order to minimize the number of files we are creating this we are currently using the CrunchDatasets.partition(...) method. This allows us to configure a total number of writers. However we have hit situations where the performance of a job is greatly impacted by a single partition with a high volume of data. Since the configured number of writers only distributes partitions across reducers one reducer is still carrying the burden of shuffling/sorting all of the data. As an example in the Copy/Compaction task if I was to compact a single partition and specified 10 writers, 9 wouldn't do anything but 1 would do all the work.
Another situation I'm thinking of is I have a single partition that is full of small files (10MB - 1 GB) but total they amount of data is over 2 TB. Using the partition compact command I'd like to end up with larger file and don't really need it to be one single 2TB file but something in the 200-300 GB range would help me clean up HDFS and be a quicker job. I can't really use the Compact command on that partition because it will force that into a single file which is a lot for one reducer to shuffle/sort.
It would be ideal if we could instead of specifying a number of writers also specify the number of writers per partition. For my regular processing I have a few use cases as well.