[KITE-1011] Configurable Number of Writers Per Partition - Cloudera Open Source

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 1.0.0
Fix Version/s: 1.1.0
Component/s: Command-line Interface, Data Module
Labels:
None

Description

In order to minimize the number of files we are creating this we are currently using the CrunchDatasets.partition(...) method. This allows us to configure a total number of writers. However we have hit situations where the performance of a job is greatly impacted by a single partition with a high volume of data. Since the configured number of writers only distributes partitions across reducers one reducer is still carrying the burden of shuffling/sorting all of the data. As an example in the Copy/Compaction task if I was to compact a single partition and specified 10 writers, 9 wouldn't do anything but 1 would do all the work.

Another situation I'm thinking of is I have a single partition that is full of small files (10MB - 1 GB) but total they amount of data is over 2 TB. Using the partition compact command I'd like to end up with larger file and don't really need it to be one single 2TB file but something in the 200-300 GB range would help me clean up HDFS and be a quicker job. I can't really use the Compact command on that partition because it will force that into a single file which is a lot for one reducer to shuffle/sort.

It would be ideal if we could instead of specifying a number of writers also specify the number of writers per partition. For my regular processing I have a few use cases as well.

Attachments

Activity

People

Assignee:

Micah Whitacre

Reporter:

Micah Whitacre

Votes:

0 Vote for this issue

Watchers:

3 Start watching this issue

Dates

Created:

03/Jun/15 3:14 PM

Updated:

15/Jun/15 11:24 PM

Resolved:

15/Jun/15 11:24 PM