Uploaded image for project: 'Kite SDK (READ-ONLY)'
  1. Kite SDK (READ-ONLY)
  2. KITE-1011

Configurable Number of Writers Per Partition



      In order to minimize the number of files we are creating this we are currently using the CrunchDatasets.partition(...) method. This allows us to configure a total number of writers. However we have hit situations where the performance of a job is greatly impacted by a single partition with a high volume of data. Since the configured number of writers only distributes partitions across reducers one reducer is still carrying the burden of shuffling/sorting all of the data. As an example in the Copy/Compaction task if I was to compact a single partition and specified 10 writers, 9 wouldn't do anything but 1 would do all the work.

      Another situation I'm thinking of is I have a single partition that is full of small files (10MB - 1 GB) but total they amount of data is over 2 TB. Using the partition compact command I'd like to end up with larger file and don't really need it to be one single 2TB file but something in the 200-300 GB range would help me clean up HDFS and be a quicker job. I can't really use the Compact command on that partition because it will force that into a single file which is a lot for one reducer to shuffle/sort.

      It would be ideal if we could instead of specifying a number of writers also specify the number of writers per partition. For my regular processing I have a few use cases as well.




            • Assignee:
              mkwhitacre Micah Whitacre
              mkwhitacre Micah Whitacre
            • Votes:
              0 Vote for this issue
              3 Start watching this issue


              • Created: