Uploaded image for project: 'Kite SDK (READ-ONLY)'
  1. Kite SDK (READ-ONLY)
  2. KITE-426

File writer caching should increase if thrashing

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 0.13.0
    • Fix Version/s: 0.15.0
    • Component/s: None
    • Labels:
      None

      Description

      For an example, I created a list of 10,000 timestamps between 1 Jan 2014 and 1 Jan 2015 and then loaded them into a Hive dataset using the csv-import command. This didn't finish in a reasonable amount of time because the timestamps were in a random order, which caused the writer cache with a hard limit of 10 writers to never have a cache hit. Instead, new writers forced old writers to be closed and the dataset wrote approximately one file per record.

      I think the default writer cache limit should be set very high to avoid this situation, and the limit should be configurable through a descriptor configuration property (kite.writer-cache-size). For most non-id partitioners, we have a cardinality hint. I'd like to set the writer cache limit by default around 10% of the expected number of partitions, or 100. Then we should add warnings when the size reaches half of that value. For year/month/day partitioning, the maximum cardinality is about 365 * 5, so 10% is around 180 days worth of writers. That's enough to ensure some cache hits happen in most situations.

      We should also consider adding age limits and other policies.

        Attachments

          Activity

            People

            • Assignee:
              blue Ryan Blue
              Reporter:
              blue Ryan Blue
            • Votes:
              1 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: