Uploaded image for project: 'Kite SDK (READ-ONLY)'
  1. Kite SDK (READ-ONLY)
  2. KITE-846

Document dataset storage properties for Parquet

    Details

    • Type: Bug
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 0.17.1
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      There are several dataset properties that affect how files are written. This is especially relevant for Parquet because it buffers records in memory. Here are the relevant points:

      1. kite.write.cache-size controls the number of files kept open by a HDFS or Hive dataset writer.

      A writer will open one file per partition it needs to write a record to. When the writer receives a record that goes in a new partition-that is, for which there isn't already an open file-it will create a new file in that partition. If the number of open files exceeds the cache size, then the file that was least recently used will be closed.

      2. Kite will pass on descriptor properties to the underlying file formats.

      Parquet defines parquet.block.size, which is approximately the amount of data that will be buffered before writing a group of records (a "row group"). This size defaults to 128MB.

      The amount of data kept in memory for each file could be up to the Parquet block size in bytes, which means that the upper bound for a writer's memory consumption is parquet.block.size * kite.writer.cache-size. It is important that this doesn't exceed a reasonable portion of the heap memory allocated to the process, or else the write could fail with an OutOfMemoryException.

      Users can set the two properties on their datasets, though we don't recommend decreasing parquet.block.size. They can either set the property on a descriptor using the builder or use the --set option with the command-line:

      kite-dataset update <uri> --set kite.writer.cache-size=2
      

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                blue Ryan Blue
              • Votes:
                0 Vote for this issue
                Watchers:
                1 Start watching this issue

                Dates

                • Created:
                  Updated: