Uploaded image for project: 'Kite SDK (READ-ONLY)'
  1. Kite SDK (READ-ONLY)
  2. KITE-1095

Empty directory in partition range prevents a complete read of dataset

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.2.0
    • Component/s: Data Module
    • Labels:
      None

      Description

      Let's say you have a file system (HDFS or otherwise) backed dataset that is partitioned by some value. An empty directory within the range being queried prevents additional, non-empty directories from being read. This behavior was previously functional with kite 1.0.0 and was previously functional in 1.1.0.

      A quick test using the kite CLI demonstrates this behavior:

      curl http://central.maven.org/maven2/org/kitesdk/kite-tools/1.1.0/kite-tools-1.1.0-binary.jar -o kite-dataset
      chmod +x kite-dataset
      
      echo 'username,email
      Sam,sam@test.com
      Bob,bob@test.com' > data.csv
      more data.csv
      hadoop fs -copyFromLocal data.csv /user/xyz/
      
      ./kite-dataset csv-schema data.csv --class User -o user.avsc
      more user.avsc 
      
      ./kite-dataset partition-config username:identity -s user.avsc -o part.json
      more part.json
      
      ./kite-dataset create dataset:hdfs:/user/xyz/users -s user.avsc -p part.json
      ./kite-dataset info dataset:hdfs:/user/xyz/users
      
      // https://groups.google.com/a/cloudera.org/forum/#!searchin/cdk-dev/cli$20csv$20partition/cdk-dev/Cis22jZCeUA/_Fw91Xz-3c4J
      ./kite-dataset csv-import hdfs:/user/xyz/data.csv view:hdfs:/user/xyz/users?nominaltime=20151106  --no-compaction
      ./kite-dataset show view:hdfs:/user/xyz/users?nominaltime=20151106
      > {"username": "Sam", "email": "sam@test.com"}
        {"username": "Bob", "email": "bob@test.com"}
      
      hadoop fs -mkdir /user/xyz/users/nominaltime=20151107
      
      ./kite-dataset csv-import hdfs:/user/xyz/data.csv view:hdfs:/user/xyz/users?nominaltime=20151108  --no-compaction
      ./kite-dataset show view:hdfs:/user/xyz/users?nominaltime=20151108
      > {"username": "Sam", "email": "sam@test.com"}
        {"username": "Bob", "email": "bob@test.com"}
      
      ./kite-dataset show 'view:hdfs:/user/xyz/users?nominaltime=[20151106,20151108]'
      // This URI should qualify partitions 20151106 and 20151107 but only qualifies one set of users.
      > {"username": "Sam", "email": "sam@test.com"}
        {"username": "Bob", "email": "bob@test.com"}
      
      ./kite-dataset show 'view:hdfs:/user/xyz/users?nominaltime=[20151107,20151108]'
      // No result but should include the partition 20151108
      
      hadoop fs -rmdir /user/xyz/users/nominaltime=20151107
      ./kite-dataset show 'view:hdfs:/user/xyz/users?nominaltime=[20151106,20151108]'
      // Removing the empty directory/partition returns all data expected.
      > {"username": "Sam", "email": "sam@test.com"}
        {"username": "Bob", "email": "bob@test.com"}
        {"username": "Sam", "email": "sam@test.com"}
        {"username": "Bob", "email": "bob@test.com"}
      
      // The same test with 1.0.0 qualifies as expected
      curl http://central.maven.org/maven2/org/kitesdk/kite-tools/1.0.0/kite-tools-1.0.0-binary.jar -o kite-dataset-1.0.0
      chmod +x kite-dataset-1.0.0
      
      hadoop fs -mkdir /user/xyz/users/nominaltime=20151107
      
      ./kite-dataset-1.0.0 show 'view:hdfs:/user/xyz/users?nominaltime=[20151106,20151108]'
      // Includes partitions 20151106 and 20151108, skips over the empty partition.
      > {"username": "Sam", "email": "sam@test.com"}
        {"username": "Bob", "email": "bob@test.com"}
        {"username": "Sam", "email": "sam@test.com"}
        {"username": "Bob", "email": "bob@test.com"}
      
      ./kite-dataset-1.0.0 show 'view:hdfs:/user/xyz/users?nominaltime=[20151107,20151108]'
      // Includes partitions 20151108, skips over the empty partition.
      > {"username": "Sam", "email": "sam@test.com"}
        {"username": "Bob", "email": "bob@test.com"}
      

        Attachments

          Activity

            People

            • Assignee:
              mkwhitacre Micah Whitacre
              Reporter:
              scheeser John Scheeser
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: