Details
-
Type:
Bug
-
Status: Resolved
-
Priority:
Major
-
Resolution: Fixed
-
Affects Version/s: None
-
Fix Version/s: 1.2.0
-
Component/s: Data Module
-
Labels:None
Description
Let's say you have a file system (HDFS or otherwise) backed dataset that is partitioned by some value. An empty directory within the range being queried prevents additional, non-empty directories from being read. This behavior was previously functional with kite 1.0.0 and was previously functional in 1.1.0.
A quick test using the kite CLI demonstrates this behavior:
curl http://central.maven.org/maven2/org/kitesdk/kite-tools/1.1.0/kite-tools-1.1.0-binary.jar -o kite-dataset chmod +x kite-dataset echo 'username,email Sam,sam@test.com Bob,bob@test.com' > data.csv more data.csv hadoop fs -copyFromLocal data.csv /user/xyz/ ./kite-dataset csv-schema data.csv --class User -o user.avsc more user.avsc ./kite-dataset partition-config username:identity -s user.avsc -o part.json more part.json ./kite-dataset create dataset:hdfs:/user/xyz/users -s user.avsc -p part.json ./kite-dataset info dataset:hdfs:/user/xyz/users // https://groups.google.com/a/cloudera.org/forum/#!searchin/cdk-dev/cli$20csv$20partition/cdk-dev/Cis22jZCeUA/_Fw91Xz-3c4J ./kite-dataset csv-import hdfs:/user/xyz/data.csv view:hdfs:/user/xyz/users?nominaltime=20151106 --no-compaction ./kite-dataset show view:hdfs:/user/xyz/users?nominaltime=20151106 > {"username": "Sam", "email": "sam@test.com"} {"username": "Bob", "email": "bob@test.com"} hadoop fs -mkdir /user/xyz/users/nominaltime=20151107 ./kite-dataset csv-import hdfs:/user/xyz/data.csv view:hdfs:/user/xyz/users?nominaltime=20151108 --no-compaction ./kite-dataset show view:hdfs:/user/xyz/users?nominaltime=20151108 > {"username": "Sam", "email": "sam@test.com"} {"username": "Bob", "email": "bob@test.com"} ./kite-dataset show 'view:hdfs:/user/xyz/users?nominaltime=[20151106,20151108]' // This URI should qualify partitions 20151106 and 20151107 but only qualifies one set of users. > {"username": "Sam", "email": "sam@test.com"} {"username": "Bob", "email": "bob@test.com"} ./kite-dataset show 'view:hdfs:/user/xyz/users?nominaltime=[20151107,20151108]' // No result but should include the partition 20151108 hadoop fs -rmdir /user/xyz/users/nominaltime=20151107 ./kite-dataset show 'view:hdfs:/user/xyz/users?nominaltime=[20151106,20151108]' // Removing the empty directory/partition returns all data expected. > {"username": "Sam", "email": "sam@test.com"} {"username": "Bob", "email": "bob@test.com"} {"username": "Sam", "email": "sam@test.com"} {"username": "Bob", "email": "bob@test.com"} // The same test with 1.0.0 qualifies as expected curl http://central.maven.org/maven2/org/kitesdk/kite-tools/1.0.0/kite-tools-1.0.0-binary.jar -o kite-dataset-1.0.0 chmod +x kite-dataset-1.0.0 hadoop fs -mkdir /user/xyz/users/nominaltime=20151107 ./kite-dataset-1.0.0 show 'view:hdfs:/user/xyz/users?nominaltime=[20151106,20151108]' // Includes partitions 20151106 and 20151108, skips over the empty partition. > {"username": "Sam", "email": "sam@test.com"} {"username": "Bob", "email": "bob@test.com"} {"username": "Sam", "email": "sam@test.com"} {"username": "Bob", "email": "bob@test.com"} ./kite-dataset-1.0.0 show 'view:hdfs:/user/xyz/users?nominaltime=[20151107,20151108]' // Includes partitions 20151108, skips over the empty partition. > {"username": "Sam", "email": "sam@test.com"} {"username": "Bob", "email": "bob@test.com"}