Details
-
Type:
Bug
-
Status: Resolved
-
Priority:
Major
-
Resolution: Fixed
-
Affects Version/s: 0.10.1
-
Fix Version/s: None
-
Component/s: Data Module
-
Labels:None
Description
Using CrunchDatasets.asTarget(Dataset) with a partitioned dataset and feeding a leaf/partitioned dataset into CrunchDatasets produces this error.
org.apache.crunch.CrunchRuntimeException: Path already exists: hdfs://localhost:58425/kite/%2Fsource%3Aint64%2Ftype_0%3Aint64%2Fpayload%3Aint64/source=source_0/batch=2014_02_04_14_25_12
at org.apache.crunch.io.impl.FileTargetImpl.handleExisting(FileTargetImpl.java:257)
at org.apache.crunch.impl.mr.MRPipeline.write(MRPipeline.java:212)
at org.apache.crunch.impl.mr.MRPipeline.write(MRPipeline.java:200)
...
The dataset also uses avro data. Seems like the gist behind the error is when creating a partitioned dataset[1] the partition directory is created. This directory is used as the target directory for Crunch but the file target expects this directory to not exist. I have to create the dataset here as well since not specifying the autoCreate boolean returns null.
There is likely other combinations where this bug exists (i.e. parquet data, non-partitioned datasets).
Also noticed that method is deprecated in master but not on the interface just in the implementation.
Attachments
Issue Links
- depends on
-
KITE-347 Implement DatasetTarget#handleExisting and WriteMode support
-
- Resolved
-