Details
-
Type: Bug
-
Status: Resolved
-
Priority: Major
-
Resolution: Fixed
-
Affects Version/s: 1.0.0
-
Fix Version/s: 1.1.0
-
Component/s: Data Module
-
Labels:None
Description
When using kite-dataset copy to write to a dataset that has both a schema change and partitioning the wrong field is read from the input records to apply the field partitioners. The root cause is that we use the destination schema and partition strategy to access fields from records that use the source schema. If you've added any fields to the schema before the field that's used for partitioning, you'll end up accessing the wrong field. In the best case, you'll get a class cast exception such as this one:
2015-03-27 19:47:48,411 WARN [main] org.apache.hadoop.mapred.YarnChild: Exception running child : java.lang.ClassCastException: java.lang.Integer cannot be cast to java.lang.Long at org.kitesdk.data.spi.partition.CalendarFieldPartitioner.apply(CalendarFieldPartitioner.java:35) at org.kitesdk.data.crunch.CrunchDatasets$AvroStorageKey.reuseFor(CrunchDatasets.java:356) at org.kitesdk.data.crunch.CrunchDatasets$GetStorageKey.map(CrunchDatasets.java:335) at org.kitesdk.data.crunch.CrunchDatasets$GetStorageKey.map(CrunchDatasets.java:302) at org.apache.crunch.fn.ExtractKeyFn.map(ExtractKeyFn.java:59) at org.apache.crunch.fn.ExtractKeyFn.map(ExtractKeyFn.java:29) at org.apache.crunch.MapFn.process(MapFn.java:34) at org.apache.crunch.impl.mr.run.RTNode.process(RTNode.java:98) at org.apache.crunch.impl.mr.emit.IntermediateEmitter.emit(IntermediateEmitter.java:56) at org.apache.crunch.MapFn.process(MapFn.java:34) at org.apache.crunch.impl.mr.run.RTNode.process(RTNode.java:98) at org.apache.crunch.impl.mr.emit.IntermediateEmitter.emit(IntermediateEmitter.java:56) at org.apache.crunch.MapFn.process(MapFn.java:34) at org.apache.crunch.impl.mr.run.RTNode.process(RTNode.java:98) at org.apache.crunch.impl.mr.emit.IntermediateEmitter.emit(IntermediateEmitter.java:56) at org.apache.crunch.MapFn.process(MapFn.java:34) at org.apache.crunch.impl.mr.run.RTNode.process(RTNode.java:98) at org.apache.crunch.impl.mr.run.RTNode.process(RTNode.java:109) at org.apache.crunch.impl.mr.run.CrunchMapper.map(CrunchMapper.java:60) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:784) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1642) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)
If the field at the position happens to have the same type then you'd end up silently partitioning on the wrong values.