Uploaded image for project: 'Kite SDK (READ-ONLY)'
  1. Kite SDK (READ-ONLY)
  2. KITE-973

Copying to a destination that has a schema change and partitioning doesn't work

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 1.0.0
    • Fix Version/s: 1.1.0
    • Component/s: Data Module
    • Labels:
      None

      Description

      When using kite-dataset copy to write to a dataset that has both a schema change and partitioning the wrong field is read from the input records to apply the field partitioners. The root cause is that we use the destination schema and partition strategy to access fields from records that use the source schema. If you've added any fields to the schema before the field that's used for partitioning, you'll end up accessing the wrong field. In the best case, you'll get a class cast exception such as this one:

      2015-03-27 19:47:48,411 WARN [main] org.apache.hadoop.mapred.YarnChild: Exception running child : java.lang.ClassCastException: java.lang.Integer cannot be cast to java.lang.Long
      	at org.kitesdk.data.spi.partition.CalendarFieldPartitioner.apply(CalendarFieldPartitioner.java:35)
      	at org.kitesdk.data.crunch.CrunchDatasets$AvroStorageKey.reuseFor(CrunchDatasets.java:356)
      	at org.kitesdk.data.crunch.CrunchDatasets$GetStorageKey.map(CrunchDatasets.java:335)
      	at org.kitesdk.data.crunch.CrunchDatasets$GetStorageKey.map(CrunchDatasets.java:302)
      	at org.apache.crunch.fn.ExtractKeyFn.map(ExtractKeyFn.java:59)
      	at org.apache.crunch.fn.ExtractKeyFn.map(ExtractKeyFn.java:29)
      	at org.apache.crunch.MapFn.process(MapFn.java:34)
      	at org.apache.crunch.impl.mr.run.RTNode.process(RTNode.java:98)
      	at org.apache.crunch.impl.mr.emit.IntermediateEmitter.emit(IntermediateEmitter.java:56)
      	at org.apache.crunch.MapFn.process(MapFn.java:34)
      	at org.apache.crunch.impl.mr.run.RTNode.process(RTNode.java:98)
      	at org.apache.crunch.impl.mr.emit.IntermediateEmitter.emit(IntermediateEmitter.java:56)
      	at org.apache.crunch.MapFn.process(MapFn.java:34)
      	at org.apache.crunch.impl.mr.run.RTNode.process(RTNode.java:98)
      	at org.apache.crunch.impl.mr.emit.IntermediateEmitter.emit(IntermediateEmitter.java:56)
      	at org.apache.crunch.MapFn.process(MapFn.java:34)
      	at org.apache.crunch.impl.mr.run.RTNode.process(RTNode.java:98)
      	at org.apache.crunch.impl.mr.run.RTNode.process(RTNode.java:109)
      	at org.apache.crunch.impl.mr.run.CrunchMapper.map(CrunchMapper.java:60)
      	at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
      	at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:784)
      	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
      	at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
      	at java.security.AccessController.doPrivileged(Native Method)
      	at javax.security.auth.Subject.doAs(Subject.java:415)
      	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1642)
      	at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)
      

      If the field at the position happens to have the same type then you'd end up silently partitioning on the wrong values.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                joey Joey Echeverria
                Reporter:
                joey Joey Echeverria
              • Votes:
                0 Vote for this issue
                Watchers:
                2 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: