[KITE-973] Copying to a destination that has a schema change and partitioning doesn't work - Cloudera Open Source

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 1.0.0
Fix Version/s: 1.1.0
Component/s: Data Module
Labels:
None

Description

When using kite-dataset copy to write to a dataset that has both a schema change and partitioning the wrong field is read from the input records to apply the field partitioners. The root cause is that we use the destination schema and partition strategy to access fields from records that use the source schema. If you've added any fields to the schema before the field that's used for partitioning, you'll end up accessing the wrong field. In the best case, you'll get a class cast exception such as this one:

2015-03-27 19:47:48,411 WARN [main] org.apache.hadoop.mapred.YarnChild: Exception running child : java.lang.ClassCastException: java.lang.Integer cannot be cast to java.lang.Long
	at org.kitesdk.data.spi.partition.CalendarFieldPartitioner.apply(CalendarFieldPartitioner.java:35)
	at org.kitesdk.data.crunch.CrunchDatasets$AvroStorageKey.reuseFor(CrunchDatasets.java:356)
	at org.kitesdk.data.crunch.CrunchDatasets$GetStorageKey.map(CrunchDatasets.java:335)
	at org.kitesdk.data.crunch.CrunchDatasets$GetStorageKey.map(CrunchDatasets.java:302)
	at org.apache.crunch.fn.ExtractKeyFn.map(ExtractKeyFn.java:59)
	at org.apache.crunch.fn.ExtractKeyFn.map(ExtractKeyFn.java:29)
	at org.apache.crunch.MapFn.process(MapFn.java:34)
	at org.apache.crunch.impl.mr.run.RTNode.process(RTNode.java:98)
	at org.apache.crunch.impl.mr.emit.IntermediateEmitter.emit(IntermediateEmitter.java:56)
	at org.apache.crunch.MapFn.process(MapFn.java:34)
	at org.apache.crunch.impl.mr.run.RTNode.process(RTNode.java:98)
	at org.apache.crunch.impl.mr.emit.IntermediateEmitter.emit(IntermediateEmitter.java:56)
	at org.apache.crunch.MapFn.process(MapFn.java:34)
	at org.apache.crunch.impl.mr.run.RTNode.process(RTNode.java:98)
	at org.apache.crunch.impl.mr.emit.IntermediateEmitter.emit(IntermediateEmitter.java:56)
	at org.apache.crunch.MapFn.process(MapFn.java:34)
	at org.apache.crunch.impl.mr.run.RTNode.process(RTNode.java:98)
	at org.apache.crunch.impl.mr.run.RTNode.process(RTNode.java:109)
	at org.apache.crunch.impl.mr.run.CrunchMapper.map(CrunchMapper.java:60)
	at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
	at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:784)
	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
	at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:415)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1642)
	at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)

If the field at the position happens to have the same type then you'd end up silently partitioning on the wrong values.

Attachments

Issue Links

relates to

KITE-345 Plan projection

Resolved

links to

PR #367

Activity

People

Assignee:

Joey Echeverria

Reporter:

Joey Echeverria

Votes:

0 Vote for this issue

Watchers:

2 Start watching this issue

Dates

Created:

29/Mar/15 7:11 AM

Updated:

14/Jun/15 9:49 PM

Resolved:

14/Jun/15 9:49 PM