Uploaded image for project: 'Kite SDK (READ-ONLY)'
  1. Kite SDK (READ-ONLY)
  2. KITE-927

Reading a dataset using Crunch with old schema causes failure

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 0.18.0
    • Fix Version/s: 1.0.0
    • Component/s: None
    • Labels:
      None

      Description

      Reading a dataset using a Crunch source with an older schema introduces job failures.

      Here are the steps to reproduce the error

      1. Create a Dataset using Avro specifics, and load data(via crunch or DatasetWriter). Let's assume schema has 3 fields.
      2. Read dataset using Crunch Source (all 3 fields are read)
      3. Update Dataset schema to have 4 fields. (Add a passive field)
      4. Read from Dataset using old schema(without the new field) and a job failure occurs

      I noticed this behavior only when using the Crunch SourceTarget, and does not happen when there is no mapreduce logic involved.

      Here is an example to demostrate this behavior. I created this based off the dataset-hbase example. Basically, a HDFS dataset is created with an initial schema, later it's updated with a newer schema(passive schema change). Any read from the dataset using the initial schema will fail. The Readme has instructions to run the example. However, the project is setup with work with CDH4 and it works with the quick start vm.

      Here is the stack trace of the error from one of the task that was trying to read from the Dataset

      2015-02-18 08:53:55,272 INFO org.apache.hadoop.mapred.MapTask: Processing split: hdfs://localhost.localdomain:8020/user/cloudera/test_crunch_passivity/users/username_copy=belinda/0ee42014-39f7-4da7-b353-a4966feee481.avro:0+630
      2015-02-18 08:53:55,774 INFO org.apache.hadoop.mapred.TaskLogsTruncater: Initializing logs' truncater with mapRetainSize=-1 and reduceRetainSize=-1
      2015-02-18 08:53:55,781 WARN org.apache.hadoop.mapred.Child: Error running child
      org.apache.avro.AvroRuntimeException: Bad index
      	at org.kitesdk.examples.data.User.put(User.java:50)
      	at org.apache.avro.generic.GenericData.setField(GenericData.java:530)
      	at org.apache.avro.generic.GenericData.setField(GenericData.java:547)
      	at org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:177)
      	at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:148)
      	at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:139)
      	at org.apache.avro.file.DataFileStream.next(DataFileStream.java:233)
      	at org.apache.avro.mapreduce.AvroRecordReaderBase.nextKeyValue(AvroRecordReaderBase.java:118)
      	at org.apache.avro.mapreduce.AvroKeyRecordReader.nextKeyValue(AvroKeyRecordReader.java:53)
      	at org.kitesdk.data.spi.AbstractKeyRecordReaderWrapper.nextKeyValue(AbstractKeyRecordReaderWrapper.java:55)
      	at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:483)
      	at org.apache.hadoop.mapreduce.task.MapContextImpl.nextKeyValue(MapContextImpl.java:76)
      	at org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.nextKeyValue(WrappedMapper.java:85)
      	at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:139)
      	at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:672)
      	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:330)
      	at org.apache.hadoop.mapred.Child$4.run(Child.java:268)
      	at java.security.AccessController.doPrivileged(Native Method)
      	at javax.security.auth.Subject.doAs(Subject.java:396)
      	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1438)
      	at org.apache.hadoop.mapred.Child.main(Child.java:262)
      2015-02-18 08:53:55,787 INFO org.apache.hadoop.mapred.Task: Runnning cleanup for the task
      

      I believe this as a valid scenario with passive schema changes, where consumers of a dataset may have an old version of schema to read from a dataset.

        Attachments

          Activity

            People

            • Assignee:
              rbrush Ryan Brush
              Reporter:
              nasokan Nithin Asokan
            • Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: