Details
-
Type:
Bug
-
Status: Open
-
Priority:
Blocker
-
Resolution: Unresolved
-
Affects Version/s: None
-
Fix Version/s: None
-
Component/s: Hive
-
Labels:None
-
Environment:CDH 5.7.0 @ CentOS 6
Description
We upgraded from CDH 5.4.4 to 5.7.0 and queries such as this one now fail:
create table hivebug.simple ( `mycol` array<struct<clickid:int,issourceloss:boolean,transitiontype:string>> ) STORED AS PARQUET; insert into hivebug.simple select mycol from data.dump limit 1;
We can select the data just fine but writing it is the issue. The reducer fails with an exception like this:
java.lang.NegativeArraySizeException at org.apache.hadoop.hive.serde2.lazybinary.LazyBinaryArray.adjustArraySize(LazyBinaryArray.java:108) at org.apache.hadoop.hive.serde2.lazybinary.LazyBinaryArray.parse(LazyBinaryArray.java:136) at org.apache.hadoop.hive.serde2.lazybinary.LazyBinaryArray.getListLength(LazyBinaryArray.java:210) at org.apache.hadoop.hive.serde2.lazybinary.objectinspector.LazyBinaryListObjectInspector.getListLength(LazyBinaryListObjectInspector.java:63) at org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriter$ListDataWriter.write(DataWritableWriter.java:259) at org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriter$GroupDataWriter.write(DataWritableWriter.java:199) at org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriter$MessageDataWriter.write(DataWritableWriter.java:215) at org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriter.write(DataWritableWriter.java:88) at org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriteSupport.write(DataWritableWriteSupport.java:59) at org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriteSupport.write(DataWritableWriteSupport.java:31) at parquet.hadoop.InternalParquetRecordWriter.write(InternalParquetRecordWriter.java:116) at parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:123) at parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:42) at org.apache.hadoop.hive.ql.io.parquet.write.ParquetRecordWriterWrapper.write(ParquetRecordWriterWrapper.java:111) at org.apache.hadoop.hive.ql.io.parquet.write.ParquetRecordWriterWrapper.write(ParquetRecordWriterWrapper.java:124) at org.apache.hadoop.hive.ql.exec.FileSinkOperator.processOp(FileSinkOperator.java:697) at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:815) at org.apache.hadoop.hive.ql.exec.LimitOperator.processOp(LimitOperator.java:51) at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:815) at org.apache.hadoop.hive.ql.exec.SelectOperator.processOp(SelectOperator.java:84) at org.apache.hadoop.hive.ql.exec.mr.ExecReducer.reduce(ExecReducer.java:244)
or if you run
create table test_xy stored as parquet as SELECT * FROM data.dump WHERE year = "2016" and month = "03" and day in ("12", "13", "14") and mycol in("abc", "def", "ghi") order by timestamp asc
you get slightly different one:
java.lang.IllegalArgumentException: fromIndex(0) > toIndex(-1) at java.util.Arrays.rangeCheck(Arrays.java:794) at java.util.Arrays.fill(Arrays.java:2084) at org.apache.hadoop.hive.serde2.lazybinary.LazyBinaryArray.parse(LazyBinaryArray.java:163) at org.apache.hadoop.hive.serde2.lazybinary.LazyBinaryArray.getListLength(LazyBinaryArray.java:210) at org.apache.hadoop.hive.serde2.lazybinary.objectinspector.LazyBinaryListObjectInspector.getListLength(LazyBinaryListObjectInspector.java:63) at org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriter$ListDataWriter.write(DataWritableWriter.java:259) at org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriter$GroupDataWriter.write(DataWritableWriter.java:199) at org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriter$MessageDataWriter.write(DataWritableWriter.java:215) at org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriter.write(DataWritableWriter.java:88) at org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriteSupport.write(DataWritableWriteSupport.java:59) at org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriteSupport.write(DataWritableWriteSupport.java:31) at parquet.hadoop.InternalParquetRecordWriter.write(InternalParquetRecordWriter.java:116) at parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:123) at parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:42) at org.apache.hadoop.hive.ql.io.parquet.write.ParquetRecordWriterWrapper.write(ParquetRecordWriterWrapper.java:111) at org.apache.hadoop.hive.ql.io.parquet.write.ParquetRecordWriterWrapper.write(ParquetRecordWriterWrapper.java:124) at org.apache.hadoop.hive.ql.exec.FileSinkOperator.processOp(FileSinkOperator.java:697) at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:815) at org.apache.hadoop.hive.ql.exec.SelectOperator.processOp(SelectOperator.java:84) at org.apache.hadoop.hive.ql.exec.mr.ExecReducer.reduce(ExecReducer.java:244)
Btw just before the crash I can see a suspicious info message in logs:
2016-05-27 12:37:54,197 INFO [main] org.apache.hadoop.hive.serde2.lazybinary.LazyBinaryStruct: Missing fields! Expected 8 fields but only got 8! Ignoring similar problems.
I looked at a diff between hive 1.1.0-cdh5.4.4 and 1.1.0-cdh5.7.0 to see what has changed and I noticed huge parquet-related changes (ql/src/java/org/apache/hadoop/hive/ql/io/parquet) so I think this is where the bug comes from.