Details
-
Type:
Bug
-
Status: Resolved
-
Priority:
Major
-
Resolution: Fixed
-
Affects Version/s: 0.17.0
-
Fix Version/s: None
-
Component/s: Data Module
-
Labels:None
Description
We have a Crunch job, periodically run on YARN through Oozie, that calculates some stats for a Kite dataset that's setup as a Hive external table.
In one environment, everything works correctly. The job config as recorded by the JobHistory server looks like this:
hive.metastore.uris=thrift://server1.abc.net:9083 kite.inputPartitionDir=hdfs://ingestiondev/wolfe/storage kite.inputUri=dataset:hive://server1.abc.net:9083/wolfe/default/storage?hdfs:host=ingestiondev
In another similar environment the job is failing with this map task exception:
2014-11-04 18:36:09,402 WARN [main] org.apache.hadoop.mapred.YarnChild: Exception running child : java.lang.IllegalArgumentException: Missing Hive MetaStore connection URI at org.kitesdk.data.spi.hive.MetaStoreUtil.<init>(MetaStoreUtil.java:78) at org.kitesdk.data.spi.hive.HiveAbstractMetadataProvider.getMetaStoreUtil(HiveAbstractMetadataProvider.java:56) at org.kitesdk.data.spi.hive.HiveAbstractMetadataProvider.resolveNamespace(HiveAbstractMetadataProvider.java:237) at org.kitesdk.data.spi.hive.HiveAbstractMetadataProvider.resolveNamespace(HiveAbstractMetadataProvider.java:222) at org.kitesdk.data.spi.hive.HiveAbstractMetadataProvider.load(HiveAbstractMetadataProvider.java:95) at org.kitesdk.data.spi.filesystem.FileSystemDatasetRepository.load(FileSystemDatasetRepository.java:191) at org.kitesdk.data.Datasets.load(Datasets.java:69) at org.kitesdk.data.Datasets.load(Datasets.java:113) at org.kitesdk.data.mapreduce.DatasetKeyInputFormat.load(DatasetKeyInputFormat.java:226) at org.kitesdk.data.mapreduce.DatasetKeyInputFormat.setConf(DatasetKeyInputFormat.java:172) at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:73) at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:133) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:726) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:167) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1554) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:162)
The config for the failing job looks like this:
hive.metastore.uris=thrift://server1.xyz.net:9083,thrift://server2.xyz.net:9083 kite.inputPartitionDir=hdfs://wario/wolfe/default/storage kite.inputUri=dataset:hive:/wolfe/default/storage?hdfs:host=wario
I haven't tracked down how the kite.inputUri property is constructed, but it seems odd that it contains the metastore host:port only for the successful job. I think the key difference is likely the multiple URIs in the hive.metastore.uris property for the unsuccessful job. A quick search found some Kite code that doesn't appear to handle multiple URIs correctly [1] (not sure if this is ultimately the culprit for the issue we're seeing, but it does look like a bug).
We're using CDH 5.1.0.1 and Kite 0.17.0.
Attachments
Issue Links
- links to