[KITE-762] Multiple URIs in hive.metastore.uris configuration may be problematic for Crunch+Kite - Cloudera Open Source

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 0.17.0
Fix Version/s: None
Component/s: Data Module
Labels:
None

Description

We have a Crunch job, periodically run on YARN through Oozie, that calculates some stats for a Kite dataset that's setup as a Hive external table.

In one environment, everything works correctly. The job config as recorded by the JobHistory server looks like this:

hive.metastore.uris=thrift://server1.abc.net:9083
kite.inputPartitionDir=hdfs://ingestiondev/wolfe/storage
kite.inputUri=dataset:hive://server1.abc.net:9083/wolfe/default/storage?hdfs:host=ingestiondev

In another similar environment the job is failing with this map task exception:

2014-11-04 18:36:09,402 WARN [main] org.apache.hadoop.mapred.YarnChild: Exception running child : java.lang.IllegalArgumentException: Missing Hive MetaStore connection URI
	at org.kitesdk.data.spi.hive.MetaStoreUtil.<init>(MetaStoreUtil.java:78)
	at org.kitesdk.data.spi.hive.HiveAbstractMetadataProvider.getMetaStoreUtil(HiveAbstractMetadataProvider.java:56)
	at org.kitesdk.data.spi.hive.HiveAbstractMetadataProvider.resolveNamespace(HiveAbstractMetadataProvider.java:237)
	at org.kitesdk.data.spi.hive.HiveAbstractMetadataProvider.resolveNamespace(HiveAbstractMetadataProvider.java:222)
	at org.kitesdk.data.spi.hive.HiveAbstractMetadataProvider.load(HiveAbstractMetadataProvider.java:95)
	at org.kitesdk.data.spi.filesystem.FileSystemDatasetRepository.load(FileSystemDatasetRepository.java:191)
	at org.kitesdk.data.Datasets.load(Datasets.java:69)
	at org.kitesdk.data.Datasets.load(Datasets.java:113)
	at org.kitesdk.data.mapreduce.DatasetKeyInputFormat.load(DatasetKeyInputFormat.java:226)
	at org.kitesdk.data.mapreduce.DatasetKeyInputFormat.setConf(DatasetKeyInputFormat.java:172)
	at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:73)
	at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:133)
	at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:726)
	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
	at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:167)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:415)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1554)
	at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:162)

The config for the failing job looks like this:

hive.metastore.uris=thrift://server1.xyz.net:9083,thrift://server2.xyz.net:9083
kite.inputPartitionDir=hdfs://wario/wolfe/default/storage
kite.inputUri=dataset:hive:/wolfe/default/storage?hdfs:host=wario

I haven't tracked down how the kite.inputUri property is constructed, but it seems odd that it contains the metastore host:port only for the successful job. I think the key difference is likely the multiple URIs in the hive.metastore.uris property for the unsuccessful job. A quick search found some Kite code that doesn't appear to handle multiple URIs correctly [1] (not sure if this is ultimately the culprit for the issue we're seeing, but it does look like a bug).

We're using CDH 5.1.0.1 and Kite 0.17.0.

[1] https://github.com/kite-sdk/kite/blob/release-0.17.0/kite-data/kite-data-hive/src/main/java/org/kitesdk/data/spi/hive/HiveAbstractDatasetRepository.java#L88-94

Attachments

Issue Links

links to

Pull request

Activity

People

Assignee:

Szabolcs Vasas

Reporter:

Andrew Olson

Votes:

0 Vote for this issue

Watchers:

2 Start watching this issue

Dates

Created:

05/Nov/14 9:23 PM

Updated:

21/Aug/17 11:24 AM

Resolved:

21/Aug/17 11:24 AM