[KITE-1048] Hadoop MR for Hive tries to write temporary folder to invalid dataset - Cloudera Open Source

Details

Type: Bug
Status: Open
Priority: Critical
Resolution: Unresolved
Affects Version/s: 1.1.0
Fix Version/s: None
Component/s: Data Module
Labels:
- hadoop
- hive
- kite
- kitesdk
Environment:
Hadoop 2.6.0, HDP2.2

Description

I have a MapReduce job to read/parse text and write its results to a hive table.

The job is configured (shortened) like this:

Configuration conf = new HiveConfiguration();
Job job = Job.getInstance(conf);

FileInputFormat.addInputPaths(job, inputPaths);
job.setInputFormatClass(TextInputFormat.class);
AvroJob.setMapOutputKeySchema(job, Schema.create(Schema.Type.LONG));
AvroJob.setMapOutputValueSchema(job, Tweet.getClassSchema());

DatasetKeyOutputFormat.ConfigBuilder configBuilder = DatasetKeyOutputFormat.configure(job);
configBuilder.overwrite("dataset:hive:mydataset");
configBuilder.withType(Tweet.class);

The job fails with the following exception:

15/07/17 00:57:56 INFO mapreduce.Job: Job job_1436989639392_0015 failed with state FAILED due to: Job setup failed : java.lang.IllegalArgumentException: Unknown repository URI pattern: dataset:hdfs://hdfs.XXX.com:8020/tmp/default/.temp/job_1436989639392_0015
at org.kitesdk.data.spi.Registration.lookupPatternByRepoUri(Registration.java:74)
at org.kitesdk.data.URIBuilder.<init>(URIBuilder.java:109)
at org.kitesdk.data.spi.filesystem.FileSystemDatasetRepository.create(FileSystemDatasetRepository.java:144)
at org.kitesdk.data.mapreduce.DatasetKeyOutputFormat.loadOrCreateJobDataset(DatasetKeyOutputFormat.java:584)
at org.kitesdk.data.mapreduce.DatasetKeyOutputFormat.access$300(DatasetKeyOutputFormat.java:67)
at org.kitesdk.data.mapreduce.DatasetKeyOutputFormat$MergeOutputCommitter.setupJob(DatasetKeyOutputFormat.java:369)
at org.apache.hadoop.mapreduce.v2.app.commit.CommitterEventHandler$EventProcessor.handleJobSetup(CommitterEventHandler.java:254)
at org.apache.hadoop.mapreduce.v2.app.commit.CommitterEventHandler$EventProcessor.run(CommitterEventHandler.java:234)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

I tracked the stacktrace a bit down, but couldn't find where the hostname was added to this dataset string.

Attachments

Activity

People

Assignee:

Unassigned

Reporter:

Dominik Hübner

Votes:

0 Vote for this issue

Watchers:

1 Start watching this issue

Dates

Created:

17/Jul/15 8:23 PM

Updated:

17/Jul/15 8:23 PM