Uploaded image for project: 'Kite SDK (READ-ONLY)'
  1. Kite SDK (READ-ONLY)
  2. KITE-1048

Hadoop MR for Hive tries to write temporary folder to invalid dataset

    Details

    • Type: Bug
    • Status: Open
    • Priority: Critical
    • Resolution: Unresolved
    • Affects Version/s: 1.1.0
    • Fix Version/s: None
    • Component/s: Data Module
    • Environment:
      Hadoop 2.6.0, HDP2.2

      Description

      I have a MapReduce job to read/parse text and write its results to a hive table.

      The job is configured (shortened) like this:

      Configuration conf = new HiveConfiguration();
      Job job = Job.getInstance(conf);

      FileInputFormat.addInputPaths(job, inputPaths);
      job.setInputFormatClass(TextInputFormat.class);
      AvroJob.setMapOutputKeySchema(job, Schema.create(Schema.Type.LONG));
      AvroJob.setMapOutputValueSchema(job, Tweet.getClassSchema());

      DatasetKeyOutputFormat.ConfigBuilder configBuilder = DatasetKeyOutputFormat.configure(job);
      configBuilder.overwrite("dataset:hive:mydataset");
      configBuilder.withType(Tweet.class);

      The job fails with the following exception:

      15/07/17 00:57:56 INFO mapreduce.Job: Job job_1436989639392_0015 failed with state FAILED due to: Job setup failed : java.lang.IllegalArgumentException: Unknown repository URI pattern: dataset:hdfs://hdfs.XXX.com:8020/tmp/default/.temp/job_1436989639392_0015
      at org.kitesdk.data.spi.Registration.lookupPatternByRepoUri(Registration.java:74)
      at org.kitesdk.data.URIBuilder.<init>(URIBuilder.java:109)
      at org.kitesdk.data.spi.filesystem.FileSystemDatasetRepository.create(FileSystemDatasetRepository.java:144)
      at org.kitesdk.data.mapreduce.DatasetKeyOutputFormat.loadOrCreateJobDataset(DatasetKeyOutputFormat.java:584)
      at org.kitesdk.data.mapreduce.DatasetKeyOutputFormat.access$300(DatasetKeyOutputFormat.java:67)
      at org.kitesdk.data.mapreduce.DatasetKeyOutputFormat$MergeOutputCommitter.setupJob(DatasetKeyOutputFormat.java:369)
      at org.apache.hadoop.mapreduce.v2.app.commit.CommitterEventHandler$EventProcessor.handleJobSetup(CommitterEventHandler.java:254)
      at org.apache.hadoop.mapreduce.v2.app.commit.CommitterEventHandler$EventProcessor.run(CommitterEventHandler.java:234)
      at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
      at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
      at java.lang.Thread.run(Thread.java:745)

      I tracked the stacktrace a bit down, but couldn't find where the hostname was added to this dataset string.

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              dhuebner Dominik Hübner
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated: