[SQOOP-121] Multiple Hive imports fail due to file naming issue - Cloudera Open Source

Details

Type: Bug
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: 1.1.0
Fix Version/s: None
Component/s: hive, import
Labels:
None

Description

Sqoop currently uses a two-tier model for importing to Hive. First it gets results into HDFS in some directory. Then it creates a Hive table and runs a LOAD DATA command to pull the data into the Hive warehouse subdir. This works fine for the first import where the files "part-m-xxxx.gz" are loaded by Hive (which merely moves from the intermediary directory to the warehouse subdir).

With the next call, however, the AppendUtils class checks the intermediary directory for the last file (which always returns none - because the HIVE LOAD DATA moved all of them) and, upon not finding any, names the files starting at "part-m-0000.gz". Then, when Hive attempts to load the data (which, again, is just a move from that directory to its warehouse subdir) it can't, because a file already exists with the name "part-m-0000.gz" in its subdirectory.

The solution is to have AppendUtils check the hive warehouse dir (if --warehouse-dir and --hive-import are specified) to determine what the next part name should be. Then, sqoop can be executed multiple times to append to the Hive DB.

Attachments

Options
- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

Attachments

sqoop.patch
4 kB
21/Oct/10 7:03 PM

Activity

People

Assignee:

Brian Muller

Reporter:

Brian Muller

Votes:

0 Vote for this issue

Watchers:

1 Start watching this issue

Dates

Created:

21/Oct/10 3:24 PM

Updated:

25/Nov/10 2:01 AM