[SQOOP-129] Newlines in RDBMS Fields Break Hive - Cloudera Open Source

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 1.1.0
Fix Version/s: 1.3.0
Component/s: hive, import
Labels:
None

Description

At the moment, Hive does not support record delimiters other than newlines. Additionally, Hive treats both newlines and carriage returns as record delimiters. Any newlines or carriage returns in fields in a RDBMS, after importation with Sqoop, will cause Hive to misread tables.

The current Sqoop docs do note:

Hive does not support enclosing and escaping characters. You must choose unambiguous field and record-terminating delimiters without the help of escaping and enclosing characters when working with Hive; this is a limitation of Hive's input parsing abilities.

The problem is that users cannot choose their own record-terminating delimiters at this time. Rather than requiring that users preprocess fields and strip all newlines and carriage returns prior to running a Sqoop job, it would be immensely useful to add the option for users to simply specify replacement characters for both record and field terminating delimiters (replacements because, as the sqoop docs note above, there is no escaping).

Attachments

Options
- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

Attachments

0001-SQOOP-129.-Newlines-in-RDBMS-fields-break-Hive.patch
05/May/11 12:07 AM
9 kB
Jonathan Hsieh
sqoop.patch
08/Nov/10 2:26 PM
10 kB
Brian Muller

Issue Links

relates to

SQOOP-190 Sqoop shouldn't use generated SqoopRecord.toString in text output cases.

Open

Activity

People

Assignee:

Jonathan Hsieh

Reporter:

Brian Muller

Votes:

0 Vote for this issue

Watchers:

6 Start watching this issue

Dates

Created:

08/Nov/10 2:13 PM

Updated:

05/May/11 12:08 AM

Resolved:

05/May/11 12:08 AM