[KITE-228] readSequenceFile command should not reuse the identity of Hadoop Writeable objects - Cloudera Open Source

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 0.6.0, 0.7.0, 0.8.0, 0.8.1
Fix Version/s: 0.9.0
Component/s: Morphlines Module
Labels:
None

Description

The readSequenceFile morphline command should not reuse the "key" and "value" Hadoop Writeable objects across rows.

Downstream commands such as loadSolr or HBase indexer buffer up a bunch of records before sending them off to Solr. If the buffered records contain a reference to the same Hadoop Writeable object as the primary key id, this leads to nonsensical behaviour as all the records suddently appear to be the same record (same id).

A work-around is to insert the commands

toString { field: key }
toString { field : value }

immediately after the readSequenceFile command in your morphline. This converts the key and value from the Hadoop Writable to a distinct String object, which means the identity of the key and object are different for each row.

Attachments

Activity

People

Assignee:

Wolfgang Hoschek

Reporter:

Wolfgang Hoschek

Votes:

0 Vote for this issue

Watchers:

0 Start watching this issue

Dates

Created:

17/Nov/13 9:55 PM

Updated:

17/Nov/13 10:07 PM

Resolved:

17/Nov/13 10:07 PM