Details
Description
(Demonstrated this bug to Aaron K. at Cloudera Hackathon on 7/27).
I discovered that when importing a table using an unsigned bigint as the primary key, the auto-generated splitting intervals are buggy. To duplicate:
mysql> create table TestInfo (
userid bigint(20) unsigned NOT NULL DEFAULT '0',
name varchar(100) COLLATE utf8_unicode_ci DEFAULT '',
primary key(userid)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;
mysql> INSERT INTO TestInfo VALUES (14, 'foo'), (7863696997872966707, 'bar')
$ sqoop import --connect jdbc:mysql://localhost/sqoop --username root -P --warehouse-dir /tmp --table TestInfo --split-by userid --where 'userid>0'
I'll add the mysql query log as an attachment. Basically it generates a number of intervals including negative values, and the resulting imported dataset includes duplicates:
$ hadoop fs -getmerge /tmp/TestInfo .
$ cat TestInfo
14,foo
14,foo
7863696997872966707,bar
Little help? Would be happy to provide additional info as requested.
Thanks.