Details
Description
Not sure if this is a sqoop bug or hive bug... The below steps were done to reproduce this issue that showed up with a larger table.
1. Bring in the first part of a table from MSSQL using Sqoop (see attached log) using "sqoop import --driver 'com.microsoft.sqlserver.jdbc.SQLServerDriver' --connect 'jdbc:sqlserver://xxxx.net;database=xxxx;user=xxx;password=xxxx' --table campaign --hive-import --check-column id --incremental append --last-value 0 --mysql-delimiters --where 'id < 3000' --compress".
"where" supplied so next step would have data to bring in.
2. get max id with hive -->
hive> select max(id) from campaign;
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapred.reduce.tasks=<number>
Starting Job = job_201106231927_0098, Tracking URL = http://hadoop-dev-master1:50030/jobdetails.jsp?jobid=job_201106231927_0098
Kill Command = /usr/lib/hadoop/bin/hadoop job -Dmapred.job.tracker=hadoop-dev-master1:8021 -kill job_201106231927_0098
2011-06-27 18:25:09,050 Stage-1 map = 0%, reduce = 0%
2011-06-27 18:25:12,066 Stage-1 map = 100%, reduce = 0%
2011-06-27 18:25:19,098 Stage-1 map = 100%, reduce = 100%
Ended Job = job_201106231927_0098
OK
2999
Time taken: 14.629 seconds
3. bring in rest of table with using: "sqoop import --driver 'com.microsoft.sqlserver.jdbc.SQLServerDriver' --connect 'jdbc:sqlserver://xxxx.net;database=xxxx;user=xxx;password=xxxx' --table campaign --hive-import --check-column id --incremental append --last-value 2999 --mysql-delimiters --compress"
4. get max id using hive – Id is the same "2999"
hive> select max(id) from campaign;
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapred.reduce.tasks=<number>
Starting Job = job_201106231927_0099, Tracking URL = http://hadoop-dev-master1:50030/jobdetails.jsp?jobid=job_201106231927_0099
Kill Command = /usr/lib/hadoop/bin/hadoop job -Dmapred.job.tracker=hadoop-dev-master1:8021 -kill job_201106231927_0099
2011-06-27 18:25:37,423 Stage-1 map = 0%, reduce = 0%
2011-06-27 18:25:40,444 Stage-1 map = 100%, reduce = 0%
2011-06-27 18:25:47,488 Stage-1 map = 100%, reduce = 100%
Ended Job = job_201106231927_0099
OK
2999
Time taken: 18.053 seconds
------------
here is the hive table listing in Hadoop after both imports.
Found 8 items
rw-rr- 3 kernst mposse 12114 2011-06-27 18:23 /user/hive/warehouse/campaign/part-m-00000.gz
rw-rr- 3 kernst mposse 5405 2011-06-27 18:24 /user/hive/warehouse/campaign/part-m-00000.gz_copy_1
rw-rr- 3 kernst mposse 13330 2011-06-27 18:23 /user/hive/warehouse/campaign/part-m-00001.gz
rw-rr- 3 kernst mposse 5480 2011-06-27 18:24 /user/hive/warehouse/campaign/part-m-00001.gz_copy_1
rw-rr- 3 kernst mposse 14337 2011-06-27 18:23 /user/hive/warehouse/campaign/part-m-00002.gz
rw-rr- 3 kernst mposse 5377 2011-06-27 18:24 /user/hive/warehouse/campaign/part-m-00002.gz_copy_1
rw-rr- 3 kernst mposse 14118 2011-06-27 18:23 /user/hive/warehouse/campaign/part-m-00003.gz
rw-rr- 3 kernst mposse 5362 2011-06-27 18:24 /user/hive/warehouse/campaign/part-m-00003.gz_copy_1
--------------
The above works (data from the second import shows up) if --compress is not used.