Uploaded image for project: 'Sqoop (READ-ONLY)'
  1. Sqoop (READ-ONLY)
  2. SQOOP-230

Sqoop incremental hive import and compression

    Details

    • Type: Bug
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 1.2.0
    • Fix Version/s: None
    • Component/s: hive, import
    • Labels:
      None
    • Environment:
      RHEL, CDH 3

      Description

      Not sure if this is a sqoop bug or hive bug... The below steps were done to reproduce this issue that showed up with a larger table.

      1. Bring in the first part of a table from MSSQL using Sqoop (see attached log) using "sqoop import --driver 'com.microsoft.sqlserver.jdbc.SQLServerDriver' --connect 'jdbc:sqlserver://xxxx.net;database=xxxx;user=xxx;password=xxxx' --table campaign --hive-import --check-column id --incremental append --last-value 0 --mysql-delimiters --where 'id < 3000' --compress".

      "where" supplied so next step would have data to bring in.
      2. get max id with hive -->
      hive> select max(id) from campaign;
      Total MapReduce jobs = 1
      Launching Job 1 out of 1
      Number of reduce tasks determined at compile time: 1
      In order to change the average load for a reducer (in bytes):
      set hive.exec.reducers.bytes.per.reducer=<number>
      In order to limit the maximum number of reducers:
      set hive.exec.reducers.max=<number>
      In order to set a constant number of reducers:
      set mapred.reduce.tasks=<number>
      Starting Job = job_201106231927_0098, Tracking URL = http://hadoop-dev-master1:50030/jobdetails.jsp?jobid=job_201106231927_0098
      Kill Command = /usr/lib/hadoop/bin/hadoop job -Dmapred.job.tracker=hadoop-dev-master1:8021 -kill job_201106231927_0098
      2011-06-27 18:25:09,050 Stage-1 map = 0%, reduce = 0%
      2011-06-27 18:25:12,066 Stage-1 map = 100%, reduce = 0%
      2011-06-27 18:25:19,098 Stage-1 map = 100%, reduce = 100%
      Ended Job = job_201106231927_0098
      OK
      2999
      Time taken: 14.629 seconds

      3. bring in rest of table with using: "sqoop import --driver 'com.microsoft.sqlserver.jdbc.SQLServerDriver' --connect 'jdbc:sqlserver://xxxx.net;database=xxxx;user=xxx;password=xxxx' --table campaign --hive-import --check-column id --incremental append --last-value 2999 --mysql-delimiters --compress"

      4. get max id using hive – Id is the same "2999"
      hive> select max(id) from campaign;
      Total MapReduce jobs = 1
      Launching Job 1 out of 1
      Number of reduce tasks determined at compile time: 1
      In order to change the average load for a reducer (in bytes):
      set hive.exec.reducers.bytes.per.reducer=<number>
      In order to limit the maximum number of reducers:
      set hive.exec.reducers.max=<number>
      In order to set a constant number of reducers:
      set mapred.reduce.tasks=<number>
      Starting Job = job_201106231927_0099, Tracking URL = http://hadoop-dev-master1:50030/jobdetails.jsp?jobid=job_201106231927_0099
      Kill Command = /usr/lib/hadoop/bin/hadoop job -Dmapred.job.tracker=hadoop-dev-master1:8021 -kill job_201106231927_0099
      2011-06-27 18:25:37,423 Stage-1 map = 0%, reduce = 0%
      2011-06-27 18:25:40,444 Stage-1 map = 100%, reduce = 0%
      2011-06-27 18:25:47,488 Stage-1 map = 100%, reduce = 100%
      Ended Job = job_201106231927_0099
      OK
      2999
      Time taken: 18.053 seconds

      ------------
      here is the hive table listing in Hadoop after both imports.

      Found 8 items
      rw-rr- 3 kernst mposse 12114 2011-06-27 18:23 /user/hive/warehouse/campaign/part-m-00000.gz
      rw-rr- 3 kernst mposse 5405 2011-06-27 18:24 /user/hive/warehouse/campaign/part-m-00000.gz_copy_1
      rw-rr- 3 kernst mposse 13330 2011-06-27 18:23 /user/hive/warehouse/campaign/part-m-00001.gz
      rw-rr- 3 kernst mposse 5480 2011-06-27 18:24 /user/hive/warehouse/campaign/part-m-00001.gz_copy_1
      rw-rr- 3 kernst mposse 14337 2011-06-27 18:23 /user/hive/warehouse/campaign/part-m-00002.gz
      rw-rr- 3 kernst mposse 5377 2011-06-27 18:24 /user/hive/warehouse/campaign/part-m-00002.gz_copy_1
      rw-rr- 3 kernst mposse 14118 2011-06-27 18:23 /user/hive/warehouse/campaign/part-m-00003.gz
      rw-rr- 3 kernst mposse 5362 2011-06-27 18:24 /user/hive/warehouse/campaign/part-m-00003.gz_copy_1
      --------------
      The above works (data from the second import shows up) if --compress is not used.

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              ken.ernst Ken Ernst
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated: