Uploaded image for project: 'CDH (READ-ONLY)'
  1. CDH (READ-ONLY)
  2. DISTRO-608

UNION ALL in hive returns incorrect results on Hbase backed table

    Details

    • Type: Bug
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: CDH 5.0.2
    • Fix Version/s: None
    • Component/s: Hive
    • Labels:
      None

      Description

      The issue can be recreated with following steps
      1) In hbase
      create 'TABLE_EMP','default'
      2) On hive
      sudo -u hive hive
      CREATE EXTERNAL TABLE TABLE_EMP(FIRST_NAME string,LAST_NAME string,CDS_UPDATED_DATE string,CDS_PK string) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES("hbase.columns.mapping" = "default:FIRST_NAME,default:LAST_NAME,default:CDS_UPDATED_DATE,:key", "hbase.scan.cache" = "500", "hbase.scan.cacheblocks" = "false" ) TBLPROPERTIES("hbase.table.name" = "TABLE_EMP",'serialization.null.format'='');
      3) On hbase insert the following data
      put 'TABLE_EMP', '1', 'default:FIRST_NAME', 'Srini'
      put 'TABLE_EMP', '1', 'default:LAST_NAME', 'P'
      put 'TABLE_EMP', '1', 'default:CDS_UPDATED_DATE', '2014-06-16 00:00:00'
      put 'TABLE_EMP', '2', 'default:FIRST_NAME', 'Aravind'
      put 'TABLE_EMP', '2', 'default:LAST_NAME', 'K'
      put 'TABLE_EMP', '2', 'default:CDS_UPDATED_DATE', '2014-06-16 00:00:00'
      4) On hive execute the following query
      hive
      SELECT *
      FROM (
      SELECT CDS_PK
      FROM TABLE_EMP
      WHERE
      CDS_PK >= '0'
      AND CDS_PK <= '9'
      AND CDS_UPDATED_DATE IS NOT NULL
      UNION ALL SELECT CDS_PK
      FROM TABLE_EMP
      WHERE
      CDS_PK >= 'a'
      AND CDS_PK <= 'z'
      AND CDS_UPDATED_DATE IS NOT NULL
      )t ;
      5) Output of the query
      1
      1
      2
      2
      6) Output of just
      SELECT CDS_PK
      FROM TABLE_EMP
      WHERE
      CDS_PK >= '0'
      AND CDS_PK <= '9'
      AND CDS_UPDATED_DATE IS NOT NULL
      is
      1
      2
      7) Output of just
      SELECT CDS_PK
      FROM TABLE_EMP
      WHERE
      CDS_PK >= 'a'
      AND CDS_PK <= 'z'
      AND CDS_UPDATED_DATE IS NOT NULL
      Empty
      8) UNION is used to combine the result from multiple SELECT statements into a single result set. Hive currently only supports UNION ALL (bag union), in which duplicates are not eliminated
      Accordingly above query should return output
      1
      2
      instead it is giving wrong output
      1
      1
      2
      2

      Current work-around for the problem is to use DISTINCT in one of the sub queries.
      Have created an ASF upstream hive jira HIVE-7248 for the same.

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              mkempanna Mala Chikka Kempanna
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated: