[KITE-195] PartitionStrategy names must not duplicate columns in Hive - Cloudera Open Source

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 0.8.0
Fix Version/s: 0.12.0
Component/s: Data Module
Labels:
None

Description

While running the dataset example in a presentation, I ran into a problem with partitioning and Hive. Hive treats data from the partition folder names as a separate column (by default at least), and when there is a data column with the same name, throws an error. This hasn't come up with the other examples because the partitioning creates new columns (i.e., timestamp != year) but happens when you change the CreateUserDatasetGenericPartitioned repository to use HCatalog because the username-hash partition duplicates the username column.

It is nice to make partition information available as columns in Hive, so partitioning automatically gives users access to year, month, day, etc. rather than needing to transform the timestamp again. So I think the solution is to check whether the partition name conflicts with an existing column and if it does, create a new name from the type of partitioner and given name. Throwing an error to prevent the duplicate-name configuration isn't an option because PartitionStrategy works for any entity type and schema, so by the time it is caught, the strategy is already set.

This requires adding a second name method to FieldPartitioner, longName, that formats the name with the field's function. For example, HashPartitioner#longName would return <column-name>-hash. Then methods that embed the name are free to use the most appropriate name.

Attachments

Activity

People

Assignee:

Ryan Blue

Reporter:

Ryan Blue

Votes:

0 Vote for this issue

Watchers:

0 Start watching this issue

Dates

Created:

19/Oct/13 1:02 AM

Updated:

06/Mar/14 11:06 PM

Resolved:

06/Mar/14 11:06 PM