Details
-
Type:
Bug
-
Status: Resolved
-
Priority:
Major
-
Resolution: Fixed
-
Affects Version/s: 0.8.0
-
Fix Version/s: 0.12.0
-
Component/s: Data Module
-
Labels:None
Description
While running the dataset example in a presentation, I ran into a problem with partitioning and Hive. Hive treats data from the partition folder names as a separate column (by default at least), and when there is a data column with the same name, throws an error. This hasn't come up with the other examples because the partitioning creates new columns (i.e., timestamp != year) but happens when you change the CreateUserDatasetGenericPartitioned repository to use HCatalog because the username-hash partition duplicates the username column.
It is nice to make partition information available as columns in Hive, so partitioning automatically gives users access to year, month, day, etc. rather than needing to transform the timestamp again. So I think the solution is to check whether the partition name conflicts with an existing column and if it does, create a new name from the type of partitioner and given name. Throwing an error to prevent the duplicate-name configuration isn't an option because PartitionStrategy works for any entity type and schema, so by the time it is caught, the strategy is already set.
This requires adding a second name method to FieldPartitioner, longName, that formats the name with the field's function. For example, HashPartitioner#longName would return <column-name>-hash. Then methods that embed the name are free to use the most appropriate name.