Details
-
Type:
Improvement
-
Status: Open
-
Priority:
Major
-
Resolution: Unresolved
-
Affects Version/s: 1.1.0
-
Fix Version/s: None
-
Component/s: Command-line Interface
-
Labels:None
-
Environment:Any.
Description
CSV files tend to have empty strings like so:
val1,,val3
Where the two commas would make an empty string when landing in AVRO. Of course, two commas next to each other can be interpreted as null or empty string. Even moreso if empty string is typically provided as ,"", where ,, could then definitely be just considered NULL.
Kite import CLI of CSV files should allow for the user to define what null means for the csv files. Many times, empty string should be assumed to be treated as null. But in other circumstances, perhaps the providers of the CSV file can agree to have \N represent null fields to differentiate between empty strings and nulls.
Especially when Hive allows for TBLPROPERTIES to determine what makes a null field, I think its important when importing to Avro/Parquet formats specifically, that on import you can pre-define what it means to be null (since those contain nullable field values..)