[KITE-768] csv-import should use headers to define fields - Cloudera Open Source

Details

Type: Sub-task
Status: Resolved
Priority: Major
Resolution: Duplicate
Affects Version/s: 0.17.0
Fix Version/s: None
Component/s: None
Labels:
None

Description

Currently, csv-import basically assumes that all fields will be present and in the same order as the dataset is declared. It checks the headers to make sure they match with the dataset, but if a field is missing or out of order the whole process breaks.

For example, create:
test2.csv:

Id,Value2
1,value!

test2.avsc:

{
  "type" : "record",
  "name" : "Test",
  "namespace" : "com.cloudera",
  "doc" : "Schema generated by Kite",
  "fields" : [ {
    "name" : "Id",
    "type" : "long"
  }, {
    "name" : "Value",
    "type" : [ "null", "long" ],
    "default": null
  }, {
    "name": "Value2",
    "type": ["null", "string" ],
    "default": null
 }
 ]
}

Then..

$ ./kite-dataset create test_incomplete_csv -s test2.avsc 
$ ./kite-dataset csv-import test2.csv test_incomplete_csv
Argument error: Incompatible schema field order
[prints schemas]

It should be able to figure out that the second column corresponds to Value2 if the header matches the dataset definition.

Attachments

Issue Links

depends on

KITE-797 Read CSV fields by header, if present

Resolved

KITE-800 Improve CSV schema validation

Resolved

Activity

People

Assignee:

Ryan Blue

Reporter:

Alan Jackoway

Votes:

0 Vote for this issue

Watchers:

3 Start watching this issue

Dates

Created:

06/Nov/14 5:32 PM

Updated:

09/Dec/14 1:23 PM

Resolved:

09/Dec/14 1:23 PM