Uploaded image for project: 'Kite SDK (READ-ONLY)'
  1. Kite SDK (READ-ONLY)
  2. KITE-1046

Infer Schema from N records for json-schema command


    • Type: Improvement
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 1.1.0
    • Fix Version/s: 1.2.0
    • Component/s: Command-line Interface
    • Labels:
    • Environment:
      RHEL 6.2, CDH 5.1.4


      We have this great blog post that talks a little more in depth about ingest tips for JSON type files.


      In there, we see this statement:

      "Kite creates a schema for each of the first 20 records, then merges them to produce an overall result."

      Looking at the code, it seems like the first 10 records are selected (3rd parameter to the inferSchema option)

          Schema sampleSchema = JsonUtil.inferSchema(
              open(samplePaths.get(0)), recordName, 10);

      Irregardless, I think that this should be a variable that a user can supply especially for JSON. The reason is that JSON files in particular could have a lot more variation in which fields may end up being nullable (ie. the field appears for the first 10 records on which the schema's defined, however, it is IMHO highly likely that there are records later on that may not even have that field).

      I suggest to have a new option like so:


      And in the context of json-schema, it is reasonable to me that num-records means the number of records we'll use to determine the schema. Let me know if this is doable!


          Issue Links



              • Assignee:
                mladkov Mladen Kovacevic
                mladkov Mladen Kovacevic
              • Votes:
                0 Vote for this issue
                2 Start watching this issue


                • Created: