Uploaded image for project: 'Kite SDK (READ-ONLY)'
  1. Kite SDK (READ-ONLY)
  2. KITE-1046

Infer Schema from N records for json-schema command

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 1.1.0
    • Fix Version/s: 1.2.0
    • Component/s: Command-line Interface
    • Labels:
      None
    • Environment:
      RHEL 6.2, CDH 5.1.4

      Description

      We have this great blog post that talks a little more in depth about ingest tips for JSON type files.

      http://ingest.tips/2015/02/23/kite-adds-json-support/

      In there, we see this statement:

      "Kite creates a schema for each of the first 20 records, then merges them to produce an overall result."

      Looking at the code, it seems like the first 10 records are selected (3rd parameter to the inferSchema option)

          Schema sampleSchema = JsonUtil.inferSchema(
              open(samplePaths.get(0)), recordName, 10);
      

      Irregardless, I think that this should be a variable that a user can supply especially for JSON. The reason is that JSON files in particular could have a lot more variation in which fields may end up being nullable (ie. the field appears for the first 10 records on which the schema's defined, however, it is IMHO highly likely that there are records later on that may not even have that field).

      I suggest to have a new option like so:

      --num-records
      

      And in the context of json-schema, it is reasonable to me that num-records means the number of records we'll use to determine the schema. Let me know if this is doable!

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                mladkov Mladen Kovacevic
                Reporter:
                mladkov Mladen Kovacevic
              • Votes:
                0 Vote for this issue
                Watchers:
                2 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: