Details

    • Type: Sub-task
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 0.17.0
    • Fix Version/s: 0.17.1
    • Component/s: None
    • Labels:
      None

      Description

      Currently, a NumberFormatException when coercing CSV data into a numeric type will result in a null value. This is to catch the case where the empty string is passed and should be considered null rather than a value.

      The unintended consequence is that any invalid string will result in a null value. If the field type is Long and a decimal value is present, this will also result in a null that could be replaced by the default value for the field. That would mean that the import process would silently drop data that appears valid and replace it with the default. Instead, this case should throw an exception.

        Attachments

          Issue Links

            Activity

            Hide
            blue Ryan Blue added a comment -

            Fixed in PR #271. Because this is a breaking change, I've added a descriptor property, kite.csv.throw-number-format-exceptions that defaults to false. In a later release we should change the default to true.

            Show
            blue Ryan Blue added a comment - Fixed in PR #271 . Because this is a breaking change, I've added a descriptor property, kite.csv.throw-number-format-exceptions that defaults to false. In a later release we should change the default to true.
            Hide
            tom Tom White added a comment -

            I'm not sure we need to have a descriptor property for this. How about treating the empty string as null, but throw a NumberFormatException for invalid non-empty strings? The current behaviour is arguably wrong since it silently drops data, as you point out.

            Show
            tom Tom White added a comment - I'm not sure we need to have a descriptor property for this. How about treating the empty string as null, but throw a NumberFormatException for invalid non-empty strings? The current behaviour is arguably wrong since it silently drops data, as you point out.
            Hide
            blue Ryan Blue added a comment -

            The behavior you describe is what happens with the property set. The logic when a NumberFormatException is caught is to return null if the string is empty or if the property is set for compatibility. I'm fine removing the property, do you think we would want to wait until 0.18.0?

            Show
            blue Ryan Blue added a comment - The behavior you describe is what happens with the property set. The logic when a NumberFormatException is caught is to return null if the string is empty or if the property is set for compatibility. I'm fine removing the property, do you think we would want to wait until 0.18.0?
            Hide
            tom Tom White added a comment -

            We could put this in 0.17.1 as you have it now, then for 0.18.0 remove the property entirely.

            Show
            tom Tom White added a comment - We could put this in 0.17.1 as you have it now, then for 0.18.0 remove the property entirely.
            Hide
            blue Ryan Blue added a comment -

            Sounds good to me. I'll merge it with the property and add some release notes.

            Show
            blue Ryan Blue added a comment - Sounds good to me. I'll merge it with the property and add some release notes.
            Hide
            blue Ryan Blue added a comment - - edited

            I updated the pull request with the other changes that have been merged. It is ready to commit if we want to, but I realized that there's no way to set the descriptor property when running csv-import, which is the primary use case for this.

            I think that we should commit the change without the property. There are two cases where a user might trigger the behavior we're trying to avoid changing. The first case is a schema with a long when there is float data in the file. In that case, the user would need to explicitly disable schema checking to avoid an exception. Then, the user would get nulls or the default when importing the data. In this case, I think getting a null or the default is actually a bug, so it's important to get this fixed because without the schema check there isn't anything that will tell the user this is happening. Given that the user had to explicitly ignore a valid error before, I don't think this behavior needs to be preserved.

            The second case is when a user has mixed data in a column, sometimes numeric and sometimes invalid, like "NaN". For this case, I think the new behavior is still more correct. If the data has strings like "NaN", then it is still incorrect to return null or a default. If the string is simply mixed data (e.g., 1, 2, 3, "airplane", 5), then rejecting it is definitely the right thing to do.

            My update removes the property entirely so the default behavior is to throw NumberFormatException. I'd definitely like to hear other opinions on whether this should be merged now or in 0.18.0.

            Show
            blue Ryan Blue added a comment - - edited I updated the pull request with the other changes that have been merged. It is ready to commit if we want to, but I realized that there's no way to set the descriptor property when running csv-import, which is the primary use case for this. I think that we should commit the change without the property. There are two cases where a user might trigger the behavior we're trying to avoid changing. The first case is a schema with a long when there is float data in the file. In that case, the user would need to explicitly disable schema checking to avoid an exception. Then, the user would get nulls or the default when importing the data. In this case, I think getting a null or the default is actually a bug, so it's important to get this fixed because without the schema check there isn't anything that will tell the user this is happening. Given that the user had to explicitly ignore a valid error before, I don't think this behavior needs to be preserved. The second case is when a user has mixed data in a column, sometimes numeric and sometimes invalid, like "NaN". For this case, I think the new behavior is still more correct. If the data has strings like "NaN", then it is still incorrect to return null or a default. If the string is simply mixed data (e.g., 1, 2, 3, "airplane", 5), then rejecting it is definitely the right thing to do. My update removes the property entirely so the default behavior is to throw NumberFormatException. I'd definitely like to hear other opinions on whether this should be merged now or in 0.18.0.
            Hide
            tom Tom White added a comment -

            I agree with this analysis. The patch preserves the behaviour where empty or missing strings are interpreted as nulls; the difference is that invalid numbers now throw an exception.

            Show
            tom Tom White added a comment - I agree with this analysis. The patch preserves the behaviour where empty or missing strings are interpreted as nulls; the difference is that invalid numbers now throw an exception.
            Show
            tom Tom White added a comment - Fixed in https://github.com/kite-sdk/kite/commit/dd9ee1b038186e2cd2ac2b1de858a2c2aea0d016

              People

              • Assignee:
                blue Ryan Blue
                Reporter:
                blue Ryan Blue
              • Votes:
                0 Vote for this issue
                Watchers:
                2 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: