Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-18076

Fix default Locale used in DateFormat, NumberFormat to Locale.US

    XMLWordPrintableJSON

Details

    Description

      Many parts of the code use DateFormat and NumberFormat instances. Although the behavior of these format is mostly determined by things like format strings, the exact behavior can vary according to the platform's default locale. Although the locale defaults to "en", it can be set to something else by env variables. And if it does, it can cause the same code to succeed or fail based just on locale:

      import java.text._
      import java.util._
      
      def parse(s: String, l: Locale) = new SimpleDateFormat("yyyyMMMdd", l).parse(s)
      
      parse("1989Dec31", Locale.US)
      Sun Dec 31 00:00:00 GMT 1989
      
      parse("1989Dec31", Locale.UK)
      Sun Dec 31 00:00:00 GMT 1989
      
      parse("1989Dec31", Locale.CHINA)
      java.text.ParseException: Unparseable date: "1989Dec31"
        at java.text.DateFormat.parse(DateFormat.java:366)
        at .parse(<console>:18)
        ... 32 elided
      
      parse("1989Dec31", Locale.GERMANY)
      java.text.ParseException: Unparseable date: "1989Dec31"
        at java.text.DateFormat.parse(DateFormat.java:366)
        at .parse(<console>:18)
        ... 32 elided
      

      Where not otherwise specified, I believe all instances in the code should default to some fixed value, and that should probably be Locale.US. This matches the JVM's default, and specifies both language ("en") and region ("US") to remove ambiguity. This most closely matches what the current code behavior would be (unless default locale was changed), because it will currently default to "en".

      This affects SQL date/time functions. At the moment, the only SQL function that lets the user specify language/country is "sentences", which is consistent with Hive.

      It affects dates passed in the JSON API.

      It affects some strings rendered in the UI, potentially. Although this isn't a correctness issue, there may be an argument for not letting that vary

      It affects a bunch of instances where dates are formatted into strings for things like IDs or file names, which is far less likely to cause a problem, but worth making consistent.

      The other occurrences are in tests.

      The downside to this change is also its upside: the behavior doesn't depend on default JVM locale, but, also can't be affected by the default JVM locale. For example, if you wanted to parse some dates in a way that depended on an non-US locale (not just the format string) then it would no longer be possible. There's no means of specifying this, for example, in SQL functions for parsing dates. However, controlling this by globally changing the locale isn't exactly great either.

      The purpose of this change is to make the current default behavior deterministic and fixed. PR coming.

      CC hyukjin.kwon

      Attachments

        Activity

          People

            srowen Sean R. Owen
            srowen Sean R. Owen
            Votes:
            1 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: