Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-25206

wrong records are returned when Hive metastore schema and parquet schema are in different letter cases

Rank to TopRank to BottomAttach filesAttach ScreenshotBulk Copy AttachmentsBulk Move AttachmentsVotersWatch issueWatchersCreate sub-taskConvert to sub-taskLinkCloneLabelsUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    Description

      In current Spark 2.3.1, below query returns wrong data silently.

      spark.range(10).write.parquet("/tmp/data")
      sql("DROP TABLE t")
      sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'")
      
      scala> sql("select * from t where id > 0").show
      +---+
      | ID|
      +---+
      +---+
      
      

       

      Root Cause

      After deep dive, it has two issues, both are related to different letter cases between Hive metastore schema and parquet schema.

      1. Wrong column is pushdown.

      Spark pushdowns FilterApi.gt(intColumn("ID"), 0: Integer) into parquet, but ID does not exist in /tmp/data (parquet is case sensitive, it has id actually).
      So no records are returned.

      Since SPARK-24716, Spark uses Parquet schema instead of Hive metastore schema to do the pushdown, perfect for this issue.

      2. Spark SQL returns NULL for a column whose Hive metastore schema and Parquet schema are in different letter cases, even spark.sql.caseSensitive set to false.

      SPARK-25132 addressed this issue already.

       

      The biggest difference is, in Spark 2.1, user will get Exception for the same query:

      Caused by: java.lang.IllegalArgumentException: Column [ID] was not found in schema!

      So they will know the issue and fix the query.

      But in Spark 2.3, user will get the wrong results sliently.

       

      To make the above query work, we need both SPARK-25132 and SPARK-24716.

       

      Yuming WangWenchen FanXiao Li, any thoughts? Should we backport it?

      Attachments

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            Unassigned Unassigned
            yucai yucai
            Votes:
            0 Vote for this issue
            Watchers:
            11 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment