Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-22505

toDF() / createDataFrame() type inference doesn't work as expected

    XMLWordPrintableJSON

Details

    Description

      df = sc.parallelize([('1','a'),('2','b'),('3','c')]).toDF(['should_be_int','should_be_str'])
      df.printSchema()
      

      produces

      root
       |-- should_be_int: string (nullable = true)
       |-- should_be_str: string (nullable = true)
      

      Notice `should_be_int` has `string` datatype, according to documentation:
      https://spark.apache.org/docs/latest/sql-programming-guide.html#inferring-the-schema-using-reflection

      Spark SQL can convert an RDD of Row objects to a DataFrame, inferring the datatypes. Rows are constructed by passing a list of key/value pairs as kwargs to the Row class. The keys of this list define the column names of the table, and the types are inferred by sampling the whole dataset, similar to the inference that is performed on JSON files.

      Schema inference works as expected when reading delimited files like

      spark.read.format('csv').option('inferSchema', True)...
      

      but not when using toDF() / createDataFrame() API calls.

      Spark 2.2.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              Tagar Ruslan Dautkhanov
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: