Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-7178

Improve DataFrame documentation and code samples

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Closed
    • Major
    • Resolution: Duplicate
    • 1.3.1
    • None
    • SQL
    • Spark 1.5 doc/QA sprint

    Description

      AND and OR are not straightforward when using the new DataFrame API.

      the current convention - accepted by Pandas users - is to use the bitwise & and | instead of AND and OR. when using these, however, you need to wrap each expression in parenthesis to prevent the bitwise operator from dominating.

      also, working with StructTypes is a bit confusing. the following link: https://spark.apache.org/docs/latest/sql-programming-guide.html#programmatically-specifying-the-schema (Python tab) implies that you can work with tuples directly when creating a DataFrame.

      however, the following code errors out unless we explicitly use Row's:

      from pyspark.sql import Row
      from pyspark.sql.types import *
      
      # The schema is encoded in a string.
      schemaString = "a"
      
      fields = [StructField(field_name, MapType(StringType(),IntegerType())) for field_name in schemaString.split()]
      schema = StructType(fields)
      
      df = sqlContext.createDataFrame([Row(a={'b': 1})], schema)
      

      Attachments

        Activity

          People

            Unassigned Unassigned
            cfregly Chris Fregly
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: