Details
Description
AND and OR are not straightforward when using the new DataFrame API.
the current convention - accepted by Pandas users - is to use the bitwise & and | instead of AND and OR. when using these, however, you need to wrap each expression in parenthesis to prevent the bitwise operator from dominating.
also, working with StructTypes is a bit confusing. the following link: https://spark.apache.org/docs/latest/sql-programming-guide.html#programmatically-specifying-the-schema (Python tab) implies that you can work with tuples directly when creating a DataFrame.
however, the following code errors out unless we explicitly use Row's:
from pyspark.sql import Row from pyspark.sql.types import * # The schema is encoded in a string. schemaString = "a" fields = [StructField(field_name, MapType(StringType(),IntegerType())) for field_name in schemaString.split()] schema = StructType(fields) df = sqlContext.createDataFrame([Row(a={'b': 1})], schema)