[SPARK-7178] Improve DataFrame documentation and code samples - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Closed
Priority: Major
Resolution: Duplicate
Affects Version/s: 1.3.1
Fix Version/s: None
Component/s: SQL
Labels:
- dataframe

Target Version/s:

1.5.0
Sprint:
Spark 1.5 doc/QA sprint

Description

AND and OR are not straightforward when using the new DataFrame API.

the current convention - accepted by Pandas users - is to use the bitwise & and | instead of AND and OR. when using these, however, you need to wrap each expression in parenthesis to prevent the bitwise operator from dominating.

also, working with StructTypes is a bit confusing. the following link: https://spark.apache.org/docs/latest/sql-programming-guide.html#programmatically-specifying-the-schema (Python tab) implies that you can work with tuples directly when creating a DataFrame.

however, the following code errors out unless we explicitly use Row's:

from pyspark.sql import Row
from pyspark.sql.types import *

# The schema is encoded in a string.
schemaString = "a"

fields = [StructField(field_name, MapType(StringType(),IntegerType())) for field_name in schemaString.split()]
schema = StructType(fields)

df = sqlContext.createDataFrame([Row(a={'b': 1})], schema)

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Chris Fregly

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 28/Apr/15 00:17

Updated:: 19/Aug/15 05:37

Resolved:: 19/Aug/15 05:37