Details
-
Improvement
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
1.4.1, 1.5.2, 1.6.3, 2.1.1, 2.2.0
Description
Since SPARK-2883, Apache Spark supports Apache ORC inside `sql/hive` module with Hive dependency. This issue aims to add a new and faster ORC data source inside `sql/core` and to replace the old ORC data source eventually. In this issue, the latest Apache ORC 1.4.0 (released yesterday) is used.
There are four key benefits.
- Speed: Use both Spark `ColumnarBatch` and ORC `RowBatch` together. This is faster than the current implementation in Spark.
- Stability: Apache ORC 1.4.0 has many fixes and we can depend on ORC community more.
- Usability: User can use `ORC` data sources without hive module, i.e, `-Phive`.
- Maintainability: Reduce the Hive dependency and can remove old legacy code later.
Attachments
Issue Links
- blocks
-
SPARK-20901 Feature parity for ORC with Parquet
- Open
-
SPARK-20728 Make ORCFileFormat configurable between sql/hive and sql/core
- Resolved
-
SPARK-21787 Support for pushing down filters for DateType in native OrcFileFormat
- Resolved
- is blocked by
-
SPARK-21422 Depend on Apache ORC 1.4.0
- Resolved
- is related to
-
SPARK-35274 old hive table's all columns are read when column pruning applies in spark3.0
- Open
- supercedes
-
SPARK-19109 ORC metadata section can sometimes exceed protobuf message size limit
- Resolved
-
SPARK-21791 ORC should support column names with dot
- Closed
- links to