Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-14098

Generate Java code to build CachedColumnarBatch and get values from CachedColumnarBatch when DataFrame.cache() is called

    XMLWordPrintableJSON

Details

    • Umbrella
    • Status: Resolved
    • Major
    • Resolution: Done
    • None
    • None
    • SQL

    Description

      Here is a design document for this change (**TODO: Update the document**).

      This JIRA implements a new in-memory cache feature used by DataFrame.cache and Dataset.cache. The followings are basic design based on discussions with Sameer, Weichen, Xiao, Herman, and Nong.

      • Use ColumnarBatch with ColumnVector that are common data representations for columnar storage
      • Use multiple compression scheme (such as RLE, intdelta, and so on) for each ColumnVector in ColumnarBatch depends on its data typpe
      • Generate code that is simple and specialized for each in-memory cache to build an in-memory cache
      • Generate code that directly reads data from ColumnVector for the in-memory cache by whole-stage codegen.
      • Enhance ColumnVector to keep UnsafeArrayData
      • Use primitive-type array for primitive uncompressed data type in ColumnVector
      • Use byte[] for UnsafeArrayData and compressed data

      Based on this design, this JIRA generates two kinds of Java code for DataFrame.cache()/Dataset.cache()

      • Generate Java code to build CachedColumnarBatch, which keeps data in ColumnarBatch
      • Generate Java code to get a value of each column from ColumnarBatch
        • a Get a value directly from from ColumnarBatch in code generated by whole stage code gen (primary path)
        • b Get a value thru an iterator if whole stage code gen is disabled (e.g. # of columns is more than 100, as backup path)

      Attachments

        Activity

          People

            Unassigned Unassigned
            kiszk Kazuaki Ishizaki
            Votes:
            0 Vote for this issue
            Watchers:
            20 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: