Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-22337

Improve and Expand Text-Based SerDes

Log workAgile BoardRank to TopRank to BottomBulk Copy AttachmentsBulk Move AttachmentsAdd voteVotersWatch issueWatchersCreate sub-taskConvert to sub-taskMoveLinkCloneLabelsUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    Description

      • Add new SerDe package just for text-based formats: org.apache.hadoop.hive.serde2.text.*
      • Add new SerDe package just for text-based log formats: org.apache.hadoop.hive.serde2.text.log.*
      • Create a coherent hierarchy for processing delimited data: AbstractSerDe -> TextSerDe -> EncodingAwareTextSerde -> DelimitedSerDe -> CsvTextSerDe
      • Create a coherent hierarchy for processing regex'ed data: AbstractSerDe -> TextSerDe -> EncodingAwareTextSerde -> RegexSerDe -> CommonFormatLogSerDe
      • Create some standard text processors for super-quick out-of-the-box processing: TSV SerDe and CSV SerDe
      • Create some standard log processors for super-quick out-of-the-box processing: Apache Common Log Format and Apache Combined Log Format (Apache HTTP Server Log Parsers)
      • Better default behaviors for processing text

      The default behavior should allow users to quick query data without any failures.

      1. When a blank line is encountered, insert a 'null' value for each column
      2. When there are fewer fields in the data than defined in the table schema, shift all available fields left, and fill in 'null' values for all remaining fields
      3. When there are too many fields in the data, the last field in the results will contain all remaining values. Currently, the data is silently swallows and a warning is issued in the YARN logs. A normal user will never see this warning, especially if the job completes successfully. Better to (by default) provide them all the data than to hide anything.
      CSV SerDe
      "1,2,3"    = ["1","2","3"]
      "1,2,"     = ["1","2",null]
      ""         = [null,null,null]
      "1,2,3,4"  = ["1","2","3,4"]
      

      Attachments

        1. HIVE-22337.2.patch
          85 kB
          David Mollitor
        2. HIVE-22337.1.patch
          83 kB
          David Mollitor

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            belugabehr David Mollitor Assign to me
            belugabehr David Mollitor

            Dates

              Created:
              Updated:

              Time Tracking

              Estimated:
              Original Estimate - Not Specified
              Not Specified
              Remaining:
              Remaining Estimate - 0h
              0h
              Logged:
              Time Spent - 40m
              40m

              Slack

                Issue deployment