Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-27590

Make LINES TERMINATED BY work when creating table

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Blocker
    • Resolution: Unresolved
    • 3.1.3
    • None
    • Hive, SQL
    • None
    • Any

    Description

      The only way to set line delimiters when creating tables in the current hive is like this:

      package abc.hive.MyFstTextInputFormat
      public class MyFstTextInputFormat extends FileInputFormat<LongWritable, Text> implements JobConfigurable {
       ...
      }
      create table test  (  
          id string,  
          name string  
      )  
      INPUTFORMAT 'abc.hive.MyFstTextInputFormat'   

      If there are multiple different record delimiters, multiple TextInputFormats need to be rewritten.

      Unluckily, The ideal method is not supported yet:

      create table test  (  
          id string,  
          name string  
      )  
      row format delimited fields terminated by '\t'  -- supported
      LINES TERMINATED BY '|@|' ;   -- not supported  

      I have a solution that supports setting line delimiters when creating tables just like above.

      1.create a new HiveTextInputFormat class to replace TextInputFormatn class.

      HiveTextInputFormat class read <pathToDelimiter> file to support setting record delimiter for input files based on the prefix of the file path.

      public class HiveTextInputFormat extends FileInputFormat<LongWritable, Text>
        implements JobConfigurable {
        ....
        public RecordReader<LongWritable, Text> getRecordReader(
                                                InputSplit genericSplit, JobConf job,
                                                Reporter reporter)
          throws IOException {
          
          reporter.setStatus(genericSplit.toString());
          // default delimiter
          String delimiter = job.get("textinputformat.record.delimiter");
          //Obtain the path of the file
          String filePath = genericSplit.getPath().toUri().getPath();
          //Obtain a list of file paths and delimiter relationships by parsing the <pathToDelimiter> file
          Map pathToDelimiterMap = parsePathToDelimite()//Obtain by parsing the <pathToDelimiter> file
          for(Map.Entry<String, String> entry: pathToDelimiterMap.entrySet()){
           //config path
           String configPath = entry.getKey();   
           //if configPath is the prefix of filePath, set delimiter corresponding to the file path
           if(filePath.startsWith(configPath))  delimiter = entry.getValue();        
          }
          byte[] recordDelimiterBytes = null;
          if (null != delimiter) {
            recordDelimiterBytes = delimiter.getBytes(Charsets.UTF_8);
          }
          return new LineRecordReader(job, (FileSplit) genericSplit,
              recordDelimiterBytes);
        }
      } 

      2. modify hive create table class to support <LINES TERMINATED BY>

      create table test  (  
          id string,  
          name string  
      )  
      LINES TERMINATED BY '|@|' ;  
      LOCATION  hdfs_path; 

      If Users execute above SQL, hive will insert  (hdfs_path,'|@|')  to <pathToDelimiter> file.

      Set HiveTextInputFormat  as default INPUTFORMAT  .

      Looking forward to receiving your suggestions and feedback!

      If you accept my idea, I hope you can assign the task to me. My Github account is: lvhu-goodluck

      I really hope to contribute code to the community

       

       

       

       

       

       

      Attachments

        Activity

          People

            lvhu18@163.com lvhu
            lvhu18@163.com lvhu
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated: