Uploaded image for project: 'Hadoop Map/Reduce'
  1. Hadoop Map/Reduce
  2. MAPREDUCE-7450

Set the record delimiter for the input file based on its path

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Critical
    • Resolution: Unresolved
    • 3.3.6
    • None
    • client
    • None
    • Any

    • Reviewed

    Description

      In the mapreduce program, when reading files, we can easily set the record delimiter based on the parameter textinputformat.record.delimiter.
      This parameter can also be easily set, including Spark, for example:

      spark.sparkContext.hadoopConfiguration.set("textinputformat.record.delimiter", "|@|")
      val rdd = spark.sparkContext.newAPIHadoopFile(...) 

      But once the textinputformat.record.delimiter parameter is modified, it will take effect for all files. In actual scenarios, different files often have different delimiters.

      In Hive, as Hive does not support programming, we cannot modify the record delimiter through the above methods. If modified through a configuration file, it will take effect on all Hive tables.
      The only way to modify record delimiter in hive is to rewrite a TextInputFormat class.
      The current method of hive is as follows:

      package abc.hive.MyFstTextInputFormat
      public class MyFstTextInputFormat extends FileInputFormat<LongWritable, Text> implements JobConfigurable
      {  ... }
      create table test  (  
          id string,  
          name string  
      )  stored as  
      INPUTFORMAT 'abc.hive.MyFstTextInputFormat'   

      If there are multiple different record delimiters, multiple TextInputFormats need to be rewritten.

      My idea is to modify TextInputFormat class to support setting record delimiter for input files based on the prefix of the file path.
      The specific idea is to make the following modifications to TextInputFormat:

      public class TextInputFormat extends FileInputFormat<LongWritable, Text>
        implements JobConfigurable {
        ....
        public RecordReader<LongWritable, Text> getRecordReader(InputSplit genericSplit, JobConf job,
                                                Reporter reporter)
          throws IOException {   
          reporter.setStatus(genericSplit.toString());
          // default delimiter
          String delimiter = job.get("textinputformat.record.delimiter");
          //Obtain the path of the file
          String filePath = genericSplit.getPath().toUri().getPath();
          //Obtain a list of file paths and delimiter relationships based on configuration file parameters
          Map pathToDelimiterMap = //Obtain by parsing the configuration file
          for(Map.Entry<String, String> entry: pathToDelimiterMap.entrySet()){
           //config path
           String configPath = entry.getKey();   
           //if configPath is the prefix of filePath, set delimiter corresponding to the file path
           if(filePath.startsWith(configPath)) {delimiter = entry.getValue(); break;}         
          }
          byte[] recordDelimiterBytes = null;
          if (null != delimiter) {
            recordDelimiterBytes = delimiter.getBytes(Charsets.UTF_8);
          }
          return new LineRecordReader(job, (FileSplit) genericSplit,
              recordDelimiterBytes);
        }
      }

      After implementing the record delimiter function of setting input files according to the path, not only does it save code to modify the delimiter, but it is also very convenient for Hadoop and Spark, without frequent parameter configuration modifications.

      Looking forward to receiving your suggestions and feedback!

      If you accept my idea, I hope you can assign the task to me. My Github account is: lvhu-goodluck
      I really hope to contribute code to the community.

      Attachments

        Activity

          People

            Unassigned Unassigned
            lvhu18@163.com lvhu
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:

              Time Tracking

                Estimated:
                Original Estimate - 672h
                672h
                Remaining:
                Remaining Estimate - 672h
                672h
                Logged:
                Time Spent - Not Specified
                Not Specified