Description
The only way to set line delimiters when creating tables in the current hive is like this:
package abc.hive.MyFstTextInputFormat public class MyFstTextInputFormat extends FileInputFormat<LongWritable, Text> implements JobConfigurable { ... } create table test ( id string, name string ) INPUTFORMAT 'abc.hive.MyFstTextInputFormat'
If there are multiple different record delimiters, multiple TextInputFormats need to be rewritten.
Unluckily, The ideal method is not supported yet:
create table test ( id string, name string ) row format delimited fields terminated by '\t' -- supported LINES TERMINATED BY '|@|' ; -- not supported
I have a solution that supports setting line delimiters when creating tables just like above.
1.create a new HiveTextInputFormat class to replace TextInputFormatn class.
HiveTextInputFormat class read <pathToDelimiter> file to support setting record delimiter for input files based on the prefix of the file path.
public class HiveTextInputFormat extends FileInputFormat<LongWritable, Text> implements JobConfigurable { .... public RecordReader<LongWritable, Text> getRecordReader( InputSplit genericSplit, JobConf job, Reporter reporter) throws IOException { reporter.setStatus(genericSplit.toString()); // default delimiter String delimiter = job.get("textinputformat.record.delimiter"); //Obtain the path of the file String filePath = genericSplit.getPath().toUri().getPath(); //Obtain a list of file paths and delimiter relationships by parsing the <pathToDelimiter> file Map pathToDelimiterMap = parsePathToDelimite()//Obtain by parsing the <pathToDelimiter> file for(Map.Entry<String, String> entry: pathToDelimiterMap.entrySet()){ //config path String configPath = entry.getKey(); //if configPath is the prefix of filePath, set delimiter corresponding to the file path if(filePath.startsWith(configPath)) delimiter = entry.getValue(); } byte[] recordDelimiterBytes = null; if (null != delimiter) { recordDelimiterBytes = delimiter.getBytes(Charsets.UTF_8); } return new LineRecordReader(job, (FileSplit) genericSplit, recordDelimiterBytes); } }
2. modify hive create table class to support <LINES TERMINATED BY>
create table test (
id string,
name string
)
LINES TERMINATED BY '|@|' ;
LOCATION hdfs_path;
If Users execute above SQL, hive will insert (hdfs_path,'|@|') to <pathToDelimiter> file.
Set HiveTextInputFormat as default INPUTFORMAT .
Looking forward to receiving your suggestions and feedback!
If you accept my idea, I hope you can assign the task to me. My Github account is: lvhu-goodluck
I really hope to contribute code to the community