Details
-
Bug
-
Status: Open
-
Major
-
Resolution: Unresolved
-
3.1.1
-
None
Description
In the transactional subsystems, in several places we check to see if a data file has ROW__ID fields or not. Every time we do that (even within the context of the same query), we open a Reader for that file/split. We could optimize this by caching or perhaps checking once, and saving our result for later. Also, perhaps we don't need to do this for every split. An example call stack:
OrcFile.createReader(Path, OrcFile$ReaderOptions) line: 105 AcidUtils$MetaDataFile.isRawFormatFile(Path, FileSystem) line: 2026 AcidUtils$MetaDataFile.isRawFormat(Path, FileSystem) line: 2022 AcidUtils.parsedDelta(Path, String, FileSystem) line: 1007 OrcRawRecordMerger$TransactionMetaData.findWriteIDForSynthetcRowIDs(Path, Path, Configuration) line: 1231 OrcRawRecordMerger.discoverOriginalKeyBounds(Reader, int, Reader$Options, Configuration, OrcRawRecordMerger$Options) line: 722 OrcRawRecordMerger.<init>(Configuration, boolean, Reader, boolean, int, ValidWriteIdList, Reader$Options, Path[], OrcRawRecordMerger$Options) line: 1022 OrcInputFormat.getReader(InputSplit, Options) line: 2108 OrcInputFormat.getRecordReader(InputSplit, JobConf, Reporter) line: 2006 FetchOperator$FetchInputFormatSplit.getRecordReader(JobConf) line: 776 FetchOperator.getRecordReader() line: 344 FetchOperator.getNextRow() line: 540 FetchOperator.pushRow() line: 509 FetchTask.fetch(List) line: 146
Here, for each split we'll make that check.