Details
-
Bug
-
Status: Closed
-
Major
-
Resolution: Duplicate
-
0.9.1
-
None
-
None
-
None
Description
2019-05-17 10:08:10,317 [INFO] [Fetcher_B \{Map_1} #1] |shuffle.Fetcher|: Failed to read data to memory for InputAttemptIdentifier [inputIdentifier=1, attemptNumber=0, pathComponent=attempt_1557383221332_0289_1_00_000001_0_10003, spillType=0, spillId=-1]. len=25, decomp=11. ExceptionMessage=Not a valid ifile header 2019-05-17 10:08:10,317 [WARN] [Fetcher_B \{Map_1} #1] |shuffle.Fetcher|: Failed to shuffle output of InputAttemptIdentifier [inputIdentifier=1, attemptNumber=0, pathComponent=attempt_1557383221332_0289_1_00_000001_0_10003, spillType=0, spillId=-1] from XXXXX java.io.IOException: Not a valid ifile header at org.apache.tez.runtime.library.common.sort.impl.IFile$Reader.verifyHeaderMagic(IFile.java:859) at org.apache.tez.runtime.library.common.sort.impl.IFile$Reader.isCompressedFlagEnabled(IFile.java:866) at org.apache.tez.runtime.library.common.sort.impl.IFile$Reader.readToMemory(IFile.java:616) at org.apache.tez.runtime.library.common.shuffle.ShuffleUtils.shuffleToMemory(ShuffleUtils.java:121) at org.apache.tez.runtime.library.common.shuffle.Fetcher.fetchInputs(Fetcher.java:950) at org.apache.tez.runtime.library.common.shuffle.Fetcher.doHttpFetch(Fetcher.java:599) at org.apache.tez.runtime.library.common.shuffle.Fetcher.doHttpFetch(Fetcher.java:486) at org.apache.tez.runtime.library.common.shuffle.Fetcher.callInternal(Fetcher.java:284) at org.apache.tez.runtime.library.common.shuffle.Fetcher.callInternal(Fetcher.java:76) at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36) at com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:108) at com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:41) at com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:77) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)
How to reproduce:
1) create two files - file1.csv and file2.csv 2) add two fields to the csv files as below one,two one,two one,two 3) In Hive: use testdb; create external table test1(s1 string, s2 string) row format delimited fields terminated by ',' stored as textfile location '/user/usera/test1'; 4) Copy one csv file to hdfs - /user/usera/test1 hdfs dfs -put ./file1.csv /user/usera/test1/ 5) select count(*) from testdb.test1; => works fine. 6) copy the second csv file to HDFS hdfs dfs -put ./file2.csv /user/usera/test1/ 7) select * from testdb.test1; => Can see all data from 2 hdfs files. 8) select count(*) from testdb.test1; => get this issue.
Similar ticket: https://issues.apache.org/jira/browse/TEZ-3699
Attachments
Issue Links
- duplicates
-
TEZ-3894 Tez intermediate outputs implicitly rely on permissive umask for shuffle
- Closed