Uploaded image for project: 'Hadoop Common'
  1. Hadoop Common
  2. HADOOP-17755

EOF reached error reading ORC file on S3A

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Duplicate
    • 3.2.0
    • None
    • fs/s3
    • None
    • Hadoop 3.2.0

    Description

      Hi I am trying to do some transformation using Spark 3.1.1-Hadoop 3.2 on K8s and using s3a

      I have around 700 GB of data to read and around 200 executors (5 vCore and 30G each).

      Its able to read most of the files in problematic stage (Scan orc => Filter => Project) but is failing with few files at the end with below error.  The size of the file mentioned in error is around 140 MB and all other files are of similar size.

      I am able to read and rewrite the specific file mentioned which suggest the file is not corrupted.

      Let me know if further information is required

       

      java.io.IOException: Error reading file: s3a://<bucket-with-prefix>/part-00001-5e22a873-82a5-4781-9eb9-473b483396bd.c000.zlib.orcjava.io.IOException: Error reading file: s3a://<bucket-with-prefix>/part-00001-5e22a873-82a5-4781-9eb9-473b483396bd.c000.zlib.orc at org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1331) at org.apache.orc.mapreduce.OrcMapreduceRecordReader.ensureBatch(OrcMapreduceRecordReader.java:78) at org.apache.orc.mapreduce.OrcMapreduceRecordReader.nextKeyValue(OrcMapreduceRecordReader.java:96) at org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:37) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:511) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) at org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:177) at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52) at org.apache.spark.scheduler.Task.run(Task.scala:131) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.base/java.lang.Thread.run(Unknown Source)Caused by: java.io.EOFException: End of file reached before reading fully. at org.apache.hadoop.fs.s3a.S3AInputStream.readFully(S3AInputStream.java:702) at org.apache.hadoop.fs.FSDataInputStream.readFully(FSDataInputStream.java:111) at org.apache.orc.impl.RecordReaderUtils.readDiskRanges(RecordReaderUtils.java:566) at org.apache.orc.impl.RecordReaderUtils$DefaultDataReader.readFileData(RecordReaderUtils.java:285) at org.apache.orc.impl.RecordReaderImpl.readPartialDataStreams(RecordReaderImpl.java:1237) at org.apache.orc.impl.RecordReaderImpl.readStripe(RecordReaderImpl.java:1105) at org.apache.orc.impl.RecordReaderImpl.advanceStripe(RecordReaderImpl.java:1256) at org.apache.orc.impl.RecordReaderImpl.advanceToNextRow(RecordReaderImpl.java:1291) at org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1327) ... 20 more
      

       

       

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              arghya18 Arghya Saha
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: