Uploaded image for project: 'Parquet'
  1. Parquet
  2. PARQUET-2339

ArrayIndexOutOfBounds exception writing parquet from Avro in Apache Hudi

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 1.12.3
    • None
    • parquet-avro, parquet-mr
    • None
    • Amazon EMR 6.12.x, Apache Hudi 0.13.1, Apache Spark 3.4.0, Linux in Docker

    Description

      While writing an Apache Hudi table using the DeltaStreamer utility, I receive an exception from the Parquet `AvroWriteSupport` class:

      ```23/08/17 22:43:50 ERROR HoodieCreateHandle: Error writing record HoodieRecord{key=HoodieKey

      { recordKey=id:05a3065f8cf0494f9dc449307a0fddd8,idx:01 partitionPath=event.year=2023/event.month=08/event.day=17/event.hour=22}

      , currentLocation='null', newLocation='null'}
      java.lang.ArrayIndexOutOfBoundsException: Index 5 out of bounds for length 5
          at org.apache.parquet.io.MessageColumnIO$MessageColumnIORecordConsumer.addBinary(MessageColumnIO.java:476) ~[parquet-column-1.12.3-amzn-0.jar:1.12.3-amzn-0]
          at org.apache.parquet.avro.AvroWriteSupport.writeValueWithoutConversion(AvroWriteSupport.java:358) ~[hudi-utilities-bundle.jar:0.13.1-amzn-0]
          at org.apache.parquet.avro.AvroWriteSupport.writeValue(AvroWriteSupport.java:287) ~[hudi-utilities-bundle.jar:0.13.1-amzn-0]
          at org.apache.parquet.avro.AvroWriteSupport.writeRecordFields(AvroWriteSupport.java:200) ~[hudi-utilities-bundle.jar:0.13.1-amzn-0]
          at org.apache.parquet.avro.AvroWriteSupport.write(AvroWriteSupport.java:174) ~[hudi-utilities-bundle.jar:0.13.1-amzn-0]
          at org.apache.parquet.hadoop.InternalParquetRecordWriter.write(InternalParquetRecordWriter.java:138) ~[parquet-hadoop-1.12.3-amzn-0.jar:1.12.3-amzn-0]
          at org.apache.parquet.hadoop.ParquetWriter.write(ParquetWriter.java:310) ~[parquet-hadoop-1.12.3-amzn-0.jar:1.12.3-amzn-0]
          at org.apache.hudi.io.storage.HoodieBaseParquetWriter.write(HoodieBaseParquetWriter.java:80) ~[hudi-utilities-bundle.jar:0.13.1-amzn-0]
          at org.apache.hudi.io.storage.HoodieAvroParquetWriter.writeAvroWithMetadata(HoodieAvroParquetWriter.java:67) ~[hudi-utilities-bundle.jar:0.13.1-amzn-0]
          at org.apache.hudi.io.storage.HoodieAvroFileWriter.writeWithMetadata(HoodieAvroFileWriter.java:45) ~[hudi-utilities-bundle.jar:0.13.1-amzn-0]
          at org.apache.hudi.io.storage.HoodieFileWriter.writeWithMetadata(HoodieFileWriter.java:39) ~[hudi-utilities-bundle.jar:0.13.1-amzn-0]
          at org.apache.hudi.io.HoodieCreateHandle.doWrite(HoodieCreateHandle.java:147) ~[hudi-utilities-bundle.jar:0.13.1-amzn-0]
          at org.apache.hudi.io.HoodieWriteHandle.write(HoodieWriteHandle.java:175) ~[hudi-utilities-bundle.jar:0.13.1-amzn-0]
          at org.apache.hudi.execution.CopyOnWriteInsertHandler.consume(CopyOnWriteInsertHandler.java:98) ~[hudi-utilities-bundle.jar:0.13.1-amzn-0]
          at org.apache.hudi.execution.CopyOnWriteInsertHandler.consume(CopyOnWriteInsertHandler.java:42) ~[hudi-utilities-bundle.jar:0.13.1-amzn-0]
          at org.apache.hudi.common.util.queue.SimpleExecutor.execute(SimpleExecutor.java:67) ~[hudi-utilities-bundle.jar:0.13.1-amzn-0]
          at org.apache.hudi.execution.SparkLazyInsertIterable.computeNext(SparkLazyInsertIterable.java:80) ~[hudi-utilities-bundle.jar:0.13.1-amzn-0]
          at org.apache.hudi.execution.SparkLazyInsertIterable.computeNext(SparkLazyInsertIterable.java:39) ~[hudi-utilities-bundle.jar:0.13.1-amzn-0]
          at org.apache.hudi.client.utils.LazyIterableIterator.next(LazyIterableIterator.java:119) ~[hudi-utilities-bundle.jar:0.13.1-amzn-0]
          at scala.collection.convert.Wrappers$JIteratorWrapper.next(Wrappers.scala:46) ~[scala-library-2.12.15.jar:?]
          at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:486) ~[scala-library-2.12.15.jar:?]
          at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:492) ~[scala-library-2.12.15.jar:?]
          at org.apache.spark.storage.memory.MemoryStore.putIterator(MemoryStore.scala:223) ~[spark-core_2.12-3.4.0-amzn-0.jar:3.4.0-amzn-0]
          at org.apache.spark.storage.memory.MemoryStore.putIteratorAsBytes(MemoryStore.scala:352) ~[spark-core_2.12-3.4.0-amzn-0.jar:3.4.0-amzn-0]
          at org.apache.spark.storage.BlockManager.$anonfun$doPutIterator$1(BlockManager.scala:1552) ~[spark-core_2.12-3.4.0-amzn-0.jar:3.4.0-amzn-0]
          at org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$doPut(BlockManager.scala:1462) ~[spark-core_2.12-3.4.0-amzn-0.jar:3.4.0-amzn-0]
          at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1526) ~[spark-core_2.12-3.4.0-amzn-0.jar:3.4.0-amzn-0]
          at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:1349) ~[spark-core_2.12-3.4.0-amzn-0.jar:3.4.0-amzn-0]
          at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:375) ~[spark-core_2.12-3.4.0-amzn-0.jar:3.4.0-amzn-0]
          at org.apache.spark.rdd.RDD.iterator(RDD.scala:326) ~[spark-core_2.12-3.4.0-amzn-0.jar:3.4.0-amzn-0]
          at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) ~[spark-core_2.12-3.4.0-amzn-0.jar:3.4.0-amzn-0]
          at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:364) ~[spark-core_2.12-3.4.0-amzn-0.jar:3.4.0-amzn-0]
          at org.apache.spark.rdd.RDD.iterator(RDD.scala:328) ~[spark-core_2.12-3.4.0-amzn-0.jar:3.4.0-amzn-0]
          at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:92) ~[spark-core_2.12-3.4.0-amzn-0.jar:3.4.0-amzn-0]
          at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161) ~[spark-core_2.12-3.4.0-amzn-0.jar:3.4.0-amzn-0]
          at org.apache.spark.scheduler.Task.run(Task.scala:141) ~[spark-core_2.12-3.4.0-amzn-0.jar:3.4.0-amzn-0]
          at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:554) ~[spark-core_2.12-3.4.0-amzn-0.jar:3.4.0-amzn-0]
          at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1541) ~[spark-core_2.12-3.4.0-amzn-0.jar:3.4.0-amzn-0]
          at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:557) ~[spark-core_2.12-3.4.0-amzn-0.jar:3.4.0-amzn-0]
          at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) ~[?:?]
          at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) ~[?:?]
          at java.lang.Thread.run(Thread.java:833) ~[?:?]

      ```

      I have tried setting `spark.hadoop.parquet.avro.write-old-list-structure: false` but the issue persists.

      Attachments

        Activity

          People

            Unassigned Unassigned
            cloventt David Palmer
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated: