Uploaded image for project: 'Kudu'
  1. Kudu
  2. KUDU-2633

Missing documentation about Spark's KuduContext API

Attach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Patch

    Description

      Right now there's no place to check the documentation about methods belonging to KuduContext.
      The only resources available only show some examples:
      https://kudu.apache.org/docs/developing.html#_spark_integration_best_practices

      Even when including the dependency in the IDE there are no documentation for each method.

      For example, I was getting a SparkException (which does not describe the actual error) when, accidentally, inserting rows in a table that already had the same rows. And the method insertRows from KuduContext does not mention that an exception can be thrown.
      Exception  example:

      18/12/05 11:26:35 ERROR core.JobRunShell: Job DEFAULT.EventKpisConsumer threw an unhandled Exception: 
      org.apache.spark.SparkException: Job aborted due to stage failure: Aborting TaskSet 109.0 because task 3 (partition 3) cannot run anywhere due to node and executor blacklist.  Blacklisting behavior can be configured via spark.blacklist.*.
      	at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1524)
      	at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1512)
      	at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1511)
      	at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
      	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
      	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1511)
      	at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814)
      	at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814)
      	at scala.Option.foreach(Option.scala:257)
      	at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:814)
      	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1739)
      	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1694)
      	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1683)
      	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
      	at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:630)
      	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2031)
      	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2052)
      	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2071)
      	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2096)
      	at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(RDD.scala:926)
      	at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(RDD.scala:924)
      	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
      	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
      	at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
      	at org.apache.spark.rdd.RDD.foreachPartition(RDD.scala:924)
      	at org.apache.spark.sql.Dataset$$anonfun$foreachPartition$1.apply$mcV$sp(Dataset.scala:2340)
      	at org.apache.spark.sql.Dataset$$anonfun$foreachPartition$1.apply(Dataset.scala:2340)
      	at org.apache.spark.sql.Dataset$$anonfun$foreachPartition$1.apply(Dataset.scala:2340)
      	at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77)
      	at org.apache.spark.sql.Dataset.withNewExecutionId(Dataset.scala:2827)
      	at org.apache.spark.sql.Dataset.foreachPartition(Dataset.scala:2339)
      	at org.apache.kudu.spark.kudu.KuduContext.writeRows(KuduContext.scala:246)
      	at org.apache.kudu.spark.kudu.KuduContext.insertRows(KuduContext.scala:197)
      	at com.xpandit.bdu.altice.EventKpisKafkaConsumer.run(EventKpisKafkaConsumer.scala:193)
      	at com.xpandit.bdu.altice.scheduling.RunnableInterruptableJob.execute(CronScheduler.scala:73)
      	at org.quartz.core.JobRunShell.run(JobRunShell.java:207)
      	at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:560)
      

      Attachments

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            Unassigned Unassigned
            RikG Ricardo Gaspar
            Votes:
            1 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment