Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-5452

We are migrating Tera Data SQL to Spark SQL. Query is taking long time. Please have a look on this issue

    XMLWordPrintableJSON

Details

    • Test
    • Status: Closed
    • Major
    • Resolution: Not A Problem
    • 1.2.0
    • None
    • Spark Shell

    Description

      Hi Team,

      we are migrating TeraData SQL to Spark SQL because of complexity we have spilted into below 4 sub-quries
      and we are running through hive context
      ====================================================
      val HIVETMP1 = hc.sql("SELECT PARTY_ACCOUNT_ID AS PARTY_ACCOUNT_ID,LMS_ACCOUNT_ID AS LMS_ACCOUNT_ID FROM VW_PARTY_ACCOUNT WHERE PARTY_ACCOUNT_TYPE_CODE IN('04') AND LMS_ACCOUNT_ID IS NOT NULL")

      HIVETMP1.registerTempTable("VW_HIVETMP1")

      val HIVETMP2 = hc.sql("SELECT PACCNT.LMS_ACCOUNT_ID AS LMS_ACCOUNT_ID, 'NULL' AS RANDOM_PARTY_ACCOUNT_ID ,'NULL' AS MOST_RECENT_SPEND_LA ,STXN.PARTY_ACCOUNT_ID AS MAX_SPEND_12WKS_LA ,STXN.MAX_SPEND_12WKS_LADATE AS MAX_SPEND_12WKS_LADATE FROM VW_HIVETMP1 AS PACCNT INNER JOIN (SELECT STXTMP.PARTY_ACCOUNT_ID AS PARTY_ACCOUNT_ID, SUM(CASE WHEN (CAST(STXTMP.TRANSACTION_DATE AS DATE ) > DATE_SUB(CAST(CONCAT(SUBSTRING(SYSTMP.OPTION_VAL,1,4),'',SUBSTRING(SYSTMP.OPTION_VAL,5,2),'',SUBSTRING(SYSTMP.OPTION_VAL,7,2)) AS DATE),84)) THEN STXTMP.TRANSACTION_VALUE ELSE 0.00 END) AS MAX_SPEND_12WKS_LADATE FROM VW_SHOPPING_TRANSACTION_TABLE AS STXTMP INNER JOIN SYSTEM_OPTION_TABLE AS SYSTMP ON STXTMP.FLAG == SYSTMP.FLAG AND SYSTMP.OPTION_NAME = 'RID' AND STXTMP.PARTY_ACCOUNT_TYPE_CODE IN('04') GROUP BY STXTMP.PARTY_ACCOUNT_ID) AS STXN ON PACCNT.PARTY_ACCOUNT_ID = STXN.PARTY_ACCOUNT_ID WHERE STXN.MAX_SPEND_12WKS_LADATE IS NOT NULL")

      HIVETMP2.registerTempTable("VW_HIVETMP2")

      val HIVETMP3 = hc.sql("SELECT LMS_ACCOUNT_ID,MAX(MAX_SPEND_12WKS_LA) AS MAX_SPEND_12WKS_LA, 1 AS RANK FROM VW_HIVETMP2 GROUP BY LMS_ACCOUNT_ID")

      HIVETMP3.registerTempTable("VW_HIVETMP3")

      val HIVETMP4 = hc.sql(" SELECT PACCNT.LMS_ACCOUNT_ID,'NULL' AS RANDOM_PARTY_ACCOUNT_ID ,'NULL' AS MOST_RECENT_SPEND_LA,STXN.MAX_SPEND_12WKS_LA AS MAX_SPEND_12WKS_LA,1 AS RANK1 FROM VW_HIVETMP2 AS PACCNT INNER JOIN VW_HIVETMP3 AS STXN ON PACCNT.LMS_ACCOUNT_ID = STXN.LMS_ACCOUNT_ID AND PACCNT.MAX_SPEND_12WKS_LA = STXN.MAX_SPEND_12WKS_LA")

      HIVETMP4.registerTempTable("WT03_ACCOUNT_BHVR3")

      HIVETMP4.saveAsTextFile("hdfs:/file/")

      ==========================

      This query has two Group By clauses which are running on huge files(19.5GB). And the query took 40min to get the final result. Is there any changes required in run time environment or Configuration Setting in Spark which can improve the query performance.

      below are our Environment and configuration details:

      Environment details:
      No of nodes:4
      capacity on each node:62 GB RAM on each node.
      Storage capacity :9TB on each node
      total cores :48

      Spark Configuration:

      .set("spark.default.parallelism","64")
      .set("spark.driver.maxResultSize","2G")
      .set("spark.driver.memory","10g")
      .set("spark.rdd.compress","true")
      .set("spark.shuffle.spill.compress","true")
      .set("spark.shuffle.compress","true")
      .set("spark.shuffle.consolidateFiles","true/false")
      .set("spark.shuffle.spill","true/false")

      Data file size :
      SHOPPING_TRANSACTION 19.5GB
      PARTY_ACCOUNT 1.4GB
      SYSTEM_OPTIONS 11.6K

      please help us to resolve above issue.

      Thanks,

      Attachments

        Activity

          People

            Unassigned Unassigned
            Irfan123 irfan
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: