Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-20270

Don't serialize hashCode for groupByKey

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • None
    • Spark
    • None

    Description

      Similar to HIVE-20032, but for groupByKey. The tricky part with groupByKey is we need to preserve the hashCode until the key gets partitioned (via the HashPartitioner) but after that we don't really need to preserve the hashCode. The groupByKey operator in Spark does require a hashCode since it puts everything in a map, but it can use a different hash-code than the one specified in HiveKey. The hashcode in HiveKey is only important for determining the partition the key should be assigned to.

      The drawback is that computing the hashcode for each HiveKey might require more CPU resources, but we should profile it just in case.

      Attachments

        Activity

          People

            stakiar Sahil Takiar
            stakiar Sahil Takiar
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated: