Details
-
Improvement
-
Status: Resolved
-
Major
-
Resolution: Incomplete
-
None
-
None
Description
Consider an array_compact UDF that removes null values from an array. There is no easy way to implement this UDF because an explicit return type with a TypeTag is required by code generation.
The interesting observation here is that the output type of `array_compact` is the same as its input type. In general, there is a broad class of UDFs, especially collection-oriented ones, whose output types are functions of the input types. In our Spark work we have found collection manipulation UDFs to be very powerful for cleaning up data and substantially improving performance, in particular, avoiding explode followed by groupBy. It would be nice if Spark made adding these types of UDFs very easy.
I won't go into possible ways to implement this under the covers as there are many options but I do want to point out that it is possible to communicate the right type information to Spark without changing the signature for UDF registration using placeholder types, e.g.,
sealed trait UDFArgumentAtPosition case class ArgPos1 extends UDFArgumentAtPosition case class ArgPos2 extends UDFArgumentAtPosition // ... case class Struct[ArgPos <: UDFArgumentAtPosition](value: Row) case class ArrayElement[ArgPos <: UDFArgumentAtPosition, A : TypeTag](value: A) case class MapKey[ArgPos <: UDFArgumentAtPosition, A : TypeTag](value: A) case class MapValue[ArgPos <: UDFArgumentAtPosition, A : TypeTag](value: A) // Functions are stubbed just to show compilation succeeds def arrayCompact[A : TypeTag](xs: Seq[A]): ArgPos1 = null def arraySum[A : Numeric : TypeTag](xs: Seq[A]): ArrayElement[ArgPos1, A] = ArrayElement(implicitly[Numeric[A]].zero)