Uploaded image for project: 'Beam'
  1. Beam
  2. BEAM-13266

DataFrame API errors should identify culprit operation in user code

Details

    • Improvement
    • Status: Open
    • P2
    • Resolution: Unresolved
    • None
    • None
    • dsl-dataframe
    • None

    Description

      The DataFrame API aims to catch errors in pipeline code at pipeline construction time as much as possible. Ideally, flawed user code will be caught during proxy generation and bubble up an error from pandas.

      However, there are edge cases where DataFrame operations validate at construction time, but still produce errors at execution time, based on the actual data. For example a user might try to use the modulo operator on a string column. This is a valid operation, but it performs string interpolation, not a modulo as the user intended.

      The above situation will raise an error at execution time, but it has a very obtuse stacktrace, with a tree of evaluate/evaluate_at calls. The culprit from the user's code is nowhere in the astacktrace.

      We should catch errors like this at execution time and add a pointer to the line in the user's
      code that created this expression. We'll likely need to add this metadata to DataFrame expressions.

      Attachments

        Activity

          People

            Unassigned Unassigned
            bhulette Brian Hulette
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated: