Uploaded image for project: 'Beam'
  1. Beam
  2. BEAM-7131

Spark portable runner appears to be repeating work (in TFX example)

Details

    Description

      I've been trying to run the TFX Chicago taxi example [1] on the Spark portable runner. TFDV works fine, but the preprocess step (preprocess_flink.sh [2]) fails with the following error:

      RuntimeError: AlreadyExistsError: file already exists [while running 'WriteTransformFn/WriteTransformFn']

      Assets are being written multiple times to different temp directories, which is okay, but the error occurs when they are copied to the same permanent output directory. Specifically, the copy tree operation in transform_fn_io.py [3] is run twice with the same output directory. The error doesn't occur when that code is modified to allow overwriting existing files, but that's only a shallow fix. While the TF transform should probably be made idempotent, this is also an issue with the Spark runner, which shouldn't be repeating work like this regularly (in the absence of a failure condition).

      [1] https://github.com/tensorflow/tfx/tree/master/tfx/examples/chicago_taxi

      [2] https://github.com/tensorflow/tfx/blob/master/tfx/examples/chicago_taxi/preprocess_flink.sh

      [3] https://github.com/tensorflow/transform/blob/master/tensorflow_transform/beam/tft_beam_io/transform_fn_io.py#L33-L45

      Attachments

        Activity

          People

            ibzib Kyle Weaver
            ibzib Kyle Weaver
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Time Tracking

                Estimated:
                Original Estimate - Not Specified
                Not Specified
                Remaining:
                Remaining Estimate - 0h
                0h
                Logged:
                Time Spent - 7h
                7h