Uploaded image for project: 'Beam'
  1. Beam
  2. BEAM-8734

Optimize the inference of element_type when writing a list of objects to FileBasedCache

Details

    • Improvement
    • Status: Resolved
    • P3
    • Resolution: Fixed
    • 2.16.0
    • Missing
    • sdk-py-core
    • None

    Description

      The proposed FileBasedCache.write method allows the user to write a list of arbitrary objects to a cache. The element_type and the appropriate coder for the list of objects is inferred using the apache_beam.testing.datatype_inference.infer_element_type function. This works well for lists that are small to moderate in size, but is likely to be very inefficient when the amount of data being written is large.

      Two approaches to solving this issue have been considered:

      1. We could attempt to infer the element_type from the first N elements (e.g. first 100 elements) in the provided list. This should produce the correct element_type for all elements in the list in the majority of cases (since every element in the list is likely to have the same data type). In the cases where the inferred element_type is incorrect, we could attempt to catch the resulting errors and infer the element_type again using a larger portion of the data.

      2. If inferring the `element_type` in the first call to FileBasedCache.write takes too long, we could instruct the user to try again, in the first call providing a small but representative sample of the data, while in the second call providing the rest of the data. Since the element_type is inferred only the first time that anything is written to a cache, subsequent calls would not have the same constraint on the number of elements.

      Attachments

        Activity

          People

            Unassigned Unassigned
            ostrokach Alexey Strokach
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Time Tracking

                Estimated:
                Original Estimate - 72h
                72h
                Remaining:
                Remaining Estimate - 72h
                72h
                Logged:
                Time Spent - Not Specified
                Not Specified