Uploaded image for project: 'Beam'
  1. Beam
  2. BEAM-10573

CSV files are loaded several times if they are too large

Details

    • Bug
    • Status: Resolved
    • P1
    • Resolution: Won't Fix
    • 2.22.0
    • Missing
    • io-py-files
    • None
    • Important

    Description

      I have this small sample:

       

      import apache_beam as beam
      import apache_beam.io.filebasedsource
      import csv
      
      
      class CsvFileSource(apache_beam.io.filebasedsource.FileBasedSource):
       def read_records(self, file_name, range_tracker):
           with open(file_name, 'r') as file:
              reader = csv.DictReader(file)
              print("Load CSV file")
                 for rec in reader:
                    yield rec
      
      
      if __name__ == '__main__':
       with beam.Pipeline() as p:
       count_feature = (p
                 | 'create' >> beam.io.Read(CsvFileSource("myFile.csv"))
                 | 'count element' >> beam.combiners.Count.Globally()
                 | 'Print' >> beam.Map(print)
       )

       

       

      for some reason if the CSV file is too large it is loaded several times...

      for example for a file with 80000 rows (18.5 mo) the file is loaded 5 times.

      At the end I have 400000 elements in my PCollection.

       

      Attachments

        Activity

          People

            Unassigned Unassigned
            Equ1nox julien richard
            Votes:
            1 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: