Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-11250

[Python] Inconsistent behavior calling ds.dataset()

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Minor
    • Resolution: Fixed
    • 2.0.0
    • 3.0.0
    • Python

    Description

      In a Jupyter notebook, I have noticed that sometimes I am not able to read a dataset which certainly exists on Azure Blob.

       

      fs = fsspec.filesystem(protocol="abfs", account_name, account_key)
      

       
      One example of this is reading a dataset in one cell:

       

      ds.dataset("dev/test-split", partitioning="hive", filesystem=fs)

       

      Then in another cell I try to read the same dataset:

       

      ds.dataset("dev/test-split", partitioning="hive", filesystem=fs)
      
      
      ---------------------------------------------------------------------------
      FileNotFoundError                         Traceback (most recent call last)
      <ipython-input-514-bf63585a0c1b> in <module>
      ----> 1 ds.dataset("dev/test-split", partitioning="hive", filesystem=fs)
      
      /opt/conda/lib/python3.8/site-packages/pyarrow/dataset.py in dataset(source, schema, format, filesystem, partitioning, partition_base_dir, exclude_invalid_files, ignore_prefixes)
          669     # TODO(kszucs): support InMemoryDataset for a table input
          670     if _is_path_like(source):
      --> 671         return _filesystem_dataset(source, **kwargs)
          672     elif isinstance(source, (tuple, list)):
          673         if all(_is_path_like(elem) for elem in source):
      
      /opt/conda/lib/python3.8/site-packages/pyarrow/dataset.py in _filesystem_dataset(source, schema, filesystem, partitioning, format, partition_base_dir, exclude_invalid_files, selector_ignore_prefixes)
          426         fs, paths_or_selector = _ensure_multiple_sources(source, filesystem)
          427     else:
      --> 428         fs, paths_or_selector = _ensure_single_source(source, filesystem)
          429 
          430     options = FileSystemFactoryOptions(
      
      /opt/conda/lib/python3.8/site-packages/pyarrow/dataset.py in _ensure_single_source(path, filesystem)
          402         paths_or_selector = [path]
          403     else:
      --> 404         raise FileNotFoundError(path)
          405 
          406     return filesystem, paths_or_selector
      
      FileNotFoundError: dev/test-split
      

       

      If I reset the kernel, it works again. It also works if I change the path slightly, like adding a "/" at the end (so basically it just not work if I read the same dataset twice):

       

      ds.dataset("dev/test-split/", partitioning="hive", filesystem=fs)
      

       

       

      The other strange behavior I have noticed that that if I read a dataset inside of my Jupyter notebook,

       

      %%time
      dataset = ds.dataset("dev/test-split", partitioning=ds.partitioning(pa.schema([("date", pa.date32())]), flavor="hive"), 
      filesystem=fs,
      exclude_invalid_files=False)
      
      CPU times: user 1.98 s, sys: 0 ns, total: 1.98 s Wall time: 2.58 s

       

      Now, on the exact same server when I try to run the same code against the same dataset in Airflow it takes over 3 minutes (comparing the timestamps in my logs between right before I read the dataset, and immediately after the dataset is available to filter):

      [2021-01-14 03:52:04,011] INFO - Reading dev/test-split
      [2021-01-14 03:55:17,360] INFO - Processing dataset in batches
      

      This is probably not a pyarrow issue, but what are some potential causes that I can look into? I have one example where it is 9 seconds to read the dataset in Jupyter, but then 11 minutes in Airflow. I don't know what to really investigate - as I mentioned, the Jupyter notebook and Airflow are on the same server and both are deployed using Docker. Airflow is using the CeleryExecutor.

       

      Attachments

        Activity

          People

            Unassigned Unassigned
            ldacey Lance Dacey
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: