Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-7584

[Python] Improve ergonomics of new FileSystem API

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Closed
    • Major
    • Resolution: Duplicate
    • None
    • None
    • Python

    Description

      The new Python FileSystem API is nice but seems to be very verbose to use.

      The documentation of the old FS API is here

      Here are some examples

      Filesystem access:

      Before:
      fs.ls()
      fs.mkdir()
      fs.rmdir()

      Now:
      fs.get_target_stats()
      fs.create_dir()
      fs.delete_dir()

      What is the advantage of having a longer method ? The short ones seem clear and are much easier to use. Seems like an easy change. Also this is consistent with what is doing hdfs in the fs api and works naturally with a local filesystem.

      File opening:

      Before:
      with fs.open(self, path, mode=u'rb', buffer_size=None)

      Now:
      fs.open_input_file()
      fs.open_input_stream()
      fs.open_output_stream()

      It seems more natural to fit to Python standard open function which works for local file access as well. Not sure if this is possible to do easily as there is `_wrap_output_stream` method.

      Possible solutions

      • If the current Python API is still unused we could just rename the methods
      • We could keep everything as is and add some alias methods, it would make the FileSystem class a bit messy I think becasue there would be always 2 methods to do the work
      • Make everything compatible to FSSpec and reference the Spec, see https://issues.apache.org/jira/browse/ARROW-7102,
        I like the idea of a https://github.com/intake/filesystem_spec repo. Some comments on the proposed solutions there:
        Make a fsspec wrapper for pyarrow.fs => seems strange to me, it would be having to wrap again a FileSystem that is not good enough in yet another repo
        Make a pyarrow.fs wrapper for fsspec => if the wrapper becomes the documented "official" pyarow FileSystem it is fine I think, otherwise I would be yet another wrapper on top of the pyarrow "official" fs

      Tensorflow RFC on FileSystems

      Tensorflow is also doing some standardization work on their FileSystem:
      https://github.com/tensorflow/community/blob/master/rfcs/20190506-filesystem-plugin-modular-tensorflow.md#python-considerations

      Not clear (to me) what they will do with Python file API though. it seems like they will also just wrap the C code back to tf.Gfile

      Other considerations on FS ergonomics

      In the long run I would also like to enhance the FileSystem API and add more methods that use the basic ones to provide new features for example:

      with fs.open(path, "wb") as fd:
        res = {"a": "bc"}
        json.dump(res, fd)
      

      instead of

      with fs.open(path, "wb") as fd:
        res = {"a": "bc"}
        fd.write(json.dumps(res))
      

      or like currently (with old API, which required encore each time, untested with new one)

      with fs.open(path, "wb") as fd:
        res = {"a": "bc"}
        fd.write(json.dumps(res).encode())
      
      with hdfs.open("file", 'wb') as outfile:
        pickle.dump({"a": "b"}, outfile)
      
      with hdfs.open("file", 'wb') as infile:
        pickle.load(infile) 
      
      • not clear how to make this also work when reading from files

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              fhoering Fabian Höring
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 2h 50m
                  2h 50m