Details
-
Improvement
-
Status: Closed
-
Major
-
Resolution: Duplicate
-
None
-
None
Description
The new Python FileSystem API is nice but seems to be very verbose to use.
The documentation of the old FS API is here
Here are some examples
Filesystem access:
Before:
fs.ls()
fs.mkdir()
fs.rmdir()
Now:
fs.get_target_stats()
fs.create_dir()
fs.delete_dir()
What is the advantage of having a longer method ? The short ones seem clear and are much easier to use. Seems like an easy change. Also this is consistent with what is doing hdfs in the fs api and works naturally with a local filesystem.
File opening:
Before:
with fs.open(self, path, mode=u'rb', buffer_size=None)
Now:
fs.open_input_file()
fs.open_input_stream()
fs.open_output_stream()
It seems more natural to fit to Python standard open function which works for local file access as well. Not sure if this is possible to do easily as there is `_wrap_output_stream` method.
Possible solutions
- If the current Python API is still unused we could just rename the methods
- We could keep everything as is and add some alias methods, it would make the FileSystem class a bit messy I think becasue there would be always 2 methods to do the work
- Make everything compatible to FSSpec and reference the Spec, see https://issues.apache.org/jira/browse/ARROW-7102,
I like the idea of a https://github.com/intake/filesystem_spec repo. Some comments on the proposed solutions there:
Make a fsspec wrapper for pyarrow.fs => seems strange to me, it would be having to wrap again a FileSystem that is not good enough in yet another repo
Make a pyarrow.fs wrapper for fsspec => if the wrapper becomes the documented "official" pyarow FileSystem it is fine I think, otherwise I would be yet another wrapper on top of the pyarrow "official" fs
Tensorflow RFC on FileSystems
Tensorflow is also doing some standardization work on their FileSystem:
https://github.com/tensorflow/community/blob/master/rfcs/20190506-filesystem-plugin-modular-tensorflow.md#python-considerations
Not clear (to me) what they will do with Python file API though. it seems like they will also just wrap the C code back to tf.Gfile
Other considerations on FS ergonomics
In the long run I would also like to enhance the FileSystem API and add more methods that use the basic ones to provide new features for example:
- introduce put and get on top of the streams that directly upload/download files
- introduce touch from dask/hdfs3
- introduce du from dask/hdfs3
- check if selector works with globs or add https://github.com/dask/hdfs3/blob/master/hdfs3/core.py#L349
- be able to write strings to the file streams (instead of only bytes, already implemented by https://github.com/dask/hdfs3/blob/master/hdfs3/utils.py#L96), it would permit to directly use some Python API's like json.dump
with fs.open(path, "wb") as fd: res = {"a": "bc"} json.dump(res, fd)
instead of
with fs.open(path, "wb") as fd: res = {"a": "bc"} fd.write(json.dumps(res))
or like currently (with old API, which required encore each time, untested with new one)
with fs.open(path, "wb") as fd: res = {"a": "bc"} fd.write(json.dumps(res).encode())
- implementing readline, needed for:
with hdfs.open("file", 'wb') as outfile: pickle.dump({"a": "b"}, outfile) with hdfs.open("file", 'wb') as infile: pickle.load(infile)
- not clear how to make this also work when reading from files
Attachments
Issue Links
- is related to
-
ARROW-7102 [Python] Make filesystems compatible with fsspec
- Resolved
- relates to
-
ARROW-8780 [Python] A fsspec-compatible wrapper for pyarrow.fs filesystems
- Resolved
1.
|
[Python] Replace FileSystem.get_target_stats by FileSystem.ls/info | Closed | Unassigned |
|