[ARROW-7584] [Python] Improve ergonomics of new FileSystem API - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Closed
Priority: Major
Resolution: Duplicate
Affects Version/s: None
Fix Version/s: None
Component/s: Python
Labels:
- FileSystem

External issue URL:
https://github.com/apache/arrow/issues/23841

Description

The new Python FileSystem API is nice but seems to be very verbose to use.

The documentation of the old FS API is here

Here are some examples

Filesystem access:

Before:
fs.ls()
fs.mkdir()
fs.rmdir()

Now:
fs.get_target_stats()
fs.create_dir()
fs.delete_dir()

What is the advantage of having a longer method ? The short ones seem clear and are much easier to use. Seems like an easy change. Also this is consistent with what is doing hdfs in the fs api and works naturally with a local filesystem.

File opening:

Before:
with fs.open(self, path, mode=u'rb', buffer_size=None)

Now:
fs.open_input_file()
fs.open_input_stream()
fs.open_output_stream()

It seems more natural to fit to Python standard open function which works for local file access as well. Not sure if this is possible to do easily as there is `_wrap_output_stream` method.

Possible solutions

If the current Python API is still unused we could just rename the methods
We could keep everything as is and add some alias methods, it would make the FileSystem class a bit messy I think becasue there would be always 2 methods to do the work
Make everything compatible to FSSpec and reference the Spec, see https://issues.apache.org/jira/browse/ARROW-7102,
I like the idea of a https://github.com/intake/filesystem_spec repo. Some comments on the proposed solutions there:
Make a fsspec wrapper for pyarrow.fs => seems strange to me, it would be having to wrap again a FileSystem that is not good enough in yet another repo
Make a pyarrow.fs wrapper for fsspec => if the wrapper becomes the documented "official" pyarow FileSystem it is fine I think, otherwise I would be yet another wrapper on top of the pyarrow "official" fs

Tensorflow RFC on FileSystems

Tensorflow is also doing some standardization work on their FileSystem:
https://github.com/tensorflow/community/blob/master/rfcs/20190506-filesystem-plugin-modular-tensorflow.md#python-considerations

Not clear (to me) what they will do with Python file API though. it seems like they will also just wrap the C code back to tf.Gfile

Other considerations on FS ergonomics

In the long run I would also like to enhance the FileSystem API and add more methods that use the basic ones to provide new features for example:

introduce put and get on top of the streams that directly upload/download files
introduce touch from dask/hdfs3
introduce du from dask/hdfs3
check if selector works with globs or add https://github.com/dask/hdfs3/blob/master/hdfs3/core.py#L349
be able to write strings to the file streams (instead of only bytes, already implemented by https://github.com/dask/hdfs3/blob/master/hdfs3/utils.py#L96), it would permit to directly use some Python API's like json.dump

with fs.open(path, "wb") as fd:
  res = {"a": "bc"}
  json.dump(res, fd)

instead of

with fs.open(path, "wb") as fd:
  res = {"a": "bc"}
  fd.write(json.dumps(res))

or like currently (with old API, which required encore each time, untested with new one)

with fs.open(path, "wb") as fd:
  res = {"a": "bc"}
  fd.write(json.dumps(res).encode())

implementing readline, needed for:

with hdfs.open("file", 'wb') as outfile:
  pickle.dump({"a": "b"}, outfile)

with hdfs.open("file", 'wb') as infile:
  pickle.load(infile)

not clear how to make this also work when reading from files

Attachments

Issue Links

is related to

ARROW-7102 [Python] Make filesystems compatible with fsspec

Resolved

relates to

ARROW-8780 [Python] A fsspec-compatible wrapper for pyarrow.fs filesystems

Resolved

Sub-Tasks

[Python] Replace FileSystem.get_target_stats by FileSystem.ls/info

Closed

Unassigned

100%

Activity

People

Assignee:: Unassigned

Reporter:: Fabian Höring

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 15/Jan/20 13:22

Updated:: 11/Jan/23 07:54

Resolved:: 24/Jun/20 14:57

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

2h 50m

Include sub-tasks