Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-6416

[Python] Confusing API & documentation regarding chunksizes

    XMLWordPrintableJSON

Details

    Description

      The python API and documentation regarding chunksizes is confusing in my opinion.

      Example:

      https://arrow.apache.org/docs/python/generated/pyarrow.RecordBatchFileWriter.html#pyarrow.RecordBatchFileWriter.write_table

      def write_table(self, Table table, chunksize=None):
      """
      Write RecordBatch to stream

      Parameters
      ----------
      batch : RecordBatch
       
      This suggests, the file will be written with a fixed chunk size when in fact the chunksize parameter is an upper bound on the size of the chunks to be written.

      In my opinion this parameter should be renamed max_chunksize to avoid confusion and reflect its true purpose.

      This would also improve naming consistency in the code base, since in the C++ implementation this parameter is already named max_chunksize in cpp/source/arrow/ipc/writer.cc:

      Status RecordBatchWriter::WriteTable(const Table& table, int64_t max_chunksize)

      Similarly, the parameter should be renamed in pyarrow.Table.to_batches(self, chunksize=None).

       

      Attachments

        Issue Links

          Activity

            People

              ARF1 ARF
              ARF1 ARF
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 0.5h
                  0.5h