Uploaded image for project: 'Cassandra'
  1. Cassandra
  2. CASSANDRA-10835

CqlInputFormat creates too small splits for map Hadoop tasks

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Normal
    • Resolution: Fixed
    • 2.2.5, 3.0.3, 3.2
    • None
    • None
    • Normal

    Description

      CqlInputFormat use number of rows in C* version < 2.2 to define split size
      The default split size was 64K rows.

          private static final int DEFAULT_SPLIT_SIZE = 64 * 1024;
      

      The doc:

      * You can also configure the number of rows per InputSplit with
       *   ConfigHelper.setInputSplitSize. The default split size is 64k rows.
       

      New split algorithm assumes that SPLIT size is in bytes, so it creates really small map hadoop tasks by default (or with old configs).

      There two way to fix it:
      1. Update the doc and increase default value to something like 16MB
      2. Make the C* to be compatible with older version.

      I like the second options, as it will not surprise people who upgrade from old versions. I do not expect a lot of new user that will use Hadoop.

      Attachments

        1. cassandra-2.2-10835-2.txt
          6 kB
          Artem Aliev
        2. cassandra-3.0.1-10835.txt
          2 kB
          Artem Aliev
        3. cassandra-3.0.1-10835-2.txt
          6 kB
          Artem Aliev

        Activity

          People

            Unassigned Unassigned
            artem.aliev Artem Aliev
            Joshua McKenzie
            Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: