Uploaded image for project: 'Cassandra'
  1. Cassandra
  2. CASSANDRA-2436

Secondary Index Updates Invalidate Data Set

    XMLWordPrintableJSON

Details

    • Critical

    Description

      Creating an index, validator, and default validator then renaming/dropping the index later results in read errors and an invalid unreadable data set.

      Updating the CF with the old index will not resolve the problem. You can insert/write all you want, but reads will fail if you come across a row that included one of these cases. The only workaround that I've been able to use is to know exactly what the columns/changes were prior to the CF change and iterate through all the rows inserting the same column name will a NULL value. One problem here is that you _must_ absolutely know what the row keys are called because you can't do a read to get them.

      1) create a secondary index on a column with a validator and a default validator
      2) insert a row
      3) read and verify the row
      4) update the CF/index/name/validator
      5) read the CF and get an error (CLI or Pycassa)

      CLI Commands to create the row and CF/Index

      create column family cf_testing with comparator=UTF8Type and default_validation_class=UTF8Type and column_metadata=[

      {column_name: colour, validation_class: LongType, index_type: KEYS}

      ];

      set cf_testing['key']['colour']='1234';
      list cf_testing;

      update column family cf_testing with comparator=UTF8Type and default_validation_class=UTF8Type and column_metadata=[

      {column_name: color, validation_class: LongType, index_type: KEYS}

      ];

      ERROR from the CLI:

      list cf_testing;
      Using default limit of 100
      -------------------
      RowKey: key
      invalid UTF8 bytes 00000000000004d2

      Here is the Pycassa client code that shows this error too.

      badindex.py

      #!/usr/local/bin/python2.7

      import pycassa
      import uuid
      import sys

      def main():
      try:
      keyspace="badindex"
      serverPoolList = ['localhost:9160']
      pool = pycassa.connect(keyspace, serverPoolList)
      except:
      print "couldn't get a connection"
      sys.exit()

      cfname="cf_testing"
      cf = pycassa.ColumnFamily(pool, cfname)
      results = cf.get_range(start='key', finish='key', row_count=1)
      for key, columns in results:
      print key, '=>', columns

      if _name_ == "_main_":
      main()

      Attachments

        Activity

          People

            Unassigned Unassigned
            u20110407 Dexter Fryar
            Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: