Uploaded image for project: 'Thrift'
  1. Thrift
  2. THRIFT-4638

When I use Thrift to serialize map in C++ to disk, and then de-serialize it using Python, I do not get back the same object

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • None
    • None

    Description

      Summary: when I use Thrift to serialize map in C++ to disk, and then de-serialize it using Python, I do not get back the same object.

       

      See also https://stackoverflow.com/questions/52377410/when-i-use-thrift-to-serialize-map-in-c-to-disk-and-then-de-serialize-it-usin

      I have the following Thrift model file, which (as you can see) contains a map indexed by a struct:

      struct Coordinate

      { 1: required i32 x; 2: required i32 y; }

      struct Terrain

      { 1: required map<Coordinate, i32> altitude_samples; }

      I use the following C++ code to create an object with 3 coordinates in the map (see the repo for complete code for all snippets below):

      Terrain terrain;
      add_sample_to_terrain(terrain, 10, 10, 100);
      add_sample_to_terrain(terrain, 20, 20, 200);
      add_sample_to_terrain(terrain, 30, 30, 300);

      where:

      void add_sample_to_terrain(Terrain& terrain, int32_t x, int32_t y, int32_t altitude)

      { Coordinate coordinate; coordinate.x = x; coordinate.y = y; std::pair<Coordinate, int32_t> sample(coordinate, altitude); terrain.altitude_samples.insert(sample); }

      I use the following C++ code to serialize an object to disk:

      shared_ptr<TFileTransport> transport(new TFileTransport("terrain.dat"));
      shared_ptr<TBinaryProtocol> protocol(new TBinaryProtocol(transport));
      terrain.write(protocol.get());

      Important note: for this to work correctly, I had to implement the function Coordinate::operator<. Thrift does generate the declaration for the Coordinate::operator< but does not generate the implementation of Coordinate::operator<. The reason for this is that Thrift does not understand the semantics of the struct and hence cannot guess the correct implementation of the comparison operator. This is discussed at http://mail-archives.apache.org/mod_mbox/thrift-user/201007.mbox/%3C4C4E08BD.8030407@facebook.com%3E

      // Thrift generates the declaration but not the implementation of operator< because it has no way
      // of knowning what the criteria for the comparison are. So, provide the implementation here.
      bool Coordinate::operator<(const Coordinate& other) const
      {
      if (x < other.x)

      { return true; } else if (x > other.x) { return false; } else if (y < other.y) { return true; }

      else

      { return false; }

      }

      Then, finally, I use the following Python code to de-serialize the same object from disk:

      file = open("terrain.dat", "rb")
      transport = thrift.transport.TTransport.TFileObjectTransport(file)
      protocol = thrift.protocol.TBinaryProtocol.TBinaryProtocol(transport)
      terrain = Terrain()
      terrain.read(protocol)
      print(terrain)

      This Python program outputs:

      Terrain(altitude_samples=None)

      In other words, the de-serialized Terrain contains no terrain_samples, instead of the expected dictionary with 3 coordinates.

      I am 100% sure that the file terrain.dat contains valid data: I also de-serialized the same data using C++ and in that case, I do get the expected results (see repo for details)

      I suspect that this has something to do with the comparison operator.

      I gut feeling is that I should have done something similar in Python with respect to the comparison operator as I did in C++. But I don't know what that missing something would be.

      Additional information added on 19-Sep-2018:

      Here is a hexdump of the encoding produced by the C++ encoding program:

      Offset: 00 01 02 03 04 05 06 07 08 09 0A 0B 0C 0D 0E 0F
      00000000: 01 00 00 00 0D 02 00 00 00 00 01 01 00 00 00 0C ................
      00000010: 01 00 00 00 08 04 00 00 00 00 00 00 03 01 00 00 ................
      00000020: 00 08 02 00 00 00 00 01 04 00 00 00 00 00 00 0A ................
      00000030: 01 00 00 00 08 02 00 00 00 00 02 04 00 00 00 00 ................
      00000040: 00 00 0A 01 00 00 00 00 04 00 00 00 00 00 00 64 ...............d
      00000050: 01 00 00 00 08 02 00 00 00 00 01 04 00 00 00 00 ................
      00000060: 00 00 14 01 00 00 00 08 02 00 00 00 00 02 04 00 ................
      00000070: 00 00 00 00 00 14 01 00 00 00 00 04 00 00 00 00 ................
      00000080: 00 00 C8 01 00 00 00 08 02 00 00 00 00 01 04 00 ..H.............
      00000090: 00 00 00 00 00 1E 01 00 00 00 08 02 00 00 00 00 ................
      000000a0: 02 04 00 00 00 00 00 00 1E 01 00 00 00 00 04 00 ................
      000000b0: 00 00 00 00 01 2C 01 00 00 00 00 .....,.....

      The first 4 bytes are 01 00 00 00

      Using a debugger to step through the Python decoding function reveals that:

      • This is being decoded as a struct (which is expected)
      • The first byte 01 is interpreted as the field type. 01 means field type VOID.
      • The next two bytes are interpreted as the field id. 00 00 means field ID 0.
      • For field type VOID, nothing else is read and we continue to the next field.
      • The next byte is interpreted as the field type. 00 means STOP.
      • We top reading data for the struct.
      • The final result is an empty struct.

      All off the above is consistent with the information at https://github.com/apache/thrift/blob/master/doc/specs/thrift-binary-protocol.md which describes the Thrift binary encoding format

      My conclusion thus far is that the C++ encoder appears to produce an "incorrect" binary encoding (I put incorrect in quotes because certainly something as blatant as that would have been discovered by lots of other people, so I am sure that I am still missing something).

      Additional information added on 19-Sep-2018:

      It appears that the C++ implementation of TFileTransport has the concept of "events" when writing to disk.

      The output which is written to disk is divided into a sequence of "events" where each "event" is preceded by a 4-byte length field of the event, followed by the contents of the event.

      Looking at the hexdump above, the first couple of events are:

      `0100 0000 0d` : Event length 1, event value `0d`

      `02 0000 0000 01` : Event length 2, event value `00 01`

      Etc.

      The Python implementation of TFileTransport does not understand this concept of events when parsing the file.

      It appears that the problem is one of the following two:

      1) Either the C++ code should not be inserting these event lengths into the encoded file,

      2) Or the Python code should understand these event lengths when decoding the file.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              brunorijsman Bruno Rijsman
              Votes:
              1 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated: