Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-1997

Add CBOR "magic header" to CommonCrawlDataDumper output

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Closed
    • Minor
    • Resolution: Fixed
    • None
    • 1.10
    • tool

    Description

      For each file extracted from Nutch crawled data, CommonCrawlDataDumper wraps a single string value, representing the JSON text, into CBOR.
      For instance, using the Unix hexdump tool, we can see that, as expected, the first byte of all files is "0x7F" (the first three bits are "011", that is the major type for strings, and the following 5 bits are "11010", meaning a uint32_t encodes the length of following text), and the following 4 bytes (single-precision float) encodes the right length of file (as described in RFC7049). Therefore, a CBOR tag is currently included into the file (a list of cbor tags is available here).
      In order to add support for CBOR detection using Apache Tika (as described in TIKA-1610), it would be great if CommonCrawlDataDumper tool is able to add the self-describing CBOR "magic header" (Tag 55799) to CBOR-encoded output files.
      Thanks a lot Lukeliush for this great research. Thanks chrismattmann for supporting me on this work.

      Attachments

        1. NUTCH-1997.patch
          2 kB
          Giuseppe Totaro

        Issue Links

          Activity

            People

              chrismattmann Chris A. Mattmann
              gostep Giuseppe Totaro
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: