Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-1975

New configuration for CommonCrawlDataDumper tool

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Closed
    • Minor
    • Resolution: Fixed
    • 1.9
    • 1.10
    • tool
    • None

    Description

      Hi all, you can find in attachment a new patch including support for new options for CommonCrawlDataDumper.
      In particultar, new options are passed to CommonCrawlFormat object (which provides methods to create JSON output) using a configuration object (CommonCrawlConfig).

      In particular, in this patch CommonCrawlDataDumper provides support for the following options:

      • -SimpleDataFormat: enables timestamps in GMT epoche (milliseconds) format.
      • -epochFilename: files extracted will be organized in a reversed-DNS tree based on the FQDN of the webpage, followed by a SHA1 hash of the complete URL. Scraped data will be stored in these directories as individual GMT-timestamped files using "epoche time (in milliseconds)" plus file extension.
      • -jsonArray: organizes both request and response headers into a JSON array instead of using a JSON sub-object.
        *-reverseKey: enables to use the same layout as described for -epochFilename option, with underscore in place of directory separators.

      You can use the options above in addition to the options already supported, as described in the Nutch wiki page.
      This patch starts from NUTCH-1974.

      Thanks chrismattmann and annieburgess for supporting me on this work.

      Attachments

        1. NUTCH-1975.patch
          34 kB
          Giuseppe Totaro
        2. NUTCH-1975.v02.patch
          33 kB
          Giuseppe Totaro
        3. NUTCH-1975.v03.patch
          33 kB
          Giuseppe Totaro

        Activity

          People

            chrismattmann Chris A. Mattmann
            gostep Giuseppe Totaro
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: