Uploaded image for project: 'Hadoop Common'
  1. Hadoop Common
  2. HADOOP-18477 Über-jira: S3A Hadoop 3.3.9 features
  3. HADOOP-17943

Add s3a tool to convert S3 server logs to avro/csv files

    XMLWordPrintableJSON

Details

    • Sub-task
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 3.3.2
    • None
    • fs/s3
    • None

    Description

      Add s3a tool to convert S3 server logs to avro/csv files

      With S3A Auditing, we have code in hadoop-aws to parse s3 log entries, including splitting up the referrer into its fields.

      But we don't have an easy way of using it. I've done some early work in spark but as well as that code not working (https://github.com/hortonworks-spark/cloud-integration/blob/master/spark-cloud-integration/src/main/scala/com/cloudera/spark/cloud/s3/S3LogRecordParser.scala), it doesn't do the audit splitting.

      And, given that the S3 audit logs can be small on a lightly loaded store, not always justified.

      Proposed

      we add

      1. utility parser class to take a row and split it into a record
      2. which can be saved to avro through a schema we define
      3. or exported to CSV with/without headers. (with: easy to understand, without: can cat files)
      4. add a mapper so this can be used in MR jobs (could even make it committer test ..)
      5. and a "hadoop s3guard/hadoop s3" entry point so you can do it on the cli
      hadoop s3 parselogs -format avro -out s3a://dest/path -recursive s3a://stevel-london/logs/bucket1/*
      

      would take all files under the path, load, parse and emit the output.

      design issues

      • would you combine all files, or emit a new .avro or .csv file for each one?
      • what's a good avro schema to cope with new context attributes
      • CSV nuances: tabs vs spaces, use opencsv or implement the (escaping?) writer ourselves.
        me: TSV and do a minimal escaping and quoting emitter. Can use opencsv in the test suite.
      • would you want an initial filter during processing? especially for exit codes?
        me: no, though I could see the benefit for 503s. Best to let you load it into a notebook or spreadsheet and go from there.

      Attachments

        Activity

          People

            mehakmeet Mehakmeet Singh
            stevel@apache.org Steve Loughran
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated: