S3A / S3Guard internally collects several useful metrics that we should consider exposing to Impala users. The full list of statistics can be found in o.a.h.fs.s3a.Statistic. The stats include: the number of S3 operations performed (put, get, etc.), invocation counts for various FileSystem methods, stream statistics (bytes read, written, etc.), etc.
Some interesting stats that stand out:
- "stream_aborted": "Count of times the TCP stream was aborted" - the number of TCP connection aborts, a high value would indicate performance issues
- "stream_read_exceptions" : "Number of exceptions invoked on input streams" - incremented whenever an IOException is caught while reading (these exception don't always get propagated to Impala because they trigger a retry)
- "store_io_throttled": "Requests throttled and retried" - looks like it tracks the number of times the fs retries an operation because the original request hit a throttling exception
- "s3guard_metadatastore_retry": "S3Guard metadata store retry events" - looks like it tracks the number of times the fs retries S3Guard operations
- "s3guard_metadatastore_throttled" : "S3Guard metadata store throttled events" - similar to "store_io_throttled" but looks like it is specific to S3Guard
We should consider how to expose these metrics via Impala logs / runtime profiles.
There are a few options:
- S3AFileSystem exposes StorageStatistics specific to S3A / S3Guard via the FileSystem#getStorageStatistics method; the S3AStorageStatistics seems to include all the S3A / S3Guard metrics, however, I think the stats might be aggregated globally, which would make it hard to create per-query specific metrics
- S3AInstrumentation exposes all the metrics as well, and looks like it is per-fs instance, so it is not aggregated globally; S3AInstrumentation extends o.a.h.metrics2.MetricsSource so perhaps it is exposed via some API (haven't looked into this yet)
- S3AInputStream#toString dumps the statistics from o.a.h.fs.s3a.S3AInstrumentation.InputStreamStatistics and S3AFileSystem#toString dumps them all as well
- S3AFileSystem updates the stats in o.a.h.fs.Statistics.StatisticsData as well (e.g. bytesRead, bytesWritten, etc.)
Impala has a hdfs-fs-cache as well, so hdfsFs objects get shared across threads.