Details
-
Sub-task
-
Status: Closed
-
Major
-
Resolution: Duplicate
-
1.6.0, 2.1.0
-
None
-
None
-
Linux(Generic)
Description
When running long running streaming applications, the HDFS storage gets filled up with large *.inprogress files in hdfs://spark-history/ directory
For example:
hadoop fs -du -h /spark-history
234 /spark-history/<Application_1_ID>.inprogress
46.6 G /spark-history/<Application_2_ID>.inprogress
Instead of continuing to write to a very large (multi GB) .inprogress file, Spark should instead rotate the current log file when it reaches a size (for example: 100 MB) or interval
and perhaps expose a configuration parameter for the size/interval.
This is also mentioned in SPARK-12140 as a concern.
It is very important and useful to support rotating the log files because users may have limited HDFS quota and these large files consume the available limited quota.
Also the users do not have a viable workaround
1) Can not move the files to an another location because the moving the file causes the event logging to stop
2) Trying to copy the .inprogress file to another location and truncate the .inprogress file fails because the file is still opened by EventLoggingListener for writing
hdfs dfs -truncate -w 0 /spark-history/<application_id>.inprogress
truncate: Failed to TRUNCATE_FILE /spark-history/<application_id>.inprogress for DFSClient_NONMAPREDUCE_<#ID>on <IP> because this file lease is currently owned by DFSClient_NONMAPREDUCE_<#ID> on <IP>
The only workaround available is to disable the event logging for streaming applications by setting "spark.eventLog.enabled" to false