Uploaded image for project: 'Hadoop HDFS'
  1. Hadoop HDFS
  2. HDFS-14617 Improve fsimage load time by writing sub-sections to the fsimage index
  3. HDFS-15987

Improve oiv tool to parse fsimage file in parallel with delimited format

    XMLWordPrintableJSON

Details

    • Sub-task
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 3.4.0
    • 3.4.0
    • tools
    • Reviewed

    Description

      The purpose of this Jira is to improve oiv tool to parse fsimage file with sub-sections (see HDFS-14617) in parallel with delmited format. 

      1.Serial parsing is time-consuming

      The time to serially parse a large fsimage with delimited format (e.g. `hdfs oiv -p Delimited -t <tmp> ...`) is as follows: 

      1) Loading string table:                 -> Not time consuming.
      2) Loading inode references:             -> Not time consuming
      3) Loading directories in INode section: -> Slightly time consuming (3%)
      4) Loading INode directory section:      -> A bit time consuming (11%)
      5) Output:                               -> Very time consuming (86%)

      Therefore, output is the most parallelized stage.

      2.How to output in parallel

      The sub-sections are grouped in order, and each thread processes a group and outputs it to the file corresponding to each thread, and finally merges the output files.

      3. The result of a test

       input fsimage file info:
       3.4G, 12 sub-sections, 55976500 INodes
       -----------------------------------------
       Threads TotalTime OutputTime MergeTime
       1       18m37s     16m18s      –
       4        8m7s      4m49s       41s

       

       

       

      Attachments

        1. Improve_oiv_tool_001.pdf
          76 kB
          Hongbing Wang

        Issue Links

          Activity

            People

              wanghongbing Hongbing Wang
              wanghongbing Hongbing Wang
              Votes:
              1 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 6h 40m
                  6h 40m