Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-644

parsing of Microsoft Word doc with style "Heading X" where X>6 creates invalid HTML with tags <h7>,<h8> etc

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Minor
    • Resolution: Fixed
    • 0.9
    • 0.10
    • parser

    Description

      org.apache.tika.parser.microsoft.WordExtractor will translate heading styles to "h" tags with a level greater than 6 which means the xhtml is invalid. The xhtml DTD only defines header elements 1 to 6:
      <!ENTITY % heading "h1|h2|h3|h4|h5|h6">

      changing line 380 from:
      tag = "h"+num;
      to
      tag = "h"+Math.min(num, 6);

      will resolve this.

      Attachments

        Activity

          People

            nick Nick Burch
            chud chris hudson
            Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Time Tracking

                Estimated:
                Original Estimate - 5m
                5m
                Remaining:
                Remaining Estimate - 5m
                5m
                Logged:
                Time Spent - Not Specified
                Not Specified