Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-3774

Fix ignoreCharsets param of Icu4jEncodingDetector

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Minor
    • Resolution: Fixed
    • 2.4.0
    • 2.4.1
    • parser
    • None

    Description

      That parameter was introduced in TIKA-3516 to avoid undesired charsets in advance, but it is not working as expected, it is returning when first ignored charset is found, when it should continue to next charsets. Attached (corrupted) file used to be detected as windows-1252 by Tika-1.x, but now is being detected as IBM420 after TIKA-3516, ignoreCharsets param should be able to ignore IBM420. I'll push a fix shortly.

      Attachments

        1. test_avoid_IBM420_charset.html
          2 kB
          Luís Filipe Nassif

        Activity

          People

            lfcnassif Luís Filipe Nassif
            lfcnassif Luís Filipe Nassif
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: