Details
-
Task
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
None
-
None
-
None
Description
tika-eval can run on xhtml output from Tika. When it does, it maintains counts of those tags, and then allows for sums of those tags per file type and comparison of tags extracted.
When tika-eval is run against text output from Tika, these queries are taking 30 seconds per tag type on a million files because of the joins.
In Tika 2.x let's turn off tag reports by default, but allow users to include them if needed with the exising -rf (reports file) commandline option.