Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-10493

Can we unify the viterbi search logic in the tokenizers of kuromoji and nori?

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 10.0 (main)
    • modules/analysis
    • None
    • New

    Description

      We now have common dictionary interfaces for kuromoji and nori (LUCENE-10393). A natural question would be: is it possible to unify the Japanese/Korean tokenizers?

      The core methods of the two tokenizers are `parse()` and `backtrace()` to calculate the minimum cost path by Viterbi search. I'd set the goal of this issue to factoring out them into a separate class (in analysis-common) that is shared between JapaneseTokenizer and KoreanTokenizer.
      The algorithm to solve the minimum cost path itself is of course language-agnostic, so I think it should be theoretically possible; the most difficult part here might be the N-best path calculation - which is supported only by JapaneseTokenizer and not by KoreanTokenizer.

      Attachments

        Activity

          People

            tomoko Tomoko Uchida
            tomoko Tomoko Uchida
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Time Tracking

                Estimated:
                Original Estimate - Not Specified
                Not Specified
                Remaining:
                Remaining Estimate - 0h
                0h
                Logged:
                Time Spent - 4h 40m
                4h 40m