Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-8123

Question about how to retrieve by TFIDFSimilarity query on lucene

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Minor
    • Resolution: Invalid
    • 7.2
    • None
    • core/query/scoring
    • None
    • New

    Description

      Hi, all.
      Recently, we were performing experiment on Lucene based on TFIDF.
      We want to get the similar documents from the corpus, of which the similarity between each document (d) and the given query (q) is no less than a threshold. We use the following scoring function.
      sum(tf(t,d) * idf(t) * tf(t,q) * idf(t))/(norm(d) * norm(q))
      where norm is defined as sqrt( sum(tf(t,d) * idf(t) * tf(t,d) * idf(t)) ).

      We perform this query by scanning the related docIds of all terms in the query, and the related docIds are derived from function PostingsEnum docEnum = MultiFields.getTermDocsEnum(indexReader, "text", term.bytes()) . After the inner products of these related documents have been computed, the final similarities are computed by dividing these inner products by their norms.

      However, when the documents scale up, e.g., more than ten million titles of twitter's text filed each on average has 10 terms, the runtime is unacceptable (more than ten seconds) since we always need to merge 0.5~2 million documents to generate the inner products. Does Lucene provide more efficient interface to generate ranked results based on TFIDF, or directly filter out the dissimilar documents (in lucene core) for a given threshold in the range of (0, 1)?

      Best
      Wenhai

      Attachments

        Activity

          People

            Unassigned Unassigned
            lwhay Wenhai Li
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: