Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-28081

word2vec 'large' count value too low for very large corpora

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: 2.4.3
    • Fix Version/s: 2.3.4, 2.4.4, 3.0.0
    • Component/s: ML
    • Labels:
      None

      Description

      The word2vec implementation operates on word counts, and uses a hard-coded value of 1e9 to mean "a very large count, larger than any actual count". However this causes the logic to fail if, in fact, a large corpora has some words that really do occur more than this many times. We can probably improve the implementation to better handle very large counts in general.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                srowen Sean Owen
                Reporter:
                srowen Sean Owen
              • Votes:
                0 Vote for this issue
                Watchers:
                1 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: