Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-28081

word2vec 'large' count value too low for very large corpora

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Minor
    • Resolution: Fixed
    • 2.4.3
    • 2.3.4, 2.4.4, 3.0.0
    • ML
    • None

    Description

      The word2vec implementation operates on word counts, and uses a hard-coded value of 1e9 to mean "a very large count, larger than any actual count". However this causes the logic to fail if, in fact, a large corpora has some words that really do occur more than this many times. We can probably improve the implementation to better handle very large counts in general.

      Attachments

        Issue Links

          Activity

            People

              srowen Sean R. Owen
              srowen Sean R. Owen
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: