[SPARK-28081] word2vec 'large' count value too low for very large corpora - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Minor
Resolution: Fixed
Affects Version/s: 2.4.3
Fix Version/s: 2.3.4, 2.4.4, 3.0.0
Component/s: ML
Labels:
None

Description

The word2vec implementation operates on word counts, and uses a hard-coded value of 1e9 to mean "a very large count, larger than any actual count". However this causes the logic to fail if, in fact, a large corpora has some words that really do occur more than this many times. We can probably improve the implementation to better handle very large counts in general.

Attachments

Issue Links

links to

GitHub Pull Request #24893

Activity

People

Assignee:: Sean R. Owen

Reporter:: Sean R. Owen

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 17/Jun/19 13:59

Updated:: 19/Jun/19 01:31

Resolved:: 19/Jun/19 01:29