Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-25441

calculate term frequency in CountVectorizer()

    XMLWordPrintableJSON

Details

    • New Feature
    • Status: Resolved
    • Major
    • Resolution: Won't Fix
    • 2.3.1
    • None
    • ML
    • None

    Description

      currently CountVectorizer() can not output TF (term frequency). I hope there will be such option.

      TF defined as https://en.m.wikipedia.org/wiki/Tf–idf

       

      example,

      >>> df = spark.createDataFrame( ... [(0, ["a", "b", "c"]), (1, ["a", "b", "b", "c", "a"])], ... ["label", "raw"])

      >>> cv = CountVectorizer(inputCol="raw", outputCol="vectors")

      >>> model = cv.fit(df)

      >>> model.transform(df).limit(1).show(truncate=False)

      label        raw           vectors 

      0            [a, b, c]       (3,[0,1,2],[1.0,1.0,1.0])

       

      instead I want 

      0            [a, b, c]       (3,[0,1,2],[0.33,0.33,0.33]) # ie, each vector devided by by its sum, here 3, so                                                                                 sum of new vector will 1,for every row(document)

       

      Attachments

        Activity

          People

            Unassigned Unassigned
            Ben2018 Xinyong Tian
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: