Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-20468

Refactor the ALS code

    XMLWordPrintableJSON

Details

    Description

      The current ALS implementation (org.apache.spark.ml.recommendation) is quite the beast — 21 classes, traits, and objects across 1,500+ lines, all in one file. Here are some things I think could improve the clarity and maintainability of the code:

      • The file can be split into more manageable parts. In particular, ALS, ALSParams, ALSModel, and ALSModelParams can be in separate files for better readability.
      • Certain parts can be encapsulated or moved to clarify the intent. For example:
        • The ALS.train method is currently defined in the ALS companion object, and it seems to take 12 individual parameters that are all members of the ALS class. This method can be made an instance method.
        • The code that creates in-blocks and out-blocks in the body of ALS.train, along with the partitionRatings and makeBlocks methods in the ALS companion object, can be moved into a separate case class that holds the blocks. This has the added benefit of allowing us to write specific Scaladoc to explain the logic behind these block objects, as their usage is certainly nontrivial yet is fundamental to the implementation.
        • The KeyWrapper and UncompressedInBlockSort classes could be hidden within UncompressedInBlock to clarify the scope of their usage.
        • Various builder classes could be encapsulated in the companion objects of the classes they build.
      • The code can be formatted more clearly. For example:
        • Certain methods such as ALS.train and ALS.makeBlocks can be formatted more clearly and have comments added explaining the reasoning behind key parts. That these methods form the core of the ALS logic makes this doubly important for maintainability.
        • Parts of the code that use while loops with manually incremented counters can be rewritten as for loops.
        • Where non-idiomatic Scala code is used that doesn't improve performance much, clearer code can be substituted. (This in particular should be done very carefully if at all, as it's apparent the original author spent much time and pains optimizing the code to significantly improve its runtime profile.)
      • The documentation (both Scaladocs and inline comments) can be clarified where needed and expanded where incomplete. This is especially important for parts of the code that are written imperatively for performance, as these parts don't benefit from the intuitive self-documentation of Scala's higher-level language abstractions. Specifically, I'd like to add documentation fully explaining the key functionality of the in-block and out-block objects, their purpose, how they relate to the overall ALS algorithm, and how they are calculated in such a way that new maintainers can ramp up much more quickly.

      The above is not a complete enumeration of improvements but a high-level survey. All of these improvements will, I believe, add up to make the code easier to understand, extend, and maintain. This issue will track the progress of this refactoring so that going forward, authors will have an easier time maintaining this part of the project.

      Attachments

        Activity

          People

            Unassigned Unassigned
            danielyli Daniel Li
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: