Feeds:
Posts
Comments

Archive for May, 2020

The challenge of matrix factorization in building recommendation system often comes from the scale. With large scale of data (millions of users or items), it becomes difficult to run matrix factorization on a single machine. In this case, alternating least square (ALS) is often used as a scalable algorithm. Here I benchmark some existing packages/systems on small and large scale data.

Data in use

MovieLens 25M Dataset: 25 million ratings and one million tag applications applied to 62,000 movies by 162,000 users.

Amazon review data (2018): 233 millions reviews (44 millions users, 15 million items, extremely sparse)

Packages/Systems under test

Python package: implicit, has CUDA acceleration

Sklearn TruncatedSVD

Spark MLlib: ALS

Result

Hidden dimension is set to 64; maximum iteration is set to 10

DataPackageHardwareTime
MovieLensimpliciti7-7700k70.2s
MovieLensimplicitNvidia 10809.94s
MovieLenssklearn (algorithm=randomized)i7-7700k27.2s
MovieLensscipy.sparse.linalg.svds (arpack=arpack)i7-7700k16.8s
MovieLensSpark MLlib*6 executors 15 cores (D13 v2)74.6s
MovieLensSpark MLlib9 executors 64 cores52s
Amazon reviewSpark MLlib11 executors 80 cores2h for 64 dim
20 min for 16 dim
Time spent for different packages on different hardwares
* Spark MLlib includes IO time

Challenges with Spark MLlib ALS on large data

Even with Spark cluster, working with tens of millions of users and items is still challenging. There are a few pitfalls I encountered:

StringIndexer fail due to OOM: some method is need to convert user/item ids into continuous indices. Using StringIndexer is the most straight forward way, however, it turns out to not scale well. This is solved by using zipWithIndex instead.

randomSplit does something expected: solved by calling persist before feed data into it

After solving these two issues, and tuning the memory limits, I was finally able to run ALS on such large dataset. I evaluated the result by using 1% of data to create a test set, and treat user rating >= 3 as positive label. In this way, I evaluated the predicted rating as a classification task. The baseline model is simply using the average rating as the prediction. The code is here: https://github.com/mathsam/MachineLearning/blob/master/RecommendationSys/spark_als.py.

Comparison of baseline and ALS.

Read Full Post »