May | 2020 | phychai

Archive for May, 2020

Benchmark matrix factorization

Posted in Uncategorized, tagged machine learning on May 25, 2020| Leave a Comment »

The challenge of matrix factorization in building recommendation system often comes from the scale. With large scale of data (millions of users or items), it becomes difficult to run matrix factorization on a single machine. In this case, alternating least square (ALS) is often used as a scalable algorithm. Here I benchmark some existing packages/systems on small and large scale data.

Data in use

MovieLens 25M Dataset: 25 million ratings and one million tag applications applied to 62,000 movies by 162,000 users.

Amazon review data (2018): 233 millions reviews (44 millions users, 15 million items, extremely sparse)

Packages/Systems under test

Python package: implicit, has CUDA acceleration

Sklearn TruncatedSVD

Spark MLlib: ALS

Result

Hidden dimension is set to 64; maximum iteration is set to 10

Data	Package	Hardware	Time
MovieLens	implicit	i7-7700k	70.2s
MovieLens	implicit	Nvidia 1080	9.94s
MovieLens	sklearn (algorithm=randomized)	i7-7700k	27.2s
MovieLens	scipy.sparse.linalg.svds (arpack=arpack)	i7-7700k	16.8s
MovieLens	Spark MLlib^*	6 executors 15 cores (D13 v2)	74.6s
MovieLens	Spark MLlib	9 executors 64 cores	52s
Amazon review	Spark MLlib	11 executors 80 cores	2h for 64 dim 20 min for 16 dim

Time spent for different packages on different hardwares
* Spark MLlib includes IO time

Challenges with Spark MLlib ALS on large data

Even with Spark cluster, working with tens of millions of users and items is still challenging. There are a few pitfalls I encountered:

StringIndexer fail due to OOM: some method is need to convert user/item ids into continuous indices. Using StringIndexer is the most straight forward way, however, it turns out to not scale well. This is solved by using zipWithIndex instead.

randomSplit does something expected: solved by calling persist before feed data into it

After solving these two issues, and tuning the memory limits, I was finally able to run ALS on such large dataset. I evaluated the result by using 1% of data to create a test set, and treat user rating >= 3 as positive label. In this way, I evaluated the predicted rating as a classification task. The baseline model is simply using the average rating as the prediction. The code is here: https://github.com/mathsam/MachineLearning/blob/master/RecommendationSys/spark_als.py.