Edit: As pointed out in the comments my initial claim that it beats the winning solution turned out to be false. The prize was judged on a dataset that was set in a future time as compared to the training set.
If you are familiar with the Netflix prize challenge, you would remember the final solution that got the 1M prize.. It was a mix of solutions from a few winning teams and probably was not the most elegant of solutions. There were some reports that Netflix did not use the final solution.
Here is a much simpler Neural network based solution that beats the top result on a validation set carved from the original dataset. and should not take more than 3 hours to run and few minutes to code.
Code for the model as implemented in Keras
movie_count = 17771
user_count = 2649430
model_left = Sequential()
model_left.add(Embedding(movie_count, 60, input_length=1))
model_right = Sequential()
model_right.add(Embedding(user_count, 20, input_length=1))
model = Sequential()
model.add(Merge([model_left, model_right], mode='concat'))
model.add(Flatten())
model.add(Dense(64))
model.add(Activation('sigmoid'))
model.add(Dense(64))
model.add(Activation('sigmoid'))
model.add(Dense(64))
model.add(Activation('sigmoid'))
model.add(Dense(1))
model.compile(loss='mean_squared_error', optimizer='adadelta')
In short there are 2 embeddings a 60 dimensional embedding for movies and a 20 dimensional embedding for each of the users. The embeddings are concatenated to form a single 80 dimensional input vector.
The Embedding layer in Keras provides a mapping from an integer to a vector of specified length initialized randomly. For instance this code
model_left.add(Embedding(movie_count, 60, input_length=1))
initializes a matrix of dimensions movie_count x 60 ( 17771 x 60) randomly. Similarly for users, a matrix of dimensions ( 2649430 x 20) is initialized. During training phase, the vector for each user and movie is updated so as to reduce the error. Ultimately at the end of training all users with similar interests should move closer and similar movies should move closer in their respective embedding space.
The network learns a mapping between this embedding vector and the rating. The model is fit as:
model.fit([tr[:,0].reshape((L,1)), tr[:,1].reshape((L,1))], tr[:,2].reshape((L,1)), batch_size=24000, nb_epoch=42, validation_data=([ ts[:,0].reshape((M,1)), ts[:,0].reshape((M,1))], ts[:,2].reshape((M,1))))
In the above code, tr is the training data, which are triples of movie_id, user_id, rating. L is 90432456 and M is 10048051, (around 90%, 10%) split for training and validation data. Another thing to note is that the data set is randomly shuffled before split and the training is done on the random set.
After about 40 epochs you should see the validation error around: 0.7006 or a root mean squared error of 0.837. This is around 2% better than the million dollar prize winning error rate of 0.8553.
Epoch 39/64 90432456/90432456 [==============================] - 1183s - loss: 0.6475 - val_loss: 0.7006
This model does not consider the time of rating date. The results can probably improved by using all available data.