Overview
It is well-known that missing values are one of the biggest challenges in data science
projects.
You might know that k nearest neighbour based Collaborative Filtering is also called
\memory-based" Collaborative Filtering. Luckily, data scientists and researchers have
een working hard to solve the missing value problem in k-neighbourhood-based Collab-
orative Filtering, and have got solutions there.
In this assignment, you are required to tackle the missing value problem in Collab-
orative Filtering by predicting them. Specifically, an existing solution about how to
predict the missing values in Collaborative Filtering is provided, which is a report named
\Effective Missing Data Prediction for Collaborative Filtering". Please read this report
carefully, then complete the following tasks.
Task 1: Implementation
In this task, you are required to implement the solution in the provided report so as to
predict the missing values in Collaborative Filtering.
Note, you are required to implement your own implementation, and please do not use
any other li
aries that are related to Recommender Systems or Collaborative Filtering.
If you use any of these li
aries, your implementation part will be invalid.
We provide Python framework code (named assignment3 framework.ipynb) to help
you get started, and this will also automate the co
ectness marking. The framework
also includes the training data and the test data.
Please only put your own code in the provided cell in the framework as shown in
Figure 1, Please DO NOT CHANGE anything else in the rest cells of the framework,
otherwise they might cause e
ors during the automatic marking.
Please provide detailed comments to explain your implementation. To what level of
details should you provide in your solution? Please take the comments in the ipynb _les
in Week 10 (knn based cf updated.zip) as examples for the level of detailed comments you
are expected to put for your solution. You might find the following information uesful:
https:
www.w3schools.com/python/python_comments.asp
https:
www.w3schools.com/python/python_comments.asp
Task 2: Presentation
• The presentation should
{Explain how the solution in the provided report predicts the missing values in
the Collaborative Filtering by using your own language clearly and completely.
{Explain why the solution in the provided report can tackle the missing value
problem in Collaborative Filtering clearly and completely.
{Explain how you implement the solution clearly and completely.
• The presentation should be no more than 10 minutes.
• Your presentation slides should be:
{Microsoft PowerPoint slides (with audio inserted for each slide by using: Insert
��� > Audio ��� > Record Audio).
{or you can create your own presentation slides (e.g. PDF version) and please
submit your own recording (in the format of mp4 or avi) of your presentation
as well.
Note: - 1. Main menu! Kernel! Restart & Run All
XXXXXXXXXXWait till you see the output displayed properly. You should see all the data
XXXXXXXXXXprinted and graphs displayed.
Effective Missing Data Prediction for Collaborative Filtering
Effective Missing Data Prediction for Collaborative
Filtering
Hao Ma, Irwin King and Michael R. Lyu
Dept. of Computer Science and Engineering
The Chinese University of Hong Kong
Shatin, N.T., Hong Kong
{ hma, king, lyu }@cse.cuhk.edu.hk
ABSTRACT
Memory-based collaborative filtering algorithms have been
widely adopted in many popular recommender systems, al-
though these approaches all suffer from data sparsity and
poor prediction quality problems. Usually, the user-item
matrix is quite sparse, which directly leads to inaccurate rec-
ommendations. This paper focuses the memory-based col-
laborative filtering problems on two crucial factors: (1) sim-
ilarity computation between users or items and (2) miss-
ing data prediction algorithms. First, we use the enhanced
Pearson Co
elation Coefficient (PCC) algorithm by adding
one parameter which overcomes the potential decrease of
accuracy when computing the similarity of users or items.
Second, we propose an effective missing data prediction al-
gorithm, in which information of both users and items is
taken into account. In this algorithm, we set the similarity
threshold for users and items respectively, and the predic-
tion algorithm will determine whether predicting the missing
data or not. We also address how to predict the missing data
y employing a combination of user and item information.
Finally, empirical studies on dataset MovieLens have shown
that our newly proposed method outperforms other state-
of-the-art collaborative filtering algorithms and it is more
obust against data sparsity.
Categories and Subject Descriptors: H.3.3 [Informa-
tion Systems]: Information Search and Retrieval - Informa-
tion Filtering
General Terms: Algorithm, Performance, Experimenta-
tion.
Keywords: Collaborative Filtering, Recommender System,
Data Prediction, Data Sparsity.
1. INTRODUCTION
Collaborative filtering is the method which automatically
predicts the interest of an active user by collecting rating
information from other similar users or items, and related
techniques have been widely employed in some large, fa-
Permission to make digital or hard copies of all or part of this work fo
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
ear this notice and the full citation on the first page. To copy otherwise, to
epublish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
SIGIR’07, July�23–27,�2007,�Amsterdam,�The�Netherlands.
Copyright 2007 ACM XXXXXXXXXX/07/0007 ...$5.00.
mous commercial systems, such as Amazon1 , Ebay2 . The
underlying assumption of collaborative filtering is that the
active user will prefer those items which the similar users
prefer. The research of collaborative filtering started from
memory-based approaches which utilize the entire user-item
database to generate a prediction based on user or item sim-
ilarity. Two types of memory-based methods have been
studied: user-based [2, 7, 10, 22] and item-based [5, 12,
17]. User-based methods first look for some similar users
who have similar rating styles with the active user and then
employ the ratings from those similar users to predict the
atings for the active user. Item-based methods share the
same idea with user-based methods. The only difference
is user-based methods try to find the similar users for an
active user but item-based methods try to find the simi-
lar items for each item. Whether in user-based approaches
or in item-based approaches, the computation of similarity
etween users or items is a very critical step. Notable sim-
ilarity computation algorithms include Pearson Co
elation
Coefficient (PCC) [16] and Vector Space Similarity (VSS)
algorithm [2].
Although memory-based approaches have been widely used
in recommendation systems [12, 16], the problem of inaccu-
ate recommendation results still exists in both user-based
and item-based approaches. The fundamental problem of
memory-based approaches is the data sparsity of the user-
item matrix. Many recent algorithms have been proposed
to alleviate the data sparsity problem. In [21], Wang et
al. proposed a generative probabilistic framework to exploit
more of the data available in the user-item matrix by fusing
all ratings with a predictive value for a recommendation to
e made. Xue et al. [22] proposed a framework for collab-
orative filtering which combines the strengths of memory-
ased approaches and model-based approaches by introduc-
ing a smoothing-based method, and solved the data sparsity
problem by predicting all the missing data in a user-item ma-
trix. Although the simulation showed that this approach can
achieve better performance than other collaborative filtering
algorithms, the cluster-based smoothing algorithm limited
the diversity of users in each cluster and predicting all the
missing data in the user-item matrix could
ing negative
influence for the recommendation of active users.
In this paper, we first use PCC-based significance weight-
ing to compute similarity between users and items, which
overcomes the potential decrease of similarity accuracy. Sec-
ond, we propose an effective missing data prediction algo-
1http:
www.amazon.com/.
2http:
www.half.ebay.com/.
SIGIR 2007 Proceedings Session 2: Routing and Filtering
39
ithm which exploits the information both from users and
items. Moreover, this algorithm will predict the missing
data of a user-item matrix if and only if we think it will
ing
positive influence for the recommendation of active users
instead of predicting every missing data of the user-item
matrix. The simulation shows our novel approach achieves
etter performance than other state-of-the-art collaborative
filtering approaches.
The remainder of this paper is organized as follows. In
Section 2, we provide an overview of several major approaches
for collaborative filtering. Section 3 shows the method of
similarity computation. The framework of our missing data
prediction and collaborative filtering is introduced in Sec-
tion 4. The results of an empirical analysis are presented in
Section 5, followed by a conclusion in Section 6.
2. RELATED WORK
In this section, we review several major approaches fo
collaborative filtering. Two types of collaborative filtering
approaches are widely studied: memory-based and model-
ased.
2.1 Memory-based approaches
The memory-based approaches are the most popular pre-
diction methods and are widely adopted in commercial col-
laborative filtering systems [12, 16]. The most analyzed ex-
amples of memory-based collaborative filtering include user-
ased approaches [2, 7, 10, 22] and item-based approaches [5,
12, 17]. User-based approaches predict the ratings of active
users based on the ratings of similar users found, and item-
ased approaches predict the ratings of active users based
on the information of similar items computed. User-based
and item-based approaches often use PCC algorithm [16]
and VSS algorithm [2] as the similarity computation meth-
ods. PCC-based collaborative filtering generally can achieve
higher performance than the other popular algorithm VSS,
since it considers the differences of user rating styles.
2.2 Model-based Approaches
In the model-based approaches, training datasets are used
to train a predefined model. Examples of model-based ap-
proaches include clustering models [11, 20, 22], aspect mod-
els [8, 9, 19] and latent factor model [3]. [11] presented
an algorithm for collaborative filtering based on hierarchical
clustering, which tried to balance robustness and accuracy of
predictions, especially when little data were available. Au-
thors in [8] proposed an algorithm based on a generaliza-
tion of probabilistic latent semantic analysis to continuous-
valued response variables. The model-based approaches are
often time-consuming to build and update, and cannot cove
as diverse a user range as the memory-based approaches
do [22].
2.3 Other Related Approaches
In order to take the advantages of memory-based and
model-based approaches, hy
id collaborative filtering meth-
ods have been studied recently [14, 22]. [1, 4] unified collab-
orative filtering and content-based filtering, which achieved
significant improvements over the standard approaches. At
the same time, in order