## About recosystem Package

`recosystem`

is an R wrapper of the `LIBMF`

library developed by Yu-Chin Juan, Wei-Sheng Chin, Yong Zhuang, Bo-Wen
Yuan, Meng-Yuan Yang, and Chih-Jen Lin (https://www.csie.ntu.edu.tw/~cjlin/libmf/), an open
source library for recommender system using parallel marix
factorization. (Chin, Yuan, et al.
2015)

## Highlights of LIBMF and recosystem

`LIBMF`

is a high-performance C++ library for large scale
matrix factorization. `LIBMF`

itself is a parallelized
library, meaning that users can take advantage of multicore CPUs to
speed up the computation. It also utilizes some advanced CPU features to
further improve the performance. (Chin, Yuan, et
al. 2015)

`recosystem`

is a wrapper of `LIBMF`

, hence it
inherits most of the features of `LIBMF`

, and additionally
provides a number of user-friendly R functions to simplify data
processing and model building. Also, unlike most other R packages for
statistical modeling that store the whole dataset and model object in
memory, `LIBMF`

(and hence `recosystem`

) can
significantly reduce memory use, for instance the constructed model that
contains information for prediction can be stored in the hard disk, and
output result can also be directly written into a file rather than be
kept in memory.

## A Quick View of Recommender System

The main task of recommender system is to predict unknown entries in the rating matrix based on observed values, as is shown in the table below:

item_1 | item_2 | item_3 | … | item_n | |
---|---|---|---|---|---|

user_1 | 2 | 3 | ?? | … | 5 |

user_2 | ?? | 4 | 3 | … | ?? |

user_3 | 3 | 2 | ?? | … | 3 |

… | … | … | … | … | |

user_m | 1 | ?? | 5 | … | 4 |

Each cell with number in it is the rating given by some user on a specific item, while those marked with question marks are unknown ratings that need to be predicted. In some other literatures, this problem may be named collaborative filtering, matrix completion, matrix recovery, etc.

A popular technique to solve the recommender system problem is the matrix factorization method. The idea is to approximate the whole rating matrix \(R_{m\times n}\) by the product of two matrices of lower dimensions, \(P_{n\times k}\) and \(Q_{n\times k}\), such that

\[R\approx PQ'\]

Let \(p_u\) be the \(u\)-th row of \(P\), and \(q_v\) be the \(v\)-th row of \(Q\), then the rating given by user \(u\) on item \(v\) would be predicted as \(p_u q'_v\).

A typical solution for \(P\) and \(Q\) is given by the following optimization problem (Chin, Zhuang, et al. 2015a, 2015b):

\[\min_{P,Q} \sum_{(u,v)\in R} \left[f(p_u,q_v;r_{u,v})+\mu_P||p_u||_1+\mu_Q||q_v||_1+\frac{\lambda_P}{2} ||p_u||_2^2+\frac{\lambda_Q}{2} ||q_v||_2^2\right]\]

where \((u,v)\) are locations of observed entries in \(R\), \(r_{u,v}\) is the observed rating, \(f\) is the loss function, and \(\mu_P,\mu_Q,\lambda_P,\lambda_Q\) are penalty parameters to avoid overfitting.

The process of solving the matrices \(P\) and \(Q\) is referred to as model training, and
the selection of penalty parameters is called parameter tuning. In
`recosystem`

, we provide convenient functions for these two
tasks, and additionally have functions for model exporting (outputing
\(P\) and \(Q\) matrices) and prediction.

## Data Input and Output

Each step in the recommender system involves data input and output, as the table below shows:

Step | Input | Output |
---|---|---|

Model training | Training data set | – |

Parameter tuning | Training data set | – |

Exporting model | – | User matrix \(P\), item matrix \(Q\) |

Prediction | Testing data set | Predicted values |

Data may have different formats and types of storage, for example the
input data set may be saved in a file or stored as R objects, and users
may want the output results to be directly written into file or to be
returned as R objects for further processing. In
`recosystem`

, we use two classes, `DataSource`

and
`Output`

, to handle data input and output in a unified
way.

An object of class `DataSource`

specifies the source of a
data set (either training or testing), which can be created by the
following two functions:

`data_file()`

: Specifies a data set from a file in the hard disk`data_memory()`

: Specifies a data set from R objects`data_matrix()`

: Specifies a data set from a sparse matrix

And an object of class `Output`

describes how the result
should be output, typically returned by the functions below:

`out_file()`

: Result should be saved to a file`out_memory()`

: Result should be returned as R objects`out_nothing()`

: Nothing should be output

More data source formats and output options may be supported in the future along with the development of this package.

## Data Format

The data file for training set needs to be arranged in sparse matrix triplet form, i.e., each line in the file contains three numbers

`user_index item_index rating`

User index and item index may start with either 0 or 1, and this can
be specified by the `index1`

parameter in
`data_file()`

and `data_memory()`

. For example,
with `index1 = FALSE`

, the training data file for the rating
matrix in the beginning of this article may look like

```
0 0 2
0 1 3
1 1 4
1 2 3
2 0 3
2 1 2
...
```

From version 0.4 `recosystem`

supports two special types
of matrix factorization: the binary matrix factorization (BMF), and the
one-class matrix factorization (OCMF). BMF requires ratings to take
value from `{-1, 1}`

, and OCMF requires all the ratings to be
positive.

Testing data file is similar to training data, but since the ratings
in testing data are usually unknown, the `rating`

entry in
testing data file can be omitted, or can be replaced by any placeholder
such as `0`

or `?`

.

The testing data file for the same rating matrix would be

```
0 2
1 0
2 2
...
```

Example data files are contained in the
`<recosystem>/dat`

(or
`<recosystem>/inst/dat`

, for source package)
directory.

## Usage of recosystem

The usage of `recosystem`

is quite simple, mainly
consisting of the following steps:

- Create a model object (a Reference Class object in R) by calling
`Reco()`

. - (Optionally) call the
`$tune()`

method to select best tuning parameters along a set of candidate values. - Train the model by calling the
`$train()`

method. A number of parameters can be set inside the function, possibly coming from the result of`$tune()`

. - (Optionally) export the model via
`$output()`

, i.e. write the factorization matrices \(P\) and \(Q\) into files or return them as R objects. - Use the
`$predict()`

method to compute predicted values.

Below is an example on some simulated data:

```
library(recosystem)
set.seed(123) # This is a randomized algorithm
= data_file(system.file("dat", "smalltrain.txt", package = "recosystem"))
train_set = data_file(system.file("dat", "smalltest.txt", package = "recosystem"))
test_set = Reco()
r = r$tune(train_set, opts = list(dim = c(10, 20, 30), lrate = c(0.1, 0.2),
opts costp_l1 = 0, costq_l1 = 0,
nthread = 1, niter = 10))
opts
```

```
## $min
## $min$dim
## [1] 20
##
## $min$costp_l1
## [1] 0
##
## $min$costp_l2
## [1] 0.1
##
## $min$costq_l1
## [1] 0
##
## $min$costq_l2
## [1] 0.01
##
## $min$lrate
## [1] 0.1
##
## $min$loss_fun
## [1] 0.9804937
##
##
## $res
## dim costp_l1 costp_l2 costq_l1 costq_l2 lrate loss_fun
## 1 10 0 0.01 0 0.01 0.1 0.9996368
## 2 20 0 0.01 0 0.01 0.1 1.0040111
## 3 30 0 0.01 0 0.01 0.1 0.9967101
## 4 10 0 0.10 0 0.01 0.1 0.9930384
## 5 20 0 0.10 0 0.01 0.1 0.9804937
## 6 30 0 0.10 0 0.01 0.1 0.9921565
## 7 10 0 0.01 0 0.10 0.1 0.9857116
## 8 20 0 0.01 0 0.10 0.1 1.0006225
## 9 30 0 0.01 0 0.10 0.1 0.9891277
## 10 10 0 0.10 0 0.10 0.1 0.9826748
## 11 20 0 0.10 0 0.10 0.1 0.9807865
## 12 30 0 0.10 0 0.10 0.1 0.9863404
## 13 10 0 0.01 0 0.01 0.2 1.1022376
## 14 20 0 0.01 0 0.01 0.2 1.0266608
## 15 30 0 0.01 0 0.01 0.2 1.0039170
## 16 10 0 0.10 0 0.01 0.2 1.0734307
## 17 20 0 0.10 0 0.01 0.2 1.0393326
## 18 30 0 0.10 0 0.01 0.2 1.0003177
## 19 10 0 0.01 0 0.10 0.2 1.0769594
## 20 20 0 0.01 0 0.10 0.2 1.0323938
## 21 30 0 0.01 0 0.10 0.2 1.0061849
## 22 10 0 0.10 0 0.10 0.2 1.0365456
## 23 20 0 0.10 0 0.10 0.2 1.0023265
## 24 30 0 0.10 0 0.10 0.2 1.0044131
```

`$train(train_set, opts = c(opts$min, nthread = 1, niter = 20)) r`

```
## iter tr_rmse obj
## 0 2.2673 5.3765e+04
## 1 1.0267 1.3667e+04
## 2 0.8372 1.0147e+04
## 3 0.7977 9.4773e+03
## 4 0.7703 9.0439e+03
## 5 0.7402 8.5967e+03
## 6 0.7048 8.1202e+03
## 7 0.6609 7.5638e+03
## 8 0.6133 7.0246e+03
## 9 0.5614 6.4770e+03
## 10 0.5110 5.9985e+03
## 11 0.4633 5.5846e+03
## 12 0.4203 5.2436e+03
## 13 0.3833 4.9761e+03
## 14 0.3510 4.7545e+03
## 15 0.3240 4.5818e+03
## 16 0.3005 4.4356e+03
## 17 0.2808 4.3158e+03
## 18 0.2640 4.2181e+03
## 19 0.2493 4.1321e+03
```

```
## Write predictions to file
= tempfile()
pred_file $predict(test_set, out_file(pred_file)) r
```

`## prediction output generated at /tmp/Rtmpv6QpAN/file12da161a6e28`

`print(scan(pred_file, n = 10))`

```
## [1] 3.76629 2.85805 3.13870 3.22261 2.88342 2.93686 2.71680 2.96046 2.78316
## [10] 3.65473
```

```
## Or, directly return an R vector
= r$predict(test_set, out_memory())
pred_rvec head(pred_rvec, 10)
```

```
## [1] 3.766289 2.858053 3.138700 3.222606 2.883424 2.936856 2.716800 2.960459
## [9] 2.783164 3.654734
```

Detailed help document for each function is available in topics
`?recosystem::Reco`

, `?recosystem::tune`

,
`?recosystem::train`

, `?recosystem::output`

and
`?recosystem::predict`

.

## Performance Improvement with Extra Installation Options

To build `recosystem`

from source, one needs a C++
compiler that supports the C++11 standard.

Also, there are some flags in file `src/Makevars`

(`src/Makevars.win`

for Windows system) that may have
influential effect on performance. It is strongly suggested to set
proper flags according to your type of CPU before compiling the package,
in order to achieve the best performance:

- The default
`Makevars`

provides generic options that should apply to most CPUs. - If your CPU supports SSE3 (a list of supported CPUs), add

```
PKG_CPPFLAGS += -DUSESSE
PKG_CXXFLAGS += -msse3
```

- If not only SSE3 is supported but also AVX (a list of supported CPUs), add

```
PKG_CPPFLAGS += -DUSEAVX
PKG_CXXFLAGS += -mavx
```

After editing the `Makevars`

file, run
`R CMD INSTALL recosystem`

on the package source directory to
install `recosystem`

.

## References

*ACM TIST*. https://www.csie.ntu.edu.tw/~cjlin/papers/libmf/libmf_journal.pdf.

*PAKDD*. https://www.csie.ntu.edu.tw/~cjlin/papers/libmf/mf_adaptive_pakdd.pdf.