# Classic algorithm notes: anomaly detection and recommendation system

This article is the notes and code reproduction part (clustering, dimensionality reduction) of teacher Wu Enda's machine learning course .

Author: Huang Haiguang

Note: Notes, assignments (including data, original assignment files), and videos are all downloaded in github.

I will successively post the course notes and course codes on the public account "Machine Learning Beginners", so stay tuned. This is the fifth part: anomaly detection and recommendation system, the first nine weeks of the original tutorial, including notes and operating codes (formerly coursework is  OCTAVE , here is the reproduction of python code)

Completed part:

Part 1: Return

Part 2: Logistic regression

Part 3: Support Vector Machine

Part 4: Unsupervised learning

Note's markdown file 

Pdf file of notes 

## 15. Anomaly Detection

### 15.1 Motivation of the problem

Reference document: 15-1-Problem Motivation (8 min).mkv

In the next series of videos, I will introduce you to the anomaly detection ( Anomaly detection ) problem. This is a common application of machine learning algorithms. An interesting thing about this algorithm is that although it is mainly used for unsupervised learning problems, from some perspectives, it is similar to some supervised learning problems.

What is anomaly detection? To explain this concept, let me give an example:

Imagine you are an aircraft engine manufacturer. When the aircraft engine you produce comes out of the production line, you need to perform QA (Quality Control Test). As part of this test, you measure some characteristic variables of the aircraft engine, such as the engine The heat generated during operation, or the vibration of the engine, etc.

In this way, you have a data set, from to, if you produce an engine, you draw these data into a chart, it looks like this:

Every point and every cross here is your unlabeled data. In this way, the anomaly detection problem can be defined as follows: We assume that one day later, you have a new aircraft engine flowing out of the production line, and your new aircraft engine has characteristic variables. The so-called anomaly detection problem is: we want to know whether this new aircraft engine has some kind of abnormality, or in other words, we want to judge whether this engine needs further testing. Because, if it looks like a normal engine, then we can ship it directly to the customer without further testing.

Given a data set, we assume that the data set is normal, and we want to know whether the new data is abnormal, that is, the probability that the test data does not belong to the set of data. The model we build should be able to tell us the possibility of belonging to a set of data based on the location of the test data.

In the above figure, the data in the blue circle is more likely to belong to this group of data, and the more remote the data, the less likely it is to belong to this group of data.

This method is called density estimation and is expressed as follows:

Fraud detection:

The model is the possibility for us to belong to a set of data by detecting abnormal users.

Anomaly detection is mainly used to identify deception. For example, for user-related data collected online, a feature vector may include such as: how often the user logs in, the pages visited, the number of posts posted on the forum, and even the typing speed. Try to build a model based on these characteristics, which can be used to identify users who do not fit the model.

Another example is to detect a data center. Features may include: memory usage, number of disks accessed, CPU load, network traffic, etc. Based on these characteristics, a model can be constructed to determine whether some computers may be wrong.

### 15.2 Gaussian distribution

Reference video: 15-2-Gaussian Distribution (10 min).mkv

In this video, I will introduce the Gaussian distribution, also known as the normal distribution. Review the basic knowledge of Gaussian distribution.

Usually if we think that the variable conforms to the Gaussian distribution, then its probability density function is:

We can use the existing data to predict the sum of the population. The calculation method is as follows:

Example of Gaussian distribution:

Note: In machine learning, we usually only divide by the variance, not in statistics. By the way, in actual use, the difference between whether you choose to use or actually is very small. As long as you have a fairly large training set, most people in the field of machine learning are more accustomed to using this version of the formula. The formulas of these two versions are slightly different in theoretical and mathematical properties, but in actual use, the difference between them is very small and almost negligible.

### 15.3 Algorithm

Reference video: 15-3-Algorithm (12 min).mkv

In this video, I will apply Gaussian distribution to develop anomaly detection algorithm.

Anomaly detection algorithm:

For a given data set, we have to calculate the estimated value of and for each feature.

Once we have obtained the estimated values of the mean and variance, given a new training instance, calculate according to the model:

At that time, it was abnormal.

The following figure is a training set with two features, and the distribution of the features:

The following three-dimensional chart represents the density estimation function, and the axis is the estimated value based on the values of the two features:

We choose one, which will be our judgment boundary. At that time, the predicted data is normal data, otherwise it is abnormal.

In this video, we introduced how to fit, which is the probability value of, to develop an anomaly detection algorithm. At the same time, in this lesson, we also give out fitting parameters through the given data set, perform parameter estimation, obtain parameters and, and then detect new samples to determine whether the new samples are abnormal.

In the next course, we will study this algorithm in depth and at the same time introduce in more depth how to make the algorithm work more effectively.

### 15.4 Develop and evaluate an anomaly detection system

Reference video: 15-4-Developing and Evaluating an Anomaly Detection System (13 min). mkv

The anomaly detection algorithm is an unsupervised learning algorithm, which means that we cannot tell us whether the data is really anomalous based on the value of the result variable. We need another method to help test whether the algorithm is effective. When we develop an anomaly detection system, we start with labeled (abnormal or normal) data, we select a part of the normal data from it to construct the training set, and then use the remaining normal data and abnormal data to form an intersection. Inspection set and test set.

For example: we have data for 10,000 normal engines and data for 20 abnormal engines. We distribute the data like this:

The data of 6000 normal engines is used as the training set

The data of 2000 normal engines and 10 abnormal engines are used as a cross-check set

The data of 2000 normal engines and 10 abnormal engines are used as the test set

The specific evaluation methods are as follows:

1. According to the test set data, we estimate the mean and variance of the features and construct the function
2. For the cross-check set, we try to use different values as the threshold, and predict whether the data is abnormal, and choose according to the value or the ratio of precision to recall
3. After selection, predict the test set, calculate the value of the abnormal inspection system, or the ratio of the precision rate to the recall rate

### 15.5 Comparison of anomaly detection and supervised learning

Reference video: 15-5-Anomaly Detection vs. Supervised Learning (8 min).mkv

The anomaly detection system we built before also uses labeled data, which is somewhat similar to supervised learning. The following comparison helps to choose whether to use supervised learning or anomaly detection:

Comparison of the two:

abnormal detectionSupervised learning
Very small number of positive categories (abnormal data), a large number of negative categories ()At the same time there are a large number of positive and negative classes
Many different kinds of exceptions are very difficult. The algorithm is trained based on a very small amount of forward data.There are enough forward class instances to be used for training the algorithm, and the forward class instances encountered in the future may be very similar to those in the training set.
The anomalies encountered in the future may be very different from the anomalies that have been mastered.
For example: Fraud detection and production (such as aircraft engines) to detect computer operating conditions in data centersFor example: mail filter weather forecast tumor classification

I hope this lesson will enable you to understand the characteristics of a learning problem, and allow you to treat this problem as an anomaly detection or a supervised learning problem. In addition, for some problems that many technology companies may encounter, generally speaking, the number of positive samples is very small, sometimes even 0, that is, there are too many different abnormal types that have not been seen before For these problems, the algorithm that should usually be used is the anomaly detection algorithm.

### 15.6 Select features

Reference video: 15-6-Choosing What Features to Use (12 min).mkv

For anomaly detection algorithms, the features we use are crucial. Let s talk about how to select features:

The anomaly detection assumes that the features conform to the Gaussian distribution. If the distribution of the data is not Gaussian, the anomaly detection algorithm can also work, but it is better to convert the data into a Gaussian distribution, for example, using a logarithmic function:, where is a non-negative constant; or, is A score between 0-1, and so on. (Editor's note: In python , usually use

np.log1p()
Function, that is, can avoid negative results, the inverse function is
np.expm1()
)

Error Analysis:

A common problem is that some abnormal data may also have higher values, which are considered normal by the algorithm. In this case, error analysis can help us. We can analyze the data that is incorrectly predicted by the algorithm as normal, and observe whether we can find some problems. We may find from the problem that we need to add some new features, and the new algorithm obtained after adding these new features can help us better perform anomaly detection.

Anomaly detection error analysis:

We can usually obtain some new and better features by combining some related features (the feature value of abnormal data is abnormally large or small). For example, in the case of detecting the computer status of a data center, we can Use the ratio of CPU load to network traffic as a new feature. If the value is abnormally large, it may mean that the server is in some problems.

In this video, we introduced how to select features and perform some small transformations on the features to make the data more like a normal distribution, and then input the data into the anomaly detection algorithm. At the same time, it also introduces the error analysis method when establishing the feature to capture the possibility of various abnormalities. I hope you can understand how to choose good feature variables through these methods, so as to help your anomaly detection algorithm capture a variety of different abnormal situations.

### 15.7 Multivariate Gaussian distribution (optional)

Reference video: 15-7-Multivariate Gaussian Distribution (Optional) (14 min).mkv

If we have two related features, and the range of the two features is relatively wide, in this case, the general Gaussian distribution model may not be able to identify abnormal data well. The reason is that the general Gaussian distribution model tries to capture the deviation of two features at the same time, thus creating a relatively large decision boundary.

The following figure shows two related features. The magenta line (the range can be large or small according to the difference of ) is the judgment boundary obtained by the general Gaussian distribution model. It is obvious that the data point represented by the green X is likely to be abnormal Value, but its value is still within the normal range. The multivariate Gaussian distribution will create a decision boundary as shown by the blue curve in the figure.

In the general Gaussian distribution model, our calculation method is: by calculating the probability corresponding to each feature separately and then multiplying it up. In the multivariate Gaussian distribution model, we will construct the covariance matrix of the feature and use all the features Calculate together.

We first calculate the average of all features, and then calculate the covariance matrix:

Note: where is a vector, each unit of which is the mean value of a row of data in the original feature matrix. Finally, we calculate the multivariate Gaussian distribution: where:

Is a fixed matrix in Octave using

det(sigma)
Calculation

Is the inverse matrix, let's see how the covariance matrix affects the model:

The figure above shows 5 different models, analyzed from left to right:

1. Is a general Gaussian distribution model
2. Through the covariance matrix, feature 1 has a small deviation while maintaining the deviation of feature 2
3. Through the covariance matrix, feature 2 has a large deviation while maintaining the deviation of feature 1
4. Through the covariance matrix, on the basis of not changing the original deviation of the two features, increase the positive correlation between the two
5. Through the covariance matrix, on the basis of not changing the original deviation of the two features, increase the negative correlation between the two

The relationship between the multivariate Gaussian distribution model and the original Gaussian distribution model:

It can be proved that the original Gaussian distribution model is a subset of the multivariate Gaussian distribution model, that is, as shown in the first 1, 2, 3, and 3 examples in the above figure, if the covariance matrix is only in the diagonal unit When there is a non-zero value on it, it is the original Gaussian distribution model.

Comparison of the original Gaussian distribution model and the multivariate Gaussian distribution model:

Original Gaussian distribution modelMultivariate Gaussian distribution model
Cannot capture the correlation between features but can be solved by combining featuresAutomatically capture the correlation between features
Low computational cost, adaptable to large-scale featuresThe same applies when the computational cost is high and the training set is small.
Must have, otherwise the covariance matrix is irreversible, usually required. In addition, feature redundancy will also cause the covariance matrix to be irreversible.

The original Gaussian distribution model is widely used. If there are correlations between features to some extent, we can capture these correlations by constructing new features.

If the training set is not too large and does not have too many features, we can use a multivariate Gaussian distribution model.

### 15.8 Use multivariate Gaussian distribution for anomaly detection (optional)

Reference video: 15-8-Anomaly Detection using the Multivariate Gaussian Distribution (Optional) (14 min).mkv

In the last video we talked about, about the multivariate Gaussian distribution, I saw some of the various distribution models built, when you change the parameters, and. In this video, let us use these ideas and apply them to develop a different anomaly detection algorithm.

To review the multivariate Gaussian distribution and the multivariate normal distribution:

The distribution has two parameters, and. The covariance matrix of this one-dimensional vector and is a kind of matrix. And the probability of the formula here, such as by and parameterization, and your variables and, you can get a range of different distributions, you know, these are three samples, those we have seen in the previous video.

Therefore, let us talk about parameter fitting or parameter estimation:

I have a set of samples that is a dimensional vector, and I think my samples come from a multivariate Gaussian distribution. How can I try to estimate my parameters and and standard formulas?

It is estimated that they are the average of your training samples.

And set: This is actually only when we use the PCA algorithm, sometimes it is written out. So you only need to insert the above two formulas, which will give you your estimated parameters and your estimated parameters. So, the data set given here is how you estimate and. Let's use this method and just plug it into the anomaly detection algorithm. So, how do we develop an anomaly detection algorithm with all of this?

1. we take our training set and our fitting model, we calculate, we must know, the setting is the same as the description.

As shown in the figure, the distribution is most in the center, and the range of the circle outside is smaller.

And the probability that the point is the way out here is very low.

The relationship between the original model and the multivariate Gaussian model is shown in the figure:

Among them: the covariance matrix is:

The comparison between the original model and the multivariate Gaussian distribution is shown in the figure:

## 16. Recommender Systems

### 16.1 Formalizing the problem

Reference video: 16-1-Problem Formulation (8 min).mkv

In the next video, I want to talk about the recommendation system. I want to talk about the recommendation system for two reasons:

1. just because it is an important application in machine learning. In the past few years, I occasionally visit different technology companies in Silicon Valley. I often chat with people working here who are committed to machine learning applications. I often ask them, what is the most important application of machine learning, or what do you want to improve most? What are the application of machine learning. The answer I hear most often is the recommendation system. Now, there are many groups in Silicon Valley trying to build a good recommendation system. Therefore, if you consider a website like Amazon, or Netflix or eBay, or iTunes Genius , there are many websites or systems that try to recommend new products to users. For example, Amazon recommends new books to you, Netflix tries to recommend new movies to you, and so on. These recommendation systems are judged by browsing what books you have bought in the past or what movies you have reviewed in the past. These systems will bring a large part of the revenue, for example for Amazon and companies like Netflix. Therefore, the improvement of the performance of the recommendation system will have a substantial and direct impact on these companies.

Recommendation system is an interesting problem. In academic machine learning, therefore, we can go to an academic machine learning conference. The recommendation system problem actually receives very little attention, or at least it occupies a small share in the academic world. However, if you look at what is happening, many technology companies that are capable of building these systems seem to occupy a high priority among many companies. This is one of the reasons why I discuss it in this class.

The second reason I want to discuss the recommendation system is: the last few episodes of this class video I want to discuss some big ideas in machine learning and share with you. We have also seen in this lesson that for machine learning, features are very important, and the features you choose will have a great impact on the performance of your learning algorithm. Therefore, there is a big idea in machine learning. It addresses some problems, not all problems, but some problems. There are algorithms that can automatically learn a set of good features for you. Therefore, don't try to design manually, and hand-written code is what we have done so far. There are some settings, you can have an algorithm, just learn the characteristics of its use, the recommendation system is an example of type settings. There are many others, but through the recommendation system, we will appreciate a small part of the idea of feature learning. At least, you will be able to understand an example of this. I think the big idea in machine learning is the same. Therefore, let us begin to discuss the formalization of the recommendation system problem.

We start with an example to define the problem of the recommendation system.

If we are a movie supplier, we have 5 movies and 4 users, and we ask users to rate the movies.

The first three movies are romance movies, and the latter two are action movies. We can see that Alice and Bob seem to be more inclined to romance movies, while Carol and Dave seem to be more towards action movies. And no one user has rated all the movies. We hope to build an algorithm to predict how many points each of them might give to movies they haven't watched, and use this as a basis for recommendation.

Here are some tags:

Represents the number of users

Represents the number of movies

If user j has rated the movie too much,

Rating the movie on behalf of the user

Represents the total number of movies rated by users

### 16.2 Content-based recommendation system

Reference video: 16-2-Content Based Recommendations (15 min).mkv

In a content-based recommendation system algorithm, we assume that there are some data about the things we want to recommend, and these data are the characteristics of these things.

In our example, we can assume that every movie has two characteristics, such as the degree of romance of the movie and the degree of action of the movie.

Then each movie has a feature vector, if the feature vector of the first movie is [0.9 0].

Next, we will build a recommendation system algorithm based on these characteristics. Assuming that we use a linear regression model, we can train a linear regression model for each user, such as the parameters of the first user's model. So we have:

User's parameter vector

Feature vector of movie

For users and movies, we predict that the score is:

Cost function

For users, the cost of the linear regression model is the sum of squares of the prediction error, plus the regularization term:

Which means that we only count those movies that users have rated. In a general linear regression model, both the error term and the regular term should be multiplied by, and we will remove them here. And we do not regularize the variance term.

The above cost function is only for one user. In order to learn all users, we sum the cost functions of all users:

If we want to use the gradient descent method to solve the optimal solution, we calculate the partial derivative of the cost function to get the gradient descent update formula:

### 16.3 Collaborative filtering

Reference video: 16-3-Collaborative Filtering (10 min).mkv

In the previous content-based recommendation system, for each movie, we have mastered the available features, and used these features to train the parameters of each user. Conversely, if we have the user's parameters, we can learn the characteristics of the movie.

But if we have neither user parameters nor movie characteristics, neither of these two methods are feasible. The collaborative filtering algorithm can learn both at the same time.

Our optimization goal was changed to target and proceed at the same time.

The results of the partial derivative of the cost function are as follows:

Note: In the collaborative filtering algorithm, we usually do not use the variance term, if necessary, the algorithm will learn it automatically. The steps to use the collaborative filtering algorithm are as follows:

1. Initially some random small values
2. Minimize the cost function using gradient descent algorithm
3. After training the algorithm, we predict the user s rating for the movie

The feature matrix obtained through this learning process contains important data about movies. These data are not always readable by humans, but we can use these data as a basis for recommending movies to users.

For example, if a user is watching a movie, we can find another movie based on the distance between the feature vectors of the two movies.

### 16.4 Collaborative filtering algorithm

Reference video: 16-4-Collaborative Filtering Algorithm (9 min).mkv

Collaborative filtering optimization goals:

Given, estimate:

Given, estimate:

Simultaneously minimize and:

### 16.5 Vectorization: Low-rank matrix decomposition

Reference video: 16-5-Vectorization_ Low Rank Matrix Factorization (8 min).mkv

In the last few videos, we talked about the collaborative filtering algorithm. In this video, I will talk about the vectorization implementation of the algorithm and talk about other things you can do about the algorithm.

for example:

1. When a product is given, can you find other products related to it?
2. A user recently saw a product, if there are other related products, you can recommend it to him.

What I will do is: implement a method of selection and write out the predictions of the collaborative filtering algorithm.

We have a data set of five movies. What I will do is to group the ratings of these users' movies and store them in a matrix.

We have five movies and four users, so this matrix is a matrix with 5 rows and 4 columns, and it stores the user rating data of these movies in the matrix:

MovieAlice (1)Bob (2)Carol (3)Dave (4)
Love at last5500
Romance forever5??0
Cute puppies of love?40?
Nonstop car chases0054
Swords vs. karate005?

Roll out rating:

Find related videos:

Now that you have learned the feature parameter vector, then we will have a very convenient method to measure the similarity between two movies. For example, if a movie has a feature vector, can you find a different movie and ensure that the distance between the feature vectors of the two movies is very small, then it can strongly indicate that the movie and the movie have a certain degree of Similar, at least in a sense, some people like movies, and maybe they are more likely to be interested in movies too. To sum up, when a user is watching a certain movie, if you want to find 5 movies that are very similar to the movie, in order to recommend 5 new movies to the user, what you need to do is to find out the movies in these different movies The distance between the movie and the movie we are looking for is the smallest, so you can recommend several different movies to your users.

Through this method, I hope you can know how to perform a vectorized calculation to calculate ratings for all users and all movies. At the same time, I hope you can also master the method to find related movies and products by learning the characteristic parameters.

### 16.6 Implementation details: mean normalization

Reference video: 16-6-Implementational Detail_ Mean Normalization (9 min).mkv

Let's look at the following user rating data:

If we add a new user Eve , and Eve has not rated any movies, then on what basis do we recommend movies for Eve ?

We first need to normalize the average of the result matrix, and subtract the average of all users' ratings for a movie from each user's rating for that movie:

Then we use this new matrix to train the algorithm. If we want to use the newly trained algorithm to predict the score, we need to add the average back to predict that, for Eve , our new model will think that her score for each movie is the average score of the movie.

# Code section-anomaly detection and recommendation system

In this exercise, we will use Gaussian model to implement anomaly detection algorithm and apply it to detect faulty servers on the network. We will also see how to use collaborative filtering to build a recommendation system and apply it to a movie recommendation dataset.

## Anomaly detection

Our first task is to use a Gaussian model to detect whether unlabeled examples in the dataset should be considered anomalies. We start with a simple two-dimensional data set to help visualize what the algorithm is doing.

```Import numpy AS NP
Import PANDAS AS PD
Import matplotlib.pyplot AS PLT
Import Seaborn AS SB
from scipy.io Import loadmat
duplicated code```
```data = loadmat( 'data/ex8data1.mat' )
X = data[ 'X' ]
X.shape
Copy code```
```( 307 , 2 )
copy the code```
```fig, ax = plt.subplots(figsize=( 12 , 8 ))
ax.scatter(X[:, 0 ], X[:, 1 ])
plt.show()
Copy code```

This is a very tight cluster, and several values are far away from the cluster. In this simple example, these can be considered abnormal. To figure it out, we are estimating the Gaussian distribution for each feature in the data. To do this, we will create a function that returns the mean and variance of each feature.

```def  estimate_gaussian ( X ):
mu = X.mean(axis= 0 )
sigma = X.var(axis = 0 )

return mu, sigmaCopy
code```
```mu, sigma = estimate_gaussian(X)
mu, sigma
Copy code```
```(Array ([ 14.11222578 , 14.99771051 ]), Array ([ 1.83263141 , 1.70974533 ]))
Copy the code```

Now that we have our model parameters, we need to determine the probability threshold, which indicates that a sample should be considered an anomaly. To this end, we need to use a set of labeled verification data (where the real abnormal samples have been labeled), and given different thresholds to identify the performance of the model.

```Xval = data[ 'Xval' ]
yval = data[ 'yval' ]

Xval.shape, yval.shape
Copy code```
```(( 307 , 2 ), ( 307 , 1 ))
to copy the code```

We also need a way to calculate the probability that a data point belongs to a normal distribution. Fortunately SciPy has this built-in method.

```from scipy import stats
dist = stats.norm(mu[ 0 ], sigma[ 0 ])
dist.pdf( 15 )
Copy code```
```0.1935875044615038Copy
code```

We can also pass the array to the probability density function and get the probability density of each point in the data set.

```dist.pdf (X-[:, 0 ]) [ 0 : 50 ]
Copy the code```
```Array ([ .183842   , .20221694 , .21746136 , .19778763 , .20858956 ,
.21652359 , .16991291 , 0.15123542 , .1163989 , .1594734 ,
0.21716057 , 0.21760472 , .20141857 , 0.20157497 , .21711385 ,
.21758775 , .21695576 , 0.2138258 , .21057069 , .1173018 ,
0.20765108 ,0.21717452 , 0.19510663 , 0.21702152 , 0.17429399 ,
0.15413455 , 0.21000109 , 0.20223586 , 0.21031898 , 0.21313426 ,
0.16158946 , 0.2170794 , 0.17825767 , 0.17414633 , 0.1264951 ,
0.19723662 , 0.14538809 , 0.21766361 , 0.21191386 , 0.21729442 ,
0.21238912 , 0.18799417 , 0.21259798, 0.21752767  ,0.20616968 ,
0.21520366 , 0.1280081 , 0.21768113 , 0.21539967 , 0.16913173 ])
Copy the code```

We calculate and save the probability density of each value in the data set given the aforementioned Gaussian model parameters.

```p = np.zeros((X.shape[ 0 ], X.shape[ 1 ]))
p[:, 0 ] = stats.norm(mu[ 0 ], sigma[ 0 ]).pdf(X[:, 0 ])
p[:, 1 ] = stats.norm(mu[ 1 ], sigma[ 1 ]).pdf(X[:, 1 ])

p.shape
Copy code```
```( 307 , 2 )
copy the code```

We also need to do this for the validation set (using the same model parameters). We will use these probabilities in combination with real labels to determine the optimal probability threshold for assigning data points as anomalies.

```pval = np.zeros((Xval.shape[ 0 ], Xval.shape[ 1 ]))
pval[:, 0 ] = stats.norm(mu[ 0 ], sigma[ 0 ]).pdf(Xval[:, 0 ])
pval[:, 1 ] = stats.norm(mu[ 1 ], sigma[ 1 ]).pdf(Xval[:, 1 ])
pval.shape
Copy code```
```( 307 , 2 )
copy the code```

Next, we need a function to find the best threshold for a given probability density value and true label. To do this, we will calculate F1 scores for different epsilon values. F1 is a function of the number of true positives, false positives and false negatives. The equation is in the exercise text.

```def  select_threshold ( pval, yval ):
best_epsilon = 0
best_f1 = 0
f1 = 0

step = (pval. max ()-pval. min ())/1000

for epsilon in np.arange(pval. min (), pval. max (), step):
preds = pval <epsilon
tp = np. sum (np.logical_and(preds == 1 , yval == 1 )).astype( float )
fp = np. sum (np.logical_and(preds == 1 , yval == 0 )).astype( float )
fn = np. sum (np.logical_and(preds == 0 , yval == 1 )).astype( float )

precision = tp/(tp + fp)
recall = tp/(tp + fn)
f1 = ( 2 * precision * recall)/(precision + recall)

if f1> best_f1:
best_f1 = f1
best_epsilon = epsilon

return best_epsilon, best_f1
copy the code```
```epsilon, f1 = select_threshold(pval, yval)
epsilon, f1
Copy code```
```( .009566706005956842 , .7142857142857143 )
Copy the code```

Finally, we can apply the threshold to the data set and visualize the results.

```# indexes of the values  considered to be outliers
outliers = np.where(p <epsilon)
outliers
Copy code```
```(array([ 300 , 301 , 301 , 303 , 303 , 304 , 306 , 306 ], dtype = int64 ),
Array ([ . 1 , 0 , . 1 , 0 , . 1 , 0 , 0 , . 1 ], DTYPE = Int64 ))
copying the code```
```fig, ax = plt.subplots(figsize=( 12 , 8 ))
ax.scatter(X[:, 0 ], X[:, 1 ])
ax.scatter(X[outliers[ 0 ], 0 ], X[outliers[ 0 ], 1 ], s= 50 , color= 'r' , marker= 'o' )
plt.show()
Copy code```

Red dots are points that are marked as outliers. These seem reasonable. The upper right corner with some separation (but not marked) may also be an outlier, but it is quite close.

## Collaborative filtering

The recommendation engine uses similarity metrics based on items and users to check the user's historical preferences in order to provide suggestions for new "things" that may be of interest to the user. In this exercise, we will implement a specific recommendation system algorithm called collaborative filtering and apply it to a dataset of movie ratings.

We first load and check the data we will use.

```= loadmat Data ( 'Data/ex8_movies.mat' )
copying the code```

Y is an array containing levels from 1 to 5 (number of movies x number of users). R is an array of "indicators" that contains a binary value indicating whether the user has rated the movie. Both should have the same dimensions.

```Y = data[ 'Y' ]
R = data[ 'R' ]
Y.shape, R.shape
Copy code```
```(( 1682 , 943 ), ( 1682 , 943 ))
replicate the code```

We can evaluate the average rating of movies by ranking Y on average.

```The Y [ . 1 , np.where (R & lt [ . 1 ,:] == . 1 ) [ 0 ]]. Mean ()
to copy the code```
```3.2061068702290076Copy
code```

We can also try to "visualize" the data by rendering the matrix into an image. We can't collect too much from here, but it does give us an idea of the relative density of users and movies.

```fig, ax = plt.subplots(figsize=( 12 , 12 ))
ax.imshow(Y)
ax.set_xlabel( 'Users' )
ax.set_ylabel( 'Movies' )
fig.tight_layout()
plt.show()
Copy code```

Next, we will implement the cost function of collaborative filtering. Intuitively, "cost" refers to the degree to which a set of movie rating predictions deviate from the true predictions. The cost equation is given in the exercise text. It is based on two sets of parameter matrices called X and Theta in the text. These "expand" into the "parameter" input so that SciPy's optimization package can be used later. Please note that I have included the array/matrix shape (for the data we used in this exercise) in the comments to help illustrate how the matrix interaction works.

# Cost function

```def  cost ( params, Y, R, num_features ):
Y = np.matrix(Y)   # (1682, 943)
R = np.matrix(R)   # (1682, 943)
num_movies = Y.shape[ 0 ]
num_users = Y.shape[ 1 ]

# reshape the parameter array into parameter matrices
X = np.matrix(np.reshape(params[:num_movies * num_features], (num_movies, num_features)))   # (1682, 10)
Theta = np.matrix(np.reshape(params [num_movies * num_features:], (num_users, num_features)))   # (943, 10)

# initializations
J = 0

# compute the cost
error = np.multiply((X * Theta.T)-Y, R)   # (1682, 943)
squared_error = np.power(error, 2 )   # (1682, 943)
J = ( 1./2 ) * np. sum (squared_error)

return J
copy the code```

To test this, we provide a set of pre-training parameters that we can evaluate. In order to keep the evaluation time short, we will only look at a small piece of data.

```params_data = loadmat( 'data/ex8_movieParams.mat' )
X = params_data[ 'X' ]
Theta = params_data[ 'Theta' ]
X.shape, Theta.shape
Copy code```
```(( 1682 , 10 ), ( 943 , 10 ))
copying the code```
```users = 4
movies = 5
features = 3

X_sub = X[:movies, :features]
Theta_sub = Theta[:users, :features]
Y_sub = Y[:movies, :users]
R_sub = R[:movies, :users]

params = np.concatenate((np.ravel(X_sub), np.ravel(Theta_sub)))

cost(params, Y_sub, R_sub, features)
Copy code```
```22.224603725685675Copy
code```

Next we need to implement gradient calculation. Just like we used the neural network implementation in exercise 4, we will extend the cost function to calculate the gradient.

```def  cost ( params, Y, R, num_features ):
Y = np.matrix(Y)   # (1682, 943)
R = np.matrix(R)   # (1682, 943)
num_movies = Y.shape[ 0 ]
num_users = Y.shape[ 1 ]

# reshape the parameter array into parameter matrices
X = np.matrix(
np.reshape(params[:num_movies * num_features],
(num_movies, num_features)))   # (1682, 10)
Theta = np.matrix(
np.reshape(params[num_movies * num_features:],
(num_users, num_features)))   # (943, 10)

# initializations
J = 0
X_grad = np.zeros(X.shape)   # (1682, 10)
Theta_grad = np.zeros(Theta.shape)   # (943, 10)

# compute the cost
error = np.multiply((X * Theta.T)-Y, R)   # (1682, 943)
squared_error = np.power(error, 2 )   # (1682, 943)
J = ( 1./2 ) * np. sum (squared_error)

# calculate the gradients
X_grad = error * Theta
Theta_grad = error.T * X

# unravel the gradient matrices into a single array
code```
```J, grad = cost(params, Y_sub, R_sub, features)
Copy code```
```( 22.224603725685675 ,
Array ([ -2.52899165 ,    7.57570308 ,   -1.89979026 ,   -0.56819597 ,
3.35265031 ,   -0.52339845 ,   -0.83240713 ,    4.91163297 ,
-0.76677878 ,   -0.38358278 ,    2.26333698 ,   -0.35334048 ,
-0.80378006 ,    4.74271842 ,   -0.74040871 , -10.5680202 ,
4.62776019 ,   -7.16004443 ,   -3.05099006 ,    1.16441367,
-3.47410789 ,    0.5         ,    0.5         ,    0.5         ,
0.5         ,    0.5         ,    0.5         ]))
Copy the code```

Our next step is to add regularization to the cost and gradient calculations. We will create a final regularized version of the function (note that this version contains an additional "learning rate" parameter, called "lambda" in the text).

```def  cost ( params, Y, R, num_features, learning_rate ):
Y = np.matrix(Y)   # (1682, 943)
R = np.matrix(R)   # (1682, 943)
num_movies = Y.shape[ 0 ]
num_users = Y.shape[ 1 ]

# reshape the parameter array into parameter matrices
X = np.matrix(np.reshape(params[:num_movies * num_features], (num_movies, num_features)))   # (1682, 10)
Theta = np.matrix(np.reshape(params [num_movies * num_features:], (num_users, num_features)))   # (943, 10)

# initializations
J = 0
X_grad = np.zeros(X.shape)   # (1682, 10)
Theta_grad = np.zeros(Theta.shape)   # (943, 10)

# compute the cost
error = np.multiply((X * Theta.T)-Y, R)   # (1682, 943)
squared_error = np.power(error, 2 )   # (1682, 943)
J = ( 1./2 ) * np. sum (squared_error)

# add the cost regularization
J = J + ((learning_rate/2 ) * np. sum (np.power(Theta, 2 )))
J = J + ((learning_rate/2 ) * np. sum (np.power(X, 2 )))

# calculate the gradients with regularization
X_grad = (error * Theta) + (learning_rate * X)
Theta_grad = (error.T * X) + (learning_rate * Theta)

# unravel the gradient matrices into a single array

code```
```J, grad = cost(params, Y_sub, R_sub, features, 1.5 )
Copy code```
```( 31.34405624427422 ,
Array ([ -0.95596339 ,    6.97535514 ,   -0.10861109 ,    0.60308088 ,
2.77421145 ,    .25839822 ,    .12985616 ,    4.0898522 ,
-0.89247334 ,    .29684395 ,    1.06300933 ,    0.66738144 ,
0.60252677 ,    4.90185327 ,   -0.19747928 , -10.13985478 ,
2.10136256 ,   -6.76563628 ,   -2.29347024 ,    0.48244098,
-2.99791422 ,   -0.64787484 ,   -0.71820673 ,    1.27006666 ,
1.09289758 ,   -0.40784086 ,    0.49026541 ]))
Copy the code```

This result again matches the expected output of the exercise code, so it seems that regularization is normal. Before we train the model, we have a final step. Our task is to create our own movie ratings so that we can use the model to generate personalized recommendations. Provide us with a file linking the movie index to its title. Then we load the file into the dictionary.

```movie_idx = {}
f = open ( 'data/movie_ids.txt' ,encoding = 'gbk' )
for line in f:
tokens = line.split( '' )
tokens[- 1 ] = tokens[- 1 ][:- 1 ]
movie_idx [ int (tokens [ 0 ]) - . 1 ] = '' .join (tokens [ . 1 :])
Copy the code```
```movie_idx[ 0 ]
Copy code```
```'Toy Story (1995)'
Copy code```

We will use the score provided in the exercise.

```ratings = np.zeros(( 1682 , 1 ))

ratings[ 0 ] = 4
ratings[ 6 ] = 3
ratings[ 11 ] = 5
ratings[ 53 ] = 4
ratings[ 63 ] = 5
ratings[ 65 ] = 3
ratings[ 68 ] = 5
ratings[ 97 ] = 2
ratings[ 182 ] = 4
ratings[ 225 ] = 5
ratings[ 354 ] = 5

print ( 'Rated {0} with {1} stars.' . format (movie_idx[ 0 ], str ( int (ratings[ 0 ]))))
print ( 'Rated {0} with {1} stars.' . format (movie_idx[ 6 ], str ( int (ratings[ 6 ]))))
print ( 'Rated {0} with {1} stars.' . format (movie_idx[ 11 ], str ( int (ratings[ 11 ])) ))
print ( 'Rated {0} with {1} stars.'.  format(movie_idx[ 53 ], str ( int (ratings[ 53 ]))))
print ( 'Rated {0} with {1} stars.' . format (movie_idx[ 63 ], str ( int (ratings[ 63 ]) )))
print ( 'Rated {0} with {1} stars.' . format (movie_idx[ 65 ], str ( int (ratings[ 65 ]))))
print ( 'Rated {0} with {1} stars. ' . format (movie_idx[ 68 ], str ( int (ratings[ 68 ]))))
print ( 'Rated {0} with {1} stars.' . format (movie_idx[ 97 ], str ( int (ratings[ 97 ]))))
print ( 'Rated { 0} with {1} stars.' . Format (movie_idx[ 182 ], str ( int (ratings[ 182 ]))))
print ( 'Rated {0} with {1} stars.' . Format (movie_idx[ 225 ] , str ( int (ratings[ 225]))))
Print ( 'Rated with {{0}}. 1 Stars.' . The format (movie_idx [ 354 ], STR ( int (ratings [ 354 ]))))
copying the code```
```Rated Toy Story ( 1995 ) with 4 stars.
Rated Twelve Monkeys ( 1995 ) with 3 stars.
Rated Usual Suspects, The ( 1995 ) with 5 stars.
Rated Outbreak ( 1995 ) with 4 stars.
Rated Shawshank Redemption, The ( 1994 ) with 5 stars.
Rated While You Were Sleeping ( 1995 ) with 3 stars.
Rated Forrest Gump ( 1994 ) with 5 stars.
Rated Silence of the Lambs, The ( 1991 ) with 2 stars.
Rated Alien ( 1979 ) with 4 stars.
Rated Die Hard 2 ( 1990 ) with 5 stars.
Sphere Rated ( 1998 ) with . 5 Stars.
Duplicated code```

We can add our own rating vector to the existing data set to be included in the model.

```R = data[ 'R' ]
Y = data[ 'Y' ]
Y = np.append(Y, ratings, axis = 1 )
R = np.append(R, ratings != 0 , axis = 1 )
Y.shape, R.shape, ratings.shape
Copy code```
```(( 1682 , 944 ), ( 1682 , 944 ), ( 1682 , 1 ))
to copy the code```

We are not just preparing to train collaborative filtering models. We only need to define some variables and normalize the ratings.

```movies = Y.shape[ 0 ]   # 1682
users = Y.shape[ 1 ]   # 944
features = 10
learning_rate = 10.
X = np.random.random(size=(movies, features))
Theta = np.random.random(size=(users, features))
params = np.concatenate((np.ravel(X), np.ravel(Theta)))

X.shape, Theta.shape, params.shape
Copy code```
```(( 1682 , 10 ), ( 944 , 10 ), ( 26,260 ,))
copy the code```
```Ymean = np.zeros((movies, 1 ))
Ynorm = np.zeros((movies, users))

for i in  range (movies):
idx = np.where(R[i,:] == 1 )[ 0 ]
Ymean[i] = Y[i,idx].mean()
Ynorm[i,idx] = Y[i,idx]-Ymean[i]

Ynorm.mean()
Copy code```
```5.507036456515984e-19Copy
code```
```from scipy.optimize import minimize

fmin = minimize(fun=cost, x0=params, args=(Ynorm, R, features, learning_rate),
= Method 'the CG' , JAC = True , Options = { 'maxiter' : 100 })
copy the code```
```X = np.matrix(np.reshape(fmin.x[:movies * features], (movies, features)))
Theta = np.matrix(np.reshape(fmin.x[movies * features:], (users, features)))
X.shape, Theta.shape
Copy code```
```(( 1682 , 10 ), ( 944 , 10 ))
copying the code```

The parameters we trained are X and Theta. We can use these to create some suggestions for the users we add.

```predictions = X * Theta.T
my_preds = predictions[:,- 1 ] + Ymean
my_preds.shape
Copy code```
```( 1682 , 1 )
copy the code```
```sorted_preds = np.sort(my_preds, axis = 0 )[::- 1 ]
sorted_preds [: 10 ]
Copy the code```
```matrix([[ 5.00000277 ],
[ 5.00000236 ],
[ 5.00000172 ],
[ 5.00000167 ],
[ 5.00000018 ],
[ 4.99999996 ],
[ 4.99999993 ],
[ 4.99999963 ],
[ 4.99999936 ],
[ 4.99999933 ]])
Copy code```

This gave us a top ranking rating, but we lost the index of these ratings. We actually need to use the argsort function to predict the movie corresponding to the rating.

```idx = np.argsort(my_preds, axis = 0 )[::- 1 ]
idx
Copy code```
```matrix([[ 1499 ],
[  813 ],
[ 1535 ],
...,
[ 313 ],
[ 829 ],
[ 1431 ]], DTYPE = Int64 )
copying the code```
```print ( "Top 10 movie predictions:" )
for i in  range ( 10 ):
= J int (IDX [I])
Print ( '' Predicted Rating of {0}}. 1 for Movie {. ' . the format ( STR ( a float (my_preds [J])), movie_idx [J]))
Copy the code```
```Top 10 movie predictions:
Predicted rating of 5.0000027697456755  for movie Santa with Muscles ( 1996 ).
Predicted rating of 5.000002356341384  for movie Great Day in Harlem, A ( 1994 ).
Predicted rating of 5.000001717626692  for movie Aiqing wansui ( 1994 ).
Predicted rating of 5.000001669423294  for movie Star Kid ( 1997 ).
Predicted rating of 5.000000180977937  for movie Prefontaine ( 1997 ).
Predicted rating of 4.999999960309181  for movie They Made Me a Criminal ( 1939 ).
Predicted rating of 4.999999928299842  for movie Marlene Dietrich: Shadow and Light ( 1996 ).
Predicted rating of 4.999999629264883  for movie Someone Else 's America (1995).
Predicted rating of 4.9999993570025705 for movie Saint of Fort Washington, The (1993).
Predicted rating of 4.999999325074885 for movie Entertaining Angels: The Dorothy Day Story (1996).
Copy code```

The recommended movie does not actually match the content in the exercise text, I did not find the reason.

References/

 Machine learning course:  www.coursera.org/course/ml
 Huang Haiguang : https://github.com/fengdu78
 github:  github.com/fengdu78/Co...
 Assignment code: https://github.com/fengdu78/Coursera-ML-AndrewNg-Notes/blob/master/code/ex8-anomaly%20detection%20and%20recommendation/ML-Exercise8.ipynb
 markdown file:  github.com/fengdu78/Co...
 pdf file: https://github.com/fengdu78/Coursera-ML-AndrewNg-Notes/blob/master/Machine Learning Personal Notes Full Version v5.4-A4 Printed Version.pdf