CN114647773A

CN114647773A - Improved collaborative filtering method based on multiple linear regression and third-party credit

Info

Publication number: CN114647773A
Application number: CN202011504132.XA
Authority: CN
Inventors: 朱赟; 于士浩; 郑闻悦; 高连峰; 陈剑
Original assignee: Gannan Normal University
Current assignee: Gannan Normal University
Priority date: 2020-12-17
Filing date: 2020-12-17
Publication date: 2022-06-21
Anticipated expiration: 2040-12-17
Also published as: CN114647773B

Abstract

Due to simple and easily understood steps and excellent computing performance, the collaborative filtering algorithm has become a popular research field in the recommendation system, and meanwhile, the traditional collaborative filtering algorithm has many defects, such as: sparsity matrices, cold start problems, trust problems for users, defects in similarity calculation formulas, etc. By analyzing the problems in the collaborative filtering algorithm, an improved collaborative filtering algorithm based on multiple linear regression and third-party credit is provided, so that the construction of a third-party credit model and the calculation of credit similarity are implemented. The specific implementation result shows that the improved collaborative filtering algorithm based on the multiple linear regression and the third-party credit is superior to the traditional collaborative filtering algorithm. In the improved collaborative filtering algorithm based on the multiple linear regression and the third-party credit, the sparsity problem, the trust problem of the user and the similarity calculation formula among the collaborative filtering algorithms are effectively relieved.

Description

Improved collaborative filtering method based on multiple linear regression and third-party credit

Technical Field

The invention belongs to the field of intelligent recommendation algorithms, and particularly relates to an improved collaborative filtering algorithm based on multivariate linear regression and third-party credit.

Background

In the information age developing at a high speed, it is becoming more and more important how to dig out the commonalities of data from massive data and find out the potential laws therein. The improved collaborative filtering algorithm finds out the potential requirements of the user from the data and can be popularized to various aspects of life by combining with the rapid development trend of the Internet. When a favorite song is wanted to be found in the early morning every day to meet a good mood of one day, the collaborative filtering algorithm can help you; when the user wants to find the favorite food for supplementing energy, the collaborative filtering algorithm can help the user; when people eat more sports equipment which is wanted to lose weight and entangle with sports equipment, the collaborative filtering algorithm can help you. In conclusion, the collaborative filtering algorithm has penetrated aspects of life, and the number of application examples is not sufficient.

However, the conventional collaborative filtering algorithm has many disadvantages, such as: sparsity matrices, cold start problems, trust problems for users, defects in similarity calculation formulas. How to effectively alleviate or solve the defects of the traditional collaborative filtering algorithm is a big problem, and a large number of scholars propose solutions for the problem. The problems of sparse matrix improvement by PCA dimension reduction are proposed by Yaojingbo, Yuyicheng and the like; the similarity measurement method provided by the Jinming, Mengjun and the like effectively solves the problems of cold start and the like of the system; the multiple similarity fusion proposed by Wangbosheng, Haowangbo and the like effectively relieves the problems of data sparsity and cold start; however, in these problems, the fusion consideration of various information based on the trust of the third party is often overlooked by people. In this case, the system generates user recommendations based on false information, and the recommendation error becomes large, and of course, may have more or less influence on other users. This necessarily presents a challenge to the system recommendation algorithm, and adding trust in third party personal credits to consideration may effectively address these issues. The improved collaborative filtering algorithm based on the multiple linear regression and the third-party credit has the advantages that the multiple linear regression equation of the score value and the various information is established after the fields are quantized according to the information such as the attribute characteristics of the articles and the credit value of the user, the favorite vector of the user is solved, the problem of the sparsity matrix is effectively solved, the Euclidean distance is used for solving the neighbor user of the target user, and finally the recommendation formula is used for recommending the user.

Disclosure of Invention

In the big data era, the information scale of users processed in the system can reach hundreds of millions of orders of magnitude, and meanwhile, a large number of users and items inevitably cause the high sparseness of user item matrixes, which brings huge challenges to the recommendation system. Aiming at the problem of a user project sparse matrix in a system, the invention aims to provide an improved recommendation algorithm based on multiple linear regression and a third-party credit agency.

The improved recommendation algorithm is a recommendation algorithm which integrates information such as a place where a certain project is acted by a user in the past, credit values of third-party organizations for the user, attribute characteristics of the project and the like, and mainly solves the problems of a user project sparse matrix, malicious user scores and the like. The method comprises the following steps:

(1) a weighted credit model based on third party individuals or credit agencies is constructed.

(2) And (3) constructing a project feature vector associated with the user according to the features of the project in the system and the credit obtained in the previous two steps:

nat_i＝(nat_i1,nat_i2,...,nat_ij,...,nat_in)

(3) nat obtained according to the second step_iAnd (4) scoring the project by the user in the vector and system, and constructing a multiple linear regression equation for each user.

y_i＝b₀+b₁x₁+b₂x₂+...+b_nx_n+μ_i

(4) And solving the scoring vector of the comprehensive multi-source information of each user according to the multiple linear regression equation in the third step.

user_i＝(y_i1,y_i2,...,y_ij,...,y_in)

(5) And representing the favorite of the users by using the vector obtained in the fourth step, calculating the distance between the target user and other users by using the Euclidean distance according to the vector, calculating the similarity between the target user and other users by using the following formula, and selecting N users most similar to the target user as neighbors of the target user based on the KNN idea.

(6) And generating recommendation for the user by using a recommendation formula according to the user similarity obtained in the fifth step.

Step one, the construction of the third-party organization-based weighted credit model specifically comprises the following steps:

the credit of the user is a considerable part. On the internet, a phenomenon that some users do not want to make real scores or maliciously brush scores often exists, and a certain false property exists in a lot of data. The concept of a third party trust authority is therefore introduced. Supposing that m third-party trust authorities CA, n user users and a credit value matrix of each authority to each user are specifically expressed as follows:

CA＝{CA₁,CA₂,...,CA_i,...,CA_m}

user＝(user₁,user₂,...,user_i,...,user_n)

firstly, aiming at the credibility of the third-party trust authority, the third-party trust authority CA is subjected to descending order arrangement according to known official data, and the ordered third-party trust authority ACA and the credit value matrix of each ordered authority to each user are as follows:

ACA＝{ACA₁,ACA₂,...,ACA_i,...,ACA_m}

then, the classification is carried out according to the sequenced third-party trust organization ACA, and the specific classification conditions are shown in the following table:

finally, in order to obtain the comprehensive credit CCRE of the user, the discussion can be divided into cases according to grades: for the trust authorities at the same level, the processed comprehensive credit degree is obtained by a method of solving a tail-cutting mean/tail-removing mean, and the specific formula is as follows:

wherein c represents the number of third-party trust authorities in a certain evaluation level, m represents a row, q represents n users, max represents the most approved third-party trust authority in the current evaluation level, and min represents the third-party trust authority with the highest approval degree ranking in the current evaluation level.

After the above transformation, the credit matrix can be represented by the following formula:

for trust organizations with different levels, the comprehensive credit degree of the user is obtained according to the previous trust weight, which is specifically expressed as follows:

wherein Credit^T _acAnd (m, n) represents the transposition of the credit value matrix, and the comprehensive user credit based on the third-party trust authority of n multiplied by 1 can be obtained through the formula.

And the nat vector in the second step is composed of the attribute of the item and the comprehensive credit degree of the third-party person or organization. Each item scored by each user will have a nat vector, and if the user participates in scoring m items, there will be m corresponding nat vectors. Assuming that a nat vector specifies n fields in total, if a certain article does not have a certain field, the value of the field is usually 0; if some item has its own attribute field, the value of the field is obtained according to the coefficient of the regression equation of the multiple linear regression equation, and in addition, the comprehensive credit value, the region and the season information of the user are all possessed by each vector, wherein the region value is replaced by the zip code, and the season value is shown in the following table:

from these vectors and the known score values, it is convenient to later establish a multiple linear regression equation of the score values with each field in the nat vector.

Step three, constructing a multiple linear regression equation for each user, specifically as follows:

and establishing a score value and a multiple linear regression equation of each field in the nat vector for each user based on the nat vector of the project characteristic vector associated with the user and the corresponding nat vector and the score value of the user. The specific formula is as follows:

y_i＝b₀+b₁nat_i1+b₂nat_i2+...+b_nnat_in+μ_i

wherein b is_nRepresents the nth factor influencing the score value y, b₀Represents a constant term, μ_iRepresents a random error, y_iRepresenting the value of the i-th user's credit by the regression equation for a given nat vector.

After the coefficients of the multiple linear regression equation are obtained through the formula, the favorite vectors of the user can be obtained, and the favorite vectors of the user are used for replacing the user item matrix, so that the problem of matrix sparsity is solved. Suppose the user's Preference vector is reference, which is expressed specifically as follows:

Preference_i＝(b_i0+b_i1+μ_i1,b_i0+b_i2+μ_i2,...,b_i0+b_in+μ_in)

wherein reference_iRepresenting the favorite vector of the ith user, and the other variables have been mentioned above and will not be described herein.

Step four, the neighbor users of the target user are calculated by using the Euclidean distance method, and the method comprises the following specific steps:

the Euclidean distance method is simple in calculation and more accurate in result, and is a common distance definition. The idea is as follows: in the m-dimensional space, subtracting the favorite vectors of the current user and the target user, and if the final value distance (m, n) is smaller, indicating that the object preferences of the current user and the target user are more similar. (ii) a Conversely, it is stated that the more dissimilar the two user item preferences are. The calculation formula is as follows:

wherein x_mnThe nth attribute represents the user m, k represents the total k users, distance (m, n) represents the distance between the mth user and the nth user, and Sim (m, n) represents the similarity of the mth user to the nth user.

And (3) calculating preference similarity between each user by combining the formula of Sim (m, n), thereby constructing a similarity matrix favored by the users, wherein the specific formula is as follows:

after the similarity matrix is obtained through calculation, in order to reduce unnecessary calculation, based on the idea of KNN, the previous m maximum values of each row of the Sim similarity matrix may be taken and substituted into the following recommended formula to obtain the final result.

And fifthly, generating recommendation for the user, wherein the specific recommendation mode is as follows:

after the similarity matrix is obtained in the fifth step, a user recommendation result can be given according to a recommendation formula, wherein the specific recommendation formula is as follows:

wherein R is_ijRepresenting the user i's score for item j in the score-item matrix, Sim (m, i) is the closeness of preference between two users m, i,

the average score for any item it has historically engaged in is given on behalf of the current user m,

represents the average score of the current user i on any item that the current user historically participates in, k represents the user who picks k users with higher likelihoods of similarity to the user m, and pre (m, j) represents the predicted score of the user m on the item j.

Drawings

FIG. 1 is a flow chart of an improved collaborative filtering algorithm based on multiple linear regression and third party credit.

FIG. 2 is a diagram of a user's composite credit value based on a weighted credit model for a third party individual.

Fig. 3 shows similarity values of each of the 20 users with the first 3 users having the highest similarity.

FIG. 4 is a graph of the predicted scores for each user for the non-scored items based on the improved collaborative filtering algorithm based on multiple linear regression and third party credit.

Detailed Description

In order to show the steps of the invention work clearly and intuitively, the matlab tool is used and is combined with the actual case to describe the improved collaborative filtering algorithm based on the multiple linear regression and the credit of the third-party organization in detail.

The case background is to recommend clothing according to the target user's preferences using an improved collaborative filtering algorithm based on multiple linear regression and third party credit. Assuming that there are 3 third party individuals available, their weights are assigned as follows based on official data:

third party individual	Grade of judgement	Trust weight
			CA₁	A	0.6
CA₂	B	0.3
			CA₃	C	0.1

The third party individuals would score each user and calculate the user's composite credit value as follows.

The ordinate of figure 2 is their composite integrated credit value. And after obtaining the comprehensive credit value of the user, constructing project characteristic vectors associated with the user according to the features of the project in the system and the credit obtained in the previous two steps, constructing a multiple linear regression equation for each user according to the characteristic vectors, and obtaining the favorite vector of the user according to the equation by using the solved coefficients. Then, the similarity of the user is calculated by using the above Euclidean distance, then a Sim similarity matrix is constructed, and then the first m maximum values of each row of the Sim similarity matrix are taken and substituted into the following recommendation formula based on the idea of KNN, wherein fig. 3 shows that the first 3 maximum similarity values of each row of the Sim similarity matrix are taken. After obtaining the similar value, a recommendation is generated for the user, and a recommendation result is obtained, as shown in fig. 4.

Claims

1. The improved collaborative filtering method based on the multiple linear regression and the third-party credit is characterized in that: firstly, a weighted credit model based on a third-party individual is constructed, and secondly, a credit value matrix of a user behavior generated on a certain project is collected. And quantizing the favorite vectors to obtain a linear regression equation, and selecting N users most similar to the target user as neighbors of the target user. And finally, generating recommendation for the user by using a recommendation formula.

(1) The weighted credit model calculation is described as:

wherein Credit^T _ac(m, n) denotes the transpose of the matrix of credit values, ω_a,ω_b,ω_cRespectively representing trust weights of trust authorities of different levels.

(2) The credit value matrix calculation is described as:

wherein a is a constant, c represents the number of third party individuals in a certain evaluation level, m represents a row, q represents n users, max represents the most approved third party trust in the current evaluation level, and min represents the third party individual with the highest approval degree ranking in the current evaluation level.

(3) The favorites vector calculation is described as:

nat_i＝(nat_i1,nat_i2,...,nat_ij,...,nat_in)

therein nat_inN attributes, nat, representing the ith subscriber_iRepresenting the favorite vector of the ith user.

(4) The method for quantitatively constructing the favorite vector of each user is characterized in that:

y_i＝b₀+b₁nat_i1+b₂nat_i2+...+b_nnat_in+μ_i

wherein, b_nRepresents the nth factor influencing the score value y, b₀Represents a constant term, μ_iRepresents a random error, y_iRepresenting the value of the credit obtained for the ith user for a given nat vector.

(5) The description of the collaborative filtering method for calculating the similarity between the target user and other users is as follows:

wherein distance (m, N) represents the distance between the target user m and the user N, k represents the total number of users in the system, Sim (m, N) represents the similarity value between the target user m and the user N, and N users most similar to the target user can be selected as neighbors of the target user.

(6) The recommended formula is described as:

represents the average score of the current user m for any items it has historically engaged in,

represents the average score of the current user i on any item that the current user historically participates in, k represents the user who picks k users with higher likelihoods of similarity to the user m, and pre (m, j) represents the predicted score of the user m on the item j. And generating recommendation for the user by using the recommendation formula according to the obtained user similarity.