CN109063120B

CN109063120B - Collaborative filtering recommendation method and device based on clustering

Info

Publication number: CN109063120B
Application number: CN201810863191.2A
Authority: CN
Inventors: 高志鹏; 李博; 杨杨; 王颖; 谭清; 王茜; 肖楷乐
Original assignee: Beijing University of Posts and Telecommunications
Current assignee: Beijing University of Posts and Telecommunications
Priority date: 2018-08-01
Filing date: 2018-08-01
Publication date: 2021-05-28
Anticipated expiration: 2038-08-01
Also published as: CN109063120A

Abstract

The embodiment of the invention provides a collaborative filtering recommendation method and device based on clustering, which comprises the following steps: obtaining a tag genome vector of a first item; classifying the first item into a first number of cluster classes based on the tag genome vector of the first item; for each target item: when the target article and the second article belong to the same cluster class, calculating a correlation coefficient based on a preset type distance between the target article and the second article; when the target item and the second item belong to different cluster classes, calculating a correlation coefficient based on the Poisson correlation coefficient of the target item and the second item; weighting and summing the preset score of the target user on the second object and the correlation coefficient of the target object and the second object to obtain the prediction score of the target user on the target object; and recommending the target article with the prediction score meeting the preset condition to the target user. By applying the embodiment of the invention, the objectivity of recommendation scoring can be improved.

Description

Collaborative filtering recommendation method and device based on clustering

Technical Field

The invention relates to the technical field of recommendation algorithms, in particular to a collaborative filtering recommendation method and device based on clustering.

Background

With the rapid development of internet technology, the internet provides users with a variety of massive information, and enriches and facilitates the work and life of people. However, it is time-consuming and labor-consuming for the user to obtain the interested information from the mass information. For this reason, a recommendation algorithm is generated, which does not require the user to provide explicit needs, but analyzes the user's interests and needs through the user's historical behavior in order to recommend items that can satisfy the interests and needs to the user.

Specifically, the collaborative filtering recommendation algorithm is one of recommendation algorithms widely applied, and the processing steps of the collaborative filtering recommendation algorithm are as follows:

the method comprises the following steps of firstly, acquiring a history of interaction between a user and an article from a preset public data set. Typically, the public data set may be obtained from a website dedicated to research recommendation systems. The history of user and item interactions includes a user-item scoring matrix, a user-item consumption matrix, and the like, wherein the user-item scoring matrix includes a plurality of items and a score from the user for each item. It should be noted that the score obtained for each item may come from a different user. The user-item scoring matrix is simply referred to as a scoring matrix, and the scoring matrix is taken as an example for explanation.

For convenience of explanation, an object for recommending an item is referred to as a target user, and an item recommended to the target user is referred to as a target item; the item in the scoring matrix is referred to as the first item. It is understood that the item recommended to the target user should be an item that the target user has not used, that is, the target user has not given a score, and then the target item should be the first item that the target user has not given a score.

A second step of, for each target item: firstly, calculating the correlation coefficient of the target object and a second object, wherein the second object is an object except all the target objects in the first object; secondly, carrying out weighted summation on the score of the target user on the second object in the score matrix and the correlation coefficient of the target object and the second object, and calculating the prediction score of the target user on the target object;

specifically, a poisson correlation coefficient may be used as a correlation coefficient between the articles, and a calculation formula of the poisson correlation coefficient is shown in formula (1):

in formula (1), the target item is item i; the second article is article j; rho_ijIs the poisson correlation coefficient of item i and item j; u shape_ij＝U_i∩U_jA public user set for scoring both item i and item j; u is a common user set U_ijThe user of (1); r is_uiA score obtained for item i;

the mean of the scores obtained for item i; r is_ujA score obtained for item j;

the mean of the scores obtained for item j.

And step three, recommending the target article with the prediction score meeting the preset condition to the target user.

Specifically, the target items may be sorted according to the prediction scores of the target users for the target items; recommending a preset number of target articles with higher prediction scores to a target user; target items with prediction scores exceeding the scoring threshold can also be recommended to the target users.

Therefore, in the collaborative filtering recommendation algorithm, the correlation coefficients of the target item and other items all depend on the scores obtained by the items, and the scores are given by the user subjectively and easily influence the objectivity of the calculated prediction scores. For example, the article a only obtains a low score given by a user in a bad mood state, but actually the quality of the article a is very good, so that the predicted score of the article a calculated according to the collaborative filtering recommendation algorithm is low, and the objectivity is poor.

Disclosure of Invention

The embodiment of the invention aims to provide a collaborative filtering recommendation method and device based on clustering so as to improve the objectivity of recommendation scoring. The specific technical scheme is as follows:

the embodiment of the invention provides a collaborative filtering recommendation method based on clustering, which comprises the following steps:

acquiring a tag genome vector of a first article from a preset tag genome information matrix, wherein the tag genome vector is used for describing inherent attributes of the first article;

dividing the first article into a preset first number of cluster classes based on the tag genome vector of the first article by using a preset clustering algorithm;

for each target item: when the target item and the second item belong to the same cluster, calculating a correlation coefficient of the target item and the second item based on a preset type distance between a tag genome vector of the target item and a tag genome vector of the second item, wherein the target item refers to an item which is not scored by a target user in the first item, and the second item refers to an item except all target items in the first item; when the target object and the second object belong to different cluster classes, calculating a correlation coefficient of the target object and the second object based on the Poisson correlation coefficient of the target object and the second object; weighting and summing the preset score of the target user on the second object and the correlation coefficient of the target object and the second object to obtain the prediction score of the target user on the target object;

and recommending the target article with the prediction score meeting the preset condition to the target user.

The embodiment of the invention also provides a collaborative filtering recommendation device based on clustering, which comprises:

the first acquisition module is used for acquiring a tag genome vector of a first article from a preset tag genome information matrix, wherein the tag genome vector is used for describing inherent attributes of the first article;

the dividing module is used for dividing the first articles into clusters with a preset first quantity based on the label genome vector of the first articles by using a preset clustering algorithm;

a first computing module to, for each target item: when the target item and the second item belong to the same cluster, calculating a correlation coefficient of the target item and the second item based on a preset type distance between a tag genome vector of the target item and a tag genome vector of the second item, wherein the target item refers to an item which is not scored by a target user in the first item, and the second item refers to an item except all target items in the first item; when the target object and the second object belong to different cluster classes, calculating a correlation coefficient of the target object and the second object based on the Poisson correlation coefficient of the target object and the second object; weighting and summing the preset score of the target user on the second object and the correlation coefficient of the target object and the second object to obtain the prediction score of the target user on the target object;

and the recommending module is used for recommending the target articles with the prediction scores meeting the preset conditions to the target user.

The embodiment of the invention further provides electronic equipment, which comprises a processor, a communication interface, a memory and a communication bus, wherein the processor, the communication interface and the memory complete mutual communication through the communication bus;

a memory for storing a computer program;

and the processor is used for realizing any one of the collaborative filtering recommendation methods based on the clustering when executing the program stored in the memory.

An embodiment of the present invention further provides a computer-readable storage medium, where instructions are stored in the computer-readable storage medium, and when the instructions are executed on a computer, the computer is enabled to execute any one of the above-mentioned clustering-based collaborative filtering recommendation methods.

An embodiment of the present invention further provides a computer program product containing instructions, which when run on a computer, causes the computer to execute any one of the above-mentioned clustering-based collaborative filtering recommendation methods.

The collaborative filtering recommendation method and device based on clustering provided by the embodiment of the invention are characterized in that firstly, a tag genome vector of a first article is obtained from a preset tag genome information matrix, and the tag genome vector is used for describing inherent attributes of the first article; then, dividing the first articles into a preset first number of cluster classes based on the label genome vector by using a preset clustering algorithm so as to enable the first articles belonging to the same cluster class to have similar inherent attributes; next, for each target item: when the target item and the second item belong to the same cluster, calculating a correlation coefficient of the target item and the second item based on a preset type distance between a tag genome vector of the target item and a tag genome vector of the second item, wherein the target item refers to an item which is not scored by a target user in the first item, and the second item refers to an item except all target items in the first item; when the target object and the second object belong to different cluster classes, calculating a correlation coefficient of the target object and the second object based on the Poisson correlation coefficient of the target object and the second object; weighting and summing the score of the preset target user on the second object and the correlation coefficient of the target object and the second object to obtain the prediction score of the target user on the target object; and finally, recommending the target article with the prediction score meeting the preset condition to the target user.

In this way, after the first item is classified according to the tag genome vectors, for the first item belonging to the same cluster class, the correlation coefficient may be calculated based on the preset type of distance between the tag genome vectors. The tag genome vector is used for describing the inherent attributes of the articles, does not change along with the subjective will of the user, and has objectivity, so that the calculated correlation coefficient also has objectivity, the objectivity of the obtained prediction score is stronger, and the problem that the objectivity of the prediction score is influenced by the subjective score of the user is avoided.

Of course, not all of the advantages described above need to be achieved at the same time in the practice of any one product or method of the invention.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

Fig. 1 is a schematic flow chart of a collaborative filtering recommendation method based on clustering according to an embodiment of the present invention;

FIG. 2 is a flowchart illustrating a detailed process of step 103 according to an embodiment of the present invention;

FIG. 3 is a flowchart illustrating a detailed procedure of substep 12 of the present invention;

FIG. 4 is a flowchart illustrating a detailed procedure of substep 13 of the present invention;

FIG. 5 is a flow chart of determining an optimal value of a parameter according to an embodiment of the present invention;

FIG. 6 is a flowchart illustrating a detailed process of step 503 according to an embodiment of the present invention;

FIG. 7 is a schematic structural diagram of a collaborative filtering recommendation device based on clustering according to an embodiment of the present invention;

fig. 8 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

An embodiment of the present invention provides a collaborative filtering recommendation method based on clustering, and referring to fig. 1, fig. 1 is a schematic flow diagram of the collaborative filtering recommendation method based on clustering provided in the embodiment of the present invention, which may include the following steps:

step 101, obtaining a tag genome vector of a first article from a preset tag genome information matrix.

Wherein the tag genome vector is used to describe an intrinsic property of the first item.

In this step, a tag Genome vector Genome of the first item may be obtained from a preset tag Genome information matrix, so that the first item may be classified according to the tag Genome vector through the subsequent steps; the tag genome vector can be used to describe the inherent property of the first item, which refers to all items, typically a plurality of items, included in the tag genome matrix.

The tag genome information matrix is obtained from user comments of various websites by a manual collection or machine learning method, and belongs to context information of an article. In practical applications, the recommendation system can also be obtained from public data sets on websites of special research recommendation systems, for example, public data sets on websites of movieels. The tag genome information matrix comprises a tag genome vector of the first item, wherein the tag genome vector is used for describing the inherent property of the first item, specifically, the degree of correlation between the first item and each feature or tag is represented by a decimal between 0 and 1, and the larger the numerical value, the higher the weight of the first item on the feature or tag is, i.e., the closer the inherent property of the first item is to the feature or tag is.

For example, if the first item is a movie and the tags are "terrorist", "affection" and "efface", respectively, then the tag genome vector of the first item corresponds to [0.9, 0.8, 0.1], where 0.9 is the degree of relevance of the first item to tag "terrorism", 0.8 is the degree of relevance of the first item to tag "affection", and 0.1 is the degree of relevance of the first item to tag "efface". Then the first item is closest to the label "terrorism" and can be considered a terrorist movie.

It can be understood that the tag genome vector has objectivity because the tag or feature in the tag genome vector is an inherent attribute of the article, exists objectively, and does not change according to the subjective intention of the user.

Step 102, dividing the first item into a preset first number of cluster classes based on the tag genome vector of the first item by using a preset clustering algorithm.

In this step, a preset clustering algorithm may be used to divide the first item into a preset first number of clusters based on the tag genome vector, where the preset clustering algorithm may be a K-means (K-means) clustering algorithm, a mean shift clustering algorithm, or the like.

Since the tag or the feature in the tag genome vector is an inherent attribute of the article and exists objectively, the first article belonging to the same cluster class has similar objective features, so that the correlation coefficient is calculated based on the tag genome vector for the first article belonging to the same cluster class, so that the calculated correlation coefficient has strong objectivity.

In one implementation, the preset clustering algorithm is a K-means clustering algorithm, and the first item is divided into K clusters based on the tag genome vector of the first item, and the specific processing procedure is as follows:

first, a tag genome information matrix G is input.

Second, randomly initializing K cluster center points, which are expressed as mu₁，μ₂，...，μ_k∈Rⁿ；

Wherein R isⁿRepresenting a vector space of length n, n being the number of features or tags in the tag genome information matrix G.

Thirdly, calculating a genome vector g of the first item i_iCluster class c to which it belongs⁽ⁱ⁾；

c⁽ⁱ⁾:＝argmin||g_i-μ_j||² (2)

In the formula (2), g_iIs the genome vector of the first item i, g_i∈G；c⁽ⁱ⁾Is g_iThe cluster class to which it belongs; mu.s_jAs a cluster center point, j belongs to k; equation (2)) Means that⁽ⁱ⁾Is defined as argmin | | g_i-μ_j||²。

Fourthly, for each cluster class j, recalculating the cluster class center point mu of the cluster class j_j；

In the formula (3), g_iIs the genome vector of the first item i, g_i∈G；c⁽ⁱ⁾Is g_iThe cluster class to which it belongs; mu.s_jAs a cluster center point, j belongs to k; m is the total number of the first articles, i belongs to m; the meaning of formula (3) is that_jIs defined as

And repeatedly executing the third step and the fourth step until convergence. It should be noted that, with the use of the K-means clustering algorithm, the detailed process of dividing the first item into K clusters based on the tag genome vector of the first item may refer to the prior art, and is not described herein again.

Step 103, for each target item: when the target item and the second item belong to the same cluster, calculating a correlation coefficient of the target item and the second item based on a preset type distance between the tag genome vector of the target item and the tag genome vector of the second item; when the target object and the second object belong to different cluster classes, calculating a correlation coefficient of the target object and the second object based on the Poisson correlation coefficient of the target object and the second object; and carrying out weighted summation on the score of the preset target user on the second object and the correlation coefficient of the target object and the second object to obtain the prediction score of the target user on the target object.

The target object is an object which is not scored by a target user in the first object; the second object is the object except all the target objects in the first object.

Since the first item has been divided into a plurality of clusters by executing step 102, in this step, the following processing is performed for each target item, and with reference to fig. 2, fig. 2 is a specific flowchart of step 103 in the embodiment of the present invention:

and a substep 11, when the target item and the second item belong to the same cluster, calculating a correlation coefficient of the target item and the second item based on a preset type distance between the tag genome vector of the target item and the tag genome vector of the second item.

The preset type distance can be a Euclidean distance, a Manhattan distance, a Mahalanobis distance and the like, and can be determined according to actual conditions.

In one implementation, when the target item and the second item belong to the same cluster class, a euclidean distance between the tag genome vector of the target item and the tag genome vector of the second item is calculated, and the calculated euclidean distance value is used as a correlation coefficient of the target item and the second item.

It should be noted that, the method for calculating the euclidean distance may refer to the prior art, and is not described herein again.

Since the tag or the feature in the tag genome vector is an inherent attribute of the article and is objectively present, the correlation coefficient calculated based on the tag genome vector has strong objectivity for the first article belonging to the same cluster.

And a substep 12, when the target item and the second item belong to different cluster classes, calculating a correlation coefficient of the target item and the second item based on the poisson correlation coefficient of the target item and the second item.

In one implementation, referring to fig. 3, fig. 3 is a specific flowchart of sub-step 12 in the embodiment of the present invention, and sub-step 12 may include:

and a substep 121 of calculating poisson correlation coefficients of the target item and the second item.

Specifically, the detailed process of the substep 121 may refer to formula (1) and related descriptions in the background art, and is not repeated herein.

And a substep 122, generating a scaled poisson correlation coefficient of the target item and the second item based on the number of the common users of the target item and the second item and the calculated poisson correlation coefficient, and using the scaled poisson correlation coefficient of the target item and the second item as the correlation coefficient of the target item and the second item.

It should be noted that the preset scoring matrix, like the tag genome information matrix, may also be obtained from public data sets on websites of a special research recommendation system, for example, public data sets on websites of movielens. The predetermined scoring matrix includes a plurality of items, and a score from the user is obtained for each item. However, since the preset scoring matrix has sparseness, a large number of items in the preset scoring matrix get fewer scores from the user, so that there are very few common users between the items. If the poisson correlation coefficient is directly used as the correlation coefficient among the objects, the correlation coefficient cannot truly reflect the correlation among the objects, and the accuracy is poor.

Therefore, the embodiment of the present invention calculates the correlation coefficient based on more common users, so that the correlation coefficient can be more accurate, and the correlation between the articles can be truly embodied.

In a specific implementation manner, a scaled poisson correlation coefficient of the target item and the second item may be generated based on the number of common users of the target item and the second item and the calculated poisson correlation coefficient according to formula (4);

in formula (4), the target item is item i; the second article is article j; s_ijScaling the Poisson correlation coefficient of the item i and the item j; rho_ijThe poisson correlation coefficient of the article i and the article j is taken as the index; n is_ijThe number of common users of item i and item j; lambda [ alpha ]₁Is a parameter of the number of common users.

Specifically, the embodiment of the invention introduces the parameter lambda of the number of the common users based on the Poisson correlation coefficient₁The influence of the number of the common users on the correlation coefficient is increased, the adverse influence of the problem that the number of the common users among the articles is very small on the correlation coefficient is made up, the correlation coefficient calculated by the embodiment of the invention can truly reflect the correlation among the articles, the accuracy is higher, and the objectivity of the calculated prediction score is stronger.

In practical application, the parameter lambda of the number of common users₁May have a value of 100.

And a substep 13, performing weighted summation on the score of the preset target user on the second item and the correlation coefficient of the target item and the second item to obtain the predicted score of the target user on the target item.

In this step, since the preset scoring matrix already includes the score of the target user for the second item, the score of the preset target user for the second item may be obtained from the preset scoring matrix, and the correlation coefficients calculated in the second step are weighted and summed to obtain the predicted score of the target user for the target item.

In the second step, different calculation methods are used for the correlation coefficients of the objects belonging to the same cluster and the objects belonging to different clusters according to the classification result, so that the calculated correlation coefficients have high objectivity and high accuracy, and the objective of the target user on the prediction score of the target object obtained in the step is also high.

In one implementation mode, according to a formula shown in formula (5), weighting and summing the score of a preset target user on a second article and the correlation coefficient of the target article and the second article to obtain the prediction score of the target user on the target article;

in the formula shown in formula (5), the target item is item i; the second article is article j; the target user is user u;

a predicted score for user u for item i; r is_ujA predictive score for user u for item j; s_ijScaling the Poisson correlation coefficient of the item i and the item j; rho_ijThe poisson correlation coefficient of the article i and the article j is taken as the index; epsilon_ijThe Euclidean distance value of the tag genome vector of the item i and the tag genome vector of the item j is obtained; kappa_uGiving a set of scored items for user u in addition to item i; p is a radical of_ijIs an adjustment factor;

in the case where the first item is divided into k clusters, the first item is a set of items belonging to the same cluster as the item i.

Specifically, in the formula shown in formula (5), according to the classification result in step 102, for a second item belonging to the same cluster as the target item, a prediction score of the target user for the target item is calculated based on the euclidean distance value between the tag genome vectors of the target item and the second item, and since the tag or the feature in the tag genome vector is an inherent attribute of the item and is objectively present, the correlation coefficient calculated based on the tag genome vector has strong objectivity; and aiming at a second article belonging to a different cluster from the target article, based on the scaled Poisson correlation coefficient between the target article and the second article, the calculated correlation coefficient can truly reflect the correlation between the articles and has higher accuracy; in conclusion, the calculated prediction score has stronger objectivity and can truly reflect the evaluation of the target user on the target object.

And 104, recommending the target article with the prediction score meeting the preset condition to the target user.

In this step, the target items may be sorted according to the prediction score of the target user on the target items calculated in step 103; recommending a preset number of target articles with higher prediction scores to a target user; target articles with prediction scores exceeding the score threshold value can also be recommended to the target user, and the prediction scores can be determined according to actual requirements, and are not described herein again.

The calculated prediction score is higher in objectivity and can truly reflect the evaluation of the target object by the target user, so that the screened target object with higher prediction score can better match the preference of the target user, and the user experience is better.

As can be seen, in the collaborative filtering recommendation method based on clustering proposed in the embodiment of the present invention, after the first item is classified according to the tag genome vector, the relevance coefficient may be calculated for the first item belonging to the same cluster based on the euclidean distance value of the tag genome vector. The tag genome vector is used for describing the inherent attributes of the articles, does not change along with the subjective will of the user, and has objectivity, so that the calculated correlation coefficient also has objectivity, the objectivity of the obtained prediction score is stronger, and the problem that the objectivity of the prediction score is influenced by the subjective score of the user is avoided.

In an alternative implementation manner, referring to fig. 4, fig. 4 is a specific flowchart of sub-step 13 in the embodiment of the present invention, which may specifically include:

and a substep 131, performing weighted summation on the score of the preset target user on the second item and the correlation coefficient of the target item and the second item.

And a substep 132 of using the personalized parameter, the user deviation adjustment parameter and the item deviation adjustment parameter to adjust on the basis of the result of the weighted summation to obtain a prediction score of the target user for the target item.

It should be noted that, since the score of the second item by the preset target user is obtained from the preset score matrix, the score in the preset score matrix may be affected by various bias factors. Specifically, the bias factors may include user bias and item bias, where the user bias refers to that some users are used to score items higher, and the item bias refers to that some items are likely to obtain higher scores due to the influence of external factors such as advertising. Therefore, the scores in the preset scoring matrix may not accurately reflect the user's preference and the quality of the goods.

Therefore, in the embodiment of the present invention, based on the result of the substep 131, the personalized parameter, the user bias adjustment parameter, and the article bias adjustment parameter are used to adjust, so that the calculated prediction score of the target user for the target article is more accurate, and the user preference and the quality of the article can be truly reflected.

In one implementation, the preset score of the target user for the second item and the correlation coefficient of the target item and the second item may be weighted and summed by using formula (6); using the personalized parameters, the user deviation adjustment parameters and the article deviation adjustment parameters to adjust on the basis of the result of the weighted summation to obtain the prediction score of the target user on the target article;

in the formula shown in formula (6), the target item is item i; the second article is article j; the target user is user u; mu is a score mean value in a preset score matrix; b_uAdjusting parameters for user bias for user u; b_iAdjusting a parameter for an item bias for item i; alpha is alpha_uPersonalized parameters for user u;

Specifically, the formula shown in formula (6) is based on formula (5), and a score mean μ and a user bias adjustment parameter b in a preset score matrix are added_uArticle deviation adjustment parameter b_iAnd, considering that each user is affected by similar articles to a different extent, a personalization parameter α is introduced for each user_uTherefore, the calculated prediction score of the target user on the target object is more accurate, and the preference of the user and the quality of the object can be truly reflected.

In order to obtain a more objective and accurate prediction score, the user bias adjustment parameter b in the formula (6) may be determined before calculating the prediction score using the formula_uArticle deviation adjustment parameter b_iAnd a personalization parameter alpha_uTo obtain the optimal value of equation (6).

Referring to fig. 5, fig. 5 is a flowchart of determining an optimal value of a parameter according to an embodiment of the present invention. As shown in fig. 5, the user bias adjustment parameter b in the calculation formula for determining the second prediction score_uArticle deviation adjustment parameter b_iAnd a personalization parameter alpha_uThe steps of (2) are as follows:

step 501, obtaining a preset number of sample sets from a preset scoring matrix.

Each sample set comprises a user, an item and a score given to the item by the user.

In this step, a preset number of sample sets may be obtained from a preset scoring matrix, so as to minimize the loss function according to the preset number of sample sets.

Since the scoring matrix includes a plurality of items and each item obtains a score from the user, the item and the score given by the user to the item can be obtained from the scoring matrix, and the three items are used as a sample set. The number of sample sets may be determined according to practical circumstances.

Step 502, regarding each sample set, taking the user in the sample set as a target user, and taking the article in the sample set as a target article; and calculating the prediction score of the target user on the target item, and taking the calculated prediction score as the prediction score corresponding to the sample set.

In this step, the following processing may be performed for each sample set in the preset number of sample sets:

the first step is to take the user in the sample set as a target user and take the item in the sample set as a target item.

Specifically, for one sample set in a preset number of sample sets, the user in the sample set is taken as a target user, the item in the sample set is taken as a target item, and a prediction score is calculated according to the target user and the target item.

And secondly, calculating the prediction score of the target user on the target item, and taking the calculated prediction score as the prediction score corresponding to the sample set.

Specifically, the formula shown in formula (6) is used to calculate the prediction score of the target user for the target item, and the calculated prediction score is used as the prediction score corresponding to the sample set, so as to minimize the loss function value according to the prediction score and the real score.

It should be noted that the score originally present in the sample set is the user's true score for the item, and the calculated score is the user's predicted score for the item. Here, the prediction score is calculated so as to optimize the parameters, and the difference between the true score and the prediction score is minimized, so that the prediction score calculated by the formula shown in formula (6) is closer to the true score. It will be appreciated that, in practical applications, when a target item is recommended to a target user by calculating a prediction score, the prediction score is not recalculated for an item for which the target user has given a score, since the target item should be an item that the target user has not used or purchased.

Step 503, based on each sample set and the prediction scores corresponding to each sample set, performing minimization processing on a preset loss function, where the preset loss function is shown in formula (7);

in the preset loss function as shown in equation (7), κ is a preset number of sample sets; item i is an item in the sample set; the user u is a user in the sample set; b_uAdjusting parameters for user bias for user u; b_iAdjusting a parameter for an item bias for item i; alpha is alpha_uPersonalized parameters for user u;

calculating a prediction score of the user u for the item i; r is_uiScoring the item i by the user u, wherein the scoring is obtained from a preset scoring matrix; lambda [ alpha ]₂Is a preset regularization parameter.

In this step, the loss function shown in formula (7) may be minimized according to each sample set that has been calculated and the prediction score corresponding to each sample set, so as to obtain a minimized loss function.

In one implementation manner, referring to fig. 6, fig. 6 is a specific flowchart of step 503 in the embodiment of the present invention, including:

substep 5031, for each sampleSet, set the personalized parameter alpha_uUser bias adjustment parameter b_uAnd an article deflection adjustment parameter b_iAnd initializing the step size gamma and the number of initialization iterations.

In particular, the individualization parameter alpha_uUser bias adjustment parameter b_uAnd an article deflection adjustment parameter b_iMay be set to random (0, 1), 0, and 0, respectively; the initial value of the step length γ may be 0.04; the initial value of the number of iterations may be 0.

Sub-step 5032, setting both the user set U and the item set I as empty sets.

Specifically, the user set U and the item set I are respectively set as empty sets, so that the users and the items in the processed sample set are respectively added into the user set U and the item set I by performing subsequent steps.

Sub-step 5033, randomly obtaining a sample set from a preset number of sample sets κ.

Wherein, one sample set comprises a user u, an item i and a score r of the user u to the item i, which is obtained from a preset scoring matrix_uiAnd the calculated predicted score of user u for item i

Specifically, one sample set used in sub-steps 5034 to 5037 is randomly obtained from a preset number of sample sets κ; wherein, one sample set comprises a user u and an item i, and the real score r of the item i is obtained from a preset scoring matrix by the user u_uiAnd the predicted score of user u for item i calculated according to equation (6)

So that the loss function value is calculated based on the prediction score and the true score.

Substep 5034, determining whether the user U in the sample set is included in the user set U and whether the item I in the item set I includes the item I in the sample set; if so, go back to substep 5033, and if not, go back to substep 5035.

In particular, a determination is made as to whether the sample set selected in sub-step 5033 has been processed, it being understood that the users and items in the processed sample set should already exist in user set U and item set I. For the already processed sample set, the processing is not performed in the current iteration, and the substep 5033 is executed to randomly select a sample set again.

This is because, since the number of sample sets in the preset number of sample sets κ is large, using the conventional random gradient descent algorithm may cause a problem of excessive time complexity. In order to enable the minimization process of the loss function to be completed in a reasonable time, the embodiment of the invention adopts a modified random gradient descent algorithm in the minimization process of the loss function. That is, each sample set of the preset number of sample sets κ is processed only once in each iteration. If the currently selected sample set has already been processed in this iteration, the sample set is skipped.

In this way, in the minimization processing of the loss function, the time complexity can be obviously reduced by adopting the improved random gradient descent algorithm, so that the minimization processing of the loss function can be completed in a reasonable time, and meanwhile, overfitting can be effectively prevented.

Sub-step 5035, based on said score r in the set of samples_uiAnd the prediction score

Difference e between_uiUpdating the personalization parameter alpha_uUser bias adjustment parameter b_uAnd an article deflection adjustment parameter b_iAnd putting the user U in the sample set into the user set U, and putting the item I in the sample set into the item set I.

In this step, when the user U does not include the sample set in the user set U, and the item I does not include the sample set in the item set I, the sample set may be processed.

Specifically, first, the score r in the sample set is calculated_uiAnd the prediction score

Difference e between_uiAs shown in equation (8);

in the formula (8), the first and second groups,

calculating a prediction score of the user u for the item i; r is_uiScoring the item i by the user u, wherein the scoring is obtained from a preset scoring matrix; e.g. of the type_uiIs the score r_uiAnd the prediction score

The difference between them.

Then, the personalization parameters α are updated according to the following equations (9) to (11), respectively_uUser bias adjustment parameter b_uAnd an article deflection adjustment parameter b_i。

In formula (9), γ is the step size; e.g. of the type_uiIs the score r_uiAnd the prediction score

The difference between them; r is_ujA predictive score for user u for item j; s_ijScaling the Poisson correlation coefficient of the item i and the item j; rho_ijThe poisson correlation coefficient of the article i and the article j is taken as the index; epsilon_ijThe Euclidean distance value of the tag genome vector of the item i and the tag genome vector of the item j is obtained; alpha is alpha_uPersonalized parameters for user u; lambda [ alpha ]₂Is a preset regularization parameter; kappa_uA set of items, other than item i, is given a score for user u.

In a specific implementation, the personalized parameter α shown in equation (9) is specified_uThe updated formula is the personalized parameter alpha of the formula (6)_uFor derivation, the detailed processing procedure may refer to the prior art, and is not described herein again.

b_u←b_u+γ·(e_ui-λ₂·b_u) (10)

In the formula (10), b_uAdjusting parameters for user bias for user u; gamma is the step length; e.g. of the type_uiIs the score r_uiAnd the prediction score

The difference between them; lambda [ alpha ]₂Is a preset regularization parameter.

In a specific implementation, the parameter b is adjusted for user bias as shown in equation (10)_uThe updated formula is the user deviation adjustment parameter b of the formula (6)_uFor derivation, the detailed processing procedure may refer to the prior art, and is not described herein again.

b_i←b_i+γ·(e_ui-λ₂·b_i) (11)

In formula (11), b_iAdjusting a parameter for an item bias for item i; gamma is the step length; e.g. of the type_uiIs the score r_uiAnd the prediction score

In an implementation, the parameter b is adjusted according to the deviation of the article shown in the formula (11)_iThe updated formula is the item deviation adjustment parameter b of the formula (6)_iFor derivation, the detailed processing procedure may refer to the prior art, and is not described herein again. Wherein, the arrow pointing to the left in the formula (9) to the formula (11) means that the expression on the right side of the arrow is usedThe value of equation replaces the value of the parameter to the left of the arrow.

Substep 5036, reducing the step size γ, and returning to execute substep 5033;

sub-step 5037, when all sample sets in the preset number of sample sets κ are traversed, the number of iterations is increased by one, and sub-step 5038 is performed.

Specifically, when all sample sets in the preset number of sample sets κ are traversed, it is indicated that the iteration process is completed, and the iteration number may be increased by one; and a smaller step size is used in the next iteration process to ensure convergence and prevent oscillation.

Substep 5038, judging whether the iteration times exceed a preset iteration time threshold; if so, go to step 5039, otherwise, go to sub-step 5032.

Specifically, when the number of iterations exceeds a preset threshold number of iterations, the minimization process for the preset loss function is completed by performing sub-step 5038; when the number of iterations does not exceed the preset threshold number of iterations, sub-step 5032 can be executed to continue the next iteration until the number of iterations exceeds the preset threshold number of iterations.

Sub-step 5039, the minimization process of the preset loss function is completed.

Specifically, when the number of iterations exceeds a preset number of iterations threshold, it is indicated that the minimization process for the preset loss function has been completed.

And step 504, determining the optimal values of the personalized parameters, the user deviation adjustment parameters and the article deviation adjustment parameters from the minimized loss function.

In this step, according to the minimized loss function obtained after the minimization process, the optimal values of the personalized parameter, the user deviation adjustment parameter and the article deviation adjustment parameter are determined, and the optimal values of the personalized parameter, the user deviation adjustment parameter and the article deviation adjustment parameter are brought into a calculation formula of a second prediction score shown in a graph formula (6), so as to obtain a calculation formula of the optimized second prediction score.

It can be seen that in the present embodimentThe user bias adjustment parameter b in the calculation formula of the second prediction score may be obtained by performing minimization processing on a preset loss function_uArticle deviation adjustment parameter b_iAnd a personalization parameter alpha_uAnd then the optimized calculation formula of the second prediction score is obtained, so that the prediction score calculated according to the optimized calculation formula of the second prediction score is more accurate and objective.

The embodiment of the present invention further provides a collaborative filtering recommendation device based on clustering, referring to fig. 7, where fig. 7 is a schematic structural diagram of the collaborative filtering recommendation device based on clustering according to the embodiment of the present invention, and the device includes:

a first obtaining module 701, configured to obtain a tag genome vector of a first article from a preset tag genome information matrix, where the tag genome vector is used to describe an inherent attribute of the first article;

a dividing module 702, configured to divide the first item into a preset first number of clusters based on a tag genome vector of the first item by using a preset clustering algorithm;

a first calculation module 703 for, for each target item: when the target item and the second item belong to the same cluster, calculating a correlation coefficient of the target item and the second item based on a preset type distance between a tag genome vector of the target item and a tag genome vector of the second item, wherein the target item refers to an item which is not scored by a target user in the first item, and the second item refers to an item except all target items in the first item; when the target object and the second object belong to different cluster classes, calculating a correlation coefficient of the target object and the second object based on the Poisson correlation coefficient of the target object and the second object; weighting and summing the preset score of the target user on the second object and the correlation coefficient of the target object and the second object to obtain the prediction score of the target user on the target object;

and the recommending module 704 is used for recommending the target item with the prediction score meeting the preset condition to the target user.

Optionally, the first calculating module 703 is specifically configured to

Calculating the poisson correlation coefficient of the target object and the second object;

and generating a scaling poisson correlation coefficient of the target object and the second object based on the number of the common users of the target object and the second object and the calculated poisson correlation coefficient, and taking the scaling poisson correlation coefficient of the target object and the second object as the correlation coefficient of the target object and the second object.

Optionally, the first calculating module 703 is specifically configured to

Generating a scaled poisson correlation coefficient of the target item and the second item based on the number of common users of the target item and the second item and the calculated poisson correlation coefficient according to the following formula;

in the formula, the target item is item i; the second article is article j; s_ijScaling the Poisson correlation coefficient of the item i and the item j; rho_ijThe poisson correlation coefficient of the article i and the article j is taken as the index; n is_ijThe number of common users of item i and item j; lambda [ alpha ]₁Is a parameter of the number of common users.

Optionally, the preset type of distance includes an euclidean distance;

a first calculation module 703, in particular for

According to the following formula, carrying out weighted summation on the score of a preset target user on a second article and the correlation coefficient of the target article and the second article to obtain the prediction score of the target user on the target article;

in the formula, the target item is item i; the second article is article j; the target user is user u;

Optionally, the first calculating module 703 is specifically configured to

Weighting and summing the score of the preset target user on the second article and the correlation coefficient of the target article and the second article;

and adjusting the personalized parameters, the user deviation adjusting parameters and the article deviation adjusting parameters on the basis of the weighted summation result to obtain the prediction score of the target user for the target article.

Optionally, the first calculating module 703 is specifically configured to

Weighting and summing the score of a preset target user on a second article and the correlation coefficient of the target article and the second article by using the following formula; using the personalized parameters, the user deviation adjustment parameters and the article deviation adjustment parameters to adjust on the basis of the result of the weighted summation to obtain the prediction score of the target user on the target article;

in the formula, the target item is item i; the second article is article j; the target user is user u; mu is a score mean value in a preset score matrix; b_uAdjusting parameters for user bias for user u; b_iAdjusting a parameter for an item bias for item i; alpha is alpha_uPersonalized parameters for user u;

Optionally, the apparatus further comprises:

the second obtaining unit is used for obtaining a preset number of sample sets from a preset scoring matrix, and each sample set respectively comprises a user, an article and a score given to the article by the user;

the second calculation module is used for regarding each sample set, taking the user in the sample set as a target user, and taking the article in the sample set as a target article; calculating the prediction score of the target user on the target item by using a calculation formula of a second prediction score, and taking the calculated prediction score as the prediction score corresponding to the sample set;

the minimizing module is used for performing minimizing processing on a preset loss function based on each sample set and the prediction scores corresponding to the sample sets to obtain a minimized loss function, and the preset loss function is shown as the following formula;

in a preset loss function, k is a preset number of sample sets; the article i is an article in each sample set; the user u is a user in each sample set; b_uAdjusting parameters for user bias for user u; b_iAdjusting a parameter for an item bias for item i; alpha is alpha_uPersonalized parameters for user u;

calculating a prediction score of the user u for the item i; r is_uiScoring the item i by the user u, wherein the scoring is obtained from a preset scoring matrix; lambda [ alpha ]₂Is a preset regularization parameter;

and the determining module is used for determining the optimal values of the personalized parameters, the user deviation adjusting parameters and the article deviation adjusting parameters from the minimized loss function.

Optional, minimization module, in particular for

Setting a personalized parameter alpha for each sample set_uUser bias adjustment parameter b_uAnd an article deflection adjustment parameter b_iInitializing step length gamma and iteration times;

setting the user set U and the article set I to be both empty sets;

obtaining a sample set from a preset number of sample sets kappa, wherein the sample set comprises a user u and an article i, and a score r of the user u on the article i is obtained from a preset scoring matrix_uiAnd the calculated predicted score of user u for item i

Judging whether the user U of the sample set is contained in the user set U and whether the item I of the sample set is contained in the item set I;

if yes, returning to the step of obtaining a sample set from the sample sets kappa with the preset number;

if not, the score r in the sample set is determined_uiAnd the prediction score

Difference e between_uiUpdating the personalization parameter alpha_uUser bias adjustment parameter b_uAnd an article deflection adjustment parameter b_iPutting the user U in the sample set into a user set U, and putting the item I in the sample set into an item set I;

reducing the step size gamma, and returning to execute the step of obtaining a sample set from the sample sets kappa with the preset number;

when all sample sets in the sample set kappa with the preset number are traversed, adding one to the iteration number;

judging whether the iteration times exceed a preset iteration time threshold value or not; if the user set U and the item set I are empty sets, the minimization processing of a preset loss function is completed, and if the user set U and the item set I are not empty sets, the step of setting the user set U and the step of setting the item set I to be empty sets is returned.

As can be seen, in the collaborative filtering recommendation device based on clustering proposed in the embodiment of the present invention, after the first item is classified according to the tag genome vector, the relevance coefficient may be calculated for the first item belonging to the same cluster based on the euclidean distance value of the tag genome vector. The tag genome vector is used for describing the inherent attributes of the articles, does not change along with the subjective will of the user, and has objectivity, so that the calculated correlation coefficient also has objectivity, the objectivity of the obtained prediction score is stronger, and the problem that the objectivity of the prediction score is influenced by the subjective score of the user is avoided.

An embodiment of the present invention further provides an electronic device, and referring to fig. 8, fig. 8 is a schematic structural diagram of the electronic device provided in the embodiment of the present invention. As shown in fig. 8, the system comprises a processor 81, a communication interface 82, a memory 83 and a communication bus 84, wherein the processor 81, the communication interface 82 and the memory 93 are communicated with each other through the communication bus 84,

a memory 83 for storing a computer program;

the processor 81 is configured to implement the following steps when executing the program stored in the memory 83:

The communication bus mentioned in the electronic device may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The communication bus may be divided into an address bus, a data bus, a control bus, etc. For convenience, only one thick line is used in the figures, but there is not only one bus or one type of bus.

The communication interface is used for communication between the electronic equipment and other equipment.

The Memory may include a Random Access Memory (RAM) or a Non-Volatile Memory (NVM), such as at least one disk Memory. Optionally, the memory may also be at least one memory device located remotely from the processor.

The Processor may be a general-purpose Processor, including a Central Processing Unit (CPU), a Network Processor (NP), and the like; but also Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components.

The method provided by the embodiment of the invention can be applied to electronic equipment. Specifically, the electronic device may be: desktop computers, laptop computers, intelligent mobile terminals, servers, and the like. Without limitation, any electronic device that can implement the present invention is within the scope of the present invention.

An embodiment of the present invention further provides a computer-readable storage medium, where a computer program is stored in the storage medium, and when the computer program is executed by a processor, the steps of the cluster-based collaborative filtering recommendation method are implemented.

Embodiments of the present invention further provide a computer program product containing instructions that, when executed on a computer, cause the computer to perform the steps of the above-mentioned cluster-based collaborative filtering recommendation method.

An embodiment of the present invention further provides a computer program, which when running on a computer, causes the computer to execute the steps of the above-mentioned cluster-based collaborative filtering recommendation method.

For the apparatus/electronic device/storage medium/computer program product/computer program embodiment comprising instructions, the description is relatively simple as it is substantially similar to the method embodiment, and reference may be made to some descriptions of the method embodiment for relevant points.

It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

All the embodiments in the present specification are described in a related manner, and the same and similar parts among the embodiments may be referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the apparatus/electronic device/storage medium/computer program product/computer program embodiment comprising instructions, the description is relatively simple as it is substantially similar to the method embodiment, and reference may be made to some descriptions of the method embodiment for relevant points.

The above description is only for the preferred embodiment of the present invention, and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention shall fall within the protection scope of the present invention.

Claims

1. A collaborative filtering recommendation method based on clustering is characterized by comprising the following steps:

2. The method of claim 1, wherein the step of calculating the poisson correlation coefficient of the target item and the second item based on the poisson correlation coefficient of the target item and the second item comprises:

generating a scaling poisson correlation coefficient of the target object and the second object based on the number of common users of the target object and the second object and the calculated poisson correlation coefficient, taking the scaling poisson correlation coefficient of the target object and the second object as the correlation coefficient of the target object and the second object, wherein the common users are users who score different objects;

the step of generating scaled poisson correlation coefficients for the target item and the second item based on the number of co-users of the target item and the second item and the poisson correlation coefficients comprises:

3. The method of claim 2, wherein the preset type of distance comprises a euclidean distance;

the step of performing weighted summation on the preset prediction score of the target user on the second item and the correlation coefficient of the target item and the second item to obtain the prediction score of the target user on the target item comprises the following steps:

according to the following formula, weighting and summing the preset prediction score of the target user on the second object and the correlation coefficient of the target object and the second object to obtain the prediction score of the target user on the target object;

in the formula, the target item is item i; the second article is article j; target userIs user u;

a predicted score for user u for item i; r is_ujA predictive score for user u for item j; s_ijScaling the Poisson correlation coefficient of the item i and the item j; epsilon_ijThe Euclidean distance value of the tag genome vector of the item i and the tag genome vector of the item j is obtained; kappa_uGiving a set of scored items for user u in addition to item i; p is a radical of_ijIs an adjustment factor;

a set of items belonging to the same cluster class as item i in the case where the first item is divided into k cluster classes; alpha is alpha_uIs a personalization parameter for user u.

4. The method according to claim 3, wherein the step of performing weighted summation on the preset target user prediction score of the second item and the correlation coefficient between the target item and the second item to obtain the target user prediction score of the target item comprises:

weighting and summing the preset prediction score of the target user on the second article and the correlation coefficient of the target article and the second article;

5. The method according to claim 4, wherein the preset prediction score of the target user for the second item and the correlation coefficient of the target item and the second item are weighted and summed; using the personalized parameters, the user deviation adjustment parameters and the article deviation adjustment parameters to adjust on the basis of the result of the weighted summation to obtain the prediction score of the target user on the target article, wherein the step comprises the following steps:

weighting and summing the preset prediction score of the target user on the second item and the correlation coefficient of the target item and the second item by using the following formula; using the personalized parameters, the user deviation adjustment parameters and the article deviation adjustment parameters to adjust on the basis of the result of the weighted summation to obtain the prediction score of the target user on the target article;

a set of items belonging to the same cluster class as item i in the case where the first item is divided into k cluster classes; b_uj＝μ+b_u+b_i。

6. The method of claim 5, wherein the following formula is used,

a set of items belonging to the same cluster class as item i in the case where the first item is divided into k cluster classes; b_uj＝μ+b_u+b_i；

Before the step of performing weighted summation on the preset prediction score of the target user on the second item and the correlation coefficient of the target item and the second item, the method further includes:

acquiring a preset number of sample sets from a preset scoring matrix, wherein each sample set respectively comprises a user, an article and a score given to the article by the user;

regarding each sample set, taking the user in the sample set as a target user, and taking the article in the sample set as a target article; calculating the prediction score of the target user on the target object, and taking the calculated prediction score as the prediction score corresponding to the sample set;

based on each sample set and the prediction scores corresponding to each sample set, performing minimization processing on a preset loss function to obtain a minimized loss function, wherein the preset loss function is shown in the following formula;

and determining the optimal values of the personalized parameters, the user deviation adjusting parameters and the article deviation adjusting parameters from the minimized loss function.

7. The method according to claim 6, wherein the step of minimizing the preset loss function based on each sample set and the prediction score corresponding to each sample set to obtain a minimized loss function comprises:

setting a personalized parameter alpha for each sample set_uUser bias adjustment parameter b_uAnd an article deflection adjustment parameter b_iAnd initializing step size gamma and iterationThe number of times;

setting the user set U and the article set I to be both empty sets;

if not, the score r in the sample set is determined_uiAnd the prediction score

8. A collaborative filtering recommendation device based on clustering, comprising:

9. The apparatus of claim 8,

the first computing module is specifically configured to

the first computing module is further specifically configured to: