CN113269609A

CN113269609A - User similarity calculation method, calculation system, device and storage medium

Info

Publication number: CN113269609A
Application number: CN202110570380.2A
Authority: CN
Inventors: 霍慧
Original assignee: China United Network Communications Group Co Ltd
Current assignee: China United Network Communications Group Co Ltd
Priority date: 2021-05-25
Filing date: 2021-05-25
Publication date: 2021-08-17

Abstract

The present disclosure provides a user similarity calculation method, a calculation system, a computer device, and a storage medium, the method including: acquiring a user-commodity scoring matrix; modifying the scores in the user-commodity scoring matrix based on the preset time weight to obtain a new user-commodity scoring matrix; calculating the scoring difference value of any two users for each common scoring commodity aiming at the new user-commodity scoring matrix; classifying the grading difference values and respectively calculating the frequency of the grading difference values of all classes; calculating improved information entropy of all category score difference values; and calculating the similarity between any two users in the new user-commodity scoring matrix according to the information entropy and a preset similarity calculation method. The technical scheme of the disclosure enables the scoring to reflect the preference of the user more truly; meanwhile, the problem of data sparseness is relieved by introducing the information entropy, so that the similarity calculation result is more in line with the actual situation, and the commodity recommendation is more accurate.

Description

User similarity calculation method, calculation system, device and storage medium

Technical Field

The disclosure belongs to the technical field of electronic commerce, and particularly relates to a user similarity calculation method, a user similarity calculation system, a computer device and a computer readable storage medium.

Background

A Collaborative Filtering (CF) algorithm is a representative algorithm in a recommendation system, and is widely applied to various large e-commerce platforms. The collaborative filtering algorithm mainly comprises a User-based collaborative filtering (User-CF) algorithm and a commodity-based collaborative filtering (Item-CF) algorithm. As shown in fig. 1, the key of the User-CF algorithm is to find similar users of a target User, and to synthesize the preferred goods of the similar users and recommend the goods to the target User. The method comprises the following three steps: 1. acquiring user-commodity scoring information; 2. calculating user similarity according to the user-commodity scoring information, sorting according to size, and taking the top N users with larger similarity as a neighbor user set; 3. and according to the scores of the commodities of the neighbor user set, carrying out score prediction on the commodities unknown to the user, and recommending the commodities with the highest prediction scores to the user.

It can be seen that the User similarity calculation is the key to the User-CF algorithm. The user similarity calculation is completed based on a user-commodity scoring matrix, and strategies which can be used in the solving process include cosine similarity, modified cosine similarity, Pearson correlation coefficient, Jacard similarity and the like.

Because the existing user similarity calculation is completed based on the user-commodity scoring matrix, data concentration is needed, and enough user behavior information is provided, when the user historical behaviors are few, even a new user does not have the historical behavior information, the problem that enough common commodity scoring information does not exist among users occurs, namely the user-commodity scoring matrix data is sparse, the similarity calculation among the users is inaccurate, and therefore recommendation with high accuracy is difficult to make. Moreover, the existing collaborative filtering algorithm treats the commodities accessed by the user equally, and the contribution of the commodities accessed by the user recently to the user interest measurement is not fully considered, so that the recommendation reliability and recommendation precision of the recommendation system are not high.

Disclosure of Invention

The present disclosure provides a method, a system, a computer device and a storage medium for calculating user similarity, so that a score can reflect user preferences more truly; and the problem of data sparseness is relieved, the similarity calculation result is more in line with the actual situation, and the commodity recommendation is more accurate.

In a first aspect, an embodiment of the present disclosure provides a method for calculating user similarity, including:

acquiring a user-commodity scoring matrix;

modifying the scores in the user-commodity scoring matrix based on the preset time weight to obtain a new user-commodity scoring matrix;

calculating the scoring difference value of any two users for each common scoring commodity aiming at the new user-commodity scoring matrix;

classifying the grading difference values and respectively calculating the frequency of the grading difference values of all classes;

calculating the improved information entropy of all the category score difference values according to the frequency of each category score difference value;

and calculating the similarity between any two users in the new user-commodity scoring matrix according to the information entropy and a preset similarity calculation method.

Further, the score in the user-commodity scoring matrix is corrected based on the preset time weight, and the score is obtained by adopting the following formula:

in the formulae (1) and (2), t (u)_i) And t (v)_i) Respectively representing the scoring time of the user u and the user v for the commodity i; w is a_t(u_i)、w_t(v_i) Respectively presetting time weight calculation formulas for a user u and a user v; t (0) represents the earliest scoring time when the user u and the user v score the commodities; alpha represents a time attenuation parameter and reflects the speed of interest change of a user; t represents a time window; u. of_iAnd v_iRespectively representing the scores of the user u and the user v on the commodity i; u'_iAnd v'_iRespectively representing the correction scores of the user u and the user v on the commodity i; and i is 1 to n.

Further, the scoring difference of any two users for each common scoring commodity is calculated for the new user-commodity scoring matrix, and the scoring difference is obtained by adopting the following formula:

dif(u′,v′)＝(u₁′-v₁′,…,u_i′-v_i′,…,u_n′-v_n′)＝(d₁,…,d_i,…,d_n) (3)

in the formula (3), dif (u ', v') represents the difference value of the scores of the user u and the user v on the common score commodities; d₁，…,d_i,…,d_nRepresenting the difference in the scores of user u and user v for commonly scored items 1, …, item i, …, and item n, respectively.

Further, the frequency of the difference value of each category score is calculated by the following formula:

fre(dif(u′,v′))＝(p₁,p₂,…,p_j,…,p_k) (4)

in the formula, fre (dif (u ', v')) represents the frequency of the grade difference of each category after the grade difference of each common grade commodity of the user u and the user v is divided into k categories; dif (u ', v') represents the difference value of the scores of the user u and the user v on each common score commodity; k represents the number of categories into which the score difference of the respective common score commodities is divided, p_jIndicating the probability in which the j-th class score difference occurs.

Further, the improved information entropy of all the category score difference values is calculated by adopting the following formula:

in the formula (5), H ' (fre (dif (u ', v ')) 0 represents the improved information entropy of the score difference values of all categories after the score difference values of the user u and the user v for the common score commodities are divided into k categories;

for improved entropy of information, calculating formula, wherein d (p)_j) Representing a distribution probability of p_jThe difference in scores of (a).

Further, the similarity between any two users in the new user-commodity scoring matrix is calculated according to the information entropy and a preset similarity calculation method, and is obtained by adopting the following formula:

in formula (6), sim (u ', v') represents the similarity between user u and user v; i is_uAnd I_vRespectively representing commodity sets scored by the user u and the user v;

the formula is a Jaccard similarity calculation formula.

In a second aspect, an embodiment of the present disclosure provides a system for calculating user similarity, including:

an acquisition module configured to acquire a user-commodity scoring matrix;

the score correction module is set to correct scores in the user-commodity score matrix based on a preset time weight to obtain a new user-commodity score matrix;

the first calculation module is configured to calculate a score difference value of any two users for each common score commodity aiming at the new user-commodity score matrix; and the number of the first and second groups,

a second calculation module configured to calculate an improved information entropy for all category score differences according to the frequency of each category score difference; and the number of the first and second groups,

Further, the score correction module is specifically configured to:

and (3) correcting the scores in the user-commodity scoring matrix by adopting a formula (1) and a formula (2):

In a third aspect, an embodiment of the present disclosure further provides a computer device, including a memory and a processor, where the memory stores a computer program, and when the processor runs the computer program stored in the memory, the processor executes the method for calculating the user similarity according to any one of the first aspect.

In a fourth aspect, an embodiment of the present disclosure further provides a computer-readable storage medium, including: computer program, which when run on a computer, causes the computer to perform a method of calculating user similarity as described in any one of the first aspects.

Has the advantages that:

the user similarity calculation method, the calculation system, the computer equipment and the storage medium provided by the disclosure are realized by acquiring a user-commodity scoring matrix; modifying the scores in the user-commodity scoring matrix based on the preset time weight to obtain a new user-commodity scoring matrix; calculating the scoring difference value of any two users for each common scoring commodity aiming at the new user-commodity scoring matrix; classifying the grading difference values and respectively calculating the frequency of the grading difference values of all classes; calculating the improved information entropy of all the category score difference values according to the frequency of each category score difference value; and calculating the similarity between any two users in the new user-commodity scoring matrix according to the information entropy and a preset similarity calculation method. According to the technical scheme, the influence of time on the user interest is considered, and the time weight is introduced to correct the user score, so that the score can reflect the user preference more truly; meanwhile, an information entropy calculation idea is introduced, the user similarity is calculated, the problem of data sparseness is relieved, the similarity calculation result is more in line with the actual situation, and the commodity recommendation is more accurate.

Drawings

FIG. 1 is a schematic diagram of a user-based collaborative filtering recommendation algorithm in the prior art;

fig. 2 is a schematic flowchart of a method for calculating user similarity according to a first embodiment of the present disclosure;

fig. 3 is an architecture diagram of a computing system for user similarity according to a second embodiment of the present disclosure;

fig. 4 is an architecture diagram of a computer device according to a third embodiment of the present disclosure.

Detailed Description

In order to make the technical solutions of the present disclosure better understood by those skilled in the art, the present disclosure is further described in detail below with reference to the accompanying drawings and examples.

In which the terminology used in the embodiments of the disclosure is for the purpose of describing particular embodiments only and is not intended to be limiting of the disclosure. As used in the disclosed embodiments and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.

Because the existing user similarity calculation is completed based on the user-commodity scoring matrix, data concentration and enough user behavior information are needed, when the historical behaviors of the users are less, even a new user does not have the historical behavior information, the problem that enough common commodity scoring information does not exist among the users, namely the user-commodity scoring matrix data is sparse occurs, so that the similarity calculation among the users is inaccurate, and the recommendation with high accuracy is difficult to make. And the traditional collaborative filtering algorithm treats the commodities accessed by the user equally, the contribution of the recently accessed commodities to the user interest measurement is not fully considered, and the recommendation reliability and the recommendation precision are not high.

The following describes the technical solutions of the present disclosure and how to solve the above problems in detail with specific examples. The following several specific embodiments may be combined with each other, and details of the same or similar concepts or processes may not be repeated in some embodiments.

Fig. 1 is a schematic flowchart of a method for calculating user similarity in a collaborative filtering algorithm according to an embodiment of the present disclosure, as shown in fig. 1, including:

step S101: acquiring a user-commodity scoring matrix;

step S102: modifying the scores in the user-commodity scoring matrix based on the preset time weight to obtain a new user-commodity scoring matrix;

step S103: calculating the scoring difference value of any two users for each common scoring commodity aiming at the new user-commodity scoring matrix;

step S104: classifying the grading difference values and respectively calculating the frequency of the grading difference values of all classes;

step S105: calculating the improved information entropy of all the category score difference values according to the frequency of each category score difference value;

step S106: and calculating the similarity between any two users in the new user-commodity scoring matrix according to the information entropy and a preset similarity calculation method.

The User similarity calculation is the key of the User-CF algorithm. The user similarity calculation is performed based on a user-commodity scoring matrix, such as a user-commodity scoring matrix R_mnThe following were used:

wherein m represents m users, n represents n commodities, and R_mnAnd representing the scoring of the nth commodity by the mth user, and calculating the similarity of the users by adopting a row vector. The strategies that can be used in the solving process include cosine similarity, modified cosine similarity, Pearson correlation coefficient and the like.

Considering that the user interest changes along with time, in order to reflect the user scoring condition more truly, time weight is introduced to correct the scoring in the commodity scoring matrix, and a new user commodity scoring matrix is constructed. By correcting the scores in the commodity scoring matrix, the recent scores of the users are higher, and the scores can reflect the current interests of the users.

Then, calculating the grade difference of the commodities which are jointly graded by the user u and the user v based on the new commodity grading matrix, then carrying out frequency analysis, classifying the grade difference and calculating the frequency of each category; the information entropy is calculated to calculate the similarity of users, the information entropy can be understood as the occurrence probability of certain specific information (the occurrence probability of discrete random events), the chaos degree of a system can be reflected, and the lower the information entropy is, the more ordered the system is. Because the user similarity and the information entropy are in inverse proportion, the larger the information entropy is, the larger the difference degree between two users is, the more dissimilar the two users are; the smaller the information entropy, the smaller the degree of difference between the two users, and the more similar the two users. The calculation formula of the information entropy is as follows:

in the formula, n represents the number of information types in the sample U, and p_iIndicating the probability of the occurrence of the information numbered i in the sample U. In an implementation manner of the embodiment of the present disclosure, besides considering the frequency of the score difference, the information entropy may be improved, for example, the score difference itself also has an influence on the calculation result, and the score difference itself is added to the formula (7) in the information entropy calculation.

By time weight w_tThe proportion of long-term interest of the user can be reduced, the proportion of short-term interest can be increased, and the interest of the user at present can be better reflected. Time weights w of different users_tThe time decay parameters in (1) are the same.

in the formula (3), dif (u ', v') represents the difference value of the scores of the user u and the user v on the common score commodities; d₁,…,d_i,…，d_nRepresenting the difference in the scores of user u and user v for commonly scored items 1, …, item i, …, and item n, respectively. .

Through the corrected commodity scoring matrix, the scoring difference of the two users on the commonly scored commodities under the current condition can be obtained, and the influence of scoring of the two users at different time on the similarity of the two users is eliminated.

fre(dif(u′,v′))＝(p₁,p₂,…,p_j,…,p_k) (4)

And performing frequency analysis on the score difference to obtain distribution characteristics of the score difference, wherein for example, if the score difference of the jointly scored commodities of the user u and the user v is (1, 2,2, 3), the frequency of 3 categories with the score difference of 1, 2,3 is represented as (1/4, 1/2, 1/4).

in the formula (5), H ' (fre (dif (u ', v '))) represents the improved information entropy of the score difference values of all categories after the score difference values of the user u and the user v for the common score commodities are divided into k categories;

Besides considering the frequency of the score difference, the score difference itself also has an influence on the calculation result, for example, if dif (u ', v') (1, 2,3) and dif (u ', w') (3, 4, 5), the information entropy calculation result is consistent, but the similarity between the actual user u and the user v is greater than that between the user u and the user w. Therefore, the information entropy calculation formula is improved by adding the score difference value.

the formula is a Jaccard similarity calculation formula.

The Jaccard similarity does not care about the grade of the user on the commodity, and only considers the behavior that whether the user has preference on the commodity, namely the ratio of the common commodity grade of the two users to the total grade. The value is between (0, 1), when the value is 0, the two users do not have any common preference, and when the value is 1, the two users have consistent preference.

I_u、I_vAnd respectively representing the commodity sets scored by the user u and the user v.

According to the embodiment of the disclosure, the change of the user interest along with time is considered, and the time weight is introduced to correct the user score, so that the current interest preference of the user is reflected more truly; meanwhile, an information entropy calculation idea is introduced, the similarity of the user is calculated by improving and combining the Jaccard similarity, the problem of data sparseness is relieved, the similarity calculation result is more in line with the actual situation, and the recommendation result is more accurate.

Fig. 3 is an architecture diagram of a computing system for user similarity according to a second embodiment of the present disclosure, as shown in fig. 3, including:

an acquisition module 1 configured to acquire a user-commodity scoring matrix;

the score correction module 2 is configured to correct scores in the user-commodity score matrix based on a preset time weight to obtain a new user-commodity score matrix;

a first calculating module 3, configured to calculate, for the new user-commodity scoring matrix, a scoring difference value of any two users for each common scored commodity; and the number of the first and second groups,

a second calculation module 4 arranged to calculate the improved information entropy for all category score differences according to the frequency of each category score difference; and the number of the first and second groups,

Further, the score correction module 2 is specifically configured to:

and (3) correcting the scores of the user compared commodities in the user-commodity score matrix by adopting a formula (1) and a formula (2):

in the formulae (1) and (2), t (u)_i) And t (v)_i) Respectively representing the scoring time of the user u and the user v for the commodity i; w is a_t(u_i)、w_t(v_i) Respectively presetting time weight calculation formulas for a user u and a user v; t (0) represents user u andthe earliest grading time when the user v grades the commodities; alpha represents a time attenuation parameter and reflects the speed of interest change of a user; t represents a time window; u. of_iAnd v_iRespectively representing the scores of the user u and the user v on the commodity i; u'_iAnd v'_iRespectively representing the correction scores of the user u and the user v on the commodity i; and i is 1 to n.

Further, the first calculating module 3 is specifically configured to:

calculating the difference value of the scores of any two users for each common score commodity by adopting a formula (3):

dif(u′，v′)＝(u₁′-v₁′,…,u_i′-v_i′，…，u_n′-v_n′)＝(d₁，…,d_i,…,d_n) (3)

in the formula (3), dif (u ', v') represents the difference value of the scores of the user u and the user v on the common score commodities; d₁,…,d_i,…,d_nRepresenting the difference in the scores of user u and user v for commonly scored items 1, …, item i, …, and item n, respectively.

Further, the first calculating module 3 is further configured to:

the frequency of the difference value of each category score is calculated by the following formula:

fre(dig(u′,v′))＝(p₁,p₂,…,p_j,…,p_k) (4)

Further, the second calculating module 4 is specifically configured to:

the improved information entropy of all category score differences is calculated using the following formula:

Further, the second calculating module 4 is specifically further configured to:

calculating the similarity between any two users in the new user-commodity scoring matrix by adopting the following formula:

the formula is a Jaccard similarity calculation formula.

The user similarity calculation system in the embodiment of the present disclosure is used for implementing the user similarity calculation method in the first method embodiment, so that the description is simpler, and reference may be specifically made to the related description in the first method embodiment, and details are not repeated here.

Furthermore, as shown in fig. 4, a computer device according to a third embodiment of the present disclosure further includes a memory 10 and a processor 20, where the memory 10 stores a computer program, and when the processor 20 runs the computer program stored in the memory 10, the processor 20 executes the above-mentioned methods for calculating the user similarity.

In addition, the embodiments of the present disclosure also provide a computer-readable storage medium, in which computer-executable instructions are stored, and when at least one processor of the user equipment executes the computer-executable instructions, the user equipment executes the above-mentioned various possible methods.

Computer-readable media includes both computer storage media and communication media including any medium that facilitates transfer of a computer program from one place to another. A storage media may be any available media that can be accessed by a general purpose or special purpose computer. An exemplary storage medium is coupled to the processor such the processor can read information from, and write information to, the storage medium. Of course, the storage medium may also be integral to the processor. The processor and the storage medium may reside in an ASIC (Application Specific Integrated Circuit). Additionally, the ASIC may reside in user equipment. Of course, the processor and the storage medium may reside as discrete components in a communication device.

It is to be understood that the above embodiments are merely exemplary embodiments that are employed to illustrate the principles of the present disclosure, and that the present disclosure is not limited thereto. It will be apparent to those skilled in the art that various changes and modifications can be made therein without departing from the spirit and scope of the disclosure, and these are to be considered as the scope of the disclosure.

Claims

1. A method for calculating user similarity is characterized by comprising the following steps:

acquiring a user-commodity scoring matrix;

2. The calculation method according to claim 1, wherein the score in the user-commodity score matrix is corrected based on the preset time weight, and the following formula is adopted:

in the formulae (1) and (2), t (u)_i)、t(v_i) Respectively representing the scoring time of the user u and the user v for the commodity i; w is a_t(u_i)、w_t(v_i) Respectively presetting time weight calculation formulas for a user u and a user v; t (0) represents the earliest scoring time when the user u and the user v score the commodities; alpha represents a time attenuation parameter and reflects the speed of interest change of a user; t represents a time window; u. of_i、v_iRespectively representing the scores of the user u and the user v on the commodity i; u'_i、v′_iRespectively representing the correction scores of the user u and the user v on the commodity i; and i is 1 to n.

3. The calculation method according to claim 2, wherein for the new user-commodity scoring matrix, the scoring difference of any two users for each common scored commodity is calculated, and the following formula is adopted to obtain:

dif(u′，v′)＝(u₁′-v₁′，…，u_i′-v_i′，…，u_n′-v_n′)＝(d₁，…，d_i，…，d_n) (3)

in the formula (3), dif (u ', v') represents the difference value of the scores of the user u and the user v on the common score commodities; d₁，…，d_i，…，d_nRepresenting the difference in the scores of user u and user v for commonly scored items 1, …, item i, …, and item n, respectively.

4. The method of claim 2, wherein the frequency of calculating the difference between the respective category scores is obtained by using the following formula:

fre(dif(u′，v′))＝(p₁，p₂，…，p_j，…，p_k) (4)

5. The calculation method according to claim 4, wherein the improved information entropy of all the category score differences is calculated by using the following formula:

6. The calculation method according to claim 5, wherein the similarity between any two users in the new user-commodity rating matrix is calculated according to the information entropy and a preset similarity calculation method, and is obtained by adopting the following formula:

the formula is a Jaccard similarity calculation formula.

7. A system for calculating user similarity, comprising:

an acquisition module configured to acquire a user-commodity scoring matrix;

8. The computing system of claim 7, wherein the score modification module is specifically configured to:

9. A computer device characterized by comprising a memory in which a computer program is stored and a processor that executes the user similarity calculation method according to any one of claims 1 to 6 when the processor runs the computer program stored in the memory.

10. A computer-readable storage medium, comprising: computer program, which, when run on a computer, causes the computer to carry out the method of calculating user similarity as claimed in any one of claims 1 to 6.