CN111831890A

CN111831890A - User similarity generation method and device, storage medium and computer equipment

Info

Publication number: CN111831890A
Application number: CN201910306528.4A
Authority: CN
Inventors: 杨毅; 李冰锋; 冯晓强; 李彪; 范欣
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2019-04-17
Filing date: 2019-04-17
Publication date: 2020-10-27
Anticipated expiration: 2039-04-17
Also published as: CN111831890B

Abstract

The application relates to a user similarity generating method, a device, a storage medium and computer equipment, wherein the method comprises the following steps: acquiring the viewing times corresponding to the user identification under more than one content category; generating a user characteristic vector corresponding to the user identifier according to the checking times; respectively determining the correlation degree between any two content categories of the more than one content categories; combining the correlations to obtain a correlation matrix between the more than one content categories; and generating user similarity according to the correlation matrix and the user characteristic vectors corresponding to the at least two user identifications respectively. The scheme provided by the application can improve the accuracy of the generated user similarity.

Description

User similarity generation method and device, storage medium and computer equipment

Technical Field

The present application relates to the field of computer technologies, and in particular, to a user similarity generating method, apparatus, storage medium, and computer device.

Background

With the rapid development of computer technology, more and more contents need to be acquired through a computer, and content recommendation needs to be performed on a user in more and more scenes. Such as recommendations for news, videos, or advertisements, etc. The current content recommendation generally recommends content liked by users similar to the target user based on user similarity.

However, the user similarity generated in the conventional content recommendation process is usually calculated based on the ratio of the number of the same clicked contents in the recently browsed contents among users; due to the personalized difference of the users, different users have different browsing contents, so that the user similarity obtained by calculation in different sample spaces has lower accuracy.

Disclosure of Invention

Based on this, it is necessary to provide a user similarity generating method, apparatus, storage medium and computer device for solving the technical problem of low accuracy of user similarity generated in the conventional manner.

A user similarity generating method comprises the following steps:

acquiring the viewing times corresponding to the user identification under more than one content category;

generating a user characteristic vector corresponding to the user identifier according to the checking times;

respectively determining the correlation degree between any two content categories of the more than one content categories;

combining the correlations to obtain a correlation matrix between the more than one content categories;

and generating user similarity according to the correlation matrix and the user characteristic vectors corresponding to the at least two user identifications respectively.

A user similarity generating apparatus comprising:

the acquisition module is used for acquiring the viewing times corresponding to the user identification under more than one content category;

the first generation module is used for generating a user characteristic vector corresponding to the user identifier according to the viewing times;

a determining module, configured to determine a correlation between any two content categories of the more than one content categories respectively;

the combination module is used for combining the relevance degrees to obtain a relevance degree matrix among the more than one content category;

and the second generation module is used for generating the user similarity according to the correlation matrix and the user characteristic vectors corresponding to the at least two user identifications.

A computer-readable storage medium, storing a computer program which, when executed by a processor, causes the processor to perform the steps of the above-described user similarity generating method.

A computer device comprising a memory and a processor, the memory storing a computer program which, when executed by the processor, causes the processor to perform the steps of the user similarity generation method described above.

According to the user similarity generating method, the user similarity generating device, the storage medium and the computer equipment, after the viewing times corresponding to the user identifications under more than one content category are obtained, the user feature vectors corresponding to the user identifications are generated according to the viewing times, the user feature vectors with strong interpretability are constructed from macroscopic content classification, the clicking frequency is used as the vector elements of the user feature vectors, and the influence caused by personalized difference is avoided; then, the correlation degrees between any two content categories in the more than one content categories are respectively determined, the correlation degrees are combined to obtain a correlation degree matrix between the more than one content categories, and the user similarity is generated according to the correlation degree matrix and the user characteristic vectors corresponding to the at least two user identifications. Therefore, the similarity calculation of the user is enlarged to a macroscopic content classification level, and the association between different content categories is taken into consideration, so that the generated user similarity is more accurate and reliable.

Drawings

FIG. 1 is a diagram of an application environment of a user similarity generation method in one embodiment;

FIG. 2 is a flowchart illustrating a user similarity generating method according to an embodiment;

FIG. 3 is a diagram of an interface for content presentation in one embodiment;

FIG. 4 is a schematic diagram of a content recommendation based on user similarity in one embodiment;

FIG. 5 is a diagram of an interface for content recommendation based on user similarity, according to an embodiment;

FIG. 6 is a schematic diagram illustrating the generation of user similarities in one embodiment;

FIG. 7 is a flow diagram that illustrates content recommendation based on user similarity, in one embodiment;

FIG. 8 is a block diagram showing the structure of a user similarity generating apparatus according to an embodiment;

fig. 9 is a block diagram showing the structure of a user similarity generating apparatus in another embodiment;

FIG. 10 is a block diagram showing a configuration of a computer device according to an embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

Fig. 1 is an application environment diagram of a user similarity generating method in one embodiment. Referring to fig. 1, the user similarity generating method is applied to a user similarity generating system. The user similarity generating system includes a terminal 110 and a server 120. The terminal 110 and the server 120 are connected through a network. The terminal 110 may specifically be a desktop terminal or a mobile terminal, and the mobile terminal may specifically be at least one of a mobile phone, a tablet computer, a notebook computer, and the like. The server 120 may be implemented as a stand-alone server or a server cluster composed of a plurality of servers. The terminal 110 and the server 120 may be independently used to execute the user similarity generating method, and the terminal 110 and the server 120 may also be cooperatively used to execute the user similarity generating method. The terminal 10 may also perform the user similarity generating method through an application program running thereon.

As shown in fig. 2, in one embodiment, a user similarity generation method is provided. The embodiment is mainly exemplified by applying the method to a computer device, and the computer device may specifically be the terminal 110 or the server 120 in fig. 1. Referring to fig. 2, the user similarity generating method specifically includes the following steps:

s202, the viewing times corresponding to the user identification under more than one content category are obtained.

The content category is a category to which the content belongs, and is a type obtained by dividing the content. The division of the content may be based on an attribute inherent to the content itself. For example, the data format may be divided into text news, video news, or picture news. For another example, the semantic division expressed by the content may divide news into entertainment news, sports news, social news, and the like. The partitioning of the content may also be a custom manual partitioning. Such as manually divided news sites in news applications: video, movies, sports, duds, beijing, and entertainment, among others.

The content refers to information displayed to a user through a computer device, and may specifically be text, image or video and combined information, such as the content shown in fig. 3. The content is specifically, for example, promotion information, application programs, video, audio, news, articles or commodities. Typically a large amount of content may be included under one content category. For example, a large amount of entertainment news may be included under an entertainment news site.

It can be understood that the content range covered by more than one content category is larger, so that the user feature vector is generated based on the number of times that the user views the content categories, so that the data source for generating the user feature vector can be widened as much as possible, and the user feature vector really reflecting the interest preference of the user is obtained.

In one embodiment, the number of views in a content category may include the number of times the content in the content category is further viewed through user manipulation after being displayed. The user operation includes a click operation, a touch operation, a voice operation, a physical key operation or a shaking operation, and the like, which can trigger the operation of further viewing the content. The number of views under a content category may also include the number of accesses into the content category.

It is understood that on a terminal such as a smart phone or a tablet computer, when a user views content of interest through an application, the application usually presents the content in groups according to content categories. The user may select to enter the content category of interest, so that the application program may display the content under the content category in a manner of combining a title and a thumbnail, and form a plurality of pieces of content into a waterfall flow list, as shown in fig. 3. The user may further view detailed information of the content by clicking on the content of interest. That is, the content category selected by the user to enter may also reflect the user's interest preference to some extent.

In particular, the computer device may count the respective number of views of each user under more than one content category within a counting period. The statistical period is a time period for counting the number of viewing times, and may specifically be one or more than one natural week, or one or more than one natural month, or the like.

For example, the computer device may count the number of times that user a, user B, and user C view news under each news site in a news application within one month.

In an embodiment, the computer device may also divide the time interval of the statistical period, assign different weights to the number of times of viewing in different time intervals, and perform weighted summation on the number of times of viewing in each time interval according to the corresponding weights, thereby obtaining the number of times of viewing in the statistical period.

As will be appreciated by those skilled in the art, the interest preferences of a user may change as time shifts, and may affect the viewing behavior of the user on content. The viewing behavior of the user on the content may also change over time. Considering the time attenuation effect, the recent viewing times can better reflect the current interest preference of the user, and then, when counting the viewing times of a statistical period, the computer device can give different weights to the viewing times of different time intervals, so that the accuracy of data statistics can be improved by focusing on the viewing times of a certain time interval.

And S204, generating a user characteristic vector corresponding to the user identifier according to the checking times.

Wherein the user feature vector is data that mathematically represents the user's interest preferences. Specifically, the computer device may vectorize the number of views in text form to obtain a user feature vector corresponding to the user identifier. For example, "XXX" in text form is represented in mathematical form "[ 00010000000. ]", where "[ 00010000000. ]" is the result of vectorizing "XXX", i.e., the vector of "XXX". It is to be understood that the vector to which other forms of data are converted is not limited as long as other forms of data can be expressed mathematically.

It can be understood that the viewing behavior of the user on the content under different content categories can well reflect different interest preferences of the user. For example, if the user a views news under the entertainment category and the celebrity category more, and the user B views news under the military category and the financial category more, it is obvious that the interest preference difference between the user a and the user B is larger. Based on the above, it can be considered that the interest preference of the user can be well represented by mining the viewing activity degree of different users on the content in different content categories, so as to distinguish the users.

Specifically, the computer device may generate a vector element according to the number of views under each content category, and then combine the vector elements as a user feature vector corresponding to the user identification.

In one embodiment, S204 includes: taking the viewing times corresponding to the user identification under each content category as vector elements; combining each vector element to obtain an initial feature vector corresponding to the user identifier; and normalizing the initial feature vector to obtain a user feature vector corresponding to the user identifier.

Where the vector elements are the units that make up the vector. Specifically, the computer device may directly use the viewing times corresponding to the user identifier under each content category as vector elements, and then combine these vector elements to obtain an initial feature vector corresponding to the user identifier.

For example, for user u, the computer device may count the respective viewing times of the user under more than one content category in one counting period, and then use these viewing times as the characteristic expression of the user. Assuming that the number of more than one content category is n, the interest preference of user u can be expressed as an n-dimensional vector: (u)₁,u₂,...,u_i,...,u_n). Wherein u is_i,i∈[1,n]The number of views of user u under content category i.

It will be appreciated that since the user's interest preferences are generally more focused, more focus is on content in several particular content categories and less focus is on content in other content categories. This may result in the initial feature vector having large values for some dimensions of the vector elements and small values for some dimensions of the vector elements. To smooth the effects of larger vector elements, the initial feature vectors may be normalized.

Specifically, the user feature vector obtained by normalizing the initial feature vector by the computer device is as follows:

wherein,

the user feature vector is represented as a row vector with dimension n x 1.

It should be noted that the user feature vector obtained in the embodiment of the present application is an interpretable vector. By interpretable vector, it is meant that each dimension of the vector has a specific meaning. For example, for the user feature vector in the embodiment of the present application, each dimension of the user feature vector represents a weight obtained by normalizing the number of viewing times in one content category. The first dimension of the user feature vector can be set as the viewing times weight under the entertainment category, and the second dimension can be set as the viewing times weight under the sports category. Based on this, when constructing the user feature vector according to the embodiment of the present application, if a certain user is recommended entertainment news information and the user intends to know the reason why the entertainment news information is recommended to him, it can be interpreted that the number of times of viewing the content in the entertainment category is the largest among the number of times of viewing the content category in the history of the user.

In the embodiment, when the user feature vector is constructed, the user viewing behavior is mapped to the vector with the dimensionality as the content category, each dimension represents the number of times that the user views the content under the content category, the user feature vector with strong interpretability is constructed from a macroscopic content classification level, the problems caused by content timeliness and user personalized difference are avoided, and the subsequent user similarity calculation is facilitated.

S206, determining the correlation degree between any two content categories in the more than one content categories respectively.

In general, after the computer device constructs the user feature vectors of the users, that is, for any two users, the similarity of the user feature vectors of the two users is taken as the similarity of the two users. Obviously, the similarity is calculated only by considering the viewing behavior of the user under different content categories and ignoring the association between different content categories, which may result in unreliable user similarity. For example, if user a focuses on the fashion category and not on the entertainment category, and user B focuses on the entertainment category and not on the fashion category, and the similarity of the user feature vectors is directly used as the similarity of users a and B, the similarity of the two users will be zero because the fashion category and the entertainment category are two different content categories. In fact, entertainment is strongly related to fashion, and if the similarity between two content categories is considered from the perspective of relevance, the similarity between the two users should not be zero. Based on this, it is necessary to take into account the relevance between each content category when generating the user similarity.

In particular, the computer device may determine a degree of correlation between any two of the more than one content categories, respectively. For example, the more than one content category includes: entertainment category, sports category, financial category, and social category. After the computer device obtains the number of views under the four content categories, the relevance between every two of the four content categories, namely the entertainment category, the sports category, the finance category and the social category, needs to be determined.

In one embodiment, S206 includes: for any two content categories in the more than one content category, determining the correlation degree between any two content categories according to the overlapping degree of the contents under any two content categories.

The overlapping degree of contents in any two content categories refers to the ratio of the same contents in the two content categories. It can be understood that each content category includes a large amount of content, and when the content range covered by the two content categories is fixed, the more the same content in the content included in the two content categories is, the more the two content categories are related. In this embodiment, the degree of correlation between any two content categories is determined according to the degree of overlapping of the content under any two content categories, so that the obtained degree of correlation between any two content categories is accurate and has strong interpretability.

In one embodiment, for any two content categories of the more than one content categories, determining the degree of correlation between any two content categories according to the degree of overlapping of the contents under any two content categories comprises: for any two content categories in more than one content category, determining the intersection of the contents under any two content categories and the union of the contents under any two content categories; the ratio of the intersection to the union is taken as the degree of correlation between any two content categories.

The intersection of the contents in any two content categories refers to the same content in the contents in the two content categories. The union of the contents in any two content categories refers to the combination of the contents in the two content categories. Specifically, the computer device may calculate the degree of correlation between any two content categories by:

wherein, c_iAnd c_jFor any two of more than one content categories, d_iIs a content category c_iSet of contents of d_jIs a content category c_jThe following set of contents.

It is understood that the correlation degree calculated by equation (2) between any two content categories may also be referred to as a Jaccard distance between any two content categories.

In the embodiment, the correlation between any two content categories is calculated by the Jacard distance, so that the calculation is convenient and the interpretability is strong.

In one embodiment, for any two content categories of the more than one content category, the computer device may also determine an intersection of the content under any two content categories, and a geometric mean of the content under any two content categories; the ratio of the intersection to the set average is taken as the degree of correlation between any two content categories.

In the above embodiment, the degree of correlation between any two content categories is determined according to the degree of overlapping of the content under any two content categories, so that the obtained degree of correlation between any two content categories is accurate and has strong interpretability.

In one embodiment, S206 includes: for any two content categories in more than one content category, respectively determining content vectors corresponding to the contents in any two content categories; calculating content category vectors corresponding to the content categories according to content vectors corresponding to the content under any two content categories; and taking the correlation degree between the content category vectors corresponding to the arbitrary two content categories as the correlation degree between the arbitrary two content categories.

Where a content vector is data that represents the characteristics of the content in a mathematical form. The content category vector is data that mathematically represents the characteristics of the content category.

In particular, the computer device may map each content into a content vector according to the semantics of each content, the value of each dimension of the content vector representing a feature having certain semantics and grammatical interpretations. For each content category, the computer device may then derive a content category vector for the content category based on the content vector for the content under the content category. Here, the computer device may fuse content vectors of content under each content category to obtain a content category vector for the corresponding content category. The fusion here may be dimensional concatenation of vectors, or vector averaging by vector elements, or weighted averaging, or the like.

Further, after obtaining the content category vectors of the content categories, the computer device may use the correlation between the content category vectors corresponding to each of any two content categories as the correlation between any two content categories. Here, the correlation between the vectors may be a distance between the vectors, and may be specifically obtained by cosine similarity, pearson correlation coefficient, euclidean distance, Jaccard distance, and the like.

In a particular embodiment, when the content is an article, the computer device may map an article to a topic vector via the topic model. Each dimension in the topic vector represents a topic. Then, for an article category, the topic vectors of the articles in the article category can be fused to obtain the category topic vector of the corresponding article category. The computer device can further use the correlation degree between the category topic vectors corresponding to the two article categories as the correlation degree between the two article categories. The topic model is, for example, an LDA (Latent Dirichlet Allocation) model.

In a particular embodiment, the computer device may also utilize an embedding model to map content to corresponding content vectors. Here an embedded model such as word2vec or doc2vec etc.

In the above embodiment, the categories are mapped to the vector space to calculate the similarity between the categories of the vector space, so that the interpretable user feature vector is macroscopically constructed, and the performance improvement brought by the implicit semantic model is considered.

And S208, combining the correlation degrees to obtain a correlation degree matrix among more than one content category.

In particular, the computer device may organize the relevance between two of the more than one content categories into a matrix of relevance between the more than one content categories in terms of the elemental composition of the matrix.

For example, assuming that the number of the more than one content category is n, the correlation between any content category i and content category j in the more than one content category is r_ij. Then the correlation matrix between more than one content category organized is

Assuming that the content is news, then r_ijThe number of news belonging to both news categories i and j is divided by the data of all different news under the news categories i and j, and the value is [0,1 ]]。r_ijWhen the value is 0, the news with no cross between the news category i and the news category j is represented. r is_ijWhen the value is 1, it indicates that the news category i is completely the same as the news under the news category j.

And S210, generating user similarity according to the correlation matrix and the user characteristic vectors corresponding to the at least two user identifications.

The user similarity is used for measuring the similarity degree between users, and the greater the similarity is, the more similar the users are. It can be understood that the correlation matrix obtained in the foregoing steps reflects the correlation between two content categories in more than one content category; meanwhile, the user feature vector obtained in the previous step reflects the behavior characteristics of the user under each content category in more than one content category independently. In order to obtain more reliable user similarity, the behavior characteristics of the user under each content category and the relevance between the content categories should be taken into consideration; then, the computer device may generate the user similarity between any two users according to the correlation matrix and the user feature vectors corresponding to the at least two user identifiers.

In one embodiment, S210 includes: for any two different user identifications, multiplying the user characteristic vector corresponding to one user identification by the correlation matrix, and then multiplying the user characteristic vector by the transpose of the user characteristic vector corresponding to the other user identification to obtain the user similarity corresponding to any two different user identifications.

It can be understood that the user feature vector is a row vector, and the number of columns in the correlation matrix is the same as the number of columns in the user feature vector. The user similarity is data for measuring the degree of similarity between users, and should be a scalar. Then, when the operation is performed based on two row vectors and a matrix with the same column number as the row vector, and the operation result is intended to be a scalar, based on the principle of matrix operation, one of the row vectors may be multiplied by the matrix, and then multiplied by the transpose of the other row vector (i.e., column vector), so that an operation result that is a scalar may be obtained.

Specifically, the computer device may generate the user similarity according to the following formula:

wherein,

for a user feature vector corresponding to one user identifier (first user identifier) of any two different user identifiers,

is composed of

The transpose of (a) is performed,

for the user feature vector corresponding to the other user identifier (second user identifier) of any two different user identifiers,

is composed of

R is a correlation matrix.

The computer device may split equation (3) as defined by the matrix multiplication to obtain the following equation:

wherein u is_1iIs the number of views, r, corresponding to the first user identification under the content category i_ijIs the degree of correlation, u, of content class i with content class j_2jAnd the viewing times corresponding to the second user identification under the content category j.

As can be seen from equation (4), when calculating the similarity between user a (the user identified by the first user identifier) and user B (the user identified by the second user identifier), the user similarity is more reliable because the user a and user B do not pay attention to only isolated content categories, the association between any two content categories is considered, and the calculated correlation between the content categories is weighted to calculate the similarity between the two users.

Therefore, when the user similarity is generated for a large number of users, the user similarity between every two users in the users can be generated according to the embodiment of the application.

According to the user similarity generating method, after the viewing times corresponding to the user identifications under more than one content category are obtained, the user characteristic vectors corresponding to the user identifications are generated according to the viewing times, the user characteristic vectors with strong interpretability are constructed from the macroscopic content classification, the clicking frequency is used as the vector elements of the user characteristic vectors, and the influence caused by personalized difference is avoided; then, the correlation degrees between any two content categories in the more than one content categories are respectively determined, the correlation degrees are combined to obtain a correlation degree matrix between the more than one content categories, and the user similarity is generated according to the correlation degree matrix and the user characteristic vectors corresponding to the at least two user identifications. Therefore, the similarity calculation of the user is enlarged to a macroscopic content classification level, and the association between different content categories is taken into consideration, so that the generated user similarity is more accurate and reliable.

In one embodiment, the user similarity generating method further includes: and when the user similarity reaches a preset user similarity threshold, recommending the content corresponding to the other user identifier according to one user identifier of any two different user identifiers.

The preset user similarity threshold is a preset boundary value for judging whether the dividing users are similar or not. It can be considered that when the user similarity between two users reaches a preset similarity threshold, it can be determined that the two users are similar to each other, that is, the interest preferences of the two users are similar; when the user similarity between two users does not reach the preset similarity threshold, it can be determined that the two users are not similar, that is, the interest preferences of the two users are different.

Specifically, when the user similarity calculated for any two different users reaches a preset user similarity threshold, the computer device may determine that the two users are similar. Therefore, the content in which one user is interested can be considered, and the other user is likely to be interested; the content corresponding to one of the users may be recommended to the other user. Such content recommendation may also be referred to as User based collaborative Filtering (UserCF) recommendation.

For example, fig. 4 is a schematic diagram illustrating a principle of content recommendation based on user similarity in one embodiment. Referring to fig. 4, the computer device may calculate the used similarity between the user 1 and the user 2 according to the user similarity generating manner provided by the embodiment of the present application. When the user similarity between the two users reaches a preset user similarity threshold, the two users can be considered to be similar users. Then, the news viewed by the two users can be recommended to each other, that is, the news viewed by the user 1 is recommended to the user 2, and the news viewed by the user 2 is recommended to the user 1.

When content recommendation is performed among similar users, the content recommended to the user may be content that is not viewed by the user but is viewed by the similar users. Referring to fig. 5, assume that user 1, when browsing news 510, has an interest in the news 510 and has further clicked to view it. Then when a content recommendation is needed for user 2, news 510 may be recommended to user 2 if user 2 is a similar user to user 1.

In addition, the user similarity generating method provided by the embodiment of the application is applied to an online recommendation test of recommendation application, and the recall rate is greatly improved from 40% to 90%.

In the embodiment, the similarity of the user obtained by taking the behavior characteristics of the user under each content category and the relevance between the content categories into consideration is more accurate and reliable, so that the content recommendation effect is greatly improved when content recommendation is performed based on the similarity of the user in the following process.

In one embodiment, S210 includes: performing singular value decomposition on the correlation matrix to obtain a spatial variation matrix; mapping the sparse user characteristic vector into a dense target vector through a space variation matrix; and generating user similarity according to the target vectors corresponding to the at least two user identifications.

Wherein a spatial variation matrix is a matrix for mapping vectors in one space to vectors in another space. In this embodiment, the user feature vector is mapped to a vector space with the columns of the spatial variation matrix as basis vectors to obtain a new target vector.

Specifically, after obtaining a correlation matrix between more than one content category, the computer device may perform singular value decomposition on the correlation matrix as shown in the following equation, since the correlation matrix is a symmetric matrix:

R＝QΛQ^T(5)

wherein, R is a correlation matrix, Q is an orthogonal matrix composed of eigenvectors, Λ is a real diagonal matrix and diagonal elements are eigenvalues. If the number of more than one content category in S202 is n, then R, Q and Λ are both matrices with dimensions n × n.

Further, the computer device may decompose equation (5) based on the operation rule of the matrix as shown in the following equation:

wherein,

is a spatially varying matrix. Then, combining formula (3) with formula (6) can result in the following formula:

those skilled in the art will appreciate that the inner product of two vectors may be viewed as a projection of one vector onto the other. For example, in the two-dimensional space, the vector (2,3) is projected on the vector (1,0) as (2,3) · (1,0) · 2 × 1+3 × 0 ═ 2, and the vector (2,3) is projected on the vector (0,1) as (2,3) · (0,1) · (2 × 0+3 × 1 ═ 3, so that the coordinate of the vector (2,3) is (2,3) in the coordinate system spanned by (1,0) and (0, 1).

Assuming that the number of more than one content category in S202 is n, the vector space formed at this time is an n-dimensional space. Formula (II)(7) In (1)

To represent

Inner product with each column vector of delta respectively to obtain new vector (i.e. delta)

) The vector element in (1) can be regarded as a vector

The projections onto the column vectors of delta respectively,

that is to say

New coordinates under the new basis vector (column vector of Δ).

Based on this, it can be found that the user similarity defined by equation (3) is actually a dot product of new vectors obtained by vector space mapping the user feature vector calculated by equation (1) using the matrix Δ. That is, a User Similarity Measure (User Similarity Measure based on Space Mapping, USMSM) based on spatial Mapping, that is, a correlation between any two content categories is calculated first, so as to obtain a correlation matrix between more than one content category; then decomposing the correlation matrix, and mapping the user characteristic vector to a new space by using the matrix obtained by decomposition; and finally, calculating the similarity of the user by using the new vector in the new space.

For example, FIG. 6 shows a schematic diagram of generating user similarities in one embodiment. Referring to fig. 6, assuming that the content is divided into n content categories based on a certain recommendation application, the computer device may count the number of views of the user 1 and the user 2 under the n content categories based on the recommendation application, and normalize the counted data to obtain the user characteristics of the user 1Eigenvector

User feature vector of user 2

The computer device may calculate the correlation for any two of the n content categories and combine to obtain a correlation matrix R between the n content categories. The computer equipment can continuously decompose the correlation matrix R to obtain a spatial variation matrix delta; by the spatial variation matrix delta will

And

mapping to new space to obtain new vector of new space

And

then based on formula

Calculating user similarity between user 1 and user 2, i.e. based on formula

The user similarity between user 1 and user 2 is calculated.

It is to be understood that the present embodiment is an explanation of the rationality of defining the user similarity using equation (3). In the process of actually generating the user similarity, after the user feature vectors corresponding to any two users and the correlation matrix between more than one content category are obtained, the similarity between the two users can be calculated by adopting the formula (3), and the matrix delta for mapping the user feature vectors to a new space does not need to be calculated.

By integrating the description of the embodiment of the application, firstly, the relevance of each content category can be fused into the calculation of the user similarity based on the user similarity measurement of the spatial mapping, and compared with a calculation mode based on an isolated content category, the more accurate and reliable user similarity can be obtained.

Second, the user feature vectors based on the number of views under each content category are sparse. This is because the user's interests are generally concentrated, focusing on content under a few specific content categories, and not on content under other content categories. The values of the vector elements corresponding to these content categories of no interest in the user feature vector are likely to be 0. For example, a loved sports user may be more concerned about content categories such as sports and fitness, but not about content categories such as fashion, art and pets; then the values of the vector elements corresponding to these content categories in his user feature vector are largely 0, whereas the vector with more 0 is definitely sparse. And the user feature vector is paired by the matrix delta

Obtained by mapping

The value of the ith dimension in

Multiplication by the ith column of the matrix delta, then as long as

Is not perpendicular to the ith column of the matrix delta, the product is not 0, i.e., it is

Is dense. And the densified vector is more robust to the calculation of user similarity.

In addition, a matrix Q obtained by decomposing the correlation matrix R is an orthogonal matrix, and Λ is a real diagonal matrix. Then the matrix delta is a full rank matrix. The new vector thus obtained by matrix delta mapping then contains as much information as possible.

In a specific embodiment, the present embodiment is illustrated in a news recommendation scenario in a news recommendation application. Referring to fig. 7, when news recommendation is required for a target user, the computer device may execute S702, obtain a target user identifier (for identifying the target user), and viewing times corresponding to the target user identifier respectively under each news site in the news recommendation application program. Based on a Site-Access user collaborative Filtering (Site-UCF), executing S704, and taking the viewing times corresponding to the target user identification under each news Site as vector elements; combining each vector element to obtain an initial feature vector corresponding to the target user identifier; and normalizing the initial feature vector to obtain a user feature vector corresponding to the target user identifier.

The computer device may execute S706 in parallel, selecting more than one reference user identifier, and the number of views corresponding to each reference user identifier respectively under each news site in the news recommendation application. Based on a Site-Access User Collaborative Filtering (Site-UCF) of the User Access Site, S708 is executed, and the viewing times corresponding to the reference User identifications under each news Site are used as vector elements; combining each vector element to obtain an initial feature vector corresponding to each reference user identifier; and normalizing each initial characteristic vector to obtain a user characteristic vector corresponding to each reference user identifier.

The computer device may further execute S710 in parallel, and determine, for any two news sites in each news site in the news recommendation application program, an intersection of news under any two news sites and a union of news under any two news sites; continuing to execute S712, and taking the ratio of the intersection to the union as the correlation between any two news sites; and combining the relevancy degrees to obtain a relevancy degree matrix among news sites in the news recommendation application program.

Thus, the computer device may execute S714 to generate a user similarity corresponding to the target user identifier and the reference user identifier together (i.e., the user similarity between the target user and the reference user) according to the user feature vector corresponding to the target user identifier, the correlation matrix between the news sites, and the user feature vector corresponding to any reference user identifier. Then, the computer device may execute S716, and recommend the news corresponding to the reference user identifier corresponding to the user similarity that reaches the preset user similarity threshold among the user similarities to the terminal corresponding to the target user identifier.

The operation between any two user feature vectors and the correlation matrix can be decomposed into the operation between new vectors obtained by mapping the two user feature vectors to another vector space by the same matrix. In principle, it can be understood that: the method is based on a collaborative filtering algorithm of user visiting sites and a user similarity measurement method based on space mapping, namely, vectors of user site levels are mapped into new vectors for similarity calculation by decomposing a correlation matrix among sites.

It should be understood that, although the steps in the flowcharts of the above embodiments are shown in sequence as indicated by the arrows, the steps are not necessarily executed in sequence as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least a part of the steps in the above embodiments may include multiple sub-steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of performing the sub-steps or the stages is not necessarily performed in sequence, but may be performed alternately or alternately with other steps or at least a part of the sub-steps or the stages of other steps.

As shown in fig. 8, in one embodiment, a user similarity generating apparatus 800 is provided. Referring to fig. 8, the user similarity generating apparatus 800 includes: an acquisition module 801, a first generation module 802, a determination module 803, a combination module 804, and a second generation module 805.

An obtaining module 801, configured to obtain the viewing times corresponding to the user identifiers in more than one content category.

A first generating module 802, configured to generate a user feature vector corresponding to the user identifier according to the number of viewing times.

A determining module 803, configured to determine a correlation between any two content categories of the more than one content categories, respectively.

And a combining module 804, configured to combine the correlations to obtain a correlation matrix between more than one content category.

A second generating module 805, configured to generate a user similarity according to the correlation matrix and the user feature vectors corresponding to the at least two user identifiers.

In one embodiment, the first generating module 802 is further configured to use the viewing times corresponding to the user identifier under each content category as a vector element; combining each vector element to obtain an initial feature vector corresponding to the user identifier; and normalizing the initial feature vector to obtain a user feature vector corresponding to the user identifier.

In one embodiment, the determining module 803 is further configured to determine, for any two content categories of the more than one content categories, a degree of correlation between any two content categories according to a degree of overlapping of contents under any two content categories.

In one embodiment, the determining module 803 is further configured to determine, for any two content categories of the more than one content categories, an intersection of the contents under any two content categories and a union of the contents under any two content categories; the ratio of the intersection to the union is taken as the degree of correlation between any two content categories.

In one embodiment, the determining module 803 is further configured to, for any two content categories of the more than one content categories, respectively determine content vectors corresponding to the contents in any two content categories; calculating content category vectors corresponding to the content categories according to content vectors corresponding to the content under any two content categories; and taking the correlation degree between the content category vectors corresponding to the arbitrary two content categories as the correlation degree between the arbitrary two content categories.

In an embodiment, the second generating module 805 is further configured to, for any two different user identifiers, multiply the user feature vector corresponding to one of the user identifiers by the correlation matrix, and then multiply by the transpose of the user feature vector corresponding to the other user identifier, so as to obtain the user similarity corresponding to any two different user identifiers.

In one embodiment, the second generation module 805 is further configured to generate the user similarity according to the following formula:

wherein,

for a user feature vector corresponding to one of any two different user identities,

is composed of

The transpose of (a) is performed,

for the user feature vector corresponding to the other of any two different user identities,

is composed of

R is a correlation matrix.

As shown in fig. 9, in one embodiment, the user similarity generating apparatus 800 further includes: and a recommending module 806, configured to recommend, when the user similarity reaches a preset user similarity threshold, a content corresponding to another user identifier according to one user identifier of any two different user identifiers.

In an embodiment, the second generating module 805 is further configured to perform singular value decomposition on the correlation matrix to obtain a spatial variation matrix; mapping the sparse user characteristic vector into a dense target vector through a space variation matrix; and generating user similarity according to the target vectors corresponding to the at least two user identifications.

After the user similarity generating device 800 obtains the viewing times corresponding to the user identifiers respectively under more than one content category, the user feature vectors corresponding to the user identifiers are generated according to the viewing times, the user feature vectors with strong interpretability are constructed from the macroscopic content classification, and the user feature vectors are vector elements with high clicking frequency, so that the influence caused by personalized difference is avoided; then, the correlation degrees between any two content categories in the more than one content categories are respectively determined, the correlation degrees are combined to obtain a correlation degree matrix between the more than one content categories, and the user similarity is generated according to the correlation degree matrix and the user characteristic vectors corresponding to the at least two user identifications. Therefore, the similarity calculation of the user is enlarged to a macroscopic content classification level, and the association between different content categories is taken into consideration, so that the generated user similarity is more accurate and reliable.

FIG. 10 is a diagram illustrating an internal structure of a computer device in one embodiment. The computer device may specifically be the terminal 110 or the server 120 in fig. 1. As shown in fig. 10, the computer device includes a processor, a memory, and a network interface connected by a system bus. Wherein the memory includes a non-volatile storage medium and an internal memory. The non-volatile storage medium of the computer device stores an operating system and may also store a computer program that, when executed by the processor, causes the processor to implement the user similarity generating method. The internal memory may also have a computer program stored therein, which when executed by the processor, causes the processor to perform the user similarity generating method. Those skilled in the art will appreciate that the architecture shown in fig. 10 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.

In one embodiment, the user similarity generating apparatus provided in the present application may be implemented in the form of a computer program, and the computer program may be run on a computer device as shown in fig. 10. The memory of the computer device may store various program modules constituting the user similarity generating apparatus, such as the acquisition module 801, the first generation module 802, the determination module 803, the combination module 804, and the second generation module 805 shown in fig. 8. The computer program constituted by the respective program modules causes the processor to execute the steps in the user similarity generation method of the embodiments of the present application described in the present specification.

For example, the computer device shown in fig. 10 may execute, by the obtaining module 801 in the user similarity generating apparatus shown in fig. 8, obtaining the viewing times respectively corresponding to the user identifications in more than one content category. The first generation module 802 executes the generation of the user feature vector corresponding to the user identifier according to the number of viewing times. The determination of the degree of correlation between any two content categories of the more than one content categories, respectively, is performed by the determination module 803. Combining the correlations by the combining module 804 results in a correlation matrix between more than one content category. The second generating module 805 executes the user feature vectors corresponding to the correlation matrix and the at least two user identifiers, so as to generate the user similarity.

In one embodiment, a computer device is provided, comprising a memory and a processor, the memory storing a computer program which, when executed by the processor, causes the processor to perform the steps of the user similarity generation method described above. Here, the steps of the user similarity generating method may be steps in the user similarity generating methods of the above embodiments.

In one embodiment, a computer-readable storage medium is provided, storing a computer program which, when executed by a processor, causes the processor to perform the steps of the user similarity generation method described above. Here, the steps of the user similarity generating method may be steps in the user similarity generating methods of the above embodiments.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a non-volatile computer-readable storage medium, and can include the processes of the embodiments of the methods described above when the program is executed. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).

The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the present application. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. A user similarity generating method comprises the following steps:

2. The method of claim 1, wherein generating the user feature vector corresponding to the user identifier according to the number of viewing times comprises:

viewing times corresponding to the user identification under each content category are used as vector elements;

combining each vector element to obtain an initial feature vector corresponding to the user identifier;

normalizing the initial feature vector to obtain a user feature vector corresponding to the user identifier.

3. The method of claim 1, wherein the separately determining the degree of correlation between any two of the more than one content categories comprises:

and for any two content categories in the more than one content categories, determining the correlation degree between the any two content categories according to the overlapping degree of the contents under the any two content categories.

4. The method according to claim 3, wherein the determining, for any two content categories of the more than one content categories, a degree of correlation between the any two content categories according to a degree of overlapping of contents under the any two content categories comprises:

for any two content categories of the more than one content categories, determining an intersection of the contents under the any two content categories and a union of the contents under the any two content categories;

and taking the ratio of the intersection to the union as the correlation degree between any two content categories.

5. The method of claim 1, wherein the separately determining the degree of correlation between any two of the more than one content categories comprises:

for any two content categories of the more than one content categories, respectively determining content vectors corresponding to the contents under the any two content categories;

calculating content category vectors corresponding to the content categories according to the content vectors corresponding to the content under the two content categories;

and taking the correlation degree between the content category vectors corresponding to the arbitrary two content categories as the correlation degree between the arbitrary two content categories.

6. The method according to claim 1, wherein the generating user similarity according to the correlation matrix and the user feature vectors corresponding to the at least two user identifiers comprises:

for any two different user identifications, multiplying the user characteristic vector corresponding to one user identification by the correlation matrix, and then multiplying the user characteristic vector by the transpose of the user characteristic vector corresponding to the other user identification to obtain the user similarity corresponding to any two different user identifications.

7. The method of claim 6, wherein the user similarity is generated according to the following formula:

wherein,

for a user feature vector corresponding to one of the two different user identities,

is composed of

The transpose of (a) is performed,

for the user feature vector corresponding to the other of the two different user identities,

is composed of

R is a correlation matrix.

8. The method of claim 6, further comprising:

and when the user similarity reaches a preset user similarity threshold, recommending the content corresponding to the other user identifier according to one user identifier of the two different user identifiers.

9. The method according to claim 1, wherein the generating user similarity according to the correlation matrix and the user feature vectors corresponding to the at least two user identifiers comprises:

performing singular value decomposition on the correlation matrix to obtain a spatial variation matrix;

mapping the sparse user characteristic vector into a dense target vector through the space variation matrix;

and generating user similarity according to the target vectors corresponding to the at least two user identifications.

10. A user similarity generating apparatus comprising:

11. The apparatus of claim 10, wherein the first generating module is further configured to use the number of views corresponding to the user identifier under each of the content categories as a vector element; combining each vector element to obtain an initial feature vector corresponding to the user identifier; normalizing the initial feature vector to obtain a user feature vector corresponding to the user identifier.

12. The apparatus of claim 10, wherein the determining module is further configured to determine, for any two content categories of the more than one content categories, a degree of correlation between the any two content categories according to a degree of overlapping of contents under the any two content categories.

13. The apparatus of claim 10, wherein the second generating module is further configured to, for any two different user identities, multiply a user eigenvector corresponding to one of the user identities by the correlation matrix, and then multiply by a transpose of the user eigenvector corresponding to the other user identity, so as to obtain user similarities corresponding to the any two different user identities.

14. A computer-readable storage medium, storing a computer program which, when executed by a processor, causes the processor to carry out the steps of the method according to any one of claims 1 to 9.

15. A computer device comprising a memory and a processor, the memory storing a computer program that, when executed by the processor, causes the processor to perform the steps of the method according to any one of claims 1 to 9.