CN104063445A

CN104063445A - Method and system for measuring similarity

Info

Publication number: CN104063445A
Application number: CN201410267170.6A
Authority: CN
Inventors: 朱宝
Original assignee: Baidu Mobile Network Technology (beijing) Co Ltd
Current assignee: Baidu Mobile Network Technology (beijing) Co Ltd
Priority date: 2014-06-16
Filing date: 2014-06-16
Publication date: 2014-09-24
Anticipated expiration: 2034-06-16
Also published as: CN104063445B

Abstract

The invention relates to a method and system for measuring the similarity. The method comprises the following steps: a data acquiring step, acquiring behavior data about a user and feature data of articles; a similarity calculating step based on the behavior data, calculating the similarity among the articles based on the behavior data; a similarity calculating step based on the feature data, calculating the similarity among the articles based on the feature data; a similarity synthesizing step, utilizing a Bayes formula to synthesize the similarity obtained on the basis of the behavior data and the similarity obtained on the basis of the feature data.

Description

A kind of method of similarity measurement and system

Technical field

The present invention relates to field of information processing, particularly the method for the similarity measurement in field of information processing and system.

Background technology

Current, in various fields, all relate to similarity measurement, and carry out similarity analysis based on various existing method for measuring similarity.Such as at the related similarity measurement in the fields such as internet industry.

The existing similarity measurement that carries out comprises following two classes.A kind of based on behavioral data.As calculated the method for object similarity in the technology such as matrix decomposition, collaborative filtering.Be to calculate a similarity based on characteristic, utilize user behavior to carry out characteristic similarity study, as genetic algorithm etc.But above-mentioned algorithm all exists following problem, or only consider that behavioral data calculates similarity, or only consider that object features carries out similarity calculating.Do not have behavioral data and characteristic to be fully utilized, to obtain best similarity result.

Summary of the invention

The present invention completes in view of the above problems, and its object is to provide a kind of the similarity result based on object features and the similarity result based on behavioral data are carried out to the effectively comprehensive method and system based on similarity measurement.

A kind of method for measuring similarity the present invention relates to, comprises the following steps: data acquisition step, obtain the characteristic about user's behavioral data and article; Similarity calculation procedure based on behavioral data, calculates article based on behavioral data and the similarity between article; Similarity calculation procedure based on characteristic, calculates article based on characteristic and the similarity between article; And the comprehensive step of similarity, will carry out comprehensively based on the resulting similarity of behavioral data and the Bayesian formula based on below the resulting similarity utilization of characteristic,

{sim}^{'''} (b_{i}, b_{j}) = \frac{{sim}^{'} (b_{i}, b_{j}) * {sim}^{''} (b_{j}, b_{i})}{\underset{j}{Σ} {sim}^{'} (b_{i}, b_{j}) * {sim}^{''} (b_{j}, b_{i})}

Wherein, b _i, b _jrepresent article, subscript variable i=1,2 ..., priori probability density sim ' (b _i, b _j) be based on characteristic article b _iwith article b _jbetween similarity result, conditional probability density sim " (b _j, b _i) be the article b based on behavioral data _jwith article b _isimilarity result, sim " ' (b _i, b _j) represent to have carried out the comprehensive article b of similarity _iwith article b _jbetween Bayes's similarity.

According to above-mentioned method for measuring similarity, can be in the described similarity calculation procedure based on behavioral data, comprise the following steps: utilize the described behavioral data generation user and the relational matrix between article and the relational matrix between article and user that obtain; Utilize relational matrix between described user and article and the relational matrix between described article and user generate user to the probability matrix of article and article the probability matrix to user; Described article are multiplied by the similarity matrix calculating between article and article mutually to user's probability matrix and user to the probability matrix of article.

According to above-mentioned method for measuring similarity, also can be in the described similarity calculation procedure based on behavioral data, utilize user a in the user that obtains of institute set and the article b in article set and the user user a in gathering to the indiscriminate similarity number of operations of the article b sim (a in article set, b), carry out the article set interior items b based on following formula _jwith article b _isimilarity sim " (b _j, b _i) calculating, to generate similarity matrix,

\begin{matrix} {sim}^{''} (b_{j}, b_{i}) = \\ k * \underset{m}{Σ} (\frac{sim (a_{m}, b_{j}) * sim (a_{m}, b_{i})}{\underset{n}{Σ} sim (a_{m}, b_{n}) * \underset{n}{Σ} sim (a_{m}, b_{n}) * \underset{n}{Σ} sim (a_{n}, b_{j}) * \underset{n}{Σ} sim (a_{n}, b_{i})}) \end{matrix}

Wherein, i, j, m, n represents the label of element in set, k is normalized factor.

According to above-mentioned method for measuring similarity, can be that the method for calculating described similarity matrix for described similarity matrix utilization is calculated again, with the article of the similarity association that is enhanced and the enhancing similarity matrix between article, be used as the article b of described behavioral data _jwith article b _isimilarity result.

According to above-mentioned method for measuring similarity, can be before the similarity of calculating based on behavioral data, also to comprise white noise compensation process: user is supplemented to pre-determined number to the number of times of object manipulation lower than the user of pre-determined number.

According to above-mentioned method for measuring similarity, can be in the described similarity calculation procedure based on characteristic, comprise the following steps: utilize the described characteristic generation article and the relational matrix between attribute and the relational matrix between attribute and article that obtain; Utilize relational matrix between described article and attribute and the relational matrix between described attribute and article generate article to the probability matrix of attribute and attribute the probability matrix to article; Described article are multiplied by the similarity matrix calculating between article and article mutually to the probability matrix of attribute and attribute to the probability matrix of article.

According to above-mentioned method for measuring similarity, can be in the described similarity calculation procedure based on characteristic, comprise the following steps: the corresponding property value sim (c that utilizes the article a of obtain in article set and the known attribute c in community set and the known attribute c in community set and article b in article set, b), carry out the article set interior items b based on following formula _iwith article b _jsimilarity sim ' (b _i, b _j) calculating, to generate for the article of known attribute and the similarity matrix between article,

\begin{matrix} {sim}^{'} (b_{i}, b_{j}) = \\ k * \underset{m}{Σ} (\frac{sim (c_{m}, b_{i}) * sim (c_{m}, b_{j})}{\underset{n}{Σ} sim (c_{m}, b_{n}) * \underset{n}{Σ} sim (c_{m}, b_{n}) * \underset{n}{Σ} sim (c_{n}, b_{i}) * \underset{n}{Σ} sim (c_{n}, b_{j})}) \end{matrix}

According to above-mentioned method for measuring similarity, can be the white noise compensation process that also comprises the similarity based on characteristic: for unknown properties by arbitrary article and similarity between other article be made as identical and and be 1, obtain the white noise compensation matrix for similarity between the article of unknown properties and article, and sue for peace according to predetermined ratio by the described article for known attribute and the similarity matrix between article and for the white noise compensation matrix of similarity between the article of unknown properties and article, the similarity matrix based on characteristic of white noise has acted as a supplement.

The present invention relates to a kind of similarity measurement system, comprising: data capture unit, it obtains the characteristic about user's behavioral data and article; Similarity calculated based on behavioral data, it calculates article based on behavioral data and the similarity between article; Similarity calculated based on characteristic, it calculates article based on characteristic and the similarity between article; And similarity comprehensive unit, it will and carry out comprehensively based on the following Bayesian formula of the resulting similarity utilization of characteristic based on the resulting similarity of behavioral data,

{sim}^{'''} (b_{i}, b_{j}) = \frac{{sim}^{'} (b_{i}, b_{j}) * {sim}^{''} (b_{j}, b_{i})}{\underset{j}{Σ} {sim}^{'} (b_{i}, b_{j}) * {sim}^{''} (b_{j}, b_{i})}

Wherein, b _i, b _jrepresent article, subscript variable i and j are positive integer, priori probability density sim ' (b _i, b _j) be the article b based on characteristic _iwith article b _jbetween similarity result, conditional probability density sim " (b _j, b _i) be the article b based on behavioral data _jwith article b _isimilarity result, sim " ' (b _i, b _j) represent to have carried out the comprehensive article b of similarity _iwith article b _jbetween Bayes's similarity.

According to above-mentioned similarity measurement system, can be in the described similarity calculated based on behavioral data, comprise: Mathematical Models unit, the described behavioral data generation user that its utilization is obtained and the relational matrix between article and the relational matrix between article and user; Probability matrix generation unit, utilize relational matrix between described user and article and the relational matrix between described article and user generate user to the probability matrix of article and article the probability matrix to user; Similarity calculated, it carries out multiplication matrix to user's probability matrix and user to the probability matrix of article to described article and is multiplied by mutually the similarity matrix calculating between article and article.

According to above-mentioned similarity measurement system, also can be in the described similarity calculated based on behavioral data, utilize user a in the user that obtains of institute set and the article b in article set and the user user a in gathering to the indiscriminate similarity number of operations of the article b sim (a in article set, b), carry out the article set interior items b based on following formula _jwith article b _isimilarity sim " (b _j, b _i) calculating, to generate similarity matrix,

\begin{matrix} {sim}^{''} (b_{j}, b_{i}) = \\ k * \underset{m}{Σ} (\frac{sim (a_{m}, b_{j}) * sim (a_{m}, b_{i})}{\underset{n}{Σ} sim (a_{m}, b_{n}) * \underset{n}{Σ} sim (a_{m}, b_{n}) * \underset{n}{Σ} sim (a_{n}, b_{j}) * \underset{n}{Σ} sim (a_{n}, b_{i})}) \end{matrix}

According to above-mentioned similarity measurement system, can also comprise similarity enhancement unit, for described similarity matrix, by described similarity calculated method, again calculate, with the article of the similarity association that is enhanced and the enhancing similarity matrix between article, be used as the article b of described behavioral data _jwith article b _isimilarity result.

According to above-mentioned similarity measurement system, the white noise compensating unit that can also comprise behavioral data: before calculating the similarity based on behavioral data by similarity calculated, user is supplemented to pre-determined number to the number of times of object manipulation lower than the user of pre-determined number.

According to above-mentioned method for measuring similarity, can be in the described similarity calculated based on characteristic, comprise: Mathematical Models unit, the described characteristic generation article that its utilization is obtained and the relational matrix between attribute and the relational matrix between attribute and article; Probability generation unit, its utilize relational matrix between described article and attribute and the relational matrix between described attribute and article generate article to the probability matrix of attribute and attribute the probability matrix to article; Similarity calculated, it is multiplied by the probability matrix of attribute and attribute the similarity matrix calculating between article and article to described article mutually to the probability matrix of article.

According to above-mentioned similarity measurement system, can be in the described similarity calculated based on characteristic, utilize the corresponding property value sim (c of the article a of obtain in article set and the known attribute c in community set and the known attribute c in community set and article b in article set, b), carry out the article set interior items b based on following formula _iwith article b _jsimilarity sim ' (b _i, b _j) calculating, to generate for the article of known attribute and the similarity matrix between article,

\begin{matrix} {sim}^{'} (b_{i}, b_{j}) = \\ k * \underset{m}{Σ} (\frac{sim (c_{m}, b_{i}) * sim (c_{m}, b_{j})}{\underset{n}{Σ} sim (c_{m}, b_{n}) * \underset{n}{Σ} sim (c_{m}, b_{n}) * \underset{n}{Σ} sim (c_{n}, b_{i}) * \underset{n}{Σ} sim (c_{n}, b_{j})}) \end{matrix}

According to above-mentioned similarity measurement system, it can be the white noise compensating unit that also comprises the similarity based on characteristic, its for unknown properties by arbitrary article and similarity between other article be made as identical and and be 1, obtain the white noise compensation matrix for similarity between the article of unknown properties and article, and sue for peace according to predetermined ratio by the described article for known attribute and the similarity matrix between article and for the white noise compensation matrix of similarity between the article of unknown properties and article, the similarity matrix based on characteristic of white noise has acted as a supplement.

According to above-mentioned method for measuring similarity and system, can access the similarity measurement result of having considered based on characteristic and the similarity value based on behavioral data.

Accompanying drawing explanation

Fig. 1 illustrates the process flow diagram of the method for measuring similarity of embodiment 1;

Fig. 2 illustrates the process flow diagram of the method for measuring similarity of embodiment 2;

Fig. 3 is the block diagram that similarity measurement system is shown;

Fig. 4 illustrates the process flow diagram of method of the enhancing similarity association of embodiment 1;

Fig. 5 illustrates the process flow diagram of method of the enhancing similarity association of embodiment 2;

Fig. 6 illustrates the block diagram of the similarity measurement system that strengthens similarity association;

Fig. 7 illustrates the process flow diagram of another method for measuring similarity;

Fig. 8 illustrates the block diagram of another similarity measurement system;

Fig. 9 illustrates the process flow diagram of a white noise compensation method;

Figure 10 illustrates the process flow diagram of another white noise compensation method;

Figure 11 illustrates the process flow diagram for the comprehensive method of the Bayes of the similarity of behavioral data and the similarity of characteristic;

Figure 12 illustrates the process flow diagram of the similarity of calculating behavioral data;

Figure 13 illustrates the process flow diagram of the similarity of calculated characteristics data;

Figure 14 illustrates the block diagram for the comprehensive system of the Bayes of the similarity of behavioral data and the similarity of characteristic.

Embodiment

In the process of personalized recommendation, situation for known users, article and user to the operation history of article, how can, in the situation that the attribute vector of unknown subscriber and article calculates the similarity between user and user or article and article, describe below.

The similarity of obeying under being uniformly distributed about attribute vector value is calculated

Below, the present invention provides a kind of new similarity definition, first for attribute vector value, obeys and is introduced to the equally distributed situation in positive infinity at minus infinity.

Object can be used n dimension attribute vector description, and the attribute vector of object a is [a[1], a[2], a[3] ... a[n]], the attribute vector of object b is [b[1], b[2], b[3],, b[n]], sim (a, b) represents that object a and object b's is that k and variance vectors are [δ in given weight ²[1], δ ²[2], δ ²[3] ..., δ ²[n]] similarity value in situation.

sim (a, b) = k \cdot Π_{i = 0}^{i = n} \frac{1}{\sqrt{2 π} δ [i]} e^{(- \frac{{(a [i] - b [i])}^{2}}{2 δ {[i]}^{2}})}

Formula 1

For example only there is an attribute, and attribute vector value obeys at minus infinity in the equally distributed situation in positive infinity, the similarity of object a and object b be property value variable x obey N (a[1], δ [1] ²) during normal distribution at b[1] locate probability density value.Although above-mentioned attribute vector is unknown, cannot directly calculate probability density value by normal distribution formula, can calculate probability density value according to existing operation history data.The in the situation that of unknown properties vector, while utilizing normal distribution convolution algorithm, still the superperformance of Normal Distribution is set up the association between object, thereby utilizes the probability density value that can obtain to obtain similarity.Therefore this similarity definition is useful for the hiding article of attribute or user's similarity analysis.

Under the vectorial known condition of thingness, bring vector value into, can obtain similarity result.

For the situation of unknown object attribute vector, we have enumerated following example.

Embodiment 1

First take continuous situation as example.For continuous situation, given weight is all 1.Exemplified the book recommendation of network bookstore, with reference to 1 pair of method for measuring similarity of figure, described.First, as shown in step S1, all user profile of server collection network bookstore and all book informations and user click all historical datas of reading to books.By the set of all books of network bookstore be made as set M (m1, m2 ...), all users' set is made as to set N (n1, n2 ...), suppose that the element in N has at set M and set property value meets positive infinity and is uniformly distributed under minus infinity.Below we introduce in the situation that do not know any attribute information of books, also do not know any attribute information of user, the historical data how according to user, books to be operated obtains the similarity between user and user.

Suppose now that user gathers user n1 in N and wishes that the books of seeing are books m1, books m1 has an attribute, and property value is μ.In user set, some other user n2 wish to see books m2, and it is x that books m2 has property value, and this user n1 wishes the similarity between books m2 that the books m1 that sees and some other user n2 wish to see, i.e. similarity f between user and user ₀(x), at given variance δ ²situation under, can obtain formula 2 according to above-mentioned definition.

f_{0} (x) = \frac{1}{\sqrt{2 π} δ} e^{(- \frac{{(x - u)}^{2}}{{2 δ}^{2}})}

Formula 2

But in fact we do not know that user wishes the books m1 that sees and the property value of m2, certainly just do not know the similarity between books m2 that user n1 wishes that the books m1 that sees and some other user n2 wish to see yet.Yet we are the operation history to books according to user, know that user n1 reality carried out click reading to books m3, and can calculate user n1 to the number of clicks of books m3 with respect to this user the probability D1 to the number of clicks of all books.Because the books of user n1 practical operation are books m3, the property value of establishing books m3 is y, and user n1 wishes that the books m1 seeing should be similar to books m3 so.

If user n1 is read and is considered as one-shot measurement the click of books m3, the property value μ that user n1 is wished to the books m1 that sees is as tested value, using the property value of the actual books m3 seeing of user n1 as measured value, the property value of all books is formed and measures codomain, if the property value of books is infinite many, and satisfied size is being uniformly distributed to positive infinity at minus infinity, the sample average measuring, be that actual object attribute average and tested value meet maximum likelihood estimation, the sample value result that measures meets that to take tested value μ be expectation, take the normal distribution that certain unknown variance is variance.That is, sample place is worth to corresponding probability density value as the similarity of this sample value and actual tested value.According to above-mentioned, the property value y of the actual books m3 reading of user n1 removes to measure the property value μ that user n1 wishes the books m1 that reads, and the probability density g of books m3 (y) is formula 3.

g (y) = \frac{1}{\sqrt{2 π} δ} e^{(- \frac{{(y - u)}^{2}}{{2 δ}^{2}})}

Formula 3

G (y) is the probability density value that can calculate according to operation history data as mentioned above.That is,, as shown in step S2, according to historical record, the number of times of calculating user n1 click books m3 is the probability g (y) to the number of clicks of all books with respect to user n1.

In like manner, if clicking, known certain user read books m3, for user n1, the probability that the number of times of user n1 click books m3 is clicked the number of times of books m3 with respect to all users is known, and the number of times that can calculate user n1 click books m3 is clicked the probability D2 of the number of times of books m3 with respect to all users.In like manner, user n2 wishes that the property value x of the books m2 that reads removes to measure the property value y of the actual books m3 reading of user n1, and its probability density distribution z (x) also meets similar formula (4).

z (x) = \frac{1}{\sqrt{2 π} δ} e^{(- \frac{{(x - y)}^{2}}{{2 δ}^{2}})}

Formula 4

Z (x) is the probability density value that also can calculate according to operation history data as mentioned above.That is,, as shown in step S3, according to historical record, the number of times that calculates user n1 click books m3 is clicked the probability z (x) of the number of times of books m3 with respect to all users.

By a kind of computing, property value x and u are associated now, g (y) and z (x) are carried out to convolution algorithm, obtain and f ₀(x) approximate expression formula.By the result name f (x) of new integration, with and f ₀(x) difference, f (x) is formula 5.

f (x) = {&Integral;}_{- \infty}^{+ \infty} g (y) \cdot z (x) dy = {&Integral;}_{- \infty}^{+ \infty} \frac{1}{\sqrt{2 π} δ} e^{(- \frac{{(y - u)}^{2}}{2 δ^{2}})} \cdot \frac{1}{\sqrt{2 π} δ} e^{(- \frac{{(x - y)}^{2}}{{2 δ}^{2}})} dy

Formula 5

That is, as shown in step S4, calculate the convolution of g (y) and z (x), according to above-mentioned formula 5, due to g (y) and z (x) known, so can access the value of f (x).F (x) is the similarity of books m1 and books m2, and this user n1 likes the book read and some other user n2 to like the similarity of the book read, i.e. similarity between user n1 and user n2.Because the probability density D1 at above-mentioned middle g (y) and z (x) and D2 are by adding up and can calculate user's historical behavior data, so similarity sim (x, u)=f (x), therefore in the situation that the attribute vector of unknown books m1 and books m2, the similarity value that has obtained them, obtains the similarity between user and user.

In like manner, calculate the convolution of z (x) and g (y), can access the similarity between books and books.

In addition, by calculating that formula 5 is derived, can obtain formula 6, ignore derivation here.

f (x) = \frac{1}{\sqrt{2 π} \sqrt{2} δ} e^{(- \frac{{(x - u)}^{2}}{2 {(\sqrt{2} δ)}^{2}})}

Formula 6

According to the formula 6 of further deriving, known f (x) meets the method for measuring similarity of formula 1, and contrast 6 and formula 2, f (x) and f ₀(x) form is consistent, and given variance becomes 2 δ ².

For attribute number, be not 1 situation, said process can be regarded as to the statistics to a lot of independent attribute density of simultaneous distributions.The convolution of the independent random variable of Normal Distribution, the variance that its variance is each normal distribution and.If be (δ according to the given variance of formula 2 gained ₁ ², δ ₂ ², δ ₃ ²...), each variance in its bracket represents the statistical variance of each independent attribute, so due to attribute independent, can derive and obtain given variance for (2 δ ₁ ², 2 δ ₂ ², 2 δ ₃ ²...) similarity.

Above derivation must satisfied set M, N be obeyed minus infinity to the equally distributed hypothesis between positive infinity in addition, and the hypothesis that in set M, N, attribute of an element value remains unchanged.But under concrete condition, can utilize this ultimate principle and method, calculate the similarity result that meets definition.

Embodiment 1 has provided an example of continuous situation, below method for measuring similarity in corresponding situation continuously, the method for measuring similarity under discrete case is introduced.

Embodiment 2

The similarity of calculating between user and user or article and article in order to recommend article to user in shopping at network of take is example, and comparison other is user and user or article and article here.With reference to figure 2, carry out following explanation.First, as shown in the step S21 of Fig. 2, the article that server is sold according to user's Login Register, website and user carry out the collection of information to the operational circumstances of article, be that collected information comprises the mutual situation between user, article and user and article, to obtain user, article and the user data to the operation of article.Server is analyzed above-mentioned information, and one is that user gathers User, and one is article set Item, and the operation note of user to article.Here each user is separate to the operation of article, and it is identical that implication is expressed in each operation, has all expressed user interested in article.Table 1 illustrates existing user and gathers the mutual situation between User and article set Item.A _ijrepresent the number of operations of user Useri to article Itemj, i represents user label, and j represents article label, and i, j are integer.For example user User1 is a to the number of operations of article Item1 ₁₁inferior, user User1 is a to the number of operations of article Item2 ₁₂inferior, user User1 is a to the number of operations of article Item3 ₁₃inferior, user User1 is a to the number of operations of article Item4 ₁₄inferior, the like, suppose that article Item4 was not carried out operation, a by any user ₁₄=a ₂₄=a ₃₄=a ₄₄=0, suppose that user User4 did not carry out operation, a to any article ₄₁=a ₄₂=a ₄₃=a ₄₄=0.

Table 1

	Item1	Item2	Item3	Item4
					User1	a ₁₁	a ₁₂	a ₁₃	a ₁₄
User2	a ₂₁	a ₂₂	a ₂₃	a ₂₄
					User3	a ₃₁	a ₃₂	a ₃₃	a ₃₄
User4	a ₄₁	a ₄₂	a ₄₃	a ₄₄

In step S22, the data of above-mentioned acquisition are set up to mathematical model and form matrix, utilize matrix to express above-mentioned table 1, obtain the relational matrix a of following user and article.

Matrix a

(\begin{matrix} a_{11} & a_{12} & a_{13} & a_{14} \\ a_{21} & a_{22} & a_{23} & a_{24} \\ a_{31} & a_{31} & a_{33} & a_{34} \\ a_{41} & a_{42} & a_{43} & a_{44} \end{matrix})

As shown in the step S23 of Fig. 2, calculate the probability matrix of user to article.First, for matrix a, take user as capable, successively by each article by this user's number of operations, the operation total degree carrying out divided by this user, obtains matrix A.If the user that this row is corresponding does not carry out any operation, this row of matrix gets that to meet element value all identical and and be 1 value.Matrix A can be regarded actual object as and measure the statistical distribution that user wishes the article that obtain, the i.e. probability matrix of user to article.Here, article are the normal distribution density of obeying identical or close variance with respect to the probability density of user's probability.

A _ijthe element of representing matrix A, represents the ratio of the operation total degree that user Useri carries out all article the number of times of article Itemj operation and this user Useri, and, as shown in the formula shown in 7, wherein k represents the quantity of all article.

A_{ij} = \frac{a_{ij}}{Σ_{j = 1}^{k} a_{ij}}

Formula 7

Matrix A

(\begin{matrix} A_{11} & A_{12} & A_{13} & A_{14} \\ A_{21} & A_{22} & A_{23} & A_{24} \\ A_{31} & A_{32} & A_{33} & A_{34} \\ A_{41} & A_{42} & A_{43} & A_{44} \end{matrix})

Matrix the first row represents: the probability that Item1 is operated by User1 is A ₁₁, A ₁₁=a ₁₁/ (a ₁₁+ a ₁₂+ a ₁₃+ a ₁₄), next coming in order are analogized, and the probability that Item2 is operated by User1 is A ₁₂, the probability that Item3 is operated by User1 is A ₁₃, the probability that Item4 is operated by User1 is A ₁₄.Matrix the second line display: the probability that Item1 is operated by User2 is A ₂₁, the probability that Item2 is operated by User2 is A ₂₂, the probability that Item3 is operated by User2 is A ₂₃, the probability that Item4 is operated by User2 is A ₂₄.The like, suppose that User4 does not carry out any operation, so this row get be respectively worth identical and and be 1 value, all get A ₄₁=A ₄₂=A ₄₃=A ₄₄=0.25.

As shown in the step S24 of Fig. 2, calculate the probability matrix of article to user.First article and user's relational matrix is b, matrix b=a ^t,

Matrix b

(\begin{matrix} a_{11} & a_{21} & a_{31} & a_{41} \\ a_{12} & a_{22} & a_{32} & a_{42} \\ a_{13} & a_{23} & a_{33} & a_{43} \\ a_{14} & a_{24} & a_{34} & a_{44} \end{matrix})

For matrix b, take article as capable, successively each user is operated to the number of operations of these article, the operation total degree being carried out divided by these article.If the article that this row is corresponding were not operated, these row of matrix get that to meet element value all identical and and be 1 value.Matrix B can be regarded user as and wish that the article that obtain measure the statistical distribution of actual object, the i.e. probability matrix of article to user.Here, described article are the normal distribution density of obeying identical or close variance to the probability density of user's probability.

B _ijthe element of representing matrix B, represent number of times that article Itemj operated by user Useri with these article Itemj by the ratio of all users' operation total degree,, as shown in the formula shown in 8, wherein h represents all users' quantity.

B_{ij} = \frac{a_{ij}}{Σ_{i = 1}^{h} a_{ij}}

Formula 8

Matrix B

(\begin{matrix} B_{11} & B_{21} & B_{31} & B_{41} \\ B_{12} & B_{22} & B_{32} & B_{42} \\ B_{13} & B_{23} & B_{33} & B_{43} \\ B_{14} & B_{24} & B_{34} & B_{44} \end{matrix})

Matrix B the first row represents: the probability that Item1 is operated by User1 is B ₁₁, wherein, B ₁₁=a ₁₁/ (a ₁₁+ a ₂₁+ a ₃₁+ a ₄₁), next coming in order are analogized, and the probability that Item1 is operated by User2 is B ₂₁, the probability that Item1 is operated by User3 is B ₃₁, the probability that Item1 is operated by User4 is B ₄₁.Matrix B ^tthe probability that the second line display: Item2 is operated by User1 is B ₁₂, the like.If Item4 is not carried out any operation, thus this row get be respectively worth identical and and be 1 value, all get B ₁₄=B ₂₄=B ₃₄=B ₄₄=0.25.

In step S25, in the situation that calculating above-mentioned probability, corresponding embodiment 1 gets convolution, that is, matrix A and B are multiplied each other and obtain the similarity matrix AB between user and user, AB=A*B.

AB _ijthe element of representing matrix AB, AB ₁₁=A ₁₁* B ₁₁+ A ₁₂* B ₁₂+ A ₁₃* B ₁₃+ A ₁₄* B ₁₄, AB ₁₂=A ₁₁* B ₂₁+ A ₁₂* B ₂₂+ A ₁₃* B ₂₃+ A ₁₄* B ₂₄, the like.

Matrix A B

(\begin{matrix} {AB}_{11} & {AB}_{12} & {AB}_{13} & {AB}_{14} \\ {AB}_{21} & {AB}_{22} & {AB}_{23} & {AB}_{24} \\ {AB}_{31} & {AB}_{32} & {AB}_{33} & {AB}_{34} \\ {AB}_{41} & {AB}_{42} & {AB}_{43} & {AB}_{44} \end{matrix})

This similarity value is the similarity value under certain unknown weights k and certain unknown variance vector, and this matrix A B is the similarity matrix between user and user.For example the first row represents the AB that the similarity of User1 and User1 is ₁₁, the similarity of User1 and User2 is AB ₁₂, the similarity of User1 and User3 is AB ₁₃, the similarity of User1 and User4 is AB ₁₄, in like manner the rest may be inferred.

If compute matrix B*A, obtains the similarity matrix BA=B*A between article and article.

BA _ijthe element of representing matrix BA, BA ₁₁=B ₁₁* A ₁₁+ B ₂₁* A ₂₁+ B ₃₁* A ₃₁+ B ₄₁* A ₄₁, BA ₁₂=B ₁₁* A ₁₂+ B ₂₁* A ₂₂+ B ₃₁* A ₃₂+ B ₄₁* A ₄₂, the like.

Matrix E

(\begin{matrix} {BA}_{11} & {BA}_{12} & {BA}_{13} & {BA}_{14} \\ {BA}_{21} & {BA}_{22} & {BA}_{23} & {BA}_{24} \\ {BA}_{31} & {BA}_{32} & {BA}_{33} & {BA}_{34} \\ {BA}_{41} & {BA}_{42} & {BA}_{43} & {BA}_{44} \end{matrix})

For example matrix E the first row represents the similarity BA of Item1 and Item1 ₁₁, the similarity of Item1 and Item2 is BA ₁₂, the similarity of Item1 and Item3 is BA ₁₃, the similarity of Item1 and Item4 is BA ₁₄.Matrix E the second row, the third line, the rest may be inferred for fourth line.

In the present embodiment, do not know the attribute of article, do not know user's attribute yet, but according to normal distribution characteristic of Normal Distribution still the convolution in the situation that, utilize the operation history of user to article, obtain article with respect to user's probability matrix and user the probability matrix with respect to article, thereby can calculate similarity between user and user or the similarity of article and article.Thereby can come according to this to recommend article to user, the article that raising is recommended be by user's the possibility of adopting.

Fig. 3 shows a kind of similarity measurement system 300, and similarity measurement system comprises data collection module 301, Mathematical Models unit 302, probability matrix generation unit 303, similarity calculated 304.Data collection module 301 is for example collected the article that Login Register user, website sell, the historical data that user operates article.The mathematical model of the interactive relation of user and article is set up in Mathematical Models unit 302, generates the matrix of user and article interactive relation.Probability matrix generation unit 303 is according to interactive relation matrix between described user and article, for each user, calculate the probability of the number of times that each article are operated by all users with respect to described article by the number of times of described each user operation, generate with described user and article between the probability matrix of user corresponding to interactive relation matrix to article; And according to interactive relation matrix between described user and article, for each article, calculate the number of times that each user operates described each article and with respect to described user, operate the probability of the number of times of all article, generate with described user and article between the probability matrix of article corresponding to interactive relation matrix to user.The similarity computing unit 304 described users of calculating are the product to user's probability matrix to the probability matrix of article and described article, obtains the similarity matrix between user and user; Or described article are the product to the probability matrix of article to user's probability matrix and described user, obtains the similarity matrix between article and article.

Utilized the result of the drawn similarity of method for measuring similarity of the present invention and the method for previous calculation similarity to compare and there is good effect under the condition that meets hypothesis.

According to the above-mentioned definition to similarity that has utilized the characteristic of normal distribution, can the in the situation that of unknown properties vector, calculate the similarity between article, its application is not limited to above-described embodiment, and the similarity that can be applied between the comparison other of each unknown properties is calculated.

Computing for the enhancing similarity association of above-mentioned similarity

Embodiment 3 strengthens the computing of similarity association to the result of embodiment 1 gained.We know, variance shows that more greatly associated result increases, but the also corresponding increase of its error.

Fig. 4 illustrates the process flow diagram of method of the enhancing similarity association of embodiment 1, with reference to 4 couples of embodiment 3 of figure, describes.Utilize above-mentioned similarity definition formula 1, and according to resulting similarity result in embodiment 1, at the step S41 of Fig. 4, pass through any books m _xand m _y, and m _yand m _zsimilarity carry out m _yconvolution algorithm, as shown in Equation 9, can access m _xand m _zbetween association, thereby expanded the associated scope of similarity between books, strengthen the association of similarity between books, the similarity that is enhanced sim (m _x, m _y).The computing of through type 9, the variance that meets formula 1 also becomes 4 δ ².

sim (m_{x}, m_{z}) = {&Integral;}_{- \infty}^{+ \infty} sim (m_{x}, m_{y}) \cdot sim (m_{y}, m_{z}) {dm}_{y}

Formula 9

And according to formula 1 and formula 9, obtain the result of formula 10, wherein C ₀it is constant.

sim (m_{x}, m_{z}) = C_{0} \frac{1}{\sqrt{2 π} \cdot \sqrt{2 δ}} e^{(- \frac{{(m_{x} - m_{z})}^{2}}{2 {(\sqrt{2 δ})}^{2}})}

Formula 10

Variance increases, although the similarity having strengthened between books is associated, with time error, has also expanded.In order to reduce error, making the given variance of similarity is 2 δ again ², in the step S42 of Fig. 4, carry out the variance of formula 11 and regain computing, obtained carrying out enhancing similarity the sim ' (m that variance is regained _x, m _y).

{sim}^{'} (m_{x}, m_{z}) = \frac{{sim}^{2} (m_{x}, m_{z})}{{&Integral;}_{- \infty}^{+ \infty} {sim}^{2} (m_{x}, m_{z}) {dm}_{x}}

Formula 11

According to formula 10 and formula 11, obtain formula 12, variance has become δ again ², C here ₀, C ₁, C ₀' are all constants.

According to above-mentioned, variance is 2 δ again ², the similarity that can strengthen between comparison other is associated, and hold error is constant.

By above-mentioned, variance has realized from 2 δ ²become 4 δ ²get back to again 2 δ ²variation, obtained wider and m _xthe books with similarity association, can therefrom select books that similarity is high for recommending.Similarity sim ' (m after enhancing _x, m _z) value can obtain according to formula 11.

Embodiment 3 has provided an example of continuous situation, below the method for measuring similarity of enhancing similarity association in corresponding situation continuously, the method for measuring similarity of the enhancing similarity association under discrete case is introduced.

Embodiment 4

Fig. 5 is the process flow diagram illustrating the method for the enhancing similarity association of embodiment 2, with reference to 5 couples of embodiment 4 of figure, describes.Embodiment 4 strengthens the computing of similarity association to the similar matrix of embodiment 2 gained.Here general given weights are that use can make similarity and be that 1 weights calculate.

The similar matrix AB that for example represents the similarity between user and user, in the step S51 of Fig. 5, first strengthens the calculating of similarity, the association between further extending one's service to it.Similar matrix f=(AB) * (AB) after enhancing ^t.

F _ijthe element of representing matrix f, f ₁₁=AB ₁₁* AB ₁₁+ AB ₁₂* AB ₁₂+ AB ₁₃* AB ₁₃+ AB ₁₄* AB ₁₄, f ₁₂=AB ₁₁* AB ₂₁+ AB ₁₂* AB ₂₂+ AB ₁₃* AB ₂₃+ AB ₁₄* AB ₂₄, the like.

Matrix f

(\begin{matrix} f_{11} & f_{12} & f_{13} & f_{14} \\ f_{21} & f_{22} & f_{23} & f_{24} \\ f_{31} & f_{32} & f_{33} & f_{34} \\ f_{41} & f_{42} & f_{43} & f_{44} \end{matrix})

Matrix f, for the user after strengthening through similarity and the enhancing similar matrix between user, has expanded scope associated between user and user by enhance operation, calculates similarity and be between zero user to have had association between making it.In like manner, calculate (BA) * (BA) ^tassociated enhancing similar matrix has been enhanced between article and article.

Along with above-mentioned similarity strengthens, the satisfied variance of similarity between user and user has increased by one times, its error has also increased by one times, here for error size is kept with original consistent, as shown in the step S52 of Fig. 5, the similar matrix strengthening is carried out to variance and regain computing, as shown in Equation 13.Matrix g has represented to carry out variance and has regained user after computing and the similar matrix between user.This computing is as follows, f _ijrepresent to strengthen the element in matrix f, g _ijrepresent that variance regains the element in the enhancing matrix g after computing, the line number of i representing matrix, the row number of j representing matrix, h represents the maximal value of row number, i, j, h are the integer that is greater than zero.

g_{ij} = \frac{{f_{ij}}^{2}}{Σ_{j = 1}^{h} {f_{ij}}^{2}}

Formula 13

Matrix g

(\begin{matrix} g_{11} & g_{12} & g_{13} & g_{14} \\ g_{21} & g_{22} & g_{23} & g_{24} \\ g_{31} & g_{32} & g_{33} & g_{34} \\ g_{41} & g_{42} & g_{43} & g_{44} \end{matrix})

G in matrix g ₁₁represent that user User1 and User1 have carried out variance and regained the enhancing similarity value of computing, g ₁₁=f ₁₁ ²/ (f ₁₁ ²+ f ₁₂ ²+ f ₁₃ ²+ f ₁₄ ²).Other the like.

Fig. 6 shows a kind of similarity measurement system 600 that strengthens similarity association, and the similarity measurement system that strengthens similarity association comprises that obtaining similarity matrix unit 601, similarity enhance operation unit 602 and variance regains arithmetic element 603.Obtain similarity matrix unit 601 and obtain the similarity matrix between comparison other.Obtain similarity matrix unit 601 same with similarity measurement system 300 shown in Fig. 3, comprise data collection module, Mathematical Models unit, probability matrix generation unit and similarity computing unit.The article that sell described data collection module collection Login Register user, website, the historical data that user operates article.The mathematical model of the interactive relation of user and article is set up in described Mathematical Models unit, generates the matrix of user and article interactive relation.Described probability matrix generation unit is according to interactive relation matrix between described user and article, for each user, calculate the probability of the number of times that each article are operated by all users with respect to described article by the number of times of described each user operation, generate with described user and article between article corresponding to interactive relation matrix with respect to user's probability matrix; And according to interactive relation matrix between described user and article, for each article, calculate the number of times that each user operates described each article and with respect to described user, operate the probability of the number of times of all article, generate with described user and article between user corresponding to interactive relation matrix with respect to the probability matrix of article.Described similarity computing unit calculate described article with respect to user's probability matrix and described user with respect to the product of the probability matrix of article, obtain the similarity matrix between user and user; Or described user with respect to the product of user's probability matrix, obtains the similarity matrix between article and article with respect to the probability matrix of article and described article.Similarity matrix between similarity enhance operation unit 602 calculating comparison others and the product of the transposed matrix of self, thereby the enhancing similarity matrix of the enhancing similarity association between the comparison other of the similarity association that is enhanced.Variance regain computing unit 603 calculate strengthen each element in similarity matrixs square with the new element that is compared to of the quadratic sum of each element of the row at this element place, thereby obtain carrying out variance, regain the similar matrix between the comparison other family after calculating.

According to embodiment 3 and 4, similar matrix at embodiment 1 and gained in 2 is strengthened to the associated computing of similarity, thereby increased for example, between comparison other (user and user or article and article) associated, and carry out variance and regain computing, thereby make to strengthen error that similarity association brings with original consistent, remain unchanged.Obtain being in the above-described embodiment compared between object that similarity association range is larger, the constant similarity of error.

About data, there is the calculating of the similarity in the situation of deflection

In above-mentioned similarity is calculated, each attribute vector value, be that each behavioral data need to be obeyed at minus infinity to being uniformly distributed in positive infinity, in the situation that not meeting above-mentioned situation, having data skew, the article that obtain and the similarity matrix between article may be asymmetric, cannot carry out further similarity enhance operation.In order to obtain symmetrical similarity matrix, for the method for the above-mentioned similarity of obtaining, can carry out approaching of a similarity result, to obtain similarity more accurately.

Below approaching of similarity result described in detail.

Fig. 7 illustrates the process flow diagram of another method for measuring similarity.Known with reference to figure 7, first, as shown in step S71, obtain set a and the element of set b and the operative relationship data between element, then, as shown in step S72, according to above-mentioned data, obtain gathering the similarity value between element in b.For example obtain the element of set a and set b, indiscriminate similarity number of operations is expressed as to sim (item_a, item_b), said indiscriminate similarity number of operations refers to the element item_a in set a and gathers the operative relationship between the element item_b in b here.Utilize following formula to ask for similarity the sim ' (Item_b between set b inner element _i, Item_b _j).Formula is as follows:

\begin{matrix} {sim}^{'} (Item_b_{i}, Item_b_{j}) = \\ k * Σ (\frac{sim (Item_a_{m}, Item_b_{i}) * sim (Item_a_{m}, Item_b_{j})}{\underset{n}{Σ} sim (Item_a_{m}, Item_b_{n}) * \underset{n}{Σ} sim (Item_a_{m}, Item_b_{n}) * \underset{n}{Σ} sim (Item_a_{n}, Item_b_{i}) * \underset{n}{Σ} sim (Item_a_{n}, Item_b_{j})}) \end{matrix}

Formula 14

Wherein, sim ' (Item_b _i, Item_b _j) represent element Item_b in set b _iand Item_b _jbetween similarity, be the value that the similarity result to utilizing the method for formula 1 to obtain is approached.K is normalized factor, is normalized rear definition 2 and is defining approaching of a similarity result in 1.Because indiscriminate similarity number of operations is expressed as sim (item_a, item_b), therefore sim (item_a for example _m, item_b _i) represent the element item_a in set a _mwith the element item_b in set b _ibetween indiscriminate similarity number of operations.Sim (item_a _m, item_b _j) represent the element item_a in set a _mwith the element item_b in set b _jbetween indiscriminate similarity number of operations.Wherein, m, n, i, j all represent the label of element in set.

In above-mentioned formula 14, similarity is operated to time of origin and be made as t (item_a, item_b), gather the running time point of the element item_b in the element item_a pair set b in a.Utilize following formula to ask for the similarity method between set b inner element.Suppose by the following part in G expression 14:

\begin{matrix} G = \\ \frac{sim (Item_a_{m}, Item_b_{i}) * sim (Item_a_{m}, Item_b_{j})}{\underset{n}{Σ} sim (Item_a_{m}, Item_b_{n}) * \underset{n}{Σ} sim (Item_a_{m}, Item_b_{n}) * \underset{n}{Σ} sim (Item_a_{n}, Item_b_{i}) * \underset{n}{Σ} sim (Item_a_{n}, Item_b_{j})} \end{matrix}

Considering the filter factor f (t (item_a relevant with the time _m, item_b _i), t (item_a _m, item_b _j)) situation under, similarity formula is as follows:

{sim}^{'} (Item_b_{i}, Item_b_{j}) = k * \underset{n}{Σ} (G * f (t (Item_a_{m}, Item_b_{i}), t (Item_a_{m}, Item_b_{j}))

Formula 15

Wherein k is normalized factor, f (t (item_a _m, item_b _i), t (item_a _m, item_b _j)) be certain function relevant with the time, make the time more close, functional value is larger.As the low-pass filter function of use, make close value of time larger.For example time filtering function f is formula 16.

f (t (Item_a_{m}, Item_b_{i}), t (Item_a_{m}, Item_b_{j}) = β^{| t (Item_a_{m}, Item_b_{i}) - t (Item_a_{m}, Item_b_{j}) |}

Formula 16

Wherein β is less than the 1 low-pass filtering coefficient that is greater than 0.

Fig. 8 illustrates the block diagram of another similarity measurement system, similarity measurement system of the present invention, comprise: data capture unit, it obtains element item_a and the element item_b in set b and the element item_b indiscriminate similarity number of operations sim (item_a, item_b) in the element item_a pair set b in set a in set a; Similarity calculated, it carries out the set b inner element item_b based on formula 14 _iwith element item_b _jsimilarity sim ' (Item_b _i, Item_b _j) calculating.Here about data, exist the calculating of the similarity in the situation of deflection to be not merely applicable to exist in data the situation of deflection, in the equally distributed situation of data, also can be suitable for.In like manner following explanation about data, exist the calculating of the enhancing similarity in the situation of deflection also can be applicable to the equally distributed situation of data.

About data, there is the calculating of the enhancing similarity in the situation of deflection

The matrix calculating for above-mentioned method for measuring similarity and system can carry out further similarity enhance operation, this enhance operation is for by according to the similarity matrix of the similarity value gained of the set b inner element that obtains and the transposed matrix of this similarity matrix, continue the calculating similarity of use formula 14, and be normalized calculating, to carry out the calculating that strengthens similarity.

About white noise, compensate

In the above-described embodiment, the attribute of only take is illustrated as example, but for article, there are a plurality of attributes, when operating article, user can evaluate some attribute of some article, but the data that some article does not but exist user to evaluate, or user does not evaluate all properties of operated article, cause resulting user fewer to the interaction data of the evaluation of goods attribute.

White noise compensation method comprises the following steps: the mean value of the measured value of the attribute vector of each sample in calculating sample space is as the estimated value of described attribute vector; Calculate the mean value of estimated value of the described attribute vector of all samples; For the number of measurements of described attribute vector, be less than the sample of predetermined number, utilize the mean value of estimated value of described attribute vector of above-mentioned all samples as the measured value of the described attribute vector of sample, the number of measurements of the described attribute vector of sample is complemented to predetermined number; Recalculate the mean value of described attribute vector measured value of sample of the number of measurements of being supplied described attribute vector as estimated value.

In addition, another white noise compensation method comprises the following steps: the mean value of the measured value of the attribute vector of each sample in calculating sample space is as the estimated value of described attribute vector; Calculate the mean value of measured value of all described attribute vectors of all samples; For the number of measurements of described attribute vector, be less than the sample of predetermined number, utilize the mean value of measured value of all described attribute vectors of above-mentioned all samples as the measured value of the described attribute vector of sample, the number of measurements of the described attribute vector of sample is complemented to predetermined number; Recalculate the mean value of described attribute vector measured value of sample of the number of measurements of being supplied described attribute vector as estimated value.

With reference to 9 pairs of situations that exist a plurality of independent attributes to carry out sample statistics of figure, describe.In fictitia, on some websites, have a lot of films, marking with watching duration is the attribute vector of film.Want now that the scoring and the user that determine certain film watch duration.Known have three users that this film is watched and marked now.Measured value about scoring on this website is respectively 7,5,8, about watching the measured value of duration to be respectively 1.4,1.6,1.5.Concrete condition is as shown in table 2.

Table 2

	Scoring (full marks 10 minutes)	Watch duration (hour)
			User 1	7	1.4
User 2	5	1.6
			User 3	8	1.5

First as shown in step S91, calculate about the mean value of the scoring measured value of this film and the mean value of watching duration measured value, the mean value of measured value of marking is (7+5+8)/3=6.67, and the mean value of watching duration measured value is (1.4+1.6+1.5)/3=1.5.

The number of users that a known film is watched, more than 30, could reflect film in the actual scoring of this website and watch duration.But because above-mentioned this website only has 3 for the evaluation quantity of this film, evaluate quantity very few, therefore two of this film of this website attribute vectors are carried out to white noise compensation.

As shown in step S92, in order to predict that accurately scoring and the user of this film watches duration, this website is found on other similar websites, obtain each website to the scoring of this film and watch the mean value of duration, and calculate and comprise this website in the scoring of each interior website and watch the mean value of the mean value of duration measured value, as shown in table 3 below.

Table 3

	Scoring (full marks 10 minutes)	Watch duration (hour)
			Film	6	1.2

As shown in step S93, utilize the mean value of above-mentioned mean value to carry out white noise compensation to two of this film of this website attributes, the quantity of attribute vector is complemented to 30.As shown in step S94, calculate the mean value of attribute vector measured value of this film of supplying after attribute vector measured value as estimated value, this website is predicted the scoring of this film and is watched duration to be:

([7，1.4]+[5，1.6]+[8，1.5]+27*[6，1.2])/30＝[6.0667，1.230]

So this film is predicted in this website, the scoring on this website is 6.0667 minutes, and watching duration is 1.230 hours.Thereby obtained prediction more accurately.

Also can replace the step S92 in Fig. 9, and shown in step S102 as shown in figure 10, calculate the mean value of the measured value about attribute scoring of all websites, and calculate all websites about attribute, watch the mean value of the measured value of duration, and as shown in step S103, utilize the mean value of the above-mentioned measured value calculating to go to supply the number of attribute vector measured value of this film of this website, be 30, and utilize the attribute vector measured value of supplying after white noise to come the mean value of computation attribute vector measurement value as estimated value.

Below, more than state and be illustrated as basis, introduce and utilize Bayesian formula to carry out comprehensive method the similarity based on behavioral data and the similarity based on characteristic.

With reference to figures 11 to Figure 13, describe.Figure 11 illustrates the process flow diagram for Bayes's integrated approach of the similarity of behavioral data and the similarity of characteristic; Figure 12 illustrates the process flow diagram of the similarity of calculating behavioral data.As shown in the step S111 of Figure 11, first obtain characteristic and behavioral data.Then, as shown in step S112, calculate article based on behavioral data and the similarity matrix between article.Specifically as shown in figure 12, first, as shown in step S121, based on obtaining behavioral data, obtain user, article and the user data to the operation of article; Then as shown in step S122, utilize above-mentioned data to calculate the similarity matrix between article and article.For the above-mentioned similarity matrix calculating, can also, as shown in step S123, to the similarity matrix between above-mentioned article and article, strengthen similarity computing.

According to the method for above-mentioned calculating similarity and similarity Enhancement Method, can the in the situation that of unknown properties vector, utilize behavioral data to carry out the tolerance of similarity.Based on behavioral data, obtain the similarity value between comparison other, and carry out wild phase like the computing of degree.Here, for the fewer situation of behavioral data, in order to stablize statistics, also can utilize above-mentioned white noise compensation method to carry out above-mentioned white noise compensation.By above-mentioned, obtain the similarity value based on behavioral data.

As shown in the step S113 in Figure 11, calculate article based on characteristic and the similarity matrix between article.Similarly, utilize above-mentioned similarity calculating method and similarity Enhancement Method can calculate article based on characteristic and the similarity between article.Specifically as shown in figure 13, in step S131, obtain the data of the property value of article, attribute and the corresponding attribute of article.In step S132, according to above-mentioned data, utilize formula 1 or formula 14 to calculate the similarity matrix between article and article.In step S133, attribute information is carried out to white noise compensation, and the contribution proportion of similarity is calculated to the similarity matrix that has compensated white noise according to known attribute information and unknown properties information.Article comprise known attribute and unknown properties, for known genera performance, enough obtain the property value as the relation of article and attribute, the similarity calculating method that can utilize property value to introduce according to formula 1 or formula 14 calculates article based on attribute information and the similarity between article.Due to known attribute, only account for a part for goods attribute information, the similarity result of calculating so, need to carry out white noise compensation.Compensation method is, to unknown goods attribute information, supposes that the similarity information of its contribution is white noise, and the similarity of any one article and other article is identical.Obtain like this similarity matrix.By this similarity matrix and the similarity matrix that utilizes known attribute to calculate, according to certain ratio, be added, just obtained the similarity matrix that utilizes attribution method to calculate.Concrete, known attribute information accounts for the ratio of goods attribute information, can not know in advance, so just need to utilize the method for cross validation, determines this scale-up factor, to reach best recommendation effect.

As shown in the step S114 of Figure 11, utilize Bayesian formula to carry out comprehensively above-mentioned article based on behavioral data and the similarity matrix between article and the article based on characteristic and the similarity matrix between article.Particularly, between article and article, utilize Bayesian formula to carry out comprehensively the similarity based on characteristic and the similarity based on behavioral data based on having supplemented white noise, the similarity result wherein calculating based on characteristic is as prior distribution, using utilizing the similarity result that behavioral data calculates to distribute as condition, be shown below.

{sim}^{'''} (b_{i}, b_{j}) = \frac{{sim}^{'} (b_{i}, b_{j}) * {sim}^{''} (b_{j}, b_{i})}{\underset{j}{Σ} {sim}^{'} (b_{i}, b_{j}) * {sim}^{''} (b_{j}, b_{i})}

Formula 17

Wherein, b _i, b _jrepresent article, subscript variable i=1,2 ..., priori probability density sim ' (b _i, b _j) be the article b based on characteristic _iwith article b _jbetween similarity result, conditional probability density sim " (b _j, b _i) be the article b based on behavioral data _jwith article b _isimilarity result, sim " ' (b _i, b _j) represent to have carried out the comprehensive article b of similarity _iwith article b _jbetween Bayes's similarity.This similarity is the estimated value that meets the similarity of formula 1 definition equally.

For example to the similarity based on characteristic and the similarity based on behavioral data, utilizing Bayesian formula to carry out comprehensive situation below describes.Because the similarity calculating method based on formula 1 is only applicable in the equally distributed situation of data, and the similarity calculating method of formula 14 is applicable to the calculating of any data, and therefore, the similarity calculating method of formula 14 of take here describes as example.

For example the relation of user and article (behavioral data) is as follows

	Article 1	Article 2	Article 3
				User 1	1	1	0
User 2	1	0	1
				User 3	2	0	0

The relation of article and attribute (characteristic) is as follows

	Attribute 1	Attribute 2	Attribute 3
				Article 1	1	1	1
Article 2	1	0	0
				Article 3	0	1	0

Relational matrix log_a based between behavioral data (log data) user and article is

(\begin{matrix} 1 & 1 & 0 \\ 1 & 0 & 1 \\ 2 & 0 & 0 \end{matrix})

Article based on characteristic (Tag data) and the relational matrix tag_c of attribute are

(\begin{matrix} 1 & 1 & 1 \\ 1 & 0 & 0 \\ 0 & 1 & 0 \end{matrix})

According to the relational matrix log_a between user and article, utilize formula 14 to calculate article based on behavioral data and the similarity matrix sim_log between article is

(\begin{matrix} 0.4286 & 0.2857 & 0.2857 \\ 0.2000 & 0.8000 & 0 \\ 0.2000 & 0 & 0.8000 \end{matrix})

If free filtering, can be multiplied by a filter function relevant with the time here.Take below and do not carry out time filtering and proceed to calculate as example.

This similarity matrix has been carried out the similarity matrix after normalization, and the similarity matrix before normalization is symmetrical.This normalized similarity matrix is strengthened, utilize this similarity matrix sim_log further to utilize formula 14 to carry out similarity calculating, and be normalized, obtain following enhancing similarity matrix sim_log_enhance

(\begin{matrix} 0.3795 & 0.3102 & 0.3102 \\ 0.3154 & 0.6150 & 0.0696 \\ 0.3154 & 0.0696 & 0.6150 \end{matrix})

For the article based on behavioral data and the enhancing similarity between article, calculate above, below the article based on characteristic (Tag data) and the similarity between article are calculated.

First, the relational matrix tag_c of the attribute based on characteristic (Tag data) and article obtains its article based on characteristic and the relational matrix of attribute (tag_c) ^t, according to the relational matrix of article and attribute (tag_c) ^tutilize formula 14 to calculate in the same manner as described above similarity, obtain the similarity between article and article, and it is normalized, the similarity matrix sim_tag obtaining between normalized article and article is

(\begin{matrix} 0.5000 & 0.2500 & 0.2500 \\ 0.2500 & 0.7500 & 0 \\ 0.2500 & 0 & 0.7500 \end{matrix})

For current attribute vector, carry out white noise compensation, suppose to need altogether 6 attribute vectors, need to compensate white noise, the property value of the white noise compensating is 1/ (6-3), and the matrix W of the attribute vector compensating is

(\begin{matrix} 0.3333 & 0.3333 & 0.3333 \\ 0.3333 & 0.3333 & 0.3333 \\ 0.3333 & 0.3333 & 0.3333 \end{matrix})

The attribute information of supposing compensated article is 9/10 to the contribution of similarity, and the actual attribute information having is 1/10 to the contribution of similarity.Between actual object and article, the similarity estimated value based on attribute information is sim_tag '=(sim_tag*0.1)+(W*0.9), be according to above-mentioned sim_tag ' matrix

(\begin{matrix} 0.3500 & 0.3250 & 0.3250 \\ 0.3250 & 0.3750 & 0.3000 \\ 0.3250 & 0.3000 & 0.3750 \end{matrix})

Here suppose known attribute information accounting 10%, and actual accounting need to be asked by cross-validation method, makes the RMSE of result minimum.

Below, according to the similarity matrix sim_tag ' that supplies white noise between the article based on behavioral data of having obtained and the enhancing similarity matrix sim_log_enhance between article and the article based on characteristic and article, based on Bayesian formula 17, similarity is carried out comprehensively.Wherein, the similarity matrix sim_tag ' based on feature that has added white noise is that prior imformation (is the sim ' (b in formula _i, b _j)), the article based on behavioral data and the similarity matrix sim_log_enhance between article are that conditional information (is the sim " (b in formula _j, b _i)).So comprehensively as follows:

Bayesian molecular moiety matrix is matrix sim_tag ' some multiply matrix (sim_log_enhance) ^tand above-mentioned matrix dot product result is normalized to the similarity that obtains combining based on behavioral data and the similarity matrix of the similarity based on characteristic, and be normalized, obtaining the Bayes's similar matrix sim_BAYES after normalization, matrix sim_BAYES is:

(\begin{matrix} 0.3932 & 0.3034 & 0.3034 \\ 0.2862 & 0.6546 & 0.0592 \\ 0.2862 & 0.0592 & 0.6546 \end{matrix})

According to above-mentioned calculating, will carry out effectively comprehensively based on behavioral data and the similarity based on characteristic, thereby obtained good similarity result.

With reference to Figure 14, Figure 14 illustrates the block diagram for the comprehensive system of the Bayes of the similarity of behavioral data and the similarity of characteristic, similarity measurement system 141 comprises: data capture unit 142, and it obtains the characteristic about user's behavioral data and article; Similarity calculated 143 based on behavioral data, it calculates article based on behavioral data and the similarity between article; Similarity calculated 144 based on characteristic, it calculates article based on characteristic and the similarity between article; And similarity comprehensive unit 145, it will and utilize Bayesian formula to carry out comprehensively based on the resulting similarity of characteristic based on the resulting similarity of behavioral data,

{sim}^{'''} (b_{i}, b_{j}) = \frac{{sim}^{'} (b_{i}, b_{j}) * {sim}^{''} (b_{j}, b_{i})}{\underset{j}{Σ} {sim}^{'} (b_{i}, b_{j}) * {sim}^{''} (b_{j}, b_{i})}

In above-mentioned, we give an example with the similarity between article and article always, but the similarity being applicable to too between user and user is calculated.First obtain user's behavioral data and user's characteristic, we can calculate user based on behavioral data and the similarity between user method based on above-mentioned, also can calculate user based on characteristic and the similarity between user, utilize Bayesian formula to carry out comprehensively both, thereby obtain combining Bayes's similarity user's behavioral data and characteristic, between user and user.Similarly, the above-mentioned white noise compensation method about behavioral data and characteristic also can be applied in user and user's similarity calculating with the method that strengthens similarity.

According to similarity based method of the present invention and system, can carry out effectively comprehensively the similarity based on behavioral data and the similarity based on characteristic, thereby obtain the similarity more accurately based on behavioral data and characteristic.Also it will be understood by those skilled in the art that and to have how optional embodiment and the improved procedure that can be used in the present invention embodiment, and above-mentioned embodiment and example are only the explanations of one or more embodiment.

According to above-mentioned, the invention provides a kind of method for measuring similarity and system that combines the similarity of behavioral data and the similarity of characteristic.The invention is not restricted to the embodiments described, as long as in the scope of this technical conceive, is all included in scope of the present invention.

Claims

1. a method for measuring similarity, is characterized in that, comprises the following steps:

Data acquisition step, obtains the characteristic about user's behavioral data and article;

Similarity calculation procedure based on behavioral data, calculates article based on behavioral data and the similarity between article;

Similarity calculation procedure based on characteristic, calculates article based on characteristic and the similarity between article; And

The comprehensive step of similarity, will carry out comprehensively based on the resulting similarity of behavioral data and the Bayesian formula based on below the resulting similarity utilization of characteristic,

{sim}^{'''} (b_{i}, b_{j}) = \frac{{sim}^{'} (b_{i}, b_{j}) * {sim}^{''} (b_{j}, b_{i})}{\underset{j}{Σ} {sim}^{'} (b_{i}, b_{j}) * {sim}^{''} (b_{j}, b_{i})}

2. method for measuring similarity according to claim 1, is characterized in that,

In the described similarity calculation procedure based on behavioral data, comprise the following steps:

The described behavioral data generation user that utilization is obtained and the relational matrix between article and the relational matrix between article and user;

Utilize relational matrix between described user and article and the relational matrix between described article and user generate user to the probability matrix of article and article the probability matrix to user;

Described article are multiplied by the similarity matrix calculating between article and article mutually to user's probability matrix and user to the probability matrix of article.

3. method for measuring similarity according to claim 1, is characterized in that,

In the described similarity calculation procedure based on behavioral data, utilize user a in the user that obtains of institute set and the article b in article set and the user user a in gathering to the indiscriminate similarity number of operations of the article b sim (a in article set, b), carry out the article set interior items b based on following formula _jwith article b _isimilarity sim " (b _j, b _i) calculating, to generate similarity matrix,

\begin{matrix} {sim}^{''} (b_{j}, b_{i}) = \\ k * \underset{m}{Σ} (\frac{sim (a_{m}, b_{j}) * sim (a_{m}, b_{i})}{\underset{n}{Σ} sim (a_{m}, b_{n}) * \underset{n}{Σ} sim (a_{m}, b_{n}) * \underset{n}{Σ} sim (a_{n}, b_{j}) * \underset{n}{Σ} sim (a_{n}, b_{i})}) \end{matrix}

4. method for measuring similarity according to claim 3, is characterized in that,

The method of calculating described similarity matrix for described similarity matrix utilization is calculated again, is used as the article b of described behavioral data with the article of the similarity association that is enhanced and the enhancing similarity matrix between article _jwith article b _isimilarity result.

5. method for measuring similarity according to claim 1, is characterized in that,

Before the similarity of calculating based on behavioral data, also comprise white noise compensation process: user is supplemented to pre-determined number to the number of times of object manipulation lower than the user of pre-determined number.

6. method for measuring similarity according to claim 1, is characterized in that,

In the described similarity calculation procedure based on characteristic, comprise the following steps:

The described characteristic generation article that utilization is obtained and the relational matrix between attribute and the relational matrix between attribute and article;

Utilize relational matrix between described article and attribute and the relational matrix between described attribute and article generate article to the probability matrix of attribute and attribute the probability matrix to article;

Described article are multiplied by the similarity matrix calculating between article and article mutually to the probability matrix of attribute and attribute to the probability matrix of article.

7. method for measuring similarity according to claim 1, is characterized in that,

In the described similarity calculation procedure based on characteristic, utilize the corresponding property value sim (c of the article a of obtain in article set and the known attribute c in community set and the known attribute c in community set and article b in article set, b), carry out the article set interior items b based on following formula _iwith article b _jsimilarity sim ' (b _i, b _j) calculating, to generate for the article of known attribute and the similarity matrix between article,

\begin{matrix} {sim}^{'} (b_{i}, b_{j}) = \\ k * \underset{m}{Σ} (\frac{sim (c_{m}, b_{i}) * sim (c_{m}, b_{j})}{\underset{n}{Σ} sim (c_{m}, b_{n}) * \underset{n}{Σ} sim (c_{m}, b_{n}) * \underset{n}{Σ} sim (c_{n}, b_{i}) * \underset{n}{Σ} sim (c_{n}, b_{j})}) \end{matrix}

8. according to the method for measuring similarity described in claim 6 or 7, it is characterized in that,

The white noise compensation process that also comprises the similarity based on characteristic: for unknown properties by arbitrary article and similarity between other article be made as identical and and be 1, obtain the white noise compensation matrix for similarity between the article of unknown properties and article, and sue for peace according to predetermined ratio by the described article for known attribute and the similarity matrix between article and for the white noise compensation matrix of similarity between the article of unknown properties and article, the similarity matrix based on characteristic of the white noise that acted as a supplement.

9. a similarity measurement system, is characterized in that, comprising:

Data capture unit, it obtains the characteristic about user's behavioral data and article;

Similarity calculated based on behavioral data, it calculates article based on behavioral data and the similarity between article;

Similarity calculated based on characteristic, it calculates article based on characteristic and the similarity between article; And

Similarity comprehensive unit, it will and utilize Bayesian formula to carry out comprehensively based on the resulting similarity of characteristic based on the resulting similarity of behavioral data,

{sim}^{'''} (b_{i}, b_{j}) = \frac{{sim}^{'} (b_{i}, b_{j}) * {sim}^{''} (b_{j}, b_{i})}{\underset{j}{Σ} {sim}^{'} (b_{i}, b_{j}) * {sim}^{''} (b_{j}, b_{i})}

10. similarity measurement system according to claim 9, is characterized in that,

In the described similarity calculated based on behavioral data, comprising:

Mathematical Models unit, the described behavioral data generation user that its utilization is obtained and the relational matrix between article and the relational matrix between article and user;

Probability matrix generation unit, utilize relational matrix between described user and article and the relational matrix between described article and user generate user to the probability matrix of article and article the probability matrix to user;

Similarity calculated, it is multiplied by user's probability matrix and user the similarity matrix calculating between article and article to described article mutually to the probability matrix of article.

11. similarity measurement systems according to claim 9, is characterized in that,

In the described similarity calculated based on behavioral data, utilize user a in the user that obtains of institute set and the article b in article set and the user user a in gathering to the indiscriminate similarity number of operations of the article b sim (a in article set, b), carry out the article set interior items b based on following formula _jwith article b _isimilarity sim " (b _j, b _i) calculating, to generate similarity matrix,

\begin{matrix} {sim}^{''} (b_{j}, b_{i}) = \\ k * \underset{m}{Σ} (\frac{sim (a_{m}, b_{j}) * sim (a_{m}, b_{i})}{\underset{n}{Σ} sim (a_{m}, b_{n}) * \underset{n}{Σ} sim (a_{m}, b_{n}) * \underset{n}{Σ} sim (a_{n}, b_{j}) * \underset{n}{Σ} sim (a_{n}, b_{i})}) \end{matrix}

12. similarity measurement systems according to claim 11, is characterized in that,

Also comprise similarity enhancement unit, for described similarity matrix, by described similarity calculated method, again calculate, with the article of the similarity association that is enhanced and the enhancing similarity matrix between article, be used as the article b of described behavioral data _jwith article b _isimilarity result,

Described similarity comprehensive unit utilizes described enhancing similarity matrix to utilize described Bayesian formula to carry out comprehensively.

13. similarity measurement systems according to claim 9, is characterized in that,

The white noise compensating unit that also comprises behavioral data: before calculating the similarity based on behavioral data by similarity calculated, user is supplemented to pre-determined number to the number of times of object manipulation lower than the user of pre-determined number.

14. similarity measurement systems according to claim 9, is characterized in that,

In the described similarity calculated based on characteristic, comprising:

Mathematical Models unit, the described characteristic generation article that its utilization is obtained and the relational matrix between attribute and the relational matrix between attribute and article;

Probability generation unit, its utilize relational matrix between described article and attribute and the relational matrix between described attribute and article generate article to the probability matrix of attribute and attribute the probability matrix to article;

Similarity calculated, it is multiplied by the probability matrix of attribute and attribute the similarity matrix calculating between article and article to described article mutually to the probability matrix of article.

15. similarity measurement systems according to claim 9, is characterized in that,

In the described similarity calculated based on characteristic, utilize the corresponding property value sim (c of the article a of obtain in article set and the known attribute c in community set and the known attribute c in community set and article b in article set, b), carry out the article set interior items b based on following formula _iwith article b _jsimilarity sim ' (b _i, b _j) calculating, to generate for the article of known attribute and the similarity matrix between article,

\begin{matrix} {sim}^{'} (b_{i}, b_{j}) = \\ k * \underset{m}{Σ} (\frac{sim (c_{m}, b_{i}) * sim (c_{m}, b_{j})}{\underset{n}{Σ} sim (c_{m}, b_{n}) * \underset{n}{Σ} sim (c_{m}, b_{n}) * \underset{n}{Σ} sim (c_{n}, b_{i}) * \underset{n}{Σ} sim (c_{n}, b_{j})}) \end{matrix}

16. according to the similarity measurement system described in claims 14 or 15, it is characterized in that,

The white noise compensating unit that also comprises the similarity based on characteristic, its for unknown properties by arbitrary article and similarity between other article be made as identical and and be 1, obtain the white noise compensation matrix for similarity between the article of unknown properties and article, and sue for peace according to predetermined ratio by the described article for known attribute and the similarity matrix between article and for the white noise compensation matrix of similarity between the article of unknown properties and article, the similarity matrix based on characteristic of the white noise that acted as a supplement.