It is on June 25th, 2012 that the application, which is the applying date, entitled " a kind of application No. is 201210228726.1
The divisional application of the patent of the method and system of acquisition user and Document personalization feature ".
Specific embodiment
The method of the present invention is described in further detail in conjunction with attached drawing.
The specific embodiment of this patent method illustrates, including following components.Firstly, illustrating user's collection, document sets
With the parameter vector representation method of the meaning of feature set and user and document;Then, illustrate the parameter vector of user and document
More new algorithm;Later, illustrate the ordering vector representation method of document and the document ordering algorithm based on document parameter vector;Again
Afterwards, illustrate the individualized document search method based on query vector;Finally, illustrating a kind of acquisition user and Document personalization feature
System.
Illustrate that user collects the meaning of U, document sets D and feature set K first.
In the server of access internet, stores the user being made of multiple user identifiers and collect U and by multiple document marks
Know the document sets D of composition.The user identifier is the unique identifier of user on the internet, including user account number, cell-phone number
One in code, Cookie identification code, IP address, the address Email and instant communication number;The document identification is in internet
The unique identifier of upper document, such as the address URL of Web page document.The user collects U and contains M element, the document
Collection D contains N number of element.
In the server of access internet, the feature set K that storage is made of multiple signature identifications, the feature set K contain
There is L element.Feature in the feature set K, be the user collect U in user feature and the document sets D in document
Feature in choose.User and document use identical feature set K.If user has " music " feature, illustrate consumer taste
Music, and document has " music " feature, illustrates that document is related to musical theme.
The representation method of the parameter vector of user and document is described below.The parameter vector representation method and vector space
The vector expression method of model VSM is similar, i.e., using characteristic item as user characteristics or the basic unit of file characteristics.This patent institute
State in method and system, using the set of the degree of correlation of user and each feature as the parameter vector of user, with document with it is each
Parameter vector of the set of the degree of correlation of feature as document.
Fig. 1 is the parameter vector representation method that user collects each user in U.Collect any one user m (m ∈ in U in user
U parameter vector) is set as U (m)=(uwm1, uwm2 ..., uwmk ..., uwmL), wherein the uwmk indicates the use
The degree of correlation of family m and feature k (k ∈ K).In addition, the degree of correlation that the user collects each user and feature k in U is collected in
Together, a vector is formed, k-th of user's column vector (uw1k, uw2k ..., uwMk) that user collects U is called.
Fig. 2 is the parameter vector representation method of each document in document sets D.Any one document n (n ∈ in document sets D
D parameter vector) is set as D (n)=(dwn1, dwn2 ..., dwnk ..., dwnL), wherein the dwnk indicates the text
The degree of correlation of shelves n and feature k (k ∈ K).In addition, the degree of correlation of each document and feature k in the document sets D is collected in
Together, a vector is formed, k-th of document column vector (dw1k, dw2k ..., dwNk) of document sets D is called.
The degree of correlation is a real number value, it indicates the relationship of some feature in user or document and feature set K
Tightness degree.As soon as if user or document be associated with musical features it is more be associated with sports feature it is a little less, we
Say that the degree of correlation of the user or document and musical features is high, it is low with the degree of correlation of sports feature.In addition in Feature Selection, have
There is correlation between a little features, therefore the dimension of feature set K can be reduced by reducing the correlation between feature,
The demand to server storage is reduced, efficiency of algorithm is improved.Some features need not be directly included in feature set, because these
The degree of correlation of feature can be come out by the relatedness computation of one or several other features in feature set K.
Illustrate the setting method of the parameter vector initial value of user or document below.It is illustrated for following three example.
The parameter vector initial value range of user or document is usually arranged as having uwmk ∈ [0,1] for any m ∈ U, n ∈ D and k ∈ K
With dwnk ∈ [0,1].If initial value is not set in the parameter vector of user or document, parameter vector initial value is default to be set
For null vector.
The method that example 1 is artificial setting user m (m ∈ U) or the parameter vector initial value of document n (n ∈ D).Such as it sets
Set feature sum L=5, feature set K=(science, education, finance and economics, music, sport), setting U (m)=(uwm1, uwm2,
Uwm3, uwm4, uwm5)=(0,0.9,0,1,0).That is the degree of correlation of user m and " education " feature is 0.9, with " music " feature
The degree of correlation be 1, the degree of correlation with other feature is zero.Similarly, can be set the parameter vector D (n) of the document n=
The initial value of (dwn1, dwn2 ..., dwnk ..., dwnL).
Example 2 is that the method for the parameter vector initial value of user m (m ∈ U) is arranged.One group of text is submitted by the user m first
Shelves setThe parameter vector of the document r (r ∈ H) is (dwr1, dwr2 ..., dwrL), so
Afterwards, for each k ∈ K, uwmk=(σ 1/s) ∑ (r ∈ H) dwrk or uwmk=(σ 1/s) ∑ (r ∈ H) are set
[dwrk/ (∑ (k ∈ K) dwrk)], wherein s is the element number of the set H, and σ 1 is setting normal number.Use similar side
Method, the user m can also select one group of user in user collection U to calculate the parameter vector initial value of the user m.
Example 3 is a kind of method of parameter vector initial value that document is arranged.Classified catalogue is a kind of special document, such as door
Family website generally includes the classified catalogues such as news, music, sport, finance and economics and science and technology.We assume that the text under same category catalogue
Shelves are all related to sport with certain identical features, such as the document under sport catalogue.If document n (n ∈ D) is classification mesh
Record h (h ∈ D) under a document, then the parameter vector initial value of the document n by the parameter vector of the classified catalogue h Lai
It determines.Such as each k ∈ K, dwnk=σ 2dwhk is set, wherein σ 2 is setting normal number.
Fig. 3 is the parameter vector update algorithm flow chart of user and document.It is specifically included in the server of access internet
In, execute following steps:
S11. the document sets D that the user being made of multiple user identifiers collects U and is made of multiple document identifications is stored;It deposits
Store up the feature set K being made of multiple signature identifications;
S12. a document setup parameter being at least in the user or the document sets D in user's collection U
Vector initial value;
S13. the signal that any one user m (m ∈ U) accesses any one document n (n ∈ D) is received;
S14. according to the signal, read the parameter vector U (m) of the user m=(uwm1, uwm2 ...,
Uwmk ..., uwmL), wherein the uwmk indicates the degree of correlation of the user m and feature k (k ∈ K);
S15. according to the signal, read the parameter vector D (n) of the document n=(dwn1, dwn2 ...,
Dwnk ..., dwnL), wherein the dwnk indicates the degree of correlation of the document n and feature k (k ∈ K);
S16. application parameter vector more new algorithm updates the parameter vector of the user m and the document n;If institute after updating
State parameter vector U* (m)=(uwm1*, uwm2* ..., uwmk* ..., the uwmL*) of user m, the ginseng of the document n after update
Number vector D* (n)=(dwn1*, dwn2* ..., dwnk* ..., dwnL*), then the algorithm includes:
U* (m)=F1 [U (m), D (n)];
D* (m)=F2 [U (m), D (n)];
After having executed the step S16, the step S13 is returned.
Wherein the F1 () and the F2 () are the function with the U (m) and the D (n) for independent variable respectively.Institute
It states user m and represents any one of user's collection U user, and be not specific to some user, the document n is represented in document sets D
Any one document, and it is not specific to some document.Such as n-th m=1023, n=in the signal when executing step S13
3428, and m=33456 in the signal when (n+1)th execution step S 13, n=28477.
It is the increasing letter of the dwnk to each k ∈ K, the uwmk* in an application example of Fig. 3 the method
Number, the dwnk* is the increasing function of the uwmk.
In an application example of Fig. 3 the method, all for each k ∈ K, the uwmk* and the dwnk*
It is the subtraction function for the frequency that the user m accesses the document sets D.The frequency is the user m in a set period of time
The number of the document in the document sets D is accessed divided by the length of the set period of time.
It is that ∑ (k ∈ K) dwnk subtracts to each k ∈ K, the uwmk* in an application example of Fig. 3 the method
Function, the dwnk* are the subtraction functions of ∑ (k ∈ K) uwmk.
In an application example of Fig. 3 the method, the signal is random from Web log in a setting time
It extracts.In a setting time, the calling-on signal that each any active ues in U extract identical quantity is collected to the user
Input signal as Fig. 3 the method.Any active ues refer in a setting time, access the document sets D
Reach the user of setting number.Inactive users cannot use the parameter vector of Fig. 3 the method update user and document.
In Fig. 3 the method, after executing the parameter vector more new algorithm and reaching setting number t1, in each feature
Under k ∈ K, k-th of user's column vector (uw1k, uw2k ..., uwMk) is normalized;Executing the parameter vector
After more new algorithm reaches setting number t2, at each feature k ∈ K, to k-th document column vector (dw1k, dw2k ...,
DwNk it) is normalized;Wherein t1 and t2 is positive integer.Primary parameter vector more new algorithm is executed, that is, executes primary institute
State step S16.The method for normalizing includes specific application example below.
Example 1: the side that k-th of user's column vector (uw1k, uw2k ..., uwMk) is normalized in U is collected to user
Method is as follows: to set uw1k, uw2k ..., uwMk) be ranked up by descending sequence, the element of ranking M1 is assigned
Be worth toAnd for each m ∈ U, ifUwmk=1 is then set, is otherwise arranged
It is as follows to the method that k-th of document column vector (dw1k, dw2k ..., dwNk) is normalized in document sets D: to collection
Close dw1k, dw2k ..., dwNk) be ranked up by descending sequence, the element of ranking N1 is assigned toWith
And it for each n ∈ D, is set ifOtherwise dwnk=1 is arrangedWherein, M1 and N1
To set normal number.
Example 2: to the side that k-th of document column vector (dw1k, dw2k ..., dwNk) is normalized in document sets D
Method is as follows: first to set dw1k, dw2k ..., dwNk) be ranked up, and according to ranking results will gather dw1k,
Dw2k ..., dwNk } it is divided into the approximately equal r group of element number, wherein the relationship of any two groups of a groups and b group is appointing in a group
What element is more than or equal to any one of b group any one of element or a group element and is less than or equal in b group
Any one element;The smallest data composition set { s1, s2 ..., sr } of numerical value, and s1 < s2 are taken out in each group
< ... < sr;Then, for each n ∈ D, if dwnk < s1, is arranged dwnk=0;If sm≤dwnk≤sm+1,
It is arranged dwnk=g1 (sm);If dwnk > sr, is arranged dwnk=1.Wherein g1 (sm) be increasing function, g1 (sm) ∈ (0,
1), such as g1 (sm)=sm/sr is set;1≤m < r, r are setting positive number.Same method can collect k-th of user's column in U to user
Vector is normalized.
It further include for each k ∈ after having executed the step S16 in an application example of Fig. 3 the method
Uwmk=uwmk* and dwnk=dwnk* is arranged in K.
In an application example of Fig. 3 the method, the method meets for each k ∈ K, there is uwmk* >=uwmk
With dwnk* >=dwnk.
In Fig. 3 the method, the type of the signal is at least with one of Types Below: T=1 indicates the user
M clicks the link of the document n, and T=2 indicates that the user m keys in the address of the document n, and T=3 indicates that the user m will
The document n is set as liking+the 1 of Google (Like of such as types of facial makeup in Beijing operas and), and T=4 indicates that the user m forwards the document n, T
=5 indicate that the user m comments on the document n, and T=6 indicates that the user m collects the document n.
Application example 1
In an application example of Fig. 3 the method, the parameter vector more new algorithm is specifically included:
1 (n, m, T) f1 (dwnk) of uwmk*=β 1uwmk+ λ (for each k ∈ K)
2 (m, n, T) f2 (uwmk) of dwnk*=β 2dwnk+ λ (for each k ∈ K)
Wherein, the λ 1 (n, m, T) is influence system of the document n to the user m at the type T of the signal
Number, the λ 2 (m, n, T) are influence coefficient of the user m to the document n at the type T of the signal;β 1 and β 2 are
Set normal number;The f1 (dwnk) is the increasing function of the dwnk, and the f2 (uwmk) is the increasing function of the uwmk.Such as
F1 (dwnk)=σ 3dwnk, f2 (uwmk)=σ 4uwmk;Or f1 (dwnk)=σ 5 { 1/ [1+exp (- dwnk)] },
F2 (uwmk)=σ 6 { 1/ [1+exp (- uwmk)] }, wherein σ 3, σ 4, σ 5 and σ 6 are setting normal number.
It is k-th of document column vector setting threshold values dCk for each feature k ∈ K in the application example 1, if
Dwnk≤dCk then takes f1 (dwnk)=0;It is k-th of user's column vector setting threshold values uCk for each feature k ∈ K, if
Uwmk≤uCk then takes f2 (uwmk)=0.Wherein dCk is equal to each of kth document column vector (dw1k, dw2k ..., dwNk)
Component of the ranking at a1 in a component;UCk is equal to each of k-th of user's column vector (uw1k, uw2k ..., uwMk)
Component of the ranking at a2 in component;A1 and a2 is setting positive integer.
In the application example 1, the concrete methods of realizing of the λ 1 (n, m, T) and the λ 2 (m, n, T) include as follows
Example:
Example 1: the λ 1 (n, m, T) and the λ 2 (m, n, T) are set as setting constant.Such as λ 1 (n, m, T)=c1 and λ 2
(m, n, T)=c2, wherein c1 and c2 is setting normal number, such as c1=c2=0.01.
Example 2: the λ 1 (n, m, T) and the λ 2 (m, n, T) are the frequency that the user m accesses the document sets D respectively
Subtraction function.λ 1 (n, m, T)=1/g2 [freq (m)], λ 2 (m, n, T)=1/g2 [freq (m)] are such as set, the g2 (x) is to increase
Function.Such as g2 (x) is piecewise function, and as x < a3, g2 (x)=1;As x >=a3, g2 (x)=1+a4 (x-a3), wherein
A3 and a4 is default normal number.The freq (m) is the frequency that the user m accesses the document in the document sets D.
Example 3: setting λ 1 (n, m, T)=1/g3 [∑ (k ∈ K) dwnk], λ 2 (m, n, T)=1/g3 [∑ (k ∈ K) uwmk], g3
It (x) is increasing function.Such as g3 (x) is piecewise function, and as x < a5, g3 (x)=1;As x >=a5, g3 (x)=1+a6 (x-
A5), wherein a5 and a6 is default normal number.When calculating ∑ (k ∈ K) dwnk, if dwnk≤min_dCk, dwnk=is taken
0;When calculating ∑ (k ∈ K) uwmk, if uwmk≤min_uCk, uwmk=0 is taken;Wherein min_dCk and min_uCk are
Set normal number.
Example 4: 1 (n, m, T)=d1 (n) u2 (m) of λ, 2 (m, n, T)=u1 (m) d2 (n) of λ, wherein d1 (n)
Indicate whether the parameter vector of document n can be used for updating the parameter vector that user collects user in U, u2 (m) indicates the ginseng of user m
Whether number vector can be updated by the parameter vector of document in document sets D, and u1 (m) indicates whether the parameter vector of user m can be with
For updating the parameter vector of document in document sets D, d2 (n) indicates whether the parameter vector of document n can be collected in U by user
The parameter vector of user updates.U1 (m), u2 (m), d1 (n) and d2 (n) are parameter presets, their value is 0 or 1.1 generation
Table is, 0 represent it is no.This example is meant that prevent malicious attack, some documents (or user) are not due to by reliable
Property certification, parameter vector cannot be updated the parameter vector of other users (or document);Some important documents (or use
Family), parameter vector cannot be updated by the parameter vector of other users (or document).
Example 5: 1 (n, m, the T)=s1 (T) of λ, 2 (m, n, the T)=s2 (T) of λ.Wherein the T is that user accesses text
The type of shelves signal, the s1 (T) and the s2 (T) are the function of the T respectively.
Example 6: the λ 1 (n, m, T) is the accessed number of the document n or the increasing function of PageRank value, the λ 2
(m, n, T) is the increasing function of bean vermicelli (follower) quantity of the user m.
Example 7: the λ 1 (n, m, T) and the λ 2 (m, n, T) are the parameter vector of the user m and the document n respectively
Between similarity sim (m, n) increasing function.Such as λ 1 (n, m, T)=1+c3sim (m, n), λ 2 (m, n, T)=1+c4
Sim (m, n), wherein c3 and c4 is setting constant more than or equal to 1, and sim (m, n)=[∑ (k ∈ K) (uwmk
dwnk)]/{[∑(k∈K)(uwmk)2]1/2·[∑(k∈K)(dwnk)2]1/2}.This example is meant that user and Wen
Similarity between the parameter vector of shelves is higher, and the proportionality coefficient that they " vote " each other is bigger.When calculating sim (m, n), such as
Fruit dwnk≤min_dCk, then take dwnk=0;If uwmk≤min_uCk, uwmk=0 is taken, wherein min_dCk and min_
UCk is setting normal number.
Example 8: using the combination of at least two methods in above-mentioned 1~7 each method of example, come generate the λ 1 (n, m, T) and
λ 2 (m, n, T).For example in freq (m) > a3, have
λ 1 (n, m, T)=c1 { 1+c3sim (m, n) } { 1/ [1+a4 (freq (m)-a3)] } { d1 (n) u2
(m)}·s1(T)
λ 2 (m, n, T)=c2 { 1+c4sim (m, n) } { 1/ [1+a4 (freq (m)-a3)] } { u1 (m) d2
(n)}·s2(T)。
In the application example 1, after the execution specific parameter vector more new algorithm reaches setting number, need
For each feature k ∈ K, respectively to k-th of document column vector (dw1k, dw2k ..., dwNk) and k-th of user's column vector
(uw1k, uw2k ..., uwMk) is normalized.
Application example 2
This is a concrete methods of realizing of application example 1.Let it be assumed, for the purpose of illustration, that there are two users on the internet
With three documents, each user and each document are there are two feature, i.e. user collects U={ 1,2 }, document sets D={ 1,2,3 },
Feature set K={ 1,2 }.The parameter vector of user 1 and user 2 are respectively (uw11, uw12) and (uw21, uw22), document 1, text
The parameter vector of shelves 2 and document 3 is respectively (dw11, dw12), (dw21, dw22) and (dw31, dw32).Wherein uwmk (m ∈
U, k ∈ K) indicate the degree of correlation of the user m and feature k;Dwnk (n ∈ D, k ∈ K) indicates the phase of the document n and feature k
Guan Du.
Assuming that have received the signal that the user 2 accesses the document 3 in the server, and signal type T=1, then root
The parameter vector of the user 2 and the document 3 are updated according to following parameter vector more new algorithm:
1 (3,2,1) dw31 of uw21*=β 1uw21+ λ;1 (3,2,1) dw32 of uw22*=β 1uw22+ λ
2 (2,3,1) uw21 of dw31*=β 2dw31+ λ;2 (2,3,1) uw22 of dw32*=β 2dw32+ λ
Wherein, β 1=β 2=1;λ 1 (3,2,1) indicates shadow of the document 3 to the user 2 in signal type T=1
Ring coefficient;λ 2 (2,3,1) indicates influence coefficient of the user 2 to the document 3 in signal type T=1.Such as:
λ 1 (3,2,1)=c1 { 1+c3sim (2,3) } { 1/ [1+a4 (freq (2)-a3)] } { d1 (3) u2
(2)}·s1(1)
λ 2 (2,3,1)=c2 { 1+c4sim (2,3) } { 1/ [1+a4 (freq (2)-a3)] } { u1 (2) d2
(3)}·s2(1)
Wherein, c1=c2=0.01, c3=c4=3, sim (2,3)=(uw21dw31+uw22dw32)/
{ [(uw21) 2+ (uw22) 2] 1/2 [(dw31) 2+ (dw32) 2] 1/2 }, a3=200, a4=0.01, d1 (3)=u2 (2)=
U1 (2)=d2 (3)=1, s1 (1)=2, s2 (1)=1, and assume freq (2) > a3.
After having executed above-mentioned parameter vector more new algorithm, it is arranged as follows: uw21=uw21*, uw22=uw22*,
Dw31=dw31* and dw32=dw32*.
After having executed above-mentioned parameter vector more new algorithm, to user's column vector (uw11, uw21) and (uw12, uw22)
It is normalized, and document column vector (dw11, dw21, dw31) and (dw12, dw22, dw32) is normalized
Processing.
It is as follows to the algorithm of user's standardization on series vectors processing: to set temp1=max (uw11, uw21), then to feature k
=1 setting uw11=uw11/temp1, uw21=uw21/temp1;If temp2=max (uw12, uw22), then to feature k=
2 setting uw12=uw12/temp2, uw22=uw22/temp2.
It is as follows to the algorithm of the normalized of document column vector: it sets temp1=max (dw11, dw21, dw31), then it is right
Dw11=dw11/temp1, dw21=dw21/temp1, dw31=dw31/temp1 is arranged in feature k=1;If temp2=max
Then dw12=dw12/temp2, dw22=dw22/temp2, dw32=is arranged to feature k=2 in (dw12, dw22, dw32)
dw32/temp2。
Fig. 4 is the ordering vector representation method of each document in document sets D.
The core technology of search engine is sort algorithm, wherein foremost is PageRank algorithm.Standard
PageRank algorithm can be indicated with following formula.
Wherein, the chain that set T is webpage p (p ∈ D) enters collections of web pages, and C (i) is that the chain of webpage i (i ∈ T) goes out webpage number
Amount;D expression user accesses the probability of the webpage p by the link of other webpages;1-d indicates that user does not pass through other webpages
Link (such as pass through key in the address URL mode) access the probability of the webpage p, d ∈ (0,1);PR (p) indicates the net
Ranking value of the page p in the document sets D, N indicate the webpage quantity in document sets D.In addition the initial ranking value of each webpage
It is set as 1/N.Here, each element in document sets D is a webpage.
(the shortcomings that algorithm is that each webpage on the internet only has a unique webpage sorting to the PageRank of standard
Value, i.e., the algorithm assumes that each user is identical to the evaluation of the importance of the same webpage.That is, PageRank is calculated
Method does not account for the personalized difference for submitting the user of search inquiry.Therefore, it is necessary to improve to existing sort algorithm.
We are extended traditional PageRank value, i.e., by one of any one document p in the document sets D
It ties up ranking value PR (p), is extended to the ordering vector of the multidimensional based on domain features.If the ordering vector of any document p (p ∈ D)
For [PR (p, 1), PR (p, 2) ..., PR (p, k) ..., PR (p, L)], wherein the PR (p, k) is indicated at feature k (k ∈ K)
Under ranking value of the document p in the document sets D.The ranking value of each document under feature k ∈ K is pooled together,
A vector is formed, is called k-th of sequence column vector of document sets D, i.e.,
Fig. 5 is that document ordering vector updates algorithm flow chart.If at least containing in the document sets D there are two document subset,
Wherein document subsetIn each document contain other texts that at least one link is directed toward in the document sets D
Shelves, and document subsetIn the chain that is contained by least one document in the document subset S of each document
It connects pointed;And S ∪ E=D, S ∩ E ≠ Φ, wherein Φ is empty set.Therefore, ordering vector more new algorithm is as follows: the document
Collect ranking value of any one document p in D at feature k (k ∈ K), is that each chain of the document p enters document described
Ranking value and the chain under feature k enter the function of document and the degree of correlation of the feature k.
The ordering vector more new algorithm includes following two specific application example.
Example 1: ranking value of any document p (p ∈ D) in the document sets D at feature k ∈ K is defined as:
Wherein, the chain that set T is the document p enters collection of document;D indicates that user is accessed by the link of other documents
The probability of the document p;1-d indicates that user is not visiting by the link of other documents (such as by keying in the address URL mode)
Ask the probability of the document p, d ∈ (0,1);PR (i, k) indicates ranking value of the document i at feature k (k ∈ K);The dwik table
Show the degree of correlation of document i Yu feature k (k ∈ K);N is the document number in the document sets D.In addition, for each document i ∈ D
With each feature k ∈ K, if initial ranking value PR (i, k)=1/N of the document i.
The formula (2) can state following vector form as:
Wherein, k ∈ K,It is complete 1 column vector;A is one non-
Negative matrix, A=(aij) N × N are defined as follows:
Example 2: ranking value of any document p (p ∈ D) in the document sets D at feature k ∈ K is defined as:
Wherein, gatherChain for the document p enters collection of document;D indicates that user passes through the chain of other documents
Fetch the probability for accessing the document p;1-d indicates user not by the link of other documents (such as by keying in the address side URL
Formula) access the probability of the document p, d ∈ (0,1);PR (i, k) indicates ranking value of the document i at feature k (k ∈ K);Institute
Stating dwik indicates the degree of correlation of document i and feature k (k ∈ K);C (i) indicates that the chain of document i (i ∈ T) goes out number of documents;N is institute
State the document number in document sets D.In addition for each document i ∈ D and each feature k ∈ K, if the initial ranking value of document i
PR (i, k)=1/N.
The vector form of the formula (4) can also state the form of formula (3) as, wherein It is complete 1 column vector;Nonnegative matrix A=(aij) N × N is defined as follows:
In order to guarantee the formula (3) validity, need to carry out the linking relationship between the document in document sets D several
Limitation, such as reject pendency page (Dangling Page) and be directed toward its each link, when the ranking value of other documents has been calculated
Bi Hou, then will dangle page and its connectivity restitution of direction, and according to the ranking value of the formula (3) calculating pendency page.
The formula (3) can be by its solution of dominant eigenvalue (Power Method) approximate calculation, i.e., described in calculating
K-th of sequence column vector in document sets DAfter being located at nth iteration, the sequence column vector isThe then power iteration
Method includes the following steps:
R10. any feature k ∈ K is chosen;
R11. according to the formula (2) or formula (4), nonnegative matrix A is generated;
R12., the initial value of k-th of sequence column vector in document sets D is setN=0;
R13. it executes the formula (3), i.e., according to the sequence column vector of the n-th stepCome calculate the Sorted list of the (n+1)th step to
AmountI.e.
R14. to describedIt is normalized, i.e.,
R15. judge whetherOr n > STEP, it is to terminate;Otherwise n=n+1 is set, step is returned
Rapid R13.
Wherein ε and STEP is setting normal number;Indicate vectorBy the maximum component of mould.
Fig. 6 is the individualized document search method flow chart based on query vector and ordering vector.This method is included in clothes
Following steps are executed in business device:
S10. according to the parameter vector more new algorithm, the parameter vector of multiple documents and institute in the document sets D are updated
State the parameter vector that user collects multiple users in U;Concrete methods of realizing includes step S11 described in Fig. 3 to the step S16;
S20., the ordering vector initial value of each document in the document sets D is set;
S30. at each feature k (k ∈ K), using the ordering vector more new algorithm, iteration updates the document sets D
In k-th of sequence column vector, that is, update the ordering vector of each user in the document sets D;
S40. the search condition that the query vector and the user q for receiving user q (q ∈ D) setting are submitted, and described
Search key is extracted in search condition;Wherein described search condition can be set as all letters that user submits in search dialogue
Breath;
S50. one group of document Q with described search keyword match is retrieved in the document sets D;
S60. according to the ordering vector of each document in the query vector and one group of document Q, described one is calculated
The personalized ordering value of each document in group document Q;
S70. according to the personalized ordering value, one group of document Q is ranked up, and according to ranking results by institute
The link for stating multiple documents in one group of document Q is sent to the user q.
In Fig. 6 the method, if the query vector of user q is (swq1, swq2 ..., swqk ..., swqL),
Middle swqk expression is queried ranking value of the document in the document sets D, swqk ∈ [0,1] at feature k (k ∈ K).It is described
The setting method of query vector is exemplified below.
The first be feature is selected in feature set K by the user n, and be arranged and be queried the ranking value of document, such as
Swq2=0.00023, swq6=0.00061 are set, and other component of a vector are 0.
It is for second that the user q submits one group of document identification Sq={ ..., r ... }.The document r (r ∈ Sq)
Ordering vector is [PR (r, 1), PR (r, 2) ..., PR (r, k) ..., PR (r, L)], therefore for each feature k ∈ K, described
The query vector of user q is set as swqk=(σ 7/s) ∑ (r ∈ Sq) PR (r, k) or swqk=(σ 7/s) ∑ (r ∈
Sq) { PR (r, k)/∑ (k ∈ K) PR (r, k) };Wherein s is the element number of the set Sq, and σ 7 is setting normal number.
In an application example of Fig. 6 the method, the document i based on the user q query vector submitted
The personalized ordering value UR (i, q) of (i ∈ Q) be defined as the user q query vector (swq1, swq2 ..., swqk ...,
SwqL) the phase between the ordering vector of the document i [PR (i, 1), PR (i, 2) ..., PR (i, k) ..., PR (i, L)]
Like degree, such as
UR (i, q)=∑ (k ∈ K) [PR (i, k) swqk] }/{ [∑ (k ∈ K) (PR (i, k)) 2] 1/2 [∑ (k ∈
K)(swqk)2]1/2}
Wherein, the PR (i, k) indicates ranking value of the document i in the document sets D at feature k (k ∈ K),
The swqk expression is queried ranking value of the document in the document sets D at feature k (k ∈ K).Calculate the UR (i,
When q), for any k ∈ K, if PR (i, k) < min_PR, takes PR (i, k)=0;If swqk < min_SW, takes
Swqk=0.Min_PR and min_SW is setting normal number.
Fig. 7 is the individualized document search method flow chart based on query vector and parameter vector.The method includes
Following steps are executed in server:
A10. according to the parameter vector more new algorithm, the parameter vector of multiple documents and institute in the document sets D are updated
State the parameter vector that user collects multiple users in U;Concrete methods of realizing includes step S11 described in Fig. 3 to the step S16;
A20. the search condition that the query vector and the user q for receiving user q (q ∈ D) setting are submitted, and described
Search key is extracted in search condition;Wherein described search condition can be set as all letters that user submits in search dialogue
Breath;
A30. one group of document Q with described search keyword match is retrieved in the document sets D;
A40. according to the parameter vector of each document in the query vector and one group of document Q, described one is calculated
The personalized ordering value of each document in group document Q;
A50. according to the personalized ordering value, one group of document Q is ranked up, and according to ranking results by institute
The link for stating multiple documents in one group of document Q is sent to the user q.
In Fig. 7 the method, if the query vector of user q is (swq1, swq2 ..., swqk ..., swqL),
Middle swqk indicates the degree of correlation for being queried document Yu feature k (k ∈ K), swqk ∈ [0,1].The query vector there are several types of
Setting method.
The first is feature to be selected in feature set K by the user n, and it is arranged the feature degree of correlation, such as be arranged
Swq2=0.8, swq6=0.9, other component of a vector are 0.
Second is that the parameter vector of the user q is assigned to the query vector.
The third is that the user q submits one group of user identifier or document identification Sq={ ..., r ... }.When
When, the parameter vector of the user r (r ∈ Sq) is (uwr1, uwr2 ..., uwrL), therefore the query vector of the user q is set
For for each feature k ∈ K, swqk=(σ 8/s) ∑ (r ∈ Sq) uwrk or swqk=(σ 8/s) ∑ (r ∈ Sq)
[uwrk/(∑(k∈K)uwrk)];WhenWhen, the parameter vector of the document r (r ∈ Sq) be (dwr1, dwr2 ...,
DwrL), therefore the query vector of the user q is set as each feature k ∈ K, swqk=(σ 9/s) ∑ (r ∈ Sq)
Dwrk or swqk=(σ 9/s) ∑ (r ∈ Sq) [dwrk/ (∑ (k ∈ K) dwrk)];The wherein element that s is the set Sq
Number, σ 8 and σ 9 are setting normal number.
In an application example of Fig. 7 the method, the document i based on the user q query vector submitted
The personalized ordering value UR (i, q) of (i ∈ Q) be defined as the user q query vector (swq1, swq2 ..., swqk ...,
SwqL) the similarity between the parameter vector (dwi1, dwi2 ..., dwiL) of the document i, i.e.,
UR (i, q)=[∑ k (swqkdwik)]/{ [∑ k (swqk) 2] 1/2 [∑ k (dwik) 2] 1/2 }.
One application scenarios of Fig. 7 the method are microbloggings.After user issues a microblogging document, so that it may which this is set
The parameter vector initial value of microblogging document, i.e., the parameter vector for the user for issuing this microblogging multiplied by a preset constant,
It is assigned to the parameter vector of this microblogging document.After having received the signal of user's access microblogging document in micro blog server
(signal such as generated by forwarding, comment or collection movement), according to the user identifier and microblogging document for including in the signal
Mark, reads the parameter vector of the user and the parameter vector of the microblogging document respectively;Then it is updated according to parameter vector
Algorithm updates the parameter vector of the user and the microblogging document.When user opens microblogging, he can be default by it
Query vector in relational network other people issue information be filtered and screen.Its method is to be looked into first by user preset
Vector is ask, the similarity between the parameter vector of the every microblogging document then received using the query vector and user is as often
The personalized ordering value of a microblogging document, and according to the numerical values recited of the personalized ordering value, the microblogging text that user is received
Shelves are filtered and screen.Such as before personalized ordering value ranking 30% microblogging document is only sent to inquiry user.
Fig. 8 is a kind of system construction drawing for obtaining user and Document personalization feature.The system 200 includes following function
Module:
User's collection, document sets and feature set setup module 211: storage is by multiple user identifiers in customer data base 220
The user of composition collects U, and the document sets D being made of multiple document identifications is stored in document database 230;In property data base
The feature set K being made of multiple signature identifications is stored in 240;
User and document initial value setup module 212: collect at least one user setting parameter vector in U for the user
Initial value is simultaneously stored in customer data base 220;It is initial at least one document setup parameter vector in the document sets D
It is worth and is stored in document database 230;For each document setup ordering vector initial value in the document sets D;Not by
User and the document of parameter vector initial value are set, and parameter vector initial value defaults to null vector;
User accesses document signal acquisition module 213: any for acquiring any one user m (m ∈ U) (102) access
The signal of one document n (n ∈ D), the signal are stored in web log data library 250;Described in user m (102) access
The signal of document n will be sent at least one application server, and the application server includes portal site server
301, social network server 302, search engine server 303 and instant communication server 304;
User and document parameter vector update module 214: it according to the signal, is read in the customer data base 220
The parameter vector of the user m (102) and the parameter vector that the document n is read in the document database 230, so
Application parameter vector more new algorithm afterwards updates the parameter vector of the user m (102) and the document n, finally with after update
The parameter vector of the user m (102) and the parameter vector of the document n update the customer data base 220 and institute respectively
State document database 230;
Document ordering vector update module 215: in the document sets D, with linking relationship, each document between document
Ordering vector initial value and each document parameter vector as input data, using ordering vector more new algorithm, iteration
The ranking value of each document in the document sets D at each feature k (k ∈ K) is updated, and applies the updated sequence
Value updates the document database 230;Linking relationship between the document is by each document packet in the document sets D
Contained document links determine;
User query module 216: firstly, receiving the query vector of inquiry user q setting and the search of user q submission
Condition, and search key is extracted in described search condition;Then, retrieval is closed with described search in the document sets D
The matched one group of document Q of key word;Later, according to the ordering vector of each document in the query vector and one group of document Q,
The personalized ordering value of each document in one group of document Q is calculated, or according to the query vector and one group of document Q
In each document parameter vector, calculate the personalized ordering value of each document in one group of document Q;Finally, according to described
Personalized ordering value is ranked up one group of document Q, and according to ranking results by multiple texts in one group of document Q
The link of shelves is sent to the user q.
Application example described above is only preferable application example of the invention, the protection model being not intended to limit the invention
It encloses.