CN108959579A

CN108959579A - A kind of system obtaining user and Document personalization feature

Info

Publication number: CN108959579A
Application number: CN201810739450.0A
Authority: CN
Inventors: 祁勇
Original assignee: Jing Zhuqiang
Current assignee: Zhu Yanling
Priority date: 2012-06-25
Filing date: 2012-06-25
Publication date: 2018-12-07
Anticipated expiration: 2032-06-25
Also published as: CN108959579B; CN103514237A; CN103514237B

Abstract

The invention proposes a kind of method and systems for obtaining user and Document personalization feature.The method accesses the signal of document by user, the individualized feature of user and document is automatically updated.The individualized feature of the individualized feature of user, the document accessed according to the user is updated；The individualized feature of document, the individualized feature according to the user for accessing the document are updated.According to the individualized feature of the user of acquisition and document, personalized document ordering can be realized in a search engine；According to the individualized feature of user and document, personalized information filtering and screening can be realized in social networks.The invention also provides a kind of systems for obtaining user and Document personalization feature.The method of the present invention can be improved the precision ratio of search engine and the efficiency of social networks retrieval information.In addition the method for the present invention can be improved the anti-cheating ability of page rank algorithm.

Description

A kind of system obtaining user and Document personalization feature

It is on June 25th, 2012 that the application, which is the applying date, entitled " a kind of application No. is 201210228726.1 The divisional application of the patent of the method and system of acquisition user and Document personalization feature ".

Technical field

The present invention relates to internet area, relate in particular to a kind of acquisition user and Document personalization feature method and System.

Background technique

Search engine and social networks are the main tools that information is obtained on internet.Both tools are common there are one The shortcomings that, i.e., the filtering and screening of information cannot be carried out according to the individualized feature of user.For example, different users is same Input identical keyword in a search engine, the search result returned be it is identical, the search submitted with which user is looked into It askes unrelated；Different users establishes identical relational network in the same social networks, the information obtained be also it is identical, It is unrelated with the relational network which user establishes.

Search engine is to carry out large-scale collecting web page, index, sequence using information retrieval technique, and according to sequence As a result webpage is presented to the application program of inquiry user.The core technology of search engine is sort algorithm, and foremost is paddy The PageRank algorithm of song.The input of the algorithm is the web page interlinkage relationship constructed by Web page maker according to its subjective desire. Although it sufficiently reflects the personal preference of Web page maker and the understanding to web page interlinkage relationship, it can not reflect The personal preference of the user of search engine --- user.Due to being engaged in different industries or user with different hobbies is to same The Assessment of Important of a webpage is usually different, and the existing ordering techniques such as PageRank can not be to this different progress areas Point, the shortcomings that they can only provide unique page rank to different users, this is existing search technique.One feasible skill Art solution be improve search result in conjunction with the individualized feature of user and webpage so that the ranking of each webpage not only according to Rely the linking relationship between webpage, and dependent on the individualized feature for the user for submitting search inquiry and is queried webpage Individualized feature.Have analysis shows, by the individualized feature of user and webpage, can be improved the precision ratio of search engine, subtract Few scanning and browsing of the user to invalid information.

Social networks is the platform that people are linked up each other on internet.In social networks, user passes through certainly Relational network that oneself establishes obtains information, such as obtains his human hair by operations such as concern (follow) other people and plusing good friends The information of cloth.The people being concerned and the people added as a friend are more, and the information that user obtains is also more.Due to worry have it is important or The interesting information of person is missed, and user would generally pay close attention to more people in social networks or more good friends are added.But After the number of users in relational network is more than Dunbar number (Dunbar) 150, the social networks such as microblogging and the types of facial makeup in Beijing operas (Facebook) Network can be increasingly becoming the service that " INFORMATION BOMB " is carried out to user.The reason is that existing social networks technical requirements user must connect All information of all users publication in its relational network are received, and cannot selectively receive these information by information category, The shortcomings that this is existing social networks technology.One feasible technical solution is that the information for allowing user to obtain not only relies on use The relational network that family is established, and rely on the individualized feature of the individualized feature of user and the information of acquisition.This will be helpful to Massive information on social networks is effectively filtered and screened, the Information Retrieval Efficiency of social networks is improved.In order to chat It states conveniently, every information (a such as microblogging) that we usually obtain user on social networks also regards a document as, It has unique network address.

Realize that above-mentioned two technical solution, necessary condition are can to obtain the personalization of user and web document Feature.But the individualized feature of acquisition user and web document is often difficult on the internet, is mainly had following Difficult point.First is the automatic acquisition problem of customized information.It is estimated that having hundreds billion of a webpages and 2,000,000,000 on internet at present User safeguards that the individualized feature of web document and user are unpractical by hand.How user and web document obtained automatically Individualized feature be a problem.Second is the replacement problem of customized information.Over time, the interest love of user The personal information such as good, job site, the industry being engaged in and education degree can change, but require most users in real time Its customized information is updated to be difficult.Third is the semantic difference problem of customized information.It is special in the personalization of user setting In sign, term difference but semantic identical individualized feature, it is difficult to which it is effectively sorted out.Fourth is that customized information it is complete Standby property problem.The personal information that user provides on website is usually relatively simpler.Such as it is usual to the description of user interest hobby It is several contents such as to like music, play baseball or read a book, and it is difficult for requiring user that its interested field is comprehensively depicted 's.

In conclusion how effectively to obtain the individualized feature of user and document, and according to the individualized feature come It improves the precision ratio of search engine and improves the Information Retrieval Efficiency of social networks, be a urgent problem to be solved.

Summary of the invention

In view of the above-mentioned problems of the prior art, the purpose of the present invention is to provide a kind of acquisition user and document individual characteies Change the method and system of feature, obtains the individualized feature of user and document automatically, and help according to the individualized feature It helps user filtering and screens its information obtained on the internet.

According to above-described purpose, the invention proposes a kind of method for obtaining user and Document personalization feature, It is characterized in that,

In the server of access internet, stores the user being made of multiple user identifiers and collect U and by multiple document marks Know the document sets D of composition；Store the feature set K being made of multiple signature identifications；

In the server, at least described user collects a text in a user or the document sets D in U Shelves setting parameter vector initial value；

In the server, following steps are performed a plurality of times:

Receive the signal that any one user m (m ∈ U) accesses any one document n (n ∈ D)；

According to the signal, read the parameter vector U (m) of the user m=(uwm1, uwm2 ..., uwmk ..., UwmL), wherein the uwmk indicates the degree of correlation of the user m and feature k (k ∈ K)；

According to the signal, read the parameter vector D (n) of the document n=(dwn1, dwn2 ..., dwnk ..., DwnL), wherein the dwnk indicates the degree of correlation of the document n and feature k (k ∈ K)；

Application parameter vector more new algorithm updates the parameter vector of the user m and the document n；If described after updating The parameter vector of user m is U* (m)=(uwm1*, uwm2* ..., uwmk* ..., uwmL*), the ginseng of the document n after update Number vector is D* (n)=(dwn1*, dwn2* ..., dwnk* ..., dwnL*), then the parameter vector more new algorithm includes:

U* (m)=F1 [U (m), D (n)]；

D* (m)=F2 [U (m), D (n)]；

Wherein the F1 () and the F2 () are the function with the U (m) and the D (n) for independent variable respectively.

Compared with prior art, personalized document ordering can be achieved in the present invention, and then improve search engine looks into standard Rate and the Information Retrieval Efficiency for improving social networks.In addition, the individualized feature using web document can also improve webpage The anti-cheating ability of sort algorithm.

Detailed description of the invention

Fig. 1 is to collect the parameter vector representation method of each user in U in user；

Fig. 2 is the parameter vector representation method of each document in document sets D；

Fig. 3 is the parameter vector update algorithm flow chart of user and document；

Fig. 4 is the ordering vector representation method of each document in document sets D；

Fig. 5 is that document ordering vector updates algorithm flow chart；

Fig. 6 is the individualized document search method flow chart based on query vector and ordering vector；

Fig. 7 is the individualized document search method flow chart based on query vector and parameter vector；

Fig. 8 is a kind of system construction drawing for obtaining user and Document personalization feature；

Specific embodiment

The method of the present invention is described in further detail in conjunction with attached drawing.

The specific embodiment of this patent method illustrates, including following components.Firstly, illustrating user's collection, document sets With the parameter vector representation method of the meaning of feature set and user and document；Then, illustrate the parameter vector of user and document More new algorithm；Later, illustrate the ordering vector representation method of document and the document ordering algorithm based on document parameter vector；Again Afterwards, illustrate the individualized document search method based on query vector；Finally, illustrating a kind of acquisition user and Document personalization feature System.

Illustrate that user collects the meaning of U, document sets D and feature set K first.

In the server of access internet, stores the user being made of multiple user identifiers and collect U and by multiple document marks Know the document sets D of composition.The user identifier is the unique identifier of user on the internet, including user account number, cell-phone number One in code, Cookie identification code, IP address, the address Email and instant communication number；The document identification is in internet The unique identifier of upper document, such as the address URL of Web page document.The user collects U and contains M element, the document Collection D contains N number of element.

In the server of access internet, the feature set K that storage is made of multiple signature identifications, the feature set K contain There is L element.Feature in the feature set K, be the user collect U in user feature and the document sets D in document Feature in choose.User and document use identical feature set K.If user has " music " feature, illustrate consumer taste Music, and document has " music " feature, illustrates that document is related to musical theme.

The representation method of the parameter vector of user and document is described below.The parameter vector representation method and vector space The vector expression method of model VSM is similar, i.e., using characteristic item as user characteristics or the basic unit of file characteristics.This patent institute State in method and system, using the set of the degree of correlation of user and each feature as the parameter vector of user, with document with it is each Parameter vector of the set of the degree of correlation of feature as document.

Fig. 1 is the parameter vector representation method that user collects each user in U.Collect any one user m (m ∈ in U in user U parameter vector) is set as U (m)=(uwm1, uwm2 ..., uwmk ..., uwmL), wherein the uwmk indicates the use The degree of correlation of family m and feature k (k ∈ K).In addition, the degree of correlation that the user collects each user and feature k in U is collected in Together, a vector is formed, k-th of user's column vector (uw1k, uw2k ..., uwMk) that user collects U is called.

Fig. 2 is the parameter vector representation method of each document in document sets D.Any one document n (n ∈ in document sets D D parameter vector) is set as D (n)=(dwn1, dwn2 ..., dwnk ..., dwnL), wherein the dwnk indicates the text The degree of correlation of shelves n and feature k (k ∈ K).In addition, the degree of correlation of each document and feature k in the document sets D is collected in Together, a vector is formed, k-th of document column vector (dw1k, dw2k ..., dwNk) of document sets D is called.

The degree of correlation is a real number value, it indicates the relationship of some feature in user or document and feature set K Tightness degree.As soon as if user or document be associated with musical features it is more be associated with sports feature it is a little less, we Say that the degree of correlation of the user or document and musical features is high, it is low with the degree of correlation of sports feature.In addition in Feature Selection, have There is correlation between a little features, therefore the dimension of feature set K can be reduced by reducing the correlation between feature, The demand to server storage is reduced, efficiency of algorithm is improved.Some features need not be directly included in feature set, because these The degree of correlation of feature can be come out by the relatedness computation of one or several other features in feature set K.

Illustrate the setting method of the parameter vector initial value of user or document below.It is illustrated for following three example. The parameter vector initial value range of user or document is usually arranged as having uwmk ∈ [0,1] for any m ∈ U, n ∈ D and k ∈ K With dwnk ∈ [0,1].If initial value is not set in the parameter vector of user or document, parameter vector initial value is default to be set For null vector.

The method that example 1 is artificial setting user m (m ∈ U) or the parameter vector initial value of document n (n ∈ D).Such as it sets Set feature sum L=5, feature set K=(science, education, finance and economics, music, sport), setting U (m)=(uwm1, uwm2, Uwm3, uwm4, uwm5)=(0,0.9,0,1,0).That is the degree of correlation of user m and " education " feature is 0.9, with " music " feature The degree of correlation be 1, the degree of correlation with other feature is zero.Similarly, can be set the parameter vector D (n) of the document n= The initial value of (dwn1, dwn2 ..., dwnk ..., dwnL).

Example 2 is that the method for the parameter vector initial value of user m (m ∈ U) is arranged.One group of text is submitted by the user m first Shelves setThe parameter vector of the document r (r ∈ H) is (dwr1, dwr2 ..., dwrL), so Afterwards, for each k ∈ K, uwmk=(σ 1/s) ∑ (r ∈ H) dwrk or uwmk=(σ 1/s) ∑ (r ∈ H) are set [dwrk/ (∑ (k ∈ K) dwrk)], wherein s is the element number of the set H, and σ 1 is setting normal number.Use similar side Method, the user m can also select one group of user in user collection U to calculate the parameter vector initial value of the user m.

Example 3 is a kind of method of parameter vector initial value that document is arranged.Classified catalogue is a kind of special document, such as door Family website generally includes the classified catalogues such as news, music, sport, finance and economics and science and technology.We assume that the text under same category catalogue Shelves are all related to sport with certain identical features, such as the document under sport catalogue.If document n (n ∈ D) is classification mesh Record h (h ∈ D) under a document, then the parameter vector initial value of the document n by the parameter vector of the classified catalogue h Lai It determines.Such as each k ∈ K, dwnk=σ 2dwhk is set, wherein σ 2 is setting normal number.

Fig. 3 is the parameter vector update algorithm flow chart of user and document.It is specifically included in the server of access internet In, execute following steps:

S11. the document sets D that the user being made of multiple user identifiers collects U and is made of multiple document identifications is stored；It deposits Store up the feature set K being made of multiple signature identifications；

S12. a document setup parameter being at least in the user or the document sets D in user's collection U Vector initial value；

S13. the signal that any one user m (m ∈ U) accesses any one document n (n ∈ D) is received；

S14. according to the signal, read the parameter vector U (m) of the user m=(uwm1, uwm2 ..., Uwmk ..., uwmL), wherein the uwmk indicates the degree of correlation of the user m and feature k (k ∈ K)；

S15. according to the signal, read the parameter vector D (n) of the document n=(dwn1, dwn2 ..., Dwnk ..., dwnL), wherein the dwnk indicates the degree of correlation of the document n and feature k (k ∈ K)；

S16. application parameter vector more new algorithm updates the parameter vector of the user m and the document n；If institute after updating State parameter vector U* (m)=(uwm1*, uwm2* ..., uwmk* ..., the uwmL*) of user m, the ginseng of the document n after update Number vector D* (n)=(dwn1*, dwn2* ..., dwnk* ..., dwnL*), then the algorithm includes:

U* (m)=F1 [U (m), D (n)]；

D* (m)=F2 [U (m), D (n)]；

After having executed the step S16, the step S13 is returned.

Wherein the F1 () and the F2 () are the function with the U (m) and the D (n) for independent variable respectively.Institute It states user m and represents any one of user's collection U user, and be not specific to some user, the document n is represented in document sets D Any one document, and it is not specific to some document.Such as n-th m=1023, n=in the signal when executing step S13 3428, and m=33456 in the signal when (n+1)th execution step S 13, n=28477.

It is the increasing letter of the dwnk to each k ∈ K, the uwmk* in an application example of Fig. 3 the method Number, the dwnk* is the increasing function of the uwmk.

In an application example of Fig. 3 the method, all for each k ∈ K, the uwmk* and the dwnk* It is the subtraction function for the frequency that the user m accesses the document sets D.The frequency is the user m in a set period of time The number of the document in the document sets D is accessed divided by the length of the set period of time.

It is that ∑ (k ∈ K) dwnk subtracts to each k ∈ K, the uwmk* in an application example of Fig. 3 the method Function, the dwnk* are the subtraction functions of ∑ (k ∈ K) uwmk.

In an application example of Fig. 3 the method, the signal is random from Web log in a setting time It extracts.In a setting time, the calling-on signal that each any active ues in U extract identical quantity is collected to the user Input signal as Fig. 3 the method.Any active ues refer in a setting time, access the document sets D Reach the user of setting number.Inactive users cannot use the parameter vector of Fig. 3 the method update user and document.

In Fig. 3 the method, after executing the parameter vector more new algorithm and reaching setting number t1, in each feature Under k ∈ K, k-th of user's column vector (uw1k, uw2k ..., uwMk) is normalized；Executing the parameter vector After more new algorithm reaches setting number t2, at each feature k ∈ K, to k-th document column vector (dw1k, dw2k ..., DwNk it) is normalized；Wherein t1 and t2 is positive integer.Primary parameter vector more new algorithm is executed, that is, executes primary institute State step S16.The method for normalizing includes specific application example below.

Example 1: the side that k-th of user's column vector (uw1k, uw2k ..., uwMk) is normalized in U is collected to user Method is as follows: to set uw1k, uw2k ..., uwMk) be ranked up by descending sequence, the element of ranking M1 is assigned Be worth toAnd for each m ∈ U, ifUwmk=1 is then set, is otherwise arranged It is as follows to the method that k-th of document column vector (dw1k, dw2k ..., dwNk) is normalized in document sets D: to collection Close dw1k, dw2k ..., dwNk) be ranked up by descending sequence, the element of ranking N1 is assigned toWith And it for each n ∈ D, is set ifOtherwise dwnk=1 is arrangedWherein, M1 and N1 To set normal number.

Example 2: to the side that k-th of document column vector (dw1k, dw2k ..., dwNk) is normalized in document sets D Method is as follows: first to set dw1k, dw2k ..., dwNk) be ranked up, and according to ranking results will gather dw1k, Dw2k ..., dwNk } it is divided into the approximately equal r group of element number, wherein the relationship of any two groups of a groups and b group is appointing in a group What element is more than or equal to any one of b group any one of element or a group element and is less than or equal in b group Any one element；The smallest data composition set { s1, s2 ..., sr } of numerical value, and s1 < s2 are taken out in each group < ... < sr；Then, for each n ∈ D, if dwnk < s1, is arranged dwnk=0；If sm≤dwnk≤sm+1, It is arranged dwnk=g1 (sm)；If dwnk > sr, is arranged dwnk=1.Wherein g1 (sm) be increasing function, g1 (sm) ∈ (0, 1), such as g1 (sm)=sm/sr is set；1≤m < r, r are setting positive number.Same method can collect k-th of user's column in U to user Vector is normalized.

It further include for each k ∈ after having executed the step S16 in an application example of Fig. 3 the method Uwmk=uwmk* and dwnk=dwnk* is arranged in K.

In an application example of Fig. 3 the method, the method meets for each k ∈ K, there is uwmk* >=uwmk With dwnk* >=dwnk.

In Fig. 3 the method, the type of the signal is at least with one of Types Below: T=1 indicates the user M clicks the link of the document n, and T=2 indicates that the user m keys in the address of the document n, and T=3 indicates that the user m will The document n is set as liking+the 1 of Google (Like of such as types of facial makeup in Beijing operas and), and T=4 indicates that the user m forwards the document n, T =5 indicate that the user m comments on the document n, and T=6 indicates that the user m collects the document n.

Application example 1

In an application example of Fig. 3 the method, the parameter vector more new algorithm is specifically included:

1 (n, m, T) f1 (dwnk) of uwmk*=β 1uwmk+ λ (for each k ∈ K)

2 (m, n, T) f2 (uwmk) of dwnk*=β 2dwnk+ λ (for each k ∈ K)

Wherein, the λ 1 (n, m, T) is influence system of the document n to the user m at the type T of the signal Number, the λ 2 (m, n, T) are influence coefficient of the user m to the document n at the type T of the signal；β 1 and β 2 are Set normal number；The f1 (dwnk) is the increasing function of the dwnk, and the f2 (uwmk) is the increasing function of the uwmk.Such as F1 (dwnk)=σ 3dwnk, f2 (uwmk)=σ 4uwmk；Or f1 (dwnk)=σ 5 { 1/ [1+exp (- dwnk)] }, F2 (uwmk)=σ 6 { 1/ [1+exp (- uwmk)] }, wherein σ 3, σ 4, σ 5 and σ 6 are setting normal number.

It is k-th of document column vector setting threshold values dCk for each feature k ∈ K in the application example 1, if Dwnk≤dCk then takes f1 (dwnk)=0；It is k-th of user's column vector setting threshold values uCk for each feature k ∈ K, if Uwmk≤uCk then takes f2 (uwmk)=0.Wherein dCk is equal to each of kth document column vector (dw1k, dw2k ..., dwNk) Component of the ranking at a1 in a component；UCk is equal to each of k-th of user's column vector (uw1k, uw2k ..., uwMk) Component of the ranking at a2 in component；A1 and a2 is setting positive integer.

In the application example 1, the concrete methods of realizing of the λ 1 (n, m, T) and the λ 2 (m, n, T) include as follows Example:

Example 1: the λ 1 (n, m, T) and the λ 2 (m, n, T) are set as setting constant.Such as λ 1 (n, m, T)=c1 and λ 2 (m, n, T)=c2, wherein c1 and c2 is setting normal number, such as c1=c2=0.01.

Example 2: the λ 1 (n, m, T) and the λ 2 (m, n, T) are the frequency that the user m accesses the document sets D respectively Subtraction function.λ 1 (n, m, T)=1/g2 [freq (m)], λ 2 (m, n, T)=1/g2 [freq (m)] are such as set, the g2 (x) is to increase Function.Such as g2 (x) is piecewise function, and as x < a3, g2 (x)=1；As x >=a3, g2 (x)=1+a4 (x-a3), wherein A3 and a4 is default normal number.The freq (m) is the frequency that the user m accesses the document in the document sets D.

Example 3: setting λ 1 (n, m, T)=1/g3 [∑ (k ∈ K) dwnk], λ 2 (m, n, T)=1/g3 [∑ (k ∈ K) uwmk], g3 It (x) is increasing function.Such as g3 (x) is piecewise function, and as x < a5, g3 (x)=1；As x >=a5, g3 (x)=1+a6 (x- A5), wherein a5 and a6 is default normal number.When calculating ∑ (k ∈ K) dwnk, if dwnk≤min_dCk, dwnk=is taken 0；When calculating ∑ (k ∈ K) uwmk, if uwmk≤min_uCk, uwmk=0 is taken；Wherein min_dCk and min_uCk are Set normal number.

Example 4: 1 (n, m, T)=d1 (n) u2 (m) of λ, 2 (m, n, T)=u1 (m) d2 (n) of λ, wherein d1 (n) Indicate whether the parameter vector of document n can be used for updating the parameter vector that user collects user in U, u2 (m) indicates the ginseng of user m Whether number vector can be updated by the parameter vector of document in document sets D, and u1 (m) indicates whether the parameter vector of user m can be with For updating the parameter vector of document in document sets D, d2 (n) indicates whether the parameter vector of document n can be collected in U by user The parameter vector of user updates.U1 (m), u2 (m), d1 (n) and d2 (n) are parameter presets, their value is 0 or 1.1 generation Table is, 0 represent it is no.This example is meant that prevent malicious attack, some documents (or user) are not due to by reliable Property certification, parameter vector cannot be updated the parameter vector of other users (or document)；Some important documents (or use Family), parameter vector cannot be updated by the parameter vector of other users (or document).

Example 5: 1 (n, m, the T)=s1 (T) of λ, 2 (m, n, the T)=s2 (T) of λ.Wherein the T is that user accesses text The type of shelves signal, the s1 (T) and the s2 (T) are the function of the T respectively.

Example 6: the λ 1 (n, m, T) is the accessed number of the document n or the increasing function of PageRank value, the λ 2 (m, n, T) is the increasing function of bean vermicelli (follower) quantity of the user m.

Example 7: the λ 1 (n, m, T) and the λ 2 (m, n, T) are the parameter vector of the user m and the document n respectively Between similarity sim (m, n) increasing function.Such as λ 1 (n, m, T)=1+c3sim (m, n), λ 2 (m, n, T)=1+c4 Sim (m, n), wherein c3 and c4 is setting constant more than or equal to 1, and sim (m, n)=[∑ (k ∈ K) (uwmk dwnk)]/{[∑(k∈K)(uwmk)2]1/2·[∑(k∈K)(dwnk)2]1/2}.This example is meant that user and Wen Similarity between the parameter vector of shelves is higher, and the proportionality coefficient that they " vote " each other is bigger.When calculating sim (m, n), such as Fruit dwnk≤min_dCk, then take dwnk=0；If uwmk≤min_uCk, uwmk=0 is taken, wherein min_dCk and min_ UCk is setting normal number.

Example 8: using the combination of at least two methods in above-mentioned 1~7 each method of example, come generate the λ 1 (n, m, T) and λ 2 (m, n, T).For example in freq (m) > a3, have

λ 1 (n, m, T)=c1 { 1+c3sim (m, n) } { 1/ [1+a4 (freq (m)-a3)] } { d1 (n) u2 (m)}·s1(T)

λ 2 (m, n, T)=c2 { 1+c4sim (m, n) } { 1/ [1+a4 (freq (m)-a3)] } { u1 (m) d2 (n)}·s2(T)。

In the application example 1, after the execution specific parameter vector more new algorithm reaches setting number, need For each feature k ∈ K, respectively to k-th of document column vector (dw1k, dw2k ..., dwNk) and k-th of user's column vector (uw1k, uw2k ..., uwMk) is normalized.

Application example 2

This is a concrete methods of realizing of application example 1.Let it be assumed, for the purpose of illustration, that there are two users on the internet With three documents, each user and each document are there are two feature, i.e. user collects U={ 1,2 }, document sets D={ 1,2,3 }, Feature set K={ 1,2 }.The parameter vector of user 1 and user 2 are respectively (uw11, uw12) and (uw21, uw22), document 1, text The parameter vector of shelves 2 and document 3 is respectively (dw11, dw12), (dw21, dw22) and (dw31, dw32).Wherein uwmk (m ∈ U, k ∈ K) indicate the degree of correlation of the user m and feature k；Dwnk (n ∈ D, k ∈ K) indicates the phase of the document n and feature k Guan Du.

Assuming that have received the signal that the user 2 accesses the document 3 in the server, and signal type T=1, then root The parameter vector of the user 2 and the document 3 are updated according to following parameter vector more new algorithm:

1 (3,2,1) dw31 of uw21*=β 1uw21+ λ；1 (3,2,1) dw32 of uw22*=β 1uw22+ λ

2 (2,3,1) uw21 of dw31*=β 2dw31+ λ；2 (2,3,1) uw22 of dw32*=β 2dw32+ λ

Wherein, β 1=β 2=1；λ 1 (3,2,1) indicates shadow of the document 3 to the user 2 in signal type T=1 Ring coefficient；λ 2 (2,3,1) indicates influence coefficient of the user 2 to the document 3 in signal type T=1.Such as:

λ 1 (3,2,1)=c1 { 1+c3sim (2,3) } { 1/ [1+a4 (freq (2)-a3)] } { d1 (3) u2 (2)}·s1(1)

λ 2 (2,3,1)=c2 { 1+c4sim (2,3) } { 1/ [1+a4 (freq (2)-a3)] } { u1 (2) d2 (3)}·s2(1)

Wherein, c1=c2=0.01, c3=c4=3, sim (2,3)=(uw21dw31+uw22dw32)/ { [(uw21) 2+ (uw22) 2] 1/2 [(dw31) 2+ (dw32) 2] 1/2 }, a3=200, a4=0.01, d1 (3)=u2 (2)= U1 (2)=d2 (3)=1, s1 (1)=2, s2 (1)=1, and assume freq (2) > a3.

After having executed above-mentioned parameter vector more new algorithm, it is arranged as follows: uw21=uw21*, uw22=uw22*, Dw31=dw31* and dw32=dw32*.

After having executed above-mentioned parameter vector more new algorithm, to user's column vector (uw11, uw21) and (uw12, uw22) It is normalized, and document column vector (dw11, dw21, dw31) and (dw12, dw22, dw32) is normalized Processing.

It is as follows to the algorithm of user's standardization on series vectors processing: to set temp1=max (uw11, uw21), then to feature k =1 setting uw11=uw11/temp1, uw21=uw21/temp1；If temp2=max (uw12, uw22), then to feature k= 2 setting uw12=uw12/temp2, uw22=uw22/temp2.

It is as follows to the algorithm of the normalized of document column vector: it sets temp1=max (dw11, dw21, dw31), then it is right Dw11=dw11/temp1, dw21=dw21/temp1, dw31=dw31/temp1 is arranged in feature k=1；If temp2=max Then dw12=dw12/temp2, dw22=dw22/temp2, dw32=is arranged to feature k=2 in (dw12, dw22, dw32) dw32/temp2。

Fig. 4 is the ordering vector representation method of each document in document sets D.

The core technology of search engine is sort algorithm, wherein foremost is PageRank algorithm.Standard PageRank algorithm can be indicated with following formula.

Wherein, the chain that set T is webpage p (p ∈ D) enters collections of web pages, and C (i) is that the chain of webpage i (i ∈ T) goes out webpage number Amount；D expression user accesses the probability of the webpage p by the link of other webpages；1-d indicates that user does not pass through other webpages Link (such as pass through key in the address URL mode) access the probability of the webpage p, d ∈ (0,1)；PR (p) indicates the net Ranking value of the page p in the document sets D, N indicate the webpage quantity in document sets D.In addition the initial ranking value of each webpage It is set as 1/N.Here, each element in document sets D is a webpage.

(the shortcomings that algorithm is that each webpage on the internet only has a unique webpage sorting to the PageRank of standard Value, i.e., the algorithm assumes that each user is identical to the evaluation of the importance of the same webpage.That is, PageRank is calculated Method does not account for the personalized difference for submitting the user of search inquiry.Therefore, it is necessary to improve to existing sort algorithm.

We are extended traditional PageRank value, i.e., by one of any one document p in the document sets D It ties up ranking value PR (p), is extended to the ordering vector of the multidimensional based on domain features.If the ordering vector of any document p (p ∈ D) For [PR (p, 1), PR (p, 2) ..., PR (p, k) ..., PR (p, L)], wherein the PR (p, k) is indicated at feature k (k ∈ K) Under ranking value of the document p in the document sets D.The ranking value of each document under feature k ∈ K is pooled together, A vector is formed, is called k-th of sequence column vector of document sets D, i.e.,

Fig. 5 is that document ordering vector updates algorithm flow chart.If at least containing in the document sets D there are two document subset, Wherein document subsetIn each document contain other texts that at least one link is directed toward in the document sets D Shelves, and document subsetIn the chain that is contained by least one document in the document subset S of each document It connects pointed；And S ∪ E=D, S ∩ E ≠ Φ, wherein Φ is empty set.Therefore, ordering vector more new algorithm is as follows: the document Collect ranking value of any one document p in D at feature k (k ∈ K), is that each chain of the document p enters document described Ranking value and the chain under feature k enter the function of document and the degree of correlation of the feature k.

The ordering vector more new algorithm includes following two specific application example.

Example 1: ranking value of any document p (p ∈ D) in the document sets D at feature k ∈ K is defined as:

Wherein, the chain that set T is the document p enters collection of document；D indicates that user is accessed by the link of other documents The probability of the document p；1-d indicates that user is not visiting by the link of other documents (such as by keying in the address URL mode) Ask the probability of the document p, d ∈ (0,1)；PR (i, k) indicates ranking value of the document i at feature k (k ∈ K)；The dwik table Show the degree of correlation of document i Yu feature k (k ∈ K)；N is the document number in the document sets D.In addition, for each document i ∈ D With each feature k ∈ K, if initial ranking value PR (i, k)=1/N of the document i.

The formula (2) can state following vector form as:

Wherein, k ∈ K,It is complete 1 column vector；A is one non- Negative matrix, A=(aij) N × N are defined as follows:

Example 2: ranking value of any document p (p ∈ D) in the document sets D at feature k ∈ K is defined as:

Wherein, gatherChain for the document p enters collection of document；D indicates that user passes through the chain of other documents Fetch the probability for accessing the document p；1-d indicates user not by the link of other documents (such as by keying in the address side URL Formula) access the probability of the document p, d ∈ (0,1)；PR (i, k) indicates ranking value of the document i at feature k (k ∈ K)；Institute Stating dwik indicates the degree of correlation of document i and feature k (k ∈ K)；C (i) indicates that the chain of document i (i ∈ T) goes out number of documents；N is institute State the document number in document sets D.In addition for each document i ∈ D and each feature k ∈ K, if the initial ranking value of document i PR (i, k)=1/N.

The vector form of the formula (4) can also state the form of formula (3) as, wherein It is complete 1 column vector；Nonnegative matrix A=(aij) N × N is defined as follows:

In order to guarantee the formula (3) validity, need to carry out the linking relationship between the document in document sets D several Limitation, such as reject pendency page (Dangling Page) and be directed toward its each link, when the ranking value of other documents has been calculated Bi Hou, then will dangle page and its connectivity restitution of direction, and according to the ranking value of the formula (3) calculating pendency page.

The formula (3) can be by its solution of dominant eigenvalue (Power Method) approximate calculation, i.e., described in calculating K-th of sequence column vector in document sets DAfter being located at nth iteration, the sequence column vector isThe then power iteration Method includes the following steps:

R10. any feature k ∈ K is chosen；

R11. according to the formula (2) or formula (4), nonnegative matrix A is generated；

R12., the initial value of k-th of sequence column vector in document sets D is setN=0；

R13. it executes the formula (3), i.e., according to the sequence column vector of the n-th stepCome calculate the Sorted list of the (n+1)th step to AmountI.e.

R14. to describedIt is normalized, i.e.,

R15. judge whetherOr n > STEP, it is to terminate；Otherwise n=n+1 is set, step is returned Rapid R13.

Wherein ε and STEP is setting normal number；Indicate vectorBy the maximum component of mould.

Fig. 6 is the individualized document search method flow chart based on query vector and ordering vector.This method is included in clothes Following steps are executed in business device:

S10. according to the parameter vector more new algorithm, the parameter vector of multiple documents and institute in the document sets D are updated State the parameter vector that user collects multiple users in U；Concrete methods of realizing includes step S11 described in Fig. 3 to the step S16；

S20., the ordering vector initial value of each document in the document sets D is set；

S30. at each feature k (k ∈ K), using the ordering vector more new algorithm, iteration updates the document sets D In k-th of sequence column vector, that is, update the ordering vector of each user in the document sets D；

S40. the search condition that the query vector and the user q for receiving user q (q ∈ D) setting are submitted, and described Search key is extracted in search condition；Wherein described search condition can be set as all letters that user submits in search dialogue Breath；

S50. one group of document Q with described search keyword match is retrieved in the document sets D；

S60. according to the ordering vector of each document in the query vector and one group of document Q, described one is calculated The personalized ordering value of each document in group document Q；

S70. according to the personalized ordering value, one group of document Q is ranked up, and according to ranking results by institute The link for stating multiple documents in one group of document Q is sent to the user q.

In Fig. 6 the method, if the query vector of user q is (swq1, swq2 ..., swqk ..., swqL), Middle swqk expression is queried ranking value of the document in the document sets D, swqk ∈ [0,1] at feature k (k ∈ K).It is described The setting method of query vector is exemplified below.

The first be feature is selected in feature set K by the user n, and be arranged and be queried the ranking value of document, such as Swq2=0.00023, swq6=0.00061 are set, and other component of a vector are 0.

It is for second that the user q submits one group of document identification Sq={ ..., r ... }.The document r (r ∈ Sq) Ordering vector is [PR (r, 1), PR (r, 2) ..., PR (r, k) ..., PR (r, L)], therefore for each feature k ∈ K, described The query vector of user q is set as swqk=(σ 7/s) ∑ (r ∈ Sq) PR (r, k) or swqk=(σ 7/s) ∑ (r ∈ Sq) { PR (r, k)/∑ (k ∈ K) PR (r, k) }；Wherein s is the element number of the set Sq, and σ 7 is setting normal number.

In an application example of Fig. 6 the method, the document i based on the user q query vector submitted The personalized ordering value UR (i, q) of (i ∈ Q) be defined as the user q query vector (swq1, swq2 ..., swqk ..., SwqL) the phase between the ordering vector of the document i [PR (i, 1), PR (i, 2) ..., PR (i, k) ..., PR (i, L)] Like degree, such as

UR (i, q)=∑ (k ∈ K) [PR (i, k) swqk] }/{ [∑ (k ∈ K) (PR (i, k)) 2] 1/2 [∑ (k ∈ K)(swqk)2]1/2}

Wherein, the PR (i, k) indicates ranking value of the document i in the document sets D at feature k (k ∈ K), The swqk expression is queried ranking value of the document in the document sets D at feature k (k ∈ K).Calculate the UR (i, When q), for any k ∈ K, if PR (i, k) < min_PR, takes PR (i, k)=0；If swqk < min_SW, takes Swqk=0.Min_PR and min_SW is setting normal number.

Fig. 7 is the individualized document search method flow chart based on query vector and parameter vector.The method includes Following steps are executed in server:

A10. according to the parameter vector more new algorithm, the parameter vector of multiple documents and institute in the document sets D are updated State the parameter vector that user collects multiple users in U；Concrete methods of realizing includes step S11 described in Fig. 3 to the step S16；

A20. the search condition that the query vector and the user q for receiving user q (q ∈ D) setting are submitted, and described Search key is extracted in search condition；Wherein described search condition can be set as all letters that user submits in search dialogue Breath；

A30. one group of document Q with described search keyword match is retrieved in the document sets D；

A40. according to the parameter vector of each document in the query vector and one group of document Q, described one is calculated The personalized ordering value of each document in group document Q；

A50. according to the personalized ordering value, one group of document Q is ranked up, and according to ranking results by institute The link for stating multiple documents in one group of document Q is sent to the user q.

In Fig. 7 the method, if the query vector of user q is (swq1, swq2 ..., swqk ..., swqL), Middle swqk indicates the degree of correlation for being queried document Yu feature k (k ∈ K), swqk ∈ [0,1].The query vector there are several types of Setting method.

The first is feature to be selected in feature set K by the user n, and it is arranged the feature degree of correlation, such as be arranged Swq2=0.8, swq6=0.9, other component of a vector are 0.

Second is that the parameter vector of the user q is assigned to the query vector.

The third is that the user q submits one group of user identifier or document identification Sq={ ..., r ... }.When When, the parameter vector of the user r (r ∈ Sq) is (uwr1, uwr2 ..., uwrL), therefore the query vector of the user q is set For for each feature k ∈ K, swqk=(σ 8/s) ∑ (r ∈ Sq) uwrk or swqk=(σ 8/s) ∑ (r ∈ Sq) [uwrk/(∑(k∈K)uwrk)]；WhenWhen, the parameter vector of the document r (r ∈ Sq) be (dwr1, dwr2 ..., DwrL), therefore the query vector of the user q is set as each feature k ∈ K, swqk=(σ 9/s) ∑ (r ∈ Sq) Dwrk or swqk=(σ 9/s) ∑ (r ∈ Sq) [dwrk/ (∑ (k ∈ K) dwrk)]；The wherein element that s is the set Sq Number, σ 8 and σ 9 are setting normal number.

In an application example of Fig. 7 the method, the document i based on the user q query vector submitted The personalized ordering value UR (i, q) of (i ∈ Q) be defined as the user q query vector (swq1, swq2 ..., swqk ..., SwqL) the similarity between the parameter vector (dwi1, dwi2 ..., dwiL) of the document i, i.e.,

UR (i, q)=[∑ k (swqkdwik)]/{ [∑ k (swqk) 2] 1/2 [∑ k (dwik) 2] 1/2 }.

One application scenarios of Fig. 7 the method are microbloggings.After user issues a microblogging document, so that it may which this is set The parameter vector initial value of microblogging document, i.e., the parameter vector for the user for issuing this microblogging multiplied by a preset constant, It is assigned to the parameter vector of this microblogging document.After having received the signal of user's access microblogging document in micro blog server (signal such as generated by forwarding, comment or collection movement), according to the user identifier and microblogging document for including in the signal Mark, reads the parameter vector of the user and the parameter vector of the microblogging document respectively；Then it is updated according to parameter vector Algorithm updates the parameter vector of the user and the microblogging document.When user opens microblogging, he can be default by it Query vector in relational network other people issue information be filtered and screen.Its method is to be looked into first by user preset Vector is ask, the similarity between the parameter vector of the every microblogging document then received using the query vector and user is as often The personalized ordering value of a microblogging document, and according to the numerical values recited of the personalized ordering value, the microblogging text that user is received Shelves are filtered and screen.Such as before personalized ordering value ranking 30% microblogging document is only sent to inquiry user.

Fig. 8 is a kind of system construction drawing for obtaining user and Document personalization feature.The system 200 includes following function Module:

User's collection, document sets and feature set setup module 211: storage is by multiple user identifiers in customer data base 220 The user of composition collects U, and the document sets D being made of multiple document identifications is stored in document database 230；In property data base The feature set K being made of multiple signature identifications is stored in 240；

User and document initial value setup module 212: collect at least one user setting parameter vector in U for the user Initial value is simultaneously stored in customer data base 220；It is initial at least one document setup parameter vector in the document sets D It is worth and is stored in document database 230；For each document setup ordering vector initial value in the document sets D；Not by User and the document of parameter vector initial value are set, and parameter vector initial value defaults to null vector；

User accesses document signal acquisition module 213: any for acquiring any one user m (m ∈ U) (102) access The signal of one document n (n ∈ D), the signal are stored in web log data library 250；Described in user m (102) access The signal of document n will be sent at least one application server, and the application server includes portal site server 301, social network server 302, search engine server 303 and instant communication server 304；

User and document parameter vector update module 214: it according to the signal, is read in the customer data base 220 The parameter vector of the user m (102) and the parameter vector that the document n is read in the document database 230, so Application parameter vector more new algorithm afterwards updates the parameter vector of the user m (102) and the document n, finally with after update The parameter vector of the user m (102) and the parameter vector of the document n update the customer data base 220 and institute respectively State document database 230；

Document ordering vector update module 215: in the document sets D, with linking relationship, each document between document Ordering vector initial value and each document parameter vector as input data, using ordering vector more new algorithm, iteration The ranking value of each document in the document sets D at each feature k (k ∈ K) is updated, and applies the updated sequence Value updates the document database 230；Linking relationship between the document is by each document packet in the document sets D Contained document links determine；

User query module 216: firstly, receiving the query vector of inquiry user q setting and the search of user q submission Condition, and search key is extracted in described search condition；Then, retrieval is closed with described search in the document sets D The matched one group of document Q of key word；Later, according to the ordering vector of each document in the query vector and one group of document Q, The personalized ordering value of each document in one group of document Q is calculated, or according to the query vector and one group of document Q In each document parameter vector, calculate the personalized ordering value of each document in one group of document Q；Finally, according to described Personalized ordering value is ranked up one group of document Q, and according to ranking results by multiple texts in one group of document Q The link of shelves is sent to the user q.

Application example described above is only preferable application example of the invention, the protection model being not intended to limit the invention It encloses.

Claims

1. a kind of system for obtaining user and Document personalization feature, which is characterized in that the system comprises following functional modules:

User's collection, document sets and feature set setup module: the user being made of multiple user identifiers is stored in customer data base Collect U, the document sets D being made of multiple document identifications is stored in document database；Storage is by multiple spies in property data base The feature set K of sign mark composition；

User and document initial value setup module: collect at least one user setting parameter vector initial value in U for the user And it is stored in customer data base；For at least one document setup parameter vector initial value in the document sets D and by its It is stored in document database；For each document setup ordering vector initial value in the document sets D；Parameter vector is not set The user of initial value and document, parameter vector initial value default to null vector；

User accesses document signal acquisition module: accessing any one document n (n ∈ for acquiring any one user m (m ∈ U) D signal), the signal are stored in web log data library；

User and document parameter vector update module: according to the mark of the user m and the document n that include in the signal Know, the parameter vector of the user m is read in the customer data base and reads the text in the document database The parameter vector of shelves n；Then by parameter vector more new algorithm, the parameter vector of the user m and the document n are updated；Most The customer data base and the number of files are updated respectively with the parameter vector of the updated user m and document n afterwards According to library；

Document ordering vector update module: in the document sets D, with the sequence of linking relationship, each document between document The parameter vector of vector initial value and each document is as input data, and using ordering vector more new algorithm, iteration updates Under each feature k (k ∈ K) in the document sets D each document ranking value, and the updated ranking value of application updates The document database；Linking relationship between the document is the document chain for including by each document in the document sets D It connects and is determined；

User query module: firstly, receiving the query vector of inquiry user q (q ∈ D) setting and the search of user q submission Condition, and search key is extracted in described search condition；Then, retrieval is closed with described search in the document sets D The matched one group of document Q of key word；Later, according to the ordering vector of each document in the query vector and one group of document Q, The personalized ordering value of each document in one group of document Q is calculated, or according to the query vector and one group of document Q In each document parameter vector, calculate the personalized ordering value of each document in one group of document Q；Finally, according to described Personalized ordering value is ranked up one group of document Q, and according to ranking results by multiple texts in one group of document Q The link of shelves is sent to the user q；

According to user access document signal acquisition module collect described in any one document n signal, read the use The parameter vector U (m) of family m=(uwm1, uwm2 ..., uwmk ..., uwmL), wherein the uwmk indicate the user m with The degree of correlation of feature k (k ∈ K)；

According to the signal, parameter vector D (n)=(dwn1, dwn2 ..., dwnk ..., the dwnL) of the document n is read, Wherein the dwnk indicates the degree of correlation of the document n and feature k (k ∈ K)；

Application parameter vector more new algorithm updates the parameter vector of the user m and the document n；If the user m after updating Parameter vector be U* (m)=(uwm1*, uwm2* ..., uwmk* ..., uwmL*), after update the parameter of the document n to Amount is D* (n)=(dwn1*, dwn2* ..., dwnk* ..., dwnL*), then the parameter vector more new algorithm includes:

U* (m)=F1 [U (m), D (n)]；

D* (m)=F2 [U (m), D (n)]；

Wherein the F1 () and the F2 () are the function with the U (m) and the D (n) for independent variable respectively；

It is subtracting for the frequency that the user m accesses the document sets D respectively for each feature k ∈ K, the uwmk* and dwnk* Function；

In an application example of the parameter vector more new algorithm, the specific update method of the uwmk* and the dwnk* It is as follows:

1 (n, m, T) f1 (dwnk) of uwmk*=β 1uwmk+ λ (for each k ∈ K)

2 (m, n, T) f2 (uwmk) of dwnk*=β 2dwnk+ λ (for each k ∈ K)

Wherein, the λ 1 (n, m, T) is influence coefficient of the document n to the user m, institute at the type T of the signal Stating λ 2 (m, n, T) is influence coefficient of the user m to the document n at the type T of the signal；β 1 and β 2 is to set just Constant；The f1 (dwnk) is the increasing function of the dwnk, and the f2 (uwmk) is the increasing function of the uwmk；For each K ∈ K, the uwmk* are the subtraction functions of ∑ (k ∈ K) dwnk, and the dwnk* is the subtraction function of ∑ (k ∈ K) uwmk；The λ 1 (n, m, T) and the λ 2 (m, n, T) are the subtraction function for the frequency that the user m accesses the document sets D respectively.

2. system according to claim 1, which is characterized in that for each feature k ∈ K, the uwmk* is the dwnk Increasing function, the dwnk* is the increasing function of the uwmk.

3. system according to claim 1, which is characterized in that execute the parameter vector more new algorithm and reach setting number Afterwards, for each feature k ∈ K, k-th of user's column vector (uw1k, uw2k ..., uwMk) is normalized；It executes After the parameter vector more new algorithm reaches setting number, for each feature k ∈ K, to k-th document column vector (dw1k, Dw2k ..., dwNk) it is normalized.

4. system according to claim 3, which is characterized in that the λ 1 (n, m, T) and the λ 2 (m, n, T) are institute respectively State the increasing function of the similarity between the parameter vector of user m and the parameter vector of the document n.