Summary of the invention
For the above-mentioned defect existing in prior art, technical matters to be solved by this invention is how for the difference of different user, to provide accurate information.
For solving the problems of the technologies described above, on the one hand, the invention provides a kind of resource recommendation method based on user's potential demand, the method comprising the steps of:
S1, utilizes text cluster and Topics Crawling algorithm to carry out cluster and subject extraction to resource;
S2, based on cluster result, calculates the descriptor under each theme, obtains the thesaurus in corresponding field;
S3, utilizes thesaurus to carry out automatic indexing to resource, calculates the descriptor that each independent resource comprises;
S4, to the operation note of independent resource and user property, calculates the attention rate of user to certain theme in conjunction with user, sets up user's request model and calculates the Topic Similarity between user; Utilize the relation between data in independent resource to calculate the technorati authority of appointed information to theme;
S5, according to user's request model discrimination resource, by the higher resource recommendation of matching degree to user.
Preferably, in described step S1, adopt improved stratification subject extraction model hLDA to carry out described cluster and subject extraction.
Preferably, in described step S4, the Topic Similarity calculation procedure between user u and v is:
Model user u and v demand model M separately
uand M
v; Remember M simultaneously
uand M
vtheme set is separately
with
According to M
uand M
vin the theme that comprises set up theme set
n is M
uand M
veach self-contained theme number sum;
calculate respectively the T that user u and v are right
iattention rate S (u, T
i) and S (v, T
i);
At theme space { T
1, T
2..., T
non set up respectively theme attention rate vector U and V:U={S (u, the T of user u, v
1), S (u, T
2) ..., S (u, T
n) and V={S (v, T
1), S (v, T
2) ..., S (v, T
n); The cosine value of the angle of compute vector U and V is as the Topic Similarity between u and v.
Preferably, in described step S5, according to user's request model M
uscreening resource comprises step:
For M
ueach theme comprising, by the standard descriptor under this theme and accordingly auxiliary word put into vocabulary Dic; After all themes are disposed, vocabulary Dic has comprised model M
uin all standard descriptor and auxiliary words;
For M
ueach the theme T comprising, obtains all documents of comprising this theme, and these documents are put into set Docs; After all themes are disposed, set Docs is all M that comprised
uin the collection of document of at least one theme;
Each document in pair set Docs, the word occurrence number sum TF in the document in statistics vocabulary Dic
dic; In set Docs after all Document Statistices, according to the TF of each document
dicsort, several the most forward documents are recommended to user.
Preferably, for user u, its user's request model M
ube expressed as: M
u=(A
u, T
u), A wherein
uuser u community set, A={a
1, a
2..., a
n, attribute a
ithe attribute being associated with demand, T
uthe set of the theme paid close attention to of user u, T
ube represented as the theme T that user u pays close attention to
iset, i=1,2 ..., n.
Preferably, in described step S2, utilize mutual information to carry out descriptor calculating:
Calculate after the mutual information of each candidate key words and corresponding theme, according to the value of mutual information is descending, sort; Finally get front several candidate word of mutual information value maximum as the descriptor of this theme.
Preferably, in described step S2, after calculating descriptor:
Also adopt the way of manual intervention to examine the descriptor of calculating, the descriptor that audit is passed through enters standard thesaurus;
Meanwhile, utilize the hierarchical relationship between descriptor to set up the upper and lower relation between descriptor in standard thesaurus;
And utilize HowNet as synonymicon, calculate the synonym of each descriptor in standard thesaurus.
Preferably, in described step S5, also according to the Topic Similarity between described user, utilize the user's that similarity is the highest demand model to carry out similar recommendation to targeted customer; And/or
According to described appointed information, the technorati authority of theme being carried out to authority to user recommends.
On the other hand, the present invention also provides a kind of resource recommendation system based on user's potential demand simultaneously, and this system comprises:
Pretreatment module, for utilizing text cluster and Topics Crawling algorithm to carry out cluster and subject extraction to resource;
Thesaurus module, for based on cluster result, calculates the descriptor under each theme, obtains the thesaurus in corresponding field;
Index module, for utilizing thesaurus to carry out automatic indexing to resource, calculates the descriptor that each independent resource comprises;
Computing module, in conjunction with user to the operation note of independent resource and user property, calculate the attention rate of user to certain theme; Set up user's request model and calculate the Topic Similarity between user; Utilize the relation between data in independent resource to calculate the technorati authority of appointed information to theme;
Recommending module, for according to user's request model discrimination resource, by the higher resource recommendation of matching degree to user.
The invention provides a kind of resource recommendation method and system based on user's potential demand, utilize user's potential Intelligence Request and the professional domain Close relation of oneself, by excavating the potential Intelligence Request of user based on professional domain, can be more accurately to user recommends and user's request matches resource of information.
Embodiment
Below in conjunction with the accompanying drawing in the embodiment of the present invention, the technical scheme in the embodiment of the present invention is clearly and completely described.Obviously, described embodiment is for implementing better embodiment of the present invention, and described description is to illustrate that rule of the present invention is object, not in order to limit scope of the present invention.Protection scope of the present invention should with claim the person of being defined be as the criterion, the embodiment based in the present invention, the every other embodiment that those of ordinary skills obtain under the prerequisite of not making creative work, belongs to the scope of protection of the invention.
Existing search engine information sifting is limited in one's ability and comparatively passive, in order to address the above problem, there is personalizedization recommended technology: at information resources service end by the demand of analysis user, user may interested information be initiatively pushed to user.The principal feature of commending system for it be the process of an active push, this active push mode of commending system has just in time overcome the defect of traditional search engines information pull mode: how accurately user often the also unclear information requirement of expression oneself, or also do not know the information requirement of oneself, just cannot obtain valuable information with search engine yet.
The core of personalized recommendation technology is exactly how analysis mining goes out the potential information requirement of user, such as utilizing user's Operation Log (as the resources such as books, song, film are browsed to record etc.) to analyze individual preference, geographical location information of user etc., and to targeted customer, recommend associated out of Memory resource on this basis.In current user-customized recommended technology, collaborative filtering is most study, most widely used recommended technology, and it is that Operation Log analysis based on other users obtains the content recommendation to targeted customer, and the personalized degree of recommendation is very high.As long as the current recommend method all things considered based on collaborative filtering thought is divided into two large classes, the one, the proposed algorithm based on user's similarity, a class is the proposed algorithm of content-based filtration.Proposed algorithm based on similarity is exactly to describe the incidence relation between user-resource by building user-resource matrix, calculates on this basis the similarity between user, then by with the information recommendation of the similar users of targeted customer to targeted customer; Thereby the proposed algorithm of content-based filtration analyzes by the information resources content that user had been browsed the characteristic model that obtains user, then utilize user characteristics model to give a mark to resource, the high attraction of score is given targeted customer by recommended.These class methods have made up the Sparse Problems of user-project rating matrix to a certain extent in conjunction with methods such as machine learning, data minings.
Above research contents is the common information resource based on Internet mostly, and as webpage, song, video etc., the interested resource category of user is extensive, therefore utilizes collaborative filtering and information filtering to be difficult to the affiliated field of accurately analysis user resource requirement.Yet for thering is the structuring of relatively complete metadata or semi-structured resource (typically as professional digital resource of information, as the digital resource of information of colleges and universities, scientific research institutions, large enterprise etc.), there is the following characteristics that is obviously different from the upper public resource of Internet: resource of information has professional domain classification very clearly; Resource of information has relatively complete metadata, as author, keyword, classification number etc.; User uses digital library or scientific and technological information platform generally to need authentication, and identity information is clearer and more definite simultaneously, except user name, also has the information such as institutional affiliation; The purpose of user's Gains resources is stronger, and the resource of information obtaining is closely related with own interested research field.So traditional personalized recommendation method based on Internet public resource can not meet the requirement that staff intelligence resource information system personalized information desired more accurately, based on professional domain interest is recommended.
In an embodiment of the present invention, by excavating the potential Intelligence Request of user based on professional domain, can be more accurately to user recommends and user's request matches resource of information.Referring to Fig. 1, in one embodiment of the invention, the resource recommendation method based on user's potential demand comprises step:
S1, utilizes text cluster and Topics Crawling algorithm to carry out cluster and subject extraction to resource;
S2, based on cluster result, calculates the descriptor under each theme, obtains the thesaurus in corresponding field;
S3, utilizes thesaurus to carry out automatic indexing to resource, calculates the descriptor that each independent resource comprises;
S4, to the operation note of independent resource and user property, calculates the attention rate of user to certain theme in conjunction with user; Set up user's request model and calculate the Topic Similarity between user; Utilize the relation between data in independent resource to calculate the technorati authority of appointed information to theme;
S5, according to user's request model discrimination resource, by the higher resource recommendation of matching degree to user.
Below the various optimal ways of above-described embodiment are done to further expansion explanation, in a preferred embodiment of the invention, for further outstanding Technique Rule of the present invention and actual effect, scope of resource is limited in technical information information, but relevant technical staff in the field should be appreciated that, technical information information is a concrete classification in total data resource, for other, there is the structuring of relatively complete metadata or semi-structured resource (is carried out the structured network file of mark, as XML, HTML etc. with general format; Or have clear and definite field to be described further resource, as patent documentation etc.; Or other are through the sorted resource of roughing), technical scheme of the present invention obviously also can directly apply to wherein, so the preferred embodiments of the present invention should not regarded limitation of the present invention as.
The user of scientific and technological information infosystem, in order to obtain own interested resource, generally can screen by following three kinds of approach: research institution under term, literature author, author.Every piece of scientific and technical literature has specific theme, and each theme has one group of descriptor to be described theme; Every piece of scientific and technical literature is all associated with author, and each author has specific research field, and author's research field can be described by the theme of chapter that author sends the documents; Each author and own affiliated ,Ru university of research institution, scientific research institutions are associated; Between scientific and technical literature, may there is topic relativity; Between author, between mechanism, may there is common dispatch, the such cooperative relationship of shared science and technology item.Therefore, between user's research theme, scientific and technical literature, author and research institution, there is potential comparatively complicated incidence relation.These incidence relations will be fully utilized in the present invention, as digging user potential demand and the foundation to user's recommendation.Meanwhile, the present invention not only recommends scientific and technical literature to user, but also utilizes the incidence relation between research theme, document, author, research institution to user, to recommend authoritative author and the authoritative research institution of user's domain of interest.
Wherein, in a preferred embodiment of the invention, in order effectively to build thesaurus, first need to from resource of information, extract the theme containing.What in subject extraction method, be most widely used is LDA topic extraction model, and this is that a kind of conventional three layers of Bayesian probability generate topic models, and the relation between word, document and potential semantic topic three is highlighted.Its parameter can not increase and linear growth has good generalization ability along with document sets, is very popular models of field such as machine learning, information retrieval.But, before LDA model carries out Topics Crawling to extensive collection of document, need in advance artificial designated key number K.But generally,, a given extensive collection of document, cannot determine in advance and wherein comprise how many themes.Meanwhile, traditional LDA model can not be by document automatic cluster in subject extraction process, and the theme therefore extracting does not have semantic hierarchies relation.
Therefore, in step S1 of the present invention, preferably adopt improved stratification subject extraction model hLDA to carry out the extraction of resource theme, extract the theme of processing and there is semantic hierarchies relation, can automatically to document, carry out cluster simultaneously.The more important thing is, improved hLAD model of the present invention takes full advantage of the quoted passage relation in scientific and technical literature: having the document quote and to be drawn relation is more likely to belong to same theme, and more likely by cluster together.The improved hLDA model of the present invention is as shown in Figure 2:
In improved hLDA model, adopt following symbol to carry out the correlation parameter of mark document clustering and subject extraction model, node T represents the set of paths of L layer tree; The super parameter of priori that the path probability that γ is tree distributes; NCRP is a statistic processes, and its allocation probability is distributed in the tree of unlimited range, the unlimited degree of depth; C
1, C
2, C
3..., C
lrepresent the node in tree; α is the proportion between implicit theme, is to describe the potential theme prior distribution super parameter of collection of document on the theme level of the tree at its place; θ is the distribution proportion of document on theme, and θ obedience Dirichlet distribution Dir (θ | α), represent the weight of destination document m each implicit theme in the theme level at its place; Z represents the theme containing in document; W represents the word in document; The lexical item of each the node theme in β representative tree distributes; The super parameter of prior distribution that η distributes for describing theme lexical item; Parameter lambda determines that topic is from quoted passage m ' or the ratio of document m self; Prior probability ψ is depended in the distribution of λ; Stochastic variable s represents that the adduction relationship between document m and m: s=0 represents that document m does not have citing document m ', so the theme of document m is determined by the topic distribution prior probability α of document itself and the topic distribution θ of document itself completely; If s=1, the theme of document m is determined jointly by m and m ', and parameter lambda determines that topic is from quoted passage m or the ratio of document m self; Prior probability ψ is depended in the distribution of λ.
Based on above-mentioned improved hLDA model, the word process that extracts a document through excavation is as follows:
For each the theme k ∈ T in tree, the descriptor that generates β~Dirichlet (η) distributes;
To each piece of document m, according to C
mthe path of~nCRP (γ) spanning tree;
L dimension theme to the document distributes, if s=0 generates θ
m~Dirichlet (α); If s=1, first generates λ~Dirichlet (ψ), λ determines topic from quoted passage m or the ratio of document m self; Then generate θ
m~λ Dirichlet (α)+(1-λ) Dirichlet (α ');
To n word in document, the theme Z that selects this word to give
m, n| Mult (θ
m), select subsequently word W
m,n| { Z
m,n, C
m, β } and~Mult (β, C
m[Z
m,
n]).
Wherein, in cluster and subject extraction, adopt the Gibbs methods of sampling to estimate model parameter.Gibbs sampling only need be to variable Z
m,i(i word W in document m
ithe theme of giving) and variable C
m, l(the l layer theme of document m in theme hierarchical tree path) estimated to calculate.The process of whole Gibbs sampling is divided into following two steps:
First, predictor Z
m, i, its condition posterior probability distribution and expression formula is as follows:
Wherein, Z
m ,-irepresent all other k ≠ i word W in document m
ktheme give situation; Z
m ' ,-i, W represents all other k ≠ i word W in document m '
ktheme give situation;
represent to be given in document m the word W of theme j
inumber;
represent to be endowed in document m the word number of theme j;
represent to be endowed total word number of theme j;
represent the total words in document m;
represent to be given in document m ' the word W of theme j
inumber;
represent to be endowed in document m ' word number of theme j;
represent the total words in document m '; Parameter lambda determines that topic is from quoted passage m ' or the ratio of document m self; α and β are respectively the priori that document subject matter distributes and theme lexical item distributes.
Secondly, predictor C
m, l, its condition posterior probability distribution and expression formula is as follows:
p(C
m|W,C
-m,Z)∝p(W
m|C,W
-m,Z)p(C
m|C
-m);
Wherein, W
-mand C
-mthe word and the path of document m in theme hierarchical tree that represent respectively all documents except document m; Use Bayes rule, p (W
m| C, W
-m, Z) be the maximum likelihood function of document m, p (C
m| C
-m) be C
mprior probability in theme hierarchical tree.P(W
m| C, W
-m, computing formula Z) is as follows:
Wherein,
represent that the word w in document m is endowed theme C
m, lnumber;
represent that all words in document m are endowed theme C
m, lnumber.
word w in all documents of expression except document m is endowed theme C
m, lnumber.
all words in all documents of expression except document m are endowed theme C
m, lnumber.W is the sum of word in dictionary, and Γ () is standard gamma function.
By the excavation of topic model, the potential theme lying in scientific and technical literature is found automatically, and the document in literature collection will carry out cluster according to the theme of automatic discovery.Next a very important step is to find out the one group of descriptor that represents each theme.Can adopt the descriptor of calculating with the following method each theme: each theme represents by one group of a plurality of document that belong to this theme, subject extraction and document clustering algorithm that the collection of document of each theme has been described by step S1 complete; On this basis, for a theme and belong to the collection of document of this theme, find out one group of word that can represent the document set.This group word be by calculate each word with belong to filter out after the mutual information of document of this theme can represent this theme before M word obtain.
In step S2 of the present invention, utilize mutual information to carry out descriptor calculating, mutual information is a Useful Information tolerance, and it refers to two correlativitys between event sets.The mutual information of two stochastic variable X and Y is defined as:
Wherein p (x, y) is the joint probability distribution function of stochastic variable X and Y, and p. (x) and p. (y) are marginal probability distribution function.Particularly, utilize flow process that mutual information carries out descriptor calculating as shown in Figure 3, concrete steps are:
S21, while carrying out topic word filtering, definition stochastic variable U and C, when one piece of document package is containing descriptor t, U value is e
t=1, when one piece of document does not comprise descriptor t, U value is e
t=0.When one piece of document package is contained in theme c, C value is e
c=1.When one piece of document is not contained in theme c, C value is e
c=0;
S22, for a word t in a theme c, the mutual information of word t and theme c is:
After adopting maximal possibility estimation, above formula equals:
N wherein
10represent to comprise descriptor t but the number of files in theme c not N
11represent to comprise the descriptor t also number of files in theme c simultaneously, N
01represent not comprise descriptor t but number of files in theme c N
00represent not comprise the descriptor t number of files in theme c not simultaneously, N
1.represent the total number of documents that comprises descriptor t, N
.1represent to be included in the total number of documents in theme c, N
0.represent not comprise the total number of documents of descriptor t, N
.0represent the not total number of documents in theme c, N is all total number of documents.
S23, supposes that the descriptor candidate collection by all documents under a Topics Crawling theme c is out W={w
1, w
2..., w
n.Calculate after the mutual information of each candidate key words and this theme c, according to the value of mutual information is descending, sort.Finally get front several candidate word of mutual information value maximum as the descriptor of this theme.
In addition, after calculating descriptor, also can adopt the way of manual intervention to examine the descriptor of calculating, the descriptor that audit is passed through enters standard thesaurus.Meanwhile, utilize the hierarchical relationship between descriptor to set up the upper and lower relation between descriptor in standard thesaurus; And utilize HowNet as synonymicon, calculate the synonym of each descriptor in standard thesaurus.
The automatic indexing of standard descriptor is the basic work of Text Automatic Processing, is constructing on the basis of thesaurus, every piece of scientific and technical literature is calculated automatically to the standard descriptor containing.Can effectively improve user search, text classification, the precision that scientific literature is mated with user's request.
Further, in order to describe document automatic indexing algorithm, adopt following symbol: standard subject heading list NW is the set of standard descriptor, NW={nw
1, nw
2..., nw
n, wherein N is the number of standard descriptor in standard thesaurus; Right
(i=1,2 ..., N), its synonym set is designated as
wherein K is standard descriptor nw
isynonym number; In standard thesaurus NW, the set of all standard descriptor is designated as SW; Scientific and technical literature set is designated as D, D={d
(1), d
(2)..., d
(M), the number that wherein M is document; Note document d
(i)the set of middle standard descriptor is
note document d
ithe synset of middle standard descriptor is combined into
based on this, as shown in Figure 4, concrete steps are the document automatic indexing flow process adopting in step S3 of the present invention:
S31, adds the dictionary of participle device to carry out participle to document NW and SW; Right
D
(i)be represented as lexical item set
?
S32,
if w ∈ is NW, w is added
?
S33,
if w ∈ is SW, w is added
?
S34,
synonym relation by standard thesaurus finds its corresponding synonym standard descriptor nw, and nw is added
?
S35, has calculated document d to this
(i)the set of all standard descriptor that comprise
note
wherein L is document d
(i)the number of all standard descriptor that comprise;
S36,
calculate modular word w about the weighted value of original text:
Wherein, w
t, w
a, w
fls, w
cbe respectively modular word w at the head and the tail sentence of title, summary, text paragraph, the weight that other parts of text paragraph occur; f
t, f
a, f
fls, f
cbe respectively modular word w at the head and the tail sentence of title, summary, text paragraph and the number of times of text other parts appearance, w
lenlength for modular word w; w
tf/idftf/idf value for modular word w.
Get front 5 words of weighted value maximum as the automatic indexing word of this scientific and technical literature.
User's theme attention rate has been described the interest level of user to certain theme.Vacuum metrics user's theme attention rate of the present invention has been considered following factor: the number of times of the document that belongs to this theme that user browsed; The number of times of the document that belongs to this theme that user downloaded; The author who belongs to this motif document that user browses or downloaded or the technorati authority of research institution.Consider that these factors are based on following reasonable assumption: the number of times of the document that belongs to this theme that user browses, downloaded is more, the author who belongs to this motif document that user browses or downloaded or the technorati authority of research institution are higher, illustrate that user is higher to the attention rate of this theme.
The present invention adopts following symbol to describe the computing method of user's theme attention rate: user u is designated as S (u, T) to the attention rate of theme T; The collection of document that user u browsed is designated as D
bu; The collection of document that user u downloaded is designated as D
du; The document author set of user's browsing and download is A
u; The document author institutional affiliation set of user's browsing and download is O
u; Author a is designated as C (a, T) to the technorati authority of theme T; The o of mechanism is designated as C (o, T) to the technorati authority of theme T.The present invention is as follows to the calculation procedure of S (u, T):
Collection of document D
bu, D
duin each piece of document, extract the theme of every piece of article;
Statistic document set D
buin comprise theme T article record, be designated as
Statistic document set D
duin comprise theme T article record, be designated as
Add up the article record that comprises theme T in all scientific and technical literatures, be designated as N
t;
Utilize lower formula to calculate S (u, T):
Wherein: N is total number of documents,
the inverse document frequency being the theme, if theme T occurs in more documents, illustrates that the ubiquity of this theme is higher, therefore
with
weight will reduce.
Except User operation log, another key factor that proposed algorithm need to be considered is exactly the similarity between user, according to Operation Log, cannot determine the targeted customer to recommend content time, can to targeted customer, recommend according to the content recommendation of similar users.Similarity between user can be recommended by the user characteristics based on different.The proposed algorithm proposing due to the present invention is the demand based on user, and user's request descriptive model is set up based on theme, therefore the present invention is based on theme and calculates the similarity between user.
Preferably, the Topic Similarity calculation procedure between user u and v is:
Topic Similarity calculation procedure between user u and v is:
Model user u and v demand model M separately
uand M
v; Remember M simultaneously
uand M
vtheme set is separately
with
According to M
uand M
vin the theme that comprises set up theme set
n is M
uand M
veach self-contained theme number sum;
calculate respectively the T that user u and v are right
iattention rate S (u, T
i) and S (v, T
i);
At theme space { T
1, T
2..., T
non set up respectively theme attention rate vector U and V:U={S (u, the T of user u, v
1), S (u, T
2) ..., S (u, T
n) and V={S (v, T
1), S (v, T
2) ..., S (v, T
n); The cosine value of the angle of compute vector U and V is as the Topic Similarity between u and v.
Wherein, for user u, its Requirements description model M
urepresent M
uby two tuples (A, T), represented, i.e. M
u=(A
u, T
u), A wherein
uuser u community set, A={a
1, a
2..., a
n, attribute a wherein
ithat the attribute that is associated with demand is as specialty, institutional affiliation, affiliated function, work position etc.; T
uthe set of the theme paid close attention to of user u, T
ube represented as T
iset, T wherein
i(i=1,2 ..., n) be the theme that user u pays close attention to, theme T
ithe set being formed by a plurality of elements, T
ii element for set { NW
i, SNW
i: { S
1, S
2..., S
n, NW
ifor describing theme T
istandard descriptor, SNW
istandard descriptor NW
iauxiliary set of words, be used for to theme T
isupplement description, auxiliary set of words SNW
iby two parts content, formed: the one, standard descriptor NW
isynonym in thesaurus, the keyword of the article that another part is browsed for user.
For user u, its Requirements description model M
uthe obtaining step of the set of the theme that middle user u pays close attention to is as follows:
According to User operation log record, find the collection of document D of user's browsing and download;
Document subject matter extracts the theme set T that obtains collection of document D
dthe set T of the theme of paying close attention to as user u
u;
To theme set T
din each theme T
i, the construction step of its content is: from collection of document D, find and belong to theme T
idocument subclass
for document subclass
in each document d, calculate the standard descriptor that it contains, and join theme T
ithe set of standard descriptor
in; For
in each standard descriptor NW
i, the synonym by it in thesaurus joins NW
iauxiliary set of words SNW
iin; Simultaneously for document subclass
in each document d, if d comprises standard descriptor NW
i, the keyword of document d is also joined to NW
iauxiliary set of words SNW
iin; By element { NW
i, SNW
i: { S
1, S
2..., S
njoin theme T
iin.
The commending system that the present invention realizes except according to user's request to user recommends resource of information, also to user, recommend authoritative author and the authoritative research institution relevant with user's request.Because user's request adopts the model based on theme, describe, therefore need to calculate author and mechanism about the technorati authority of certain theme.
The cooperative relationship that the theme technorati authority of author and mechanism is calculated based between author, between mechanism is calculated.As shown in Figure 5, author of the present invention, mechanism's theme technorati authority calculation procedure are as follows:
(1) utilize subject extraction algorithm to calculate the theme of all documents, the document that comprises designated key is picked out to the collection of document that forms this theme; Utilize that the common dispatch relation of each document in the document set is set up between author, the cooperative relationship figure between mechanism; Author relationships Tu, mechanism graph of a relation is merged into a heterogeneous network.
On this heterogeneous network, set up 3 random walk models, as shown in Figure 6, respectively: the author's random walk model G (A) setting up according to author's collaboration relation, the random walk model G of mechanism (O) setting up according to institution cooperation relation, the random walk model G of author mechanism (AO) setting up according to the affiliated relation of author and mechanism.In figure, the weight on every limit obtains by following factor weighted calculation: the quantity of jointly sending the documents; The citation times of each document of co-present.Set up author's popularity assessment models C (A).C (A) model mainly utilizes two features of author in information system: the document that comprises designated key is measured author's popularity about particular topic in system by the citation times of the document of collecting number of times and comprising designated key, as a Consideration of author impact degree.
(2) to simple substance node random walk model G (A) and G (O), adopt traditional PageRank algorithm, utilize homogeneity intra-node relation and tightness degree (weight) iteration mutually, calculate the pagerank value of each node
And
Wherein, A
a, A
onetwork chart G
a, G
oadjacency matrix, the paper of author, mechanism of take is collaborateed the weight that number of times is adjacency matrix limit; M
a, M
ofor G
a, G
ofrom current state, jump to the probability transfer matrix of next state; I is that component is 1 column vector entirely; I
ttransposed matrix for I; n
a, n
odimension for adjacency matrix;
for the n time distribution of pagerank of whole information, and
(3) for the mixing random walk model G (AO) of heterogeneous node, adopt HITS thought, regard author's node as Hub, mechanism node and regard Authorities as, set up Co-PageRank algorithm, calculate the metric of author and mechanism
And
Wherein, parameter lambda has determined two subnetwork G
aOsignificance level in peer metric assigning process, can be by controlling hybrid network G to the adjusting of λ
aOinfluence power to peer metric, A
aOoutgoing mechanism arrives author's probability transfer matrix, corresponding A
oAfor the probability transfer matrix of author to mechanism, two probability transfer matrixs are according to setting up adjacency matrix with the affiliated pass between author and mechanism; Other symbol implications are with above describing.
(4) author's popularity assessment models C (A) main indexes is drawn number of times for collection rate and the document of author in system, and tolerance formula is as follows: C (A)=f
a+ r
a;
Wherein, f
athe document that comprises particular topic that expression author delivers is by collection rate, and the document that comprises particular topic of delivering by author is collected number of times and weighed with the ratio of always collecting number of times all about the document of this particular topic; r
arepresent author's document ratio that is cited, the document citation times that comprises particular topic of delivering by author is weighed with the ratio of always collecting number of times all about the document of this particular topic.
(5) integrate a plurality of modules, set up Integrated Evaluation Model, the final technorati authority value about particular topic T of author and mechanism is: PR
o=λ PR
o(G (O))+(1-λ) PR
o(G (AO)) and PR
a=α PR
a(G (AO))+β PR
a(G (A))+χ PR
a(C (A));
Wherein, α, β, χ, λ are weight factor, control the significance level of modules to final technorati authority.Can, by regulating these parameter factors, adjust the influence power of each module.Three parameter alpha, β, χ meet alpha+beta+χ=1.
Finally, in step S5, according to user's request model M
uselect the resource of information, authoritative author, the authoritative institution that match with demand to recommend user:
For M
ueach theme comprising, by the standard descriptor under this theme and accordingly auxiliary word put into vocabulary Dic; After all themes are disposed, vocabulary Dic has comprised model M
uin all standard descriptor and auxiliary words;
For M
ueach the theme T comprising, obtains all documents of comprising this theme, and these documents are put into set Docs; After all themes are disposed, set Docs is all M that comprised
uin the collection of document of at least one theme;
Each document in pair set Docs, the word occurrence number sum TF in the document in statistics vocabulary Dic
dic; In set Docs after all Document Statistices, according to the TF of each document
dicsort, several the most forward documents are recommended to user.
Preferably, according to model M
uthe authoritative author that selection matches and the step of authoritative institution are as follows:
For M
ueach theme comprising, calculates the attention rate of user to this theme; After all themes are disposed, by getting several the most forward themes as the set of candidate's theme after attention rate sequence;
For each theme of candidate's theme set-inclusion, calculate all authors and the research institution technorati authority under this theme; By getting the most forward several authoritative authors and research institution after technorati authority sequence, recommend user;
If the recommendation resource obtaining in said process, author, mechanism's quantity are very few, further utilize step S4 to find out the highest similar user, utilize the user's that similarity is the highest demand model to carry out similar recommendation to targeted customer.
One of ordinary skill in the art will appreciate that, the all or part of step realizing in above-described embodiment method is to come the hardware that instruction is relevant to complete by program, described program can be stored in a computer read/write memory medium, this program is when carrying out, each step that comprises above-described embodiment method, and described storage medium can be: ROM/RAM, magnetic disc, CD, storage card etc.Therefore, with said method accordingly, the present invention also discloses a kind of resource recommendation system based on user's potential demand simultaneously, comprising:
Pretreatment module, for utilizing text cluster and Topics Crawling algorithm to carry out cluster and subject extraction to resource;
Thesaurus module, for based on cluster result, calculates the descriptor under each theme, obtains the thesaurus in corresponding field;
Index module, for utilizing thesaurus to carry out automatic indexing to resource, calculates the descriptor that each independent resource comprises;
Computing module, in conjunction with user to the operation note of independent resource and user property, calculate the attention rate of user to certain theme; Set up user's request model and calculate the Topic Similarity between user; Utilize the relation between data in independent resource to calculate the technorati authority of appointed information to theme;
Recommending module, for according to user's request model discrimination resource, by the higher resource recommendation of matching degree to user.
The invention provides a kind of resource recommendation method and system based on user's potential demand, utilize user's potential Intelligence Request and the professional domain Close relation of oneself, by excavating the potential Intelligence Request of user based on professional domain, can be more accurately to user recommends and user's request matches resource of information.
Above-mentioned explanation illustrates and has described some preferred embodiments of the present invention, but as previously mentioned, be to be understood that the present invention is not limited to disclosed form herein, should not regard the eliminating to other embodiment as, and can be used for various other combinations, modification and environment, and can, in invention contemplated scope described herein, by technology or the knowledge of above-mentioned instruction or association area, change.And the change that those skilled in the art carry out and variation do not depart from the spirit and scope of the present invention, all should be in the protection domain of claims of the present invention.