A kind of map locating keyword is to the method for relevant issues
Technical field
The invention belongs to technical field of information retrieval, particularly relate to a kind of map locating keyword side to relevant issues
Method.
Background technology
Along with the development of web2.0, community-based question and answer website (community based question
Answering sites, writes a Chinese character in simplified form CQA) become more popular, increasing people is by proposing problem at CQA and answering a question
Carry out Knowledge Sharing.Inquire about information needed in a search engine with respect to keyword, the problem on CQA can be brighter
True expression user's request, and each problem of CQA is furnished an answer by many users behind, and mark one
Good answer, such that it is able to better meet the information retrieval demand of user.By depositing of CQA such a high quality information platform
, the problem that the key word of the inquiry that user provides is mapped on CQA, it is possible not only to provide the user answer, but also permissible
Deeply understand user's request and as clear and definite problem to serve web search result.
When the problem being mapped on CQA by the key word of the inquiry that user provides, some key word of the inquiry can not be all
Being contained in problem, so being accomplished by establishing a standard, the dependency between key word of the inquiry and problem being judged.
Meanwhile, after obtaining multiple problems relevant to key word of the inquiry, it should provide the problem that can react user's request accurately.
Further, owing to problems a lot of on CQA have similarity, in order to meet the many demands of user, similar problem can be returned
For same category, therefrom select representational problem, without all showing.
Summary of the invention
In order to solve above-mentioned technical problem, the invention provides a kind of map locating keyword method to relevant issues.
For given information requirement, people can directly propose problem or select the word being correlated with to inquire about from problem, claim this
A little words are the theme word.By key word of the inquiry and descriptor of all the problems are analyzed, obtain and key word of the inquiry phase
The candidate's problem closed, being then passed through is ranked up candidate's problem and classifies accurately obtains what user's key word of the inquiry was mapped
Problem.
The technical solution adopted in the present invention is: the method for a kind of map locating keyword to relevant issues, including following
Step:
Step 1: carry out problem on CQA and crawl, and record each problem generic, obtains being made up of N number of problem
Problem set PS, remembers PS={P1,P2,...,PN, for each problem P in set PSj, by the POS of a standard
Tagger program extracts noun phrase therein, then combines its generic word and obtains the descriptor set of correspondence
PTSj;For the key word of the inquiry q of n word composition, remember q={w1,w2,...,wn, calculate each word w in qiDescriptor
Score Tgrade (wi), and by score more than threshold θtWord add descriptor set corresponding to q;Described θt∈[0,1];
If the descriptor set of certain problem comprises the descriptor set of key word of the inquiry, then this problem is added the time of key word of the inquiry
Select problem set CPSq, otherwise this problem is considered as the problem unrelated with key word of the inquiry, does not considers;Each word w in qiMain
Epigraph score Tgrade (wi) computing formula be:
Wherein, n is the number of words that key word of the inquiry q comprises;wiIt it is the word in q;N is to comprise in problem set PS
Number of questions;Times(wi|PTSj) it is word wiEach problem P in set PSjCorresponding descriptor set PTSjIn
Occurrence number;ptimes(wi|Pj) it is word wiEach problem P in set PSjIn occurrence number.
Step 2: for set CPSqIn each problem PcIf, PcWith the degree of correlation between key word of the inquiry q is more
Height, the information retrieval demand that more likely accurate response user is current, so can be by the relevant journey of problem to key word of the inquiry
Spend as the important reference selecting final problem set, set of computations CPSqIn each problem PcWith inquiry key
The degree of correlation of word q, corresponding result uses Cor (Pc, q) represent, Cor (Pc, being specifically calculated as q):
Wherein, NcIt is that candidate question set closes CPSqThe number of questions comprised;N is the number of words that key word of the inquiry q comprises;
wiIt it is the word in q;ctimes(wi|Pc) it is word wiAt set CPSqIn each problem PcIn occurrence number;
length(Pc) it is set CPSqIn each problem PcThe word number comprised;N is the problem comprised in problem set PS
Number;ptimes(wi|Pj) it is word wiEach problem P in set PSjIn occurrence number;λ (λ ∈ (0,1)) is for giving
Fixed inhibitive factor;Described λ ∈ (0,1).
Step 3: construct a figure G, set CPSqMiddle problem is as node, and each problem is corresponding to figure G's
One node;Time initial, figure G only comprises node, and any two node VmAnd VnBetween the most there is not limit;Then for figure G
In any two node VmAnd Vn, it is assumed that it is corresponding to set CPSqIn problem PmAnd Pn, calculate PmAnd PnDescriptor cover
Lid rate Cover (Pm,Pn), if Cover is (Pm,Pn) more than given threshold θc, then there is node VmTo node VnA limit, otherwise
Node VmWith node VnThe most there is not limit;Described θc∈[0,1];Wherein descriptor coverage rate Cover (Pm,Pn) computing formula
For:
Wherein PTSmFor problem PmDescriptor set;||PTSm| | represent set PTSmIn element number cos (Pm,Pn)
It it is the cosine similarity of two problems;α is given inhibitive factor;Described α ∈ (0,1).
Step 4: for set CPSqIn each problem PcIf being accessed for number of times the most, then show that this problem is more
Welcome, more it is likely to be the problem corresponding to current keyword query, thus the pouplarity of problem is final as selecting
One important reference of problem set, uses Wel (Pc) represent set CPSqIn each problem PcPouplarity,
Wel(Pc) be specifically calculated as:
Wherein, NcIt is that candidate question set closes CPSqThe number of questions comprised;adj(Pc) in representative graph G with each problem Pc
The node set being connected;V is set adj (PcA node in);Deg (v) is the degree of node v;D (d ∈ (0,1)) gives
Inhibitive factor;
Step 5: for set CPSqIn each problem Pc, combine its pouplarity and the phase with key word of the inquiry
Pass degree, calculates the comprehensive score Grade (P of each problemc), according to comprehensive score order from big to small to CPSqIn
Problem is ranked up, the problem set RP after being sorted;Comprehensive score Grade (Pc) be specifically calculated as:
Grade(Pc)=log (Cor (Pc,q))+log(Wel(Pc)), (c=1,2 ..., Nc)
Wherein NcIt is that candidate question set closes CPSqThe number of questions comprised;Cor(Pc, q) it is each problem PcClose with inquiry
The degree of correlation of key word q;Wel(Pc) it is each problem PcPouplarity;
Step 6: initialize a null set FP, adds the first problem in RP FP, selects in RP surplus the most successively
Remaining each problem Pr, calculate PrThe cosine similarity csim of each problem with in FP, the cosine similarity that record is maximum
Problem P in maxcsim and corresponding FPf, by PrMark Grade (Pr) add Grade (Pf) to upper, if maxcsim is less than simultaneously
Given threshold θs, then by PrAdd FP, if maxcsim is more than given threshold θs, described θs∈ [0,1], then it is assumed that problem PrWith
PfSimilar, and record and problem PfSimilar problem number Nfq;
Step 7: update each problem P in FP settComprehensive score, and according to mark from big to small suitable after updating
Problem sequence in ordered pair FP, returns the set FP after sequence.The formula updating score is:
Wherein Grade (Pt)OldFor problem P each in FPtRenewal before mark;NtqIt is and each problem PtSimilar
Number of questions;Grade(Pt)NewIt it is each problem P in FPtRenewal after mark.
Preferably, in step 1, θt∈[0.3,0.9]。
Preferably, in step 2, λ ∈ (0.2,0.9).
Preferably, in step 3, α ∈ (0.1,1), θc∈[0.3,0.9]。
Preferably, in step 4, d ∈ (0.1,1).
Preferably, in step 6, θs∈[0.2,0.9]。
User's key word of the inquiry is mapped as clear and definite problem by the present invention, can the high quality information on CQA be integrated into
Search engine, and the problem relevant to user's key word of the inquiry and answer can be directly obtained, thus be more fully understood from
User's request, it is thus achieved that preferably search experience.
Accompanying drawing explanation
The flow chart of Fig. 1: the embodiment of the present invention.
Fig. 2: first three problem result figure of score rank in RP set in the embodiment of the present invention;
Fig. 3: first three problem result figure of score rank in FP set after sequence in the embodiment of the present invention;
Fig. 4: in the embodiment of the present invention, key word of the inquiry is inquired about in a search engine and obtained first three result figure of ranking.
Detailed description of the invention
Understand and implement the present invention for the ease of those of ordinary skill in the art, below in conjunction with the accompanying drawings and embodiment is to this
Bright it is described in further detail, it will be appreciated that embodiment described herein is merely to illustrate and explains the present invention, not
For limiting the present invention.
The invention provides a kind of map locating keyword method to relevant issues.For given key word of the inquiry,
Map relative problem.
Asking for an interview Fig. 1, the present invention comprises the following steps:
Step 1: at Yahoo!The upper selection of Answers " iPod " classification carries out problem and crawls, and obtains problem set PS, remembers PS
={ P1,P2,...,PN}.For each problem P in set PSj, extract it by the POS tagger program of a standard
In noun phrase, then in conjunction with " iPod " obtain correspondence descriptor set PTSj。
Given key word of the inquiry q=" iPod downloaded videos ", calculate each word w in qiDescriptor
Score Tgrade (wi), and by score more than threshold θtWord add descriptor set corresponding to q;Described θt∈[0,1];
If the descriptor set of certain problem comprises the descriptor set of key word of the inquiry, then this problem is added the time of key word of the inquiry
Select problem set CPSq, otherwise this problem is considered as the problem unrelated with key word of the inquiry, does not considers;Each word w in qiMain
Epigraph score Tgrade (wi) computing formula be:
Wherein, n is the number of words that key word of the inquiry q comprises;wiIt it is the word in q;N is to comprise in problem set PS
Number of questions;Times(wi|PTSj) it is word wiEach problem P in set PSjCorresponding descriptor set PTSjIn
Occurrence number;ptimes(wi|Pj) it is word wiEach problem P in set PSjIn occurrence number.
Step 2: for set CPSqIn each problem PcIf, PcWith the degree of correlation between key word of the inquiry q is more
Height, the information retrieval demand that more likely accurate response user is current, so can be by the relevant journey of problem to key word of the inquiry
Spend as the important reference selecting final problem set, set of computations CPSqIn each problem PcWith inquiry key
The degree of correlation of word q, corresponding result uses Cor (Pc, q) represent, Cor (Pc, being specifically calculated as q):
Wherein, NcIt is that candidate question set closes CPSqThe number of questions comprised;N is the number of words that key word of the inquiry q comprises;
wiIt it is the word in q;ctimes(wi|Pc) it is word wiAt set CPSqIn each problem PcIn occurrence number;
length(Pc) it is set CPSqIn each problem PcThe word number comprised;N is the problem comprised in problem set PS
Number;ptimes(wi|Pj) it is word wiEach problem P in set PSjIn occurrence number;λ (λ ∈ (0,1)) is for giving
Fixed inhibitive factor;Described λ ∈ (0,1).
Step 3: construct a figure G, set CPSqMiddle problem is as node, and each problem is corresponding to figure G's
One node;Time initial, figure G only comprises node, and any two node VmAnd VnBetween the most there is not limit;Then for figure G
In any two node VmAnd Vn, it is assumed that it is corresponding to set CPSqIn problem PmAnd Pn, calculate PmAnd PnDescriptor cover
Lid rate Cover (Pm,Pn), if Cover is (Pm,Pn) more than given threshold θc, then there is node VmTo node VnA limit, otherwise
Node VmWith node VnThe most there is not limit;Described θc∈[0,1];Wherein descriptor coverage rate Cover (Pm,Pn) computing formula
For:
Wherein PTSmFor problem PmDescriptor set;||PTSm| | represent set PTSmIn element number cos (Pm,Pn)
It it is the cosine similarity of two problems;α is given inhibitive factor;Described α ∈ (0,1).
Step 4: for set CPSqIn each problem PcIf being accessed for number of times the most, then show that this problem is more
Welcome, more it is likely to be the problem corresponding to current keyword query, thus the pouplarity of problem is final as selecting
One important reference of problem set, uses Wel (Pc) represent set CPSqIn each problem PcPouplarity,
Wel(Pc) be specifically calculated as:
Wherein, NcIt is that candidate question set closes CPSqThe number of questions comprised;adj(Pc) in representative graph G with each problem Pc
The node set being connected;V is set adj (PcA node in);Deg (v) is the degree of node v;D (d ∈ (0,1)) gives
Inhibitive factor;
Step 5: for set CPSqIn each problem Pc, combine its pouplarity and the phase with key word of the inquiry
Pass degree, calculates the comprehensive score Grade (P of each problemc), according to comprehensive score order from big to small to CPSqIn
Problem is ranked up, the problem set RP after being sorted.RP set in score rank first three problem as shown in Figure 2.Comprehensively
Property score Grade (Pc) be specifically calculated as
Grade(Pc)=log (Cor (Pc,q))+log(Wel(Pc)), (c=1,2 ..., Nc)
Wherein NcIt is that candidate question set closes CPSqThe number of questions comprised;Cor(Pc, q) it is each problem PcClose with inquiry
The degree of correlation of key word q;Wel(Pc) it is each problem PcPouplarity;
Step 6: initialize a null set FP, adds the first problem in RP FP, selects in RP surplus the most successively
Remaining each problem Pr, calculate PrThe cosine similarity csim of each problem with in FP, the cosine similarity that record is maximum
Problem P in maxcsim and corresponding FPf, by PrMark Grade (Pr) add Grade (Pf) to upper, if maxcsim is less than simultaneously
Given threshold θs, then by PrAdd FP, if maxcsim is more than given threshold θs, described θs∈ [0,1], then it is assumed that problem PrWith
PfSimilar, and record and problem PfSimilar problem number Nfq;
Step 7: update each problem P in FP settComprehensive score, and according to mark from big to small suitable after updating
In ordered pair FP problem sequence, after sequence FP set in comprehensive score rank first three problem as shown in Figure 3.Inquiry key
Word inquire about in a search engine obtain ranking first three result as shown in Figure 4.Return the set FP after sequence.Update the public affairs of score
Formula is:
Wherein Grade (Pt)OldFor problem P each in FPtRenewal before mark;NtqIt is and each problem PtSimilar
Number of questions;Grade(Pt)NewIt it is each problem P in FPtRenewal after mark.
User's key word of the inquiry is mapped as clear and definite problem by the present invention, is possible not only to provide the user answer, but also
Can deeply understand user's request and as clear and definite problem to serve web search result, thus obtain and preferably search
Cable body is tested.
In step 1, θt∈[0.3,0.9]。
In step 2, λ ∈ (0.2,0.9).
In step 3, α ∈ (0.1,1), θc∈[0.3,0.9]。
In step 4, d ∈ (0.1,1).
In step 6, θs∈[0.2,0.9]。
It should be appreciated that this specification is the part elaborated belongs to prior art.
It should be appreciated that the above-mentioned description for preferred embodiment is more detailed, can not therefore be considered this
The restriction of invention patent protection scope, those of ordinary skill in the art, under the enlightenment of the present invention, is weighing without departing from the present invention
Profit requires under the ambit protected, it is also possible to make replacement or deformation, within each falling within protection scope of the present invention, this
The bright scope that is claimed should be as the criterion with claims.