CN106294656A

CN106294656A - A kind of map locating keyword is to the method for relevant issues

Info

Publication number: CN106294656A
Application number: CN201610631777.7A
Authority: CN
Inventors: 黄浩; 颜钱; 李宗鹏
Original assignee: Wuhan University WHU
Current assignee: Nanjing Yuanfeng Intelligent Technology Co.,Ltd.
Priority date: 2016-08-04
Filing date: 2016-08-04
Publication date: 2017-01-04
Anticipated expiration: 2036-08-04
Also published as: CN106294656B

Abstract

The invention discloses a kind of map locating keyword method to relevant issues；First crawl problem information, then extraction key word of the inquiry and the descriptor of problem, select candidate question set and closeCPS _q, forCPS _qIn each problem, calculate the degree of correlation of itself and key word of the inquiry, calculate the comprehensive score of this problem by structure degree of correlation and pouplarity, and right according to score order from high to lowCPS _qIn problem be ranked up being gatheredRP, subsequently by calculatingRPCosine similarity between middle problem selects the composition set of representational problem from all kinds of Similar ProblemsFP, final updatingFPIn the comprehensive score of each problem, and right according to mark order from high to lowFPIn problem be ranked up, return sequence after problem setFPAs the problem relevant to key word of the inquiry；The present invention can directly obtain the problem relevant to user's key word of the inquiry and answer, thus is more fully understood from user's request, it is thus achieved that preferably search experience.

Description

A kind of map locating keyword is to the method for relevant issues

Technical field

The invention belongs to technical field of information retrieval, particularly relate to a kind of map locating keyword side to relevant issues Method.

Background technology

Along with the development of web2.0, community-based question and answer website (community based question Answering sites, writes a Chinese character in simplified form CQA) become more popular, increasing people is by proposing problem at CQA and answering a question Carry out Knowledge Sharing.Inquire about information needed in a search engine with respect to keyword, the problem on CQA can be brighter True expression user's request, and each problem of CQA is furnished an answer by many users behind, and mark one Good answer, such that it is able to better meet the information retrieval demand of user.By depositing of CQA such a high quality information platform , the problem that the key word of the inquiry that user provides is mapped on CQA, it is possible not only to provide the user answer, but also permissible Deeply understand user's request and as clear and definite problem to serve web search result.

When the problem being mapped on CQA by the key word of the inquiry that user provides, some key word of the inquiry can not be all Being contained in problem, so being accomplished by establishing a standard, the dependency between key word of the inquiry and problem being judged. Meanwhile, after obtaining multiple problems relevant to key word of the inquiry, it should provide the problem that can react user's request accurately. Further, owing to problems a lot of on CQA have similarity, in order to meet the many demands of user, similar problem can be returned For same category, therefrom select representational problem, without all showing.

Summary of the invention

In order to solve above-mentioned technical problem, the invention provides a kind of map locating keyword method to relevant issues. For given information requirement, people can directly propose problem or select the word being correlated with to inquire about from problem, claim this A little words are the theme word.By key word of the inquiry and descriptor of all the problems are analyzed, obtain and key word of the inquiry phase The candidate's problem closed, being then passed through is ranked up candidate's problem and classifies accurately obtains what user's key word of the inquiry was mapped Problem.

The technical solution adopted in the present invention is: the method for a kind of map locating keyword to relevant issues, including following Step:

Step 1: carry out problem on CQA and crawl, and record each problem generic, obtains being made up of N number of problem Problem set PS, remembers PS={P₁,P₂,...,P_N, for each problem P in set PS_j, by the POS of a standard Tagger program extracts noun phrase therein, then combines its generic word and obtains the descriptor set of correspondence PTS_j；For the key word of the inquiry q of n word composition, remember q={w₁,w₂,...,w_n, calculate each word w in q_iDescriptor Score Tgrade (w_i), and by score more than threshold θ_tWord add descriptor set corresponding to q；Described θ_t∈[0,1]； If the descriptor set of certain problem comprises the descriptor set of key word of the inquiry, then this problem is added the time of key word of the inquiry Select problem set CPS_q, otherwise this problem is considered as the problem unrelated with key word of the inquiry, does not considers；Each word w in q_iMain Epigraph score Tgrade (w_i) computing formula be:

T g r a d e (w_{i}) = \frac{Σ_{j = 1}^{N} T i m e s (w_{i} | {PTS}_{j})}{Σ_{j = 1}^{N} p t i m e s (w_{i} | P_{j})}, (i = 1, 2, ..., n)

Wherein, n is the number of words that key word of the inquiry q comprises；w_iIt it is the word in q；N is to comprise in problem set PS Number of questions；Times(w_i|PTS_j) it is word w_iEach problem P in set PS_jCorresponding descriptor set PTS_jIn Occurrence number；ptimes(w_i|P_j) it is word w_iEach problem P in set PS_jIn occurrence number.

Step 2: for set CPS_qIn each problem P_cIf, P_cWith the degree of correlation between key word of the inquiry q is more Height, the information retrieval demand that more likely accurate response user is current, so can be by the relevant journey of problem to key word of the inquiry Spend as the important reference selecting final problem set, set of computations CPS_qIn each problem P_cWith inquiry key The degree of correlation of word q, corresponding result uses Cor (P_c, q) represent, Cor (P_c, being specifically calculated as q):

C o r (P_{c}, q) = Π_{i = 1}^{n} (λ \times \frac{c t i m e s (w_{i} | P_{c})}{l e n g t h (P_{c})} + (1 - λ) \frac{Σ_{j = 1}^{N} p t i m e s (w_{i} | P_{j})}{Σ_{k = 1}^{n} Σ_{j = 1}^{N} p t i m e s (w_{k} | P_{j})}), (c = 1, 2, ..., N_{c})

Wherein, N_cIt is that candidate question set closes CPS_qThe number of questions comprised；N is the number of words that key word of the inquiry q comprises； w_iIt it is the word in q；ctimes(w_i|P_c) it is word w_iAt set CPS_qIn each problem P_cIn occurrence number； length(P_c) it is set CPS_qIn each problem P_cThe word number comprised；N is the problem comprised in problem set PS Number；ptimes(w_i|P_j) it is word w_iEach problem P in set PS_jIn occurrence number；λ (λ ∈ (0,1)) is for giving Fixed inhibitive factor；Described λ ∈ (0,1).

Step 3: construct a figure G, set CPS_qMiddle problem is as node, and each problem is corresponding to figure G's One node；Time initial, figure G only comprises node, and any two node V_mAnd V_nBetween the most there is not limit；Then for figure G In any two node V_mAnd V_n, it is assumed that it is corresponding to set CPS_qIn problem P_mAnd P_n, calculate P_mAnd P_nDescriptor cover Lid rate Cover (P_m,P_n), if Cover is (P_m,P_n) more than given threshold θ_c, then there is node V_mTo node V_nA limit, otherwise Node V_mWith node V_nThe most there is not limit；Described θ_c∈[0,1]；Wherein descriptor coverage rate Cover (P_m,P_n) computing formula For:

Wherein PTS_mFor problem P_mDescriptor set；||PTS_m| | represent set PTS_mIn element number cos (P_m,P_n) It it is the cosine similarity of two problems；α is given inhibitive factor；Described α ∈ (0,1).

Step 4: for set CPS_qIn each problem P_cIf being accessed for number of times the most, then show that this problem is more Welcome, more it is likely to be the problem corresponding to current keyword query, thus the pouplarity of problem is final as selecting One important reference of problem set, uses Wel (P_c) represent set CPS_qIn each problem P_cPouplarity, Wel(P_c) be specifically calculated as:

W e l (P_{c}) = \frac{1}{N_{c}} + d \underset{v &Element; a d j (P_{c})}{Σ} \frac{W e l (v)}{\deg (v)}, (c = 1, 2, ..., N_{c})

Wherein, N_cIt is that candidate question set closes CPS_qThe number of questions comprised；adj(P_c) in representative graph G with each problem P_c The node set being connected；V is set adj (P_cA node in)；Deg (v) is the degree of node v；D (d ∈ (0,1)) gives Inhibitive factor；

Step 5: for set CPS_qIn each problem P_c, combine its pouplarity and the phase with key word of the inquiry Pass degree, calculates the comprehensive score Grade (P of each problem_c), according to comprehensive score order from big to small to CPS_qIn Problem is ranked up, the problem set RP after being sorted；Comprehensive score Grade (P_c) be specifically calculated as:

Grade(P_c)=log (Cor (P_c,q))+log(Wel(P_c)), (c=1,2 ..., N_c)

Wherein N_cIt is that candidate question set closes CPS_qThe number of questions comprised；Cor(P_c, q) it is each problem P_cClose with inquiry The degree of correlation of key word q；Wel(P_c) it is each problem P_cPouplarity；

Step 6: initialize a null set FP, adds the first problem in RP FP, selects in RP surplus the most successively Remaining each problem P_r, calculate P_rThe cosine similarity csim of each problem with in FP, the cosine similarity that record is maximum Problem P in maxcsim and corresponding FP_f, by P_rMark Grade (P_r) add Grade (P_f) to upper, if maxcsim is less than simultaneously Given threshold θ_s, then by P_rAdd FP, if maxcsim is more than given threshold θ_s, described θ_s∈ [0,1], then it is assumed that problem P_rWith P_fSimilar, and record and problem P_fSimilar problem number N_fq；

Step 7: update each problem P in FP set_tComprehensive score, and according to mark from big to small suitable after updating Problem sequence in ordered pair FP, returns the set FP after sequence.The formula updating score is:

G r a d e {(P_{t})}_{N e w} = \frac{G r a d e {(P_{t})}_{O l d}}{N_{t q}}

Wherein Grade (P_t)_OldFor problem P each in FP_tRenewal before mark；N_tqIt is and each problem P_tSimilar Number of questions；Grade(P_t)_NewIt it is each problem P in FP_tRenewal after mark.

Preferably, in step 1, θ_t∈[0.3,0.9]。

Preferably, in step 2, λ ∈ (0.2,0.9).

Preferably, in step 3, α ∈ (0.1,1), θ_c∈[0.3,0.9]。

Preferably, in step 4, d ∈ (0.1,1).

Preferably, in step 6, θ_s∈[0.2,0.9]。

User's key word of the inquiry is mapped as clear and definite problem by the present invention, can the high quality information on CQA be integrated into Search engine, and the problem relevant to user's key word of the inquiry and answer can be directly obtained, thus be more fully understood from User's request, it is thus achieved that preferably search experience.

Accompanying drawing explanation

The flow chart of Fig. 1: the embodiment of the present invention.

Fig. 2: first three problem result figure of score rank in RP set in the embodiment of the present invention；

Fig. 3: first three problem result figure of score rank in FP set after sequence in the embodiment of the present invention；

Fig. 4: in the embodiment of the present invention, key word of the inquiry is inquired about in a search engine and obtained first three result figure of ranking.

Detailed description of the invention

Understand and implement the present invention for the ease of those of ordinary skill in the art, below in conjunction with the accompanying drawings and embodiment is to this Bright it is described in further detail, it will be appreciated that embodiment described herein is merely to illustrate and explains the present invention, not For limiting the present invention.

The invention provides a kind of map locating keyword method to relevant issues.For given key word of the inquiry, Map relative problem.

Asking for an interview Fig. 1, the present invention comprises the following steps:

Step 1: at Yahoo！The upper selection of Answers " iPod " classification carries out problem and crawls, and obtains problem set PS, remembers PS ={ P₁,P₂,...,P_N}.For each problem P in set PS_j, extract it by the POS tagger program of a standard In noun phrase, then in conjunction with " iPod " obtain correspondence descriptor set PTS_j。

Given key word of the inquiry q=" iPod downloaded videos ", calculate each word w in q_iDescriptor Score Tgrade (w_i), and by score more than threshold θ_tWord add descriptor set corresponding to q；Described θ_t∈[0,1]； If the descriptor set of certain problem comprises the descriptor set of key word of the inquiry, then this problem is added the time of key word of the inquiry Select problem set CPS_q, otherwise this problem is considered as the problem unrelated with key word of the inquiry, does not considers；Each word w in q_iMain Epigraph score Tgrade (w_i) computing formula be:

T g r a d e (w_{i}) = \frac{Σ_{j = 1}^{N} T i m e s (w_{i} | {PTS}_{j})}{Σ_{j = 1}^{N} p t i m e s (w_{i} | P_{j})}, (i = 1, 2, ..., n)

C o r (P_{c}, q) = Π_{i = 1}^{n} (λ \times \frac{t i m e s (w_{i} | P_{c})}{l e n g t h (P_{c})} (1 - λ) \frac{Σ_{j = 1}^{N} t i m e s (w_{i} | P_{j})}{Σ_{k = 1}^{n} Σ_{j = 1}^{N} t i m e s (w_{k} | P_{j})}), (c = 1, 2, ..., N_{c})

W e l (P_{c}) = \frac{1}{N_{c}} + d \underset{v &Element; a d j (P_{c})}{Σ} \frac{W e l (v)}{\deg (v)}, (c = 1, 2, ..., N_{c})

Step 5: for set CPS_qIn each problem P_c, combine its pouplarity and the phase with key word of the inquiry Pass degree, calculates the comprehensive score Grade (P of each problem_c), according to comprehensive score order from big to small to CPS_qIn Problem is ranked up, the problem set RP after being sorted.RP set in score rank first three problem as shown in Figure 2.Comprehensively Property score Grade (P_c) be specifically calculated as

Grade(P_c)=log (Cor (P_c,q))+log(Wel(P_c)), (c=1,2 ..., N_c)

Step 7: update each problem P in FP set_tComprehensive score, and according to mark from big to small suitable after updating In ordered pair FP problem sequence, after sequence FP set in comprehensive score rank first three problem as shown in Figure 3.Inquiry key Word inquire about in a search engine obtain ranking first three result as shown in Figure 4.Return the set FP after sequence.Update the public affairs of score Formula is:

G r a d e {(P_{t})}_{N e w} = \frac{G r a d e {(P_{t})}_{O l d}}{N_{t q}}

User's key word of the inquiry is mapped as clear and definite problem by the present invention, is possible not only to provide the user answer, but also Can deeply understand user's request and as clear and definite problem to serve web search result, thus obtain and preferably search Cable body is tested.

In step 1, θ_t∈[0.3,0.9]。

In step 2, λ ∈ (0.2,0.9).

In step 3, α ∈ (0.1,1), θ_c∈[0.3,0.9]。

In step 4, d ∈ (0.1,1).

In step 6, θ_s∈[0.2,0.9]。

It should be appreciated that this specification is the part elaborated belongs to prior art.

It should be appreciated that the above-mentioned description for preferred embodiment is more detailed, can not therefore be considered this The restriction of invention patent protection scope, those of ordinary skill in the art, under the enlightenment of the present invention, is weighing without departing from the present invention Profit requires under the ambit protected, it is also possible to make replacement or deformation, within each falling within protection scope of the present invention, this The bright scope that is claimed should be as the criterion with claims.

Claims

1. a map locating keyword is to the method for relevant issues, it is characterised in that comprise the following steps:

Step 1: carry out problem on CQA and crawl, and record each problem generic, obtains problem set PS, remembers PS={P₁, P₂,...,P_N, for each problem P in set PS_i, extract name therein by the POS tagger program of a standard Word phrase, then combines its generic word and obtains the descriptor set PTS of correspondence_i；Inquiry for n word composition is closed Key word q, remembers q={w₁,w₂,...,w_n, calculate each word w in q_iDescriptor score Tgrade (w_i), and score is more than Threshold θ_t(θ_t∈ [0,1]) word add descriptor set corresponding to q；Close if the descriptor set of certain problem comprises inquiry The descriptor set of key word, then the candidate question set that this problem adds key word of the inquiry closes CPS_q；Each word w in q_iTheme Word score Tgrade (w_i) computing formula be:

T g r a d e (w_{i}) = \frac{Σ_{j = 1}^{N} T i m e s (w_{i} | {PTS}_{j})}{Σ_{j = 1}^{N} t i m e s (w_{i} | P_{j})}, (i = 1, 2, ..., n)

Wherein n is the number of words that q comprises；w_iIt it is the word in q；N is the number of questions comprised in PS；Times(w_i|PTS_j) It is w_iAt PTS_jIn occurrence number；times(w_i|P_j) it is w_iAt P_jIn occurrence number；

Step 2: for set CPS_qIn each problem P_cIf, P_cWith degree of correlation between key word of the inquiry q is the highest, The information retrieval demand that more likely accurate response user is current, uses Cor (P_c, q) represent P_cRelevant to key word of the inquiry q Degree, Cor (P_c, being specifically calculated as q):

C o r (P_{c}, q) = Π_{i = 1}^{n} (λ \times \frac{t i m e s (w_{i} | P_{c})}{l e n g t h (P_{c})} + (1 - λ) \frac{Σ_{j = 1}^{N} t i m e s (w_{i} | P_{j})}{Σ_{k = 1}^{n} Σ_{j = 1}^{N} t i m e s (w_{k} | P_{j})}), (c = 1, 2, ..., N_{c})

Wherein N_cIt is CPS_qThe number of questions comprised；N is the number of words that q comprises；w_iIt it is the word in q；times(w_i|P_c) be w_iAt P_cIn occurrence number；length(P_c) it is P_cThe word number comprised；N is the number of questions comprised in PS；λ(λ∈(0, 1)) for given inhibitive factor；

Step 3: construct a figure G, CPS will be gathered_qIn each problem as figure G node, then a set of computations CPS_qIn any two problem P_iAnd P_jDescriptor coverage rate Cover (P_i,P_j), if Cover is (P_i,P_j) more than given threshold value θ_c(θ_c∈ [0,1]), then there is P_iTo P_jA limit；Wherein descriptor coverage rate Cover (P_i,P_j) computing formula be:

Wherein PTS_iFor problem P_iDescriptor set；||PTS_i| | represent set PTS_iIn element number cos (P_i,P_j) it is two The cosine similarity of individual problem；α (α ∈ (0,1)) is given inhibitive factor；

Step 4: for set CPS_qIn each problem P_cIf being accessed for number of times the most, then show that this problem is more by joyous Meet, be more likely to be the problem corresponding to current keyword query, use Wel (P_c) represent P_cPouplarity, Wel (P_c) Be specifically calculated as:

W e l (P_{c}) = \frac{1}{N_{c}} + d \underset{v &Element; a d j (P_{c})}{Σ} \frac{W e l (v)}{\deg (v)}, (c = 1, 2, ..., N_{c})

Wherein N_cIt is CPS_qThe number of questions comprised；adj(P_c) for figure G in P_cThe node set being connected；V is set adj (P_c) In a node；Deg (v) is the degree of node v；The inhibitive factor that d (d ∈ (0,1)) is given；

Step 5: for set CPS_qIn each problem P_c, combine its pouplarity and the relevant journey to key word of the inquiry Degree, calculates the comprehensive score Grade (P of each problem_c), according to comprehensive score order from big to small to CPS_qIn problem It is ranked up, the problem set RP after being sorted；Comprehensive score Grade (P_c) be specifically calculated as

Grade(P_c)=log (Cor (P_c|q))+log(Wel(P_c))

Wherein Cor (P_c| q) it is P_cDegree of correlation with q；Wel(P_c) it is P_cPouplarity；

Step 6: initialize a null set FP, adds the first problem in RP FP, selects in RP remaining the most successively Each problem P_r, calculate P_rThe cosine similarity csim of each problem with in FP, cosine similarity maxcsim that record is maximum and Problem P in corresponding FP_f, by P_rMark Grade (P_f) add Grade (P_f) to upper, if maxcsim is less than given threshold θ simultaneously_s (θ_s∈ [0,1]), then by P_rAdd FP, otherwise it is assumed that problem P_rAnd P_fSimilar, and record and problem P_fSimilar problem number N_fq；

Step 7: update the comprehensive score of each problem in FP set, and according to the order from big to small of the mark after updating to FP In problem sequence, return sequence after set FP；The formula updating score is:

G r a d e {(P_{f})}_{N e w} = \frac{G r a d e {(P_{f})}_{O l d}}{N_{f q}}

Wherein Grade (P_f)_OldFor problem P in FP_sRenewal before mark；N_fqIt is and P_fSimilar number of questions；Grade (P_f)_NewIt it is problem P in FP_fRenewal after mark.

A kind of map locating keyword the most according to claim 1 is to the method for relevant issues, it is characterised in that: in step In 1, θ_t∈[0.3,0.9]。

A kind of map locating keyword the most according to claim 1 is to the method for relevant issues, it is characterised in that: in step In 2, λ ∈ (0.2,0.9).

A kind of map locating keyword the most according to claim 1 is to the method for relevant issues, it is characterised in that: in step In 3, α ∈ (0.1,1), θ_c∈[0.3,0.9]。

A kind of map locating keyword the most according to claim 1 is to the method for relevant issues, it is characterised in that: in step In 4, d ∈ (0.1,1).

A kind of map locating keyword the most according to claim 1 is to the method for relevant issues, it is characterised in that: in step In 6, θ_s∈[0.2,0.9]。