CN105786817A

CN105786817A - Method for recommending high-utility search engine query based on query reconstruction graph

Info

Publication number: CN105786817A
Application number: CN201410796485.XA
Authority: CN
Inventors: 王建国; 黄哲学; 姜青山
Original assignee: Shenzhen Institute of Advanced Technology of CAS
Current assignee: Shenzhen Institute of Advanced Technology of CAS
Priority date: 2014-12-18
Filing date: 2014-12-18
Publication date: 2016-07-20

Abstract

The invention discloses a method for recommending high-utility search engine query based on a query reconstruction graph. The method includes the following steps: S1, predicting a reconstruction query q ' of a query q in a query log; S2, constructing a query reconstruction graph G={Q, U, i, Eqr, Eqc, Eqi}; and S3, recommending a high-utility query through the query reconstruction graph, obtaining the utility of a URL through random walk of an absorbing state, and recommending a group of queries without redundancy search results. According to the method, the query reconstruction graph is constructed, random walk is performed on the query reconstruction graph, and then a group of queries without the redundancy search results can be recommended, and a user can be instructed to find more relevant search results.

Description

The search engine inquiry method of effective is recommended based on Query Reconstruction figure

Technical field

The present invention relates to search engine inquiry technical field, particularly relate to a kind of search engine inquiry method recommending effective based on Query Reconstruction figure.

Background technology

A given history log data, inquiry recommended technology builds a knowledge base.This knowledge base is made up of two parts: an initial query set Q and corresponding candidate query set, and each candidate query set inquiry initial with a Q is associated and is sorted.When search engine obtains an initial inquiry q from user, inquiry recommends knowledge base that front K candidate query is recommended user, and they are shown in bottom or the both sides of the search results pages of q.So, search engine inquiry recommended technology can help user quickly to find useful Search Results.As it is shown in figure 1, Baidu is shown in bottom " relevant search " part the relevant inquiring recommended.At present, this technology has been used in a lot of commercial search engine, as Baidu, Google with must wait.

Existing inquiry is recommended to be all based on similarity rather than utility (serviceability).The method that there is currently is mainly through similarity function S (q, a q_i) to the initial candidate query set { q inquiring about q ∈ Q₁, q₂..., q_mBe ranked up.S is from q and q_iDifferent inquiry log data calculate.

First the similaritys being used to calculate two inquiries are URLs clicked in inquiry log data.One inquiry-URL bigraph (bipartite graph) is created by the URLs from inquiry log data, is used to afterwards calculate the similarity between inquiry.BeefermanandBerger (2000) uses a kind of aggregate clustering algorithm cluster inquiry and then find relevant inquiry to recommend on inquiry-URL bigraph (bipartite graph).CraswellandSzummer (2007) applies two kinds of random walk process and propagates inquiry similarity on inquiry-URL bigraph (bipartite graph) and obtain better similarity score between inquiry.Inquiry-URL bigraph (bipartite graph) is folded into an affine graph by Lietal (2008), and recommends similar inquiry with a kind of sort method based on stratification cohesion cluster.Undirected inquiry-URL bigraph (bipartite graph) is transformed into an oriented bigraph (bipartite graph) by LiuandSun (2008), and applies a kind of random walk and find the inquiry similar to initial inquiry.Replacing random walk, Maetal (2012) uses conduction of heat to model analog information on oriented inquiry-URL bigraph (bipartite graph) and propagates, and then recommends similar inquiry.

Search sessions daily record data is also used to calculate the similarity between two inquiries.A series of inquiries that one search sessions is constructed within a period of time by same user.Search sessions is regarded as the transaction (transaction) of inquiry by Fonsecaetal (2003), and association rule mining algorithms finds the inquiry of association to recommend.Huangetal (2003) represents each inquiry with the vector of a search sessions, the number of times that wherein each component recording inquiry of vector occurs in that search sessions.The similarity of two inquiries calculates from two query vectors.Given current search sessions, Heetal (2009) proposes the Markov model (MixturevariablememoryMarkovmodel) of the variable memory with a kind of mixing constructed from search sessions and predicts selected next inquiry.

Inquiry adjacent from search sessions, Boldietal (2008,2009) constructs a querying flow figure and applies a random walk method started from initial inquiry, measures the similarity between inquiry.Anagnostopoulosetal (2010) proposes a kind of method and carrys out the transition probability of disturbance querying flow figure to maximize the expected utility of random walk.Bordinoetal (2010) proposes a kind of method and the querying flow figure that big is mapped to the space of a low-dimensional, thus reducing the amount of calculation of similarity between inquiry.

The existing search engine inquiry based on similarity recommends method to recommend the candidate query most like with initial inquiry to user, but the Search Results of similar inquiry is often useless, namely incoherent.Such as, given initial inquiry " iphoneavailabletimemarket ", this inquiry wants to find information to be " what ' sthetimeofiphonetosellonthemarket ", the inquiry recommended based on the method for similarity (similarity-based) includes " iphonemarketsaletime ", " iphonesellingmarket " and " iphonereleasedate ".Obviously, three recommendations seem similar to initial inquiry, but, their Search Results shows that only last can find relevant Search Results.One does not have the recommendation of relevant search result is otiose to user.

Therefore, for above-mentioned technical problem, it is necessary to provide a kind of search engine inquiry method recommending effective based on Query Reconstruction figure.

Summary of the invention

In view of this, it is an object of the invention to a kind of search engine inquiry method recommending effective based on Query Reconstruction figure, to recommend the inquiry that user can be guided to find more relevant search result.

In order to achieve the above object, the technical scheme that the embodiment of the present invention provides is as follows:

A kind of search engine inquiry method recommending effective based on Query Reconstruction figure, described method includes:

S1, predicted query daily record are inquired about the reconstruct inquiry q ' of q；

S2, structure Query Reconstruction figure G={Q, U, i, E_qr, E_qc, E_qi, wherein,

Q={q₂, q₂..., q_nRepresent all different inquiries in inquiry log；

U={u₂, u₂..., u_nRepresent all different URLs clicked by user in inquiry log；

I represents interruption summit, and each inquiry q ∈ Q has a limit to i；

E_qr={ (q, q ') q ∈ Q, q ' ∈ R (q) } represents the limit from the reconstruct inquiry inquiring them, and wherein R (q) represents the reconstruct set of inquiry q；

E_qc={ (q, u) | q ∈ Q, u ∈ C(q) } represents that wherein C (q) represents the set of the inquiry q URLs clicked from the limit inquiring the URLs that they click；

E_qi=(q, i) | q ∈ Q} represents from inquiring the limit interrupting summit i；

S3, use Query Reconstruction figure recommend the inquiry of effective, are obtained the effectiveness of URL by the random walk of absorbing state, and recommend one group of inquiry not having redundant search result.

As a further improvement on the present invention, described step S1 includes:

S11, inquiry normalization；

S12, inquire key word；

S13, Keywords matching.

3, method according to claim 2, it is characterised in that described step S11 particularly as follows:

Make the capitalization in inquiry into lower case, be a space by continuous print space suppression, delete the space of any beginning and end.

As a further improvement on the present invention, described step S12 particularly as follows:

Adopt the unsupervised method based on n-gram to carry out inquiry segmentation, inquire about x={x for one₂..., x_n):

PMI (x_{i}, x_{i + 2}) = \log \frac{p (x_{i}, x_{i + 1})}{p (x_{i}) . p (x_{i + 1})},

Wherein, p (x_i, x_i+2) it is two-dimensional grammar (x_i, x_i+1) probability that occurs, p (x_i) and p (x_i+1) it is word x_iAnd x_i+1Occur frequency, when the PMI value of two words of continuous print lower than threshold value 0.895 time, the two word is separated.

As a further improvement on the present invention, described step S13 particularly as follows:

Accurately coupling, approximate match or semantic matches is adopted to mate two key word k₁And k₂, wherein,

When accurately mating, k₁=k₂；

During approximate match, k₁And k₂Between Levenshtein editing distance less than 2；

During semantic matches, k₁And k₂WuandPalmer in WordNet measures and is defined as more than 0.5, WuandPalmer measurementWherein, LCS represents nearest public ancestors, and depth (k) represents the key word k TongYiCi CiLin in the WordNet distance to root node.

As a further improvement on the present invention, described step S3 " obtains the effectiveness of URL " by the random walk of absorbing state to include:

URL summit in Query Reconstruction figure G and interruption summit i are set to absorbing state, inquiry summit are set to transient state, Query Reconstruction figure G comprises n query node, m URL node, 1 interrupts node, and in the absorbing state random walk stage, corresponding transition probability matrix is:

P = [\begin{matrix} P_{Q} & P_{U} & P_{l} \\ 0 & I_{U} & 0 \\ 0 & 0 & 1 \end{matrix}],

Wherein, P_QRepresent the n × n matrix of transition probability, P between inquiry summit_URepresenting the n × n matrix of transition probability between inquiry and URL, Pl represents inquiry and interrupts n × 1 matrix of transition probability, I between summit_UIt it is the unit matrix of m × m.

As a further improvement on the present invention, described step S3 also includes:

Calculate the absorbing state distribution in Query Reconstruction figure G:

P^{t} = [\begin{matrix} P_{Q}^{t} & Σ_{k = 0}^{t - 1} P_{Q}^{k} P_{U} & Σ_{k = 0}^{t - 1} P_{Q}^{k} P_{l} \\ 0 & I_{U} & 0 \\ 0 & 0 & 1 \end{matrix}],

Wherein, P^t[i, j] represents the probability going to summit j from node i through t step random walk.

As a further improvement on the present invention, described step S3 " recommends one group of inquiry not having redundant search result " particularly as follows:

Calculate the utility score of inquiry qAccording to U (q), q is ranked up, and is put in candidate query collection C；

From C, select inquiry q, make U (q) reach maximum.

The method have the advantages that

By building Query Reconstruction figure, Query Reconstruction figure carries out random walk and can recommend one group of inquiry not having redundant search result, guide user to find the inquiry of more relevant search result.

Accompanying drawing explanation

In order to be illustrated more clearly that the embodiment of the present invention or technical scheme of the prior art, the accompanying drawing used required in embodiment or description of the prior art will be briefly described below, apparently, the accompanying drawing that the following describes is only some embodiments recorded in the present invention, for those of ordinary skill in the art, under the premise not paying creative work, it is also possible to obtain other accompanying drawing according to these accompanying drawings.

Fig. 1 is that in prior art, Baidu is the relevant inquiring schematic diagram of inquiry " patent application " Search Results that returns and recommendation.

Fig. 2 is the schematic flow sheet of a kind of search engine inquiry method recommending effective based on Query Reconstruction figure of the present invention.

Detailed description of the invention

In order to make those skilled in the art be more fully understood that the technical scheme in the present invention, below in conjunction with the accompanying drawing in the embodiment of the present invention, technical scheme in the embodiment of the present invention is clearly and completely described, obviously, described embodiment is only a part of embodiment of the present invention, rather than whole embodiments.Based on the embodiment in the present invention, the every other embodiment that those of ordinary skill in the art obtain under not making creative work premise, all should belong to the scope of protection of the invention.

Shown in ginseng Fig. 2, the invention discloses a kind of search engine inquiry method recommending effective based on Query Reconstruction figure, including:

S2, structure Query Reconstruction figure G={Q, U, i, E_qr, E_qc, E_qi,

User submits an inquiry q to search engine, obtains some useful informations by clicking some Search Results C (q) of q.If user pleases oneself, will stop search；If dissatisfied, user may proceed to reconstruct inquiry q '.User can be helped to find the inquiry of useful document to be only useful inquiry.But, whether the document in the Search Results of inquiry q is useful, and we cannot directly obtain from daily record.Therefore, we first assess user's satisfaction to inquiry q, afterwards, are reflected the effectiveness of the Search Results of the click of q by the satisfaction of inquiry q: the Search Results of satisfied inquiry q is effective, and the Search Results of unsatisfied inquiry q is that poor efficiency is used.

S1, predicted query reconstruct

Query Reconstruction be user in order to obtain more information, thus submit to one inquiry q ' revise before inquiry q.Therefore, Query Reconstruction is considered as that user is to the unsatisfied signal of inquiry q before.If inquiry q ' is considered as the reconstruct of inquiry q, two inquiries must be used to meet same information requirement.The present invention reconstructs according to following steps predicted query:

1. inquiry normalization

All make the capitalization in inquiry into lower case, be a space by continuous print space suppression, delete the space of any beginning and end.

2. inquire key word

Vocabulary dependency (lexicalsimilarity) between inquiry has been frequently used to determines relevant inquiring.The problem of vocabulary dependency is that it can cause and fails to report (falsenegative), for instance, synonym, and wrong report (falsepositive), such as, q: ' weatherinnewyorkcity ', q ': ' hotelsinnewyorkcity '.The word of 80% is total, and therefore the feature of any vocabulary dependency can predict into q the reconstruct of q.

By inquiry is transformed into the problem that key word solves vocabulary dependency in the present invention.Q need to be to do is to be divided into two words ' weather ' and ' newyorkcity ', q ' is become two words ' hotels ' and ' newyorkcity '.The key word of such 50% is identical.(segment) is split in each inquiry and becomes key word.Inquiry segmentation (querysegmentation) is the technology of a search inquiry processing user, and it becomes independent phrase or semantic primitive the word segmentation in inquiry.Some have the inquiry cutting techniques of supervision because of prohibitively expensive and very difficult reproduction.The present invention select the unsupervised method based on n-gram (n-gram) carry out inquiry segmentation.Concretely, x={x is inquired about for one₂..., x_n}:

PMI (x_{i}, x_{i + 2}) = \log \frac{p (x_{i}, x_{i + 1})}{p (x_{i}) . p (x_{i + 1})},

Wherein, p (x_i, x_i+2) it is two-dimensional grammar (x_i, x_i+1) probability that occurs, p (x_i) and p (x_i+1) it is word x_iAnd x_i+1The frequency occurred.When the PMI value of two words of continuous print lower than threshold value 0.895 time, separate by the two word.

3. Keywords matching

The present invention defines three kinds of modes and mates two key word k₂And k₂:

A) accurately mate

k₁=k₂。

B) approximate match

In order to process spelling change and cacography, if k₂And k₂Between Levenshtein editing distance less than 2, then it is assumed that they be coupling.

C) semantic matches

In order to process k₁And k₂Semantic change, if k₁And k₂WuandPalmer measurement in WordNet is greater than 0.5, then it is assumed that k₁And k₂It it is coupling.WuandPalmer measurement is defined as

wup (k_{1}, k_{2}) = \frac{2 * depth (LCS)}{depth (k_{1}) + depth (k_{2})},

Wherein, LCS dissipates and represents nearest public ancestors, and depth (k) represents the key word k TongYiCi CiLin (synset) in the WordNet distance to root node.

Given inquiry q, we predict as one model of feature construction that with key word whether the inquiry q ' in succession occurred in search sessions is the reconstruct of q.

If q ' is not the reconstruct of inquiry q, and the Search Results of q has at least a click, then it is considered that q is the inquiry (satisfiedquery) of a satisfaction.

S2, structure Query Reconstruction figure

Pretreatment:

Inquiry log is reorganized into a series of search sessions (searchsession)；

Extract all inquiries occurred continuously in search sessions to (q, q ')；

The definition of Query Reconstruction figure:

G={Q, U, i, E_qr, E_qc, E_qi, wherein,

Q={q₁, q₂..., q_nRepresent all different inquiries in inquiry log；

U={u₁, u₂..., u_nRepresent all different URLs clicked by user in inquiry log；

I represents interruption summit, and each inquiry q ∈ Q has a limit to i；

S3, use Query Reconstruction figure recommend the inquiry of effective

The effectiveness of URL is obtained by the random walk of absorbing state:

Query Reconstruction figure G carries out absorbing state random walk.We are set to absorbing state URL summit in G and interruption summit i, and an inquiry summit is set to transient state.When random walk arrives the inquiry q in a G, next step has three kinds of possibilities: user is satisfied, stops search；User is unsatisfied with, and continues reconstruct inquiry；User is unsatisfied with, and stops search.We are with whether inquiry q has the whether clicked behavior predicting user of Search Results of reconstruct and q:

Whether whether table 1 is inquired about has reconstruct, Search Results clicked and the relation of user behavior

Reconstruct	Search Results is clicked	User behavior
			No	Have	Stop search (satisfaction)
Have	Have	Reconstruct inquiry (being unsatisfied with)
			Have	No	Reconstruct inquiry (being unsatisfied with)
No	No	Interrupt (being unsatisfied with)

A given inquiry q, making f (q) is the q number of times occurred in inquiry log, makes f_rQ () is that q occurs in inquiry log and it has the number of times of reconstruct, make f_nrsQ () is the number of times that q occurs in inquiry log and it does not reconstruct and its Search Results is clicked, make f_nrncQ () is that q occurs in inquiry log and it does not reconstruct and its Search Results does not have clicked number of times.Obviously, f_r(q)+f_nrc(q)+f_nrnc(q)=f (q).

According to table 1, the transition probability of random walk on definition G:

P (q | q) = \frac{f_{r} (q)}{f (q)} \cdot \frac{f (q, q')}{Σ_{q' &Element; R (q)} f (q, q^{t})},

Wherein, q ' is the reconstruct of inquiry q, and f (q, q ') is the inquiry q number of times reconstructing q ' co-occurrence with it；

P (i | q) = \frac{f_{nrnc} (q)}{f (q)},

Wherein, i interrupts node；

p (u | q) = \frac{fnrc (q)}{f (q)} \frac{fnrc (q, u')}{Σ_{u' &Element; c (q)} f (q, u')},

Wherein, clicked URL, the f (q, u ') of Search Results that u is inquiry q is the u ' number of times clicked by q, f_nrc(q, u^t) it is the number of times that inquiry q does not reconstruct and u ' is clicked.

Owing to document and interruption summit are all absorbing states, random walk suffers that other summit just cannot be transferred in these summits.Therefore, P (u | u)=1, and P (i | i)=1.

Assuming that comprise n query node in Query Reconstruction figure G, m URL node, 1 interrupts node.In the absorbing state random walk stage, corresponding transition probability matrix can be expressed as

P = [\begin{matrix} P_{Q} & P_{U} & P_{l} \\ 0 & I_{U} & 0 \\ 0 & 0 & 1 \end{matrix}],

Wherein, P_QN × the n matrix of transition probability, P between table inquiry summit_URepresent n × m matrix of transition probability, P between inquiry and URL_IIt is represent inquiry and interrupt n × 1 matrix of transition probability, I between summit_UIt it is the unit matrix of m × m.

Because transition probability matrix above is reducible, the distribution that it is unstable.The method of one optional calculating absorbing state distribution is by calculating:

P^{t} = [\begin{matrix} P_{Q}^{t} & Σ_{k = 0}^{t - 1} P_{Q}^{k} P_{U} & Σ_{k = 0}^{t - 1} P_{Q}^{k} P_{l} \\ 0 & I_{U} & 0 \\ 0 & 0 & 1 \end{matrix}],

Recommend one group of inquiry not having redundant search result:

Make s_iIt is URLu_iAbsorbing state random walk score, it is believed that inquiry q utility scoreAccording to U (q), q is ranked up, and they is put in candidate query collection C.Specific algorithm is:

In sum, the present invention, by building Query Reconstruction figure, carries out random walk on Query Reconstruction figure and can recommend one group of inquiry not having redundant search result, guides user to find the inquiry of more relevant search result.

It is obvious to a person skilled in the art that the invention is not restricted to the details of above-mentioned one exemplary embodiment, and when without departing substantially from the spirit of the present invention or basic feature, it is possible to realize the present invention in other specific forms.Therefore, no matter from which point, embodiment all should be regarded as exemplary, and be nonrestrictive, the scope of the invention rather than described above limits, it is intended that all changes in the implication of the equivalency dropping on claim and scope included in the present invention.Any accompanying drawing labelling in claim should be considered as the claim that restriction is involved.

In addition, it is to be understood that, although this specification is been described by according to embodiment, but not each embodiment only comprises an independent technical scheme, this narrating mode of description is only for clarity sake, description should be made as a whole by those skilled in the art, and the technical scheme in each embodiment through appropriately combined, can also form other embodiments that it will be appreciated by those skilled in the art that.

Claims

1. the search engine inquiry method recommending effective based on Query Reconstruction figure, it is characterised in that described method includes:

Q={q₁, q₂..., q_nRepresent all different inquiries in inquiry log；

I represents interruption summit, and each inquiry q ∈ Q has a limit to i；

E_qr=((q, q ') [q ∈ Q, q ' ∈ R (q)) represent the limit from the reconstruct inquiry inquiring them, wherein R (q) represents the reconstruct set of inquiry q；

E_qc=(q, u) | q ∈ Q, u ∈ C (q) } represent that, from the limit inquiring the URLs that they click, wherein C (q) represents the set of the inquiry q URLs clicked；

2. method according to claim 1, it is characterised in that described step S1 includes:

S11, inquiry normalization；

S12, inquire key word；

S13, Keywords matching.

3. method according to claim 2, it is characterised in that described step S11 particularly as follows:

4. method according to claim 2, it is characterised in that described step S12 particularly as follows:

Adopt the unsupervised method based on n-gram to carry out inquiry segmentation, inquire about x={x for one₁..., x_n]:

PMI (x_{i}, x_{i + 1}) = \log \frac{p (x_{i}, x_{i + 1})}{p (x_{i}), p (x_{i + 1})},

Wherein, p (x_i, x_i+1) it is two-dimensional grammar (x_i, x_i+1) probability that occurs, p (x_i) and p (x_i+1) it is word x_iAnd x_i+1Occur frequency, when the PMI value of two words of continuous print lower than threshold value 0.895 time, the two word is separated.

5. method according to claim 2, it is characterised in that described step S13 particularly as follows:

When accurately mating, k₁=k₂；

6. method according to claim 1, it is characterised in that " obtained the effectiveness of URL by the random walk of absorbing state " in described step S3 and including:

P = [\begin{matrix} P_{Q} & P_{U} & P_{I} \\ 0 & I_{U} & 0 \\ 0 & 0 & 1 \end{matrix}],

Wherein, P_QRepresent the n × n matrix of transition probability, P between inquiry summit_URepresent n × m matrix of transition probability, P between inquiry and URL_IRepresent inquiry and interrupt n × 1 matrix of transition probability, I between summit_UIt it is the unit matrix of m × m.

7. method according to claim 6, it is characterised in that also include in described step S3:

Calculate the absorbing state distribution in Query Reconstruction figure G:

P^{t} = [\begin{matrix} P_{Q}^{t} & Σ_{k = 0}^{t - 1} P_{Q}^{k} P_{U} & Σ_{k = 0}^{t - 1} P_{Q}^{k} P_{I} \\ 0 & I_{U} & 0 \\ 0 & 0 & 1 \end{matrix}],

Wherein, P^t[i, j] represents the probability going to summit i from node i through t step random walk.

8. method according to claim 6, it is characterised in that described step S3 " recommends one group of inquiry not having redundant search result " particularly as follows:

From C, select inquiry q, make U (q) reach maximum.