CN100595759C - Method and device for enquire enquiry extending as well as related searching word stock - Google Patents

Method and device for enquire enquiry extending as well as related searching word stock Download PDF

Info

Publication number
CN100595759C
CN100595759C CN 200710097501 CN200710097501A CN100595759C CN 100595759 C CN100595759 C CN 100595759C CN 200710097501 CN200710097501 CN 200710097501 CN 200710097501 A CN200710097501 A CN 200710097501A CN 100595759 C CN100595759 C CN 100595759C
Authority
CN
China
Prior art keywords
query
term
degree
correlation
user
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN 200710097501
Other languages
Chinese (zh)
Other versions
CN101281523A (en
Inventor
童征宇
汤帜
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
New Founder Holdings Development Co ltd
Peking University
Founder Apabi Technology Ltd
Original Assignee
Peking University
Peking University Founder Group Co Ltd
Beijing Founder Apabi Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Peking University, Peking University Founder Group Co Ltd, Beijing Founder Apabi Technology Co Ltd filed Critical Peking University
Priority to CN 200710097501 priority Critical patent/CN100595759C/en
Publication of CN101281523A publication Critical patent/CN101281523A/en
Application granted granted Critical
Publication of CN100595759C publication Critical patent/CN100595759C/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a method and a device for query expansion and a correlation retrieval word library, which solves the problem that a current searching system can not automatically and effectively realize query expansion and provide effective support for retrieval behaviors of a user. The method comprises the steps of dividing the query behavior records of a user into at least a query eventand query units according to the identity and access time of the user; calculating the correlation degrees between retrieval words in the query units and the query event periodically, and updating a correlation retrieval word library according to the calculated correlation degrees; and retrieving correlation retrieval words having a close correlation degree to the retrieval word inputted by the user in the correlation retrieval word library, thus forming a query expansion result. The invention can be applied to various searching systems, effectively assist users to query, reduce query load ofthe users, and improve the searching efficiency of the users.

Description

Enquiry expanding method and device and coordinate indexing dictionary
Technical field
The present invention relates to the query expansion technology of information retrieval field, particularly enquiry expanding method and device and coordinate indexing dictionary.
Background technology
Along with the fast development of computer information technology and Internet technology, the content of digitized information and quantity sharp increase on the network.In the face of the ocean of electronic information, search system becomes the important tool that people effectively utilize Internet resources.The search system of existing main flow has substantially all adopted global search technology, and the principle of global search technology is: will content retrieved be divided into short word sequence, and generate the index that comprises character string in each word sequence then.After the user imports term or statement, cut apart too, compare with index, then the article lists of links under the vocabulary of coupling is shown to the user.
Global search technology is divided into two classes according to application: internet search engine and professional searching system.
Internet search engine towards be that Internet goes up a large amount of rambling webpages, fundamental purpose is to find some useful reference informations and some harmful informations of shielding, useful webpage is come the front as far as possible.
And professional searching system requires Query Result to possess high recall ratio and high precision ratio, and information is through the useful information of arrangement comparatively speaking, requires high recall ratio when requiring high precision ratio.Recall ratio is meant system when carrying out a certain retrieval, the ratio of related data total amount in related data amount that retrieves and the system documentation storehouse, the information found of reflection comprehensive; Precision ratio then is to guarantee that we find the key of useful data, be system when carrying out a certain retrieval, the useful data quantity that retrieves and retrieve the ratio of data total amount.
Certainly, for most of user, mainly be that the such research tool of applying Internet search engine carries out search information, when user search information, search engine mates by the term of user's input, the results list ID of coupling is shown to the user, but there are following two problems in user's search:
(1) owing to has a large amount of synonyms and polysemant in the Chinese, and the diversity of Chinese expression way, the term that the user uses when making up query expression is standard not to the utmost often, with employed speech of search system document index or phrase very big difference is arranged, cause " differential expression " problem in the information retrieval.
(2) user is when inquiring about, and the term quantity of use is few, is generally 1~2, can't describe user's information requirement in detail particularly; Especially when user and indeterminate oneself information requirement, it is more outstanding that above problem just seems.
In this case, user's initial query is normally coarse, professional and incomplete inadequately, and the result that such inquiry obtains can't satisfy user's demand well.
Existing search system is in order to address the above problem, adopt query expansion (Query Expansion) technology, the query expansion technology is by showing other relevant coordinate indexing speech of term of importing with the user, help the user to re-construct query expression accurately and effectively, this remedy to a certain extent the user expression may with the difference of candidate's paragraph, retrieve the needed candidate documents of user with less omission as far as possible.Make that user's retrieval is more accurate, alleviated burden for users (so-called burden for users is meant that the user pays the summation of energy in retrieving).Information demander for most unprofessional user and " I do not know what I want, but I have just known when I see it " more can improve their recall precision effectively.
But this search system of query expansion function that provides finds still that in actual use experience there are the following problems:
The one, present Chinese search engine makes up a retrieval dictionary by the term and the retrieval number of times of statistics user input.When user search information, character according to the term that the user imported carries out matching inquiry in this retrieval dictionary, sort according to the match condition of character and the number of times that is retrieved of Query Result then, the result of ordering is submitted to the user as the result of query expansion.As user's input " computer ", then in the result of query expansion the terms relevant with " computer " such as " computer newspaper ", " computer quotation ", " notebook computer " can appear; This mode is convenient, directly perceived, can give the possible coordinate indexing speech of user prompt, but the query expansion result who obtains still is subjected to the restriction of the initial term of selecting of user, be not suitable for the quick neologisms that increase on the network, also can't satisfy people to demand news category, time dependent conjunctive word, and then user's search efficiency can't be protected.
The 2nd, be representative specialty searching system with PubMed etc.It compiles a dictionary by the professional, has defined the relation between the various vocabulary such as synonym, related term in dictionary.Carry out query expansion by this dictionary, reach high recall ratio and high precision ratio, and play standard user's inquiry word, the effect of assisted user inquiry.But this method needs the professional to compile a dictionary, define the relation between the various vocabulary, length consuming time, maintenance cost height, generally only be applicable to the search system or the thematic data base inquiry system of professional domain, and be not suitable for the search system in the amateur field of big data quantity.In addition, this method is not suitable for the quick neologisms that increase on the network too, also can't handle news category, time dependent conjunctive word.
In sum, present also good without comparison method on general search system can be at the existing problem of user search, and carries out query expansion automatically and efficiently, provides the support of effect for user's retrieval behavior.
Summary of the invention
At above-mentioned problems of the prior art, the purpose of this invention is to provide a kind of enquiry expanding method and device and coordinate indexing dictionary.Utilize this coordinate indexing dictionary to carry out query expansion, assisted user is inquired about, and alleviates user's inquiry burden, improves search efficiency of users.
For achieving the above object, the invention provides a kind of enquiry expanding method, adopt following scheme, comprising:
(A) user's inquiry behavior record is divided at least one query unit according to identify label and access time of this user, described query unit comprises at least one query event;
(B) periodically calculate the degree of correlation between the term in same query unit, different two query event, the degree of correlation between the term in perhaps any two query event is upgraded the coordinate indexing dictionary according to the degree of correlation between the term that calculates;
(C) the coordinate indexing speech of the degree of correlation maximum between the term that retrieval is imported when the user inquiring in the coordinate indexing dictionary, formation query expansion result.
In the above-mentioned steps (A), the inquiry behavior record comprises: term of importing during user inquiring and the clauses and subclauses of Query Result being clicked visit; Described identify label is the IP address.
Above-mentioned query unit comprises at least one query event, if the time between two adjacent query event of same IP address user is separated by less than the regular hour threshold value, then these two query event belong to same query unit.
In the above-mentioned steps (B), the method for calculating the degree of correlation between the term in same query unit, different two query event is:
Weight ( W in , W jm ) = WeightFactor ( W in , W jm ) × OldWeight ( W in , W jm ) p × ( a - 1 ) + 1 a
Wherein, And j>i, W InN term when representing the i time inquiry, W JmM term when representing the j time inquiry; OldWeight (W In, W Jm) be term W InAnd W JmBetween the old degree of correlation, OldWeight (W In, W Jm) pValue be (0,1), p is the real number greater than zero, a is the natural number greater than 2.
Preferable, the value of p is 1.
In the above-mentioned steps (B), the method of calculating the degree of correlation between the term in any two query event comprises: by calculating the similarity of user's clauses and subclauses that click is visited to Query Result in new query event and the old query event, instead push away the degree of correlation between the term in the query event, its method is:
Sim ( S new , S old ) = Common ( S new . CDoc , S old . Cdoc ) Max ( S new . CDoc , S old . Cdoc )
Wherein, S NewBe new query event, S OldBe old query event, Sim (S NewS Old) be the similarity of user's clauses and subclauses that click is visited to Query Result in new query event and the old query event, S New.CDoc being illustrated in the ID that the user clicks the clauses and subclauses of visiting to Query Result in the new query event tabulates; S Old.Cdoc being illustrated in the ID that the user clicks the clauses and subclauses of visiting to Query Result in the old query event tabulates; Common (S New.CDoc, S Old.Cdoc) be the quantity of the clauses and subclauses that the user clicks in the Query Result of new query event and old query event, Max (S with identical ID New.CDoc, S Old.Cdoc) be the maximum that the user clicks the number of entries of visit in new query event and the old query event to Query Result.
If Sim is (S New, S Old) value surpass the threshold value of systemic presupposition, then further calculate the degree of correlation between the term in new query event and the old query event, its method is:
Weight ( W newi , W oldj ) = Sim ( S new , S old ) × OldWeight ( W newi , W oldj ) q × ( a - 1 ) + 1 a
Wherein, W NewiBe i the term of importing in the new query event, W OldjBe j the term of importing in the old query event, Weight (W Newi, W Oldj) be term W NewiWith W OldjThe degree of correlation, OldWeight (W Newi, W Oldj) be term W NewiWith W OldjThe old degree of correlation, OldWeight (W Newi, W Oldj) qValue be (0,1), q is the real number greater than zero, a is the natural number greater than 2.
Preferable, the value of q is 1.
In the above-mentioned steps (B), according to the degree of correlation between the term that calculates the coordinate indexing dictionary is carried out method for updating and comprises:
If the degree of correlation between a pair of term that calculates has been stored in the coordinate indexing dictionary, then improve and describedly be stored in the coordinate indexing dictionary this relevance degree between the term, otherwise, in the coordinate indexing dictionary, increase this to term.
The degree of correlation between the term that described basis calculates is carried out the coordinate indexing dictionary after the updating steps, also comprises the step of regulating the coordinate indexing dictionary:
If a pair of term with degree of correlation in the coordinate indexing dictionary does not have accessed mistake in certain time interval, then reduce this to the relevance degree between the term.
This method to the relevance degree between the term of described reduction is:
Weight = n × OldWeight - 1 n - 1
Wherein, OldWeight is this old relevance degree to term, and n is the natural number greater than 2.
In the above-mentioned steps (C), described query expansion result is the query expansion result through ordering, and described sort method comprises: the term of being imported during according to user inquiring sorts with the size of the relevance degree of described coordinate indexing speech respectively.
In the above-mentioned steps (C), described query expansion result is the query expansion result through ordering, described sort method comprises: the size according to the weighted value of described coordinate indexing speech sorts, and the computing method of the weighted value of coordinate indexing speech are: w=s * 1g (n+1);
Wherein, the term that s imports during for user inquiring and the degree of correlation of coordinate indexing speech, n is the number of times that is retrieved of described coordinate indexing speech.
In the above-mentioned steps (C), the term that the result of query expansion imports in the time of can be with user inquiring combines and makes up new query expression, or directly as query expression, directly submits to the user and use;
Perhaps the result of query expansion with " or " relation and the term combination of importing during user inquiring, participate in this inquiry as query expression, and the Query Result that obtains showed the user.
Above-mentioned steps (A) also comprises before: search system is stored in user's inquiry behavior record in the journal file of this search system, and this journal file is filtered.
In the above-mentioned steps (B), the cycle can be one day.
For achieving the above object, the present invention also provides a kind of query expansion device, comprising:
Inquiry behavior record division unit is used for user's inquiry behavior record is divided at least one query unit according to identify label and access time of this user, and described query unit comprises at least one query event;
Correlation calculating unit is used for periodically calculating the degree of correlation between the term of same query unit, different two query event, the degree of correlation between the term in perhaps any two query event;
Coordinate indexing Word library updating unit is used for according to the degree of correlation between the term that calculates the coordinate indexing dictionary being upgraded;
Retrieval unit, the coordinate indexing speech of the degree of correlation maximum between the term that is used for being imported when retrieval of coordinate indexing dictionary and user inquiring forms the query expansion result.
Above-mentioned query expansion device, also can comprise: coordinate indexing dictionary regulon, be used for query expansion result, reduce the relevance degree that in certain time interval, does not have a pair of term of accessed mistake in the coordinate indexing dictionary according to described retrieval unit formation.
Wherein, described coordinate indexing dictionary comprises:
The term storage unit, the term of being imported during the inquiry that is used for storing the user inquiring behavior record;
The relevance degree storage unit is used for storing the relevance degree between the described term storage unit term;
Last access time record cell is used for writing down the last access time that described relevance degree storage unit has the term of the degree of correlation.
The term sequence number is according to ascending order or descending sort in the above-mentioned relevance degree storage unit.
In the above-mentioned last access time record cell, the last access time with term of the degree of correlation arranges according to the access time sequencing.
The present invention analyzes by inquiry that the user is carried out on search system and visit behavior and learns, upgrade the coordinate indexing dictionary, and according in the coordinate indexing dictionary between the term degree of correlation on the meaning of one's words carry out query expansion, assisted user is inquired about, alleviate user's inquiry burden, improved user's precision ratio and recall ratio greatly.
Description of drawings
Fig. 1 is the overall realization flow figure of this forwarding method;
Fig. 2 is the schematic flow sheet of the preferred embodiment of the inventive method;
Fig. 3 is the embodiment schematic flow sheet of further optimizing of method shown in Figure 2;
Fig. 4 is the querying flow synoptic diagram of the inventive method;
Fig. 5 is a query expansion effect surface chart of the present invention;
Fig. 6 is the structural representation of device of the present invention;
Fig. 7 is the example structure synoptic diagram of the further optimization of device shown in Figure 6;
Fig. 8 is the structural representation of coordinate indexing dictionary of the present invention.
Embodiment
In order to make purpose of the present invention, scheme and effect clearer, be described in detail below in conjunction with the embodiment of accompanying drawing to the present invention program:
The overall realization flow of enquiry expanding method of the present invention comprises as shown in Figure 1:
(S1) user's inquiry behavior record is divided at least one query unit according to identify label and access time of this user, described query unit comprises at least one query event;
(S2) periodically calculate the degree of correlation between the term in same query unit, different two query event, the degree of correlation between the term in perhaps any two query event;
(S3) according to the degree of correlation between the term that calculates the coordinate indexing dictionary is upgraded;
(S4) the coordinate indexing speech of the degree of correlation maximum between the term that retrieval is imported when the user inquiring in the coordinate indexing dictionary, formation query expansion result
Said method is analyzed by inquiry that the user is carried out on search system and visit behavior and is learnt, upgrade the coordinate indexing dictionary, and according in the coordinate indexing dictionary between the term degree of correlation on the meaning of one's words carry out query expansion, assisted user is inquired about, alleviate user's inquiry burden, improved user's precision ratio and recall ratio greatly.
First kind of concrete implementing procedure below in conjunction with 2 pairs of said methods of accompanying drawing is described:
In the above-mentioned steps (S1), the user inquiring behavior record comprises: term of importing during user inquiring and the clauses and subclauses of Query Result being clicked visit.
Step (S1) can be specially in specific implementation process:
(S11) search system is stored in user's inquiry behavior record in the journal file of text formatting of this search system, at first the journal file to text formatting filters, term of being imported when only keeping user inquiring and the entry record of Query Result being clicked visit, make the data volume that needs to handle significantly reduce, and then improve the treatment effeciency of system.
(S12) journal file after filtering is analyzed and learnt, this user's inquiry behavior record is divided at least one query event and query unit according to this user's IP address and access time.
Query unit is the process that user's informational needs is met, and continues the time period of use when same IP user is retrieved, and is recorded as a query unit.
Search system is that unit divides with the each inquiry in the query unit further, obtains littler inquiry unit and is called query event, comprises that the user imports term and clicks the corresponding Query Result clauses and subclauses of visit.
Therefore, query unit comprises at least one query event, the time of being separated by between can two adjacent query event with same IP user is as judging whether they belong to the foundation of same query unit, time between two adjacent query event of even same IP user is separated by less than the regular hour threshold value, and then these two query event belong to same query unit.
The inquiry that a plurality of query event in the query unit is carried out is the user constantly adjusts term according to previous Query Result a process, and therefore the repeatedly inquiry of being carried out in query unit is considered to be correlated with.
In the step in the flow process shown in Figure 1 (S2), the degree of correlation between the term can realize by following dual mode:
(S21A) periodically calculate the degree of correlation between the term in same query unit, different two query event, its method is as follows:
Weight ( W in , W jm ) = WeightFactor ( W in , W jm ) × OldWeight ( W in , W jm ) p × ( a - 1 ) + 1 a
Wherein,
Figure C20071009750100132
And j>i, W InN term when representing the i time inquiry, W JmM term when representing the j time inquiry; OldWeight (W In, W Jm) be term W InAnd W JmBetween the old degree of correlation, OldWeight (W In, W Jm) pValue be (0,1), p is the real number greater than zero, a is the natural number greater than 2.Optimum, the value of p is 1.
(S21B) periodically calculate the degree of correlation between the term in any two query event, its method is as follows: by calculating the similarity of user's clauses and subclauses that click is visited to Query Result in new query event and the old query event, instead push away the degree of correlation between the term in the query event, its method is:
Sim ( S new , S old ) = Common ( S new . CDoc , S old . Cdoc ) Max ( S new . CDoc , S old . Cdoc )
Wherein, S NewBe new query event, S OldBe old query event, Sim (S New, S Old) be the similarity of user's clauses and subclauses that click is visited to Query Result in new query event and the old query event, S New.CDoc being illustrated in the ID that the user clicks the clauses and subclauses of visiting to Query Result in the new query event tabulates; S Old.Cdoc being illustrated in the ID that the user clicks the clauses and subclauses of visiting to Query Result in the old query event tabulates; Common (S New.CDoc, S Old.Cdoc) be the quantity of the clauses and subclauses that the user clicks in the Query Result of new query event and old query event, Max (S with identical ID New.CDoc, S Old.Cdoc) be the greater that the user clicks the number of entries of visit in new query event and the old query event to Query Result.
If Sim is (S New, S Old) value surpass the threshold value of systemic presupposition, then further calculate the degree of correlation between the term in new query event and the old query event, otherwise do not carry out the calculating of following method;
The method of further calculating the degree of correlation between the term in new query event and the old query event is:
Weight ( W newi , W oldj ) = Sim ( S new , S old ) × OldWeight ( W newi , W oldj ) q × ( a - 1 ) + 1 a
Wherein, W NewiBe i the term of importing in the new query event, W OldjBe j the term of importing in the old query event, Weight (W Newi, W Oldj) be term W NewiWith W OldjThe degree of correlation, OldWeight (W Newi, W Oldj) be term W NewiWith W OldjThe old degree of correlation, OldWeight (W Newi, W Oldj) qValue be (0,1), q is the real number greater than zero, a is the natural number greater than 2.Optimum, the value of q is 1.
Certainly, above-mentioned two kinds of methods of calculating the degree of correlation between the term are not limited to this, can also with (S21A) described computing method and (S21B) described computing method combine, so that the degree of correlation of the term that calculates more comprehensively accurately, when the user inquires about, can provide more comprehensive and accurate query expansion result for the user.
In step (S21A) and the step (S21B), the cycle can be chosen to be one day, generally can be chosen in the search system server and carry out relatively more idle morning.
In the step in the flow process shown in Figure 1 (S3), according to the degree of correlation between the term that calculates the coordinate indexing dictionary is carried out method for updating and comprises:
(S31) if the degree of correlation between a pair of term that calculates has been stored in the coordinate indexing dictionary, then improve and describedly be stored in the coordinate indexing dictionary this relevance degree between the term, otherwise, in the coordinate indexing dictionary, increase this to term.
As shown in Figure 3, in step (S31), upgrade the coordinate indexing dictionary and can also further regulate this coordinate indexing dictionary afterwards, the method that the coordinate indexing dictionary is regulated comprises:
(S32) if a pair of term with degree of correlation in the coordinate indexing dictionary does not have accessed mistake in certain time interval, then reduce this to the relevance degree between the term.
Reducing this method to the relevance degree between the term is:
Weight = n × OldWeight - 1 n - 1
Wherein, OldWeight is this old relevance degree to term, and n is the natural number greater than 2.
In this step, certain time interval can be decided to be 15 days in implementation process.Comprising various noise datas in the user inquiring behavior record.For those long-time not accessed terms (being noise data), reduce its degree of correlation gradually, thereby alleviate the influence of noise data system.
As the very high frequency of terms such as " corn ", " bean jelly " co-occurrence in the upsurge of " super schoolgirl " match, but after upsurge was gone over, the possibility of co-occurrence was very little.Therefore,, reduce its degree of correlation gradually, reduce to 0 until the last degree of correlation for not having accessed coordinate indexing speech within a certain period of time in the coordinate indexing dictionary.Thereby reduce frequent appearance the in these short time, interior for a long time afterwards few term that occurs is to the influence of system.
But be fixed as 1 term for the degree of correlation, do not reduce the calculating of its degree of correlation.
Below in conjunction with Fig. 4 the specific implementation process of the step in the flow process shown in Figure 1 (S4) is elaborated, comprises:
(S41) during user inquiring, by search engine input term;
(S42) obtain the query expansion result by the coordinate indexing dictionary;
(S43) show Query Result accesses entry ID and query expansion result;
(S44) judge whether Query Result is satisfied with,, finish retrieval if satisfied, otherwise, re-construct new query expression according to the query expansion result and inquire about once more.
In step (S43), can arrange the query expansion result in a certain order, convenient user inquires about, and query expansion result's sort method can sort with the size of the relevance degree of coordinate indexing speech respectively for: the term of being imported during according to user inquiring.
For the result who makes query expansion meets user's demand more, query expansion result's sort method can also for: the size according to the weighted value of coordinate indexing speech sorts, and the computing method of the weighted value of coordinate indexing speech are: w=s * 1g (n+1);
Wherein, the term that s imports during for user inquiring and the degree of correlation of coordinate indexing speech, n is the number of times that is retrieved of coordinate indexing speech.
The popular degree that can also further term be retrieved is as the foundation of query expansion sort result.
Query expansion result's effect interface is as shown in Figure 5: the user imports term " computer " by search engine, then the result of query expansion is: " large scale computer ", " server ", " workstation ", " computing machine " etc., rather than only demonstrating the query expansion result who comprises " computer " two words, this is in the restriction of the term of to a certain degree having broken through the initial input of user; Large scale computer ", " server ", " workstation ", " computing machine " these speech are called as the coordinate indexing speech; these speech combine according to its degree of correlation size or the degree of correlation and the number of times that is retrieved and sort; can satisfy the actual demand of user to information better, improve user's search efficiency.
In the existing enquiry expanding method, the result of query expansion does not participate in this retrieval, is directly the result of query expansion to be showed the user, is selected to retrieve behind the term among the query expansion result by the user again.And in the enquiry expanding method of the present invention, the term that the result of query expansion imports in the time of can be with user inquiring combines and makes up new query expression, or directly as query expression, directly submits to the user and use;
Perhaps the result of query expansion with " or " relation and the term combination of importing during user inquiring, participate in this inquiry as query expression, and the Query Result that obtains showed the user.
Adopt method of the present invention, can pass through the learning log file quickly and easily, and then the coordinate indexing dictionary augmented renewal automatically, thereby carry out query expansion according to the meaning of one's words incidence relation between the term, and do not need a large amount of manual interventions, assisted user is inquired about, and alleviates user's inquiry burden, and then improves precision ratio and the recall ratio of user to information.
The method that cooperates above-mentioned query expansion, the present invention also provides a kind of query expansion device, and its structure comprises as shown in Figure 6:
Inquiry behavior record division unit 61 is used for user's inquiry behavior record is divided at least one query unit according to identify label and access time of this user, and described query unit comprises at least one query event;
Correlation calculating unit 62 is used for periodically calculating the degree of correlation between the term of same query unit, different two query event, the degree of correlation between the term in perhaps any two query event;
Coordinate indexing Word library updating unit 63 is used for according to the degree of correlation between the term that calculates the coordinate indexing dictionary being upgraded;
Retrieval unit 64, the coordinate indexing speech of the degree of correlation maximum between the term that is used for being imported when retrieval of coordinate indexing dictionary and user inquiring forms the query expansion result.
Further, as shown in Figure 7, query expansion device of the present invention also comprises: coordinate indexing dictionary regulon 65, be used for query expansion result, reduce the relevance degree that in certain time interval, does not have a pair of term of accessed mistake in the coordinate indexing dictionary according to described retrieval unit formation.
In addition, the structure of employed coordinate indexing dictionary comprises as shown in Figure 8 in enquiry expanding method of the present invention:
Term storage unit 81, the term of being imported when being used for storing user's the inquiry of inquiry behavior record;
Relevance degree storage unit 82 is used for storing the relevance degree between the described term storage unit term;
Last access time record cell 83 is used for writing down the last access time that described relevance degree storage unit has the term of the degree of correlation.
Its concrete structure is as shown in the table:
Title Type Explanation
Word Navchar Term
WordID Int Key increases progressively
Table 1
Title Type Explanation
Degree of correlation ID Int Key increases progressively
WordID 1 Int The ID of the term correspondence in the table 1
WordID 2 Int The ID of the term correspondence in the table 1
Weight Double The degree of correlation between the term, (0,1] in the interval
Table 2
Title Type Explanation
Degree of correlation ID Int Degree of correlation ID in the table 2
LastAccessTime DateTime The last accessed time
Table 3
The corresponding term storage unit of above-mentioned table 1; The corresponding relevance degree storage unit of table 2; The corresponding last access time record cell of table 3.
Wherein, in above-mentioned table 2, promptly in the relevance degree storage unit term sequence number according to ascending order or descending sort, i.e. the order of the order of WordID 1>WordID 2 or WordID 1<WordID 2.
In above-mentioned table 3, in the promptly last access time record cell, the last access time with term of the degree of correlation arranges according to the access time sequencing, can improve the access speed of data like this.
Now technique effect of the present invention is described as follows:
Existing main flow search system has substantially all adopted global search technology, and the user who lacks the search experience meets difficulty through regular meeting, and provides query expansion by search system when selecting term to make up expression formula for search, can improve retrieval effectiveness.Method of the present invention, by collecting the user inquiring behavior record, it is carried out analytic learning to find the incidence relation between the term, augment the coordinate indexing dictionary automatically, thereby carry out query expansion according to the meaning of one's words incidence relation between the term, and do not need a large amount of manual interventions; Effectively assisted user carries out the inquiry of precise and high efficiency; And adopt method of the present invention, do not need to drop into great amount of manpower the coordinate indexing dictionary is put in order, be applicable to the search system of big data quantity, big visit capacity, improved user's search efficiency greatly, promptly realized high precision ratio and high recall ratio.
Above-mentioned preferred embodiment only is used for technical scheme of the present invention is specifically described; be familiar with these those skilled in the art and should do not breaking away from change and the modification of under spirit of the present invention and the principle the present invention being carried out equivalent purpose; these changes with revise, all should be covered by among the protection domain that the present invention defines.

Claims (21)

1, a kind of method of query expansion is characterized in that, comprises the steps:
(A) user's inquiry behavior record is divided at least one query unit according to identify label and access time of this user, described query unit comprises at least one query event;
(B) periodically calculate the degree of correlation between the term in same query unit, different two query event, the degree of correlation between the term in perhaps any two query event is upgraded the coordinate indexing dictionary according to the degree of correlation between the term that calculates;
(C) the coordinate indexing speech of the degree of correlation maximum between the term that retrieval is imported when the user inquiring in the coordinate indexing dictionary, formation query expansion result.
2, the method for query expansion according to claim 1 is characterized in that, in the described step (A), the inquiry behavior record comprises: term of importing during user inquiring and the clauses and subclauses of Query Result being clicked visit; Described identify label is the IP address.
3, the method for query expansion according to claim 2 is characterized in that, if the time between two adjacent query event of same IP address user is separated by less than the regular hour threshold value, then these two query event belong to same query unit.
4, the method for query expansion according to claim 1 is characterized in that, in the described step (B), the method for calculating the degree of correlation between the term in same query unit, different two query event is:
Weight ( W in , W jm ) = WeightFactor ( W in , W jm ) × OldWeight ( W in , W jm ) p × ( a - 1 ) + 1 a
Wherein,
Figure C2007100975010002C2
And j>i, W InN term when representing the i time inquiry, W JmM term when representing the j time inquiry; OldWeight (W In, W Jm) be term W InAnd W JmBetween the old degree of correlation, OldWeight (W In, W Jm) pValue be (0,1), p is the real number greater than zero, a is the natural number greater than 2.
5, the method for query expansion according to claim 4 is characterized in that, the value of described p is 1.
6, according to the method for claim 1 or 4 or 5 described query expansion, it is characterized in that, in the described step (B), the method of calculating the degree of correlation between the term in any two query event comprises: by calculating the similarity of user's clauses and subclauses that click is visited to Query Result in new query event and the old query event, instead push away the degree of correlation between the term in the query event, its method is:
Sim ( S new , S old ) = Common ( S new . CDoc , S old . Cdoc ) Max ( S new . CDoc , S old . Cdoc )
Wherein, S NewBe new query event, S OldBe old query event, Sim (S New, S Old) be the similarity of user's clauses and subclauses that click is visited to Query Result in new query event and the old query event, S New.CDoc being illustrated in the ID that the user clicks the clauses and subclauses of visiting to Query Result in the new query event tabulates; S Old.Cdoc being illustrated in the ID that the user clicks the clauses and subclauses of visiting to Query Result in the old query event tabulates; Common (S New.CDoc, S Old.Cdoc) be the quantity of the clauses and subclauses that the user clicks in the Query Result of new query event and old query event, Max (S with identical ID New.CDoc, S Old.Cdoc) be the maximum that the user clicks the number of entries of visit in new query event and the old query event to Query Result.
7, the method for query expansion according to claim 6 is characterized in that, if Sim is (S New, S Old) value surpass the threshold value of systemic presupposition, then further calculate the degree of correlation between the term in new query event and the old query event, its method is:
Weight ( W newi , W oldj ) = Sim ( S new , S old ) × OldWeight ( W newi , W oldj ) q × ( a - 1 ) + 1 a
Wherein, W NewiBe i the term of importing in the new query event, W OldjBe j the term of importing in the old query event, Weight (W Newi, W Oldj) be term W NewiWith W OldjThe degree of correlation, OldWeight (W Newi, W Oldj) be term W NewiWith W OldjThe old degree of correlation, OldWeight (W Newi, W Oldj) qValue be (0,1), q is the real number greater than zero, a is the natural number greater than 2.
8, the method for query expansion according to claim 7 is characterized in that, the value of described q is 1.
9, the method for query expansion according to claim 8 is characterized in that, in the described step (B), according to the degree of correlation between the term that calculates the coordinate indexing dictionary is carried out method for updating and comprises:
If the degree of correlation between a pair of term that calculates has been stored in the coordinate indexing dictionary, then improve and describedly be stored in the coordinate indexing dictionary this relevance degree between the term, otherwise, in the coordinate indexing dictionary, increase this to term.
10, the method for query expansion according to claim 9 is characterized in that, the degree of correlation between the term that described basis calculates is carried out the coordinate indexing dictionary after the updating steps, also comprises the step of regulating the coordinate indexing dictionary:
If a pair of term with degree of correlation in the coordinate indexing dictionary does not have accessed mistake in certain time interval, then reduce this to the relevance degree between the term.
11, the method for query expansion according to claim 10 is characterized in that, this method to the relevance degree between the term of described reduction is:
Weight = n × OldWeight - 1 n - 1
Wherein, OldWeight is this old relevance degree to term, and n is the natural number greater than 2.
12, the method for query expansion according to claim 1, it is characterized in that, described query expansion result is the query expansion result through ordering, and described sort method comprises: the term of being imported during according to user inquiring sorts with the size of the relevance degree of described coordinate indexing speech respectively.
13, the method for query expansion according to claim 1 is characterized in that, described query expansion result is the query expansion result through ordering, and described sort method comprises:
Size according to the weighted value of described coordinate indexing speech sorts, and the computing method of the weighted value of coordinate indexing speech are: w=s * 1g (n+1);
Wherein, the term that s imports during for user inquiring and the degree of correlation of coordinate indexing speech, n is the number of times that is retrieved of described coordinate indexing speech.
14, the method for query expansion according to claim 13, it is characterized in that in the described step (C), the term that the result of query expansion imports in the time of can be with user inquiring combines and makes up new query expression, or, directly submit to the user and use directly as query expression;
Perhaps the result of query expansion with " or " relation and the term combination of importing during user inquiring, participate in this inquiry as query expression, and the Query Result that obtains showed the user.
15, the method for query expansion according to claim 1 is characterized in that, described step (A) also comprises before: search system is stored in user's inquiry behavior record in the journal file of this search system, and this journal file is filtered.
16, the method for query expansion according to claim 1 is characterized in that, in the described step (B), the cycle is one day.
17, a kind of query expansion device is characterized in that, comprising:
Inquiry behavior record division unit is used for user's inquiry behavior record is divided at least one query unit according to identify label and access time of this user, and described query unit comprises at least one query event;
Correlation calculating unit is used for periodically calculating the degree of correlation between the term of same query unit, different two query event, the degree of correlation between the term in perhaps any two query event;
Coordinate indexing Word library updating unit is used for according to the degree of correlation between the term that calculates the coordinate indexing dictionary being upgraded;
Retrieval unit, the coordinate indexing speech of the degree of correlation maximum between the term that is used for being imported when retrieval of coordinate indexing dictionary and user inquiring forms the query expansion result.
18, query expansion device according to claim 17, it is characterized in that, also comprise: coordinate indexing dictionary regulon, be used for query expansion result, reduce the relevance degree of a pair of term that in certain time interval, does not have accessed mistake in the coordinate indexing dictionary according to described retrieval unit formation.
19, query expansion device according to claim 17 is characterized in that, described coordinate indexing dictionary comprises:
The term storage unit, the term of being imported during the inquiry that is used for storing the user inquiring behavior record;
The relevance degree storage unit is used for storing the relevance degree between the described term storage unit term;
Last access time record cell is used for writing down the last access time that described relevance degree storage unit has the term of the degree of correlation.
20, query expansion device according to claim 19 is characterized in that, the term sequence number is according to ascending order or descending sort in the described relevance degree storage unit.
According to claim 19 or 20 described query expansion devices, it is characterized in that 21, in the described last access time record cell, the last access time with term of the degree of correlation arranges according to the access time sequencing.
CN 200710097501 2007-04-25 2007-04-25 Method and device for enquire enquiry extending as well as related searching word stock Expired - Fee Related CN100595759C (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN 200710097501 CN100595759C (en) 2007-04-25 2007-04-25 Method and device for enquire enquiry extending as well as related searching word stock

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN 200710097501 CN100595759C (en) 2007-04-25 2007-04-25 Method and device for enquire enquiry extending as well as related searching word stock

Publications (2)

Publication Number Publication Date
CN101281523A CN101281523A (en) 2008-10-08
CN100595759C true CN100595759C (en) 2010-03-24

Family

ID=40013999

Family Applications (1)

Application Number Title Priority Date Filing Date
CN 200710097501 Expired - Fee Related CN100595759C (en) 2007-04-25 2007-04-25 Method and device for enquire enquiry extending as well as related searching word stock

Country Status (1)

Country Link
CN (1) CN100595759C (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
TWI676906B (en) * 2018-04-13 2019-11-11 和碩聯合科技股份有限公司 Prompt method and computer system thereof

Families Citing this family (21)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101770499A (en) * 2009-01-07 2010-07-07 上海聚力传媒技术有限公司 Information retrieval method in search engine and corresponding search engine
CN101650742B (en) * 2009-08-27 2015-01-28 中兴通讯股份有限公司 System and method for prompting search condition during English search
CN101853298B (en) * 2010-05-26 2012-08-15 上海大学 Event-oriented query expansion method
CN102646096A (en) * 2011-02-18 2012-08-22 鸿富锦精密工业(深圳)有限公司 Linked word searching system and method
CN102262660B (en) * 2011-07-15 2013-05-29 北京百度网讯科技有限公司 Method and device implemented by computer and used for obtaining search result
CN102915314B (en) * 2011-08-05 2018-07-31 深圳市世纪光速信息技术有限公司 A kind of Automatic error correction pair generation method and system
CN102955821A (en) * 2011-08-30 2013-03-06 北京百度网讯科技有限公司 Method and device for carrying out expansion processing on query sequence
CN103207881B (en) * 2012-01-17 2016-03-02 阿里巴巴集团控股有限公司 Querying method and device
CN103294670B (en) * 2012-02-22 2018-07-06 腾讯科技(深圳)有限公司 A kind of searching method and system based on vocabulary
CN103365910B (en) * 2012-04-06 2017-02-15 腾讯科技(深圳)有限公司 Method and system for information retrieval
CN103678415A (en) * 2012-09-25 2014-03-26 腾讯科技(深圳)有限公司 Method and device for processing project log data
CN103970748B (en) * 2013-01-25 2019-01-29 腾讯科技(深圳)有限公司 A kind of related keyword words recommending method and device
CN104462375B (en) * 2014-12-09 2018-08-10 北京百度网讯科技有限公司 Search processing method based on barrage media and system
CN105975596A (en) * 2016-05-10 2016-09-28 上海珍岛信息技术有限公司 Query expansion method and system of search engine
CN107220380A (en) * 2017-06-27 2017-09-29 北京百度网讯科技有限公司 Question and answer based on artificial intelligence recommend method, device and computer equipment
CN107885875B (en) * 2017-11-28 2022-07-08 北京百度网讯科技有限公司 Synonymy transformation method and device for search words and server
CN108959555B (en) * 2018-06-29 2020-10-30 北京百度网讯科技有限公司 Query type expansion method and device, computer equipment and storage medium
CN111597297A (en) * 2019-02-21 2020-08-28 北京京东尚科信息技术有限公司 Article recall method, system, electronic device and readable storage medium
CN111143666A (en) * 2019-12-04 2020-05-12 深圳市智微智能软件开发有限公司 Steel mesh inventory query method and system
CN115221872B (en) * 2021-07-30 2023-06-02 苏州七星天专利运营管理有限责任公司 Vocabulary expansion method and system based on near-sense expansion
CN113609370B (en) * 2021-08-06 2023-12-12 北京百度网讯科技有限公司 Data processing method, device, electronic equipment and storage medium

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
TWI676906B (en) * 2018-04-13 2019-11-11 和碩聯合科技股份有限公司 Prompt method and computer system thereof

Also Published As

Publication number Publication date
CN101281523A (en) 2008-10-08

Similar Documents

Publication Publication Date Title
CN100595759C (en) Method and device for enquire enquiry extending as well as related searching word stock
US8812559B2 (en) Methods and systems for creating an advertising database
US8402031B2 (en) Determining entity popularity using search queries
CN100530180C (en) Method and system for suggesting search engine keywords
US8498974B1 (en) Refining search results
EP2758900B1 (en) Providing topic based search guidance
CA2578513C (en) System and method for online information analysis
CN100483408C (en) Method and apparatus for establishing link structure between multiple documents
CN102722498B (en) Search engine and implementation method thereof
CN103455487B (en) The extracting method and device of a kind of search term
CN102722501B (en) Search engine and realization method thereof
CN103186574A (en) Method and device for generating searching result
CN102722499B (en) Search engine and implementation method thereof
CN101501630A (en) Method for ranking and sorting electronic documents in a search result list based on relevance
CN103678576A (en) Full-text retrieval system based on dynamic semantic analysis
CN102737021B (en) Search engine and realization method thereof
CN103365839A (en) Recommendation search method and device for search engines
CN102651011B (en) Method and system for determining document characteristic and user characteristic
CN102693304A (en) Search engine feedback information processing method and search engine
JP2006318398A (en) Vector generation method and device, information classifying method and device, and program, and computer readable storage medium with program stored therein
Arguello et al. Using query performance predictors to reduce spoken queries
JP2013168177A (en) Information provision program, information provision apparatus, and provision method of retrieval service
CN111737607A (en) Data processing method, data processing device, electronic equipment and storage medium
JP2012104051A (en) Document index creating device
CN107818091B (en) Document processing method and device

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20220615

Address after: 3007, Hengqin international financial center building, No. 58, Huajin street, Hengqin new area, Zhuhai, Guangdong 519031

Patentee after: New founder holdings development Co.,Ltd.

Patentee after: FOUNDER APABI TECHNOLOGY Ltd.

Patentee after: Peking University

Address before: 100871, Beijing, Haidian District, Cheng Fu Road, No. 298, Zhongguancun Fangzheng building, 5 floor

Patentee before: PEKING UNIVERSITY FOUNDER GROUP Co.,Ltd.

Patentee before: FOUNDER APABI TECHNOLOGY Ltd.

Patentee before: Peking University

CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20100324