Background technology
Along with the fast development of computer information technology and Internet technology, the content of digitized information and quantity sharp increase on the network.In the face of the ocean of electronic information, search system becomes the important tool that people effectively utilize Internet resources.The search system of existing main flow has substantially all adopted global search technology, and the principle of global search technology is: will content retrieved be divided into short word sequence, and generate the index that comprises character string in each word sequence then.After the user imports term or statement, cut apart too, compare with index, then the article lists of links under the vocabulary of coupling is shown to the user.
Global search technology is divided into two classes according to application: internet search engine and professional searching system.
Internet search engine towards be that Internet goes up a large amount of rambling webpages, fundamental purpose is to find some useful reference informations and some harmful informations of shielding, useful webpage is come the front as far as possible.
And professional searching system requires Query Result to possess high recall ratio and high precision ratio, and information is through the useful information of arrangement comparatively speaking, requires high recall ratio when requiring high precision ratio.Recall ratio is meant system when carrying out a certain retrieval, the ratio of related data total amount in related data amount that retrieves and the system documentation storehouse, the information found of reflection comprehensive; Precision ratio then is to guarantee that we find the key of useful data, be system when carrying out a certain retrieval, the useful data quantity that retrieves and retrieve the ratio of data total amount.
Certainly, for most of user, mainly be that the such research tool of applying Internet search engine carries out search information, when user search information, search engine mates by the term of user's input, the results list ID of coupling is shown to the user, but there are following two problems in user's search:
(1) owing to has a large amount of synonyms and polysemant in the Chinese, and the diversity of Chinese expression way, the term that the user uses when making up query expression is standard not to the utmost often, with employed speech of search system document index or phrase very big difference is arranged, cause " differential expression " problem in the information retrieval.
(2) user is when inquiring about, and the term quantity of use is few, is generally 1~2, can't describe user's information requirement in detail particularly; Especially when user and indeterminate oneself information requirement, it is more outstanding that above problem just seems.
In this case, user's initial query is normally coarse, professional and incomplete inadequately, and the result that such inquiry obtains can't satisfy user's demand well.
Existing search system is in order to address the above problem, adopt query expansion (Query Expansion) technology, the query expansion technology is by showing other relevant coordinate indexing speech of term of importing with the user, help the user to re-construct query expression accurately and effectively, this remedy to a certain extent the user expression may with the difference of candidate's paragraph, retrieve the needed candidate documents of user with less omission as far as possible.Make that user's retrieval is more accurate, alleviated burden for users (so-called burden for users is meant that the user pays the summation of energy in retrieving).Information demander for most unprofessional user and " I do not know what I want, but I have just known when I see it " more can improve their recall precision effectively.
But this search system of query expansion function that provides finds still that in actual use experience there are the following problems:
The one, present Chinese search engine makes up a retrieval dictionary by the term and the retrieval number of times of statistics user input.When user search information, character according to the term that the user imported carries out matching inquiry in this retrieval dictionary, sort according to the match condition of character and the number of times that is retrieved of Query Result then, the result of ordering is submitted to the user as the result of query expansion.As user's input " computer ", then in the result of query expansion the terms relevant with " computer " such as " computer newspaper ", " computer quotation ", " notebook computer " can appear; This mode is convenient, directly perceived, can give the possible coordinate indexing speech of user prompt, but the query expansion result who obtains still is subjected to the restriction of the initial term of selecting of user, be not suitable for the quick neologisms that increase on the network, also can't satisfy people to demand news category, time dependent conjunctive word, and then user's search efficiency can't be protected.
The 2nd, be representative specialty searching system with PubMed etc.It compiles a dictionary by the professional, has defined the relation between the various vocabulary such as synonym, related term in dictionary.Carry out query expansion by this dictionary, reach high recall ratio and high precision ratio, and play standard user's inquiry word, the effect of assisted user inquiry.But this method needs the professional to compile a dictionary, define the relation between the various vocabulary, length consuming time, maintenance cost height, generally only be applicable to the search system or the thematic data base inquiry system of professional domain, and be not suitable for the search system in the amateur field of big data quantity.In addition, this method is not suitable for the quick neologisms that increase on the network too, also can't handle news category, time dependent conjunctive word.
In sum, present also good without comparison method on general search system can be at the existing problem of user search, and carries out query expansion automatically and efficiently, provides the support of effect for user's retrieval behavior.
Summary of the invention
At above-mentioned problems of the prior art, the purpose of this invention is to provide a kind of enquiry expanding method and device and coordinate indexing dictionary.Utilize this coordinate indexing dictionary to carry out query expansion, assisted user is inquired about, and alleviates user's inquiry burden, improves search efficiency of users.
For achieving the above object, the invention provides a kind of enquiry expanding method, adopt following scheme, comprising:
(A) user's inquiry behavior record is divided at least one query unit according to identify label and access time of this user, described query unit comprises at least one query event;
(B) periodically calculate the degree of correlation between the term in same query unit, different two query event, the degree of correlation between the term in perhaps any two query event is upgraded the coordinate indexing dictionary according to the degree of correlation between the term that calculates;
(C) the coordinate indexing speech of the degree of correlation maximum between the term that retrieval is imported when the user inquiring in the coordinate indexing dictionary, formation query expansion result.
In the above-mentioned steps (A), the inquiry behavior record comprises: term of importing during user inquiring and the clauses and subclauses of Query Result being clicked visit; Described identify label is the IP address.
Above-mentioned query unit comprises at least one query event, if the time between two adjacent query event of same IP address user is separated by less than the regular hour threshold value, then these two query event belong to same query unit.
In the above-mentioned steps (B), the method for calculating the degree of correlation between the term in same query unit, different two query event is:
Wherein,
And j>i, W
InN term when representing the i time inquiry, W
JmM term when representing the j time inquiry; OldWeight (W
In, W
Jm) be term W
InAnd W
JmBetween the old degree of correlation, OldWeight (W
In, W
Jm)
pValue be (0,1), p is the real number greater than zero, a is the natural number greater than 2.
Preferable, the value of p is 1.
In the above-mentioned steps (B), the method of calculating the degree of correlation between the term in any two query event comprises: by calculating the similarity of user's clauses and subclauses that click is visited to Query Result in new query event and the old query event, instead push away the degree of correlation between the term in the query event, its method is:
Wherein, S
NewBe new query event, S
OldBe old query event, Sim (S
NewS
Old) be the similarity of user's clauses and subclauses that click is visited to Query Result in new query event and the old query event, S
New.CDoc being illustrated in the ID that the user clicks the clauses and subclauses of visiting to Query Result in the new query event tabulates; S
Old.Cdoc being illustrated in the ID that the user clicks the clauses and subclauses of visiting to Query Result in the old query event tabulates; Common (S
New.CDoc, S
Old.Cdoc) be the quantity of the clauses and subclauses that the user clicks in the Query Result of new query event and old query event, Max (S with identical ID
New.CDoc, S
Old.Cdoc) be the maximum that the user clicks the number of entries of visit in new query event and the old query event to Query Result.
If Sim is (S
New, S
Old) value surpass the threshold value of systemic presupposition, then further calculate the degree of correlation between the term in new query event and the old query event, its method is:
Wherein, W
NewiBe i the term of importing in the new query event, W
OldjBe j the term of importing in the old query event, Weight (W
Newi, W
Oldj) be term W
NewiWith W
OldjThe degree of correlation, OldWeight (W
Newi, W
Oldj) be term W
NewiWith W
OldjThe old degree of correlation, OldWeight (W
Newi, W
Oldj)
qValue be (0,1), q is the real number greater than zero, a is the natural number greater than 2.
Preferable, the value of q is 1.
In the above-mentioned steps (B), according to the degree of correlation between the term that calculates the coordinate indexing dictionary is carried out method for updating and comprises:
If the degree of correlation between a pair of term that calculates has been stored in the coordinate indexing dictionary, then improve and describedly be stored in the coordinate indexing dictionary this relevance degree between the term, otherwise, in the coordinate indexing dictionary, increase this to term.
The degree of correlation between the term that described basis calculates is carried out the coordinate indexing dictionary after the updating steps, also comprises the step of regulating the coordinate indexing dictionary:
If a pair of term with degree of correlation in the coordinate indexing dictionary does not have accessed mistake in certain time interval, then reduce this to the relevance degree between the term.
This method to the relevance degree between the term of described reduction is:
Wherein, OldWeight is this old relevance degree to term, and n is the natural number greater than 2.
In the above-mentioned steps (C), described query expansion result is the query expansion result through ordering, and described sort method comprises: the term of being imported during according to user inquiring sorts with the size of the relevance degree of described coordinate indexing speech respectively.
In the above-mentioned steps (C), described query expansion result is the query expansion result through ordering, described sort method comprises: the size according to the weighted value of described coordinate indexing speech sorts, and the computing method of the weighted value of coordinate indexing speech are: w=s * 1g (n+1);
Wherein, the term that s imports during for user inquiring and the degree of correlation of coordinate indexing speech, n is the number of times that is retrieved of described coordinate indexing speech.
In the above-mentioned steps (C), the term that the result of query expansion imports in the time of can be with user inquiring combines and makes up new query expression, or directly as query expression, directly submits to the user and use;
Perhaps the result of query expansion with " or " relation and the term combination of importing during user inquiring, participate in this inquiry as query expression, and the Query Result that obtains showed the user.
Above-mentioned steps (A) also comprises before: search system is stored in user's inquiry behavior record in the journal file of this search system, and this journal file is filtered.
In the above-mentioned steps (B), the cycle can be one day.
For achieving the above object, the present invention also provides a kind of query expansion device, comprising:
Inquiry behavior record division unit is used for user's inquiry behavior record is divided at least one query unit according to identify label and access time of this user, and described query unit comprises at least one query event;
Correlation calculating unit is used for periodically calculating the degree of correlation between the term of same query unit, different two query event, the degree of correlation between the term in perhaps any two query event;
Coordinate indexing Word library updating unit is used for according to the degree of correlation between the term that calculates the coordinate indexing dictionary being upgraded;
Retrieval unit, the coordinate indexing speech of the degree of correlation maximum between the term that is used for being imported when retrieval of coordinate indexing dictionary and user inquiring forms the query expansion result.
Above-mentioned query expansion device, also can comprise: coordinate indexing dictionary regulon, be used for query expansion result, reduce the relevance degree that in certain time interval, does not have a pair of term of accessed mistake in the coordinate indexing dictionary according to described retrieval unit formation.
Wherein, described coordinate indexing dictionary comprises:
The term storage unit, the term of being imported during the inquiry that is used for storing the user inquiring behavior record;
The relevance degree storage unit is used for storing the relevance degree between the described term storage unit term;
Last access time record cell is used for writing down the last access time that described relevance degree storage unit has the term of the degree of correlation.
The term sequence number is according to ascending order or descending sort in the above-mentioned relevance degree storage unit.
In the above-mentioned last access time record cell, the last access time with term of the degree of correlation arranges according to the access time sequencing.
The present invention analyzes by inquiry that the user is carried out on search system and visit behavior and learns, upgrade the coordinate indexing dictionary, and according in the coordinate indexing dictionary between the term degree of correlation on the meaning of one's words carry out query expansion, assisted user is inquired about, alleviate user's inquiry burden, improved user's precision ratio and recall ratio greatly.
Embodiment
In order to make purpose of the present invention, scheme and effect clearer, be described in detail below in conjunction with the embodiment of accompanying drawing to the present invention program:
The overall realization flow of enquiry expanding method of the present invention comprises as shown in Figure 1:
(S1) user's inquiry behavior record is divided at least one query unit according to identify label and access time of this user, described query unit comprises at least one query event;
(S2) periodically calculate the degree of correlation between the term in same query unit, different two query event, the degree of correlation between the term in perhaps any two query event;
(S3) according to the degree of correlation between the term that calculates the coordinate indexing dictionary is upgraded;
(S4) the coordinate indexing speech of the degree of correlation maximum between the term that retrieval is imported when the user inquiring in the coordinate indexing dictionary, formation query expansion result
Said method is analyzed by inquiry that the user is carried out on search system and visit behavior and is learnt, upgrade the coordinate indexing dictionary, and according in the coordinate indexing dictionary between the term degree of correlation on the meaning of one's words carry out query expansion, assisted user is inquired about, alleviate user's inquiry burden, improved user's precision ratio and recall ratio greatly.
First kind of concrete implementing procedure below in conjunction with 2 pairs of said methods of accompanying drawing is described:
In the above-mentioned steps (S1), the user inquiring behavior record comprises: term of importing during user inquiring and the clauses and subclauses of Query Result being clicked visit.
Step (S1) can be specially in specific implementation process:
(S11) search system is stored in user's inquiry behavior record in the journal file of text formatting of this search system, at first the journal file to text formatting filters, term of being imported when only keeping user inquiring and the entry record of Query Result being clicked visit, make the data volume that needs to handle significantly reduce, and then improve the treatment effeciency of system.
(S12) journal file after filtering is analyzed and learnt, this user's inquiry behavior record is divided at least one query event and query unit according to this user's IP address and access time.
Query unit is the process that user's informational needs is met, and continues the time period of use when same IP user is retrieved, and is recorded as a query unit.
Search system is that unit divides with the each inquiry in the query unit further, obtains littler inquiry unit and is called query event, comprises that the user imports term and clicks the corresponding Query Result clauses and subclauses of visit.
Therefore, query unit comprises at least one query event, the time of being separated by between can two adjacent query event with same IP user is as judging whether they belong to the foundation of same query unit, time between two adjacent query event of even same IP user is separated by less than the regular hour threshold value, and then these two query event belong to same query unit.
The inquiry that a plurality of query event in the query unit is carried out is the user constantly adjusts term according to previous Query Result a process, and therefore the repeatedly inquiry of being carried out in query unit is considered to be correlated with.
In the step in the flow process shown in Figure 1 (S2), the degree of correlation between the term can realize by following dual mode:
(S21A) periodically calculate the degree of correlation between the term in same query unit, different two query event, its method is as follows:
Wherein,
And j>i, W
InN term when representing the i time inquiry, W
JmM term when representing the j time inquiry; OldWeight (W
In, W
Jm) be term W
InAnd W
JmBetween the old degree of correlation, OldWeight (W
In, W
Jm)
pValue be (0,1), p is the real number greater than zero, a is the natural number greater than 2.Optimum, the value of p is 1.
(S21B) periodically calculate the degree of correlation between the term in any two query event, its method is as follows: by calculating the similarity of user's clauses and subclauses that click is visited to Query Result in new query event and the old query event, instead push away the degree of correlation between the term in the query event, its method is:
Wherein, S
NewBe new query event, S
OldBe old query event, Sim (S
New, S
Old) be the similarity of user's clauses and subclauses that click is visited to Query Result in new query event and the old query event, S
New.CDoc being illustrated in the ID that the user clicks the clauses and subclauses of visiting to Query Result in the new query event tabulates; S
Old.Cdoc being illustrated in the ID that the user clicks the clauses and subclauses of visiting to Query Result in the old query event tabulates; Common (S
New.CDoc, S
Old.Cdoc) be the quantity of the clauses and subclauses that the user clicks in the Query Result of new query event and old query event, Max (S with identical ID
New.CDoc, S
Old.Cdoc) be the greater that the user clicks the number of entries of visit in new query event and the old query event to Query Result.
If Sim is (S
New, S
Old) value surpass the threshold value of systemic presupposition, then further calculate the degree of correlation between the term in new query event and the old query event, otherwise do not carry out the calculating of following method;
The method of further calculating the degree of correlation between the term in new query event and the old query event is:
Wherein, W
NewiBe i the term of importing in the new query event, W
OldjBe j the term of importing in the old query event, Weight (W
Newi, W
Oldj) be term W
NewiWith W
OldjThe degree of correlation, OldWeight (W
Newi, W
Oldj) be term W
NewiWith W
OldjThe old degree of correlation, OldWeight (W
Newi, W
Oldj)
qValue be (0,1), q is the real number greater than zero, a is the natural number greater than 2.Optimum, the value of q is 1.
Certainly, above-mentioned two kinds of methods of calculating the degree of correlation between the term are not limited to this, can also with (S21A) described computing method and (S21B) described computing method combine, so that the degree of correlation of the term that calculates more comprehensively accurately, when the user inquires about, can provide more comprehensive and accurate query expansion result for the user.
In step (S21A) and the step (S21B), the cycle can be chosen to be one day, generally can be chosen in the search system server and carry out relatively more idle morning.
In the step in the flow process shown in Figure 1 (S3), according to the degree of correlation between the term that calculates the coordinate indexing dictionary is carried out method for updating and comprises:
(S31) if the degree of correlation between a pair of term that calculates has been stored in the coordinate indexing dictionary, then improve and describedly be stored in the coordinate indexing dictionary this relevance degree between the term, otherwise, in the coordinate indexing dictionary, increase this to term.
As shown in Figure 3, in step (S31), upgrade the coordinate indexing dictionary and can also further regulate this coordinate indexing dictionary afterwards, the method that the coordinate indexing dictionary is regulated comprises:
(S32) if a pair of term with degree of correlation in the coordinate indexing dictionary does not have accessed mistake in certain time interval, then reduce this to the relevance degree between the term.
Reducing this method to the relevance degree between the term is:
Wherein, OldWeight is this old relevance degree to term, and n is the natural number greater than 2.
In this step, certain time interval can be decided to be 15 days in implementation process.Comprising various noise datas in the user inquiring behavior record.For those long-time not accessed terms (being noise data), reduce its degree of correlation gradually, thereby alleviate the influence of noise data system.
As the very high frequency of terms such as " corn ", " bean jelly " co-occurrence in the upsurge of " super schoolgirl " match, but after upsurge was gone over, the possibility of co-occurrence was very little.Therefore,, reduce its degree of correlation gradually, reduce to 0 until the last degree of correlation for not having accessed coordinate indexing speech within a certain period of time in the coordinate indexing dictionary.Thereby reduce frequent appearance the in these short time, interior for a long time afterwards few term that occurs is to the influence of system.
But be fixed as 1 term for the degree of correlation, do not reduce the calculating of its degree of correlation.
Below in conjunction with Fig. 4 the specific implementation process of the step in the flow process shown in Figure 1 (S4) is elaborated, comprises:
(S41) during user inquiring, by search engine input term;
(S42) obtain the query expansion result by the coordinate indexing dictionary;
(S43) show Query Result accesses entry ID and query expansion result;
(S44) judge whether Query Result is satisfied with,, finish retrieval if satisfied, otherwise, re-construct new query expression according to the query expansion result and inquire about once more.
In step (S43), can arrange the query expansion result in a certain order, convenient user inquires about, and query expansion result's sort method can sort with the size of the relevance degree of coordinate indexing speech respectively for: the term of being imported during according to user inquiring.
For the result who makes query expansion meets user's demand more, query expansion result's sort method can also for: the size according to the weighted value of coordinate indexing speech sorts, and the computing method of the weighted value of coordinate indexing speech are: w=s * 1g (n+1);
Wherein, the term that s imports during for user inquiring and the degree of correlation of coordinate indexing speech, n is the number of times that is retrieved of coordinate indexing speech.
The popular degree that can also further term be retrieved is as the foundation of query expansion sort result.
Query expansion result's effect interface is as shown in Figure 5: the user imports term " computer " by search engine, then the result of query expansion is: " large scale computer ", " server ", " workstation ", " computing machine " etc., rather than only demonstrating the query expansion result who comprises " computer " two words, this is in the restriction of the term of to a certain degree having broken through the initial input of user; Large scale computer ", " server ", " workstation ", " computing machine " these speech are called as the coordinate indexing speech; these speech combine according to its degree of correlation size or the degree of correlation and the number of times that is retrieved and sort; can satisfy the actual demand of user to information better, improve user's search efficiency.
In the existing enquiry expanding method, the result of query expansion does not participate in this retrieval, is directly the result of query expansion to be showed the user, is selected to retrieve behind the term among the query expansion result by the user again.And in the enquiry expanding method of the present invention, the term that the result of query expansion imports in the time of can be with user inquiring combines and makes up new query expression, or directly as query expression, directly submits to the user and use;
Perhaps the result of query expansion with " or " relation and the term combination of importing during user inquiring, participate in this inquiry as query expression, and the Query Result that obtains showed the user.
Adopt method of the present invention, can pass through the learning log file quickly and easily, and then the coordinate indexing dictionary augmented renewal automatically, thereby carry out query expansion according to the meaning of one's words incidence relation between the term, and do not need a large amount of manual interventions, assisted user is inquired about, and alleviates user's inquiry burden, and then improves precision ratio and the recall ratio of user to information.
The method that cooperates above-mentioned query expansion, the present invention also provides a kind of query expansion device, and its structure comprises as shown in Figure 6:
Inquiry behavior record division unit 61 is used for user's inquiry behavior record is divided at least one query unit according to identify label and access time of this user, and described query unit comprises at least one query event;
Correlation calculating unit 62 is used for periodically calculating the degree of correlation between the term of same query unit, different two query event, the degree of correlation between the term in perhaps any two query event;
Coordinate indexing Word library updating unit 63 is used for according to the degree of correlation between the term that calculates the coordinate indexing dictionary being upgraded;
Retrieval unit 64, the coordinate indexing speech of the degree of correlation maximum between the term that is used for being imported when retrieval of coordinate indexing dictionary and user inquiring forms the query expansion result.
Further, as shown in Figure 7, query expansion device of the present invention also comprises: coordinate indexing dictionary regulon 65, be used for query expansion result, reduce the relevance degree that in certain time interval, does not have a pair of term of accessed mistake in the coordinate indexing dictionary according to described retrieval unit formation.
In addition, the structure of employed coordinate indexing dictionary comprises as shown in Figure 8 in enquiry expanding method of the present invention:
Term storage unit 81, the term of being imported when being used for storing user's the inquiry of inquiry behavior record;
Relevance degree storage unit 82 is used for storing the relevance degree between the described term storage unit term;
Last access time record cell 83 is used for writing down the last access time that described relevance degree storage unit has the term of the degree of correlation.
Its concrete structure is as shown in the table:
Title |
Type |
Explanation |
Word |
Navchar |
Term |
WordID |
Int |
Key increases progressively |
Table 1
Title |
Type |
Explanation |
Degree of correlation ID |
Int |
Key increases progressively |
WordID 1 |
Int |
The ID of the term correspondence in the table 1 |
WordID 2 |
Int |
The ID of the term correspondence in the table 1 |
Weight |
Double |
The degree of correlation between the term, (0,1] in the interval |
Table 2
Title |
Type |
Explanation |
Degree of correlation ID |
Int |
Degree of correlation ID in the table 2 |
LastAccessTime |
DateTime |
The last accessed time |
Table 3
The corresponding term storage unit of above-mentioned table 1; The corresponding relevance degree storage unit of table 2; The corresponding last access time record cell of table 3.
Wherein, in above-mentioned table 2, promptly in the relevance degree storage unit term sequence number according to ascending order or descending sort, i.e. the order of the order of WordID 1>WordID 2 or WordID 1<WordID 2.
In above-mentioned table 3, in the promptly last access time record cell, the last access time with term of the degree of correlation arranges according to the access time sequencing, can improve the access speed of data like this.
Now technique effect of the present invention is described as follows:
Existing main flow search system has substantially all adopted global search technology, and the user who lacks the search experience meets difficulty through regular meeting, and provides query expansion by search system when selecting term to make up expression formula for search, can improve retrieval effectiveness.Method of the present invention, by collecting the user inquiring behavior record, it is carried out analytic learning to find the incidence relation between the term, augment the coordinate indexing dictionary automatically, thereby carry out query expansion according to the meaning of one's words incidence relation between the term, and do not need a large amount of manual interventions; Effectively assisted user carries out the inquiry of precise and high efficiency; And adopt method of the present invention, do not need to drop into great amount of manpower the coordinate indexing dictionary is put in order, be applicable to the search system of big data quantity, big visit capacity, improved user's search efficiency greatly, promptly realized high precision ratio and high recall ratio.
Above-mentioned preferred embodiment only is used for technical scheme of the present invention is specifically described; be familiar with these those skilled in the art and should do not breaking away from change and the modification of under spirit of the present invention and the principle the present invention being carried out equivalent purpose; these changes with revise, all should be covered by among the protection domain that the present invention defines.