CN101131706A

CN101131706A - Query amending method and system thereof

Info

Publication number: CN101131706A
Application number: CNA2007101753268A
Authority: CN
Inventors: 高立琦; 刘挺
Original assignee: Harbin Institute of Technology; Beijing Kingsoft Software Co Ltd; Beijing Jinshan Digital Entertainment Technology Co Ltd
Current assignee: Harbin Institute of Technology; Beijing Kingsoft Software Co Ltd; Beijing Jinshan Digital Entertainment Technology Co Ltd
Priority date: 2007-09-28
Filing date: 2007-09-28
Publication date: 2008-02-27
Anticipated expiration: 2027-09-28
Also published as: CN101131706B

Abstract

The invention discloses a kind of method and system for searching and modifying, solving the problem that the current search engine can analyze many wrong searching information which users input correctly, thereby, resulting in searches failing. The method above includes: setting in advance the language model with the search resource; using the relevant correcting operation, correcting every search word input initially, getting many expressions of the relevant search word which including the expression of the initial input; according many expressions of every search word, getting the different assembled-form word serial; using the language model referred above to compute the appearance probability of the word serial, and making the word serial with high appearance probability as the search suggested result. The invention can deal unitarily with many inputting mistakes or incomplete inputting appearing during searching such as spelling mistake, morphology imperfection and so on to reach the aim to correct the search automatically, to help users to use the search engine efficiently.

Description

Query correction method and system

Technical Field

The invention relates to a search engine technology, in particular to a query correction method and a query correction system in a search engine.

Background

A query refers to a user's input, specifically text input herein, while using a search engine. The query is an expression of the information requirement of the user, and the search engine can provide accurate retrieval results for the user and provide high-quality service for the user only by correctly understanding the query of the user. However, when analyzing a query of a user, the accuracy of the current search engine is affected and restricted by several factors, such as misspelling of words, misshapers of word shapes, incomplete input words, chinese homophones, and the like, so that the search engine cannot correctly "understand" the user's intention, and the returned search result cannot meet the user's requirements.

The existing search engines, such as Google (Google), google, baidu, etc., all pay great attention to the function in query processing, and provide query correction functions such as query completion, spelling error check, etc. For example, entering "computer science" in Google, returning the results page prompts "Do you mean: computer science ", the word is completed automatically, and the retrieval result is also information related to" computer science ".

However, the query correction methods commonly used in the current search engine are all single or isolated, or can perform "spell checking" or can achieve "word form completion", and if the input query contains multiple errors, the existing methods are difficult to process simultaneously. For example, if "computing science and technology" is entered, none of the methods used by current search engines can be processed correctly, and no suggestion such as "computer science and technology" can be given. Even if the query is processed multiple times, such as "word form completion" and then "spell check", the existing method is difficult to give a more accurate prompt because the multiple processing cannot well judge which word should be selected as the prompt, for example, "science" can be supplemented as "science" and "scientist", and "technology" can be supplemented as "technology", "technical", and the like. Therefore, the search fails in situations where there may be various errors in the query input by the user or where the search engine is not favorable to correctly analyze the user's search intent.

Disclosure of Invention

The invention aims to provide a query correction method and a query correction system to solve the problem that the conventional search engine cannot correctly analyze various wrong queries input by a user so as to cause retrieval failure.

In order to solve the technical problem, according to a specific embodiment provided by the invention, the invention discloses the following technical scheme:

a query modification method, comprising:

presetting a language model by utilizing retrieval resources;

calling corresponding correction operation to correct each query word input originally to obtain multiple representations corresponding to each query word, wherein the multiple representations comprise the representations input originally;

obtaining word sequences in various combination forms according to various expressions of each query word;

and calling the language model to calculate the occurrence probability of the word sequences, and determining the word sequences with high occurrence probability as query suggestion results.

Wherein the language model comprises a univariate and/or multivariate language model.

Wherein, the establishing step of the binary language model comprises the following steps: preprocessing all retrieval resources to obtain each term; counting the occurrence frequency of each term in all retrieval resources, wherein the occurrence frequency comprises the occurrence frequency of a unary term and a binary term; and substituting the occurrence times of all the unary terms and the binary terms into the following formula for calculation:

p (w) = C (w)/C (×), which represents the probability of occurrence of a unary term, where C (w) represents the number of occurrences of a unary word w,

represents the sum of the number of times of all unary terms;

is shown in the containing word w _j Under the condition of (1), the word w _i Probability of occurrence, wherein C (w) _i ，w _j ) Representing a binary word w _i And w _j The number of co-occurrences of (c),

represents the sum of the number of times of all the unary terms,

representing the sum of the number of times of all binary terms.

The step of calling the language model to calculate the occurrence probability of the word sequence comprises the following steps: for each word sequence S = w ₁ w ₂ …w _n The corresponding P (w) and P (w) in the language model _i |w _j ) Value substitution into formula P (w) ₁ w ₂ …w _n )＝P(w ₁ )P(w ₂ |w ₁ )…P(w _n |w _n-1 ) The probability of occurrence of the sequence of words is calculated.

Preferably, the step of calling the language model to calculate the word sequence occurrence probability includes:

step 1, for a first query word, calling a calculation result of P (w) in a language model to obtain the occurrence probability of each query word, and selecting the query words with high occurrence probability according to a preset number for representation;

step 2, for the second query term, using the formula P (w) ₁ w ₂ )＝P(w ₁ )P(w ₂ |w ₁ ) Calculating the occurrence probability of the word sequences comprising the first query word and the second query word, and selecting the word sequences with high occurrence probability according to a preset number; wherein w ₁ For the first query term representation, w, selected in step 1 ₂ Various representations of a second query term;

using the following formula P (w) for each query term in turn according to step 2 ₁ w ₂ …w _n )＝P(w ₁ )P(w ₂ |w ₁ )…P(w _n |w _n-1 ) Calculating the occurrence probability of the word sequence, and finally obtaining a predetermined number of word sequences S = w containing all query words ₁ w ₂ …w _n 。

The method further comprises the following steps: and displaying the query suggestion result as prompt information, or directly retrieving according to the query suggestion result.

A query revision system comprising:

the model generating unit is used for presetting a language model by utilizing retrieval resources;

a data interface for receiving a query input;

the query processing engine is used for calling corresponding correction operation to correct each query term which is input originally to obtain multiple representations corresponding to each query term, wherein the multiple representations comprise the representation of the original input;

the query suggestion generation unit is used for obtaining word sequences in various combination forms according to various expressions of each query word; and calling the language model to calculate the occurrence probability of the word sequences, and determining the word sequences with high occurrence probability as query suggestion results.

The system further comprises: and the preprocessing unit is used for carrying out word segmentation or word segmentation preprocessing on the original input to obtain a word sequence of the original input.

And the query suggestion result generated by the query suggestion generation unit is displayed as prompt information through a data interface or directly sent to a retrieval unit for retrieval.

The model generation unit can establish a univariate or multivariate language model, wherein the process of establishing the binary language model comprises the following steps: preprocessing all retrieval resources to obtain each term; counting the occurrence frequency of each term in all retrieval resources, wherein the occurrence frequency comprises the occurrence frequency of a unary term and a binary term; and substituting the occurrence times of all the unary terms and the binary terms into the following formula to calculate:

represents the sum of the number of times of all unary terms;

represents the sum of the number of times of all the unary terms,

representing the sum of the number of times of all binary terms.

Preferably, the process of generating the query suggestion result by the query suggestion generation unit includes:

step 1, for a first query word, calling a calculation result of P (w) in a language model to obtain the occurrence probability of each query word, and selecting the query words with high occurrence probability according to a preset number to represent;

step 2, for the second query term, using the formula P (w) ₁ w ₂ )＝P(w ₁ )P(w ₂ |w ₁ ) Calculating the occurrence probability of word sequences comprising the first query word and the second query word, and selecting the word sequences with high occurrence probability according to a preset number; wherein w ₁ For the first query term representation selected in step 1, w ₂ Various representations of a second query term;

According to the specific embodiment provided by the invention, the invention discloses the following technical effects:

the invention can uniformly process various input errors or incomplete input problems in the query, such as spelling errors, word shapes and the like, and achieves the purposes of automatically correcting the query and helping a user to effectively utilize a search engine. By carrying out multiple correction analyses on the query words input by a user, obtaining multiple representations corresponding to each query word by aiming at one query word through each correction operation, such as spelling check, word form completion, word form reduction, synonym replacement and the like, and further obtaining word sequences in multiple combination forms; and then, calculating the occurrence probability of each word sequence by utilizing a language model established in advance according to retrieval resources, and taking the word sequence with high occurrence probability as a query suggestion result. The result can be used as an explicit prompt or an implicit query and is transmitted to a retrieval part of the system for retrieval, so that the user can be helped to retrieve a satisfactory result as much as possible, and the efficiency of using the search engine by the user is improved.

Drawings

FIG. 1 is a flow chart illustrating steps of a query modification method according to an embodiment of the present invention;

FIG. 2 is a diagram of a query internal data structure in a preferred embodiment of the present invention;

FIG. 3 is a schematic diagram illustrating the process of obtaining the first n optimal results in the embodiment shown in FIG. 2;

FIG. 4 is a diagram of the display effect of a page of the query suggestion result in the embodiment shown in FIG. 2;

fig. 5 is a structural diagram of a query modification system according to an embodiment of the present invention.

Detailed Description

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.

The embodiment of the invention provides a query correction method, aiming at the condition that a plurality of errors can exist or the search engine is not favorable for correctly analyzing the search intention of the user to cause the search failure in the search query input by the user, by carrying out a plurality of correction analyses on the query words input by the user, utilizing a language model established in advance according to the search resources as a measurement standard, and then searching out a plurality of query correction suggestions which best accord with the language model through an effective search algorithm, wherein the query correction suggestions can be used as an explicit prompt and can also be used as an implicit query to be transmitted to a search part for searching, the expected result can be searched by the user as much as possible, the query can be prevented from being modified by the user for many times, and the efficiency of using the search engine by the user can be improved.

Referring to fig. 1, a flowchart of steps of a query modification method according to an embodiment of the present invention is shown.

Wherein, the steps 101-103 are the process of using the retrieval resource to preset the language model, as follows:

step 101, processing all search resources (such as web pages). The specific process comprises the following steps: extracting the text of the webpage according to specific requirements, and converting the text according to the encoding of the webpage (mainly having influence on encoding of non-English characters); then, word segmentation processing is carried out on the Chinese text, word segmentation processing is carried out on the English text, and operations such as word shape reduction, letter lowercase and the like can be carried out according to application requirements. Through the processing, all the terms of all the retrieval resources are obtained, wherein the terms comprise Chinese terms or English terms.

And 102, counting the occurrence frequency of each term in all retrieval resources. For example, the search resource contains "abcdabef", and the statistical result is "a:2,b. In addition, in preparation for the following construction of a language model, in addition to counting the number of occurrences of one word, it is also necessary to count the number of co-occurrences of two words, such as "a _ b:2, b _c.

When the data volume of the retrieval resources is large, a multi-path merging or distributed processing mode is generally adopted for statistics, namely, the resources are firstly divided into a plurality of groups, the statistics is carried out group by group, and finally, the statistics is combined. The statistical principle of three words and more than three words is consistent with that of two words, but the statistical data amount is larger. The statistical mode has no relation with languages, and is suitable for Chinese, english and other languages.

In this embodiment, the statistics is the number of times of common occurrence of two adjacent words or multiple words, and of course, multiple words that are not adjacent may be counted according to different model building processes, or the statistics may be performed according to other rules.

And step 103, establishing a statistical language model. The language model provides a probability distribution for direct invocation in subsequent computational processes. Based on the above steps of processing, a univariate, binary or multivariate language model can be established, wherein multivariate refers to a model established based on the occurrence number of a plurality of words. In the following, a binary language model is taken as an example for explanation, if the original search resource data is sufficient enough, the language model using the tri-gram is better than the language model using the binary word, and the building principle of the tri-gram or more is the same as that of the binary model, and will not be described in detail here.

The binary language model describes the probability of occurrence of a whole sentence by the statistical information of two adjacent words. Establishing a language model, i.e. pre-calculating P (w) and P (w) based on the data of step 102 _i |w _j ) For use in estimating probabilities in subsequent processing. P (w) and P (w) _i |w _j ) The calculation formula of (a) is as follows:

P(w)＝C(w)/C(*) I

formula I represents the probability of occurrence of a unary term, where C (w) represents the number of occurrences of an unary w,

represents the sum of the number of times of all unary terms;

formula IIIs a conditional probability formula expressed in the containing word w _j Under the condition of (1), the word w _i Probability of occurrence, wherein C (w) _i ，w _j ) Representing a binary word w _i And w _j The number of co-occurrences of (c),

represents the sum of the number of times of all the unary terms,

representing the sum of the number of times of all the binary terms.

The result after the formula processing is that all the unary terms and binary terms are substituted into the calculation value of the formula in all the retrieval resources, and the results are used for calling the subsequent steps.

The process of the above steps 101-103 requires the search system to process in advance, and the language model is established for the user to query. The following steps are query correction operations performed by the retrieval system when the user inputs a query.

Step 104, the user enters a query in the search box and triggers a query event. The query addressed by the embodiment of the present invention generally refers to natural language input, and is not addressed to structured queries such as "boolean query", for example "+ computer-science (and | | | or)". If the user inputs "computing science and technology", the real input the user wishes is "computer science and technology", but where "computer" is mis-entered as "computing", "science" and "technology" are not entered completely, and "is misspelled as" nad ".

And 105, performing word segmentation or word segmentation pretreatment on the obtained query word to form a word sequence. The word segmentation process is to obtain the most commonly used word root in the query word, for example, the query word is "the new intellectual property right measures launched by the chinese government", the word segmentation result may be "the chinese", "the government", "the intellectual property right", "the measures", or "the chinese government", "the intellectual property right measures", etc., and the collocation which is not a commonly used combination, for example, "the government", can be effectively excluded, so that the word root of the search can be reduced. Word segmentation processing refers to dividing input English words and punctuation marks, such as "good joba! "after processing" good "," job ","! ". Punctuation or some "stop words" (referring to words that do not contribute to query analysis and correction) may also need to be removed according to specific needs. In the above example "computing science nad techno", the results after analysis are: "computing", "science", "nad", "techno".

And step 106, performing correction operation on each query word after the word segmentation or word segmentation processing. The correction operation using the shisha method is required to be performed in accordance with actual requirements, and in this example, "spell check", "word form reduction", and "word form completion" are used for description, but the present invention is not limited to the above-described correction operations. The system processes the 'computing', and finds that the words 'computer' and 'computer' can be obtained through the 'morphological restoration'; processing the 'science', wherein the query words obtained by the 'word form completion' are 'science', 'scientist' and 'scientists'; processing the ' nad ', and obtaining ' and ' nap ' through ' spell check '; the "technology" is treated to obtain "technology", "technical", etc.

Through step 106, multiple representations can be obtained corresponding to each query word, and each representation of the query word is combined, and word sequences in multiple combination forms can be obtained, so as to obtain multiple sets of candidate results, such as "computer scientist and technology", "computing scientists nap technology", and the like, including the original input "computing scienn nap technology" of the user.

Step 107, calling the calculated values in the language model, calculating the occurrence probability of each word sequence in the search resources by adopting the following formula, and taking the word sequence with high occurrence probability as a query suggestion result.

Assume that a word sequence is S = w ₁ w ₂ …w _n The formula for estimating the occurrence probability of the word sequence by using the language model is (still using the binary language model)For example):

P(w ₁ w ₂ …w _n )＝P(w ₁ )P(w ₂ |w ₁ )…P(w _n |w _n-1 ) III

in formula III, each term probability P (w) ₁ )、P(w ₂ |w ₁ )、…、P(w _n |w _n-1 ) Can directly call the binary languageTo the calculated values in the model. Thus, for each set of candidate results, the language model may be invoked to calculate the occurrence probability thereof, and a larger corresponding probability value indicates that the set of candidate results conforms to the language model established in step 103.

The language model is used as a measure in the present embodiment, because the language model established in step 103 is established from the retrieval resources, the query suggestion result with the higher probability value has the higher matching degree with the retrieval resources, which is beneficial for the search engine to retrieve the resources desired by the user.

And step 108, processing the query suggestion result, or displaying the query suggestion result to a user as prompt information for selection, or transmitting the query suggestion result to a retrieval part of the system as an implicit query for retrieval.

Aiming at the step 107, the invention also provides a preferred embodiment for obtaining the query suggestion result, and the preferred embodiment can quickly solve the optimal results of the preset number before the solution, thereby greatly improving the processing efficiency of the system. The specific method comprises the following steps:

first, a directed graph structure is constructed based on step 106, and as shown in fig. 2, the edge between two nodes is not unique. The steps of constructing the directed graph are as follows:

(1) If n terms exist after the analysis in the step 105, the nodes of the graph are n +1, and the number is 0.. N;

(2) The graph represents a term by using an edge, firstly, an original input query word is added, the edge corresponding to the ith query word is (i, i + 1), and a path is formed by the edge, so that the original input word sequence is obtained;

(3) For each query word, a word obtained by the correction operation is added, and taking "computing" as an example, a word obtained by the correction operation is obtained as "computer" or "computer", and the two words are added between nodes 0 and 1. Thus, a plurality of paths are obtained, and a plurality of groups of word sequences are obtained.

Then, on the basis of constructing a directed graph, combining a binary language model and utilizing a formula III, and adopting a dynamic algorithm to obtain an optimal result, namely a word sequence corresponding to a path with the maximum occurrence probability. In query suggestion, sometimes the optimal result is not really wanted by the user because the information requirements of different users are not completely consistent and have great subjectivity. Therefore, the step also provides a method for solving the first n optimal results on the basis of the dynamic algorithm.

Referring to fig. 3, it is a schematic diagram of the process of obtaining the first n optimal results, where each edge is numbered. The method is described as follows:

(1) For the first query word, calling the calculation result of P (w) in the language model to obtain the occurrence probability represented by each query word, and selecting the query words with high occurrence probability according to the preset number for representation;

for example, the first query term in FIG. 3 is represented by three types: (1) "computing", "5", "computer", and "6" computer ". The order is (6), (1) and (5) according to the P (w) value in the binary language model. Assuming that the selected predetermined number n is 4, the top 4 edges before the sorting check are retained, and the edges exceeding the number limit are removed. In this example, all of (6), (1) and (5) are retained.

(2) For the second query term, use formula P (w) ₁ w ₂ )＝P(w ₁ )P(w ₂ |w ₁ ) Calculating the occurrence probability of the word sequences comprising the first query word and the second query word, and selecting the word sequences with high occurrence probability according to a preset number; wherein w ₁ For the first query term representation selected in step (1), w ₂ Various representations of a second query term;

for example, the second query term in FIG. 3 is represented by four terms: (2) "scien" is used herein,(7) "science", "8", "scientists", and "9" scientists ". Respectively combining the various expressions (6), (1) and (5) of the first query word with the various expressions (2), (7), (8) and (9) of the second query word, and substituting the corresponding values in the binary language model into a formula P (w) by using a formula (3) ₁ w ₂ )＝P(w ₁ )P(w ₂ |w ₁ ) Calculating the occurrence probability respectively, then sorting the calculated values from top to bottom, and finally selecting the first 4 word sequences as (6) (7), (6) (8), (1) (7), (1) (8).

(3) Sequentially using a formula III P (w) for each query term according to the method in the step (2) ₁ w ₂ …w _n )＝P(w ₁ )P(w ₂ |w ₁ )…P(w _n |w _n-1 ) Calculating the occurrence probability of the word sequences, and finally obtaining a predetermined number of word sequences S = w containing all query words ₁ w ₂ …w _n 。

For example, the word sequences including the third query word are ranked to obtain (6), (7), (11), (6), (8), (11), (1), (7), (11), (1), (8), (11); and sequencing the word sequences containing the fourth query word to obtain the representation results (6) (7) (11) (12) of the optimal path, and the sub-optimal paths (6) (8) (11) (12), (1) (7) (11) (13), (1) (8) (11) (12). The optimal suggested results (6) (7) (11) (12) correspond to a path from node 0 to node n +1 in the figure, i.e., (0, 1) (1, 2) \ 8230; (n, n + 1), as indicated by the bold line in fig. 3.

Referring to FIG. 4, a page display effect diagram of a query suggestion result is shown. In this example, the suggested results are ranked from high to low according to the "generation probability", and the maximum number of results is 10. The first result is the optimal result, that is, the closest query suggestion result in the retrieval resource, which can be directly used as the "query suggestion" of the search engine for the user's reference.

It can be seen from the above that, the preferred method is to perform sorting and screening after adding the calculation result of each query word, select several word sequences conforming to the language model for subsequent calculation, and not perform subsequent calculation on the word sequences with low probability of occurrence in the language model, so that a large amount of calculation can be saved, thereby improving the calculation efficiency.

It should be noted that the method provided by the embodiment of the present invention is not limited by language, and can be used for query correction processing in both english and chinese languages. Because each language has respective characteristics, corresponding query correction operation needs to be selected according to practical application, but the principle is completely consistent, and the difference lies in that the preprocessing of each language and the selected query word correction method are different, for example, chinese can use a Chinese automatic word segmentation technology, english can use an English word segmentation technology, and the like.

For example, in the case of Chinese processing, the problem of "misspelling" similar to English, i.e., "wrongly written characters", also occurs in Chinese input. Such as "no thoughts" (which shall be "no thoughts"), "bird luqi" (which shall be "wulu wood qi"). Such errors may be either subjective (e.g., which word is not registered) or objective (e.g., a wrong word is selected using the pinyin input method) input by the user. Generally, the Chinese input containing "wrongly written words" will affect the search result of the search engine, and the information really desired by the user cannot be found. Besides, the Chinese language can also use the query processing method of word form completion, for example, the input is "calculate", and the prompt is "computer", "calculator", etc.

Under the method provided by the embodiment of the invention, various query expansion methods are combined, and various processing modes of Chinese query can be realized. By way of example: assuming that the input is "look-up English letters", chinese word segmentation is first performed, resulting in "look-up", "English", and "letters". Then, carrying out 'wrongly written character check' on the 'check', and finding that candidate words have 'searching'; the English is processed, and the results of English, hero and the like can be given by the word form completion; the homophone word replacement is performed on the letter, and a candidate word subtitle can be given. And combining a pre-established Chinese language model to efficiently search the combination of all candidate words, and finding that the 'searching for English subtitles' is most likely to represent the input of the intention of the user.

Aiming at the query correction method, the invention also provides an embodiment of a query correction system. Referring to fig. 5, it is a structural diagram of an embodiment of the system, and the system includes a model generating unit 501, a data interface 502, a preprocessing unit 503, a query processing engine 504, and a query suggestion generating unit 505.

The model generating unit 501 is configured to build a language model using the search resources, where the language model includes a univariate model, a bivariate model, and a trigram model, or build a multivariate model according to application requirements. Taking a bigram language model as an example, the model gives the P (w) value of each unary term and the P (w) of the bigram term in all retrieval resources by using formulas I and II _i |w _j ) The value is obtained. The process of establishing the model can be referred to the aforementioned steps 101-103, and will not be described in detail here.

Data interface 502 provides an interface for the query revision system to the outside for receiving query inputs and returning query suggestion results. The query words input by the user are transmitted to the preprocessing unit 503 through the data interface 502 for preprocessing, the preprocessing process includes a series of processing such as word segmentation or word segmentation, and then the originally input word sequence is obtained.

The preprocessed word sequence is transmitted to a query processing engine 504 for modification, and the query processing engine 504 is configured to invoke various modification operations to modify each query word in the word sequence to obtain multiple representations corresponding to each query word, so as to obtain multiple candidate words. In the figure, only a plurality of correction operations of spell checking, word form completion, word form restoration and synonym replacement are listed, and other correction functions can be added according to practical application.

Each corrected query word can obtain a plurality of candidate words, and the candidate words are combined in different modes to obtain a plurality of word sequences, namely a plurality of candidate results. The query suggestion generating unit 505 is configured to invoke a language model to calculate an occurrence probability of each word sequence in the search resource, and take the word sequence with a high occurrence probability as a query suggestion result, where a calculation formula taking a binary language model as an example can be referred to as the foregoing formula III.

Preferably, the query suggestion generation unit 505 provides a method for obtaining the top n optimal results, as shown in fig. 3, so that the first result obtained thereby is the optimal result, i.e. the query suggestion result closest to the search resource, which can be directly used as the "query suggestion" of the search engine for the user to refer to. The second term is a suboptimal result, and the third term and several terms ranked later can be used as suboptimal query suggestion results.

The query suggestion result generated by the query suggestion generation unit 505 can be displayed as prompt information through the data interface 502 for selection by the user; or can be transmitted to a retrieval unit to be directly retrieved as an implicit query suggestion.

In summary, the embodiments of the present invention provide a novel query correction method, which enables a search engine to still retrieve a result expected by a user when the user inputs an error in the case that the query of the user often contains an error through automatic correction or query prompt, so as to avoid the user from modifying the query many times, and help the user improve the efficiency of using the search engine.

For the parts of the system shown in fig. 5 that are not described in detail, reference may be made to the relevant parts of the method shown in fig. 1, and for the sake of brevity, they will not be described in detail here.

The query modification method and system provided by the present invention are introduced in detail, and the principle and the implementation of the present invention are explained by applying specific examples, and the description of the above embodiments is only used to help understanding the method and the core idea of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, the specific embodiments and the application range may be changed. In view of the above, the present disclosure should not be construed as limiting the invention.

Claims

1. A query modification method, comprising:

presetting a language model by utilizing retrieval resources;

2. The method of claim 1, wherein: the language model includes univariate and/or multivariate language models.

3. The method of claim 2, wherein the building of the bigram language model comprises:

preprocessing all retrieval resources to obtain each term;

counting the occurrence frequency of each term in all retrieval resources, wherein the occurrence frequency comprises the occurrence frequency of a unary term and a binary term;

and substituting the occurrence times of all the unary terms and the binary terms into the following formula to calculate:

P(w)＝C(w)/C( ^* ) Representing the probability of occurrence of a unary term, where C (w) represents the number of occurrences of a unary w,

represents the sum of the number of times of all unary terms;

is shown in the containing word w _j Under the condition of (1), the word w _i Probability of occurrence, wherein C (w) _i ，w _j ) To representBinary word w _i And w _j The number of co-occurrences of (c),represents the sum of the number of times of all the unary terms,

representing the sum of the number of times of all binary terms.

4. The method of claim 3, wherein the step of invoking the language model to calculate the probability of occurrence of a word sequence comprises: corresponding to each word sequence S = w ₁ w ₂ …w _n The corresponding P (w) and P (w) in the language model _i |w _j ) Value substitution formula P (w) ₁ w ₂ …w _n )＝P(w ₁ )P(w ₂ |w ₁ )…P(w _n |w _n-1 ) The probability of occurrence of the sequence of words is calculated.

5. The method of claim 3, wherein said step of invoking a language model to calculate a probability of occurrence of a word sequence comprises:

step 2, for the second query term, use formula P (w) ₁ w ₂ )＝P(w ₁ )P(w ₂ |w ₁ ) Calculating the occurrence probability of the word sequences comprising the first query word and the second query word, and selecting the word sequences with high occurrence probability according to a preset number; wherein w ₁ For the first query term representation selected in step 1, w ₂ Various representations of a second query term;

using the following formula P (w) for each query word in turn according to step 2 ₁ w ₂ …w _n )＝P(w ₁ )P(w ₂ |w ₁ )…P(w _n |w _n-1 ) Calculating the probability of occurrence of a sequence of wordsAnd finally obtaining a predetermined number of word sequences S = w containing all query words ₁ w ₂ …w _n 。

6. The method of claim 1, further comprising: and displaying the query suggestion result as prompt information, or directly retrieving according to the query suggestion result.

7. A query revision system, comprising:

a model generation unit for presetting a language model by using retrieval resources;

a data interface for receiving a query input;

8. The system of claim 7, further comprising: and the preprocessing unit is used for carrying out word segmentation or word segmentation preprocessing on the original input to obtain a word sequence of the original input.

9. The system of claim 7, wherein: and the inquiry suggestion result generated by the inquiry suggestion generating unit is displayed as prompt information through a data interface or directly sent to a retrieval unit for retrieval.

10. The system of claim 7, wherein: the model generating unit can establish a univariate or multivariate language model, wherein the process of establishing the bivariate language model comprises the following steps:

preprocessing all retrieval resources to obtain each term;

represents the sum of the number of times of all unary terms;

is shown in the containing word w _j Under the condition of (2) word w _i Probability of occurrence, wherein C (w) _i ，w _j ) Representing a binary word w _i And w _j The number of co-occurrences of (c),represents the sum of the number of times of all the unary terms,

representing the sum of the number of times of all binary terms.

11. The system of claim 7, wherein the process of the query suggestion generation unit generating query suggestion results comprises:

step 2, for the second query term, use formula P (w) ₁ w ₂ )＝P(w ₁ )P(w ₂ |w ₁ ) Calculating the occurrence probability of the word sequence containing the first and second query words, and determining the word sequence according to the probabilityDetermining the number of selected word sequences with high occurrence probability; wherein w ₁ For the first query term representation selected in step 1, w ₂ Various representations of a second query term;

using the following formula P (w) for each query word in turn according to step 2 ₁ w ₂ …w _n )＝P(w ₁ )P(w ₂ |w ₁ )…P(w _n |w _n-1 ) Calculating the occurrence probability of the word sequences, and finally obtaining a predetermined number of word sequences S = w containing all query words ₁ w ₂ …w _n 。