CN101131706A - Query amending method and system thereof - Google Patents

Query amending method and system thereof Download PDF

Info

Publication number
CN101131706A
CN101131706A CNA2007101753268A CN200710175326A CN101131706A CN 101131706 A CN101131706 A CN 101131706A CN A2007101753268 A CNA2007101753268 A CN A2007101753268A CN 200710175326 A CN200710175326 A CN 200710175326A CN 101131706 A CN101131706 A CN 101131706A
Authority
CN
China
Prior art keywords
query
word
occurrence
term
language model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CNA2007101753268A
Other languages
Chinese (zh)
Other versions
CN101131706B (en
Inventor
高立琦
刘挺
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Harbin Institute of Technology
Beijing Kingsoft Software Co Ltd
Beijing Jinshan Digital Entertainment Technology Co Ltd
Original Assignee
Harbin Institute of Technology
Beijing Kingsoft Software Co Ltd
Beijing Jinshan Digital Entertainment Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Harbin Institute of Technology, Beijing Kingsoft Software Co Ltd, Beijing Jinshan Digital Entertainment Technology Co Ltd filed Critical Harbin Institute of Technology
Priority to CN2007101753268A priority Critical patent/CN101131706B/en
Publication of CN101131706A publication Critical patent/CN101131706A/en
Application granted granted Critical
Publication of CN101131706B publication Critical patent/CN101131706B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The invention discloses a kind of method and system for searching and modifying, solving the problem that the current search engine can analyze many wrong searching information which users input correctly, thereby, resulting in searches failing. The method above includes: setting in advance the language model with the search resource; using the relevant correcting operation, correcting every search word input initially, getting many expressions of the relevant search word which including the expression of the initial input; according many expressions of every search word, getting the different assembled-form word serial; using the language model referred above to compute the appearance probability of the word serial, and making the word serial with high appearance probability as the search suggested result. The invention can deal unitarily with many inputting mistakes or incomplete inputting appearing during searching such as spelling mistake, morphology imperfection and so on to reach the aim to correct the search automatically, to help users to use the search engine efficiently.

Description

Query correction method and system
Technical Field
The invention relates to a search engine technology, in particular to a query correction method and a query correction system in a search engine.
Background
A query refers to a user's input, specifically text input herein, while using a search engine. The query is an expression of the information requirement of the user, and the search engine can provide accurate retrieval results for the user and provide high-quality service for the user only by correctly understanding the query of the user. However, when analyzing a query of a user, the accuracy of the current search engine is affected and restricted by several factors, such as misspelling of words, misshapers of word shapes, incomplete input words, chinese homophones, and the like, so that the search engine cannot correctly "understand" the user's intention, and the returned search result cannot meet the user's requirements.
The existing search engines, such as Google (Google), google, baidu, etc., all pay great attention to the function in query processing, and provide query correction functions such as query completion, spelling error check, etc. For example, entering "computer science" in Google, returning the results page prompts "Do you mean: computer science ", the word is completed automatically, and the retrieval result is also information related to" computer science ".
However, the query correction methods commonly used in the current search engine are all single or isolated, or can perform "spell checking" or can achieve "word form completion", and if the input query contains multiple errors, the existing methods are difficult to process simultaneously. For example, if "computing science and technology" is entered, none of the methods used by current search engines can be processed correctly, and no suggestion such as "computer science and technology" can be given. Even if the query is processed multiple times, such as "word form completion" and then "spell check", the existing method is difficult to give a more accurate prompt because the multiple processing cannot well judge which word should be selected as the prompt, for example, "science" can be supplemented as "science" and "scientist", and "technology" can be supplemented as "technology", "technical", and the like. Therefore, the search fails in situations where there may be various errors in the query input by the user or where the search engine is not favorable to correctly analyze the user's search intent.
Disclosure of Invention
The invention aims to provide a query correction method and a query correction system to solve the problem that the conventional search engine cannot correctly analyze various wrong queries input by a user so as to cause retrieval failure.
In order to solve the technical problem, according to a specific embodiment provided by the invention, the invention discloses the following technical scheme:
a query modification method, comprising:
presetting a language model by utilizing retrieval resources;
calling corresponding correction operation to correct each query word input originally to obtain multiple representations corresponding to each query word, wherein the multiple representations comprise the representations input originally;
obtaining word sequences in various combination forms according to various expressions of each query word;
and calling the language model to calculate the occurrence probability of the word sequences, and determining the word sequences with high occurrence probability as query suggestion results.
Wherein the language model comprises a univariate and/or multivariate language model.
Wherein, the establishing step of the binary language model comprises the following steps: preprocessing all retrieval resources to obtain each term; counting the occurrence frequency of each term in all retrieval resources, wherein the occurrence frequency comprises the occurrence frequency of a unary term and a binary term; and substituting the occurrence times of all the unary terms and the binary terms into the following formula for calculation:
p (w) = C (w)/C (×), which represents the probability of occurrence of a unary term, where C (w) represents the number of occurrences of a unary word w,
Figure A20071017532600061
represents the sum of the number of times of all unary terms;
is shown in the containing word w j Under the condition of (1), the word w i Probability of occurrence, wherein C (w) i ,w j ) Representing a binary word w i And w j The number of co-occurrences of (c),
Figure A20071017532600063
represents the sum of the number of times of all the unary terms,
Figure A20071017532600064
representing the sum of the number of times of all binary terms.
The step of calling the language model to calculate the occurrence probability of the word sequence comprises the following steps: for each word sequence S = w 1 w 2 …w n The corresponding P (w) and P (w) in the language model i |w j ) Value substitution into formula P (w) 1 w 2 …w n )=P(w 1 )P(w 2 |w 1 )…P(w n |w n-1 ) The probability of occurrence of the sequence of words is calculated.
Preferably, the step of calling the language model to calculate the word sequence occurrence probability includes:
step 1, for a first query word, calling a calculation result of P (w) in a language model to obtain the occurrence probability of each query word, and selecting the query words with high occurrence probability according to a preset number for representation;
step 2, for the second query term, using the formula P (w) 1 w 2 )=P(w 1 )P(w 2 |w 1 ) Calculating the occurrence probability of the word sequences comprising the first query word and the second query word, and selecting the word sequences with high occurrence probability according to a preset number; wherein w 1 For the first query term representation, w, selected in step 1 2 Various representations of a second query term;
using the following formula P (w) for each query term in turn according to step 2 1 w 2 …w n )=P(w 1 )P(w 2 |w 1 )…P(w n |w n-1 ) Calculating the occurrence probability of the word sequence, and finally obtaining a predetermined number of word sequences S = w containing all query words 1 w 2 …w n
The method further comprises the following steps: and displaying the query suggestion result as prompt information, or directly retrieving according to the query suggestion result.
A query revision system comprising:
the model generating unit is used for presetting a language model by utilizing retrieval resources;
a data interface for receiving a query input;
the query processing engine is used for calling corresponding correction operation to correct each query term which is input originally to obtain multiple representations corresponding to each query term, wherein the multiple representations comprise the representation of the original input;
the query suggestion generation unit is used for obtaining word sequences in various combination forms according to various expressions of each query word; and calling the language model to calculate the occurrence probability of the word sequences, and determining the word sequences with high occurrence probability as query suggestion results.
The system further comprises: and the preprocessing unit is used for carrying out word segmentation or word segmentation preprocessing on the original input to obtain a word sequence of the original input.
And the query suggestion result generated by the query suggestion generation unit is displayed as prompt information through a data interface or directly sent to a retrieval unit for retrieval.
The model generation unit can establish a univariate or multivariate language model, wherein the process of establishing the binary language model comprises the following steps: preprocessing all retrieval resources to obtain each term; counting the occurrence frequency of each term in all retrieval resources, wherein the occurrence frequency comprises the occurrence frequency of a unary term and a binary term; and substituting the occurrence times of all the unary terms and the binary terms into the following formula to calculate:
p (w) = C (w)/C (×), which represents the probability of occurrence of a unary term, where C (w) represents the number of occurrences of a unary word w,
Figure A20071017532600071
represents the sum of the number of times of all unary terms;
Figure A20071017532600081
is shown in the containing word w j Under the condition of (1), the word w i Probability of occurrence, wherein C (w) i ,w j ) Representing a binary word w i And w j The number of co-occurrences of (c),
Figure A20071017532600082
represents the sum of the number of times of all the unary terms,
Figure A20071017532600083
representing the sum of the number of times of all binary terms.
Preferably, the process of generating the query suggestion result by the query suggestion generation unit includes:
step 1, for a first query word, calling a calculation result of P (w) in a language model to obtain the occurrence probability of each query word, and selecting the query words with high occurrence probability according to a preset number to represent;
step 2, for the second query term, using the formula P (w) 1 w 2 )=P(w 1 )P(w 2 |w 1 ) Calculating the occurrence probability of word sequences comprising the first query word and the second query word, and selecting the word sequences with high occurrence probability according to a preset number; wherein w 1 For the first query term representation selected in step 1, w 2 Various representations of a second query term;
using the following formula P (w) for each query term in turn according to step 2 1 w 2 …w n )=P(w 1 )P(w 2 |w 1 )…P(w n |w n-1 ) Calculating the occurrence probability of the word sequence, and finally obtaining a predetermined number of word sequences S = w containing all query words 1 w 2 …w n
According to the specific embodiment provided by the invention, the invention discloses the following technical effects:
the invention can uniformly process various input errors or incomplete input problems in the query, such as spelling errors, word shapes and the like, and achieves the purposes of automatically correcting the query and helping a user to effectively utilize a search engine. By carrying out multiple correction analyses on the query words input by a user, obtaining multiple representations corresponding to each query word by aiming at one query word through each correction operation, such as spelling check, word form completion, word form reduction, synonym replacement and the like, and further obtaining word sequences in multiple combination forms; and then, calculating the occurrence probability of each word sequence by utilizing a language model established in advance according to retrieval resources, and taking the word sequence with high occurrence probability as a query suggestion result. The result can be used as an explicit prompt or an implicit query and is transmitted to a retrieval part of the system for retrieval, so that the user can be helped to retrieve a satisfactory result as much as possible, and the efficiency of using the search engine by the user is improved.
Drawings
FIG. 1 is a flow chart illustrating steps of a query modification method according to an embodiment of the present invention;
FIG. 2 is a diagram of a query internal data structure in a preferred embodiment of the present invention;
FIG. 3 is a schematic diagram illustrating the process of obtaining the first n optimal results in the embodiment shown in FIG. 2;
FIG. 4 is a diagram of the display effect of a page of the query suggestion result in the embodiment shown in FIG. 2;
fig. 5 is a structural diagram of a query modification system according to an embodiment of the present invention.
Detailed Description
In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.
The embodiment of the invention provides a query correction method, aiming at the condition that a plurality of errors can exist or the search engine is not favorable for correctly analyzing the search intention of the user to cause the search failure in the search query input by the user, by carrying out a plurality of correction analyses on the query words input by the user, utilizing a language model established in advance according to the search resources as a measurement standard, and then searching out a plurality of query correction suggestions which best accord with the language model through an effective search algorithm, wherein the query correction suggestions can be used as an explicit prompt and can also be used as an implicit query to be transmitted to a search part for searching, the expected result can be searched by the user as much as possible, the query can be prevented from being modified by the user for many times, and the efficiency of using the search engine by the user can be improved.
Referring to fig. 1, a flowchart of steps of a query modification method according to an embodiment of the present invention is shown.
Wherein, the steps 101-103 are the process of using the retrieval resource to preset the language model, as follows:
step 101, processing all search resources (such as web pages). The specific process comprises the following steps: extracting the text of the webpage according to specific requirements, and converting the text according to the encoding of the webpage (mainly having influence on encoding of non-English characters); then, word segmentation processing is carried out on the Chinese text, word segmentation processing is carried out on the English text, and operations such as word shape reduction, letter lowercase and the like can be carried out according to application requirements. Through the processing, all the terms of all the retrieval resources are obtained, wherein the terms comprise Chinese terms or English terms.
And 102, counting the occurrence frequency of each term in all retrieval resources. For example, the search resource contains "abcdabef", and the statistical result is "a:2,b. In addition, in preparation for the following construction of a language model, in addition to counting the number of occurrences of one word, it is also necessary to count the number of co-occurrences of two words, such as "a _ b:2, b _c.
When the data volume of the retrieval resources is large, a multi-path merging or distributed processing mode is generally adopted for statistics, namely, the resources are firstly divided into a plurality of groups, the statistics is carried out group by group, and finally, the statistics is combined. The statistical principle of three words and more than three words is consistent with that of two words, but the statistical data amount is larger. The statistical mode has no relation with languages, and is suitable for Chinese, english and other languages.
In this embodiment, the statistics is the number of times of common occurrence of two adjacent words or multiple words, and of course, multiple words that are not adjacent may be counted according to different model building processes, or the statistics may be performed according to other rules.
And step 103, establishing a statistical language model. The language model provides a probability distribution for direct invocation in subsequent computational processes. Based on the above steps of processing, a univariate, binary or multivariate language model can be established, wherein multivariate refers to a model established based on the occurrence number of a plurality of words. In the following, a binary language model is taken as an example for explanation, if the original search resource data is sufficient enough, the language model using the tri-gram is better than the language model using the binary word, and the building principle of the tri-gram or more is the same as that of the binary model, and will not be described in detail here.
The binary language model describes the probability of occurrence of a whole sentence by the statistical information of two adjacent words. Establishing a language model, i.e. pre-calculating P (w) and P (w) based on the data of step 102 i |w j ) For use in estimating probabilities in subsequent processing. P (w) and P (w) i |w j ) The calculation formula of (a) is as follows:
P(w)=C(w)/C(*) I
formula I represents the probability of occurrence of a unary term, where C (w) represents the number of occurrences of an unary w,
Figure A20071017532600101
represents the sum of the number of times of all unary terms;
Figure A20071017532600102
formula IIIs a conditional probability formula expressed in the containing word w j Under the condition of (1), the word w i Probability of occurrence, wherein C (w) i ,w j ) Representing a binary word w i And w j The number of co-occurrences of (c),
Figure A20071017532600103
represents the sum of the number of times of all the unary terms,
Figure A20071017532600104
representing the sum of the number of times of all the binary terms.
The result after the formula processing is that all the unary terms and binary terms are substituted into the calculation value of the formula in all the retrieval resources, and the results are used for calling the subsequent steps.
The process of the above steps 101-103 requires the search system to process in advance, and the language model is established for the user to query. The following steps are query correction operations performed by the retrieval system when the user inputs a query.
Step 104, the user enters a query in the search box and triggers a query event. The query addressed by the embodiment of the present invention generally refers to natural language input, and is not addressed to structured queries such as "boolean query", for example "+ computer-science (and | | | or)". If the user inputs "computing science and technology", the real input the user wishes is "computer science and technology", but where "computer" is mis-entered as "computing", "science" and "technology" are not entered completely, and "is misspelled as" nad ".
And 105, performing word segmentation or word segmentation pretreatment on the obtained query word to form a word sequence. The word segmentation process is to obtain the most commonly used word root in the query word, for example, the query word is "the new intellectual property right measures launched by the chinese government", the word segmentation result may be "the chinese", "the government", "the intellectual property right", "the measures", or "the chinese government", "the intellectual property right measures", etc., and the collocation which is not a commonly used combination, for example, "the government", can be effectively excluded, so that the word root of the search can be reduced. Word segmentation processing refers to dividing input English words and punctuation marks, such as "good joba! "after processing" good "," job ","! ". Punctuation or some "stop words" (referring to words that do not contribute to query analysis and correction) may also need to be removed according to specific needs. In the above example "computing science nad techno", the results after analysis are: "computing", "science", "nad", "techno".
And step 106, performing correction operation on each query word after the word segmentation or word segmentation processing. The correction operation using the shisha method is required to be performed in accordance with actual requirements, and in this example, "spell check", "word form reduction", and "word form completion" are used for description, but the present invention is not limited to the above-described correction operations. The system processes the 'computing', and finds that the words 'computer' and 'computer' can be obtained through the 'morphological restoration'; processing the 'science', wherein the query words obtained by the 'word form completion' are 'science', 'scientist' and 'scientists'; processing the ' nad ', and obtaining ' and ' nap ' through ' spell check '; the "technology" is treated to obtain "technology", "technical", etc.
Through step 106, multiple representations can be obtained corresponding to each query word, and each representation of the query word is combined, and word sequences in multiple combination forms can be obtained, so as to obtain multiple sets of candidate results, such as "computer scientist and technology", "computing scientists nap technology", and the like, including the original input "computing scienn nap technology" of the user.
Step 107, calling the calculated values in the language model, calculating the occurrence probability of each word sequence in the search resources by adopting the following formula, and taking the word sequence with high occurrence probability as a query suggestion result.
Assume that a word sequence is S = w 1 w 2 …w n The formula for estimating the occurrence probability of the word sequence by using the language model is (still using the binary language model)For example):
P(w 1 w 2 …w n )=P(w 1 )P(w 2 |w 1 )…P(w n |w n-1 ) III
in formula III, each term probability P (w) 1 )、P(w 2 |w 1 )、…、P(w n |w n-1 ) Can directly call the binary languageTo the calculated values in the model. Thus, for each set of candidate results, the language model may be invoked to calculate the occurrence probability thereof, and a larger corresponding probability value indicates that the set of candidate results conforms to the language model established in step 103.
The language model is used as a measure in the present embodiment, because the language model established in step 103 is established from the retrieval resources, the query suggestion result with the higher probability value has the higher matching degree with the retrieval resources, which is beneficial for the search engine to retrieve the resources desired by the user.
And step 108, processing the query suggestion result, or displaying the query suggestion result to a user as prompt information for selection, or transmitting the query suggestion result to a retrieval part of the system as an implicit query for retrieval.
Aiming at the step 107, the invention also provides a preferred embodiment for obtaining the query suggestion result, and the preferred embodiment can quickly solve the optimal results of the preset number before the solution, thereby greatly improving the processing efficiency of the system. The specific method comprises the following steps:
first, a directed graph structure is constructed based on step 106, and as shown in fig. 2, the edge between two nodes is not unique. The steps of constructing the directed graph are as follows:
(1) If n terms exist after the analysis in the step 105, the nodes of the graph are n +1, and the number is 0.. N;
(2) The graph represents a term by using an edge, firstly, an original input query word is added, the edge corresponding to the ith query word is (i, i + 1), and a path is formed by the edge, so that the original input word sequence is obtained;
(3) For each query word, a word obtained by the correction operation is added, and taking "computing" as an example, a word obtained by the correction operation is obtained as "computer" or "computer", and the two words are added between nodes 0 and 1. Thus, a plurality of paths are obtained, and a plurality of groups of word sequences are obtained.
Then, on the basis of constructing a directed graph, combining a binary language model and utilizing a formula III, and adopting a dynamic algorithm to obtain an optimal result, namely a word sequence corresponding to a path with the maximum occurrence probability. In query suggestion, sometimes the optimal result is not really wanted by the user because the information requirements of different users are not completely consistent and have great subjectivity. Therefore, the step also provides a method for solving the first n optimal results on the basis of the dynamic algorithm.
Referring to fig. 3, it is a schematic diagram of the process of obtaining the first n optimal results, where each edge is numbered. The method is described as follows:
(1) For the first query word, calling the calculation result of P (w) in the language model to obtain the occurrence probability represented by each query word, and selecting the query words with high occurrence probability according to the preset number for representation;
for example, the first query term in FIG. 3 is represented by three types: (1) "computing", "5", "computer", and "6" computer ". The order is (6), (1) and (5) according to the P (w) value in the binary language model. Assuming that the selected predetermined number n is 4, the top 4 edges before the sorting check are retained, and the edges exceeding the number limit are removed. In this example, all of (6), (1) and (5) are retained.
(2) For the second query term, use formula P (w) 1 w 2 )=P(w 1 )P(w 2 |w 1 ) Calculating the occurrence probability of the word sequences comprising the first query word and the second query word, and selecting the word sequences with high occurrence probability according to a preset number; wherein w 1 For the first query term representation selected in step (1), w 2 Various representations of a second query term;
for example, the second query term in FIG. 3 is represented by four terms: (2) "scien" is used herein,(7) "science", "8", "scientists", and "9" scientists ". Respectively combining the various expressions (6), (1) and (5) of the first query word with the various expressions (2), (7), (8) and (9) of the second query word, and substituting the corresponding values in the binary language model into a formula P (w) by using a formula (3) 1 w 2 )=P(w 1 )P(w 2 |w 1 ) Calculating the occurrence probability respectively, then sorting the calculated values from top to bottom, and finally selecting the first 4 word sequences as (6) (7), (6) (8), (1) (7), (1) (8).
(3) Sequentially using a formula III P (w) for each query term according to the method in the step (2) 1 w 2 …w n )=P(w 1 )P(w 2 |w 1 )…P(w n |w n-1 ) Calculating the occurrence probability of the word sequences, and finally obtaining a predetermined number of word sequences S = w containing all query words 1 w 2 …w n
For example, the word sequences including the third query word are ranked to obtain (6), (7), (11), (6), (8), (11), (1), (7), (11), (1), (8), (11); and sequencing the word sequences containing the fourth query word to obtain the representation results (6) (7) (11) (12) of the optimal path, and the sub-optimal paths (6) (8) (11) (12), (1) (7) (11) (13), (1) (8) (11) (12). The optimal suggested results (6) (7) (11) (12) correspond to a path from node 0 to node n +1 in the figure, i.e., (0, 1) (1, 2) \ 8230; (n, n + 1), as indicated by the bold line in fig. 3.
Referring to FIG. 4, a page display effect diagram of a query suggestion result is shown. In this example, the suggested results are ranked from high to low according to the "generation probability", and the maximum number of results is 10. The first result is the optimal result, that is, the closest query suggestion result in the retrieval resource, which can be directly used as the "query suggestion" of the search engine for the user's reference.
It can be seen from the above that, the preferred method is to perform sorting and screening after adding the calculation result of each query word, select several word sequences conforming to the language model for subsequent calculation, and not perform subsequent calculation on the word sequences with low probability of occurrence in the language model, so that a large amount of calculation can be saved, thereby improving the calculation efficiency.
It should be noted that the method provided by the embodiment of the present invention is not limited by language, and can be used for query correction processing in both english and chinese languages. Because each language has respective characteristics, corresponding query correction operation needs to be selected according to practical application, but the principle is completely consistent, and the difference lies in that the preprocessing of each language and the selected query word correction method are different, for example, chinese can use a Chinese automatic word segmentation technology, english can use an English word segmentation technology, and the like.
For example, in the case of Chinese processing, the problem of "misspelling" similar to English, i.e., "wrongly written characters", also occurs in Chinese input. Such as "no thoughts" (which shall be "no thoughts"), "bird luqi" (which shall be "wulu wood qi"). Such errors may be either subjective (e.g., which word is not registered) or objective (e.g., a wrong word is selected using the pinyin input method) input by the user. Generally, the Chinese input containing "wrongly written words" will affect the search result of the search engine, and the information really desired by the user cannot be found. Besides, the Chinese language can also use the query processing method of word form completion, for example, the input is "calculate", and the prompt is "computer", "calculator", etc.
Under the method provided by the embodiment of the invention, various query expansion methods are combined, and various processing modes of Chinese query can be realized. By way of example: assuming that the input is "look-up English letters", chinese word segmentation is first performed, resulting in "look-up", "English", and "letters". Then, carrying out 'wrongly written character check' on the 'check', and finding that candidate words have 'searching'; the English is processed, and the results of English, hero and the like can be given by the word form completion; the homophone word replacement is performed on the letter, and a candidate word subtitle can be given. And combining a pre-established Chinese language model to efficiently search the combination of all candidate words, and finding that the 'searching for English subtitles' is most likely to represent the input of the intention of the user.
Aiming at the query correction method, the invention also provides an embodiment of a query correction system. Referring to fig. 5, it is a structural diagram of an embodiment of the system, and the system includes a model generating unit 501, a data interface 502, a preprocessing unit 503, a query processing engine 504, and a query suggestion generating unit 505.
The model generating unit 501 is configured to build a language model using the search resources, where the language model includes a univariate model, a bivariate model, and a trigram model, or build a multivariate model according to application requirements. Taking a bigram language model as an example, the model gives the P (w) value of each unary term and the P (w) of the bigram term in all retrieval resources by using formulas I and II i |w j ) The value is obtained. The process of establishing the model can be referred to the aforementioned steps 101-103, and will not be described in detail here.
Data interface 502 provides an interface for the query revision system to the outside for receiving query inputs and returning query suggestion results. The query words input by the user are transmitted to the preprocessing unit 503 through the data interface 502 for preprocessing, the preprocessing process includes a series of processing such as word segmentation or word segmentation, and then the originally input word sequence is obtained.
The preprocessed word sequence is transmitted to a query processing engine 504 for modification, and the query processing engine 504 is configured to invoke various modification operations to modify each query word in the word sequence to obtain multiple representations corresponding to each query word, so as to obtain multiple candidate words. In the figure, only a plurality of correction operations of spell checking, word form completion, word form restoration and synonym replacement are listed, and other correction functions can be added according to practical application.
Each corrected query word can obtain a plurality of candidate words, and the candidate words are combined in different modes to obtain a plurality of word sequences, namely a plurality of candidate results. The query suggestion generating unit 505 is configured to invoke a language model to calculate an occurrence probability of each word sequence in the search resource, and take the word sequence with a high occurrence probability as a query suggestion result, where a calculation formula taking a binary language model as an example can be referred to as the foregoing formula III.
Preferably, the query suggestion generation unit 505 provides a method for obtaining the top n optimal results, as shown in fig. 3, so that the first result obtained thereby is the optimal result, i.e. the query suggestion result closest to the search resource, which can be directly used as the "query suggestion" of the search engine for the user to refer to. The second term is a suboptimal result, and the third term and several terms ranked later can be used as suboptimal query suggestion results.
The query suggestion result generated by the query suggestion generation unit 505 can be displayed as prompt information through the data interface 502 for selection by the user; or can be transmitted to a retrieval unit to be directly retrieved as an implicit query suggestion.
In summary, the embodiments of the present invention provide a novel query correction method, which enables a search engine to still retrieve a result expected by a user when the user inputs an error in the case that the query of the user often contains an error through automatic correction or query prompt, so as to avoid the user from modifying the query many times, and help the user improve the efficiency of using the search engine.
For the parts of the system shown in fig. 5 that are not described in detail, reference may be made to the relevant parts of the method shown in fig. 1, and for the sake of brevity, they will not be described in detail here.
The query modification method and system provided by the present invention are introduced in detail, and the principle and the implementation of the present invention are explained by applying specific examples, and the description of the above embodiments is only used to help understanding the method and the core idea of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, the specific embodiments and the application range may be changed. In view of the above, the present disclosure should not be construed as limiting the invention.

Claims (11)

1. A query modification method, comprising:
presetting a language model by utilizing retrieval resources;
calling corresponding correction operation to correct each query word input originally to obtain multiple representations corresponding to each query word, wherein the multiple representations comprise the representations input originally;
obtaining word sequences in various combination forms according to various expressions of each query word;
and calling the language model to calculate the occurrence probability of the word sequences, and determining the word sequences with high occurrence probability as query suggestion results.
2. The method of claim 1, wherein: the language model includes univariate and/or multivariate language models.
3. The method of claim 2, wherein the building of the bigram language model comprises:
preprocessing all retrieval resources to obtain each term;
counting the occurrence frequency of each term in all retrieval resources, wherein the occurrence frequency comprises the occurrence frequency of a unary term and a binary term;
and substituting the occurrence times of all the unary terms and the binary terms into the following formula to calculate:
P(w)=C(w)/C( * ) Representing the probability of occurrence of a unary term, where C (w) represents the number of occurrences of a unary w,
Figure A2007101753260002C1
represents the sum of the number of times of all unary terms;
is shown in the containing word w j Under the condition of (1), the word w i Probability of occurrence, wherein C (w) i ,w j ) To representBinary word w i And w j The number of co-occurrences of (c),represents the sum of the number of times of all the unary terms,
Figure A2007101753260002C4
representing the sum of the number of times of all binary terms.
4. The method of claim 3, wherein the step of invoking the language model to calculate the probability of occurrence of a word sequence comprises: corresponding to each word sequence S = w 1 w 2 …w n The corresponding P (w) and P (w) in the language model i |w j ) Value substitution formula P (w) 1 w 2 …w n )=P(w 1 )P(w 2 |w 1 )…P(w n |w n-1 ) The probability of occurrence of the sequence of words is calculated.
5. The method of claim 3, wherein said step of invoking a language model to calculate a probability of occurrence of a word sequence comprises:
step 1, for a first query word, calling a calculation result of P (w) in a language model to obtain the occurrence probability of each query word, and selecting the query words with high occurrence probability according to a preset number to represent;
step 2, for the second query term, use formula P (w) 1 w 2 )=P(w 1 )P(w 2 |w 1 ) Calculating the occurrence probability of the word sequences comprising the first query word and the second query word, and selecting the word sequences with high occurrence probability according to a preset number; wherein w 1 For the first query term representation selected in step 1, w 2 Various representations of a second query term;
using the following formula P (w) for each query word in turn according to step 2 1 w 2 …w n )=P(w 1 )P(w 2 |w 1 )…P(w n |w n-1 ) Calculating the probability of occurrence of a sequence of wordsAnd finally obtaining a predetermined number of word sequences S = w containing all query words 1 w 2 …w n
6. The method of claim 1, further comprising: and displaying the query suggestion result as prompt information, or directly retrieving according to the query suggestion result.
7. A query revision system, comprising:
a model generation unit for presetting a language model by using retrieval resources;
a data interface for receiving a query input;
the query processing engine is used for calling corresponding correction operation to correct each query term which is input originally to obtain multiple representations corresponding to each query term, wherein the multiple representations comprise the representation of the original input;
the query suggestion generation unit is used for obtaining word sequences in various combination forms according to various expressions of each query word; and calling the language model to calculate the occurrence probability of the word sequences, and determining the word sequences with high occurrence probability as query suggestion results.
8. The system of claim 7, further comprising: and the preprocessing unit is used for carrying out word segmentation or word segmentation preprocessing on the original input to obtain a word sequence of the original input.
9. The system of claim 7, wherein: and the inquiry suggestion result generated by the inquiry suggestion generating unit is displayed as prompt information through a data interface or directly sent to a retrieval unit for retrieval.
10. The system of claim 7, wherein: the model generating unit can establish a univariate or multivariate language model, wherein the process of establishing the bivariate language model comprises the following steps:
preprocessing all retrieval resources to obtain each term;
counting the occurrence frequency of each term in all retrieval resources, wherein the occurrence frequency comprises the occurrence frequency of a unary term and a binary term;
and substituting the occurrence times of all the unary terms and the binary terms into the following formula to calculate:
p (w) = C (w)/C (×), which represents the probability of occurrence of a unary term, where C (w) represents the number of occurrences of a unary word w,
Figure A2007101753260004C1
represents the sum of the number of times of all unary terms;
is shown in the containing word w j Under the condition of (2) word w i Probability of occurrence, wherein C (w) i ,w j ) Representing a binary word w i And w j The number of co-occurrences of (c),represents the sum of the number of times of all the unary terms,
Figure A2007101753260004C4
representing the sum of the number of times of all binary terms.
11. The system of claim 7, wherein the process of the query suggestion generation unit generating query suggestion results comprises:
step 1, for a first query word, calling a calculation result of P (w) in a language model to obtain the occurrence probability of each query word, and selecting the query words with high occurrence probability according to a preset number for representation;
step 2, for the second query term, use formula P (w) 1 w 2 )=P(w 1 )P(w 2 |w 1 ) Calculating the occurrence probability of the word sequence containing the first and second query words, and determining the word sequence according to the probabilityDetermining the number of selected word sequences with high occurrence probability; wherein w 1 For the first query term representation selected in step 1, w 2 Various representations of a second query term;
using the following formula P (w) for each query word in turn according to step 2 1 w 2 …w n )=P(w 1 )P(w 2 |w 1 )…P(w n |w n-1 ) Calculating the occurrence probability of the word sequences, and finally obtaining a predetermined number of word sequences S = w containing all query words 1 w 2 …w n
CN2007101753268A 2007-09-28 2007-09-28 Query amending method and system thereof Active CN101131706B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2007101753268A CN101131706B (en) 2007-09-28 2007-09-28 Query amending method and system thereof

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2007101753268A CN101131706B (en) 2007-09-28 2007-09-28 Query amending method and system thereof

Publications (2)

Publication Number Publication Date
CN101131706A true CN101131706A (en) 2008-02-27
CN101131706B CN101131706B (en) 2010-10-13

Family

ID=39128973

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2007101753268A Active CN101131706B (en) 2007-09-28 2007-09-28 Query amending method and system thereof

Country Status (1)

Country Link
CN (1) CN101131706B (en)

Cited By (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102027720A (en) * 2008-05-30 2011-04-20 国际商业机器公司 Message handling
CN102623010A (en) * 2012-02-29 2012-08-01 北京百度网讯科技有限公司 Method and device for establishing language model and method and device for recognizing voice
WO2013007210A1 (en) * 2011-07-14 2013-01-17 腾讯科技(深圳)有限公司 Character input method, device and system
CN102937976A (en) * 2012-10-17 2013-02-20 北京奇虎科技有限公司 Drop-down prompting method and apparatus based on input prefix
CN103258025A (en) * 2013-05-08 2013-08-21 百度在线网络技术(北京)有限公司 Method for generating co-occurrence key words and method and system for providing associated search terms
CN103324620A (en) * 2012-03-20 2013-09-25 北京百度网讯科技有限公司 Method and device for rectifying marking results
WO2014040536A1 (en) * 2012-09-13 2014-03-20 腾讯科技(深圳)有限公司 Method, system, and storage medium for information search
CN103838739A (en) * 2012-11-21 2014-06-04 百度在线网络技术(北京)有限公司 Method and system for detecting error correction words in search engine
CN104331222A (en) * 2014-03-26 2015-02-04 广州三星通信技术研究有限公司 Method and equipment for inputting character by utilizing input method
CN105095222A (en) * 2014-04-25 2015-11-25 阿里巴巴集团控股有限公司 Unit word replacing method, search method and replacing apparatus
CN106294517A (en) * 2015-06-12 2017-01-04 富士通株式会社 Information processor and method
CN107291730A (en) * 2016-03-31 2017-10-24 阿里巴巴集团控股有限公司 Method, device and the probabilistic dictionaries construction method of correction suggestion are provided query word
CN107491447A (en) * 2016-06-12 2017-12-19 百度在线网络技术(北京)有限公司 Establish inquiry rewriting discrimination model, method for distinguishing and corresponding intrument are sentenced in inquiry rewriting
CN107943781A (en) * 2016-10-13 2018-04-20 北京国双科技有限公司 Keyword recognition method and device
CN108846103A (en) * 2018-06-19 2018-11-20 北京天工矩阵信息技术有限公司 A kind of data query method and device
CN110069143A (en) * 2018-01-22 2019-07-30 北京搜狗科技发展有限公司 A kind of information is anti-error to entangle method, apparatus and electronic equipment
CN111274802A (en) * 2018-11-19 2020-06-12 阿里巴巴集团控股有限公司 Validity judgment method and device for address data
CN112800314A (en) * 2021-01-26 2021-05-14 浙江香侬慧语科技有限责任公司 Method, system, storage medium and device for automatic completion of search engine query
WO2024045926A1 (en) * 2022-08-29 2024-03-07 浙江极氪智能科技有限公司 Multimedia recommendation method and recommendation apparatus, and head unit system and storage medium

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102867040B (en) * 2012-08-31 2015-03-18 中国科学院计算技术研究所 Chinese search engine mixed speech-oriented query error correction method and system

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050149499A1 (en) * 2003-12-30 2005-07-07 Google Inc., A Delaware Corporation Systems and methods for improving search quality
CN1560767A (en) * 2004-02-24 2005-01-05 珠海市汉易通信息科技有限公司 Automatic fully adding method for word input
US7254774B2 (en) * 2004-03-16 2007-08-07 Microsoft Corporation Systems and methods for improved spell checking

Cited By (30)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102027720A (en) * 2008-05-30 2011-04-20 国际商业机器公司 Message handling
WO2013007210A1 (en) * 2011-07-14 2013-01-17 腾讯科技(深圳)有限公司 Character input method, device and system
US9176941B2 (en) 2011-07-14 2015-11-03 Tencent Technology (Shenzhen) Company Limited Text inputting method, apparatus and system based on a cache-based language model and a universal language model
CN102623010A (en) * 2012-02-29 2012-08-01 北京百度网讯科技有限公司 Method and device for establishing language model and method and device for recognizing voice
CN102623010B (en) * 2012-02-29 2015-09-02 北京百度网讯科技有限公司 A kind ofly set up the method for language model, the method for speech recognition and device thereof
CN103324620B (en) * 2012-03-20 2016-04-27 北京百度网讯科技有限公司 A kind of method and apparatus that annotation results is rectified a deviation
CN103324620A (en) * 2012-03-20 2013-09-25 北京百度网讯科技有限公司 Method and device for rectifying marking results
CN103678358A (en) * 2012-09-13 2014-03-26 腾讯科技(深圳)有限公司 Information search method and system
WO2014040536A1 (en) * 2012-09-13 2014-03-20 腾讯科技(深圳)有限公司 Method, system, and storage medium for information search
US20150302056A1 (en) * 2012-09-13 2015-10-22 Tencent Technology (Shenzhen) Company Limited Method, system, and storage medium for information search
CN102937976A (en) * 2012-10-17 2013-02-20 北京奇虎科技有限公司 Drop-down prompting method and apparatus based on input prefix
CN103838739A (en) * 2012-11-21 2014-06-04 百度在线网络技术(北京)有限公司 Method and system for detecting error correction words in search engine
CN103258025A (en) * 2013-05-08 2013-08-21 百度在线网络技术(北京)有限公司 Method for generating co-occurrence key words and method and system for providing associated search terms
CN103258025B (en) * 2013-05-08 2016-08-31 百度在线网络技术(北京)有限公司 Generate the method for co-occurrence keyword, the method that association search word is provided and system
CN104331222B (en) * 2014-03-26 2018-06-05 广州三星通信技术研究有限公司 Use the method and apparatus of input method input character
CN104331222A (en) * 2014-03-26 2015-02-04 广州三星通信技术研究有限公司 Method and equipment for inputting character by utilizing input method
CN105095222A (en) * 2014-04-25 2015-11-25 阿里巴巴集团控股有限公司 Unit word replacing method, search method and replacing apparatus
CN106294517A (en) * 2015-06-12 2017-01-04 富士通株式会社 Information processor and method
CN107291730A (en) * 2016-03-31 2017-10-24 阿里巴巴集团控股有限公司 Method, device and the probabilistic dictionaries construction method of correction suggestion are provided query word
CN107291730B (en) * 2016-03-31 2020-07-31 阿里巴巴集团控股有限公司 Method and device for providing correction suggestion for query word and probability dictionary construction method
CN107491447A (en) * 2016-06-12 2017-12-19 百度在线网络技术(北京)有限公司 Establish inquiry rewriting discrimination model, method for distinguishing and corresponding intrument are sentenced in inquiry rewriting
CN107491447B (en) * 2016-06-12 2021-01-22 百度在线网络技术(北京)有限公司 Method for establishing query rewrite judging model, method for judging query rewrite and corresponding device
CN107943781A (en) * 2016-10-13 2018-04-20 北京国双科技有限公司 Keyword recognition method and device
CN110069143A (en) * 2018-01-22 2019-07-30 北京搜狗科技发展有限公司 A kind of information is anti-error to entangle method, apparatus and electronic equipment
CN108846103A (en) * 2018-06-19 2018-11-20 北京天工矩阵信息技术有限公司 A kind of data query method and device
CN108846103B (en) * 2018-06-19 2021-01-15 北京天工矩阵信息技术有限公司 Data query method and device
CN111274802A (en) * 2018-11-19 2020-06-12 阿里巴巴集团控股有限公司 Validity judgment method and device for address data
CN111274802B (en) * 2018-11-19 2023-04-18 阿里巴巴集团控股有限公司 Validity judgment method and device for address data
CN112800314A (en) * 2021-01-26 2021-05-14 浙江香侬慧语科技有限责任公司 Method, system, storage medium and device for automatic completion of search engine query
WO2024045926A1 (en) * 2022-08-29 2024-03-07 浙江极氪智能科技有限公司 Multimedia recommendation method and recommendation apparatus, and head unit system and storage medium

Also Published As

Publication number Publication date
CN101131706B (en) 2010-10-13

Similar Documents

Publication Publication Date Title
CN101131706A (en) Query amending method and system thereof
CN107480143B (en) Method and system for segmenting conversation topics based on context correlation
KR102268875B1 (en) System and method for inputting text into electronic devices
JP6187877B2 (en) Synonym extraction system, method and recording medium
CN104657440B (en) Structured query statement generation system and method
Zhou et al. Resolving surface forms to wikipedia topics
US20120246133A1 (en) Online spelling correction/phrase completion system
CN105975625A (en) Chinglish inquiring correcting method and system oriented to English search engine
CN104657439A (en) Generation system and method for structured query sentence used for precise retrieval of natural language
CN112035730B (en) Semantic retrieval method and device and electronic equipment
KR20060043682A (en) Systems and methods for improved spell checking
US20180089169A1 (en) Method, non-transitory computer-readable recording medium storing a program, apparatus, and system for creating similar sentence from original sentences to be translated
WO2012085518A1 (en) Method and apparatus for processing electronic data
TWI567569B (en) Natural language processing systems, natural language processing methods, and natural language processing programs
CN109522396B (en) Knowledge processing method and system for national defense science and technology field
US20090234852A1 (en) Sub-linear approximate string match
JP6108212B2 (en) Synonym extraction system, method and program
CN101369285B (en) Spell emendation method for query word in Chinese search engine
WO2014002774A1 (en) Synonym extraction system, method, and recording medium
CN106776590A (en) A kind of method and system for obtaining entry translation
CN110309258B (en) Input checking method, server and computer readable storage medium
RU2693328C2 (en) Methods and systems for generating a replacement request for a user input request
CN115688748A (en) Question error correction method and device, electronic equipment and storage medium
CN107203512B (en) Method for extracting key elements from natural language input of user
JP6303508B2 (en) Document analysis apparatus, document analysis system, document analysis method, and program

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
C53 Correction of patent for invention or patent application
CB03 Change of inventor or designer information

Inventor after: Gao Liqi

Inventor after: Liu Ting

Inventor after: Zhang Yu

Inventor after: Zhao Yanyan

Inventor after: Zhao Shiqi

Inventor before: Gao Liqi

Inventor before: Liu Ting

COR Change of bibliographic data

Free format text: CORRECT: INVENTOR; FROM: GAO LIQI LIU TING TO: GAO LIQI LIU TING ZHANG YU ZHAO YANYAN ZHAO SHIQI