CN106708893A

CN106708893A - Error correction method and device for search query term

Info

Publication number: CN106708893A
Application number: CN201510791328.4A
Authority: CN
Inventors: 薛文伟
Original assignee: Huawei Technologies Co Ltd
Current assignee: Shenzhen Dimension Data Technology Co ltd
Priority date: 2015-11-17
Filing date: 2015-11-17
Publication date: 2017-05-24
Anticipated expiration: 2035-11-17
Also published as: WO2017084506A1; CN106708893B

Abstract

The invention provides an error correction method and device for a search query term. After the error correction device for the search query term receives the query term in a search request, firstly, the error correction candidate sub-term list of each sub-term in the query term is obtained; then, according to a pre-established N-Gram language model, each error correction candidate term spliced by each error correction candidate sub-term list is subjected to real-time on-line grading to obtain an error correction target term list; and finally, according to the score of each error correction target term in the error correction target term list, a final error correction result is output. By use of the method and the device, on the basis of on-line analysis and calculation, the search query term input by the user is subjected to error correction in real time, off-line mass data processing and historical error correction databases are not required, and large-scale search logs and user feedback are not depended. For application scenes including mobile application markets and the like, error correction capability can be effectively improved.

Description

Search query word error correction method and device

Technical field

The present invention relates to areas of information technology, more particularly to a kind of search query word error correction method and device.

Background technology

Function of search is that (such as internet search engine, commercial articles searching draw for any information or content retrieval system Hold up, Books Retrieve System, HR employee's searching system) basis and Core Feature.User input non-structural The text query word of change, then information system is by the technology such as index, full-text search, Distributed Calculation, Matching inquiry word, the content item list that user wants to look up are returned, and is sorted by certain algorithmic rule. And by user's operational error, it is cognitive limit, the reason such as special input habit, searched in information retrieval system The mistake of rope query word is very universal.Query word mistake can cause Search Results quality and Consumer's Experience obvious Decline, in order to improve Consumer's Experience, have become chess game optimization one weighs very much for current query word error correction The dimension wanted.

At present, conventional Internet searches engine and have search engine support Mobile solution market product, Its query word error correction is mainly using the method for off-line analysis, generation error correction source word and target word pair.Specifically Process and analyze by off-line data that is various artificial and being combined automatically, based on some predefined rules And threshold values, from large-scale user search and click logs, some query words pair of statistics output are simultaneously stored in In history error correction data storehouse.Wherein, each query word entangles to the source query word including mistake and one Target query word after mistake, such as (" middle letter card volume ", " CITIC Securities ").When search engine is received During to a query word, it only needs to be searched in history error correction data storehouse, if the source of matching can be found Word, then be revised as corresponding target word, then go index output retrieval result by query word；If without Match somebody with somebody, then go retrieval result using the query word of script.

It is above-mentioned it is this sentence wrong error correction offline, the method for follow-up On-line matching needs large-scale search daily record and User feedback, is substantially the customization of search engine application scenarios.For Mobile solution market, it is searched Suo Zhi compares search engine with user feedback scale, small several orders of magnitude, when search query word has During mistake, error correcting capability is decreased obviously, so as to cause often can not search for correct application.

The content of the invention

For the drawbacks described above of prior art, the present invention provides a kind of search query word error correction method and device, For improving error correcting capability.

In a first aspect, the embodiment of the present invention provides a kind of search query word error correction method, including：

Receive the query word in searching request；

According to the word error correction map table for pre-building, the sub- word of error correction candidate of each sub- word in query word is obtained List；

According to each sub- word list of error correction candidate and the N-Gram language models for pre-building, error correction mesh is obtained Mark word list；

Error correction result is exported according to error correction target word list.

With reference in a first aspect, in the first possible implementation method of first aspect, being waited according to each error correction Sub- word list and the N-Gram language models for pre-building are selected, error correction target word list, specific bag is obtained Include：String-concatenation treatment is carried out to each sub- word list of error correction candidate, error correction candidate word list is obtained；

According to the N-Gram language models for pre-building, each error correction candidate in error correction candidate word list is calculated The word scoring of word, and each error correction candidate word is ranked up according to the scoring of the word of each error correction candidate word, Obtain error correction target word list.

With reference in a first aspect, in second possible implementation method of first aspect, being waited according to each error correction Sub- word list and the N-Gram language models for pre-building are selected, error correction target word list, specific bag is obtained Include：

Using the sub- word list of error correction candidate of the in query word first sub- word as the first intermediate result, in first Between result perform error correction target word search operation, wherein, error correction target word search operation includes：

According to N-Gram language models, the N-Gram scorings of each candidate word in the first intermediate result are obtained；

The first intermediate result is ranked up according to N-Gram scorings, and the candidate word in intermediate result When number exceeds predetermined threshold value L, preceding L candidate word is used as the second intermediate result in interception intermediate result；

The sub- word list of error correction candidate of the in query word second sub- word and the second intermediate result are carried out into word splicing Afterwards as the first new intermediate result, return and perform error correction target word search operation, until obtaining sub- word row Corresponding second intermediate result of last sub- word in table, and using the second intermediate result as error correction target word List.

With reference to second possible implementation method of first aspect, in the third possible reality of first aspect Apply in mode, the method also includes：

N-Gram scorings in each intermediate result are scored less than or equal to the N-Gram of corresponding query word Candidate word is deleted.

With reference to the second of first aspect or the third possible implementation method, the 4th kind in first aspect can In the implementation method of energy, the method also includes：

If obtaining error correction mesh according to each sub- word list of error correction candidate and the N-Gram language models for pre-building The current time-consuming of mark word list exceeds Preset Time, then by current intermediate result with do not carry out word splicing The corresponding sub- word of the sub- word list of error correction candidate carry out after word splicing as error correction target word list.

With reference to first aspect, second to the 4th kind of first aspect any one possible implementation method, In 5th kind of possible implementation method of one side, according to error correction target word list export error correction result it Before, the method also includes：

Result filtering is carried out to error correction target word list.

With reference to the 5th kind of possible implementation method of first aspect, in the 6th kind of possible reality of first aspect Apply in mode, result filtering is carried out to error correction target word list, specifically include at least in following method Kind：

According to N-Gram language models, the sentence for calculating each error correction target word in error correction target word list is commented Point, and error correction target word list is ranked up according to sentence scoring；

The all sub- word of each error correction target word in error correction target word list is checked, to comprising the sub- word of approximate phonetic Error correction target word, scored and be multiplied by a penalty factor, then according to scoring to error correction target word arrange Table is ranked up, wherein the phonetic of the sub- word of approximate phonetic is spelled with the sub- word phonetic of corresponding query word for approximate Sound；

Scoring in error correction target word list is deleted less than the error correction target word for expecting threshold value, wherein, expect Threshold value determines according to the sub- word number of query word.

With reference to first aspect, second to the 6th kind of first aspect any one possible implementation method, In 7th kind of possible implementation method of one side, according to the word error correction map table for pre-building, obtain The sub- word list of error correction candidate of each sub- word, specifically includes in query word：

Word segmentation processing is carried out to query word, the sub- word list of query word is obtained；

According to the word error correction map table for pre-building, error correction candidate of each sub- word in sub- word list is obtained Word list.

With reference to the 7th kind of possible implementation method of first aspect, in the 8th kind of possible reality of first aspect Apply in mode, word error correction map table includes：Chinese-character phonetic letter mapping table, English word concordance list and similar shape At least one in contrary opinion dictionary；

According to the word error correction map table for pre-building, error correction candidate of each sub- word in sub- word list is obtained Word list, specifically includes at least one in following method：

Entangled according to the unisonance or nearly sound that Chinese-character phonetic letter mapping table obtains each Chinese or the sub- word of phonetic in sub- word list Wrong candidate word；

The error correction candidate word of each English sub- word in sub- word list is obtained according to English word concordance list；

The homography error correction candidate word of each Chinese sub- word in sub- word list is obtained according to homography dictionary.

With reference to the 8th kind of possible implementation method of first aspect, in the 9th kind of possible reality of first aspect Apply in mode, the error correction candidate word of each English sub- word in sub- word list obtained according to English word concordance list, Specifically include：

For each English sub- word in sub- word list, obtained according to sub with English according to English word concordance list Preceding M word and the corpus occurrence number of each word that word editing distance sorts from small to large；

According to the editing distance and corpus occurrence number of M word, marking and queuing is carried out to word, and Error correction candidate word of the top n word in ranking results as English sub- word is chosen, wherein, M and N are Positive integer, and M is more than N.

Second aspect, the embodiment of the present invention provides a kind of search query word error correction device, including：

Receiver module, for receiving the query word in searching request；

Error correction pretreatment module, for the word error correction map table that basis pre-builds, in acquisition query word The sub- word list of error correction candidate of each sub- word；

Correction process module, for according to each sub- word list of error correction candidate and the N-Gram languages for pre-building Speech model, obtains error correction target word list；

Output module, error correction result is exported according to error correction target word list.

With reference to second aspect, in the first possible implementation method of second aspect, correction process module, Specifically for：

String-concatenation treatment is carried out to each sub- word list of error correction candidate, error correction candidate word list is obtained；

With reference to second aspect, in second possible implementation method of second aspect, correction process module, Specifically for：

With reference to second possible implementation method of second aspect, in the third possible reality of second aspect Apply in mode, correction process module is additionally operable to：

With reference to the second of second aspect or the third possible implementation method, the 4th kind in second aspect can In the implementation method of energy, correction process module is additionally operable to：

With reference to second aspect, second to the 4th kind of second aspect any one possible implementation method, In 5th kind of possible implementation method of two aspects, device also includes：

Error correction result filtering module, for carrying out result filtering to error correction target word list.

With reference to the 5th kind of possible implementation method of second aspect, in the 6th kind of possible reality of second aspect In applying mode, error correction result filtering module specifically for：

With reference to second aspect, second to the 6th kind of second aspect any one possible implementation method, In 7th kind of possible implementation method of two aspects, error correction pretreatment module, specifically for：

With reference to the 7th kind of possible implementation method of second aspect, in the 8th kind of possible reality of second aspect Apply in mode, word error correction map table includes：Chinese-character phonetic letter mapping table, English word concordance list and similar shape At least one in contrary opinion dictionary；

Error correction pretreatment module, specifically for：

With reference to the 8th kind of possible implementation method of second aspect, in the 9th kind of possible reality of second aspect In applying mode, error correction pretreatment module, specifically for：

Search query word error correction method provided in an embodiment of the present invention and device, search query word error correction device After receiving the query word in searching request, the sub- word row of error correction candidate of each sub- word in query word are obtained first Table；Then further according to the N-Gram language models for pre-building, to the sub- word list splicing of each error correction candidate Into each error correction candidate word carry out real-time online scoring, obtain error correction target word list；It is last further according to The final error correction result of the scoring output of each error correction target word in error correction target word list, it is based on online Analytical calculation, in real time the search query word to user input carry out error correction, without offline mass data processing With history error correction data storehouse, and do not rely on large-scale search daily record and user feedback, for it is mobile should With application scenarios such as markets, error correcting capability can be effectively improved.

Brief description of the drawings

Fig. 1 is the system architecture schematic diagram for searching for error correction system；

The schematic flow sheet of the search query word error correction method embodiment one that Fig. 2 is provided for the present invention；

The schematic flow sheet of the search query word error correction method embodiment two that Fig. 3 is provided for the present invention；

A kind of schematic flow sheet of character string cutting method that Fig. 4 is provided for the present invention；

The schematic flow sheet of the search query word error correction method embodiment three that Fig. 5 is provided for the present invention；

The schematic flow sheet of the search query word error correction method example IV that Fig. 6 is provided for the present invention；

The algorithm flow schematic diagram of the greedy algorithm that Fig. 7 (a)-Fig. 7 (d) is provided for the present invention；

The structural representation of the search query word error correction device embodiment one that Fig. 8 is provided for the present invention；

The structural representation of the search query word error correction device embodiment two that Fig. 9 is provided for the present invention.

Specific embodiment

The present embodiments relate to method and apparatus can apply to the less Mobile solution city of search scale Field, employee's searching system, Books Retrieve System etc., can also be applied to the larger internet of search Search engine.

Method and apparatus provided in an embodiment of the present invention, it is intended to solve in the prior art for search scale compared with The application scenarios such as small Mobile solution market, using sentencing wrong error correction offline, the method for follow-up On-line matching goes out The low technical problem of existing error correcting capability.

Fig. 1 is the system architecture schematic diagram for searching for error correction system, as shown in figure 1, search error correction system bag User terminal and content retrieval system are included, wherein, content retrieval system includes：Search subsystem, error correction Subsystem, index file and content data base.User can be by the graphical interfaces of user terminal or programming Interface, submits search query word to the search subsystem of content retrieval system to；Search subsystem is calling it Before itself algorithm searches related content, the query word that will first receive issues error correcting sub-system, to judge to be It is no comprising misspelling.Wherein, the communication between two subsystems, it is possible to use arbitrary standards agreement is (such as TCP, HTTP) and data interchange format (such as JSON, XML).

If error correcting sub-system judges that query word is wrong, m Correcting Suggestion (i.e. error correction target word) is returned to, And by correspondence marking and queuing；Otherwise prompting query word is errorless.Wherein, the numerical value of m can as interface parameters, The error correction target word number that error correcting sub-system at most can return to a query word is represented, can be by search subsystem System is dynamically specified when error correcting sub-system interface is called；In order to avoid user's selectivity is difficult, can only return One Correcting Suggestion, the i.e. default value of m may be configured as 1.In addition to specifying maximum error correction target word number, Another kind sets the mode of interface parameters, is to specify a scoring threshold values Г, error correction by search subsystem System only returns to error correction target word of the scoring higher than this threshold values.Above two parameter interactive mode can be simultaneously Use.

After search subsystem receives the error correction target word of error correcting sub-system return, one or more error correction are used Target word goes to search its index, retrieves its content data base, finally returns that the corresponding Search Results of query word List is to user.

The search query word error correction device that the present invention is provided is to error correcting sub-system in said system Organization Chart Further improve, it can be integrated in error correcting sub-system or substitute above-mentioned error correcting sub-system.

Technical scheme is described in detail with specifically embodiment below.These have below The embodiment of body can be combined with each other, may be in some embodiments for same or analogous concept or process Repeat no more.

The schematic flow sheet of the search query word error correction method embodiment one that Fig. 2 is provided for the present invention, the party The executive agent of method can be search query word error correction device.As shown in Fig. 2 the method bag of the present embodiment Include：

Step S201, the query word received in searching request.

Specifically, after search query word error correction device receives the query word of search subsystem transmission, can be with Judge whether the query word is wrong using existing error correction determination methods, if nothing, prompting query word is errorless； If so, then performing following step S202, correction process is carried out to query word.

The word error correction map table that step S202, basis pre-build, each sub- word entangles in acquisition query word The wrong sub- word list of candidate.

Specifically, search query word error correction device includes the data file for correction process, including Training corpus collection, by training corpus collection, can be created that word error correction map table.One query word can To be split as some sub- words, this little word is probably Chinese, phonetic or English, according to the word error correction Mapping table, can obtain the sub- word list of error correction candidate of this little word.

For example, query word can be split as " qbpinyin books enter method " [" qb ", " pin ", " yin ", " defeated ", " entering ", " method "]；The error correction of this little word can be obtained according to the word error correction map table The sub- word list of candidate, English " qb "-" qq ", phonetic " pin "-[" spelling ", " product " ...], " yin " - [" sound ", " because " ...], Chinese " book "-[" defeated ", " number " ...], sub- word " entering " and " method " It is similar, no longer illustrate.

Step S203, according to each sub- word list of error correction candidate and the N-Gram language models for pre-building, Obtain error correction target word list.

Specifically, N-Gram language models can be created by training corpus collection, by N-Gram language Model can calculate the probability that certain text words and phrases is based on training corpus collection.The model is based on such a It is assumed that in sentence n-th word appearance, it is only related to above N-1 word, and with other any words all It is uncorrelated.The N-gram probability (scoring) of one phrase or sentence, is exactly wherein each word probability of occurrence Product, these probability can directly count N number of word simultaneously by maximum likelihood estimate from language material The number of times of appearance is obtained.Assuming that a phrase includes N number of word W successively₁、W₂…W_n, then preceding N-1 After individual word occurs, the probability of occurrence of n-th word is：P(W_n|W₁W₂…W_n-1)=C (W₁W₂… W_n)/C(W₁W₂…W_n-1).Because when N values are very big, training N-gram models need huge language material And Sparse is serious, time complexity is high, and that conventional at present is the Bi-gram (N=2) and three of binary The Tri-gram (N=3), preferred N=3 in the present embodiment of unit.Additionally, in order to solve Sparse Problem, Can be using certain smooth strategy so as to be occurred in corpus when N-gram language models are realized The N-gram probability sum of the words and phrases crossed is 1, and the N-gram probability of any words and phrases is not 0.

In the present embodiment, for each sub- word list of error correction candidate, string-concatenation treatment can be carried out, so The error correction candidate being spliced into according to the sub- word list of error correction candidate is calculated further according to N-Gram language models afterwards The N-Gram probability (scoring) of word, the scoring according to each word can be arranged these words Sequence or Screening Treatment, finally obtain an error correction target word list.In the error correction target word list, error correction Scoring expression error correction target word higher of target word is that the possibility of correct query word is higher, such as P (" love Strange skill ")=0.853, and P (" love is unusual ")=0.012, the accuracy of " iqiyi.com " is higher than " love It is unusual ".

Step S204, according to error correction target word list export error correction result.

According to the scoring of each error correction target word in the error correction target word list obtained in above-mentioned steps, you can choosing Scoring one or more error correction target words higher are selected to be exported to search system as error correction result.Certainly, Final error correction result is also likely to be 0, and search query word error correction device can return to 0 error correction result, Or directly return to query word of mistake etc..

The search query word error correction method that the present embodiment is provided, search query word error correction device receives search After query word in request, the sub- word list of error correction candidate of each sub- word in query word is obtained first；Then again According to the N-Gram language models for pre-building, each error correction being spliced into each sub- word list of error correction candidate Candidate's word carries out real-time online scoring, obtains error correction target word list；Finally further according to error correction target word The final error correction result of the scoring output of each error correction target word in list, it is based on online analytical calculation, The search query word to user input carries out error correction in real time, without offline mass data processing and history error correction Database, and large-scale search daily record and user feedback are not relied on, should for Mobile solution market etc. With scene, error correcting capability can be effectively improved.

The schematic flow sheet of the search query word error correction method embodiment two that Fig. 3 is provided for the present invention, this reality It is that step S202 in above-mentioned embodiment illustrated in fig. 2 is further illustrated to apply example, real shown in above-mentioned Fig. 2 On the basis of applying example, as shown in figure 3, in the present embodiment, step S202 entangles according to the word for pre-building Wrong mapping table, obtains the sub- word list of error correction candidate of each sub- word in query word, specifically includes：

Step S301, word segmentation processing is carried out to query word, obtain the sub- word list of query word.

Specifically, query word potentially includes Chinese, phonetic and English, query word treatment is included：Chinese Participle, phonetic cutting and English cutting.Wherein, for Chinese word segmentation, when word segmentation processing is carried out, can Cutting is carried out with based on individual character pattern, such as " everyday cruel to run " participle is " everyday cruel to run "；Also may be used Cutting is carried out with based on word model, such as " everyday cruel to run " participle is " everyday cruel to run ", wherein, In individual character pattern, if the sub- word sum after splitting exceedes predetermined threshold value (such as 8), can not carry out Subsequent error correction treatment, directly returns to 0 error correction result.Phonetic cutting and English cutting may be collectively referred to as word Symbol string cutting, specifically can using the matching of reverse maximum, Forward Maximum Method, two-way maximum match, most The segmentation algorithms such as few cutting carry out character string cutting, use reverse maximum matching algorithm with segmentation algorithm below As a example by illustrate character string cutting concrete scheme.

A kind of schematic flow sheet of character string cutting method that Fig. 4 is provided for the present invention, as shown in figure 4, Segmentation algorithm uses reverse maximum matching algorithm, and input character string is " facebookweibo ".Using reverse Matching, can preferentially export the cutting that total word number is few but single word is more long.Algorithm scans character from back to front String, detects whether current prefix substring is phonetic or English, specifically can be by detecting that the substring whether there is Judge whether it is legal phonetic or English in word error correction map table.If algorithm can find One cut-off, such as " facebook ", then recurrence checks suffix substring (" weibo "), until success One cutting of output, or cutting failure.

In the present embodiment, it is singly to be cut that can be set by system configuration parameter to a character string Point (only obtain first cutting result), many cuttings (predetermined number cutting result before obtaining) or Full cutting (obtaining all possible cutting result).If not single cutting (as many cuttings or full cutting), Algorithm is after a cut-off is found, in addition it is also necessary to which continuation is scanned forward, finds next legal prefix, Such as " face " and " fa ".Each new cut-off, all corresponds to a suffix recursive procedure.If character String goes here and there more long and full cutting candidate excessively, and algorithm is time-consuming may be very long, then can set a threshold values, calculates The method cutting time then terminates (as shown in the row of right side one in Fig. 4) automatically after exceeding the threshold values, merges and exports Current part cutting the results list for obtaining (as shown in the row of left side one in Fig. 4).

Additionally, in the present embodiment, can also set and whether character string (is spelled by system configuration parameter Sound and English) carry out independent or mixing cutting.If provided as independent cutting, pinyin character string " weibo " Can be " wei bo " with cutting, English character string " angrybirds " can be " angry birds " with cutting, But " facebookweibo " would not be split, because cutting result the inside is mixed with phonetic and English. Mixing cutting is than relatively time-consuming, but range of application is wider, specifically can be according to different application scene setting.

The word error correction map table that step S302, basis pre-build, each sub- word in the sub- word list of acquisition The sub- word list of error correction candidate.

Mention in the above-described embodiments, by training corpus collection, word error correction map table can be created that, According to the word error correction map table, then the sub- word list of error correction candidate of each sub- word in sub- word list can be obtained.

In the present embodiment, word error correction map table is literary with binary system after can serializing after successfully creating The form of part is stored in the storage device outside search query word error correction device, when training corpus collection changes When update therewith, device every time start when they are quickly loaded into internal memory in use.

In addition, in the present embodiment, training corpus collection is closely related with application scenarios, and it selects application scenarios In the content item to be searched for (such as application, webpage, personnel, books) it is all can index field text This value set is used as training corpus.For example, in Mobile solution market, can index field be application of reaching the standard grade Name and Description；In employee's searching system, can index field be worker's name, department, address, project Resume etc.；Books Retrieve System, can index field be books title, author, full text text etc..

As a kind of optional implementation method, it is possible to use the query word of a recent period of time (such as 1 month) Training corpus collection is expanded in daily record (removal malformed queries word after).Furthermore, it is possible to not use and applied field The unrelated general corpus of scape (such as news article) makees training corpus collection, error correction target occurs to reduce Any content item of word mismatch can index word segment value " invalid error correction " (i.e. training corpus concentrate occur Number of times is 0) problem.It is of course also possible to press particular system realize or business demand, add part general Corpus make training corpus collection.

Additionally, the applicable training corpus rally of search query word error correction device constantly changes with the time, Such as Mobile solution market scene, new opplication is constantly reached the standard grade, old application obscure portions are offline, query word daily record is held It is continuous to increase.Search query word error correction device can regularly update word according to the intensity of variation of training corpus collection Language error correction map table.Can be specifically that timing (such as morning) is based on newest training corpus collection to device daily, Automatically update external word error correction map table and lay equal stress on and be downloaded to internal memory；Can also be that device provides keeper use Family interface and interface, allow system manager at any time, can manually operated renewal overloading data knot Structure file, both implementation methods can be used in combination.

Optionally, word error correction map table includes：Chinese-character phonetic letter mapping table, English word concordance list and same At least one in shape contrary opinion dictionary；Then step S302 specifically can include following embodiments at least It is a kind of：

The first implementation method：Each Chinese or phonetic is sub in obtaining sub- word list according to Chinese-character phonetic letter mapping table The unisonance of word or nearly sound error correction candidate word, such as " ink marks "-[" learning by heart ", " not anxious " ...], " bright "-[" people ", " name " ...].

Second implementation method：Entangling for each English sub- word in sub- word list is obtained according to English word concordance list Wrong candidate word, such as " twiter "-[" twitter "].

The third implementation method：The similar shape of each Chinese sub- word in sub- word list is obtained according to homography dictionary Contrary opinion error correction candidate word, for example, " say "-[" day "].

Further, in above-mentioned implementation method, according to each English in the English word concordance list word list of acquisition The error correction candidate word of the sub- word of text, can specifically include：

Specifically, when an English word needs error correction, can be quickly fixed by English word concordance list Position goes out the preceding M word close with the English word editing distance, then to each word, is compiled Collect distance and corpus occurrence number weighted combination scores, output scoring N number of error correction candidate word high.Tool Body can be based on paying the utmost attention to approximate with former spelling and going out in language material when scoring is weighted Now frequently legal English word sets weighted scoring method as error correction candidate word, for example：Scoring= Corpus occurrence number/editing distance.

In addition, when judging whether an English word needs error correction, can be by English word concordance list To determine.Can specifically be to determine that the English word only not in English word concordance list is just required to look up Error correction candidate word；Can also be to determine that in English word concordance list occurrence number is less than pre-set threshold value, Require to look up error correction candidate word.

The search query word error correction method that the present embodiment is provided, word segmentation processing is flexible, strong adaptability, and word Language error correction map table is set up according to application-specific scene, comprehensively reliable, so as to be reflected using according to word error correction The sub- word list of error correction candidate that firing table is obtained is high to obtain error correction result confidence level, and error-correcting performance power is strong.

The schematic flow sheet of the search query word error correction method embodiment three that Fig. 5 is provided for the present invention, this reality A kind of concrete implementation mode that example is step S203 in above-mentioned embodiment illustrated in fig. 2 is applied, in above-mentioned implementation On the basis of example, as shown in figure 5, in the present embodiment, step S203 is according to each sub- word list of error correction candidate With the N-Gram language models for pre-building, error correction target word list is obtained, specifically included：

Step S501, string-concatenation treatment is carried out to each sub- word list of error correction candidate, obtain error correction candidate Word list.

Specifically, each sub- word of query word is to that should have the sub- word list of error correction candidate, by each sub- word of error correction candidate List carries out string-concatenation treatment in order, you can an error correction candidate word list is obtained, in the list Error correction candidate word as whole query word error correction candidate item, the sub- word number of each error correction candidate word with Inquiry lexon word number is consistent.

The N-Gram language models that step S502, basis pre-build, in calculating error correction candidate word list The word scoring of each error correction candidate word, and according to the scoring of the word of each error correction candidate word to each error correction candidate word It is ranked up, obtains error correction target word list.

It is general according to the word that N-Gram language models calculate each error correction candidate word in error correction candidate word list After rate, can be scored word probability as the word of error correction candidate word, or by each error correction candidate word Word after word probability normalization as each error correction candidate word scores；Can be to these error correction according to scoring Candidate word is ranked up treatment, obtains the error correction arranged in descending order by the scoring of error correction candidate word word Target word list, wherein, the expression error correction target word higher that scores is that the possibility of correct query word is higher, Export error correction result when can choose error correction target word list in first or preceding several error correction target words it is defeated Go out.

The search query word error correction method that the present embodiment is provided, algorithm is simple, goes for inquiring about lexon The less scene of word.

The schematic flow sheet of the search query word error correction method example IV that Fig. 6 is provided for the present invention, this reality Another concrete implementation mode that example is step S203 in above-mentioned embodiment illustrated in fig. 2 is applied, in above-mentioned reality On the basis of applying example, as shown in fig. 6, in the present embodiment, step S203 is according to the sub- word row of each error correction candidate Table and the N-Gram language models for pre-building, obtain error correction target word list, specifically include：

Step S601, according to N-Gram language models, obtain each candidate word in the first intermediate result N-Gram scores.

Step S602, the first intermediate result is ranked up according to N-Gram scorings, and in intermediate result When the number of middle candidate word exceeds predetermined threshold value L, preceding L candidate word is used as second in interception intermediate result Intermediate result.

Step S603, the sub- word list of error correction candidate and second intermediate result of the in query word second sub- word are entered As the first new intermediate result after the splicing of row word, return and perform step S601, until obtaining sub- word row Corresponding second intermediate result of last sub- word in table, and using the second intermediate result as error correction target word List.

The method of the present embodiment is the lookup algorithm based on heuristic, greedy principle, and Fig. 7 (a)-Fig. 7 (d) is Shown in the algorithm flow schematic diagram of the greedy algorithm that the present invention is provided, such as Fig. 7 (a)-Fig. 7 (d), the figure is right One handling process of sample query word " paper is warded off in full open robbery war ".This query word has mistake at three, point Wei not phonetic phonetically similar word (" robbing "-" rifle "), approximate phonetic (" bright "-" people ") and homography Word (" warding off "-" wall "), correct error correction result should be " whole people's gunbattle wallpaper ".Practical application Jing Zhong, user's query word does not have so various mistakes typically to be occurred simultaneously, merely just in order to more Algorithm flow is illustrated well and is illustrated.

Assuming that after carrying out Chinese word segmentation to query word, obtaining the sub- word list that length is N.This citing Middle use individual character participle pattern (i.e. each Chinese character is a sub- word), sub- word list is [" complete ", " bright ", " robbing ", " war ", " warding off ", " paper "], i.e. N=6.To each sub- word, based on Chinese-character phonetic letter mapping Table, approximate phonetic transformation rule and homography dictionary, it may be determined that the sub- word row of error correction candidate of the sub- word Table, i.e. its all unisonance, nearly sound, the union of homograph, shown in such as Fig. 7 (a), for every height Word, the phonetic of the sub- word is represented with phone word list with CL1, and CL2 represents the nearly phone of the phonetic of the sub- word Word list, CL3 represents the sub- word list of the homography of the sub- word, then the sub- word list of error correction candidate of the word CL=CL1 ∪ CL2 ∪ CL3.

The calculating treatment of greedy algorithm is the process taken turns more, and wheel number is the length N of sub- word list.Calculate Method is first using first in the query word word list of error correction candidate of sub- word (the i.e. first sub- word) as first Intermediate result, error correction target word search operation is performed to the first intermediate result, is specifically included：Calculate first N-gram scorings (i.e. word probability) of each candidate word, then comments according to N-gram in intermediate result Divide and each candidate word is ranked up, such as P (" complete ")>P (" power ")>... P (" circle "), obtain The first intermediate result after sequence.If the length of the first intermediate result is more than a systemic presupposition threshold values L (such as 30), then the list rear portion beyond L is clipped, L candidate word before only retaining, and obtains the second intermediate result TR1.This is first round treatment Round1, referring to Fig. 7 (a).

Second wheel treatment Round2, algorithm is by second sub- word of error correction candidate of sub- word (the i.e. second sub- word) List does word splicing two-by-two with TR1, the first new intermediate result is obtained, to the first new intermediate result Error correction target word search operation is performed, that is, the N-gram for calculating each word in the first new intermediate result is commented Point, sequence lopping equally then is done by scoring and threshold values L, obtain list such as P (" whole people ")>P is (" complete It is bright ")>... P (" circle name "), i.e., the second new intermediate result TR2, referring to Fig. 7 (b).Third round The sub- word list of error correction candidate of the 3rd sub- word is spliced, obtained by treatment Round3, algorithm two-by-two with TR2 TR3, referring to Fig. 7 (c).

Algorithm persistent loop, in last N wheels (the i.e. the 6th wheel) treatment RoundN, the sub- word of n-th Candidate list splice two-by-two with TRN-1 intermediate results, obtain TRN, i.e. error correction target word list R=TRN, referring to Fig. 7 (d).

In the present embodiment, optionally, in error correction target word list process is obtained, can also be by each centre N-Gram scorings are deleted less than or equal to the candidate word that the N-Gram of corresponding query word scores in result.

Specifically, for the intermediate result of each round treatment, scoring is less than or equal to former query word in list Candidate word remove.For example, it is assumed that P (" circle is bright ")<P (" paper is warded off in full open robbery war "), then wait in TR2 Word " circle is bright " is selected to remove.

The method is based on the scoring formula of N-gram, such as P (" whole people's gunbattle ")=P (" complete ") * P (" people " | " complete ") and * P (" rifle " | " whole people ") * P (" war " | " whole people's rifle ").From formula, a word Score any prefix words necessarily than it of N-gram it is low, i.e. P (" the bright gunbattle wallpaper of circle ")<P (" circle is bright "), The scoring of prefix word is below if former query word, without considering further that all candidate words based on the prefix. The method can as early as possible filter the candidate word of the condition of not conforming to, and reduce the search space of algorithm, and the time that reduces answers Miscellaneous degree.

In addition, for intermediate result, N-gram language models and original can also will be mismatched in intermediate result The candidate word of query word is removed, for example：Assuming that " circle is bright " is not present in N-gram language models N-gram, N-gram language model can export a very low probability using smooth strategy, while can also carry Show that the word mismatches N-gram language models, and the word also mismatches former query word, will can now be somebody's turn to do Word removes.

In the present embodiment, optionally, in error correction target word list process is obtained, if being waited according to each error correction Select current the taking of sub- word list and the N-Gram language models acquisition error correction target word list for pre-building Beyond Preset Time, then can be by current intermediate result and the sub- word of error correction candidate for not carrying out word splicing The corresponding sub- word of list is carried out after word splicing as error correction target word list.

Specifically, starting timing when processing the algorithm performs first round, constantly monitored in running, such as Fruit (i in the treatment of the i-th wheel<N), find accumulative time-consuming more than a systemic presupposition time (such as 100 Millisecond), then algorithm terminates in advance, current local error correction result is returned to, to avoid because of the time of certain a little word Select list long, cause algorithm to terminate for a long time.

By taking example shown in Fig. 7 (a)-Fig. 7 (d) as an example, it is assumed that algorithm time-out in third round treatment, then by TR3 Used as local error correction result, sub- word " paper is warded off in the war " direct splicing in subsequent child word original query word is obtained The error correction target word list for arriving as output result, such as [" whole people rob war and ward off paper ", " paper is warded off in whole people's gunbattle " ...].

In addition, being mentioned in above-described embodiment, Chinese word segmentation can use individual character pattern or phrase pattern, right In phrase pattern, when using above-mentioned greedy algorithm, it is understood that there may be query word and correct error correction result are not The phenomenon matched somebody with somebody, for example：It is " silent mark weather " after malformed queries word " silent mark weather " participle, and entangles It is " ink marks weather " that both mismatch after wrong target word participle；At this point it is possible to query word phonetic is complete Various phonetic participle strings are obtained after cutting, such as " moji tianqi ", " mo ji tianqi ", " mo ji tian qi " Deng greedy error correction result lookup algorithm being called to each pinyin string successively, amalgamation result.

The search query word error correction method that the present embodiment is provided, each error correction candidate is processed using greedy algorithm Word list, obtains final error correction target word list, effectively raises running efficiency of system.

On the basis of above-described embodiment, in one embodiment of this invention, in step S204 according to error correction Before target word list output error correction result, the method also includes：Result is carried out to error correction target word list Filtering, to ensure to export confidence level error correction result high, ratio is entangled in reduction by mistake, lifts Consumer's Experience.

In the present embodiment, result filtering can be carried out using at least one in following embodiments：

The first implementation method：According to N-Gram language models, respectively entangled in calculating error correction target word list The sentence scoring of wrong target word, and error correction target word list is ranked up according to sentence scoring.

Specifically, word probability effect when filtering intermediate result is searched in error correction is preferable, because cannot not be related to completely The scoring of whole sub- word.And sentence probability possible effect when final result is filtered is more preferable, because most of should With in scene, query word is all a complete clause (such as Mobile solution title, the full name of employee, books Title).

In the present embodiment, to each target word of error correction result searching modul output, it is recalculated N-gram sentences probability rather than word probability are used as scoring.After scoring again each error correction target word, will The results list presses new marking and queuing.The computing formula of N-gram sentence probability is：P(<s>W1W2… Wn</s>)=P (W1 |<s>)*P(W2|<s>W1)…*P(</s>|<s>W1W2…Wn).<s> With</s>It is the sentence beginning and end symbol of N-gram language models definition.

Second implementation method：The all sub- word of each error correction target word in error correction target word list is checked, it is right Error correction target word comprising the sub- word of approximate phonetic, is scored and is multiplied by a penalty factor, and then basis is commented Divide and error correction target word list is ranked up, wherein the phonetic of the sub- word of approximate phonetic and corresponding query word Sub- word phonetic is approximate phonetic.

Specifically, approximate phonetic word is low compared to phonetically similar word as the probability of error correction target, because generally feelings User inputs as the possibility of approximate phonetic word is phonetically similar word than inputing by mistake certain word in query word by mistake under condition Possibility is low.To the error correction target word comprising the sub- word of approximate phonetic, can be scored and be multiplied by a punishment How approximate the factor, the sub- word of phonetic more in error correction target word, punishment degree be bigger.

Assuming that after query word participle, sub- word sum is N, each error correction target word in error correction target word list Sub- word number is also N after participle.The all sub- word of each error correction target word is checked, statistics phonetic is inquired about with former The phonetic of the sub- word of word correspondence is the sub- word sum of approximate phonetic rather than unisonance, is designated as M.For example, former inquiry Word is " full name gunbattle ", and error correction target word is " whole people's gunbattle ", then N=4, M=1.To comprising approximate The error correction target word of the sub- word of phonetic, it is penalty factor p, example between (0,1) that its scoring will be multiplied by a value Such as p=(1+N-M)/(1+N).After to the scoring addition penalty factor of part error correction target word, to list Rearrangement.Molecule denominator in formula adds 1, and when being in order that obtaining N=M, the value of p is not 0.

The third implementation method：By scoring in error correction target word list less than the error correction target word for expecting threshold value Delete, wherein, expect that threshold value determines according to the sub- word number of query word.

Specifically, can be with initialization system parameter alpha, span is (0,1).Assuming that son after query word participle Word number is n, then its sentence probability is the n+1 product of probability.Can will then expect that threshold value sets αⁿ⁺¹, I.e. for each error correction target word, its scoring must be higher than that the n+1 powers of α are otherwise filtered.In addition, Correct error correction target word is filtered out in order to avoid desired value is too high, another systematic parameter β can be set, By αⁿ⁺¹It is revised as α^β*(n+1), wherein β can use any real number value (such as 1.5,2 or 3).

Optionally, to ensure the real-time performance of error correction, the average response time of single error correction request is reduced, The entire throughput of lifting system, the method for the present embodiment can include following four caching mechanism, It can be realized with Hash table：

The first, error correction result caching：The key of Hash table is query word (capital and small letter is insensitive), is worth and is Error correction result.

In the present embodiment, the mapping relations of query word and error correction result after correction process can will have been carried out Storage subsequently when correction process is carried out, can first inquire about the table, if query word matches certain in table The key assignments of cache entry, then directly return to corresponding result, without again through above-mentioned correction process.

Secondth, N-gram scorings caching：The key of Hash table is N units phrase, such as " music ", " love Strange skill ", is worth for the N-gram language models of the phrase score.

Specifically, when scoring word, N-gram language models need to return to probability after calculating in real time, For the word for having calculated, word can be stored in table with the mapping relations of scoring, subsequently entered During row score calculation, the table can be first inquired about, if query word matches the key assignments of certain cache entry, directly returned Corresponding result is returned, without being calculated through N-gram language models again.

The third, N-gram state caches：The key of Hash table is N units phrase, and it is a Boolean to be worth, Represent whether the phrase occurred in training corpus, i.e., whether match N-gram language models, that is to say No is a legal N-gram.

During word probability is calculated using N-gram language models, some phrases may be mismatched N-gram language models (i.e. the phrase did not occur in training corpus), for N-gram language Non-existent N-gram in model, N-gram language models using smooth strategy can export one it is very low Probability, while can also point out the phrase to mismatch N-gram language models, can now exist phrase storage In table, its corresponding Boolean is set to false as, score calculation is being carried out so as to be subsequently encountered identical phrase When, directly can determine that the phrase is illegal by the table.

4th kind, intermediate result caching：The key of Hash table is a pinyin string, such as " ai ", " tiantian ", It is worth in greedy algorithm, the intermediate result after the corresponding sequence lopping of the pinyin string, i.e. the second intermediate result.

In error correction target word search procedure, if front and rear multiple difference query words have the prefix of identical phonetic Substring, then can directly use the corresponding intermediate result of the substring, without recalculating by the table.

For the phrase pattern in Chinese word segmentation, amount of calculation can effectively be reduced by the method, improved Running efficiency of system.

The search query word error correction method that the present embodiment is provided, result filtering is carried out to error correction target word list Export error correction result again afterwards, can effectively reduce and entangle ratio by mistake, lift Consumer's Experience.

The structural representation of the search query word error correction device embodiment one that Fig. 8 is provided for the present invention, this reality The device for applying example can be single equipment, it is also possible to be integrated in error correcting sub-system, as shown in figure 8, The device of the present embodiment includes：Receiver module 10, error correction pretreatment module 20, correction process module 30 With output module 40, wherein,

Receiver module 10, for receiving the query word in searching request；

Error correction pretreatment module 20, for according to the word error correction map table for pre-building, obtaining query word In each sub- word the sub- word list of error correction candidate；

Correction process module 30, for according to each sub- word list of error correction candidate and the N-Gram for pre-building Language model, obtains error correction target word list；

Output module 40, error correction result is exported according to error correction target word list.

Search query word error correction device provided in an embodiment of the present invention, can perform above method embodiment, Its realization principle is similar with technique effect, and here is omitted.

In one embodiment of this invention, correction process module 30, specifically for：

In another embodiment of the invention, correction process module 30, specifically for：

In the present embodiment, used as a kind of optional implementation method, correction process module 30 is additionally operable to：Will be each N-Gram scorings are deleted less than or equal to the candidate word that the N-Gram of corresponding query word scores in intermediate result Remove.

Used as another optional implementation method, correction process module 30 is additionally operable to：If being waited according to each error correction Select current the taking of sub- word list and the N-Gram language models acquisition error correction target word list for pre-building Beyond Preset Time, then by current intermediate result and the sub- word list of error correction candidate for not carrying out word splicing Corresponding sub- word is carried out after word splicing as error correction target word list.

The structural representation of the search query word error correction device embodiment two that Fig. 9 is provided for the present invention, this reality It is the further optimization to above-mentioned embodiment illustrated in fig. 8 to apply example, as shown in figure 9, shown in above-mentioned Fig. 8 On the basis of embodiment, the device of the present embodiment also includes：

Error correction result filtering module 50, for carrying out result filtering to error correction target word list.

As specific embodiment of the present invention, error correction result filtering module 50 specifically for：

In this implementation, optionally, error correction pretreatment module 20, specifically for：

Further, word error correction map table includes：Chinese-character phonetic letter mapping table, English word concordance list and At least one in homography dictionary；

Then error correction pretreatment module 20, specifically for：

As a kind of specific embodiment of the invention, error correction pretreatment module 20, specifically for：

Finally it should be noted that：Various embodiments above is merely illustrative of the technical solution of the present invention, rather than right Its limitation；Although being described in detail to the present invention with reference to foregoing embodiments, this area it is common Technical staff should be understood：It can still modify to the technical scheme described in foregoing embodiments, Or equivalent is carried out to which part or all technical characteristic；And these modifications or replacement, and The scope of the essence disengaging various embodiments of the present invention technical scheme of appropriate technical solution is not made.

Claims

1. a kind of search query word error correction method, it is characterised in that including：

Receive the query word in searching request；

According to the word error correction map table for pre-building, the error correction candidate of each sub- word in the query word is obtained Sub- word list；

According to each sub- word list of error correction candidate and the N-Gram language models for pre-building, acquisition is entangled Wrong target word list；

Error correction result is exported according to the error correction target word list.

2. method according to claim 1, it is characterised in that described according to each error correction candidate Sub- word list and the N-Gram language models for pre-building, obtain error correction target word list, specifically include：

According to the N-Gram language models for pre-building, each error correction in the error correction candidate word list is calculated The word scoring of candidate word, and according to the scoring of the word of each error correction candidate word to each error correction candidate Word is ranked up, and obtains error correction target word list.

3. method according to claim 1, it is characterised in that described according to each error correction candidate Sub- word list and the N-Gram language models for pre-building, obtain error correction target word list, specifically include：

Using the sub- word list of error correction candidate of the first sub- word in the query word as the first intermediate result, to institute State the first intermediate result and perform error correction target word search operation, wherein, the error correction target word search operation Including：

According to the N-Gram language models, each candidate word in acquisition first intermediate result N-Gram scores；

First intermediate result is ranked up according to N-Gram scorings, and in the middle knot When the number of candidate word exceeds predetermined threshold value L in fruit, preceding L candidate word is made in intercepting the intermediate result It is the second intermediate result；

The sub- word list of error correction candidate of the second sub- word in the query word and second intermediate result are carried out As the first new intermediate result after word splicing, return and perform the error correction target word search operation, directly To obtaining in the sub- word list corresponding second intermediate result of last sub- word, and by described second Between result as the error correction target word list.

4. method according to claim 3, it is characterised in that methods described also includes：

5. the method according to claim 3 or 4, it is characterised in that methods described also includes：

If obtained according to each sub- word list of error correction candidate and the N-Gram language models for pre-building entangling The current time-consuming of wrong target word list exceeds Preset Time, then by current intermediate result with do not carry out word The corresponding sub- word of the sub- word list of error correction candidate of splicing is carried out after word splicing as error correction target word list.

6. the method according to claim any one of 1-5, it is characterised in that described in the basis Before error correction target word list output error correction result, methods described also includes：

Result filtering is carried out to the error correction target word list.

7. method according to claim 6, it is characterised in that described to arrange the error correction target word Table carries out result filtering, specifically includes at least one in following method：

According to the N-Gram language models, each error correction target word in the error correction target word list is calculated Sentence scoring, and according to sentence scoring the error correction target word list is ranked up；

The all sub- word of each error correction target word in the error correction target word list is checked, to comprising approximate phonetic The error correction target word of sub- word, is scored and is multiplied by a penalty factor, then according to scoring to the error correction Target word list is ranked up, wherein the sub- word of the phonetic of the sub- word of approximate phonetic and corresponding query word Phonetic is approximate phonetic；

Scoring in the error correction target word list is deleted less than the error correction target word for expecting threshold value, wherein, The expectation threshold value determines according to the sub- word number of query word.

8. the method according to claim any one of 1-7, it is characterised in that the basis is built in advance Vertical word error correction map table, obtains the sub- word list of error correction candidate of each sub- word in the query word, specifically Including：

Word segmentation processing is carried out to the query word, the sub- word list of the query word is obtained；

According to the word error correction map table for pre-building, the error correction for obtaining each sub- word in the sub- word list is waited Select sub- word list.

9. method according to claim 8, it is characterised in that the word error correction map table includes： At least one in Chinese-character phonetic letter mapping table, English word concordance list and homography dictionary；

The word error correction map table that the basis pre-builds, each sub- word entangles in the acquisition sub- word list The wrong sub- word list of candidate, specifically includes at least one in following method：

The unisonance of each Chinese or the sub- word of phonetic in the sub- word list is obtained according to the Chinese-character phonetic letter mapping table Or nearly sound error correction candidate word；

The error correction candidate of each English sub- word in the sub- word list is obtained according to the English word concordance list Word；

The homography error correction of each Chinese sub- word in the sub- word list is obtained according to the homography dictionary Candidate word.

10. method according to claim 9, it is characterised in that described according to the English word Concordance list obtains the error correction candidate word of each English sub- word in the sub- word list, specifically includes：

For each English sub- word in the sub- word list, according to the English word concordance list obtain according to The preceding M word and the language material of each word sorted from small to large with the English sub- word editing distance Collection occurrence number；

According to the editing distance and corpus occurrence number of the M word, the word is scored Sequence, and error correction candidate word of the top n word in ranking results as the English sub- word is chosen, its In, M and N is positive integer, and M is more than N.

A kind of 11. search query word error correction devices, it is characterised in that including：

Receiver module, for receiving the query word in searching request；

Error correction pretreatment module, for according to the word error correction map table for pre-building, obtaining the inquiry The sub- word list of error correction candidate of each sub- word in word；

Correction process module, for according to each sub- word list of error correction candidate and the N-Gram for pre-building Language model, obtains error correction target word list；

Output module, error correction result is exported according to the error correction target word list.

12. devices according to claim 11, it is characterised in that the correction process module, tool Body is used for：

13. devices according to claim 11, it is characterised in that the correction process module, tool Body is used for：

14. devices according to claim 13, it is characterised in that the correction process module is also used In：

15. device according to claim 13 or 14, it is characterised in that the correction process mould Block is additionally operable to：

16. device according to claim any one of 11-15, it is characterised in that described device is also Including：

Error correction result filtering module, for carrying out result filtering to the error correction target word list.

17. devices according to claim 16, it is characterised in that the error correction result filtering module Specifically for：

18. device according to claim any one of 11-17, it is characterised in that the error correction is pre- Processing module, specifically for：

19. devices according to claim 18, it is characterised in that the word error correction map table bag Include：At least one in Chinese-character phonetic letter mapping table, English word concordance list and homography dictionary；

The error correction pretreatment module, specifically for：

20. devices according to claim 19, it is characterised in that the error correction pretreatment module, Specifically for：