The content of the invention
For the drawbacks described above of prior art, the present invention provides a kind of search query word error correction method and device,
For improving error correcting capability.
In a first aspect, the embodiment of the present invention provides a kind of search query word error correction method, including:
Receive the query word in searching request;
According to the word error correction map table for pre-building, the sub- word of error correction candidate of each sub- word in query word is obtained
List;
According to each sub- word list of error correction candidate and the N-Gram language models for pre-building, error correction mesh is obtained
Mark word list;
Error correction result is exported according to error correction target word list.
With reference in a first aspect, in the first possible implementation method of first aspect, being waited according to each error correction
Sub- word list and the N-Gram language models for pre-building are selected, error correction target word list, specific bag is obtained
Include:String-concatenation treatment is carried out to each sub- word list of error correction candidate, error correction candidate word list is obtained;
According to the N-Gram language models for pre-building, each error correction candidate in error correction candidate word list is calculated
The word scoring of word, and each error correction candidate word is ranked up according to the scoring of the word of each error correction candidate word,
Obtain error correction target word list.
With reference in a first aspect, in second possible implementation method of first aspect, being waited according to each error correction
Sub- word list and the N-Gram language models for pre-building are selected, error correction target word list, specific bag is obtained
Include:
Using the sub- word list of error correction candidate of the in query word first sub- word as the first intermediate result, in first
Between result perform error correction target word search operation, wherein, error correction target word search operation includes:
According to N-Gram language models, the N-Gram scorings of each candidate word in the first intermediate result are obtained;
The first intermediate result is ranked up according to N-Gram scorings, and the candidate word in intermediate result
When number exceeds predetermined threshold value L, preceding L candidate word is used as the second intermediate result in interception intermediate result;
The sub- word list of error correction candidate of the in query word second sub- word and the second intermediate result are carried out into word splicing
Afterwards as the first new intermediate result, return and perform error correction target word search operation, until obtaining sub- word row
Corresponding second intermediate result of last sub- word in table, and using the second intermediate result as error correction target word
List.
With reference to second possible implementation method of first aspect, in the third possible reality of first aspect
Apply in mode, the method also includes:
N-Gram scorings in each intermediate result are scored less than or equal to the N-Gram of corresponding query word
Candidate word is deleted.
With reference to the second of first aspect or the third possible implementation method, the 4th kind in first aspect can
In the implementation method of energy, the method also includes:
If obtaining error correction mesh according to each sub- word list of error correction candidate and the N-Gram language models for pre-building
The current time-consuming of mark word list exceeds Preset Time, then by current intermediate result with do not carry out word splicing
The corresponding sub- word of the sub- word list of error correction candidate carry out after word splicing as error correction target word list.
With reference to first aspect, second to the 4th kind of first aspect any one possible implementation method,
In 5th kind of possible implementation method of one side, according to error correction target word list export error correction result it
Before, the method also includes:
Result filtering is carried out to error correction target word list.
With reference to the 5th kind of possible implementation method of first aspect, in the 6th kind of possible reality of first aspect
Apply in mode, result filtering is carried out to error correction target word list, specifically include at least in following method
Kind:
According to N-Gram language models, the sentence for calculating each error correction target word in error correction target word list is commented
Point, and error correction target word list is ranked up according to sentence scoring;
The all sub- word of each error correction target word in error correction target word list is checked, to comprising the sub- word of approximate phonetic
Error correction target word, scored and be multiplied by a penalty factor, then according to scoring to error correction target word arrange
Table is ranked up, wherein the phonetic of the sub- word of approximate phonetic is spelled with the sub- word phonetic of corresponding query word for approximate
Sound;
Scoring in error correction target word list is deleted less than the error correction target word for expecting threshold value, wherein, expect
Threshold value determines according to the sub- word number of query word.
With reference to first aspect, second to the 6th kind of first aspect any one possible implementation method,
In 7th kind of possible implementation method of one side, according to the word error correction map table for pre-building, obtain
The sub- word list of error correction candidate of each sub- word, specifically includes in query word:
Word segmentation processing is carried out to query word, the sub- word list of query word is obtained;
According to the word error correction map table for pre-building, error correction candidate of each sub- word in sub- word list is obtained
Word list.
With reference to the 7th kind of possible implementation method of first aspect, in the 8th kind of possible reality of first aspect
Apply in mode, word error correction map table includes:Chinese-character phonetic letter mapping table, English word concordance list and similar shape
At least one in contrary opinion dictionary;
According to the word error correction map table for pre-building, error correction candidate of each sub- word in sub- word list is obtained
Word list, specifically includes at least one in following method:
Entangled according to the unisonance or nearly sound that Chinese-character phonetic letter mapping table obtains each Chinese or the sub- word of phonetic in sub- word list
Wrong candidate word;
The error correction candidate word of each English sub- word in sub- word list is obtained according to English word concordance list;
The homography error correction candidate word of each Chinese sub- word in sub- word list is obtained according to homography dictionary.
With reference to the 8th kind of possible implementation method of first aspect, in the 9th kind of possible reality of first aspect
Apply in mode, the error correction candidate word of each English sub- word in sub- word list obtained according to English word concordance list,
Specifically include:
For each English sub- word in sub- word list, obtained according to sub with English according to English word concordance list
Preceding M word and the corpus occurrence number of each word that word editing distance sorts from small to large;
According to the editing distance and corpus occurrence number of M word, marking and queuing is carried out to word, and
Error correction candidate word of the top n word in ranking results as English sub- word is chosen, wherein, M and N are
Positive integer, and M is more than N.
Second aspect, the embodiment of the present invention provides a kind of search query word error correction device, including:
Receiver module, for receiving the query word in searching request;
Error correction pretreatment module, for the word error correction map table that basis pre-builds, in acquisition query word
The sub- word list of error correction candidate of each sub- word;
Correction process module, for according to each sub- word list of error correction candidate and the N-Gram languages for pre-building
Speech model, obtains error correction target word list;
Output module, error correction result is exported according to error correction target word list.
With reference to second aspect, in the first possible implementation method of second aspect, correction process module,
Specifically for:
String-concatenation treatment is carried out to each sub- word list of error correction candidate, error correction candidate word list is obtained;
According to the N-Gram language models for pre-building, each error correction candidate in error correction candidate word list is calculated
The word scoring of word, and each error correction candidate word is ranked up according to the scoring of the word of each error correction candidate word,
Obtain error correction target word list.
With reference to second aspect, in second possible implementation method of second aspect, correction process module,
Specifically for:
Using the sub- word list of error correction candidate of the in query word first sub- word as the first intermediate result, in first
Between result perform error correction target word search operation, wherein, error correction target word search operation includes:
According to N-Gram language models, the N-Gram scorings of each candidate word in the first intermediate result are obtained;
The first intermediate result is ranked up according to N-Gram scorings, and the candidate word in intermediate result
When number exceeds predetermined threshold value L, preceding L candidate word is used as the second intermediate result in interception intermediate result;
The sub- word list of error correction candidate of the in query word second sub- word and the second intermediate result are carried out into word splicing
Afterwards as the first new intermediate result, return and perform error correction target word search operation, until obtaining sub- word row
Corresponding second intermediate result of last sub- word in table, and using the second intermediate result as error correction target word
List.
With reference to second possible implementation method of second aspect, in the third possible reality of second aspect
Apply in mode, correction process module is additionally operable to:
N-Gram scorings in each intermediate result are scored less than or equal to the N-Gram of corresponding query word
Candidate word is deleted.
With reference to the second of second aspect or the third possible implementation method, the 4th kind in second aspect can
In the implementation method of energy, correction process module is additionally operable to:
If obtaining error correction mesh according to each sub- word list of error correction candidate and the N-Gram language models for pre-building
The current time-consuming of mark word list exceeds Preset Time, then by current intermediate result with do not carry out word splicing
The corresponding sub- word of the sub- word list of error correction candidate carry out after word splicing as error correction target word list.
With reference to second aspect, second to the 4th kind of second aspect any one possible implementation method,
In 5th kind of possible implementation method of two aspects, device also includes:
Error correction result filtering module, for carrying out result filtering to error correction target word list.
With reference to the 5th kind of possible implementation method of second aspect, in the 6th kind of possible reality of second aspect
In applying mode, error correction result filtering module specifically for:
According to N-Gram language models, the sentence for calculating each error correction target word in error correction target word list is commented
Point, and error correction target word list is ranked up according to sentence scoring;
The all sub- word of each error correction target word in error correction target word list is checked, to comprising the sub- word of approximate phonetic
Error correction target word, scored and be multiplied by a penalty factor, then according to scoring to error correction target word arrange
Table is ranked up, wherein the phonetic of the sub- word of approximate phonetic is spelled with the sub- word phonetic of corresponding query word for approximate
Sound;
Scoring in error correction target word list is deleted less than the error correction target word for expecting threshold value, wherein, expect
Threshold value determines according to the sub- word number of query word.
With reference to second aspect, second to the 6th kind of second aspect any one possible implementation method,
In 7th kind of possible implementation method of two aspects, error correction pretreatment module, specifically for:
Word segmentation processing is carried out to query word, the sub- word list of query word is obtained;
According to the word error correction map table for pre-building, error correction candidate of each sub- word in sub- word list is obtained
Word list.
With reference to the 7th kind of possible implementation method of second aspect, in the 8th kind of possible reality of second aspect
Apply in mode, word error correction map table includes:Chinese-character phonetic letter mapping table, English word concordance list and similar shape
At least one in contrary opinion dictionary;
Error correction pretreatment module, specifically for:
Entangled according to the unisonance or nearly sound that Chinese-character phonetic letter mapping table obtains each Chinese or the sub- word of phonetic in sub- word list
Wrong candidate word;
The error correction candidate word of each English sub- word in sub- word list is obtained according to English word concordance list;
The homography error correction candidate word of each Chinese sub- word in sub- word list is obtained according to homography dictionary.
With reference to the 8th kind of possible implementation method of second aspect, in the 9th kind of possible reality of second aspect
In applying mode, error correction pretreatment module, specifically for:
For each English sub- word in sub- word list, obtained according to sub with English according to English word concordance list
Preceding M word and the corpus occurrence number of each word that word editing distance sorts from small to large;
According to the editing distance and corpus occurrence number of M word, marking and queuing is carried out to word, and
Error correction candidate word of the top n word in ranking results as English sub- word is chosen, wherein, M and N are
Positive integer, and M is more than N.
Search query word error correction method provided in an embodiment of the present invention and device, search query word error correction device
After receiving the query word in searching request, the sub- word row of error correction candidate of each sub- word in query word are obtained first
Table;Then further according to the N-Gram language models for pre-building, to the sub- word list splicing of each error correction candidate
Into each error correction candidate word carry out real-time online scoring, obtain error correction target word list;It is last further according to
The final error correction result of the scoring output of each error correction target word in error correction target word list, it is based on online
Analytical calculation, in real time the search query word to user input carry out error correction, without offline mass data processing
With history error correction data storehouse, and do not rely on large-scale search daily record and user feedback, for it is mobile should
With application scenarios such as markets, error correcting capability can be effectively improved.
Specific embodiment
The present embodiments relate to method and apparatus can apply to the less Mobile solution city of search scale
Field, employee's searching system, Books Retrieve System etc., can also be applied to the larger internet of search
Search engine.
Method and apparatus provided in an embodiment of the present invention, it is intended to solve in the prior art for search scale compared with
The application scenarios such as small Mobile solution market, using sentencing wrong error correction offline, the method for follow-up On-line matching goes out
The low technical problem of existing error correcting capability.
Fig. 1 is the system architecture schematic diagram for searching for error correction system, as shown in figure 1, search error correction system bag
User terminal and content retrieval system are included, wherein, content retrieval system includes:Search subsystem, error correction
Subsystem, index file and content data base.User can be by the graphical interfaces of user terminal or programming
Interface, submits search query word to the search subsystem of content retrieval system to;Search subsystem is calling it
Before itself algorithm searches related content, the query word that will first receive issues error correcting sub-system, to judge to be
It is no comprising misspelling.Wherein, the communication between two subsystems, it is possible to use arbitrary standards agreement is (such as
TCP, HTTP) and data interchange format (such as JSON, XML).
If error correcting sub-system judges that query word is wrong, m Correcting Suggestion (i.e. error correction target word) is returned to,
And by correspondence marking and queuing;Otherwise prompting query word is errorless.Wherein, the numerical value of m can as interface parameters,
The error correction target word number that error correcting sub-system at most can return to a query word is represented, can be by search subsystem
System is dynamically specified when error correcting sub-system interface is called;In order to avoid user's selectivity is difficult, can only return
One Correcting Suggestion, the i.e. default value of m may be configured as 1.In addition to specifying maximum error correction target word number,
Another kind sets the mode of interface parameters, is to specify a scoring threshold values Г, error correction by search subsystem
System only returns to error correction target word of the scoring higher than this threshold values.Above two parameter interactive mode can be simultaneously
Use.
After search subsystem receives the error correction target word of error correcting sub-system return, one or more error correction are used
Target word goes to search its index, retrieves its content data base, finally returns that the corresponding Search Results of query word
List is to user.
The search query word error correction device that the present invention is provided is to error correcting sub-system in said system Organization Chart
Further improve, it can be integrated in error correcting sub-system or substitute above-mentioned error correcting sub-system.
Technical scheme is described in detail with specifically embodiment below.These have below
The embodiment of body can be combined with each other, may be in some embodiments for same or analogous concept or process
Repeat no more.
The schematic flow sheet of the search query word error correction method embodiment one that Fig. 2 is provided for the present invention, the party
The executive agent of method can be search query word error correction device.As shown in Fig. 2 the method bag of the present embodiment
Include:
Step S201, the query word received in searching request.
Specifically, after search query word error correction device receives the query word of search subsystem transmission, can be with
Judge whether the query word is wrong using existing error correction determination methods, if nothing, prompting query word is errorless;
If so, then performing following step S202, correction process is carried out to query word.
The word error correction map table that step S202, basis pre-build, each sub- word entangles in acquisition query word
The wrong sub- word list of candidate.
Specifically, search query word error correction device includes the data file for correction process, including
Training corpus collection, by training corpus collection, can be created that word error correction map table.One query word can
To be split as some sub- words, this little word is probably Chinese, phonetic or English, according to the word error correction
Mapping table, can obtain the sub- word list of error correction candidate of this little word.
For example, query word can be split as " qbpinyin books enter method " [" qb ", " pin ", " yin ",
" defeated ", " entering ", " method "];The error correction of this little word can be obtained according to the word error correction map table
The sub- word list of candidate, English " qb "-" qq ", phonetic " pin "-[" spelling ", " product " ...], " yin "
- [" sound ", " because " ...], Chinese " book "-[" defeated ", " number " ...], sub- word " entering " and " method "
It is similar, no longer illustrate.
Step S203, according to each sub- word list of error correction candidate and the N-Gram language models for pre-building,
Obtain error correction target word list.
Specifically, N-Gram language models can be created by training corpus collection, by N-Gram language
Model can calculate the probability that certain text words and phrases is based on training corpus collection.The model is based on such a
It is assumed that in sentence n-th word appearance, it is only related to above N-1 word, and with other any words all
It is uncorrelated.The N-gram probability (scoring) of one phrase or sentence, is exactly wherein each word probability of occurrence
Product, these probability can directly count N number of word simultaneously by maximum likelihood estimate from language material
The number of times of appearance is obtained.Assuming that a phrase includes N number of word W successively1、W2…Wn, then preceding N-1
After individual word occurs, the probability of occurrence of n-th word is:P(Wn|W1W2…Wn-1)=C (W1W2…
Wn)/C(W1W2…Wn-1).Because when N values are very big, training N-gram models need huge language material
And Sparse is serious, time complexity is high, and that conventional at present is the Bi-gram (N=2) and three of binary
The Tri-gram (N=3), preferred N=3 in the present embodiment of unit.Additionally, in order to solve Sparse Problem,
Can be using certain smooth strategy so as to be occurred in corpus when N-gram language models are realized
The N-gram probability sum of the words and phrases crossed is 1, and the N-gram probability of any words and phrases is not 0.
In the present embodiment, for each sub- word list of error correction candidate, string-concatenation treatment can be carried out, so
The error correction candidate being spliced into according to the sub- word list of error correction candidate is calculated further according to N-Gram language models afterwards
The N-Gram probability (scoring) of word, the scoring according to each word can be arranged these words
Sequence or Screening Treatment, finally obtain an error correction target word list.In the error correction target word list, error correction
Scoring expression error correction target word higher of target word is that the possibility of correct query word is higher, such as P (" love
Strange skill ")=0.853, and P (" love is unusual ")=0.012, the accuracy of " iqiyi.com " is higher than " love
It is unusual ".
Step S204, according to error correction target word list export error correction result.
According to the scoring of each error correction target word in the error correction target word list obtained in above-mentioned steps, you can choosing
Scoring one or more error correction target words higher are selected to be exported to search system as error correction result.Certainly,
Final error correction result is also likely to be 0, and search query word error correction device can return to 0 error correction result,
Or directly return to query word of mistake etc..
The search query word error correction method that the present embodiment is provided, search query word error correction device receives search
After query word in request, the sub- word list of error correction candidate of each sub- word in query word is obtained first;Then again
According to the N-Gram language models for pre-building, each error correction being spliced into each sub- word list of error correction candidate
Candidate's word carries out real-time online scoring, obtains error correction target word list;Finally further according to error correction target word
The final error correction result of the scoring output of each error correction target word in list, it is based on online analytical calculation,
The search query word to user input carries out error correction in real time, without offline mass data processing and history error correction
Database, and large-scale search daily record and user feedback are not relied on, should for Mobile solution market etc.
With scene, error correcting capability can be effectively improved.
The schematic flow sheet of the search query word error correction method embodiment two that Fig. 3 is provided for the present invention, this reality
It is that step S202 in above-mentioned embodiment illustrated in fig. 2 is further illustrated to apply example, real shown in above-mentioned Fig. 2
On the basis of applying example, as shown in figure 3, in the present embodiment, step S202 entangles according to the word for pre-building
Wrong mapping table, obtains the sub- word list of error correction candidate of each sub- word in query word, specifically includes:
Step S301, word segmentation processing is carried out to query word, obtain the sub- word list of query word.
Specifically, query word potentially includes Chinese, phonetic and English, query word treatment is included:Chinese
Participle, phonetic cutting and English cutting.Wherein, for Chinese word segmentation, when word segmentation processing is carried out, can
Cutting is carried out with based on individual character pattern, such as " everyday cruel to run " participle is " everyday cruel to run ";Also may be used
Cutting is carried out with based on word model, such as " everyday cruel to run " participle is " everyday cruel to run ", wherein,
In individual character pattern, if the sub- word sum after splitting exceedes predetermined threshold value (such as 8), can not carry out
Subsequent error correction treatment, directly returns to 0 error correction result.Phonetic cutting and English cutting may be collectively referred to as word
Symbol string cutting, specifically can using the matching of reverse maximum, Forward Maximum Method, two-way maximum match, most
The segmentation algorithms such as few cutting carry out character string cutting, use reverse maximum matching algorithm with segmentation algorithm below
As a example by illustrate character string cutting concrete scheme.
A kind of schematic flow sheet of character string cutting method that Fig. 4 is provided for the present invention, as shown in figure 4,
Segmentation algorithm uses reverse maximum matching algorithm, and input character string is " facebookweibo ".Using reverse
Matching, can preferentially export the cutting that total word number is few but single word is more long.Algorithm scans character from back to front
String, detects whether current prefix substring is phonetic or English, specifically can be by detecting that the substring whether there is
Judge whether it is legal phonetic or English in word error correction map table.If algorithm can find
One cut-off, such as " facebook ", then recurrence checks suffix substring (" weibo "), until success
One cutting of output, or cutting failure.
In the present embodiment, it is singly to be cut that can be set by system configuration parameter to a character string
Point (only obtain first cutting result), many cuttings (predetermined number cutting result before obtaining) or
Full cutting (obtaining all possible cutting result).If not single cutting (as many cuttings or full cutting),
Algorithm is after a cut-off is found, in addition it is also necessary to which continuation is scanned forward, finds next legal prefix,
Such as " face " and " fa ".Each new cut-off, all corresponds to a suffix recursive procedure.If character
String goes here and there more long and full cutting candidate excessively, and algorithm is time-consuming may be very long, then can set a threshold values, calculates
The method cutting time then terminates (as shown in the row of right side one in Fig. 4) automatically after exceeding the threshold values, merges and exports
Current part cutting the results list for obtaining (as shown in the row of left side one in Fig. 4).
Additionally, in the present embodiment, can also set and whether character string (is spelled by system configuration parameter
Sound and English) carry out independent or mixing cutting.If provided as independent cutting, pinyin character string " weibo "
Can be " wei bo " with cutting, English character string " angrybirds " can be " angry birds " with cutting,
But " facebookweibo " would not be split, because cutting result the inside is mixed with phonetic and English.
Mixing cutting is than relatively time-consuming, but range of application is wider, specifically can be according to different application scene setting.
The word error correction map table that step S302, basis pre-build, each sub- word in the sub- word list of acquisition
The sub- word list of error correction candidate.
Mention in the above-described embodiments, by training corpus collection, word error correction map table can be created that,
According to the word error correction map table, then the sub- word list of error correction candidate of each sub- word in sub- word list can be obtained.
In the present embodiment, word error correction map table is literary with binary system after can serializing after successfully creating
The form of part is stored in the storage device outside search query word error correction device, when training corpus collection changes
When update therewith, device every time start when they are quickly loaded into internal memory in use.
In addition, in the present embodiment, training corpus collection is closely related with application scenarios, and it selects application scenarios
In the content item to be searched for (such as application, webpage, personnel, books) it is all can index field text
This value set is used as training corpus.For example, in Mobile solution market, can index field be application of reaching the standard grade
Name and Description;In employee's searching system, can index field be worker's name, department, address, project
Resume etc.;Books Retrieve System, can index field be books title, author, full text text etc..
As a kind of optional implementation method, it is possible to use the query word of a recent period of time (such as 1 month)
Training corpus collection is expanded in daily record (removal malformed queries word after).Furthermore, it is possible to not use and applied field
The unrelated general corpus of scape (such as news article) makees training corpus collection, error correction target occurs to reduce
Any content item of word mismatch can index word segment value " invalid error correction " (i.e. training corpus concentrate occur
Number of times is 0) problem.It is of course also possible to press particular system realize or business demand, add part general
Corpus make training corpus collection.
Additionally, the applicable training corpus rally of search query word error correction device constantly changes with the time,
Such as Mobile solution market scene, new opplication is constantly reached the standard grade, old application obscure portions are offline, query word daily record is held
It is continuous to increase.Search query word error correction device can regularly update word according to the intensity of variation of training corpus collection
Language error correction map table.Can be specifically that timing (such as morning) is based on newest training corpus collection to device daily,
Automatically update external word error correction map table and lay equal stress on and be downloaded to internal memory;Can also be that device provides keeper use
Family interface and interface, allow system manager at any time, can manually operated renewal overloading data knot
Structure file, both implementation methods can be used in combination.
Optionally, word error correction map table includes:Chinese-character phonetic letter mapping table, English word concordance list and same
At least one in shape contrary opinion dictionary;Then step S302 specifically can include following embodiments at least
It is a kind of:
The first implementation method:Each Chinese or phonetic is sub in obtaining sub- word list according to Chinese-character phonetic letter mapping table
The unisonance of word or nearly sound error correction candidate word, such as " ink marks "-[" learning by heart ", " not anxious " ...], " bright "-[" people ",
" name " ...].
Second implementation method:Entangling for each English sub- word in sub- word list is obtained according to English word concordance list
Wrong candidate word, such as " twiter "-[" twitter "].
The third implementation method:The similar shape of each Chinese sub- word in sub- word list is obtained according to homography dictionary
Contrary opinion error correction candidate word, for example, " say "-[" day "].
Further, in above-mentioned implementation method, according to each English in the English word concordance list word list of acquisition
The error correction candidate word of the sub- word of text, can specifically include:
For each English sub- word in sub- word list, obtained according to sub with English according to English word concordance list
Preceding M word and the corpus occurrence number of each word that word editing distance sorts from small to large;
According to the editing distance and corpus occurrence number of M word, marking and queuing is carried out to word, and
Error correction candidate word of the top n word in ranking results as English sub- word is chosen, wherein, M and N are
Positive integer, and M is more than N.
Specifically, when an English word needs error correction, can be quickly fixed by English word concordance list
Position goes out the preceding M word close with the English word editing distance, then to each word, is compiled
Collect distance and corpus occurrence number weighted combination scores, output scoring N number of error correction candidate word high.Tool
Body can be based on paying the utmost attention to approximate with former spelling and going out in language material when scoring is weighted
Now frequently legal English word sets weighted scoring method as error correction candidate word, for example:Scoring=
Corpus occurrence number/editing distance.
In addition, when judging whether an English word needs error correction, can be by English word concordance list
To determine.Can specifically be to determine that the English word only not in English word concordance list is just required to look up
Error correction candidate word;Can also be to determine that in English word concordance list occurrence number is less than pre-set threshold value,
Require to look up error correction candidate word.
The search query word error correction method that the present embodiment is provided, word segmentation processing is flexible, strong adaptability, and word
Language error correction map table is set up according to application-specific scene, comprehensively reliable, so as to be reflected using according to word error correction
The sub- word list of error correction candidate that firing table is obtained is high to obtain error correction result confidence level, and error-correcting performance power is strong.
The schematic flow sheet of the search query word error correction method embodiment three that Fig. 5 is provided for the present invention, this reality
A kind of concrete implementation mode that example is step S203 in above-mentioned embodiment illustrated in fig. 2 is applied, in above-mentioned implementation
On the basis of example, as shown in figure 5, in the present embodiment, step S203 is according to each sub- word list of error correction candidate
With the N-Gram language models for pre-building, error correction target word list is obtained, specifically included:
Step S501, string-concatenation treatment is carried out to each sub- word list of error correction candidate, obtain error correction candidate
Word list.
Specifically, each sub- word of query word is to that should have the sub- word list of error correction candidate, by each sub- word of error correction candidate
List carries out string-concatenation treatment in order, you can an error correction candidate word list is obtained, in the list
Error correction candidate word as whole query word error correction candidate item, the sub- word number of each error correction candidate word with
Inquiry lexon word number is consistent.
The N-Gram language models that step S502, basis pre-build, in calculating error correction candidate word list
The word scoring of each error correction candidate word, and according to the scoring of the word of each error correction candidate word to each error correction candidate word
It is ranked up, obtains error correction target word list.
It is general according to the word that N-Gram language models calculate each error correction candidate word in error correction candidate word list
After rate, can be scored word probability as the word of error correction candidate word, or by each error correction candidate word
Word after word probability normalization as each error correction candidate word scores;Can be to these error correction according to scoring
Candidate word is ranked up treatment, obtains the error correction arranged in descending order by the scoring of error correction candidate word word
Target word list, wherein, the expression error correction target word higher that scores is that the possibility of correct query word is higher,
Export error correction result when can choose error correction target word list in first or preceding several error correction target words it is defeated
Go out.
The search query word error correction method that the present embodiment is provided, algorithm is simple, goes for inquiring about lexon
The less scene of word.
The schematic flow sheet of the search query word error correction method example IV that Fig. 6 is provided for the present invention, this reality
Another concrete implementation mode that example is step S203 in above-mentioned embodiment illustrated in fig. 2 is applied, in above-mentioned reality
On the basis of applying example, as shown in fig. 6, in the present embodiment, step S203 is according to the sub- word row of each error correction candidate
Table and the N-Gram language models for pre-building, obtain error correction target word list, specifically include:
Using the sub- word list of error correction candidate of the in query word first sub- word as the first intermediate result, in first
Between result perform error correction target word search operation, wherein, error correction target word search operation includes:
Step S601, according to N-Gram language models, obtain each candidate word in the first intermediate result
N-Gram scores.
Step S602, the first intermediate result is ranked up according to N-Gram scorings, and in intermediate result
When the number of middle candidate word exceeds predetermined threshold value L, preceding L candidate word is used as second in interception intermediate result
Intermediate result.
Step S603, the sub- word list of error correction candidate and second intermediate result of the in query word second sub- word are entered
As the first new intermediate result after the splicing of row word, return and perform step S601, until obtaining sub- word row
Corresponding second intermediate result of last sub- word in table, and using the second intermediate result as error correction target word
List.
The method of the present embodiment is the lookup algorithm based on heuristic, greedy principle, and Fig. 7 (a)-Fig. 7 (d) is
Shown in the algorithm flow schematic diagram of the greedy algorithm that the present invention is provided, such as Fig. 7 (a)-Fig. 7 (d), the figure is right
One handling process of sample query word " paper is warded off in full open robbery war ".This query word has mistake at three, point
Wei not phonetic phonetically similar word (" robbing "-" rifle "), approximate phonetic (" bright "-" people ") and homography
Word (" warding off "-" wall "), correct error correction result should be " whole people's gunbattle wallpaper ".Practical application
Jing Zhong, user's query word does not have so various mistakes typically to be occurred simultaneously, merely just in order to more
Algorithm flow is illustrated well and is illustrated.
Assuming that after carrying out Chinese word segmentation to query word, obtaining the sub- word list that length is N.This citing
Middle use individual character participle pattern (i.e. each Chinese character is a sub- word), sub- word list is
[" complete ", " bright ", " robbing ", " war ", " warding off ", " paper "], i.e. N=6.To each sub- word, based on Chinese-character phonetic letter mapping
Table, approximate phonetic transformation rule and homography dictionary, it may be determined that the sub- word row of error correction candidate of the sub- word
Table, i.e. its all unisonance, nearly sound, the union of homograph, shown in such as Fig. 7 (a), for every height
Word, the phonetic of the sub- word is represented with phone word list with CL1, and CL2 represents the nearly phone of the phonetic of the sub- word
Word list, CL3 represents the sub- word list of the homography of the sub- word, then the sub- word list of error correction candidate of the word
CL=CL1 ∪ CL2 ∪ CL3.
The calculating treatment of greedy algorithm is the process taken turns more, and wheel number is the length N of sub- word list.Calculate
Method is first using first in the query word word list of error correction candidate of sub- word (the i.e. first sub- word) as first
Intermediate result, error correction target word search operation is performed to the first intermediate result, is specifically included:Calculate first
N-gram scorings (i.e. word probability) of each candidate word, then comments according to N-gram in intermediate result
Divide and each candidate word is ranked up, such as P (" complete ")>P (" power ")>... P (" circle "), obtain
The first intermediate result after sequence.If the length of the first intermediate result is more than a systemic presupposition threshold values L (such as
30), then the list rear portion beyond L is clipped, L candidate word before only retaining, and obtains the second intermediate result
TR1.This is first round treatment Round1, referring to Fig. 7 (a).
Second wheel treatment Round2, algorithm is by second sub- word of error correction candidate of sub- word (the i.e. second sub- word)
List does word splicing two-by-two with TR1, the first new intermediate result is obtained, to the first new intermediate result
Error correction target word search operation is performed, that is, the N-gram for calculating each word in the first new intermediate result is commented
Point, sequence lopping equally then is done by scoring and threshold values L, obtain list such as P (" whole people ")>P is (" complete
It is bright ")>... P (" circle name "), i.e., the second new intermediate result TR2, referring to Fig. 7 (b).Third round
The sub- word list of error correction candidate of the 3rd sub- word is spliced, obtained by treatment Round3, algorithm two-by-two with TR2
TR3, referring to Fig. 7 (c).
Algorithm persistent loop, in last N wheels (the i.e. the 6th wheel) treatment RoundN, the sub- word of n-th
Candidate list splice two-by-two with TRN-1 intermediate results, obtain TRN, i.e. error correction target word list
R=TRN, referring to Fig. 7 (d).
In the present embodiment, optionally, in error correction target word list process is obtained, can also be by each centre
N-Gram scorings are deleted less than or equal to the candidate word that the N-Gram of corresponding query word scores in result.
Specifically, for the intermediate result of each round treatment, scoring is less than or equal to former query word in list
Candidate word remove.For example, it is assumed that P (" circle is bright ")<P (" paper is warded off in full open robbery war "), then wait in TR2
Word " circle is bright " is selected to remove.
The method is based on the scoring formula of N-gram, such as P (" whole people's gunbattle ")=P (" complete ") * P (" people "
| " complete ") and * P (" rifle " | " whole people ") * P (" war " | " whole people's rifle ").From formula, a word
Score any prefix words necessarily than it of N-gram it is low, i.e. P (" the bright gunbattle wallpaper of circle ")<P (" circle is bright "),
The scoring of prefix word is below if former query word, without considering further that all candidate words based on the prefix.
The method can as early as possible filter the candidate word of the condition of not conforming to, and reduce the search space of algorithm, and the time that reduces answers
Miscellaneous degree.
In addition, for intermediate result, N-gram language models and original can also will be mismatched in intermediate result
The candidate word of query word is removed, for example:Assuming that " circle is bright " is not present in N-gram language models
N-gram, N-gram language model can export a very low probability using smooth strategy, while can also carry
Show that the word mismatches N-gram language models, and the word also mismatches former query word, will can now be somebody's turn to do
Word removes.
In the present embodiment, optionally, in error correction target word list process is obtained, if being waited according to each error correction
Select current the taking of sub- word list and the N-Gram language models acquisition error correction target word list for pre-building
Beyond Preset Time, then can be by current intermediate result and the sub- word of error correction candidate for not carrying out word splicing
The corresponding sub- word of list is carried out after word splicing as error correction target word list.
Specifically, starting timing when processing the algorithm performs first round, constantly monitored in running, such as
Fruit (i in the treatment of the i-th wheel<N), find accumulative time-consuming more than a systemic presupposition time (such as 100
Millisecond), then algorithm terminates in advance, current local error correction result is returned to, to avoid because of the time of certain a little word
Select list long, cause algorithm to terminate for a long time.
By taking example shown in Fig. 7 (a)-Fig. 7 (d) as an example, it is assumed that algorithm time-out in third round treatment, then by TR3
Used as local error correction result, sub- word " paper is warded off in the war " direct splicing in subsequent child word original query word is obtained
The error correction target word list for arriving as output result, such as [" whole people rob war and ward off paper ", " paper is warded off in whole people's gunbattle " ...].
In addition, being mentioned in above-described embodiment, Chinese word segmentation can use individual character pattern or phrase pattern, right
In phrase pattern, when using above-mentioned greedy algorithm, it is understood that there may be query word and correct error correction result are not
The phenomenon matched somebody with somebody, for example:It is " silent mark weather " after malformed queries word " silent mark weather " participle, and entangles
It is " ink marks weather " that both mismatch after wrong target word participle;At this point it is possible to query word phonetic is complete
Various phonetic participle strings are obtained after cutting, such as " moji tianqi ", " mo ji tianqi ", " mo ji tian qi "
Deng greedy error correction result lookup algorithm being called to each pinyin string successively, amalgamation result.
The search query word error correction method that the present embodiment is provided, each error correction candidate is processed using greedy algorithm
Word list, obtains final error correction target word list, effectively raises running efficiency of system.
On the basis of above-described embodiment, in one embodiment of this invention, in step S204 according to error correction
Before target word list output error correction result, the method also includes:Result is carried out to error correction target word list
Filtering, to ensure to export confidence level error correction result high, ratio is entangled in reduction by mistake, lifts Consumer's Experience.
In the present embodiment, result filtering can be carried out using at least one in following embodiments:
The first implementation method:According to N-Gram language models, respectively entangled in calculating error correction target word list
The sentence scoring of wrong target word, and error correction target word list is ranked up according to sentence scoring.
Specifically, word probability effect when filtering intermediate result is searched in error correction is preferable, because cannot not be related to completely
The scoring of whole sub- word.And sentence probability possible effect when final result is filtered is more preferable, because most of should
With in scene, query word is all a complete clause (such as Mobile solution title, the full name of employee, books
Title).
In the present embodiment, to each target word of error correction result searching modul output, it is recalculated
N-gram sentences probability rather than word probability are used as scoring.After scoring again each error correction target word, will
The results list presses new marking and queuing.The computing formula of N-gram sentence probability is:P(<s>W1W2…
Wn</s>)=P (W1 |<s>)*P(W2|<s>W1)…*P(</s>|<s>W1W2…Wn).<s>
With</s>It is the sentence beginning and end symbol of N-gram language models definition.
Second implementation method:The all sub- word of each error correction target word in error correction target word list is checked, it is right
Error correction target word comprising the sub- word of approximate phonetic, is scored and is multiplied by a penalty factor, and then basis is commented
Divide and error correction target word list is ranked up, wherein the phonetic of the sub- word of approximate phonetic and corresponding query word
Sub- word phonetic is approximate phonetic.
Specifically, approximate phonetic word is low compared to phonetically similar word as the probability of error correction target, because generally feelings
User inputs as the possibility of approximate phonetic word is phonetically similar word than inputing by mistake certain word in query word by mistake under condition
Possibility is low.To the error correction target word comprising the sub- word of approximate phonetic, can be scored and be multiplied by a punishment
How approximate the factor, the sub- word of phonetic more in error correction target word, punishment degree be bigger.
Assuming that after query word participle, sub- word sum is N, each error correction target word in error correction target word list
Sub- word number is also N after participle.The all sub- word of each error correction target word is checked, statistics phonetic is inquired about with former
The phonetic of the sub- word of word correspondence is the sub- word sum of approximate phonetic rather than unisonance, is designated as M.For example, former inquiry
Word is " full name gunbattle ", and error correction target word is " whole people's gunbattle ", then N=4, M=1.To comprising approximate
The error correction target word of the sub- word of phonetic, it is penalty factor p, example between (0,1) that its scoring will be multiplied by a value
Such as p=(1+N-M)/(1+N).After to the scoring addition penalty factor of part error correction target word, to list
Rearrangement.Molecule denominator in formula adds 1, and when being in order that obtaining N=M, the value of p is not 0.
The third implementation method:By scoring in error correction target word list less than the error correction target word for expecting threshold value
Delete, wherein, expect that threshold value determines according to the sub- word number of query word.
Specifically, can be with initialization system parameter alpha, span is (0,1).Assuming that son after query word participle
Word number is n, then its sentence probability is the n+1 product of probability.Can will then expect that threshold value sets αn+1,
I.e. for each error correction target word, its scoring must be higher than that the n+1 powers of α are otherwise filtered.In addition,
Correct error correction target word is filtered out in order to avoid desired value is too high, another systematic parameter β can be set,
By αn+1It is revised as αβ*(n+1), wherein β can use any real number value (such as 1.5,2 or 3).
Optionally, to ensure the real-time performance of error correction, the average response time of single error correction request is reduced,
The entire throughput of lifting system, the method for the present embodiment can include following four caching mechanism,
It can be realized with Hash table:
The first, error correction result caching:The key of Hash table is query word (capital and small letter is insensitive), is worth and is
Error correction result.
In the present embodiment, the mapping relations of query word and error correction result after correction process can will have been carried out
Storage subsequently when correction process is carried out, can first inquire about the table, if query word matches certain in table
The key assignments of cache entry, then directly return to corresponding result, without again through above-mentioned correction process.
Secondth, N-gram scorings caching:The key of Hash table is N units phrase, such as " music ", " love
Strange skill ", is worth for the N-gram language models of the phrase score.
Specifically, when scoring word, N-gram language models need to return to probability after calculating in real time,
For the word for having calculated, word can be stored in table with the mapping relations of scoring, subsequently entered
During row score calculation, the table can be first inquired about, if query word matches the key assignments of certain cache entry, directly returned
Corresponding result is returned, without being calculated through N-gram language models again.
The third, N-gram state caches:The key of Hash table is N units phrase, and it is a Boolean to be worth,
Represent whether the phrase occurred in training corpus, i.e., whether match N-gram language models, that is to say
No is a legal N-gram.
During word probability is calculated using N-gram language models, some phrases may be mismatched
N-gram language models (i.e. the phrase did not occur in training corpus), for N-gram language
Non-existent N-gram in model, N-gram language models using smooth strategy can export one it is very low
Probability, while can also point out the phrase to mismatch N-gram language models, can now exist phrase storage
In table, its corresponding Boolean is set to false as, score calculation is being carried out so as to be subsequently encountered identical phrase
When, directly can determine that the phrase is illegal by the table.
4th kind, intermediate result caching:The key of Hash table is a pinyin string, such as " ai ", " tiantian ",
It is worth in greedy algorithm, the intermediate result after the corresponding sequence lopping of the pinyin string, i.e. the second intermediate result.
In error correction target word search procedure, if front and rear multiple difference query words have the prefix of identical phonetic
Substring, then can directly use the corresponding intermediate result of the substring, without recalculating by the table.
For the phrase pattern in Chinese word segmentation, amount of calculation can effectively be reduced by the method, improved
Running efficiency of system.
The search query word error correction method that the present embodiment is provided, result filtering is carried out to error correction target word list
Export error correction result again afterwards, can effectively reduce and entangle ratio by mistake, lift Consumer's Experience.
The structural representation of the search query word error correction device embodiment one that Fig. 8 is provided for the present invention, this reality
The device for applying example can be single equipment, it is also possible to be integrated in error correcting sub-system, as shown in figure 8,
The device of the present embodiment includes:Receiver module 10, error correction pretreatment module 20, correction process module 30
With output module 40, wherein,
Receiver module 10, for receiving the query word in searching request;
Error correction pretreatment module 20, for according to the word error correction map table for pre-building, obtaining query word
In each sub- word the sub- word list of error correction candidate;
Correction process module 30, for according to each sub- word list of error correction candidate and the N-Gram for pre-building
Language model, obtains error correction target word list;
Output module 40, error correction result is exported according to error correction target word list.
Search query word error correction device provided in an embodiment of the present invention, can perform above method embodiment,
Its realization principle is similar with technique effect, and here is omitted.
In one embodiment of this invention, correction process module 30, specifically for:
String-concatenation treatment is carried out to each sub- word list of error correction candidate, error correction candidate word list is obtained;
According to the N-Gram language models for pre-building, each error correction candidate in error correction candidate word list is calculated
The word scoring of word, and each error correction candidate word is ranked up according to the scoring of the word of each error correction candidate word,
Obtain error correction target word list.
In another embodiment of the invention, correction process module 30, specifically for:
Using the sub- word list of error correction candidate of the in query word first sub- word as the first intermediate result, in first
Between result perform error correction target word search operation, wherein, error correction target word search operation includes:
According to N-Gram language models, the N-Gram scorings of each candidate word in the first intermediate result are obtained;
The first intermediate result is ranked up according to N-Gram scorings, and the candidate word in intermediate result
When number exceeds predetermined threshold value L, preceding L candidate word is used as the second intermediate result in interception intermediate result;
The sub- word list of error correction candidate of the in query word second sub- word and the second intermediate result are carried out into word splicing
Afterwards as the first new intermediate result, return and perform error correction target word search operation, until obtaining sub- word row
Corresponding second intermediate result of last sub- word in table, and using the second intermediate result as error correction target word
List.
In the present embodiment, used as a kind of optional implementation method, correction process module 30 is additionally operable to:Will be each
N-Gram scorings are deleted less than or equal to the candidate word that the N-Gram of corresponding query word scores in intermediate result
Remove.
Used as another optional implementation method, correction process module 30 is additionally operable to:If being waited according to each error correction
Select current the taking of sub- word list and the N-Gram language models acquisition error correction target word list for pre-building
Beyond Preset Time, then by current intermediate result and the sub- word list of error correction candidate for not carrying out word splicing
Corresponding sub- word is carried out after word splicing as error correction target word list.
Search query word error correction device provided in an embodiment of the present invention, can perform above method embodiment,
Its realization principle is similar with technique effect, and here is omitted.
The structural representation of the search query word error correction device embodiment two that Fig. 9 is provided for the present invention, this reality
It is the further optimization to above-mentioned embodiment illustrated in fig. 8 to apply example, as shown in figure 9, shown in above-mentioned Fig. 8
On the basis of embodiment, the device of the present embodiment also includes:
Error correction result filtering module 50, for carrying out result filtering to error correction target word list.
As specific embodiment of the present invention, error correction result filtering module 50 specifically for:
According to N-Gram language models, the sentence for calculating each error correction target word in error correction target word list is commented
Point, and error correction target word list is ranked up according to sentence scoring;
The all sub- word of each error correction target word in error correction target word list is checked, to comprising the sub- word of approximate phonetic
Error correction target word, scored and be multiplied by a penalty factor, then according to scoring to error correction target word arrange
Table is ranked up, wherein the phonetic of the sub- word of approximate phonetic is spelled with the sub- word phonetic of corresponding query word for approximate
Sound;
Scoring in error correction target word list is deleted less than the error correction target word for expecting threshold value, wherein, expect
Threshold value determines according to the sub- word number of query word.
In this implementation, optionally, error correction pretreatment module 20, specifically for:
Word segmentation processing is carried out to query word, the sub- word list of query word is obtained;
According to the word error correction map table for pre-building, error correction candidate of each sub- word in sub- word list is obtained
Word list.
Further, word error correction map table includes:Chinese-character phonetic letter mapping table, English word concordance list and
At least one in homography dictionary;
Then error correction pretreatment module 20, specifically for:
Entangled according to the unisonance or nearly sound that Chinese-character phonetic letter mapping table obtains each Chinese or the sub- word of phonetic in sub- word list
Wrong candidate word;
The error correction candidate word of each English sub- word in sub- word list is obtained according to English word concordance list;
The homography error correction candidate word of each Chinese sub- word in sub- word list is obtained according to homography dictionary.
As a kind of specific embodiment of the invention, error correction pretreatment module 20, specifically for:
For each English sub- word in sub- word list, obtained according to sub with English according to English word concordance list
Preceding M word and the corpus occurrence number of each word that word editing distance sorts from small to large;
According to the editing distance and corpus occurrence number of M word, marking and queuing is carried out to word, and
Error correction candidate word of the top n word in ranking results as English sub- word is chosen, wherein, M and N are
Positive integer, and M is more than N.
Search query word error correction device provided in an embodiment of the present invention, can perform above method embodiment,
Its realization principle is similar with technique effect, and here is omitted.
Finally it should be noted that:Various embodiments above is merely illustrative of the technical solution of the present invention, rather than right
Its limitation;Although being described in detail to the present invention with reference to foregoing embodiments, this area it is common
Technical staff should be understood:It can still modify to the technical scheme described in foregoing embodiments,
Or equivalent is carried out to which part or all technical characteristic;And these modifications or replacement, and
The scope of the essence disengaging various embodiments of the present invention technical scheme of appropriate technical solution is not made.