CN110032722A - Text error correction method and device - Google Patents

Text error correction method and device Download PDF

Info

Publication number
CN110032722A
CN110032722A CN201810030108.3A CN201810030108A CN110032722A CN 110032722 A CN110032722 A CN 110032722A CN 201810030108 A CN201810030108 A CN 201810030108A CN 110032722 A CN110032722 A CN 110032722A
Authority
CN
China
Prior art keywords
text
candidate text
phonetic
candidate
pinyin sequence
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201810030108.3A
Other languages
Chinese (zh)
Inventor
吴晓东
邵荣防
郝晖
谢群群
陈贱辉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Jingdong Century Trading Co Ltd
Beijing Jingdong Shangke Information Technology Co Ltd
Original Assignee
Beijing Jingdong Century Trading Co Ltd
Beijing Jingdong Shangke Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Jingdong Century Trading Co Ltd, Beijing Jingdong Shangke Information Technology Co Ltd filed Critical Beijing Jingdong Century Trading Co Ltd
Priority to CN201810030108.3A priority Critical patent/CN110032722A/en
Publication of CN110032722A publication Critical patent/CN110032722A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/232Orthographic correction, e.g. spell checking or vowelisation

Abstract

The invention discloses a kind of text error correction method and devices, are related to field of computer technology.Wherein, this method comprises: obtaining the pinyin sequence to corrected text;Mixing lexicographic tree is searched, to obtain and the matched candidate text set of the pinyin sequence to corrected text;The mixing lexicographic tree includes the corresponding relationship of phonetic and Chinese word and English words;The error correction result to corrected text is determined according to error correcting model and the candidate text set.By above step, the text error correction of Chinese, English, pinyin mixing can be handled well, improves the coverage rate and applicability of text error correction.

Description

Text error correction method and device
Technical field
The present invention relates to field of computer technology more particularly to a kind of text error correction methods and device.
Background technique
In recent years, inquiry error correcting technique is widely applied in searching system, and achieves preferable effect.With mutual The development for industry of networking inquires error correcting technique in other internet areas (such as electric business field) and also receives more and more passes Note.
Existing inquiry error correcting technique is broadly divided into following two: text error correction method based on user conversation, based on general The text error correction method of rate model.In the first text error correction method, the session log mainly searched for according to user is excavated The candidate error correction pair that user actively rewrites out, and as the correct search term after error correction.In second of text error correction method In, then the mainly higher user's search term of the amount of will click on calculates candidate text using statistical model as error correction Candidate Set Probability, and by maximum probability as the correct search term after error correction.
In realizing process of the present invention, at least there are the following problems in the prior art: the first, the prior art for inventor's discovery The inquiry error correction of Chinese, English, pinyin mixing cannot be handled well;The second, inquiry error correction of the prior art for long-tail word Processing speed is compared with slow, timeliness is poor.
Summary of the invention
In view of this, the present invention provides a kind of text error correction method and device, Chinese can be handled well, English, is spelled The text error correction of the mixture of tones improves the coverage rate and applicability of text error correction.
To achieve the above object, according to the first aspect of the invention, a kind of text error correction method is provided.
Text error correction method of the invention includes: the pinyin sequence obtained to corrected text;Mixing lexicographic tree is searched, to obtain It takes and the matched candidate text set of the pinyin sequence to corrected text;The mixing lexicographic tree include phonetic and Chinese word and The corresponding relationship of English words;The error correction result to corrected text is determined according to error correcting model and the candidate text set.
Optionally, if the step of pinyin sequence of the acquisition to corrected text include: it is described to corrected text by Chinese character Composition, then using the phonetic of the Chinese character as the pinyin sequence to corrected text;If described be made of to corrected text non-Chinese character, Described non-Chinese character itself is then used as to the pinyin sequence to corrected text;If it is described to corrected text by Chinese character and non-Chinese character group At then by the entirety being made of the phonetic of the Chinese character and the non-Chinese character itself as the pinyin sequence to corrected text;Its In, the non-Chinese character includes: number, English words and/or phonetic.
Optionally, the lookup mixes lexicographic tree, to obtain and the matched candidate of the pinyin sequence to corrected text The step of text set include: mixing lexicographic tree is searched based on Forward Maximum Method algorithm and reversed maximum matching algorithm, and according to Forward Maximum Method result and reversed maximum matching result determination and the matched candidate text set of the pinyin sequence.
Optionally, the described the step of error correction result to corrected text is determined according to error correcting model and candidate text set It include: the evaluation factor that each candidate text in the candidate text set is calculated separately based on multiple error correcting models;It is commented multiple Estimate the factor to be merged, to obtain the assessed value of the candidate text;It is determined according to the assessed value described to corrected text Error correction result.
Optionally, the multiple error correcting model includes following at least two: noisy communication channel error correcting model, editing distance error correction Model, phonetic are apart from error correcting model.
It optionally, include noisy communication channel error correcting model, editing distance error correcting model and phonetic in the multiple error correcting model It is described to calculate separately each candidate text in the candidate text set based on multiple error correcting models in the case where error correcting model Evaluation factor the step of include: that the noisy communication channel probability of the candidate text is calculated based on noisy communication channel error correcting model, and will Its first evaluation factor as the candidate text;Based on editing distance error correcting model calculate the editor of the candidate text away from From, and determine according to editing distance the second evaluation factor of the candidate text;Based on phonetic apart from described in error correcting model calculating The phonetic distance of candidate text, and according to the third evaluation factor of the determining candidate text of phonetic distance.
Optionally, it is described based on phonetic apart from error correcting model calculate the phonetic of the candidate text apart from the step of include: The word in corrected text and candidate text is treated, comparing its phonetic composition letter one by one, whether identical and tone is identical; The phonetic distance of each word is determined according to comparison result, and using the adduction of the phonetic distance of each word as the candidate text This phonetic distance.
Optionally, the Forward Maximum Method result, the reversed maximum matching result include: at least one candidate text Segment;The method also includes: edit operation is carried out to the pinyin sequence of candidate text fragments;According to edited pinyin sequence Mixing lexicographic tree is searched, with acquisition and the edited matched newly-increased candidate text fragments of pinyin sequence, and according to described Candidate text fragments, newly-increased candidate text fragments building and the matched candidate text set of the pinyin sequence to corrected text.
Optionally, the step of pinyin sequence to candidate text fragments carries out edit operation includes: in the candidate In the case that text fragments include Chinese character, the edit operation of fuzzy phoneme is carried out to the phonetic of the Chinese character;In the candidate text In the case that segment includes English words, edit operation that the English words are inserted into, replaced, exchanged and/or deleted.
Optionally, the method also includes: the pinyin sequence of training sample word is obtained, and according to the training sample word Pinyin sequence building mixing lexicographic tree.
Optionally, the method also includes: in the pinyin sequence for obtaining training sample word, and according to the trained sample Before the step of pinyin sequence building mixing lexicographic tree of this word, source data is cleaned, to obtain the training sample word.
To achieve the above object, according to the second aspect of the invention, a kind of searching method is provided.
Searching method of the invention includes: to receive input text;Determining that the input text is the feelings to corrected text Under condition, the pinyin sequence of input text is obtained;Mixing lexicographic tree is searched, is matched with obtaining with the pinyin sequence of the input text Candidate text set;The mixing lexicographic tree includes the corresponding relationship of phonetic and Chinese word and English words;According to error correcting model and Candidate's text set determines the error correction result of the input text;Search knot is obtained according to the error correction result of the input text Fruit, and described search result is sent.
To achieve the above object, according to the third aspect of the invention we, a kind of search error correction method is provided.
Search error correction method of the invention includes: to receive input text;Determining that the input text is to corrected text In the case where, obtain the pinyin sequence of input text;Mixing lexicographic tree is searched, to obtain the pinyin sequence with the input text Matched candidate's text set;The mixing lexicographic tree includes the corresponding relationship of phonetic and Chinese word and English words;According to error correction mould Type and the candidate text set determine the error correction result of the input text;The error correction result of the input text is arranged Sequence, and the error correction result after sequence is sent.
To achieve the above object, according to the fourth aspect of the invention, a kind of text error correction device is provided.
Text error correction device of the invention includes: acquisition module, for obtaining the pinyin sequence to corrected text;Search mould Block, for searching mixing lexicographic tree, to obtain and the matched candidate text set of the pinyin sequence to corrected text;It is described mixed Close the corresponding relationship that lexicographic tree includes phonetic and Chinese word and English words;Determining module, for according to error correcting model and the time Text set is selected to determine the error correction result to corrected text.
Optionally, if it is described obtain module obtain to corrected text pinyin sequence include: it is described to corrected text by the Chinese Word composition, then the acquisition module is using the phonetic of the Chinese character as the pinyin sequence to corrected text;If described to error correction text This is made of non-Chinese character, then described non-Chinese character itself is used as the pinyin sequence to corrected text by the acquisition module;If described It is made of to corrected text Chinese character and non-Chinese character, then the acquisition module will be by the phonetic of the Chinese character and the non-Chinese character itself The entirety of composition is as the pinyin sequence to corrected text;Wherein, the non-Chinese character includes: number, English words and/or phonetic.
Optionally, the searching module searches mixing lexicographic tree, to obtain and the pinyin sequence to corrected text The candidate text set matched includes: that the searching module is based on Forward Maximum Method algorithm and the lookup mixing of reversed maximum matching algorithm Lexicographic tree, and according to Forward Maximum Method result and reversed maximum matching result determination and the matched candidate text of the pinyin sequence This collection.
Optionally, the determining module determines the entangling to corrected text according to error correcting model and the candidate text set Wrong result, which includes: the determining module, calculates separately each candidate text in the candidate text set based on multiple error correcting models Evaluation factor;The determining module merges multiple evaluation factors, to obtain the assessed value of the candidate text;It is described true Cover half root tuber determines the error correction result to corrected text according to the assessed value.
Optionally, the multiple error correcting model includes following at least two: noisy communication channel error correcting model, editing distance error correction Model, phonetic are apart from error correcting model.
It optionally, include noisy communication channel error correcting model, editing distance error correcting model and phonetic in the multiple error correcting model In the case where error correcting model, the determining module is based on multiple error correcting models and calculates separately in the candidate text set each The evaluation factor of candidate text includes: the noise that the determining module calculates the candidate text based on noisy communication channel error correcting model Channel probability, and as the first evaluation factor of the candidate text;The determining module is based on editing distance error correction mould Type calculates the editing distance of the candidate text, and the second evaluation factor of the candidate text is determined according to editing distance;Institute It states determining module and calculates the phonetic distance of the candidate text apart from error correcting model based on phonetic, and institute is determined according to phonetic distance State the third evaluation factor of candidate text.
Optionally, the determining module calculates the phonetic distance packet of the candidate text based on phonetic apart from error correcting model Include: the determining module treats the word in corrected text and candidate text, compare one by one its phonetic composition letter it is whether identical with And whether tone is identical;The determining module determines the phonetic distance of each word according to comparison result, and by each word Phonetic distance of the adduction of phonetic distance as the candidate text.
Optionally, the Forward Maximum Method result, the reversed maximum matching result include: at least one candidate text Segment;Described device further include: editor module carries out edit operation for the pinyin sequence to candidate text fragments;It is described to look into Module is looked for, is also used to search mixing lexicographic tree according to edited pinyin sequence, to obtain and the edited pinyin sequence Matched newly-increased candidate text fragments, and constructed with described according to the candidate text fragments, newly-increased candidate text fragments wait entangle The matched candidate text set of the pinyin sequence of wrong text.
Optionally, it includes: in the time that the editor module, which carries out edit operation to the pinyin sequence of candidate text fragments, In the case where selecting text fragments to include Chinese character, the editor module carries out the edit operation of fuzzy phoneme to the phonetic of the Chinese character; In the case where the candidate text fragments include English words, the editor module is inserted into the English words, is replaced, is handed over The edit operation changed and/or deleted.
Optionally, described device further include: building module, for obtaining the pinyin sequence of training sample word, and according to institute State the pinyin sequence building mixing lexicographic tree of training sample word.
Optionally, described device further include: cleaning module, for being cleaned to source data, to obtain the trained sample This word.
To achieve the above object, according to the fifth aspect of the invention, a kind of electronic equipment is provided.
Electronic equipment of the invention, comprising: one or more processors;And storage device, for storing one or more A program;When one or more of programs are executed by one or more of processors, so that one or more of processing Device realizes text error correction method of the invention.
To achieve the above object, according to the sixth aspect of the invention, a kind of computer-readable medium is provided.
Computer-readable medium of the invention is stored thereon with computer program, real when described program is executed by processor Existing text error correction method of the invention.
One embodiment in foregoing invention has the following advantages that or the utility model has the advantages that in embodiments of the present invention, by obtaining The pinyin sequence to corrected text is taken, searches mixing lexicographic tree to obtain and the matched time of the pinyin sequence to corrected text Selection sheet calculates the assessed value of the candidate text, and determines the error correction result to corrected text according to the assessed value, The text error correction of Chinese, English, pinyin mixing can be handled well, improve the coverage rate and applicability of text error correction.
Further effect possessed by above-mentioned non-usual optional way adds hereinafter in conjunction with specific embodiment With explanation.
Detailed description of the invention
Attached drawing for a better understanding of the present invention, does not constitute an undue limitation on the present invention.Wherein:
Fig. 1 is the key step schematic diagram of text error correction method according to an embodiment of the invention;
Fig. 2 is the key step schematic diagram of text error correction method according to another embodiment of the present invention;
Fig. 3 is the key step schematic diagram of the text error correction method of another embodiment according to the present invention;
Fig. 4 is the schematic diagram of mixing lexicographic tree according to an embodiment of the present invention;
Fig. 5 is the main modular schematic diagram of text error correction device according to an embodiment of the invention;
Fig. 6 is the main modular schematic diagram of text error correction device according to another embodiment of the present invention;
Fig. 7 is that the embodiment of the present invention can be applied to exemplary system architecture figure therein;
Fig. 8 is adapted for the structural schematic diagram for the computer system for realizing the electronic equipment of the embodiment of the present invention.
Specific embodiment
Below in conjunction with attached drawing, an exemplary embodiment of the present invention will be described, including the various of the embodiment of the present invention Details should think them only exemplary to help understanding.Therefore, those of ordinary skill in the art should recognize It arrives, it can be with various changes and modifications are made to the embodiments described herein, without departing from scope and spirit of the present invention.Together Sample, for clarity and conciseness, descriptions of well-known functions and structures are omitted from the following description.
It should be pointed out that in the absence of conflict, the feature in embodiment and embodiment in the present invention can be with It is combined with each other.
Fig. 1 is the key step schematic diagram of text error correction method according to an embodiment of the invention.As shown in Figure 1, this The text error correction method of inventive embodiments includes:
Step S101, the pinyin sequence to corrected text is obtained.
Specifically, if step S101 include: it is described be made of to corrected text Chinese character, the phonetic of the Chinese character is made For the pinyin sequence to corrected text;If described be made of to corrected text non-Chinese character, by described non-Chinese character itself be used as to The pinyin sequence of corrected text;If described be made of to corrected text Chinese character and non-Chinese character, by by the phonetic of the Chinese character and The entirety that the non-Chinese character itself is constituted is as the pinyin sequence to corrected text;Wherein, the non-Chinese character includes: number, English Cliction and/or phonetic.
It is then " nv shi yun to the pinyin sequence of corrected text for example, being " Ms's sport footwear " to corrected text dong xie".It is then " iphone8 " to the pinyin sequence of corrected text for example, being " iphone8 " to corrected text.For example, It is " adidas men sport shoes " to corrected text, then is " adidas nan shi yun to the pinyin sequence of corrected text dong xie”
Step S102, mixing lexicographic tree is searched, to obtain and the matched candidate text of the pinyin sequence to corrected text This collection.
Wherein, the mixing lexicographic tree includes the corresponding relationship of phonetic and Chinese word and English words.In the mixing dictionary In tree, each node preserves a character.Also, in the node of the trailing character in storage phonetic, also preserve the phonetic pair The word answered.Wherein, the corresponding word can be Chinese word or English words.For example, it is assumed that having " hua in mixing lexicographic tree This phonetic of wei ", then root node is sky, successively stored in each node under root node character " h ", " u ", " a ", " w ", " e ", " i ", and also storage has " dividing into " and " Huawei " etc. corresponding with the pinyin sequence in the node of storage character " i " Word.
Illustratively, it is assumed that corrected text be " dividing mobile phone into ", to corrected text pinyin sequence be " hua wei Shou ji " includes following candidate text by the candidate text set that step S102 is obtained: " Huawei's mobile phone ", " dividing mobile phone into ", " Huawei's collection " and " dividing collection into ".
Step S103, the error correction result to corrected text is determined according to error correcting model and the candidate text set.
In embodiments of the present invention, by constructing mixing lexicographic tree in advance, and by obtaining the phonetic sequence to corrected text Column, search mixing lexicographic tree with obtain with the matched candidate text set of the pinyin sequence to corrected text, according to error correction mould Type and the candidate text set determine the error correction result to corrected text, and it is mixed can to handle Chinese, English, phonetic well The text error correction of conjunction improves the coverage rate and applicability of text error correction.
Fig. 2 is the key step schematic diagram of text error correction method according to another embodiment of the present invention.As shown in Fig. 2, this The text error correction method of inventive embodiments includes:
Step S201, the pinyin sequence to corrected text is obtained.
Specifically, if step S201 include: it is described be made of to corrected text Chinese character, the phonetic of the Chinese character is made For the pinyin sequence to corrected text;If described be made of to corrected text non-Chinese character, by described non-Chinese character itself be used as to The pinyin sequence of corrected text;If described be made of to corrected text Chinese character and non-Chinese character, by by the phonetic of the Chinese character and The entirety that the non-Chinese character itself is constituted is as the pinyin sequence to corrected text;Wherein, the non-Chinese character includes: number, English Cliction and/or phonetic.
Step S202, it is searched based on Forward Maximum Method algorithm and reversed maximum matching algorithm and mixes lexicographic tree, and according to Forward Maximum Method result and reversed maximum matching result determination and the matched candidate text of the pinyin sequence to corrected text This collection.
Specifically, in the Forward Maximum Method algorithm and reversed maximum matching algorithm of the embodiment of the present invention: first to institute It states and carries out cutting to the pinyin sequence of corrected text, then according to the pinyin sequence piece segment search blendword allusion quotation tree after cutting, with Obtain Forward Maximum Method result and reversed maximum matching result.Then, according to Forward Maximum Method result and reversed maximum With result determination and the matched candidate text set of the pinyin sequence.Candidate's text set refers to what all candidate texts were constituted Set.The Forward Maximum Method result or reversed maximum matching result include: at least one candidate text fragments.It is specific next It says, when matching result only includes a candidate text fragments, candidate's text fragments are that is, with the spelling to corrected text The candidate text of one of sound sequences match.It, can be to the multiple candidate text when matching result includes multiple candidate text fragments This segment is spliced, to obtain candidate text.
Illustratively, it is assumed that when corrected text be " sport footwear when female ", to corrected text pinyin sequence be " nv shi yun dong xie".In Forward Maximum Method algorithm:
1) mixing lexicographic tree is first searched according to " nv shi yun dong xie ".There is " nv shi in lexicographic tree if mixing This pinyin sequence of yun dong xie ", then successful match, corresponding by " nv shi yun dong xie " in mixing lexicographic tree Word as candidate text, that is, be used as Forward Maximum Method result.
If 2) mix and " nv shi yun dong xie " this pinyin sequence, forward impelling one be not present in lexicographic tree A word length, i.e., according to " nv shi yun dong " this pinyin sequence piece segment search blendword allusion quotation tree.If being deposited in mixing lexicographic tree In " nv shi yun dong " this pinyin sequence segment, then successful match, will mix " nv shi yun in lexicographic tree The corresponding word of dong " is as candidate text fragments, then according to " xie " this pinyin sequence piece segment search blendword allusion quotation tree.If It mixes and there is " xie " this pinyin sequence segment in lexicographic tree, then successful match, by " xie " corresponding word in mixing lexicographic tree As candidate text fragments.In turn, Forward Maximum Method result includes: " nv shi yun dong " corresponding candidate text piece Section and " xie " corresponding candidate text fragments.
If " nv shi yun dong " this pinyin sequence segment is not present in lexicographic tree 3) mix, iteration executes " past It is pushed forward into a word length, according to new pinyin sequence piece segment search blendword allusion quotation tree " the step of, until obtaining Forward Maximum Method As a result.
Illustratively, it is assumed that when corrected text be " sport footwear when female ", to corrected text pinyin sequence be " nv shi yun dong xie".In reversed maximum matching algorithm:
1) mixing lexicographic tree is first searched according to " nv shi yun dong xie ".There is " nv shi in lexicographic tree if mixing This pinyin sequence of yun dong xie ", then successful match, will in mixing lexicographic tree " nv shi yun dong xie " this The corresponding word of pinyin sequence is as candidate text, i.e., as reversed maximum matching result.
If 2) mix and " nv shi yun dong xie " this pinyin sequence is not present in lexicographic tree, one is promoted backward A word length, i.e., according to " shi yun dong xie " this pinyin sequence piece segment search blendword allusion quotation tree.If mixing in lexicographic tree In the presence of " shi yun dong xie " this pinyin sequence segment, then successful match, will mix " shi yun dong in lexicographic tree The corresponding word of xie " is as candidate text fragments, then according to " nv " this pinyin sequence piece segment search blendword allusion quotation tree.If mixed Close and there is " nv " this pinyin sequence segment in lexicographic tree, then successful match, will in mixing lexicographic tree " nv " corresponding word as Candidate text fragments.In turn, reversed maximum matching result include: " shi yun dong xie " corresponding candidate text fragments and " nv " corresponding candidate text fragments.
If " shi yun dong xie " this pinyin sequence segment is not present in lexicographic tree 3) mix, iteration executes The step of " word length being promoted backward, according to new pinyin sequence piece segment search blendword allusion quotation tree ", until obtaining reversed maximum Matching result.
Illustratively, it is assumed that when corrected text be " sport footwear when female ", to corrected text pinyin sequence be " nv shi Yun dong xie ", if Forward Maximum Method algorithm is " nv shi " and " yun dong to the cutting result of the pinyin sequence xie";And in mixing lexicographic tree, " nvshi " corresponding word is " Ms " and " when female ", " yun dong xie " corresponding word For " sport footwear ", then candidate text fragments are as follows: " Ms ", " when female " and " sport footwear ".Therefore, it is obtained based on Forward Maximum Method Candidate text are as follows: " Ms's sport footwear " and " sport footwear when female ".If reversed cutting of the maximum matching algorithm to the pinyin sequence As a result it is " xie " and " nv shi yun dong ";And in mixing lexicographic tree, " nv shi yun dong " corresponding word " female Scholar's movement ", " xie " corresponding word are " shoes " and " tool ", then candidate text fragments are " Ms's movement ", " shoes " and " tool ".Therefore, The candidate text obtained based on reversed maximum matching is " Ms's sport footwear " and " Ms moves tool ".In turn, according to positive maximum Matching result and reversed maximum matching result obtain with the matched candidate text of the pinyin sequence to corrected text are as follows: " Ms's sport footwear ", " sport footwear when female " and " Ms moves tool ".
In embodiments of the present invention, by the way that Forward Maximum Method algorithm is respectively adopted, reversed maximum matching algorithm is treated and is entangled The pinyin sequence of wrong text carries out cutting, matching, can not only accelerate the text error correction to corrected text (especially long-tail word) Speed guarantees the timeliness of text error correction;And it can be improved the accuracy rate and coverage rate of text error correction.
Step S203, calculated separately based on multiple error correcting models the assessment of each candidate text in the candidate text set because Son.
Wherein, the multiple error correcting model may include following at least two: noisy communication channel error correcting model, editing distance error correction Model, phonetic are apart from error correcting model.
For example, the multiple error correcting model is entangled by noisy communication channel error correcting model and editing distance in an alternative embodiment Mismatch type composition;Evaluation factor based on the candidate text that the noisy communication channel error correcting model obtains are as follows: the noise of candidate text Channel probability;Evaluation factor based on the candidate text that the editing distance error correcting model obtains are as follows: the editor of candidate text away from From.
In another alternative embodiment, the multiple error correcting model is by noisy communication channel error correcting model, editing distance error correction mould Type and phonetic are formed apart from error correcting model;The evaluation factor of the candidate text obtained based on the noisy communication channel error correcting model is waits The noisy communication channel probability of selection sheet;The evaluation factor of the candidate text obtained based on the editing distance error correcting model is candidate text This editing distance;The evaluation factor of the candidate text obtained based on the phonetic apart from error correcting model is the phonetic of candidate text Distance.
Step S204, multiple evaluation factors are merged, to obtain the assessed value of the candidate text.
Step S205, the error correction result to corrected text is determined according to the assessed value.
It illustratively, can be using the maximum candidate text of assessed value as the error correction result to corrected text.Alternatively, Assessed value can also be greater than to the candidate text of one or more of a certain preset threshold as the error correction knot to corrected text Fruit.
In embodiments of the present invention, multiple evaluation factors are concurrently calculated by multiple error correcting models, and by multiple assessments The factor is merged to obtain the assessed value of candidate text, determines the step such as error correction result to corrected text according to the assessed value Suddenly, the accuracy rate of inquiry error correction can not only be improved, and can be improved the processing speed of text error correction method, guarantees timeliness Property.In embodiments of the present invention, it by step S201 to step S205, can handle well Chinese, English, pinyin mixing Text error correction improves the coverage rate and applicability of text error correction.
Fig. 3 is the key step schematic diagram of the text error correction method of another embodiment according to the present invention.As shown in figure 3, this The text error correction method of inventive embodiments includes:
Step S301, source data is cleaned, to obtain the training sample word.
Illustratively, the source data can include: user searches for daily record data, commodity title data etc..In an optional reality It applies in example, searching for daily record data to user can clean as follows:
1) confidence level of search term is calculated, and the search term that confidence level is less than preset threshold is filtered out.
Illustratively, the pv (searching times), ctr (click volume) and gmv (gross turnover) of search term can be first counted, so The confidence level of search term is calculated according to these three indexs afterwards, calculation formula is as follows:
Confidence=a*pv+b*ctr+c*gmv
Wherein, confidence indicates the confidence level of search term, and a, b and c are preset constant coefficient, and pv indicates search time Number, ctr indicate click volume, and gmv indicates gross turnover.
Further, in this example, it to the search term for not including Chinese character and including the search term of Chinese character, can be respectively set Different preset thresholds.For example, can will not include that the preset threshold of search term of Chinese character is set as 500, by the search including Chinese character The preset threshold of word is set as 10.
2) search term searched in daily record data to user segments, and retains length less than or equal to the first length threshold It is worth the pure Chinese word of (such as 5), and retains length and be located at the second length threshold (such as 2) and third length threshold (such as 10) Between non-pure Chinese word.In the case where not influencing present invention implementation, the first, second and third length threshold can according to demand flexibly Setting.
3) search term including phonetic can be filtered out based on dictionary.
4) search term being made of pure digi-tal is filtered out, the search term including spcial character is filtered out.
By above step, wrong word, long-tail word of user's input etc. can be filtered out as far as possible, reduce making an uproar in training sample word Sound.
Further, in the alternative embodiment, commodity title data can be cleaned as follows: using based on left and right entropy New word discovery algorithm excavates the neologisms in commodity title data;And by the way that some rules are arranged (for example, removal is made of pure digi-tal Commodity title etc.) neologisms Result is filtered.
Step S302, the pinyin sequence of training sample word is obtained, and is constructed according to the pinyin sequence of the training sample word Mix lexicographic tree.
Specifically, which includes: the pinyin sequence of each word (i.e. training sample word) in the data after obtaining cleaning, so Each character of the pinyin sequence is from top to bottom sequentially placed into the child node under root node afterwards.Also, by same phonetic The corresponding all training sample words of sequence are put into the child node for having the trailing character of the pinyin sequence.
For example, it is assumed that the pinyin sequence of training sample word is " hua wei ", the corresponding all training samples of the pinyin sequence Word is " Huawei " and " dividing into ", then root node can be set as to empty, from top to bottom successively put " h ", " u ", " a ", " w ", " e ", " i " Enter in the child node of the root node.Also, " Huawei " and " dividing into " is put into the child node for having " i ".Implement in the present invention In example, lexicographic tree is mixed by building, can support the inquiry error correction of processing Chinese, English, pinyin mixing well.
Step S303, the pinyin sequence to corrected text is obtained.
How to implement about the step, can refer to the related description in embodiment illustrated in fig. 2 about step S201.
Step S304, it is searched based on Forward Maximum Method algorithm and reversed maximum matching algorithm and mixes lexicographic tree, and according to Forward Maximum Method result and reversed maximum matching result determination and the matched candidate text of the pinyin sequence to corrected text This collection.
Wherein, the candidate text set is the set that all candidate texts are constituted.The Forward Maximum Method result is anti- It include: at least one candidate text fragments to maximum matching result.Specifically, when matching result only includes a candidate text When segment, candidate's text fragments are that is, with the pinyin sequence matched one candidate text to corrected text.Work as matching When as a result including multiple candidate text fragments, the multiple candidate text fragments can be spliced, to obtain candidate text.It closes How to implement in the step, can refer to the related description in embodiment illustrated in fig. 2 about step S202.
Further, in order to improve the accuracy rate and coverage rate of text error correction, the text error correction method of the embodiment of the present invention is also It can comprise the following steps that and edit operation is carried out to the pinyin sequence of the candidate text fragments obtained by step S304;According to volume Pinyin sequence after volume searches mixing lexicographic tree, to obtain and the edited matched newly-increased candidate text piece of pinyin sequence Section, and matched according to the candidate text fragments, newly-increased candidate text fragments building with the pinyin sequence to corrected text Candidate text set.
For example, when corrected text be " sport footwear when female ", the candidate text fragments obtained based on Forward Maximum Method result Are as follows: " Ms ", " when female ", " sport footwear ", the candidate text fragments obtained based on reversed maximum matching result are as follows: " Ms's fortune It is dynamic ", " shoes ", " tool ", the newly-increased candidate text fragments obtained by edit operation are as follows: following candidate text then can be obtained in " Lv Shi " This: " Ms's sport footwear ", " sport footwear when female ", " Lv Shi sport footwear ", " Ms moves tool ".
Specifically, the pinyin sequence to candidate text fragments carries out edit operation and includes:
Step A, in the case where the candidate text fragments include Chinese character, fuzzy phoneme is carried out to the phonetic of the Chinese character Edit operation.
Wherein, the edit operation of the fuzzy phoneme can include: the edit operation of front and back nasal sound, such as an and ang conversion, The conversion of ian and iang, the conversion of uan and uang, en and the conversion of eng, the conversion of uen and ueng, the conversion of in and ing; The edit operation of flat cacuminal, such as the conversion of z and zh, the conversion of c and ch, the conversion of s and sh;The conversion of north and south sound, such as n The conversion of conversion, b and p, the conversion of h and f, u and the conversion of v, the conversion of i and u, the conversion of i and v with l.For example, candidate text This segment is " Lv Shi ", carries out edit operation to the phonetic " lv shi " of " Lv Shi ", obtained edited pinyin sequence is " nv shi”。
Step B, in the case where the candidate text fragments include English words, the English words are inserted into, are replaced, Exchange and/or the edit operation deleted.
In embodiments of the present invention, it can be realized the editor behaviour to the pinyin sequence of candidate text fragments by step A, B Make;By obtain with the matched newly-increased candidate text fragments of edited pinyin sequence, and according to the candidate text fragments and Newly-increased candidate's text fragments building and the matched candidate text of the pinyin sequence to corrected text, are capable of increasing candidate text Quantity, improve text error correction coverage rate.
Step S305, the noisy communication channel probability of the candidate text is calculated based on noisy communication channel error correcting model, and is made For the first evaluation factor of the candidate text.
Specifically, the noisy communication channel probability of candidate text can be calculated according to the following formula:
P=P (q/c) * P (c);
Wherein, P is the noisy communication channel probability of candidate text, and q indicates that, to corrected text, c indicates candidate text, P (q/c) table Show candidate text and to the transition probability between corrected text, P (c) indicates the prior probability of candidate text.
Further, P (q/c), P (c) can be calculated according to the following formula:
Wherein, freq (c) indicates that frequency of occurrence of the candidate text c in training corpus, freq (q, c) are indicated to error correction term The frequency occurred simultaneously in training corpus with candidate text, | C | indicate that the sum of all words in library is expected in training.
Step S306, calculate the editing distance of the candidate text based on editing distance error correcting model, and according to editor away from Second evaluation factor of candidate's text described from determination.
Specifically, the editing distance of candidate text refers to;Minimum needed for becoming candidate text to corrected text is compiled Collect number of operations.Wherein, edit operation can be insertion, replacement, exchange or deletion.For example, being " by machine " to corrected text, wait Selection sheet is " mobile phone ", then the editing distance of candidate text is 1.For example, being " iphoe " to corrected text, candidate text is " iphone ", then the editing distance of candidate text is 1.
Optionally, the second evaluation factor of the candidate text meets:
Wherein, μeditIndicate the second evaluation factor of candidate text, deditIndicate the editing distance of candidate text, max { L1, L2Indicate to take maximum word length, L in corrected text and candidate text1Indicate the word length to corrected text, L2Table Show the word length of candidate text.
Step S307, calculate the phonetic distance of the candidate text apart from error correcting model based on phonetic, and according to phonetic away from The third evaluation factor of candidate's text described from determination.
Specifically, the phonetic distance of the candidate text can calculate as follows: treat corrected text and candidate Word in text, comparing its phonetic composition letter one by one, whether identical and tone is identical;It is determined according to comparison result every The phonetic distance of a word, and the phonetic distance by the adduction of the phonetic distance of each word as the candidate text.Wherein, When including non-Chinese character part (such as English words, number) in corrected text and candidate text, it is believed that the non-Chinese character part The phonetic composition letter of middle same position identical characters is identical, tone is also identical, it is believed that same position in the non-Chinese character part The phonetic composition letter of kinds of characters is different, tone is also different.
For example, be " by machine " to corrected text, candidate text be " mobile phone ", treat corrected text and candidate text progress by Word compares.The phonetic of " by " and " hand " composition letter is all " shou ", but the tone of the two is different, therefore the phonetic of first character Distance are as follows: 1 (phonetic is identical)+0 (tone is different)=1.The phonetic composition letter of " machine " and " machine " is all " ji ", and the sound of the two Phase modulation is same, therefore the phonetic distance of second word are as follows: 1 (phonetic is identical)+1 (tone is identical)=2.Therefore, candidate text " hand The phonetic distance of machine " is 3.
For example, be " ipd " to corrected text, candidate text be " ipad ", treat corrected text and candidate text progress by Word compares.It is identical with the phonetic composition letter of first character " i " in candidate text to corrected text, and tone is identical, so The phonetic distance of first character is 2.It is identical with the phonetic of second word " p " in candidate text to corrected text, and the two Tone is identical, so the phonetic distance of second word is 2.To the in the third word " d " and candidate text in corrected text The phonetic composition letter of three words " a " is different, tone is different, so triliteral phonetic distance is 0.To in corrected text 4th word is sky, and the 4th word in candidate text is " d ", and the phonetic composition letter and tone of the two are all different, so the The phonetic distance of four words is 0, therefore the phonetic distance of candidate text " ipad " is 4.
Optionally, the third evaluation factor of the candidate text meets:
Wherein, νpinyinIndicate the third evaluation factor of candidate text, dpinyinIndicate the phonetic distance of candidate text, max {L1,L2Indicate to take maximum word length, L in corrected text and candidate text1Indicate the word length to corrected text, L2Indicate the word length of candidate text.
Step S308, the first, second and third evaluation factor is merged, to obtain the assessed value of the candidate text.
Optionally, the first, second and third evaluation factor can be merged according to the following formula:
Score=a1*P+b1edit+c1pinyin
Wherein, Score indicates the assessed value of candidate text, a1、b1、c1For preset constant coefficient, P is the first evaluation factor, μeditFor the second evaluation factor, νpinyinFor third evaluation factor.
Step S309, using the maximum candidate text of assessed value as the error correction result to corrected text.
In embodiments of the present invention, it is entangled by using noisy communication channel error correcting model, editing distance error correcting model, phonetic distance Mismatch type calculates separately the first, second and third evaluation factor, and is merged to the first, second and third evaluation factor to obtain candidate text Assessed value, can further increase the accuracy rate of text error correction.
It is schematically illustrated below with reference to structure of the Fig. 4 to the mixing lexicographic tree of the embodiment of the present invention.As shown in figure 4, this The mixing lexicographic tree of inventive embodiments includes the corresponding relationship of phonetic and Chinese word and English words.Specifically, the present invention is implemented The mixing lexicographic tree of example includes mulitpath, and each path includes the child node under root node and root node.Wherein, root section Point is sky, has a character in each child node under root node.Also, storage has specific in certain child nodes The corresponding Chinese word of phonetic (phonetic being made of character of the root node into the child node) or English words.
For example, the paths in Fig. 4 from top to bottom include: root node, the child node for storing " h ", the son section for storing " u " Point, the child node of storage " a ", the child node for storing " w ", the child node of storage " e ", the child node for storing " i ".Also, it is depositing In the child node for putting " u ", also there are the corresponding word " Hu " of " hu " this pinyin sequence, " tiger " etc.;In the child node of storage " a " In, also there are the corresponding word " China " of " hua " this pinyin sequence, " flower " etc.;In the child node of storage " i ", also there is " hua The corresponding word of this pinyin sequence of wei " " Huawei ", " dividing into " etc..
In addition, the present invention also provides a kind of searching methods.The searching method of the embodiment of the present invention includes:
Step 1: receiving input text.
Step 2: obtaining the pinyin sequence of input text in the case where determining the input text is to corrected text.
In this step, can using all input texts as to corrected text, can also only will part input text as To corrected text.For example, frequent fault text list can be preset, and the input text of user's input be present in it is described often When seeing Error Text list, the input text is determined as to corrected text.
Step 3: mixing lexicographic tree is searched, to obtain and the matched candidate text set of the pinyin sequence of the input text. The mixing lexicographic tree includes the corresponding relationship of phonetic and Chinese character and English.
Step 4: determining the error correction result of the input text according to error correcting model and the candidate text set.
Step 5: obtaining search result according to the error correction result of the input text, and described search result is sent out It send.
When it is implemented, described search result can be sent to the user terminal, and by user terminal to described search knot Fruit is shown.
In embodiments of the present invention, Chinese search word, English search term, phonetic can be supported by above step well The error correction for the search term that search term and the English phonetic three of Chinese arbitrarily mix, improves and covers to search term progress error correction Lid rate and applicability, and then better understood when the search intention of user, improve user experience.
In addition, the present invention also provides a kind of search error correction methods.The search error correction method of the embodiment of the present invention includes:
Step 1: receiving input text.
Step 2: obtaining the pinyin sequence of input text in the case where determining the input text is to corrected text.
In this step, can using all input texts as to corrected text, can also only will part input text as To corrected text.For example, frequent fault text list can be preset, and the input text of user's input be present in it is described often When seeing Error Text list, the input text is determined as to corrected text.
Step 3: mixing lexicographic tree is searched, to obtain and the matched candidate text set of the pinyin sequence of the input text. The mixing lexicographic tree includes the corresponding relationship of phonetic and Chinese character and English.
Step 4: determining the error correction result of the input text according to error correcting model and the candidate text set.
Step 5: being ranked up to the error correction result of the input text, and the error correction result after sequence is sent.
When it is implemented, in the case where obtained error correction result is multiple, it can be according to being obtained in text error correction method The assessed value error correction result is ranked up, and the error correction result after sequence is sent to the user terminal.User terminal exists After error correction result after receiving the sequence, the error correction result after the sequence can be shown by way of signal language To user.
In embodiments of the present invention, Chinese search word, English search term, phonetic can be supported by above step well The error correction for the search term that search term and the English phonetic three of Chinese arbitrarily mix, improves and covers to search term progress error correction Lid rate and applicability, and then better understood when the search intention of user, improve user experience.
Fig. 5 is the main modular schematic diagram of text error correction device according to an embodiment of the invention.As shown in figure 5, this The text error correction device 500 of inventive embodiments includes: to obtain module 501, searching module 502, determining module 503.
Module 501 is obtained, for obtaining the pinyin sequence to corrected text.
Specifically, if obtain module 501 obtain to corrected text pinyin sequence include: it is described to corrected text by the Chinese Word composition, then obtain module 501 using the phonetic of the Chinese character as the pinyin sequence to corrected text;If described to corrected text It is made of non-Chinese character, then obtains module 501 and described non-Chinese character itself is used as to pinyin sequence to corrected text;If described wait entangle Wrong text is made of Chinese character and non-Chinese character, then obtaining module 501 will be made of the phonetic of the Chinese character and the non-Chinese character itself Entirety as the pinyin sequence to corrected text;Wherein, the non-Chinese character includes: number, English words and/or phonetic.
Searching module 502, it is matched with the pinyin sequence to corrected text to obtain for searching mixing lexicographic tree Candidate text set.
Wherein, the mixing lexicographic tree includes the corresponding relationship of phonetic and Chinese word and English words.In the mixing dictionary In tree, each node preserves a character.Also, in the node of the trailing character in storage pinyin sequence, also preserve the spelling The corresponding all words of sound sequence.Wherein, the corresponding word can be Chinese word or English words.
Illustratively, it is assumed that corrected text be " dividing mobile phone into ", to corrected text pinyin sequence be " hua wei Shou ji ", include following candidate text by searching for the candidate text set that module 502 obtains: " Huawei's mobile phone " " divides hand into Machine ", " Huawei's collection " and " dividing collection into ".
Determining module 503, for determining the error correction to corrected text according to error correcting model and the candidate text set As a result.
In embodiments of the present invention, it by constructing mixing lexicographic tree in advance, and is obtained by obtaining module to corrected text Pinyin sequence;Mixing lexicographic tree is searched by searching for module to obtain and the matched time of the pinyin sequence to corrected text Select text set;The error correction knot to corrected text is determined according to error correcting model and the candidate text set by determining module Fruit can handle the text error correction of Chinese, English, pinyin mixing well, improve the coverage rate and applicability of text error correction.
Fig. 6 is the main modular schematic diagram of text error correction device according to another embodiment of the present invention.As shown in fig. 6, this The text error correction device 600 of inventive embodiments includes: cleaning module 601, building module 602, obtains module 603, searching module 604 and determining module 605.
Cleaning module 601, for being cleaned to source data, to obtain the training sample word.
Illustratively, the source data can include: user searches for daily record data, commodity title data etc..About cleaning mould How block 601 cleans source data, can refer to the related content in embodiment illustrated in fig. 3 about data cleansing.
Module 602 is constructed, for obtaining the pinyin sequence of training sample word, and according to the phonetic sequence of the training sample word Column building mixing lexicographic tree, may particularly include: building module 602 obtains each word (i.e. training sample word) in the data after cleaning Pinyin sequence, then each character of the pinyin sequence is from top to bottom sequentially placed into the child node under root node.And And the corresponding all training sample words of same pinyin sequence are put into the trailing character for having the pinyin sequence by building module 602 In child node.
For example, it is assumed that the pinyin sequence of training sample word is " hua wei ", the corresponding all training samples of the pinyin sequence Word is " Huawei " and " dividing into ", then root node can be set as to empty, from top to bottom successively put " h ", " u ", " a ", " w ", " e ", " i " Enter in the child node of the root node.Also, " Huawei " and " dividing into " is put into the child node for having " i ".Implement in the present invention In example, by the building building mixing lexicographic tree of module 602, the inquiry of processing Chinese, English, pinyin mixing can be supported well Error correction.
Module 603 is obtained, for obtaining the pinyin sequence to corrected text.
About the specific pinyin sequence how obtained to corrected text of module 603 is obtained, can refer in embodiment illustrated in fig. 5 About the related description for obtaining module 501.
Searching module 604, for searching mixing lexicographic tree based on Forward Maximum Method algorithm and reversed maximum matching algorithm, And it is matched with the pinyin sequence to corrected text according to Forward Maximum Method result and reversed maximum matching result determination Candidate text set.
Specifically, in Forward Maximum Method algorithm and reversed maximum matching algorithm: searching module 604 first to it is described to The pinyin sequence of corrected text carries out cutting, then according to the pinyin sequence piece segment search blendword allusion quotation tree after cutting, to obtain Forward Maximum Method result and reversed maximum matching result.Then, searching module 604 is according to Forward Maximum Method result and reversed Maximum matching result determination and the matched candidate text of the pinyin sequence.Wherein, the Forward Maximum Method result or reversed Maximum matching result includes: at least one candidate text fragments.When matching result includes a candidate text fragments, the candidate Pinyin sequence of the text fragments as and to corrected text matched one candidate text.When matching result includes multiple candidate texts When this segment, the multiple candidate text fragments can be spliced, to obtain candidate text.
For example, when corrected text be " sport footwear when female ", the candidate text fragments obtained based on Forward Maximum Method result For " Ms ", " when female " and " sport footwear ", the candidate text fragments obtained based on reversed maximum matching result be " Ms's movement ", " shoes " and " tool ", then it is sliceable to obtain following candidate text: " Ms's sport footwear ", " sport footwear when female " and " Ms moves tool ".
In embodiments of the present invention, searching module 604 is by being respectively adopted Forward Maximum Method algorithm, reversed maximum matching The pinyin sequence that algorithm treats corrected text carries out cutting, matching, can not only accelerate to corrected text (especially long-tail word) Error correction speed, guarantee text error correction timeliness;And it can be improved the accuracy rate and coverage rate of text error correction.
Further, in order to improve the accuracy rate and coverage rate of text error correction, text error correction device 600 may also include that editor Module.The editor module carries out edit operation for the pinyin sequence to candidate text fragments.Also, in the optional implementation Example in, searching module 604 be also used to according to edited pinyin sequence search mixing lexicographic tree, with obtain with it is described edited The matched newly-increased candidate text fragments of pinyin sequence;And searching module 604 is used for according to the candidate text fragments, increases newly Candidate text fragments building and the matched candidate text set of the pinyin sequence to corrected text.
For example, when corrected text be " sport footwear when female ", the candidate text fragments obtained based on Forward Maximum Method result For " Ms ", " when female " and " sport footwear ", the candidate text fragments obtained based on reversed maximum matching result be " Ms's movement ", " shoes " and " tool " are " Lv Shi " by the newly-increased candidate text fragments that edit operation obtains, then the candidate text set obtained includes Following candidate's text: " Ms's sport footwear ", " sport footwear when female ", " Lv Shi sport footwear " and " Ms moves tool ".
Specifically, editor module carries out edit operation to the pinyin sequence of candidate text fragments can include: in the candidate In the case that text fragments include Chinese character, editor module carries out the edit operation of fuzzy phoneme to the phonetic of the Chinese character;Described In the case that candidate text fragments include English words, editor module is inserted into the English words, is replaced, is exchanged and/or is deleted The edit operation removed.
In embodiments of the present invention, the quantity for being capable of increasing candidate text by the way that editor module is arranged, improves text error correction Coverage rate.
Determining module 605, for calculating the assessed value of the candidate text, and it is determining described wait entangle according to the assessed value The error correction result of wrong text, may particularly include: determining module 605 calculates separately the candidate text based on multiple error correcting models Evaluation factor;Determining module 605 merges multiple evaluation factors, to obtain the assessed value of the candidate text;Determine mould Block 605 determines the error correction result to corrected text according to the assessed value.
Wherein, the multiple error correcting model may include following at least two: noisy communication channel error correcting model, editing distance error correction Model, phonetic are apart from error correcting model.
In an optional example, the multiple error correcting model includes noisy communication channel error correcting model, editing distance error correcting model With phonetic apart from error correcting model.In this example, determining module 605 is based on multiple error correcting models and calculates separately the candidate text Evaluation factor include: operation one, operation two and operation three.
Operation one, determining module 605 calculate the noisy communication channel probability of the candidate text based on noisy communication channel error correcting model, And as the first evaluation factor of the candidate text.
Specifically, determining module 605 can calculate the noisy communication channel probability of candidate text according to the following formula:
P=P (q/c) * P (c);
Wherein, P is the noisy communication channel probability of candidate text, and q indicates that, to corrected text, c indicates candidate text, P (q/c) table Show candidate text and to the transition probability between corrected text, P (c) indicates the prior probability of candidate text.
Further, P (q/c), P (c) can be calculated according to the following formula:
Wherein, freq (c) indicates that frequency of occurrence of the candidate text c in training corpus, freq (q, c) are indicated to error correction term The frequency occurred simultaneously in training corpus with candidate text, | C | indicate that the sum of all words in library is expected in training.
Operation two, determining module 605 calculate the editing distance of the candidate text, and root based on editing distance error correcting model The second evaluation factor of the candidate text is determined according to editing distance.
Wherein, the editing distance of candidate text refers to;The editor behaviour of minimum needed for becoming candidate text to corrected text Make number.Wherein, edit operation can be insertion, replacement, exchange, deletion.For example, be " by machine " to corrected text, candidate text This is " mobile phone ", then the editing distance of candidate text is 1.For example, being " iphoe " to corrected text, candidate text is " iphone ", then the editing distance of candidate text is 1.
Optionally it is determined that module 605 can calculate the second evaluation factor of the candidate text according to the following formula:
Wherein, μeditIndicate the second evaluation factor of candidate text, deditIndicate the editing distance of candidate text, max { L1, L2Indicate to take maximum word length, L in corrected text and candidate text1Indicate the word length to corrected text, L2Table Show the word length of candidate text.
Operation three, determining module 605 calculate the phonetic distance of the candidate text, and root based on phonetic apart from error correcting model The third evaluation factor of the candidate text is determined according to phonetic distance.
Specifically, determining module 605 calculates the phonetic distance of the candidate text based on phonetic apart from error correcting model Operation can include: treat the word in corrected text and candidate text, whether determining module 605 compares its phonetic composition letter one by one Whether identical and tone is identical;Determining module 605 determines the phonetic distance of each word according to comparison result, and will be described each Phonetic distance of the adduction of the phonetic distance of word as the candidate text.
Wherein, when including non-Chinese character part (such as English words, number) in corrected text and candidate text, it is believed that institute State same position identical characters in non-Chinese character part phonetic composition letter it is identical, tone is also identical, it is believed that the non-Chinese character portion The phonetic composition letter difference of same position kinds of characters, tone are also different in point.
For example, being " by machine " to corrected text, candidate text is " mobile phone ", the phonetic distance of first character are as follows: 1 (phonetic It is identical)+0 (tone is different)=1;The phonetic distance of second word are as follows: 1 (phonetic is identical)+1 (tone is identical)=2.Therefore, it waits The phonetic distance of selection sheet " mobile phone " is 3.
For example, be " ipd " to corrected text, candidate text be " ipad ", to first in corrected text and candidate's text The phonetic composition letter of a word " i " is identical, and tone is identical, so the phonetic distance of first character is 2.To corrected text and time The phonetic of second word " p " in selection sheet is identical, and the tone of the two is identical, so the phonetic distance of second word is 2.To Third word " d " in corrected text forms alphabetical different, tone difference from the phonetic of the third word " a " in candidate text, So triliteral phonetic distance is 0.It is sky to the 4th word in corrected text, the 4th word in candidate text is " d ", the phonetic composition letter and tone of the two are all different, so the phonetic distance of the 4th word is 0, therefore candidate text The phonetic distance of " ipad " is 4.
Optionally it is determined that module 605 can calculate the third evaluation factor of the candidate text according to the following formula:
Wherein, νpinyinIndicate the third evaluation factor of candidate text, dpinyinIndicate the phonetic distance of candidate text, max {L1,L2Indicate to take maximum word length, L in corrected text and candidate text1Indicate the word length to corrected text, L2Indicate the word length of candidate text.
Further, in this example, after obtaining the first, second and third evaluation factor, determining module 605 can by first, Two, three evaluation factors are merged, to obtain the assessed value of the candidate text.After obtaining the assessed value, mould is determined Block 605 can be using the maximum candidate text of assessed value as the error correction result to corrected text.Alternatively, determining module 605 Assessed value can also be greater than to the candidate text of one or more of a certain preset threshold as the error correction knot to corrected text Fruit.
Optionally, determining module 605 can according to the following formula merge the first, second and third evaluation factor:
Score=a1*P+b1edit+c1pinyin
Wherein, Score indicates the assessed value of candidate text, a1、b1、c1For preset constant coefficient, P is the first evaluation factor, μeditFor the second evaluation factor, νpinyinFor third evaluation factor.
In embodiments of the present invention, determining module 605 is by being based on noisy communication channel error correcting model, editing distance error correction mould Type, phonetic calculate separately the first, second and third evaluation factor apart from error correcting model, and merge to the first, second and third evaluation factor It is operated with the assessed value etc. for obtaining candidate text, the accuracy rate of text error correction can be further increased.The dress of the embodiment of the present invention The text error correction that can handle Chinese, English, pinyin mixing well is set, the coverage rate and applicability of text error correction are improved.
Fig. 7 is shown can be using the text error correction method of the embodiment of the present invention or the exemplary system of text error correction device Framework 700.
As shown in fig. 7, system architecture 700 may include terminal device 701,702,703, network 704 and server 705. Network 704 between terminal device 701,702,703 and server 705 to provide the medium of communication link.Network 704 can be with Including various connection types, such as wired, wireless communication link or fiber optic cables etc..
User can be used terminal device 701,702,703 and be interacted by network 704 with server 705, to receive or send out Send message etc..Various telecommunication customer end applications, such as the application of shopping class, net can be installed on terminal device 701,702,703 The application of page browsing device, searching class application, instant messaging tools, mailbox client, social platform software etc..
Terminal device 701,702,703 can be the various electronic equipments with display screen and supported web page browsing, packet Include but be not limited to smart phone, tablet computer, pocket computer on knee and desktop computer etc..
Server 705 can be to provide the server of various services, such as utilize terminal device 701,702,703 to user The shopping class website browsed provides the back-stage management server supported.Back-stage management server can be to the search term received Etc. data carry out the inquiry processing such as error correction, and error correction result is fed back into terminal device.
It should be noted that text error correction method provided by the embodiment of the present invention is generally executed by server 705, accordingly Ground, text error correction device are generally positioned in server 705.
It should be understood that the number of terminal device, network and server in Fig. 7 is only schematical.According to realization need It wants, can have any number of terminal device, network and server.
Further, the present invention also provides a kind of electronic equipment, comprising: one or more processors;And storage dress It sets, for storing one or more programs;When one or more of programs are executed by one or more of processors, so that One or more of processors realize text error correction method of the invention.
Fig. 8 shows the structural schematic diagram for being suitable for the computer system 800 for being used to realize electronic equipment of the invention.Fig. 8 The computer system shown is only an example, should not function to the embodiment of the present invention and use scope bring any limit System.
As shown in figure 8, computer system 800 includes central processing unit (CPU) 801, it can be read-only according to being stored in Program in memory (ROM) 802 or be loaded into the program in random access storage device (RAM) 803 from storage section 808 and Execute various movements appropriate and processing.In RAM 803, also it is stored with system 800 and operates required various programs and data. CPU 801, ROM 802 and RAM 803 are connected with each other by bus 804.Input/output (I/O) interface 805 is also connected to always Line 804.
I/O interface 805 is connected to lower component: the importation 806 including keyboard, mouse etc.;It is penetrated including such as cathode The output par, c 807 of spool (CRT), liquid crystal display (LCD) etc. and loudspeaker etc.;Storage section 808 including hard disk etc.; And the communications portion 809 of the network interface card including LAN card, modem etc..Communications portion 809 via such as because The network of spy's net executes communication process.Driver 810 is also connected to I/O interface 805 as needed.Detachable media 811, such as Disk, CD, magneto-optic disk, semiconductor memory etc. are mounted on as needed on driver 810, in order to read from thereon Computer program be mounted into storage section 808 as needed.
Particularly, disclosed embodiment, the process described above with reference to flow chart may be implemented as counting according to the present invention Calculation machine software program.For example, embodiment disclosed by the invention includes a kind of computer program product comprising be carried on computer Computer program on readable medium, the computer program include the program code for method shown in execution flow chart.? In such embodiment, which can be downloaded and installed from network by communications portion 809, and/or from can Medium 811 is dismantled to be mounted.When the computer program is executed by central processing unit (CPU) Y01, system of the invention is executed The above-mentioned function of middle restriction.
It should be noted that computer-readable medium shown in the present invention can be computer-readable signal media or meter Calculation machine readable storage medium storing program for executing either the two any combination.Computer readable storage medium for example can be --- but not Be limited to --- electricity, magnetic, optical, electromagnetic, infrared ray or semiconductor system, device or device, or any above combination.Meter The more specific example of calculation machine readable storage medium storing program for executing can include but is not limited to: have the electrical connection, just of one or more conducting wires Taking formula computer disk, hard disk, random access storage device (RAM), read-only memory (ROM), erasable type may be programmed read-only storage Device (EPROM or flash memory), optical fiber, portable compact disc read-only memory (CD-ROM), light storage device, magnetic memory device, Or above-mentioned any appropriate combination.In the present invention, computer readable storage medium can be it is any include or storage journey The tangible medium of sequence, the program can be commanded execution system, device or device use or in connection.And at this In invention, computer-readable signal media may include in a base band or as carrier wave a part propagate data-signal, Wherein carry computer-readable program code.The data-signal of this propagation can take various forms, including but unlimited In electromagnetic signal, optical signal or above-mentioned any appropriate combination.Computer-readable signal media can also be that computer can Any computer-readable medium other than storage medium is read, which can send, propagates or transmit and be used for By the use of instruction execution system, device or device or program in connection.Include on computer-readable medium Program code can transmit with any suitable medium, including but not limited to: wireless, electric wire, optical cable, RF etc. are above-mentioned Any appropriate combination.
Flow chart and block diagram in attached drawing are illustrated according to the system of various embodiments of the invention, method and computer journey The architecture, function and operation in the cards of sequence product.In this regard, each box in flowchart or block diagram can generation A part of one module, program segment or code of table, a part of above-mentioned module, program segment or code include one or more Executable instruction for implementing the specified logical function.It should also be noted that in some implementations as replacements, institute in box The function of mark can also occur in a different order than that indicated in the drawings.For example, two boxes succeedingly indicated are practical On can be basically executed in parallel, they can also be executed in the opposite order sometimes, and this depends on the function involved.Also it wants It is noted that the combination of each box in block diagram or flow chart and the box in block diagram or flow chart, can use and execute rule The dedicated hardware based systems of fixed functions or operations is realized, or can use the group of specialized hardware and computer instruction It closes to realize.
Being described in module involved in the embodiment of the present invention can be realized by way of software, can also be by hard The mode of part is realized.Described module also can be set in the processor, for example, can be described as: a kind of processor packet It includes and obtains module, searching module, determining module.Wherein, the title of these modules is not constituted under certain conditions to the module The restriction of itself, for example, obtaining module is also described as " obtaining the module of the pinyin sequence to corrected text ".
As on the other hand, the present invention also provides a kind of computer-readable medium, which be can be Included in equipment described in above-described embodiment;It is also possible to individualism, and without in the supplying equipment.Above-mentioned calculating Machine readable medium carries one or more program, when said one or multiple programs are executed by the equipment, makes It obtains the equipment and executes following below scheme: obtaining the pinyin sequence to corrected text;Mixing lexicographic tree is searched, to obtain with described wait entangle The matched candidate text set of the pinyin sequence of wrong text;It is corresponding with Chinese word and English words that the mixing lexicographic tree includes phonetic Relationship;The error correction result to corrected text is determined according to error correcting model and the candidate text set.
Above-mentioned specific embodiment, does not constitute a limitation on the scope of protection of the present invention.Those skilled in the art should be bright It is white, design requirement and other factors are depended on, various modifications, combination, sub-portfolio and substitution can occur.It is any Made modifications, equivalent substitutions and improvements etc. within the spirit and principles in the present invention, should be included in the scope of the present invention Within.

Claims (16)

1. a kind of text error correction method, which is characterized in that the described method includes:
Obtain the pinyin sequence to corrected text;
Mixing lexicographic tree is searched, to obtain and the matched candidate text set of the pinyin sequence to corrected text;The mixing Lexicographic tree includes the corresponding relationship of phonetic and Chinese word and English words;
The error correction result to corrected text is determined according to error correcting model and the candidate text set.
2. the method according to claim 1, wherein the step of pinyin sequence of the acquisition to corrected text, wraps It includes:
If described be made of to corrected text Chinese character, using the phonetic of the Chinese character as the pinyin sequence to corrected text;If It is described to be made of to corrected text non-Chinese character, then described non-Chinese character itself is used as to the pinyin sequence to corrected text;If described It is made of, then makees the entirety being made of the phonetic of the Chinese character and the non-Chinese character itself Chinese character and non-Chinese character to corrected text For the pinyin sequence to corrected text;Wherein, the non-Chinese character includes: number, English words and/or phonetic.
3. the method according to claim 1, wherein the lookup mixes lexicographic tree, to obtain with described wait entangle The step of pinyin sequence of wrong text matched candidate text set includes:
Mixing lexicographic tree is searched based on Forward Maximum Method algorithm and reversed maximum matching algorithm, and according to Forward Maximum Method knot Fruit and reversed maximum matching result determination and the matched candidate text set of the pinyin sequence.
4. the method according to claim 1, wherein described determine according to error correcting model with the candidate text set The step of error correction result to corrected text includes:
The evaluation factor of each candidate text in the candidate text set is calculated separately based on multiple error correcting models;By multiple assessments The factor is merged, to obtain the assessed value of the candidate text;The entangling to corrected text is determined according to the assessed value Wrong result.
5. according to the method described in claim 4, it is characterized in that, the multiple error correcting model includes following at least two: making an uproar Acoustic channel error correcting model, editing distance error correcting model, phonetic are apart from error correcting model.
6. according to the method described in claim 5, it is characterized in that, including noisy communication channel error correction mould in the multiple error correcting model Type, editing distance error correcting model and phonetic are described to calculate separately institute based on multiple error correcting models in the case where the error correcting model State in candidate text set it is each candidate text evaluation factor the step of include:
The noisy communication channel probability of the candidate text is calculated based on noisy communication channel error correcting model, and as the candidate text The first evaluation factor;The editing distance of the candidate text is calculated based on editing distance error correcting model, and according to editing distance Determine the second evaluation factor of the candidate text;Based on phonetic apart from error correcting model calculate the phonetic of the candidate text away from From, and according to the third evaluation factor of the determining candidate text of phonetic distance.
7. according to the method described in claim 6, it is characterized in that, described calculate the candidate apart from error correcting model based on phonetic The phonetic of text apart from the step of include:
Treat the word in corrected text and candidate text, compare one by one its phonetic composition letter whether identical and tone whether phase Together;The phonetic distance of each word is determined according to comparison result, and using the adduction of the phonetic distance of each word as the time The phonetic distance of selection sheet.
8. according to the method described in claim 3, it is characterized in that, the Forward Maximum Method result, the reversed maximum It include: at least one candidate text fragments with result;
The method also includes: edit operation is carried out to the pinyin sequence of candidate text fragments;According to edited pinyin sequence Mixing lexicographic tree is searched, with acquisition and the edited matched newly-increased candidate text fragments of pinyin sequence, and according to described Candidate text fragments, newly-increased candidate text fragments building and the matched candidate text set of the pinyin sequence to corrected text.
9. according to the method described in claim 8, it is characterized in that, the pinyin sequence to candidate text fragments is edited The step of operation includes:
In the case where the candidate text fragments include Chinese character, the edit operation of fuzzy phoneme is carried out to the phonetic of the Chinese character; In the case where the candidate text fragments include English words, the English words are inserted into, are replaced, are exchanged and/or are deleted Edit operation.
10. the method according to claim 1, wherein the method also includes:
The pinyin sequence of training sample word is obtained, and mixing lexicographic tree is constructed according to the pinyin sequence of the training sample word.
11. according to the method described in claim 10, it is characterized in that, the method also includes:
Mixing dictionary is constructed in the pinyin sequence for obtaining training sample word, and according to the pinyin sequence of the training sample word Before the step of tree, source data is cleaned, to obtain the training sample word.
12. a kind of searching method, which is characterized in that the described method includes:
Receive input text;
In the case where determining the input text is to corrected text, the pinyin sequence of input text is obtained;
Mixing lexicographic tree is searched, to obtain and the matched candidate text set of the pinyin sequence of the input text;The blendword Allusion quotation tree includes the corresponding relationship of phonetic and Chinese word and English words;
The error correction result of the input text is determined according to error correcting model and the candidate text set;
Search result is obtained according to the error correction result of the input text, and described search result is sent.
13. a kind of search error correction method, which is characterized in that the described method includes:
Receive input text;
In the case where determining the input text is to corrected text, the pinyin sequence of input text is obtained;
Mixing lexicographic tree is searched, to obtain and the matched candidate text set of the pinyin sequence of the input text;The blendword Allusion quotation tree includes the corresponding relationship of phonetic and Chinese word and English words;
The error correction result of the input text is determined according to error correcting model and the candidate text set;
The error correction result of the input text is ranked up, and the error correction result after sequence is sent.
14. a kind of text error correction device, which is characterized in that described device includes:
Module is obtained, for obtaining the pinyin sequence to corrected text;
Searching module, for searching mixing lexicographic tree, to obtain and the matched candidate text of the pinyin sequence to corrected text This collection;The mixing lexicographic tree includes the corresponding relationship of phonetic and Chinese word and English words;
Determining module, for determining the error correction result to corrected text according to error correcting model and the candidate text set.
15. a kind of electronic equipment characterized by comprising
One or more processors;
Storage device, for storing one or more programs,
When one or more of programs are executed by one or more of processors, so that one or more of processors are real The now method as described in any in claim 1 to 11.
16. a kind of computer-readable medium, is stored thereon with computer program, which is characterized in that described program is held by processor The method as described in any in claim 1 to 11 is realized when row.
CN201810030108.3A 2018-01-12 2018-01-12 Text error correction method and device Pending CN110032722A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810030108.3A CN110032722A (en) 2018-01-12 2018-01-12 Text error correction method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810030108.3A CN110032722A (en) 2018-01-12 2018-01-12 Text error correction method and device

Publications (1)

Publication Number Publication Date
CN110032722A true CN110032722A (en) 2019-07-19

Family

ID=67234834

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810030108.3A Pending CN110032722A (en) 2018-01-12 2018-01-12 Text error correction method and device

Country Status (1)

Country Link
CN (1) CN110032722A (en)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111105787A (en) * 2019-12-31 2020-05-05 苏州思必驰信息科技有限公司 Text matching method and device and computer readable storage medium
CN112232062A (en) * 2020-12-11 2021-01-15 北京百度网讯科技有限公司 Text error correction method and device, electronic equipment and storage medium
CN112560493A (en) * 2020-12-17 2021-03-26 金蝶软件(中国)有限公司 Named entity error correction method, named entity error correction device, computer equipment and storage medium
CN112560452A (en) * 2021-02-25 2021-03-26 智者四海(北京)技术有限公司 Method and system for automatically generating error correction corpus
CN112863516A (en) * 2020-12-31 2021-05-28 竹间智能科技(上海)有限公司 Text error correction method and system and electronic equipment
CN113032683A (en) * 2021-04-28 2021-06-25 玉米社(深圳)网络科技有限公司 Method for quickly segmenting words in network popularization
CN113378553A (en) * 2021-04-21 2021-09-10 广州博冠信息科技有限公司 Text processing method and device, electronic equipment and storage medium
CN114239559A (en) * 2021-11-15 2022-03-25 北京百度网讯科技有限公司 Method, apparatus, device and medium for generating text error correction and text error correction model
CN115221866A (en) * 2022-06-23 2022-10-21 平安科技(深圳)有限公司 Method and system for correcting spelling of entity word

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103198149A (en) * 2013-04-23 2013-07-10 中国科学院计算技术研究所 Method and system for query error correction
US20160179774A1 (en) * 2014-12-18 2016-06-23 International Business Machines Corporation Orthographic Error Correction Using Phonetic Transcription
CN105975625A (en) * 2016-05-26 2016-09-28 同方知网数字出版技术股份有限公司 Chinglish inquiring correcting method and system oriented to English search engine
CN106708893A (en) * 2015-11-17 2017-05-24 华为技术有限公司 Error correction method and device for search query term
CN107193921A (en) * 2017-05-15 2017-09-22 中山大学 The method and system of the Sino-British mixing inquiry error correction of Search Engine-Oriented

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103198149A (en) * 2013-04-23 2013-07-10 中国科学院计算技术研究所 Method and system for query error correction
US20160179774A1 (en) * 2014-12-18 2016-06-23 International Business Machines Corporation Orthographic Error Correction Using Phonetic Transcription
CN106708893A (en) * 2015-11-17 2017-05-24 华为技术有限公司 Error correction method and device for search query term
CN105975625A (en) * 2016-05-26 2016-09-28 同方知网数字出版技术股份有限公司 Chinglish inquiring correcting method and system oriented to English search engine
CN107193921A (en) * 2017-05-15 2017-09-22 中山大学 The method and system of the Sino-British mixing inquiry error correction of Search Engine-Oriented

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111105787A (en) * 2019-12-31 2020-05-05 苏州思必驰信息科技有限公司 Text matching method and device and computer readable storage medium
CN112232062A (en) * 2020-12-11 2021-01-15 北京百度网讯科技有限公司 Text error correction method and device, electronic equipment and storage medium
US11423222B2 (en) 2020-12-11 2022-08-23 Beijing Baidu Netcom Science Technology Co., Ltd. Method and apparatus for text error correction, electronic device and storage medium
CN112560493A (en) * 2020-12-17 2021-03-26 金蝶软件(中国)有限公司 Named entity error correction method, named entity error correction device, computer equipment and storage medium
CN112863516A (en) * 2020-12-31 2021-05-28 竹间智能科技(上海)有限公司 Text error correction method and system and electronic equipment
CN112560452A (en) * 2021-02-25 2021-03-26 智者四海(北京)技术有限公司 Method and system for automatically generating error correction corpus
CN113378553A (en) * 2021-04-21 2021-09-10 广州博冠信息科技有限公司 Text processing method and device, electronic equipment and storage medium
CN113032683A (en) * 2021-04-28 2021-06-25 玉米社(深圳)网络科技有限公司 Method for quickly segmenting words in network popularization
CN114239559A (en) * 2021-11-15 2022-03-25 北京百度网讯科技有限公司 Method, apparatus, device and medium for generating text error correction and text error correction model
CN114239559B (en) * 2021-11-15 2023-07-11 北京百度网讯科技有限公司 Text error correction and text error correction model generation method, device, equipment and medium
CN115221866A (en) * 2022-06-23 2022-10-21 平安科技(深圳)有限公司 Method and system for correcting spelling of entity word
CN115221866B (en) * 2022-06-23 2023-07-18 平安科技(深圳)有限公司 Entity word spelling error correction method and system

Similar Documents

Publication Publication Date Title
CN110032722A (en) Text error correction method and device
CN104156454B (en) The error correction method and device of search term
CN105574092B (en) Information mining method and device
CN109299458A (en) Entity recognition method, device, equipment and storage medium
CN109271631A (en) Segmenting method, device, equipment and storage medium
CN110162767A (en) The method and apparatus of text error correction
JP6517352B2 (en) Method and system for providing translation information
CN108628830B (en) Semantic recognition method and device
CN108768840A (en) A kind of method and apparatus of account management
CN102750280A (en) Computer processing method and system for search
US20160092421A1 (en) Text Editing Method and Apparatus, and Server
CN109992766B (en) Method and device for extracting target words
CN104462051A (en) Word segmentation method and device
US20210042470A1 (en) Method and device for separating words
CN110069698A (en) Information-pushing method and device
CN107943895A (en) Information-pushing method and device
CN103514230A (en) Method and device used for training language model according to corpus sequence
CN106681598A (en) Information input method and device
CN110276065A (en) A kind of method and apparatus handling goods review
CN110874396A (en) Keyword extraction method and device and computer storage medium
CN111861596A (en) Text classification method and device
KR101931624B1 (en) Trend Analyzing Method for Fassion Field and Storage Medium Having the Same
CN110309293A (en) Text recommended method and device
CN105929979B (en) Long sentence input method and device
CN111538830A (en) French retrieval method, French retrieval device, computer equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination