CN106598939B

CN106598939B - A kind of text error correction method and device, server, storage medium

Info

Publication number: CN106598939B
Application number: CN201610922072.0A
Authority: CN
Inventors: 焦增涛
Original assignee: Beijing Sankuai Online Technology Co Ltd
Current assignee: Beijing Sankuai Online Technology Co Ltd
Priority date: 2016-10-21
Filing date: 2016-10-21
Publication date: 2019-09-17
Anticipated expiration: 2036-10-21
Also published as: CN106598939A

Abstract

The invention discloses a kind of text error correction method and devices, which comprises collects the first corpus in the form of participle pair；Two participles of the participle centering are marked all in the form of phonetic；Determine that the similarity of phonetic between centering two participles of the participle, the similarity are used to show the similarity degree between the phonetic of the participle centering first participle and the phonetic of the second participle；If the similarity meets preset condition, mutual error correction participle is identified as by two of the participle centering or the first participle is the error correction participle of the second participle.

Description

A kind of text error correction method and device, server, storage medium

Technical field

The present invention relates to electronic technology more particularly to a kind of text error correction methods and device, server, storage medium.

Background technique

Text error correcting technique is widely used in various text input scenes, such as input method, search engine, speech recognition It is that a kind of attempt to correct may in the text (such as keyword of the texts such as Chinese English) of user's input Deng, text error correcting technique Existing mistake, and possible correctly enter is recommended user.For Chinese error correction, text error correcting technique also needs to send out Word selection mistake, phonetic notation mistake, font mistake and its a little mistake occurred in current family input, and may be wished to user recommended user Hope the correct keyword of input.It can be seen that error correcting technique effectively can provide guidance for user entered keyword, and can entangle More often occurs keyword mistake.In text error correcting technique, be repaired keyword and correct keyword it Between similarity decide that the accuracy rate of error correction, current calculating similarity specifically include that the hair based on mandarin initial and simple or compound vowel of a Chinese syllable The initial and the final is divided into several groups by sound type, and defining the note similarity in same group is 1, and the note between different groups is similar Degree is 0, by Chinese character aligned in position, calculates the pronunciation similarity of corresponding position one by one, and then be averaging similarity as a result.It should The shortcomings that scheme, is: by same group it is that this definition that 1 different groups are 0 is unable to similarity degree between accurate description note, thus The similarity difference in group between note is had ignored, such as similar journey of pronouncing between labial b [glass] and p [slope], b [glass] and m [touching] Difference is spent, and is entirely only zero note similarity degree between group, it is as certain in having between labial b [glass] and velar g [brother] Similar pronunciation.

Summary of the invention

In view of this, the embodiment of the present invention be solve the problems, such as it is existing in the prior art at least one and a kind of text is provided It is similar as note to excavate note transition probability by pronunciation Similar Text for error correction method and device, server, storage medium Degree, can be improved error correction probability.

The technical solution of the embodiment of the present invention is achieved in that

In a first aspect, the embodiment of the present invention provides a kind of text error correction method, which comprises

The first corpus is collected in the form of participle pair；

Two participles that centering is segmented in first corpus are marked all in the form of phonetic；

Determine the similarity of phonetic between centering two participles of the participle, the similarity is for showing the participle pair Similarity degree between the phonetic of the middle first participle and the phonetic of the second participle；

If the similarity meets preset condition, two of the participle centering are identified as each other Error correction participle or the first participle be second participle error correction participle.

Second aspect, the embodiment of the present invention provide a kind of text error correction device, and described device includes the first formation unit, mark Infuse unit, the first determination unit and the second determination unit, in which:

Described first forms unit, for collecting the first corpus in the form of participle pair；

The mark unit, for two participle marks all in the form of phonetic of centering will to be segmented in first corpus Note；

First determination unit, for determining the similarity of phonetic between centering two participles of the participle, the phase Like degree for showing the similarity degree between the phonetic of the participle centering first participle and the phonetic of the second participle；

Second determination unit, if meeting preset condition for the similarity, by the participle centering Two are identified as mutual error correction participle or the first participle as the error correction participle of the second participle.

The third aspect, the embodiment of the present invention provide a kind of server, and the server includes that processor and PERCOM peripheral communication connect Mouthful, the processor is used for:

The first corpus is collected in the form of participle pair；

If the similarity meets preset condition, two of the participle centering are identified as each other Error correction participle or the first participle be second participle error correction participle；

The similarity is met into the participle of preset condition to formation error correction dictionary；

The error correction dictionary is sent to terminal by the external communication interface.

Fourth aspect, the embodiment of the present invention provide a kind of computer storage medium, store in the computer storage medium There are computer executable instructions, which is used to execute the text error correction method that above-mentioned first aspect provides.

The embodiment of the present invention provides a kind of text error correction method and device, server, storage medium, wherein with participle pair Form collect the first corpus；Two participles that centering is segmented in first corpus are marked all in the form of phonetic；It determines The similarity of phonetic between centering two participles of the participle, the similarity are used to show the participle centering first participle Similarity degree between phonetic and the phonetic of the second participle；If the similarity meets preset condition, by the participle Two of centering are identified as mutual error correction participle or the first participle is the error correction participle of the second participle；In this way, logical It crosses pronunciation Similar Text and excavates note transition probability as note similarity, can be improved error correction probability.

Detailed description of the invention

Fig. 1 is the implementation process schematic diagram one of text of embodiment of the present invention error correction method；

Fig. 2-1 is the implementation process schematic diagram two of text of embodiment of the present invention error correction method；

Fig. 2-2 is the relation schematic diagram that the embodiment of the present invention first calculates that equipment calculates equipment with second；

Fig. 2-3 is the implementation process schematic diagram three of text of embodiment of the present invention error correction method；

Fig. 3-1 is the implementation process schematic diagram four of text of embodiment of the present invention error correction method；

Fig. 3-2 is the implementation process schematic diagram of step S301 in Fig. 3-1；

Fig. 3-3 is the implementation process schematic diagram of step S302 in Fig. 3-1；

Fig. 3-4 is the implementation process schematic diagram of step S324 in Fig. 3-3；

Fig. 4 is the composed structure schematic diagram one of text of embodiment of the present invention error correction device；

Fig. 5 is the composed structure schematic diagram two of text of embodiment of the present invention error correction device；

Fig. 6 is the composed structure schematic diagram of server of the embodiment of the present invention.

Specific embodiment

The technical solution of the present invention is further elaborated in the following with reference to the drawings and specific embodiments.

In order to solve aforementioned technical problem, the embodiment of the present invention provides a kind of text error correction method, and this method is used for shape At corresponding error correction participle is segmented, during realization, the processor that this method can calculate equipment by first calls journey Sequence code realizes that certain program code can be stored in computer storage medium, it is seen then that the first calculating equipment is at least wrapped Pocessor and storage media is included, the first calculating equipment can set for various types of electronics with information processing capability It is standby, for example, the electronic equipment may include mobile phone, tablet computer, desktop computer, personal digital assistant, navigator, digital telephone, Visual telephone, television set etc..

Fig. 1 is the implementation process schematic diagram one of text of embodiment of the present invention error correction method, as shown in Figure 1, this method comprises:

Step S101 collects the first corpus in the form of participle pair；

Here, the step of step S101 is a collection corpus, during realization, step S101 can be from following several A channel collects corpus: the nearly sound words allusion quotation of Chinese Chinese language, the confusing dialect of note and standard pronunciation dictionary, speech recognition errors Input method error label result in annotation results and line.The form that corpus is collected is complete to the form of (phrase segment to) to segment At, such as: " logging off "-" leg is slightly imperial ", " coupons "-" cash equivalent volume ", " comrades "-" bobbins ", " comrades "- " notice door ", " dried shrimp "-" villagers ", " salted vegetables are too expensive "-" let's start the meeting " and " sausage pickled melon "-" chief of township's speech ", " comrades "-" let's start the meeting " etc..It should be noted that allowing in the first corpus includes wrong participle pair, such as participle pair The phonetic diversity ratio of " comrades "-" let's start the meeting " is larger, is not to be regarded as the similar participle pair of phonetic under normal circumstances. It further include the second corpus for being used to form the initial and the final similarity matrix, the second corpus and first in the other embodiment of the present invention Corpus can be different, and the second corpus can actually regard a standard corpus as, i.e. should not include mistake inside the second corpus Participle pair accidentally；And in the first corpus may include the participle pair of mistake, the first corpus forms this by embodiment shown in FIG. 1 The participle set provided is provided.

Step S102 marks two participles that centering is segmented in first corpus in the form of phonetic；

Here, continue to accept the example in above-mentioned steps S101, the phonetic of mark " salted vegetables are too expensive " is " xian-cai- Tai-gui ", the phonetic of mark " let's start the meeting " are " xian-zai-kai-hui ", and the phonetic of mark " sausage pickled melon " is " xiang-chang-jiang-gua ", the phonetic of mark " chief of township's speech " are " xiang-zhang-jiang-hua ".

Step S103 determines the similarity of phonetic between centering two participles of the participle, and the similarity is for showing Similarity degree between the phonetic of the participle centering first participle and the phonetic of the second participle；

Here, continue to accept the example in above-mentioned steps S101, for example, determine " logging off " and " leg slightly imperial " the two The similarity of phonetic between participle determines the similarity of phonetic between " comrades " and " bobbins " the two participles for another example, then Such as the similarity of phonetic between determination " comrades " and " let's start the meeting " the two participles.

Here, the similarity of phonetic includes: to utilize preset initial consonant between centering two participles of the determination participle Simple or compound vowel of a Chinese syllable similarity matrix determines the similarity of phonetic between centering two participles of the participle.Wherein about determining the initial and the final phase It is described in other examples like the process of degree matrix.

Step S104, judges whether the similarity meets preset condition；

Here, the preset condition can be threshold value, and described to judge whether similarity meets preset condition include: to sentence Whether the similarity of breaking is greater than the threshold value, if the similarity is greater than the threshold value, it is determined that full for the similarity The foot preset condition, if the similarity is less than or equal to the threshold value, it is determined that be unsatisfactory for for the similarity described Preset condition.

Step S105 distinguishes two participles of the participle centering if the similarity meets preset condition It is determined as mutual error correction participle or the first participle is the error correction participle of the second participle.

Here, continue to accept the example in above-mentioned step S103, two can be identified as by step S105 Mutual error correction participle；For example, the error correction that " salted vegetables are too expensive " is determined as " let's start the meeting " is segmented, " let's start the meeting " is determined as The error correction of " salted vegetables are too expensive " segments；For another example, " sausage pickled melon " is determined as to the error correction participle of " chief of township's speech ", by " chief of township's speech " It is determined as the error correction participle of " sausage pickled melon ".The first participle can be that the error correction of the second participle segments by rapid S105；For example, can be with " comrades " are determined as to the error correction participle of " bobbins ", " will log off " and be determined as the error correction participle of " leg is slightly imperial ".From with Upper can be seen that is determined as actually a kind of two-way mechanism for correcting errors of mutual error correction participle, and is determined as the first participle and is The error correction participle of second participle is actually a kind of unidirectional mechanism for correcting errors, this is because two points in two-way mechanism for correcting errors Word is all the everyday expressions in Working Life study, and the application environment for only segmenting two words of centering is different, for example, " let's start the meeting " and " salted vegetables are too expensive " mutual error correction segments (i.e. two-way mechanism for correcting errors) each other, and " let's start the meeting " is generally used for In work, and " salted vegetables are too expensive " is generally used in life.Mistake is generally acknowledged to by the participle of error correction in unidirectional mechanism for correcting errors Word, such as " logging off " is the error correction participle of " leg slightly imperial ", i.e., " logging off " for error correction " leg is slightly imperial ", " leg is thick It is a wrong participle that dragon ", which is generally divided into,；For another example, " comrades " are the error correction participles of " bobbins ", i.e., " comrades " use In error correction " bobbins " or " notice door ", it is a wrong participle that " bobbins " or " notice door ", which are generally divided into,.It needs Bright, above-mentioned unidirectional mechanism for correcting errors can be converted to two-way mechanism for correcting errors under certain conditions, such as in some feelings Under condition, " bobbins " or " notice door " may also be considered as a correct word.

In the embodiment of the present invention, the error correction participle determined according to step S105 can form an error correction dictionary, that is, described Method further include: the similarity is met into the participle of preset condition to formation error correction dictionary；The error correction dictionary is sent To terminal.It include several participle set in the error correction dictionary, the participle set includes at least a phase according to phonetic Like the error correction participle for being segmented described in error correction to error correction that degree is calculated and is obtained, for example, " comrades " corresponding participle Set includes " bobbins " and " notice door ", and " let's start the meeting " corresponding participle set includes " salted vegetables are too expensive ", and " notice door " is right The participle set answered includes " bobbins " and " comrades ", and " leg is slightly imperial " corresponding participle set includes " logging off ".

In the above-described embodiment, it needs two participles to participle centering to carry out pinyin marking in step S102, is marking Infuse phonetic when, step S102 the following steps are included:

Step S121 judges whether the participle centering includes Arabic numerals；

Step S122 is converted to the Arabic numerals corresponding if the participle centering includes Arabic numerals Chinese character；

Here, suppose that participle is " speed 8 " or " Mo Tai 168 ", then the Arabic numerals in participle be converted to Chinese character being " speed eight " or " Mo Taiyi six or eight ".

Step S123 marks the participle of the participle centering be converted to after Chinese character in the form of phonetic；

Here, continue to accept the example in above-mentioned steps S122, it is assumed that segment as " speed 8 " or " Mo Tai 168 ", then marking When infusing the phonetic of the two participles, the phonetic of " speed 8 " is " su-ba ", and the phonetic of " Mo Tai 168 " is " mo-tai-yi-liu-ba " Or " mo-tai-yao-liu-ba ".Wherein one in " Mo Taiyi six or eight " is polyphone, can be labeled as " yi or yao ".

Step S124 judges whether the participle centering includes polyphone；

Step S125, if the participle centering does not include polyphone, all by two participles of the participle centering It is marked in the form of phonetic.

Here, such as participle does not include polyphone to " leg is slightly imperial "-" logging off ", then carries out to the two participles Pinyin marking: the phonetic of " leg is slightly imperial " is " tui-cu-long ", and the phonetic of " logging off " is " tui-chu-xi-tong ".

Step S126 continues to judge in two participles of the participle centering if the participle centering includes polyphone Whether polyphonic word is had；

Here, such as participle is to for " PianYiFang "-" variation side " or " PianYiFang "-" derogatory sense side ", wherein " PianYiFang " In " just " be polyphone, " just " corresponding phonetic has " two sound pi á n " and " four tones of standard Chinese pronunciation bi à n ".So continue to judge the participle pair In two participles in whether have polyphonic word, for example, " cheap " for polyphonic word in " PianYiFang ", " cheap " corresponding phonetic packet Include " bian-yi " and " pian-yi ".In general, the collection of corpus is the form to segment pair, then the present embodiment is in order to mention High efficiency directly judges to segment whether centering has polyphonic word, for example, " list " word is polyphone, is singly pronouncing the " four tones of standard Chinese pronunciation as surname Shi Sh à n " pronounces " ch á n " in the title as ancient times Xiongnu monarch, as it is for example lonely in general phrase when pronounce " sound a d ān".When collecting corpus, if participle is to for " loneliness "-" fighting single-handed ", although being singly polyphone, loneliness is not multitone Word then lonely phonetic is exactly unique, and does not have to " list " being labeled as three phonetics " ch á n ", " sh à n " and " d ā n ".By This is as it can be seen that method provided in this embodiment can significantly improve computational efficiency.If the appearance for segmenting centering polyphone is single Word rather than phrase come then needing to mark out each phonetic of the word；As previously mentioned, the collection of corpus is participle pair Form, therefore, appearance the case where being single word rather than phrase of participle centering polyphone, will be very rare.

Step S127, if at least one in two participles of the participle centering includes polyphonic word, by the multitone The corresponding more than two phonetics of word are labeled as some or all of the phonetic of the corresponding participle of participle centering.

Here, continue to accept above-mentioned example, " PianYiFang " mark phonetic be include " bian-yi-fang " and “pian-yi-fang”。

From the above, it can be seen that above-mentioned steps S102, which can actually be one, to be segmented to the process for turning phonetic, in reality The step can realize that each word is obtained in Chinese dictionary to be had in such a way that one is looked into Chinese characters and pinyin table in existing process Corresponding phonetic.Processing step is as follows: 1) encounters non-polyphone table look-at and turns phonetic, 2) and encounter polyphone, check the word and week The group word result of side word is tabled look-up, and exists and word has unique pronunciation then to turn phonetic；In the presence of and word pronunciation multitone not yet, use language Model determines pronunciation (such as: cheap pin-yi, bian-yi)；3) there is no using default pronunciation, (polyphone has default to send out in table Sound)；4) encounter Arabic numerals, switch to corresponding Chinese character and table look-up again；4) encounter English character, mark phonetic can not be done Processing；5) encounter the Chinese character not in table, skip the word, and the phonetic of the position is set to sky.

It should be noted that in above-mentioned steps S121 into step S127, step S121 to step S123 and step S124 To stringent successive execution relationship is had no between step S127, i.e., in implementation process, step S121 can be first carried out to step S123, then step S124 is executed to step S127；Certainly step S124 to step S127 can also be first carried out, then executes step S121 to step S123.

In other embodiments of the invention, step S103 is used to determine phonetic between centering two participles of the participle Similarity, the step include:

Step S131, by the initial consonant alignment for segmenting the phonetic that centering two segment and the rhythm for the phonetic for segmenting two Mother's alignment；

Here, in other embodiments of the invention, by the alignment thereof of most same pronunciations by the participle centering two The simple or compound vowel of a Chinese syllable for the phonetic that the initial consonant of the phonetic of a participle is aligned and segments two is aligned.For example, " logging off "-" leg is slightly imperial " Phonetic alignment is as follows:

" logging off " --- t-ui-ch-u-x-i-t-ong；

" leg is slightly imperial " --- t-ui-c--u- -- l-ong；

During alignment, most same pronunciations in order to obtain, by the simple or compound vowel of a Chinese syllable of simple or compound vowel of a Chinese syllable " long " and " system " of " dragon " " long " alignment, rather than the phonetic of " dragon " and " being " is aligned；" " indicates default.In this example, " logging off " is four Word, " leg is slightly imperial " are three words, are first aligned in sequence in the alignment most started, i.e. spelling of the phonetic of " leg " corresponding to " moving back " Sound, " thick " phonetic correspond to phonetic of the phonetic corresponding to " being " of the phonetic of " out ", " dragon ", and the phonetic of " system " is default, the It is one group " leg " very high with the similarity of " moving back " and second group " thick " and " out ", but third group " dragon " and the similarity of " being " are very Low, at this time, the present embodiment will do it dislocation processing, i.e., changes third group are as follows: the phonetic of " being " is default, the 4th group of change Are as follows: the phonetic of " dragon " corresponds to the phonetic of " system "；It is handled by dislocation, first group, second group and the 4th group of similarity all can It is relatively high.When in the related technology, using voicing text similarity, by two sections of texts according to word sequence aligned in position, the prior art The disadvantage is that, in the case where meeting some sentence multiword or few word, the mistake alignment plenary session mistake of follow-up location.And this hair The method that bright embodiment uses most same pronunciations, in the case where capable of guaranteeing certain section of text multiword or few word, two sections of texts Between alignment.

The conversion of step S132, the phonetic that the phonetic for calculating the participle centering first participle is converted to the second participle are general Rate；

Step S133 determines the similarity of phonetic between centering two participles of the participle according to the transition probability.

In other embodiments of the invention, the mode of two kinds of realization step S132 is provided below:

Mode one: first way is fairly simple, that is, determines different between the first participle and second participle The number of note, then according to the length of the number of different notes and the first participle or the note string of second participle Determine the transition probability, wherein the length of note string can for the first participle number of words multiplied by 2 product, alternatively, note string Length can also for second participle number of words multiplied by 2 product, alternatively, the length of the note string can be for the first participle Number of words and the sum of the number of words of the second participle multiplied by 2 product because the phonetic of a Chinese character includes initial consonant and simple or compound vowel of a Chinese syllable, then sound The length of symbol string is just 2 times of Chinese total number.By " logging off "-" leg slightly imperial " this to participle for be illustrated: assuming that It is " leg is slightly imperial " by the participle (first participle) of error correction, error correction participle (the second participle) is " logging off ", and the participle is between The numbers of different notes be 4, respectively " ch ", " x ", " i " and " t ", wherein note includes initial consonant and simple or compound vowel of a Chinese syllable, then described in Transition probability may be calculated: 4 ÷ 6 (i.e. 4 divided by 6,6 for the note string of the first participle length), (i.e. 4 are 4 ÷ 8 divided by 8,8 Second participle note string length) or 4 ÷ (6+8) (i.e. 4 divided by 14,14 for the first participle and second participle note string The sum of length).Assuming that being " logging off " by the participle (first participle) of error correction, error correction participle (the second participle) is that " leg is thick Dragon ", the numbers of different notes of the participle between are 2, respectively " c " and " l ", and wherein note includes initial consonant and simple or compound vowel of a Chinese syllable, The so described transition probability may be calculated: 2 ÷ 6 (i.e. 2 divided by 6,6 for the note string of the first participle length), 2 ÷ 8 (i.e. 2 Be the length of the note string of the second participle divided by 8,8) or 2 ÷ (6+8) (i.e. 2 divided by 14,14 be the first participle and second point The sum of the length of the note string of word).

It should be noted that the relationship between the above-mentioned transition probability calculated and similarity is shifted in inverse ratio Probability is smaller, and similarity is bigger, and transition probability is bigger, and similarity is smaller, and the transition probability is between [0,1], i.e. institute Transition probability is stated more than or equal to 0 and is less than or equal to 1, when transition probability is 0, shows that the note of the first participle and the second participle is It is identical, such as " comrades "-" notice "；When transition probability is 1, show the sound of the first participle and the second participle Symbol is entirely different, such as " comrades "-" let's start the meeting ".In order to there is a good corresponding relationship to be easier in other words Understand transition probability, similarity can be calculated using following relational expression: similarity=1- transition probability.It calculates in this way Similarity between [0,1], i.e., the described similarity be more than or equal to 0 and be less than or equal to 1, when similarity be 0 when, indicate participle pair In two participles phonetics it is entirely different, when similarity is 1, indicate the complete phases of phonetic of two of centering participles of participle Together.

Mode two: the second way is to calculate the participle centering first using preset the initial and the final similarity matrix to divide The phonetic of word is converted to the transition probability of the phonetic of the second participle, step S132, the calculating participle centering first participle Phonetic be converted to the second participle phonetic transition probability, comprising:

Step S1321, if the word unisonance of two participle same positions after alignment, calculates score Score and adds 1, and The position of the position of the participle centering first participle and the second participle is all added 1；

Step S1322, if the word not unisonance of two participle same positions after alignment, according to preset the initial and the final phase The score Score of the phonetic of the phonetic of the first participle and the second participle in described two participles is determined like degree matrix；

Step S1323 is determined normalized according to the score Score, the number of words of the first participle, the number of words of the second participle Final score Sf；

Step S1324 determines that the phonetic of the participle centering first participle is converted to second according to the final score Sf The transition probability of the phonetic of participle.

In other embodiments of the invention, described to determine described two points according to preset the initial and the final similarity matrix The score Score of the phonetic of the phonetic of the first participle and the second participle in word, comprising:

Step S13221 obtains the initial consonant of the word of two participle same positions according to preset the initial and the final similarity matrix Between similarity, the similarity between simple or compound vowel of a Chinese syllable；

Step S13222, if to be greater than first default by the product S of the similarity between similarity and simple or compound vowel of a Chinese syllable between initial consonant Value then calculates score Score plus S, the position of the position of the participle centering first participle and the second participle is all added 1；

Step S13223, if two participle same position word initial consonant between similarity and simple or compound vowel of a Chinese syllable between it is similar The product S of degree is less than or equal to the first preset value, then the present bit of the first participle is obtained according to the initial and the final similarity matrix Similarity, initial consonant and the simple or compound vowel of a Chinese syllable between similarity, simple or compound vowel of a Chinese syllable between the initial consonant of the word of the next position of the word set and the second participle Between similarity and simple or compound vowel of a Chinese syllable and initial consonant between similarity；

Here, first preset value and the second following preset value, third preset value can be empirical value, and first is default Value can be identical with the second following preset value, third preset value, such as all value is 0.8, naturally it is also possible to difference.

Step S13224, determines the first maximum value, and first maximum value is the word and the of the current location of the first participle The product S's of similarity between similarity and simple or compound vowel of a Chinese syllable, the first participle between the initial consonant of the word of the next positions of two participles works as The present bit of similarity and the first participle between the simple or compound vowel of a Chinese syllable of the word of the next position of the initial consonant of the word of front position and the second participle The maximum value between similarity this three between the initial consonant of the word of the next position of the simple or compound vowel of a Chinese syllable for the word set and the second participle；

Step S13225, judges whether first maximum value is greater than the second preset value, before calculating score Score is Score adds first maximum value before obtaining, and the position of the position of the participle centering first participle and the second participle is all added 1；

Step S13226, it is similar according to the initial and the final if first maximum value is less than or equal to the second preset value Similarity, rhythm between the initial consonant of the word for the current location that degree matrix obtains the word of the next position of the first participle and second segments Similarity, initial consonant between mother and the similarity between simple or compound vowel of a Chinese syllable and the similarity between simple or compound vowel of a Chinese syllable and initial consonant；

Step S13227, determines the second maximum value, and second maximum value is the word and the of the next position of the first participle The product S of similarity between similarity and simple or compound vowel of a Chinese syllable between the initial consonant of the word of the current locations of two participles, under the first participle The next bit of similarity and the first participle between the simple or compound vowel of a Chinese syllable of the word of the current location of the initial consonant of the word of one position and the second participle The maximum value between similarity this three between the initial consonant of the word of the current location of the simple or compound vowel of a Chinese syllable for the word set and the second participle；

Step S13228, judges whether second maximum value is greater than third preset value, before calculating score Score is Score adds second maximum value before obtaining, and the position of the position of the participle centering first participle and the second participle is all added 1, word after then judging the location updating of the first participle and the second participle whether unisonance, through the above-mentioned steps traversal first participle The phonetic of phonetic and the second participle simultaneously calculates score Score.

In embodiments of the present invention, a kind of method of determining the initial and the final similarity matrix is also provided, described in the determination The initial and the final similarity matrix includes:

Step S140 collects the second corpus in the form of participle pair；Two points of centering will be segmented in second corpus Word marks all in the form of phonetic；

Here, second corpus is used to form the initial and the final similarity matrix, the second corpus and the first corpus above-mentioned Can be different, the second corpus can actually regard a standard corpus as, i.e. should not include mistake inside the second corpus Participle pair；And in the first corpus may include the participle pair of mistake, the first corpus forms the present invention by embodiment shown in FIG. 1 The participle set of offer.

Here, pinyin marking can use mask method above-mentioned, for example, by using the method for most same pronunciations.

Step S141 determines that first note is by the total degree of pronunciation mistake, the first note packet in second corpus Include initial consonant or simple or compound vowel of a Chinese syllable；

Here, the second corpus can be standard corpus library, so whether abundant decide of corpus is entangled to a certain extent Wrong accuracy, in the present embodiment, other than dictionary above-mentioned, the collection of corpus further includes the session log between user, Corpus is excavated from the session log on line, the main purpose of the step is connected applications field, is dug from user's history log Excavate the error correction candidate for meeting application target.The thinking for excavating session log is also the pronunciation similarity using text as similar Degree measurement, in general, there are two main methods: a) user conversation (session) (such as customer service session), excavates user actively Repair pronunciation mistake；From session context, the pronunciation analog result between different inputs is excavated repeatedly；B) field is manually customized Emphasis phrase excavates fallibility candidate, artificial to customize field emphasis phrase in conjunction with business objective, excavates and determines from a large amount of logs Phrase processed pronounces similar result.

Step S142 determines number of the first note by incorrect pronunciations for the second note；

Step S143 by incorrect pronunciations is by the total degree of pronunciation mistake and the first note according to the first note The number of second note determines that first note transfer is the probability of the second note；

Step S144 determines that the second note is by the total degree of pronunciation mistake, the second note packet in second corpus Include initial consonant or simple or compound vowel of a Chinese syllable；

Step S145 determines number of second note by incorrect pronunciations for first note；

Step S146 by incorrect pronunciations is by the total degree of pronunciation mistake and second note according to second note The number of first note determines that the transfer of the second note is the probability of first note；

Step S147 is first according to the probability and second note transfer that first note transfer is the second note Similarity between first note described in the determine the probability of note and second note.

Here, it is illustrated by taking " let's start the meeting "-" salted vegetables are too expensive " as an example: as follows first to the participle to mark phonetic:

Let's start the meeting x ian z ai t ai h ui；

Salted vegetables too your x ian c ai t ai g ui；

After alignment, inconsistent note has z and c, h and g.

The probability (i.e. the transition probability p (c | z) of note) that first note z transfer is the second note c is calculated now, will acquire The participle obscured of all pronunciations to being aligned, statistics is pronounced inconsistent various note numbers, and the transfer for calculating note is general Rate p (c | z):

P (c | z)=count (z- > c)/count (z) (1)；

In formula (1): p (c | z) it is the transition probability that note z incorrect pronunciations are note c；Count (z- > c) is the second corpus Middle note z incorrect pronunciations are the number of c；Count (z) is the total degree that note z is wrong by pronunciation in the second corpus；

Similarly calculate the Probability p (z | c) that the second note c incorrect pronunciations are first note z.

Then the Probability p (c | z) and the second note c incorrect pronunciations for being the second note c according to first note z transfer are first The Probability p (z | c) of note z determines the pronunciation similarity Sim (c, z) between note z and note c, can be with during realization It is obtained using formula (2):

Sim (c, z)=(P (c | z)+P (z | c))/2 (2).

It should be noted that note includes initial consonant and simple or compound vowel of a Chinese syllable, then the initial and the final similarity matrix actually includes at least Similarity matrix and simple or compound vowel of a Chinese syllable and rhythm between three similarity matrix, initial consonant and the simple or compound vowel of a Chinese syllable between matrix, such as initial consonant and initial consonant Similarity matrix between mother, wherein assuming that initial consonant has 21, then the similarity matrix between initial consonant and initial consonant is 21 × 21 Square matrix, it is assumed that initial consonant has 39, then the square matrix that the similarity matrix between initial consonant and initial consonant is 39 × 39, initial consonant and simple or compound vowel of a Chinese syllable Between similarity matrix be 21 × 39 matrix.In later embodiment, turn between two initial consonants if necessary to determine Probability is moved, then can directly inquire the similarity matrix between initial consonant and initial consonant, if necessary to determine between two simple or compound vowel of a Chinese syllable Transition probability, then can directly inquire the similarity matrix between simple or compound vowel of a Chinese syllable and simple or compound vowel of a Chinese syllable；If necessary to determine initial consonant and rhythm Transition probability between mother then can directly inquire the similarity matrix between initial consonant and simple or compound vowel of a Chinese syllable.

Based on embodiment above-mentioned, the embodiment of the present invention provides a kind of text error correction method again, should during realization Method can realize that certain program code can be stored in calculating by the processor caller code of the second calculating equipment In machine storage medium, it is seen then that this second calculates equipment and include at least pocessor and storage media, and described second calculates equipment can be with For various types of electronic equipments with information processing capability, for example, the electronic equipment may include mobile phone, tablet computer, Desktop computer, personal digital assistant, navigator, digital telephone, visual telephone, television set etc..

Fig. 2-1 is the implementation process schematic diagram two of text of embodiment of the present invention error correction method, as shown in Fig. 2-1, this method Include:

Step S201 is determined to error correction participle, the participle segmented in the sentence for being user's input to error correction；

Here, during realization, the text of user's input is often a word, or continuously multiple participles, that The text for inputting user is needed to make pauses in reading unpunctuated ancient writings, when punctuate can be disconnected in the form of participle, for example, the possibility that user inputs It is " bobbins, let's start the meeting for we ", then in punctuate the auxiliary word such as modal particle, auxiliary word can be removed, and use The form of participle disconnects, and the result of disconnection is " bobbins-we-let's start the meeting ".After disconnection, determines and wrapped to error correction participle It includes: " bobbins ", " we " and " let's start the meeting ".

Step S202 is judged whether there is and is gathered with described to the corresponding participle of error correction participle, the participle gather in extremely It less include the error correction participle for being segmented described in error correction to error correction that a similarity according to phonetic is calculated and obtained；

Here, be illustrated by taking " bobbins " as an example, i.e., judgement " bobbins " whether include corresponding participle set, by with Above it is found that the participle set of " bobbins " includes " comrades " and " notice door "；For another example, by taking " let's start the meeting " as an example, judgement is " existing It whether include in session " that corresponding participle is gathered, as known from the above, the participle set of " let's start the meeting " includes " salted vegetables are too expensive ".

Step S203 determines that first language model score, the first language model score are that described segment to error correction exists Language model scores in the sentence；

Here, continue to accept the example in step S202, that is, determine the language model scores of " bobbins ", and determine The language model scores of " let's start the meeting ".

Step S204 determines that second language model score, the second language model score are that described segment to error correction collects Error correction segments the language model scores in the sentence respectively in conjunction；

Here, continue to accept the example in step S202, that is, determine the language model scores of " comrades " and " notice door ", And determine the language model scores of " salted vegetables are too expensive ".

Step S205, judge whether there is to have in the second language model score obtains greater than the first language model Point, obtain judging result；

Here, continue to accept the example in above-mentioned steps, " let's start the meeting " corresponding second language model score only has one It is a, i.e. the language model scores of " salted vegetables are too expensive "；And " bobbins " corresponding second language model score includes two, i.e., it is " same The language model scores of will " and the language model scores of " notice door "；When this is in judgement, i.e. judgement " salted vegetables are too expensive " Language model scores whether be greater than the language model scores of " let's start the meeting ", judge " comrades " language model scores whether Greater than the language whether language model scores of the language model scores of " bobbins ", and judgement " notice door " are greater than " bobbins " Say model score.Assuming that the language model scores of " comrades " are higher than " bobbins " and " notice door ", the language of " let's start the meeting " Model score is higher than " salted vegetables are too expensive ", then judging result is to be not present in second language model score for " let's start the meeting " Greater than the first language model score；For " bobbins ", judging result is greater than institute to exist in second language model score State first language model score.

Step S206 carries out error correction to described segment to error correction according to judging result.

Here, step S206, it is described that error correction is carried out to described segment to error correction according to judging result, comprising:

Step S2061, if there are the first language model score is greater than in the second language model score, it will The error correction participle of highest scoring is determined as to the error correction word to error correction participle in language model scores；Here, step S206, further includes: the first participle is replaced with to the error correction word of the first participle, is exported.

Step S2062 is not right if being not greater than the first language model score in the second language model score Described segment to error correction carries out error correction.

In this example, it is assumed that the language model scores of " comrades " are higher than " bobbins " and " notice door ", then by " cylinder Son " is corrected as " comrades "；Assuming that the language model scores of " let's start the meeting " are higher than " salted vegetables are too expensive ", then not to " opening now Meeting " carries out error correction.

It should be noted that step S202 during realization, can be by inquiring preset related information judgement It is no to exist with described to error correction participle corresponding participle set, the related information during realization can by list, Incidence relation etc. realizes that the related information is used to show to segment and segmenting the corresponding relationship between set to error correction.It is described Related information can be pre-set (calculating equipment from first), naturally it is also possible to be that the first calculating equipment is handed down to the Two, which calculate equipment or the second calculating equipment, calculates device request to first, in other words, referring to fig. 2 shown in -2, realizes Fig. 1 institute The first calculating equipment 10 can be regarded as realizing the service of the second calculating equipment 21 and 22 shown in Fig. 2-1 in the technical solution shown Device, and the second calculating equipment can be regarded as the terminal of the first calculating equipment, first calculates equipment 10 can also be regular or indefinite Phase calculates equipment 21 and 22 to the second of user and updates related information.

In other embodiments of the invention, shown in referring to figure 2-3, on the basis of method shown in Fig. 1, the method Further include:

Step S230, terminal to server send error correction request, the sentence of user's input are carried in the error correction request；

Here, terminal side is equipped with client, and client can use the form of application program (App, Application) It embodies, user is at terminal read statement (or text), and then client detects the sentence of user's input, and then, client will The sentence carries in error correction request, and then the error correction request is sent to server by client.

Step S231, the error correction request that server receiving terminal is sent,

Step S232, server is determined to be segmented to error correction, in the sentence to error correction participle for user input Participle；

Here, the text of in general user's input is often in short or continuous multiple participles, then need by The text of user's input is made pauses in reading unpunctuated ancient writings, and when punctuate can be disconnected in the form of participle, for example, what user inputted may be " bobbin , let's start the meeting for we ", then in punctuate the auxiliary word such as modal particle, auxiliary word can be removed, and using participle Form disconnects, and the result of disconnection is " bobbins-we-let's start the meeting ".After disconnection, determine that error correction participle include: " cylinder Son ", " we " and " let's start the meeting ".

Step S233, server judge in error correction dictionary with the presence or absence of with described to error correction participle corresponding participle set；

Step S234 gathers if there is with described to the corresponding participle of error correction participle, and server determines first language mould Type score and second language model score, the first language model score are the language to error correction participle in the sentence It says model score, one is included at least in the participle set for the error correction participle described in error correction to error correction participle, described the Two language model scores segment the language model scores in the sentence respectively to error correction in error correction participle set to be described；

Step S235, judge whether there is to have in the second language model score obtains greater than the first language model Point；

Step S236, if there are be greater than the first language model score, clothes in the second language model score The error correction participle of highest scoring in language model scores is determined as to the error correction word to error correction participle by business device；

Here, the step S201 in embodiment shown in above-mentioned step S232 to step S236 and earlier figures 2-1 is extremely walked Rapid S206 is similar, and those skilled in the art is referred to embodiment shown in earlier figures 2-1 and understands above-mentioned step S232 To step S236.

The error correction word is carried in the first error correction response, and first error correction is rung by step S237, server Terminal should be sent to.

Step S238, if being not greater than the first language model score in the second language model score, or such as Fruit, which is not present, gathers with described to the corresponding participle of error correction participle, and server sends the second error correction and responds, the second error correction sound Applied to show not to it is described to error correction segment carry out error correction.

Step S239, terminal receive the error correction response that server is sent, and determine that the error correction response received is that the first error correction is rung At once, it is then responded according to the first error correction and error correction is carried out to the sentence that user inputs；Determine that the error correction response received is second When error correction responds, error correction is not carried out to the sentence of user's input.

In the embodiment shown in Fig. 2-1, language model scores are to complete in terminal side, and be based on eventually in the present embodiment The request at end, server complete language model scores, it can be seen that, when error correction method consumes ratio for the hardware of terminal , can be using method shown in Fig. 2-1 when lower, can not need networking in this way can be completed text error correction, i.e. this method can To be completed in the case where offline；When consumption of the error correction method to hardware is relatively high, method shown in Fig. 2-3 can be used, Consumption of the terminal to hardware resource can be saved in this way, but is needed terminal to network with server and be just able to achieve.

Based on embodiment above-mentioned, the embodiment of the present invention provides a kind of text error correction side based on Chinese pronunciations similarity Method can be applied to the speech recognition result error correction and Chinese pinyin input method result error correction of Chinese, can also be directly as spy It takes over for use in Chinese Semantic Similarity Measurement.Fig. 3-1 is the implementation process schematic diagram four of text of embodiment of the present invention error correction method, such as Shown in Fig. 3-1, this method comprises:

Step S301, pronunciation similarity dictionary excavate；

Here, as shown in figure 3-2, step S301 is further comprising the steps of:

Step S311 collects easily pronunciation and obscures phrase pair；

Here, the step 1 corpus collection step can collect corpus: the nearly sound word of Chinese Chinese language from following channel Dictionary；The confusing dialect of note and standard pronunciation dictionary；Speech recognition errors annotation results；Input method error label knot on line Fruit.

Here, the form that corpus is collected is completed in the form of phrase segment pair, such as " logging off " --- " leg is slightly imperial ", " generation --- ----" bobbins ", " dried shrimp " --- " villagers ", " salted vegetables are too expensive " --- are " now for " cash equivalent volume ", " comrades " for gold note " Meeting ", " sausage pickled melon " --- " chief of township's speech ".

Step S312, phrase is to turning phonetic；

This step is realized in such a way that one is looked into Chinese characters and pinyin table, and each word is obtained in Chinese dictionary correspondence Phonetic, processing step are as follows: 1) encountering non-polyphone table look-at and turn phonetic；2) encounter polyphone, check the word and periphery word Group word result table look-up；Here, exist and word has unique pronunciation then to turn phonetic, exist and word pronunciation multitone not yet, use language Say that model determines pronunciation (such as: cheap pronunciation includes " pin-yi " and " bian-yi ")；There is no use default to pronounce (in table Polyphone has default to pronounce).3) encounter Arabic numerals, switch to corresponding Chinese character and table look-up again；4) encounter English character, no It processes；5) encounter the Chinese character not in table, skip the word, and the position is set to sky.

Step S313, phonetic the initial and the final cutting alignment；

Here, due to similar phrase centering of pronouncing, incorrect pronunciations are a small number of notes, press most multiphase so use herein The alignment schemes of equal pronunciations, such as:

Let's start the meeting x ian z ai t ai hui；

Salted vegetables too your x ian c ai t ai g ui；

After alignment, inconsistent note has z and c, h and g.

Step S314 calculates transition probability between the initial and the final；

Here, all pronunciations that will acquire are obscured to being aligned according to the above method, count inconsistent various sounds Number is accorded with, the transition probability p (c | z) that note z incorrect pronunciations are note c is calculated:

P (c | z)=count (z- > c)/count (z)；

Wherein, it is note z in corpus that p (c | z), which is transition probability, count (z- > c) that note z incorrect pronunciations are note c, Incorrect pronunciations are the number of c；Count (z) is the total degree that note z is wrong by pronunciation in corpus.

Step S315 calculates similarity score between any note；

It is herein that the pronunciation between note z and note c is similar by the p (c | z) being calculated and p of upper step (z | c) Degree is defined as: Sim (c, z)=(P (c | z)+P (z | c))/2；

Calculate similarity between any note, the initial and the final similarity matrix between an available note, wherein sound Between symbol the initial and the final similarity matrix include between initial consonant and initial consonant, initial consonant and simple or compound vowel of a Chinese syllable between simple or compound vowel of a Chinese syllable and simple or compound vowel of a Chinese syllable Similarity matrix.

Step S302, phrase pronunciation similarity calculation；

Pronounce between the note being calculated based on step S301 similarity, this step calculate two any given phrases it Between pronunciation similarity, detailed process is as shown in Fig. 3-3, comprising:

Step S321, Arabic numerals pretreatment, such as " 2 " switch to " two ", convenient for extracting phonetic；

Step S322, Chinese character turn phonetic, with step S312；

Step S323, each word pronunciation cutting the initial and the final of pinyin string；

Step S324 word for word traverses two pinyin strings, calculates similar score；

Here, the current location for first assuming the first participle is pos₁, the current location of the second participle is pos₂, ScoreSS, ScoreYY and ScoreSY is respectively between initial consonant and initial consonant, between simple or compound vowel of a Chinese syllable and simple or compound vowel of a Chinese syllable, similarity score between initial consonant and simple or compound vowel of a Chinese syllable, It can be obtained by inquiring above-mentioned the initial and the final similarity matrix；Score is score；So calculate similar score referring to Fig. 3-4, comprising:

Step S3241 starts, and pos is arranged₁=1, pos₂=1；

Step S3242 judges whether the word of the current location of the first participle is identical as the word of the current location of the second participle, If identical, Score+=1, pos₁+=1, pos₂+=1, continue returns to step S3242；If it is not the same, then into Enter step S3243；

Whether step S3243, judgement (S=ScoreSS*ScoreYY) are greater than 0.8, if (S=ScoreSS*ScoreYY) > 0.8, then Score+=S, pos₂+=1, pos₂+=1, continue returns to step S3242；If (S=ScoreSS* ScoreYY)≤0.8, it is determined that the similarity for facing a word of the first participle and the second participle enters step S3244.

Step S3244, if (S=ScoreSS*ScoreYY)≤0.8, judges pos₁With pos₂Whether+1 place has S=max (ScoreSS*ScoreYY,ScoreSY1,ScoreSY2)>0.8；

If pos₁With pos₂+ 1 place has S=max (ScoreSS*ScoreYY, ScoreSY1, ScoreSY2) > 0.8, then Score+=S, pos₁+=1, pos₂+=2, continue return to step S3242；If pos₁With pos₂There is S=max at+1 place (ScoreSS*ScoreYY, ScoreSY1, ScoreSY2)≤0.8, then enter step S3245；

Step S3245, judges pos₁+ 1 and pos₂Place whether have (S=max (ScoreSS*ScoreYY, ScoreSY1, ScoreSY2)>0.8；

If pos₁+ 1 and pos₂Place, (S=max (ScoreSS*ScoreYY, ScoreSY1, ScoreSY2) > 0.8, then Score+=S, pos₁+=2, pos₂+=1, continue returns to step S3242；

S3242 to step S3245 traversal terminates through the above steps, and Score is the similarity score of two participles.

Step S325, similarity score normalization, referring to as follows:

Sf=Score*2/ (Size1*Size2)

Wherein: Sf is the final score after normalization, and Score is that previous step traverses score, and Size1 is the first Chinese character string Number of words, Size2 are the number of words of the second Chinese character string；

Step S303, error correction candidate excavate；

Based on the similarity calculating method of upper step, from excavating in interactive log on line, error correction is candidate.The main mesh of this step Be connected applications field, excavated from user's history log meet application target error correction it is candidate.

The thinking of error correction candidate is excavated as conventional error correction problem thinking, difference is the pronunciation similarity using text As measuring similarity.There are two main methods: a) user conversation (such as customer service is to session), excavates user and actively repairs pronunciation Mistake excavates repeatedly the pronunciation analog result between different inputs from session context；B) manually customization field emphasis is short It is candidate to excavate fallibility for language；It is artificial to customize field emphasis phrase in conjunction with business objective, it is excavated from a large amount of logs and customization phrase Pronounce similar result.

Step S304, error correction；

Online error correction is carried out to (participle is gathered) based on error correction candidate, the thinking of the embodiment of the present invention is as follows:

1) user inputs S0 participle；

Adjacent multiple word combination phrases search whether that there are error correction candidates (to attempt adjacent 1 to 4 phrases respectively from candidate The phrase of conjunction), there are error correction candidates then to replace corresponding phrase in original input, as a kind of user may input Si (i=1, 2, ,).

2) user is calculated separately to be originally inputted the language model scores of S0 and a variety of possible input Si (language model scores can To measure the process degree of sentence)；

3) compare the score of S0 and multiple Si；

If S0 score is high, without error correction；If Si score is high, the alternative of Si carries out error correction

It can be seen that in the embodiment of the present invention from above embodiment and note transition probability excavated by pronunciation Similar Text As note similarity, and the alignment requirements of phonetic are relaxed, i.e., finds most like note in permission window, have in processing When the participle of Arabic numerals, Arabic numerals are first converted into Chinese character, the participle with Arabic numerals can be calculated in this way With the similarity between other participles.By the above technological means, technical solution provided in an embodiment of the present invention has following skill Art advantage: 1) the pronunciation similarity obtained using Statistics-Based Method, data source is in user behavior, more representative of really answering Similarity between note, as a result more acurrate in the case of；2) each sound of available different pronunciation types and same pronunciation type Pronounce similarity degree between symbol, is a floating point values, and the similarity degree between different notes is more comparable；3) voicing text is being calculated When the aligned in position of similarity, allow to find optimal alignment in a window as a result, having to the similarity calculation of hiatus or multiword Robustness.

Based on embodiment above-mentioned, the embodiment of the present invention provides a kind of text error correction device, each list included by the device Each submodule included by each module or even each module included by member and each unit can calculate equipment by first In processor realize, certainly can also be realized by specific logic circuit；During specific embodiment, processor can Think central processing unit (CPU), microprocessor (MPU), digital signal processor (DSP) or field programmable gate array (FPGA) Deng.

Fig. 4 is the composed structure schematic diagram one of text of embodiment of the present invention error correction device, and shown in Fig. 4, which includes First forms unit 401, mark unit 402, the first determination unit 403, the first judging unit 404 and the second determination unit 405, Wherein:

Described first forms unit 401, for collecting the first corpus in the form of participle pair；

The mark unit 402, for marking two participles of the participle centering all in the form of phonetic；

First determination unit 403, it is described for determining the similarity of phonetic between centering two participles of the participle Similarity is used to show the similarity degree between the phonetic of the participle centering first participle and the phonetic of the second participle；

First judging unit 404, for judging whether the similarity meets preset condition；

Second determination unit 405, if meeting preset condition for the similarity, by the participle centering Two be identified as mutual error correction participle.

In other embodiments of the invention, the mark unit includes first judgment module and the first labeling module, In:

The first judgment module, for judging whether the participle centering includes polyphone；

First labeling module, if not including polyphone for the participle centering, by the participle centering Two participle all in the form of phonetic mark.

In other embodiments of the invention, the mark unit further includes the second judgment module and the second labeling module, Wherein:

Second judgment module continues to judge the participle centering if including polyphone for the participle centering Two participle in whether have polyphonic word；

Second labeling module, if including multitone at least one in two participles of the participle centering The corresponding more than two phonetics of the polyphonic word are labeled as the part of the phonetic of the corresponding participle of participle centering or complete by word Portion.

In other embodiments of the invention, the mark unit includes third judgment module, conversion module and third mark Injection molding block, in which:

The third judgment module, for judging whether the participle centering includes Arabic numerals；

The conversion module converts the Arabic numerals if including Arabic numerals for the participle centering For corresponding Chinese character；

The third labeling module, for by the participle of the participle centering be converted to after Chinese character in the form of phonetic mark Note.

In other embodiments of the invention, first determination unit includes that alignment module, computing module and first are true Cover half block, in which:

The alignment module, for what is segmented by the initial consonant alignment of the phonetic of centering two participles of participle and by two The simple or compound vowel of a Chinese syllable of phonetic is aligned；

The computing module, the phonetic for calculating the participle centering first participle are converted to the phonetic of the second participle Transition probability；

First determining module, for determining phonetic between participle centering two participles according to the transition probability Similarity.

In other embodiments of the invention, the alignment module, for the alignment thereof by most same pronunciations by institute The simple or compound vowel of a Chinese syllable for the phonetic stated the initial consonant alignment of the phonetic of centering two participles of participle and segment two is aligned.

In other embodiments of the invention, the computing module includes computational submodule, the first determining submodule, second Determine submodule and transform subblock, in which:

The computational submodule, if the word unisonance for two participle same positions after being aligned, calculates score Score adds 1, and the position of the position of the participle centering first participle and the second participle is all added 1；

Described first determines submodule, if the word for segmenting same position for two after be aligned not unisonance, according to pre- If the initial and the final similarity matrix determine the phonetic of the first participle in described two participles and the second participle phonetic score Score；

Described second determines submodule, for the number of words according to the score Score, the first participle, the second word segmented Number determines normalized final score Sf；

The transform subblock, for determining that the phonetic of the participle centering first participle turns according to the final score Sf It is changed to the transition probability of the phonetic of the second participle.

In other embodiments of the invention, it described second determines submodule, is used for:

It is obtained according to preset the initial and the final similarity matrix similar between the initial consonant of the word of two participle same positions Similarity between degree, simple or compound vowel of a Chinese syllable；

If the product S of the similarity between similarity and simple or compound vowel of a Chinese syllable between initial consonant is greater than the first preset value, calculate Divide Score to add S, the position of the position of the participle centering first participle and the second participle is all added 1；

If the product S of the similarity between similarity and simple or compound vowel of a Chinese syllable between the initial consonant of the word of two participle same positions is small In being equal to the first preset value, then the word and second of the current location of the first participle is obtained according to the initial and the final similarity matrix The similarity between the similarity between similarity, simple or compound vowel of a Chinese syllable, initial consonant and simple or compound vowel of a Chinese syllable between the initial consonant of the word of the next position of participle Similarity between simple or compound vowel of a Chinese syllable and initial consonant；

Determine the first maximum value, first maximum value be the word of the current location of the first participle and second segment it is next The product S of the similarity between similarity and simple or compound vowel of a Chinese syllable between the initial consonant of the word of position, the word of the current location of the first participle The simple or compound vowel of a Chinese syllable of the word of the current location of similarity and the first participle between the simple or compound vowel of a Chinese syllable of the word of the next position of initial consonant and the second participle The maximum value between similarity this three between the initial consonant of the word of the next position of the second participle；

Judge whether first maximum value is greater than the second preset value, calculates the preceding Score that obtains before score Score is and add The position of the position of the participle centering first participle and the second participle is all added 1 by upper first maximum value；

If first maximum value is less than or equal to the second preset value, the is obtained according to the initial and the final similarity matrix It is similar between similarity, simple or compound vowel of a Chinese syllable between the word of the next position of one participle and the initial consonant of the word of the current location of the second participle The similarity between similarity and simple or compound vowel of a Chinese syllable and initial consonant between degree, initial consonant and simple or compound vowel of a Chinese syllable；

Determine the second maximum value, second maximum value be the word of the next position of the first participle and second segment it is current The product S of the similarity between similarity and simple or compound vowel of a Chinese syllable between the initial consonant of the word of position, the word of the next position of the first participle The simple or compound vowel of a Chinese syllable of the word of the next position of similarity and the first participle between the simple or compound vowel of a Chinese syllable of the word of the current location of initial consonant and the second participle The maximum value between similarity this three between the initial consonant of the word of the current location of the second participle；

Judge whether second maximum value is greater than third preset value, calculates the preceding Score that obtains before score Score is and add The position of the position of the participle centering first participle and the second participle is all added 1 by upper second maximum value, then judges the One participle and second participle location updating after word whether unisonance, through above-mentioned steps traversal the first participle phonetic and second point The phonetic of word simultaneously calculates score Score.

In other embodiments of the invention, described device further includes third determination unit, for determining the initial consonant rhythm Female similarity matrix, the third determination unit further comprise the second determining module, third determining module, the 4th determining mould Block, the 5th determining module, the 6th determining module, the 7th determining module and the 8th module, in which:

Second determining module, for determining that first note is by the total degree of pronunciation mistake, institute in second corpus Stating first note includes initial consonant or simple or compound vowel of a Chinese syllable；

The third determining module, for determining number of the first note by incorrect pronunciations for the second note；

4th determining module, for the total degree and the first note wrong by pronunciation according to the first note Determine that first note transfer is the probability of the second note by the number that incorrect pronunciations are the second note；

5th determining module, for determining that the second note is by the total degree of pronunciation mistake, institute in second corpus Stating the second note includes initial consonant or simple or compound vowel of a Chinese syllable；

6th determining module, for determining number of second note by incorrect pronunciations for first note；

7th determining module, for the total degree and second note wrong by pronunciation according to second note Determine that the transfer of the second note is the probability of first note by the number that incorrect pronunciations are first note；

8th determining module, for being the probability and second sound of the second note according to first note transfer Symbol transfer is the similarity described in the determine the probability of first note between first note and second note.

It need to be noted that: the description of apparatus above embodiment, be with the description of above method embodiment it is similar, With the similar beneficial effect of same embodiment of the method.For undisclosed technical detail in apparatus of the present invention embodiment, please refer to The description of embodiment of the present invention method and understand.

Based on embodiment above-mentioned, the embodiment of the present invention provides a kind of text error correction device, each list included by the device Member can be realized by the processor in the second calculating equipment, can also be realized certainly by specific logic circuit；Having During body embodiment, processor can be central processing unit (CPU), microprocessor (MPU), digital signal processor (DSP) or field programmable gate array (FPGA) etc..

Fig. 5 is the composed structure schematic diagram two of text of embodiment of the present invention error correction device, and shown in Fig. 5, which includes 4th determination unit 501, second judgment unit 502, the 5th determination unit 503, the 6th determination unit 504, third judging unit 505 and error correction unit 506, in which:

4th determination unit 501 is segmented for determining to error correction, the sentence segmented to error correction as user's input In participle；

The second judgment unit 502 is gathered with described to the corresponding participle of error correction participle, institute for judging whether there is State included at least in participle set a similarity according to phonetic calculated and obtain for described in error correction to error correction point The error correction of word segments；

5th determination unit 503, for determining that first language model score, the first language model score are institute It states and segments the language model scores in the sentence to error correction；

6th determination unit 504, for determining that second language model score, the second language model score are institute It states and segments the language model scores in the sentence respectively to error correction in error correction participle set；

The third judging unit 505 has for judging to whether there is in the second language model score greater than described First language model score, obtains judging result；

The error correction unit 506, for carrying out error correction to described segment to error correction according to judging result.

In other embodiments of the invention, the error correction unit, is used for: if deposited in the second language model score Having be greater than the first language model score, by language model scores highest scoring error correction participle be determined as to it is described to The error correction word of error correction participle；If being not greater than the first language model score in the second language model score, no Error correction is carried out to described segment to error correction.

In other embodiments of the invention, described device further includes the first formation unit, mark unit, the first determining list Member, the first judging unit, the second determination unit and second form unit, in which:

The mark unit, for marking two participles of the participle centering all in the form of phonetic；

First judging unit, for judging whether the similarity meets preset condition；

Second determination unit, if meeting preset condition for the similarity, by the participle centering Two are identified as mutual error correction participle；

Described second forms unit, to form the participle set for segmenting according to the error correction.

Based on embodiment above-mentioned, the embodiment of the present invention provides a kind of calculating equipment, and Fig. 6 is server of the embodiment of the present invention Composed structure schematic diagram, as shown in fig. 6, the calculating equipment 600 may include: at least one processor 601, at least one is logical Believe bus 602, user interface 603, at least one external communication interface 604 and the memory 605 for storing executable program Equal components.Wherein, communication bus 602 is for realizing processor 601, user interface 603, external communication interface 604 and memory Connection communication between 605.Wherein, user interface 603 may include display screen and keyboard.External communication interface 604 is optional Including wireline interface and wireless interface.The wherein processor 601, is used for:

The processor 601 is used for:

The first corpus is collected in the form of participle pair；

The error correction dictionary is sent to terminal by the external communication interface 604.

It need to be noted that: the description of the above server implementation item, with the above method description be it is similar, have The identical beneficial effect with embodiment of the method.For undisclosed technical detail in server example of the present invention, this field Technical staff please refers to the description of embodiment of the present invention method and understands.

It should be noted that in the embodiment of the present invention, if realizing that above-mentioned text entangles in the form of software function module Wrong method, and when sold or used as an independent product, it also can store in a computer readable storage medium.Base In such understanding, substantially the part that contributes to existing technology can be in other words for the technical solution of the embodiment of the present invention The form of software product embodies, which is stored in a storage medium, including some instructions to So that a computer equipment (can be personal computer, server or network equipment etc.) executes each implementation of the present invention The all or part of example the method.And storage medium above-mentioned includes: USB flash disk, mobile hard disk, read-only memory (ROM, Read Only Memory), the various media that can store program code such as magnetic or disk.In this way, the embodiment of the present invention does not limit It is combined in any specific hardware and software.Correspondingly, the embodiment of the present invention provides a kind of computer storage medium, the meter again Computer executable instructions are stored in calculation machine storage medium, the computer executable instructions are for executing in the embodiment of the present invention Text error correction method.

It should be understood that " one embodiment " or " embodiment " that specification is mentioned in the whole text mean it is related with embodiment A particular feature, structure, or characteristic is included at least one embodiment of the present invention.Therefore, occur everywhere in the whole instruction " in one embodiment " or " in one embodiment " not necessarily refer to identical embodiment.In addition, these specific features, knot Structure or characteristic can combine in any suitable manner in one or more embodiments.It should be understood that in various implementations of the invention In example, magnitude of the sequence numbers of the above procedures are not meant that the order of the execution order, the execution sequence Ying Yiqi function of each process It can determine that the implementation process of the embodiments of the invention shall not be constituted with any limitation with internal logic.The embodiments of the present invention Serial number is for illustration only, does not represent the advantages or disadvantages of the embodiments.

It should be noted that, in this document, the terms "include", "comprise" or its any other variant are intended to non-row His property includes, so that the process, method, article or the device that include a series of elements not only include those elements, and And further include other elements that are not explicitly listed, or further include for this process, method, article or device institute it is intrinsic Element.In the absence of more restrictions, the element limited by sentence "including a ...", it is not excluded that including being somebody's turn to do There is also other identical elements in the process, method of element, article or device.

In several embodiments provided herein, it should be understood that disclosed device and method can pass through it Its mode is realized.Apparatus embodiments described above are merely indicative, for example, the division of the unit, only A kind of logical function partition, there may be another division manner in actual implementation, such as: multiple units or components can combine, or It is desirably integrated into another system, or some features can be ignored or not executed.In addition, shown or discussed each composition portion Mutual coupling or direct-coupling or communication connection is divided to can be through some interfaces, the INDIRECT COUPLING of equipment or unit Or communication connection, it can be electrical, mechanical or other forms.Above-mentioned unit as illustrated by the separation member can be, Or may not be and be physically separated, component shown as a unit can be or may not be physical unit；It both can be with It is in one place, it may be distributed over multiple network units；Part therein or complete can be selected according to the actual needs Portion unit achieves the purpose of the solution of this embodiment.In addition, each functional unit in various embodiments of the present invention can all collect It, can also be with two or more lists at each unit in one processing unit, is also possible to individually as a unit Member is integrated in one unit；Above-mentioned integrated unit both can take the form of hardware realization, can also be added using hardware soft The form of part functional unit is realized.

Those of ordinary skill in the art will appreciate that: realize that all or part of the steps of above method embodiment can pass through The relevant hardware of program instruction is completed, and program above-mentioned can store in computer-readable storage medium, which exists When execution, step including the steps of the foregoing method embodiments is executed；And storage medium above-mentioned includes: movable storage device, read-only deposits The various media that can store program code such as reservoir (Read Only Memory, ROM), magnetic or disk.Alternatively, this hair If bright above-mentioned integrated unit is realized and when sold or used as an independent product in the form of software function module, can also To be stored in a computer readable storage medium.Based on this understanding, the technical solution essence of the embodiment of the present invention On in other words the part that contributes to existing technology can be embodied in the form of software products, the computer software product It is stored in a storage medium, including some instructions are used so that a computer equipment (can be personal computer, service Device or the network equipment etc.) execute all or part of each embodiment the method for the present invention.And storage medium packet above-mentioned It includes: the various media that can store program code such as movable storage device, ROM, magnetic or disk.

The above description is merely a specific embodiment, but scope of protection of the present invention is not limited thereto, any Those familiar with the art in the technical scope disclosed by the present invention, can easily think of the change or the replacement, and should all contain Lid is within protection scope of the present invention.Therefore, protection scope of the present invention should be based on the protection scope of the described claims.

Claims

1. a kind of text error correction method, which is characterized in that the described method includes:

The first corpus is collected in the form of participle pair；

Determine the similarity of phonetic between centering two participles of the participle, the similarity is for showing the participle centering the Similarity degree between the phonetic of one participle and the phonetic of the second participle；

If the similarity meets preset condition, mutual entangle is identified as by two of the participle centering Mistake participle or the first participle are the error correction participle of the second participle；

Wherein, the preset condition is threshold value, and when the similarity is greater than the threshold value, then the similarity meets default Condition；

It is described to mark two participles that centering is segmented in first corpus all in the form of phonetic, comprising: if described point Word centering includes that at least one in polyphone and two participles of the participle centering includes polyphonic word, by the participle centering There are the participles of polyphonic word to be marked in the form of more than two phonetics.

2. the method according to claim 1, wherein two participles by the participle centering are all with phonetic Form mark, further includes:

If the participle centering does not include polyphone, by two participle marks all in the form of phonetic of the participle centering Note.

3. the method according to claim 1, wherein two participles by the participle centering are all with phonetic Form mark, comprising:

If the participle centering includes Arabic numerals, the Arabic numerals are converted into corresponding Chinese character；

The participle of the participle centering be converted to after Chinese character is marked in the form of phonetic.

4. method according to any one of claims 1 to 3, which is characterized in that described two points of centering of participle of the determination The similarity of phonetic between word, comprising:

By the initial consonant alignment of the phonetic of centering two participles of participle and the simple or compound vowel of a Chinese syllable of the phonetic of two participles is aligned；

Calculate the transition probability that the phonetic for segmenting the centering first participle is converted to the phonetic of the second participle；

The similarity of phonetic between centering two participles of the participle is determined according to the transition probability.

5. according to the method described in claim 4, it is characterized in that, the sound of the phonetic by centering two participles of participle The simple or compound vowel of a Chinese syllable of mother's alignment and the phonetic for segmenting two is aligned, comprising:

By the alignment thereof of most same pronunciations by the initial consonant alignment of the phonetic of centering two participles of participle and by two points The simple or compound vowel of a Chinese syllable of the phonetic of word is aligned.

6. according to the method described in claim 4, it is characterized in that, the phonetic for calculating the participle centering first participle turns It is changed to the transition probability of the phonetic of the second participle, comprising:

Determine the number of note different between the first participle and second participle；

Institute is determined according to the length of the number of the different note and the first participle or the note string of second participle State transition probability.

7. according to the method described in claim 4, it is characterized in that, the phonetic for calculating the participle centering first participle turns It is changed to the transition probability of the phonetic of the second participle, comprising:

If the word unisonance of two participle same positions after alignment, calculates score Score and adds 1, and by the participle centering The position of the first participle and the position of the second participle all add 1；

If the word not unisonance of two participle same positions after alignment, determines institute according to preset the initial and the final similarity matrix State the score Score of the phonetic of the phonetic of the first participle and the second participle in two participles；

Normalized final score Sf is determined according to the number of words of the score Score, the number of words of the first participle, the second participle；

Determine that the phonetic of the participle centering first participle is converted to turning for the phonetic of the second participle according to the final score Sf Change probability.

8. the method according to claim 1, wherein phonetic between the determination participle centering two participles Similarity, comprising:

The similarity of phonetic between centering two participles of the participle is determined using preset the initial and the final similarity matrix.

9. method according to claim 7 or 8, which is characterized in that the determination the initial and the final similarity matrix packet It includes:

The second corpus is collected in the form of participle pair；

Two participles that centering is segmented in second corpus are marked all in the form of phonetic；

Determine that for first note by the total degree of pronunciation mistake, the first note includes initial consonant or simple or compound vowel of a Chinese syllable in second corpus；

Determine number of the first note by incorrect pronunciations for the second note；

According to the first note by pronunciation mistake total degree and the first note by incorrect pronunciations be the second note time Number determines that first note transfer is the probability of the second note；

Determine that for the second note by the total degree of pronunciation mistake, second note includes initial consonant or simple or compound vowel of a Chinese syllable in second corpus；

Determine number of second note by incorrect pronunciations for first note；

According to second note by pronunciation mistake total degree and second note by incorrect pronunciations be first note time Number determines that the transfer of the second note is the probability of first note；

It is true according to the probability that the probability and second note transfer that first note transfer is the second note are first note Similarity between the fixed first note and second note, according between the first note and second note Similarity forms the initial and the final similarity matrix.

10. the method according to claim 1, wherein the method also includes:

The error correction dictionary is sent to terminal.

11. the method according to claim 1, wherein the method also includes:

The error correction request that terminal is sent is received, the sentence of user's input is carried in the error correction request；

It determines and is segmented to error correction, the participle in the sentence inputted to error correction participle for the user；

Gather if there is with described to the corresponding participle of error correction participle, determines first language model score and second language model Score, the first language model score are the language model scores to error correction participle in the sentence, the participle One is included at least in set for the error correction participle described in error correction to error correction participle, the second language model score is described The language model scores in the sentence respectively are segmented to error correction in error correction participle set；

If, will be in language model scores there are the first language model score is greater than in the second language model score The error correction participle of highest scoring is determined as to the error correction word to error correction participle；

The error correction word is carried in the first error correction response, first error correction response is sent to terminal.

12. according to the method for claim 11, which is characterized in that the method also includes:

If the first language model score is not greater than in the second language model score, or if there is no with it is described Segment corresponding participle set to error correction, send the second error correction response, second error correction respond for show not to it is described to Error correction participle carries out error correction.

13. a kind of text error correction device, which is characterized in that described device is determined including the first formation unit, mark unit, first Unit and the second determination unit, in which:

First determination unit, for determining the similarity of phonetic between centering two participles of the participle, the similarity For showing the similarity degree between the phonetic of the participle centering first participle and the phonetic of the second participle；

Second determination unit, if meeting preset condition for the similarity, by two of the participle centering It is identified as mutual error correction participle；

The mark unit includes the second labeling module, second labeling module, if including more for the participle centering At least one in sound word and two participles of the participle centering includes polyphonic word, and by the participle centering, there are polyphonic words Participle is marked in the form of more than two phonetics.

14. a kind of server, which is characterized in that the server includes processor and external communication interface, and the processor is used In:

The first corpus is collected in the form of participle pair；

The error correction dictionary is sent to terminal by the external communication interface；

15. a kind of computer storage medium, which is characterized in that be stored with the executable finger of computer in the computer storage medium It enables, which requires 1 to 12 described in any item text error correction methods for perform claim.