CN106250364A

CN106250364A - A kind of text modification method and device

Info

Publication number: CN106250364A
Application number: CN201610573610.XA
Authority: CN
Inventors: 刘江; 胡加学; 金泽蒙; 赵乾; 于振华
Original assignee: iFlytek Co Ltd
Current assignee: iFlytek Co Ltd
Priority date: 2016-07-20
Filing date: 2016-07-20
Publication date: 2016-12-21

Abstract

Embodiments providing a kind of text modification method and device, wherein method includes: obtain text data to be revised；Obtaining correct word, described correct word is for replacing erroneous words corresponding with described correct word in described text data；The described erroneous words found according to described correct word and replace in described text data.In the present invention, when finding errors in text occur in text, user, without providing any erroneous words, only need to input correct word, and system i.e. goes to search the erroneous words of each correspondence according to correct word automatically.Such as have only to input correct word " dark reddish purple ", it is not necessary to point out that corresponding erroneous words is " dark reddish purple " or " fall ", automatically can look for each erroneous words corresponding according to correct word.Because user has only to provide correct word, it is not necessary to pointed out which erroneous words one by one, substantially increase correction efficiency, and the omission of the erroneous words that may cause because user manually searches can also be avoided, improve the accuracy rate of correction.

Description

A kind of text modification method and device

Technical field

The present invention relates to field of information processing, particularly relate to a kind of text modification method and device.

Background technology

People are to input text by the way of typewriting traditionally, along with the development of technology, occur in that again the newest The mode of text input (or perhaps text generation), such as, convert speech into text by speech recognition technology, pass through OCR Text conversion in picture is become text by technology, etc..But the most traditional typewriting input mode or the input of new text Mode, all suffers from a problem, continuing to bring out of the most various neologisms (such as network words), original to input system or the system of identification Dictionary cause no small impact, a large amount of homonyms of producing because of various neologisms, synonym, similar words etc. have a strong impact on Correct rate for input, causes inputted text often to show some erroneous words.Such as, user passes through one network of phonetic entry Word " dark reddish purple " (means " so "), may be wrongly recognized into " dark reddish purple ", " fall purple " or " fall when being converted into text Son " etc..

When inspection is found to have erroneous words, in the prior art, common process means are that user moves the cursor to mistake Word position, is re-entered correct word, erroneous words is replaced, or carried out certain erroneous words in the whole text by software by mistake Automatically search and replace, thus completing text correction.But inventor finds during realizing the present invention, in prior art These text correcting modes since it is desired that user has pointed out which is erroneous words one by one, so efficiency is the lowest.To be carried above As a example by " dark reddish purple " word arrived, when the user discover that it by when being identified as " dark reddish purple " of mistake, then needs to search the most in the whole text And replace, when user find again its by mistake when being identified as " dark reddish purple ", it is also desirable to search the most in the whole text and replace, when with Family find again its by mistake when being identified as " fall son ", in addition it is also necessary to search the most in the whole text and replace, in other words, Yong Huke Can at least need to carry out three times search in the whole text and replace, the various erroneous words of " dark reddish purple " word could be corrected.Meanwhile, because of Make mistake word for needs artificial cognition, so the accuracy rate of prior art is relatively low, such as in full in be likely present " dark reddish purple " Other erroneous words, but user does not finds in checking process, causes occurring in that omission.

Summary of the invention

The present invention provides a kind of text modification method and device, to improve efficiency and the accuracy rate of text correction.

First aspect according to embodiments of the present invention, it is provided that a kind of text modification method, described method includes:

Obtain text data to be revised；

Obtaining correct word, described correct word is for replacing erroneous words corresponding with described correct word in described text data；

The described erroneous words found according to described correct word and replace in described text data.

Optionally, the described erroneous words finding according to described correct word and replacing in described text data, including:

Described text data is carried out participle, being multiple participle words by described text data cutting；

Described correct word is formed word pair with each participle word；

Extracting the similarity of each correct word of word centering and participle word, described similarity includes font similarity, semanteme Similarity and acoustics similarity；

Similarity according to each word pair and default decision model, obtain each word to the probability for target word pair, institute State the word pair that target word is the erroneous words corresponding with described correct word to the participle word for word centering；

Described probability according to each word pair and preset algorithm, determine target word pair；

Described correct word is used to replace the participle word of described target word centering in described text data.

Optionally, after described text data is carried out participle, described correct word is formed word pair with each participle word Before, described method also includes:

Adjacent two individual character obtained after participle is combined into a participle word.

Optionally, extract the font similarity of each correct word of word centering and participle word, including:

If the correct word of current word centering is identical with the number of words of participle word, then each by correct word and participle word Individual character is converted into quadrangle coding, by correct word and the identical coded number of quadrangle coding and four of each corresponding individual character in participle word The meansigma methods of the ratio of angle coding editor-in-chief's yardage is as font similarity；

If the correct word of current word centering differs with the number of words of participle word, then dynamic programming algorithm will be used to obtain The smallest edit distance of correct word and participle word as font similarity.

Optionally, extract the semantic similarity of each correct word of word centering and participle word, including:

Correct word and participle word to current word centering carry out vectorization respectively to obtain term vector；

Using the distance between the term vector of correct word and participle word as semantic similarity.

Optionally, extract the acoustics similarity of each correct word of word centering and participle word, including:

Determine that the correct word of current word centering changes the smallest edit distance in table with participle word in pinyin character Path；

According on described smallest edit distance path each pinyin character pinyin character conversion distance obtain correct word with The pinyin character conversion distance of participle word；

Pinyin character conversion distance according to described correct word with participle word obtains the acoustics of correct word and participle word Distance and using described acoustics distance as acoustics similarity.

Optionally, according to described probability and the preset algorithm of each word pair, determine target word pair, including:

Judge the described probability of each word pair and the magnitude relationship of predetermined threshold value；

Described probability is more than the word of described predetermined threshold value to being defined as target word pair.

Described probability according to each word pair is to the sequence to carrying out from big to small of institute's predicate；

By the word of the predetermined number stood out to being defined as target word pair.

The correct word and the participle word that use current word centering in default vocabulary respectively make a look up, wherein said default In vocabulary, storage has the correct corresponding relation of correct word and erroneous words；

If the erroneous words that the correct word of use current word centering finds in described default vocabulary and current word centering Participle word identical, and, use current word centering participle word find in described default vocabulary as erroneous words Correct word identical with the correct word of current word centering, it is determined that current word is to being a target word pair；

If the erroneous words that the correct word of use current word centering finds in described default vocabulary and current word centering Participle word different, and, use the participle word of current word centering to find in described default vocabulary as erroneous words Correct word the most different from the correct word of current word centering, it is determined that current word is to not being a target word pair；

If the erroneous words using the correct word of current word centering to find in described default vocabulary only occurring with current The situation that the participle word of word centering is identical, or, only occur the participle word using current word centering as erroneous words in institute State the situation that the correct word found in default vocabulary is identical with the correct word of current word centering, then inquire user, and according to The instruction at family determines whether current word is to being a target word pair.

Second aspect according to embodiments of the present invention, it is provided that a kind of text correcting device, described device includes:

Text acquisition module, for obtaining text data to be revised；

Correct word acquisition module, is used for obtaining correct word, and described correct word is used for replacing in described text data with described The erroneous words that correct word is corresponding；

Replacement module, for the described erroneous words found according to described correct word and replace in described text data.

Optionally, described replacement module includes:

Participle submodule, for carrying out participle to described text data, being multiple participle by described text data cutting Word；

Word is to generating submodule, for described correct word is formed word pair with each participle word；

Similarity extracts submodule, for extracting the similarity of each correct word of word centering and participle word, described similar Degree includes font similarity, semantic similarity and acoustics similarity；

Probability obtains submodule, for the similarity according to each word pair and default decision model, obtains each word pair For the probability of target word pair, described target word is to the word that the participle word for word centering is the erroneous words corresponding with described correct word Right；

Target word, to determining submodule, for the described probability according to each word pair and preset algorithm, determines target word pair；

Replace submodule, for using described correct word to replace the participle of described target word centering in described text data Word.

Optionally, described replacement module also includes:

Individual character combination submodule, for being combined into a participle word by adjacent two individual character obtained after participle.

Optionally, described similarity extraction submodule is similar to the font of participle word at each correct word of word centering of extraction When spending, it is used for:

Optionally, described similarity extracts submodule at the semantic similitude extracting each correct word of word centering and participle word When spending, it is used for:

Correct word and participle word to current word centering carry out vectorization respectively to obtain term vector；By correct word with point Distance between the term vector of word word is as semantic similarity.

Optionally, described similarity extraction submodule is similar to the acoustics of participle word at each correct word of word centering of extraction When spending, it is used for:

Determine that the correct word of current word centering changes the smallest edit distance in table with participle word in pinyin character Path；Correct word and participle is obtained according to the pinyin character conversion distance of each pinyin character on described smallest edit distance path The pinyin character conversion distance of word；Pinyin character conversion distance according to described correct word and participle word obtain correct word with Participle word acoustics distance and using described acoustics distance as acoustics similarity.

Optionally, described probability acquisition submodule is used for:

Judge the described probability of each word pair and the magnitude relationship of predetermined threshold value；By described probability more than described predetermined threshold value Word is to being defined as target word pair.

Optionally, described probability acquisition submodule is used for:

Described probability according to each word pair is to the sequence to carrying out from big to small of institute's predicate；The predetermined number that will stand out Word to being defined as target word pair.

Optionally, described probability acquisition submodule is used for:

The technical scheme that embodiments of the invention provide can include following beneficial effect:

In the present invention, when finding errors in text occur in text, user, without providing any erroneous words, only needs input Correct word, system i.e. goes to search the erroneous words of each correspondence according to correct word automatically.Such as, when the user discover that in text When having showed being written as of " dark reddish purple " word mistake " dark reddish purple " and " fall " etc., it is only necessary to input correct word and i.e. input " dark reddish purple ", nothing Need to point out that corresponding erroneous words is " dark reddish purple " or " fall ", more without pointing out the position of each erroneous words, system can be automatically Look for each erroneous words corresponding according to correct word, and automatically use correct word replace determined by erroneous words, Thus complete text correction.Because user has only to provide correct word, it is not necessary to pointed out which erroneous words one by one, significantly Improve correction efficiency, and the omission of the erroneous words that may cause because user manually searches can also be avoided, improve and repair Positive accuracy rate.

It should be appreciated that it is only exemplary and explanatory, not that above general description and details hereinafter describe The present invention can be limited.

Accompanying drawing explanation

Accompanying drawing herein is merged in description and constitutes the part of this specification, it is shown that meet the enforcement of the present invention Example, and for explaining the principle of the present invention together with description.

Fig. 1 is the flow chart according to a kind of text modification method shown in the present invention one exemplary embodiment；

Fig. 2 is the flow chart according to a kind of text modification method shown in the present invention one exemplary embodiment；

Fig. 3 is the flow chart according to a kind of text modification method shown in the present invention one exemplary embodiment；

Fig. 4 is the flow chart according to a kind of text modification method shown in the present invention one exemplary embodiment；

Fig. 5 is the schematic diagram according to the smallest edit distance path shown in the present invention one exemplary embodiment；

Fig. 6 is the flow chart according to a kind of text modification method shown in the present invention one exemplary embodiment；

Fig. 7 is the flow chart according to a kind of text modification method shown in the present invention one exemplary embodiment；

Fig. 8 is the flow chart according to a kind of text modification method shown in the present invention one exemplary embodiment；

Fig. 9 is the schematic diagram according to a kind of text correcting device shown in the present invention one exemplary embodiment；

Figure 10 is the schematic diagram according to a kind of text correcting device shown in the present invention one exemplary embodiment；

Figure 11 is the schematic diagram according to a kind of text correcting device shown in the present invention one exemplary embodiment.

Detailed description of the invention

Here will illustrate exemplary embodiment in detail, its example represents in the accompanying drawings.Explained below relates to During accompanying drawing, unless otherwise indicated, the same numbers in different accompanying drawings represents same or analogous key element.Following exemplary embodiment Described in embodiment do not represent all embodiments consistent with the present invention.On the contrary, they are only with the most appended The example of the apparatus and method that some aspects that described in detail in claims, the present invention are consistent.

Fig. 1 is the flow chart according to a kind of text modification method shown in the present invention one exemplary embodiment.The method can For the mobile terminals such as mobile phone and the equipment such as PC, server.

Shown in Figure 1, the method may include that

Step S101, obtains text data to be revised.

Described text data to be revised can determine according to the demand of user, coming for text data to be revised Source the present embodiment does not limit, such as, can be the text that manually enters of user, it is also possible to be the literary composition that obtains of speech recognition Notebook data, or, it is that OCR (Optical Character Recognition, optical character recognition) identifies the textual data obtained According to, etc..

Step S102, obtains correct word, and described correct word is used for replacing in described text data corresponding with described correct word Erroneous words.

In the present embodiment, when finding to there is text mistake, user has only to input correct word, it is not necessary to it is right to point out Which the erroneous words answered has and respectively where.

Step S103, the described erroneous words finding according to described correct word and replacing in described text data.

For the described erroneous words specifically the most how found according to described correct word and replace in described text data, this enforcement Example does not limit, and is illustrated below by Fig. 2:

Shown in Figure 2, in the present embodiment or the present invention some other embodiments, find according to described correct word and replace Change the described erroneous words in described text data, the most described step S103, may include that

Step S201, carries out participle to described text data, being multiple participle words by described text data cutting.

The segmenting method used can be such as segmenting method based on condition random field, not enters this present embodiment Row limits.

For example, text data to be revised is " I thought not ", and the word segmentation result obtained is " I thought not ", Wherein " not having " is erroneous words, needs to be modified to " U.S. ".

Additionally, in order to miss some words when preventing participle, in the present embodiment can also be adjacent by obtain after participle Two individual characters are combined into a participle word, namely successively previous individual character and later individual character are combined into participle word.Example As word segmentation result above comprises multiple continuous individual character i.e. " I ", " thinking ", " going ", after being combined by described individual character, the participle obtained Word is " I thinks " and " thinking ".

Step S202, forms word pair by described correct word with each participle word.

Such as going up the correct word " U.S. " in example can be with the following multiple word pair of multiple participle words composition obtained: " beautiful State-I ", " U.S.-think ", " U.S.-go ", " U.S.-do not had ", " U.S.-I think ", " U.S.-think ".

Step S203, extracts the similarity of each correct word of word centering and participle word, and described similarity includes font phase Like degree, semantic similarity and acoustics similarity.

For specifically how extracting these three similarity, the present embodiment does not limit, and those skilled in the art are permissible According to different demands difference scene and designed, designed, can be in these designs used herein all without departing from the essence of the present invention God and protection domain.

Step S204, according to similarity and the default decision model of each word pair, obtains each word to for target word pair Probability, described target word is to the word pair that the participle word for word centering is the erroneous words corresponding with described correct word.

Described decision model can obtain by building in advance.For example, it is possible to collect a large amount of text data in advance, manually look for To erroneous words present in text data and provide the correct word that erroneous words is corresponding, by described correct word and participle in text data After participle word composition word just can manually mark each word to whether being target word pair to rear, be the most whether real " just Really word-erroneous words " word pair.When specifically marking, it is possible to use 0 and 1 as mark feature, if current word is to for real " correct word-erroneous words " word pair, then be labeled as 1, be otherwise labeled as 0.Then, extract the similarity of each two words of word centering, I.e. font similarity, semantic similarity, acoustics similarity.Finally using described similarity and mark feature as training data, instruction Get this decision model.When specifically training, using the similarity of each word pair as the input of model, by the mark of each word pair Model parameter, as the output of model, is updated by feature, and parameter updates after terminating, and obtains decision model.

When using this decision model, can using the similarity of each two words of word centering as the input of decision model, Then each word is exported to the probability for real " correct word-erroneous words " word pair.

Step S205, according to described probability and the preset algorithm of each word pair, determines target word pair.

Obtain each word to after for the probability of target word pair, it is possible to which filtering out according to preset algorithm is real mesh Mark word pair.Particular content the present embodiment for preset algorithm does not limit, and those skilled in the art can be according to difference Demand difference scene and designed, designed, can be in these designs used herein all without departing from the spirit of the present invention and protection Scope

Step S206, uses described correct word to replace the participle word of described target word centering in described text data.

Such as correct word is " U.S. ", and target word is to being " U.S.-do not had ", then can use in text data full text " U.S. " goes to replace " not having ", thus completes correction.

In the present embodiment, when finding errors in text occur in text, user, without providing any erroneous words, only needs defeated Entering correct word, system i.e. goes to search the erroneous words of each correspondence according to correct word automatically.Such as, when the user discover that in text When occurring in that being written as of " dark reddish purple " word mistake " dark reddish purple " and " fall " etc., it is only necessary to input correct word and i.e. input " dark reddish purple ", Without pointing out that the erroneous words of correspondence is " dark reddish purple " or " fall ", more without pointing out the position of each erroneous words, system can be certainly Dynamic look for each erroneous words corresponding according to correct word, and automatically use correct word replace determined by mistake Word, thus complete text correction.Because user has only to provide correct word, it is not necessary to pointed out which erroneous words one by one, Substantially increase correction efficiency, and the omission of the erroneous words that may cause because user manually searches can also be avoided, improve The accuracy rate revised.

Below to how extracting the similarity of each correct word of word centering and participle word, namely step S203, further It is illustrated.

In the present embodiment or the present invention some other embodiments, extract the word of each correct word of word centering and participle word Shape similarity, specifically may include that

If the correct word of current word centering is identical with the number of words of participle word, then each by correct word and participle word Individual character is converted into quadrangle coding, by correct word and the identical coded number of quadrangle coding and four of each corresponding individual character in participle word The meansigma methods of the ratio of angle coding editor-in-chief's yardage is as font similarity.

Shown in circular such as formula (1):

T = \frac{1}{n} (Σ_{i = 1}^{i = n} \frac{l_{i}}{L_{i}}) - - - (1)

Wherein, T represents the font similarity of two words of word centering, and n is the number of words of each word of word centering, l_iRepresent two words The identical coded number of quadrangle coding of middle i-th word, L_iRepresent that in two words, quadrangle coding editor-in-chief's yardage of i-th word is (usually 4)。

For example, as follows to the font Similarity Measure process of " to going-thinking " for word:

" to " quadrangle coding be 2722

The quadrangle coding " thought " is 4633

1st word i.e. " to " and quadrangle coding editor-in-chief's yardage of " thinking " be 4, but there is no identical coding, And the 2nd word " is gone " and " going ",So finally obtaining the font similarity of this word pair according to formula (1) is 0.5。

If the correct word of current word centering differs with the number of words of participle word, then can will use dynamic programming algorithm The correct word obtained and the smallest edit distance of participle word are as font similarity.Existing skill can be used when implementing Art, here is omitted.

Shown in Figure 3, in the present embodiment or the present invention some other embodiments, extract each correct word of word centering with The semantic similarity of participle word, specifically may include that

Step S301, correct word and participle word to current word centering carry out vectorization respectively to obtain term vector.

Step S302, using the distance between the term vector of correct word and participle word as semantic similarity.

As example, concrete vectorization method can use the methods such as Word2Vec word each to word centering to carry out vector Change.After obtaining the term vector of each word of word centering, the distance of two term vectors can be COS distance, Euclidean distance etc., specifically Computational methods are same as the prior art, are not described in detail in this.

Shown in Figure 4, in the present embodiment or the present invention some other embodiments, extract each correct word of word centering with The acoustics similarity of participle word, specifically may include that

Step S401, determines that the correct word of current word centering changes the minimum in table with participle word in pinyin character Editing distance path.

Step S402, obtains according to the pinyin character conversion distance of each pinyin character on described smallest edit distance path The pinyin character conversion distance of correct word and participle word.

Step S403, obtains correct word and participle word according to the pinyin character conversion distance of described correct word with participle word Language acoustics distance and using described acoustics distance as acoustics similarity.

Described acoustics similarity refers to that two words, in enunciative similarity, use the acoustics distance of two words to represent, two The acoustics distance of word is the nearest, then acoustics similarity is the highest.Distance can be changed by the pinyin character of two words to calculate, i.e. root Come according to the conversion distance of two pinyin character in pinyin character conversion distance table (or perhaps pinyin character conversion confusion matrix) Calculate.Table 1 is part pinyin character conversion confusion matrix, and wherein, the first row and first is classified as the pinyin character of mutually conversion, two Character intersection is conversion distance.

Table 1

	a	ai	an	ang	ao	b	c	ch	d	e	ei	en	eng
														a	‐	0.67	0.65	0.72	0.6	1	1	1	1	0.6	0.893	0.88	0.927
ai	0.67	‐	0.7	0.95	0.928	1	1	1	1	0.914	0.763	0.866	0.928
														an	0.654	0.699	‐	0.6	0.938	1	1	1	1	0.954	0.944	0.67	0.832
ang	0.716	0.95	0.6	‐	0.793	1	1	1	1	0.972	0.971	0.877	0.737

Pinyin character conversion distance according to two words calculates the acoustics distance of two words, and circular can be such as formula (2) shown in:

D_{a c o u} (a_{1}, a_{2}) = \frac{1}{1 + D_{e d i t} (a_{1}, a_{2})} - - - (2)

Wherein, D_acou(a₁,a₂) it is the acoustics distance of two words, D_edit(a₁,a₂) be two words pinyin character conversion away from From.D_edit(a₁,a₂) two words minimum editor in pinyin character conversion distance table can be searched according to dynamic programming method Distance path, will i.e. can get the phonetic word of two words after the pinyin character conversion distance fusion of each pinyin character on this path Symbol conversion distance D_edit(a₁,a₂), concrete fusion method such as can be averaged, simply cumulative or weighted accumulation etc..

For example, " report a case to the security authorities " and the pinyin character conversion distance calculating method of " standby dish " two words be as follows:

1) each word is converted into phonetic

Report a case to the security authorities-> bao an

Standby dish-> bei cai

2) according to pinyin character conversion confusion matrix (namely pinyin character conversion distance table), table look-up and obtain each phonetic word The pinyin character conversion distance of symbol is as shown in table 2:

Table 2

	b	ao	an
				b	0	1	1
ei	1	0.976	0.944
				c	1	1	1
ai	1	0.928	0.699

3) utilize dynamic programming method, calculate the pinyin character conversion distance of two words

When specifically calculating, it is possible to use dynamic programming method searches pinyin character conversion distance table, find minimum editor away from From path, after the value on this path being merged, i.e. can get the pinyin character conversion distance of two words, as it is shown in figure 5, shadow region Territory is smallest edit distance path, the pinyin character on smallest edit distance path is changed distance and the most simply adds up I.e. can get pinyin character the conversion distance, i.e. 0+0+0.976+1+0.699=2.675 of two words.

Additionally, for step S205, i.e. according to described probability and the preset algorithm of each word pair, determine target word pair, permissible There is various ways to realize, be illustrated below by Fig. 6～Fig. 8:

Shown in Figure 6, in the present embodiment or the present invention some other embodiments, according to the described probability of each word pair and Preset algorithm, determines target word pair, may include that

Step S601, it is judged that the described probability of each word pair and the magnitude relationship of predetermined threshold value.

Step S602, is more than the word of described predetermined threshold value to being defined as target word pair by described probability.

Or it is shown in Figure 7, in the present embodiment or the present invention some other embodiments, according to each word pair Probability and preset algorithm, determine target word pair, may include that

Step S701, according to the described probability of each word pair to the sequence to carrying out from big to small of institute's predicate.

Step S702, by the word of the predetermined number stood out to being defined as target word pair.

Or it is shown in Figure 8, in the present embodiment or the present invention some other embodiments, according to each word pair Probability and preset algorithm, determine target word pair, may include that

Step S801, uses the correct word of current word centering and participle word to make a look up in default vocabulary respectively, its Described in preset storage in vocabulary and have the correct corresponding relation of correct word and erroneous words.

Described default vocabulary preserves the correct word easily made mistakes and the erroneous words of correspondence thereof, such as " U.S.-do not had ", " U.S. State-often cross " etc..Described vocabulary can be built the most in advance by domain expert and obtain.

Step S802, if using the erroneous words that finds in described default vocabulary of correct word of current word centering and working as The participle word of front word centering is identical, and, use current word centering participle word as erroneous words at described default vocabulary In the correct word that finds identical with the correct word of current word centering, it is determined that current word is to being a target word pair.

Step S803, if using the erroneous words that finds in described default vocabulary of correct word of current word centering and working as The participle word of front word centering is different, and, use the participle word of current word centering as erroneous words at described default vocabulary In the correct word that finds the most different from the correct word of current word centering, it is determined that current word is to not being a target word pair.

, if only there is the mistake using the correct word of current word centering to find in described default vocabulary in step S804 The situation that word is identical with the participle word of current word centering, or, the participle word using current word centering only occurs as mistake The situation that correct word that by mistake word finds in described default vocabulary is identical with the correct word of current word centering, then inquire user, And determine whether current word is to being a target word pair according to the instruction of user.If now user confirms, it is determined that current Word is to being a target word pair, if user is unconfirmed, it is determined that current word is to not being a target word pair.

It should be noted that for Fig. 6～Fig. 8 these three mode, it is also possible to carry out combination of two or three combine together Use, order the present embodiment when syntagmatic and combination is not limited.For example, it is possible to it is big first to filter out probability In the word pair of threshold value, carry out the sequence of probability size the most on this basis, choose the word of the predetermined number stood out to really It is set to target word pair；Again for example, it is possible to first carry out the sequence of probability size, choose the word pair of the predetermined number stood out, so After recycle described default vocabulary on this basis and screen；Again for example, it is possible to first filter out the probability word pair more than threshold value, Recycle described default vocabulary the most on this basis and carry out postsearch screening；Etc..

Following for apparatus of the present invention embodiment, may be used for performing the inventive method embodiment.Real for apparatus of the present invention Execute the details not disclosed in example, refer to the inventive method embodiment.

Fig. 9 is the schematic diagram according to a kind of text correcting device shown in the present invention one exemplary embodiment.This device can For the mobile terminals such as mobile phone and the equipment such as PC, server.

Shown in Figure 9, this device may include that

Text acquisition module 901, for obtaining text data to be revised；

Correct word acquisition module 902, is used for obtaining correct word, and described correct word is used for replacing in described text data and institute State the erroneous words that correct word is corresponding；

Replacement module 903, for the described erroneous words found according to described correct word and replace in described text data.

Shown in Figure 10, in the present embodiment or the present invention some other embodiments, described replacement module may include that

Participle submodule 1001, for carrying out participle to described text data, being multiple by described text data cutting Participle word；

Word is to generating submodule 1002, for described correct word is formed word pair with each participle word；

Similarity extracts submodule 1003, for extracting the similarity of each correct word of word centering and participle word, described Similarity includes font similarity, semantic similarity and acoustics similarity；

Probability obtains submodule 1004, for the similarity according to each word pair and default decision model, obtains each Word is to the probability for target word pair, and described target word is the erroneous words corresponding with described correct word to the participle word for word centering Word pair；

Target word, to determining submodule 1005, for the described probability according to each word pair and preset algorithm, determines target word Right；

Replace submodule 1006, for using described correct word to replace described target word centering in described text data Participle word.

Shown in Figure 11, in the present embodiment or the present invention some other embodiments, described replacement module can also wrap Include:

Individual character combination submodule 1101, for being combined into a participle word by adjacent two individual character obtained after participle Language.

In the present embodiment or the present invention some other embodiments, described similarity is extracted submodule and is being extracted each word pair In the font similarity of correct word and participle word time, specifically may be used for:

In the present embodiment or the present invention some other embodiments, described similarity is extracted submodule and is being extracted each word pair In the semantic similarity of correct word and participle word time, specifically may be used for:

In the present embodiment or the present invention some other embodiments, described similarity is extracted submodule and is being extracted each word pair In the acoustics similarity of correct word and participle word time, specifically may be used for:

In the present embodiment or the present invention some other embodiments, described probability obtains submodule and specifically may be used for:

About the device in above-described embodiment, wherein unit module perform the concrete mode of operation relevant The embodiment of the method is described in detail, explanation will be not set forth in detail herein.

Those skilled in the art, after considering description and putting into practice invention disclosed herein, will readily occur to its of the present invention Its embodiment.The application is intended to any modification, purposes or the adaptations of the present invention, these modification, purposes or Person's adaptations is followed the general principle of the present invention and includes the undocumented common knowledge in the art of the present invention Or conventional techniques means.Description and embodiments is considered only as exemplary, and true scope and spirit of the invention are by appended Claim is pointed out.

It should be appreciated that the invention is not limited in precision architecture described above and illustrated in the accompanying drawings, and And various modifications and changes can carried out without departing from the scope.The scope of the present invention is only limited by appended claim.

Claims

1. a text modification method, it is characterised in that described method includes:

Obtain text data to be revised；

Method the most according to claim 1, it is characterised in that find according to described correct word and replace described text data In described erroneous words, including:

Described correct word is formed word pair with each participle word；

Extracting the similarity of each correct word of word centering and participle word, described similarity includes font similarity, semantic similitude Degree and acoustics similarity；

Similarity according to each word pair and default decision model, obtain each word to the probability for target word pair, described mesh Mark word is the word pair of the erroneous words corresponding with described correct word to the participle word for word centering；

Method the most according to claim 2, it is characterised in that after described text data is carried out participle, by described just Really word and each participle word composition word are to before, and described method also includes:

Method the most according to claim 2, it is characterised in that extract the font of each correct word of word centering and participle word Similarity, including:

If the correct word of current word centering is identical with the number of words of participle word, then by each individual character of correct word Yu participle word It is converted into quadrangle coding, correct word is compiled with corner with the identical coded number of quadrangle coding of each corresponding individual character in participle word The meansigma methods of the ratio of code editor-in-chief's yardage is as font similarity；

If the correct word of current word centering differs with the number of words of participle word, then dynamic programming algorithm will be used just to obtain Really word and the smallest edit distance of participle word are as font similarity.

Method the most according to claim 2, it is characterised in that extract the semanteme of each correct word of word centering and participle word Similarity, including:

Method the most according to claim 2, it is characterised in that extract the acoustics of each correct word of word centering and participle word Similarity, including:

Determine that the correct word of current word centering changes the smallest edit distance path in table with participle word in pinyin character；

Correct word and participle is obtained according to the pinyin character conversion distance of each pinyin character on described smallest edit distance path The pinyin character conversion distance of word；

Pinyin character conversion distance according to described correct word with participle word obtains the acoustics distance of correct word and participle word And using described acoustics distance as acoustics similarity.

Method the most according to claim 2, it is characterised in that according to described probability and the preset algorithm of each word pair, determine Target word pair, including:

The correct word and the participle word that use current word centering in default vocabulary respectively make a look up, wherein said default vocabulary Middle storage has the correct corresponding relation of correct word and erroneous words；

If the erroneous words that the correct word of use current word centering finds in described default vocabulary is divided with current word centering Word word is identical, and, use the participle word of current word centering just to find in described default vocabulary as erroneous words Really word is identical with the correct word of current word centering, it is determined that current word is to being a target word pair；

If the erroneous words that the correct word of use current word centering finds in described default vocabulary is divided with current word centering Word word is different, and, use the participle word of current word centering just to find in described default vocabulary as erroneous words Really word is the most different from the correct word of current word centering, it is determined that current word is to not being a target word pair；

If erroneous words and the current word pair using the correct word of current word centering to find in described default vocabulary only occurs In the identical situation of participle word, or, only occur the participle word using current word centering as erroneous words described pre- If the situation that the correct word found in vocabulary is identical with the correct word of current word centering, then inquire user, and according to user's Instruction determines whether current word is to being a target word pair.

10. a text correcting device, it is characterised in that described device includes:

Text acquisition module, for obtaining text data to be revised；

Correct word acquisition module, is used for obtaining correct word, described correct word be used for replacing in described text data with described correctly The erroneous words that word is corresponding；

11. devices according to claim 10, it is characterised in that described replacement module includes:

Participle submodule, for carrying out participle to described text data, being multiple participle words by described text data cutting；

Similarity extracts submodule, for extracting the similarity of each correct word of word centering and participle word, described similarity bag Include font similarity, semantic similarity and acoustics similarity；

Probability obtains submodule, for according to the similarity of each word pair and default decision model, obtains each word to for mesh The probability of mark word pair, described target word is to the word pair that the participle word for word centering is the erroneous words corresponding with described correct word；

Replace submodule, for using described correct word to replace the participle word of described target word centering in described text data Language.

12. devices according to claim 11, it is characterised in that described replacement module also includes:

13. devices according to claim 11, it is characterised in that described similarity is extracted submodule and extracted each word pair In the font similarity of correct word and participle word time, be used for:

14. devices according to claim 11, it is characterised in that described similarity is extracted submodule and extracted each word pair In the semantic similarity of correct word and participle word time, be used for:

Correct word and participle word to current word centering carry out vectorization respectively to obtain term vector；By correct word and participle word Distance between the term vector of language is as semantic similarity.

15. devices according to claim 11, it is characterised in that described similarity is extracted submodule and extracted each word pair In the acoustics similarity of correct word and participle word time, be used for:

Determine that the correct word of current word centering changes the smallest edit distance path in table with participle word in pinyin character； Correct word and participle word is obtained according to the pinyin character conversion distance of each pinyin character on described smallest edit distance path Pinyin character conversion distance；Pinyin character conversion distance according to described correct word with participle word obtains correct word and participle Word acoustics distance and using described acoustics distance as acoustics similarity.

16. devices according to claim 11, it is characterised in that described probability obtains submodule and is used for:

Judge the described probability of each word pair and the magnitude relationship of predetermined threshold value；Described probability is more than the word pair of described predetermined threshold value It is defined as target word pair.

17. devices according to claim 11, it is characterised in that described probability obtains submodule and is used for:

Described probability according to each word pair is to the sequence to carrying out from big to small of institute's predicate；The word of predetermined number that will stand out To being defined as target word pair.

18. devices according to claim 11, it is characterised in that described probability obtains submodule and is used for: