CN1228565A

CN1228565A - Computer file automatic error detection and error correction device and its method

Info

Publication number: CN1228565A
Application number: CN97114702A
Authority: CN
Inventors: 张俊盛; 林翠芬
Original assignee: RUIYANG ZIXUN CO Ltd
Current assignee: RUIYANG ZIXUN CO Ltd
Priority date: 1997-07-18
Filing date: 1997-07-18
Publication date: 1999-09-15

Abstract

The present invention utilizes a "twice word-disconnection" method, i. e. uses first word disconnection process to make the textual sentence restore to errorless original form, then uses secondary word disconnection to convert the textual sentence into correct one to provide error detection and error correction function with high checking rate and high accuracy. Firstly, the textual sentence is undergone the process of word-disconnection analysis so as to define character pronunciation and character form, then all characters are respectively converted into the form of phonetic code and form of form code, then according to the obtained phonetic code and form code it searches the characters and words in the words library, and makes secondary word-disconnection of said textual sertence according to the searched characters and words, finally, according to the obtained result forms the corrected textual sentence.

Description

Computer file automatic error detection, the device and method of correcting mistakes

The invention relates to a kind of document error detection, the device and method of correcting mistakes, particularly about a kind of second order segmentation error detection of Chinese, Japanese document and devices and methods therefor of correcting mistakes of being applicable to.Computer file automatic error detection provided by the present invention, the device and method of correcting mistakes are the modes of utilizing the disconnected speech of secondary, reach the effect that correctly detects wrongly written character and correct wrongly written character.

Because popularizing of computer applications utilizes computer to handle various files, become the standard operation mode of modern commerce activity.In the document that all computers are handled, content correctly be the prerequisite of all processing.Therefore, how to guarantee the correct of computer file content, become the central big problem in file processing field.

In the middle of with Chinese (containing simplified form of Chinese Character and Chinese-traditional), the application of Japanese,,, the problem of wrongly written or mispronounced characters is arranged all perhaps by the obtained document of document exhalation mode no matter with keyboard entry method, voice, OCR identification mode commonly used as process object.

So-called " wrongly written character " typically refers to a certain Chinese words, owing to the computer identification or write stroke increase and decrease, change, misplace or keyboard operation no marking, beat more, miss beat, reasons such as word selection mistake, cause the error of font.And what is called " malapropism " is meant and uses certain word, uses the mistake of other words because of the mistake in the cognition.In addition, between the Chinese words of employed Chinese words in China's Mainland and Taiwan use is mutual,, also be the common problem of at present general document (the especially document of letter, numerous exchange gained) because of the difference that difference produced of usage.More than and other with the error of words, below be referred to as " wrongly written or mispronounced characters.”

For the wrongly written or mispronounced characters that occurs in the file, common technology is after system's input obtains document, utilizes and is manually proofreaied and correct.Because artificial check and correction is very consuming time, electronic information circle has worked out and has somely detected wrongly written or mispronounced characters automatically and/or correct the system of wrongly written or mispronounced characters automatically with computer at present, to satisfy the requirement of a large amount of automatic or semi-automatic correction wrongly written or mispronounced characterss of user.

No. 59572 announcement of TaiWan, China patent of invention case a kind of " automatic detection method of Chinese wrongly written character and pick-up unit ", this method can automatic centering wrongly written or mispronounced characters in the literary composition document, detected, correct wrongly written or mispronounced characters for the user.This method is to utilize statistical method, and the disconnected speech of earlier literal in the sentence being played tricks is handled, and selects the low frequency monosyllabic word that seldom occurs, and is denoted as possible wrongly written or mispronounced characters.This method provides a kind of error-detecting method of high recall ratio, but its shortcoming is low for suggestion, the accurate rate that correct word can not be provided, and its word table parameter amount that continues is huge, and processing speed can't improve.

No. 83103817 patent application case of TaiWan, China discloses a kind of " Chinese wrongly written or mispronounced characters automatic correcting method and device ", and this method is approximate word collection with text conversion earlier, the more disconnected speech of pairing approximation word collection.Afterwards, utilize the once mode of disconnected speech, each word string combination is marked, find possible wrongly written or mispronounced characters, and the correction suggestion is provided.This method is because approximate word collection is quite huge, and disconnected speech is quite time-consuming, and huge because of the word of marking between the used speech table number of parameters that continues, and the sampling statistics is difficult for complete.On using, also inconvenience to some extent.

Therefore needing badly at present a kind ofly can provide high recall ratio, high precision rate, and can improve the computer file automatic error detection of speed, the device and method of correcting mistakes.Also need there be simultaneously a kind of input method irrelevant, the apparatus and method of automatic error detection, the function of correcting mistakes all can be provided with document.

The purpose of this invention is to provide the computer file automatic error detection of a kind of high recall ratio and high precision rate, the device and method of correcting mistakes;

Another object of the present invention provides computer file automatic error detection, the device and method of correcting mistakes that a kind of speed can improve;

Another object of the present invention provides and a kind ofly can be applicable to the computer file automatic error detection of different input methods, the device and method of correcting mistakes.

Find through the inventor, utilize a kind of " secondary break speech " method, promptly, sentence is revert to error-free primitive form via the speech that breaks for the first time; Convert sentence to more correct literal through for the second time disconnected speech again, the error detection of high recall ratio and high precision rate and the function of correcting mistakes can be provided.In the method for the invention, at first with sentence via disconnected speech analysis, to determine word sound and font; Secondly, all literal are converted to a kind of sound code form and a kind of shape code form respectively.Then in dictionary, search words, and for the second time disconnected speech made in this sentence according to the words that checks in according to the sound code or the shape code of gained.At last, according to the result of for the second time disconnected speech, form the sentence of suggestion user change.

Computer file automatic error detection of the present invention, the device and method of correcting mistakes be owing to take the break practice of speech, part of speech analysis, the code conversion of sound shape of secondary respectively, can reach high recall ratio, high precision rate, high-speed effect.

Now be discussed below:

1. the prerequisite that for the first time disconnected speech can act on is can determine possible wrongly written character point effectively via the length and the frequency of disconnected speech, ensures high recall ratio.

2. the part of speech analysis of carrying out simultaneously when breaking speech is for the first time adopted a part of speech to continue and is shown to carry out.Utilize its result can analysing word and speech between the cooperate degree of part of speech.Avoid the situation with normal vocabulary logotype, mistake is the wrongly written character point, to improve accurate rate.The part of speech table that continues has the character of syntax analysis, than (between speech) word table that continues, has more generality and generalization.Experimental results show that its effect is splendid.

3. the analysis with 100 parts of speech classification is an example, and part of speech continues and shows to have approximately 1000 multinomial data.And word continues and shows hundreds of thousands item data easily.Therefore with the part of speech mosaic of the table analysis of correcting mistakes that continues, can save the time of tabling look-up, speed up processing.

Above-mentioned and other purpose of the present invention and advantage can be by clearer below in conjunction with the detailed description of accompanying drawing.

Fig. 1 represents the system flowchart of computer file automatic error detection of the present invention, the method for correcting mistakes.

Fig. 2 represents computer file automatic error detection of the present invention, the system diagram of the disconnected speech processing subsystem of the device subordinate phase of correcting mistakes.

The table I shows a part of content that is useful in the shape code table of comparisons of the present invention.

Find that through the inventor in general computer file, modal wrongly written or mispronounced characters occurs at present: with (closely) sound word, approximate word of shape and simplified and traditional hand-over word.With aspect (closely) sound word, the most normal vocabulary of being imported with spelling input method that occurs in, for example " mean value " mistake is " average matter ", " must " mistake is " taboo " or " closing ".The approximate word of shape then more often occurs in the file of input method (for example Cangjie's input method) input based on font, or the file of importing with OCR.For example will " Market Situation (Cangjie's sign indicating number of gesture is: the big corpse of native dagger-axe) " mistake be " market shape rob (Cangjie's sign indicating number of misfortune is: the big corpse of native dagger-axe) ", or will " " erroneous judgement be " seriously " etc. with OCR in Cangjie's input method.And aspect simplified and traditional hand-over word, then mainly occurring in simplified is the occasion of one-to-many to the complex form of Chinese characters, for example " Hou face " mistake is " back " etc.

Owing to the reason that wrongly written or mispronounced characters takes place not is to have only aspect one, must be able to solve wrongly written or mispronounced characters due to a variety of causes to computer file error detection and the method for correcting mistakes in computer file.For reaching above-mentioned purpose, the present invention sees through the different code tables of comparisons and disconnected speech technology of a kind of two-stage, all can detect one by one to guarantee the wrongly written or mispronounced characters that different reasons cause, and give correction.

Fig. 1 represents the system flowchart of computer file automatic error detection of the present invention, the method for correcting mistakes.Following method according to description of drawings the present invention.

When utilizing computer file automatic error detection of the present invention, the device and method of correcting mistakes to carry out the wrongly written or mispronounced characters detection, at first system takes out the literal of one section measured length in step 101 from document, and be the boundary with the punctuation mark, a unit " sentence " thought in literal before the punctuation mark, as process object.

Manage the stage herein, all literal are given a code according to general coded system.The coded system that is suitable for comprises the BIG5 sign indicating number that industry member is commonly used etc.

Secondly, in step 102, from a dictionary with shared coded system coding, find the words that the above sub-word string of any two words is constituted in this sentence by system.The obtained words (sub-word string) of step may overlap each other according to this.Therefore an operating type must be arranged, choose not overlappingly, vocabulary paragraph closely continues.Then in step 103 according to obtained words speech length, word frequency, the part of speech situation that continues, according to certain regular, determine the vocabulary segmented mode of this sentence the best.Being applicable to the disconnected speech mode of this step, can be the disconnected speech method that is disclosed as No. 81105610 patent case of TaiWan, China " Chinese document compression processing method and device " commonly used.So far finishing the disconnected speech of phase one handles.

It is to utilize intrinsic dictionary that speech made to prejudge in sentence that the disconnected speech of phase one is handled, to save the time of subsequent treatment.

Fig. 2 represents the system diagram of the disconnected speech processing subsystem of subordinate phase of the present invention.As shown in the figure, the disconnected speech processing subsystem of subordinate phase comprises that an original document memory bank 201, source document change the table of comparisons 206 and a destination document memory bank 207 to code conversion device 202, code table of comparisons 203, code shelves memory bank 204, a code to destination document conversion equipment 205, an output code.

The above-mentioned code table of comparisons 203 is if a sound code table of comparisons, then can be according to the pronunciation of each literal, for example with phonetic symbol as its code.Therefore, in this table of comparisons, phonetically similar word has identical coding.If the code table of comparisons is a shape code table of comparisons, then be with every group of font near or the literal of identical mistake input easily takes place, reduce one group one group word collection (cluster), and with one of them word, as its coding.

The table I shows a part of content that is applicable to the shape code table of comparisons of the present invention.In this code table of comparisons, the 1st hurdle is the composition word of each word collection, and the 2nd, 3 hurdles are its Cangjie's sign indicating number, and the 4th hurdle is the code of word collection.

In addition,, then comprise the character library of all single simplified Chinese characters contrasts to the number complex forms of Chinese characters if the code table of comparisons 203 is a simplified and traditional character code table of comparisons, and with its pronunciation code (as phonetic symbol) as its code.

Each sentence promptly is stored in the code document bank 204 in step 105 after step 104 is converted to code, carries out conversion process for code to purpose literal shelves conversion equipment 205.At the code of step 106 conversion equipment 205, in a dictionary with code coding, find out and the identical sub-word string of word string code in the sentence, and give record according to sentence.Then, conversion equipment 205 continues in speech length, word frequency, the part of speech of step 107 according to the sub-word string that is write down, and four factors such as change number of words determine best vocabulary segmentation (disconnected speech) mode.Reconstitute new sentence in step 108 according to selected vocabulary at last.

If necessary, conversion equipment 205 can be confirmed for the user in step 109 display process result; Otherwise promptly correct the sentence content automatically, and at the code of step 110 according to the output code table of comparisons 206, the sentence with after changing converts the destination document of encoding in the universal coding mode to, and is stored in this destination document memory bank 207.

In the present invention, the method for the best disconnected speech mode of conversion equipment 205 decisions can be utilized any mode commonly used, and for example aforementioned No. 83103817 patent case is described.But example of the present invention utilizes following step, may reach more excellent effect.

The account form of disconnected speech is for choosing one group of vocabulary W _i, i=1 makes to n

Σ_{i = 1}^{n} 20 \times | W_{i} | - \log_{10} Prob (W_{i}) - \log_{10} Prob (PO S_{i} | {POS}_{i - 1}) - 30 \times C_{i}

Be maximal value.

POS wherein _iBe W _iPart of speech, and C _iBe W _iThe change number of words.

Definition:

Long (the ︱ W of speech _i︱): the number of words that a Chinese vocabulary is comprised.Speech length as " happy " is 2, and the speech length of " soon " is 1.

Word frequency (Prob (W _i)): the frequency that vocabulary occurs in article, occur 100 times in the data of a certain 1,000,000 vocabulary as " happy ", then its word frequency is 0.001.

Part of speech probability (Prob (the POS that continues _i︱ POS _I-1)): the vocabulary of certain part of speech X is under the condition of i-1 position appearance of sentence, and the probability that occurs i position followed in the vocabulary of part of speech Y.For example verb occurs altogether 100 times and occurs the situation totally 32 times of noun thereafter in a data, then the probability that continues of verb and noun:

Prob (noun ︱ verb)=0.32.

Change number of words (C _i): in for the second time disconnected speech operation, sound sign indicating number same words of being considered (or font code same words) and the different number of words of the original input data of co-located.For example: " situation " is the homograph of co-located " shape misfortune ", and then changing number of words is 1.

Though let loose in any theory, the foundation that the invention provides above-mentioned formula is: the research report that comprehensive multinomial Chinese Computer is handled, and the long-term development test of doing of inventor, the conclusion that is obtained.Be described as follows:

1. the simplest, effective " priority of long word principle " can reach the accuracy more than 90%.Yet when the appearance of the speech as a result while of two kinds of disconnected speech, what person " priority of long word principle " just can't determine to adopt.

2. under said circumstances, the data of the use usual way of speech can help to select correct disconnected speech under the situation of major part.The statistical data of these normalities comprises: the frequency that the adjacency of speech frequency of utilization, adjacent part of speech occurs etc.For example " degree adverb ‖ adjective " is better than " adverb of time ‖ verb " aspect part of speech, and therefore when disconnected speech was handled, it is preferential that the latter should be.

3. the effect that above-mentioned formula is implemented, the vocabulary that meeting is taken in because of dictionary is handled the type of article, and the change of some degree is arranged.Through labor,, determine this formula to a large amount of dissimilar articles adjustment formula that experimentizes.

Embodiments of the invention below are described:

Embodiment one: the conversion of unisonance wrongly written or mispronounced characters

At first system takes out the literal of one section measured length in step 101 from the pending document of original document memory bank 201, and is the boundary with the punctuation mark, a selected unit " sentence ", as process object:

" its average matter is rather credible ".

Pending document is the BIG5 sign indicating number of announcing with the Ministry of Education, and its code is:

Its average matter is rather credible

A8E4 A5AD A7A1 BDE8 BBE1 ACB0 A569 AB48

Learn that by pending sentence " matter " word wherein is the unisonance wrongly written or mispronounced characters of " value ".

During processing, step 102 by system from a dictionary with BIG5 sign indicating number coding, find the words that the above sub-word string of any two words is constituted in this sentence., make the speech of prejudging of phase one and handle obtained words foundation disconnected speech method commonly used in step 103.The result is as follows:

The average ︱ matter of its ︱ of ︱︱ is the credible ︱ of ︱ rather.

To be processed in step 104 default institute is the unisonance wrongly written or mispronounced characters.Therefore source document is just obtained the sound code table of comparisons to code conversion device 202 in the code table of comparisons 203, with institute's sentence to be processed, be converted to the sound code, and the gained result be stored in the code shelves memory bank 204 in step 105: the average ︱ matter of its ︱ of ︱︱ is the credible ︱ of ︱ rather.↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ㄑㄆㄐㄓㄆㄨ fourth ㄩ one by one

ㄛㄟㄜ one

ㄥㄣ

ㄣ

Literal shelves conversion equipment 205 is at the sound code of step 106 according to sentence, in the dictionary 208 of a sound code coding, find out and identical sub-word string and the part of speech thereof of sound code string in the sentence: the average ︱ matter of its ︱ of ︱︱ rather the credible ︱ of ︱ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ㄑㄆㄐㄓㄆㄨㄒㄩ one by one

ㄛㄟㄜ one ㄥㄣ

ㄣ

︱ Nh ︱ VH ︱ Na ︱ Df ︱ VH, and its ︱ mean value of ︱︱ rather the credible ︱ of ︱ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ㄑㄆㄐㄓㄆㄨㄒㄩ one by one ㄛㄟㄜ one ㄥㄣㄣ︱ Nh ︱ Na ︱ Df ︱ VH ︱. or the like

Then, conversion equipment 205 continues in speech length, word frequency, the part of speech of step 107 according to the sub-word string that is write down, and four factors such as change number of words determine best disconnected speech mode.

Σ_{i = 1}^{n} 20 \times | W_{i} | - \log_{10} Prob (W_{i}) - \log_{10} Prob ({POS}_{i} | {POS}_{i - 1}) - 30 \times C_{i}

Be maximal value.

POS wherein _iBe W _iPart of speech, and C _iBe W _iThe change number of words.

Long (the ︱ W of speech _i︱): the number of words that a Chinese vocabulary is comprised.The speech length of " mean value " is 3, and the speech length of " on average " is 2.

Word frequency (Prob (W _i)): the frequency that vocabulary occurs in 100 ten thousand vocabulary articles is 1 time as " mean value " word frequency, and " on average " word frequency is 101 times, and " matter " word frequency is 33 times.

Part of speech probability (Porb (the POS that continues _i︱ POS _I-1)): the vocabulary of certain part of speech X is under the condition of i-1 position appearance of sentence, and the probability that occurs i position followed in the vocabulary of part of speech Y.For example " mean value " in the probability that continues that " its " back occurs is in last example: Porb (Na ︱ Nh), " on average " in the probability that continues that " its " back occurs is: Prob (VH ︱ Nn), " matter " in the probability that continues that " on average " back occurs is: Porb (nA ︱ Nh).

Change number of words (C _i): in for the second time disconnected speech operation, homonym of being considered and the different number of words of the original input data of co-located.For example: " mean value " is the homonym of co-located " average matter ", and then changing number of words is 1.

The result of back gained comprises following as calculated:

The average ︱ matter of its ︱ of ︱︱ is the credible ︱ of ︱ rather: 2.498 * 10 ^-8

Its ︱ mean value of ︱︱ is the credible ︱ of ︱ rather: 3.194 * 10 ^-5

The last numerical value that is calculated in step 108 conversion equipment 205 foundations, select higher vocabulary to reconstitute new sentence:

Its ︱ mean value of ︱︱ is the credible ︱ of ︱ rather.

In addition, conversion equipment 205 also can be confirmed for the user in step 109 display process result; Otherwise promptly correct the sentence content automatically, and by code to object code conversion equipment 205 at the code of step 110 according to the output code conversion table of comparisons 206, with the sentence after changing, convert destination document to, and be stored in this destination document memory bank 207 with shared coded system coding.So promptly finish the unisonance wrongly written or mispronounced characters step of correcting mistakes.

Embodiment two: the conversion of the approximate wrongly written or mispronounced characters of shape

" cause market shape to rob and strength ".

Being learnt by pending sentence, is the approximate wrongly written or mispronounced characters of shape of " causing Market Situation and strength ".

Secondly, step 102 by system from a dictionary with BIG5 sign indicating number coding, find the words that the above sub-word string of any two words is constituted in this sentence., make the speech of prejudging of phase one and handle obtained words foundation disconnected speech method commonly used in step 103.The result is as follows:

︱ causes ︱ market ︱ shape ︱ to rob ︱ and ︱ strength ︱.

To be processed in step 104 default institute is the approximate wrongly written or mispronounced characters of shape.Therefore source document is just obtained the shape code table of comparisons to code conversion device 202 in the code table of comparisons 203, with institute's sentence to be processed, is converted to the shape code, and the gained result is stored in the code shelves memory bank 204 in step 105:

︱ causes ︱ market ︱ shape ︱ to rob ︱ and ︱ strength ︱.

↓↓ ↓↓ ↓ ↓ ↓ ↓↓

The ︱ trunk is afraid of that the ︱ baa chivalrous ︱ of ︱ that givens to flattery robs ︱ and pulls ︱ six horse ︱.

The shape code table of comparisons be with every group of font near or the literal of identical mistake input easily takes place, reduce one group one group word collection (cluster), and with one of them word, as its coding.For example above-mentioned " trunk " is the approximate widely different trunk of word collection: “ Zao Week of the following shape of representative ... ", " fearness " is the approximate word collection of the following shape of representative: " one-tenth is favored with ashamed fearness ... ", " baa " is the approximate word collection of the following shape of representative: " city Xin Mieyang ", and by that analogy.

Literal shelves conversion equipment 205 is at the shape code of step 106 according to sentence, in a dictionary 208 with the font code coding, finds out and the identical sub-word string of word string code in the sentence:

︱ causes ︱ market ︱ shape ︱ to rob ︱ and ︱ strength ︱

The ︱ trunk is afraid of that the ︱ baa chivalrous ︱ of ︱ that givens to flattery robs ︱ and pulls ︱ six horse ︱

︱ VK33 ︱ Nc30 ︱ Na42 ︱ VD45 ︱ Ca24 ︱ Na41 ︱, and

︱ causes ︱ market ︱ situation ︱ and ︱ strength ︱

The ︱ trunk is afraid of that the ︱ baa chivalrous misfortune of the ︱︱ that givens to flattery pulls six yards ︱ of ︱

︱VK33︱Nc30︱Na99︱Ca24︱Na41︱。Or the like

Then, conversion equipment 205 continues in speech length, word frequency, the part of speech of step 107 according to the sub-word string that is write down, and four factors such as change number of words determine best punctuate mode.

The account form of disconnected speech is for choosing one group of vocabulary W _i, i-1 makes to n

Σ_{i = 1}^{n} 20 \times | W_{i} | - \log_{10} Prob (W_{i}) - \log_{10} Prob ({POS}_{i} | PO S_{i - 1}) - 30 \times C_{i}

Be maximal value.

The result of back gained is as calculated:

︱ causes ︱ market ︱ shape to rob ︱ and ︱ strength ︱: 3.697 * 10 ^-5

︱ causes ︱ market ︱ situation ︱ and ︱ strength ︱: 2.184 * 10 ^-2

︱ causes ︱ market ︱ situation ︱ and ︱ strength ︱.

In addition, conversion equipment 205 also can be confirmed for the user in step 109 display process result; Otherwise promptly correct the sentence content automatically, and by code to object code conversion equipment 205 at the code of step 110 according to the output code conversion table of comparisons 206, with the sentence after changing, convert destination document to, and be stored in this destination document memory bank 207 with shared coded system coding.So promptly finish the approximate wrongly written or mispronounced characters of the shape step of correcting mistakes.

The method of correcting mistakes of the approximate wrongly written or mispronounced characters of above-mentioned shape, can be applied in any with the input method of literal body characteristics as the input foundation, to correct wrongly written or mispronounced characters.The example that is suitable for comprises with the document of Cangjie's sign indicating number input and the document of importing with OCR.

Embodiment three: the correction of simplified and traditional conversion wrongly written or mispronounced characters

" boss moves rearward-facing end Come Soup Mian with the Halogen dried bean curd ".

Being learnt by pending sentence, is the complicated and simple conversion wrongly written or mispronounced characters of " old Board moves Hou face end Come Soup Surface with the Halogen dried bean curd ".

Secondly, step 102 by system from a complicated and simple bilingual dictionary, find the words that the above sub-word string of any two words is constituted in this sentence., make the speech of prejudging of phase one and handle obtained words foundation disconnected speech method commonly used in step 103.The result is as follows:

Boss ︱︱ moves behind the ︱︱ face ︱ end ︱ Come ︱ Soup ︱ face ︱ with ︱ Halogen ︱ dried bean curd ︱.

To be processed in step 104 default institute is complicated and simple conversion wrongly written or mispronounced characters.Therefore source document is just obtained the complicated and simple transcode table of comparisons to code conversion device 202 in the code table of comparisons 203, with institute's sentence to be processed, is converted to complicated and simple transcode, and the gained result is stored in the code shelves memory bank 204 in step 105:

↓?↓?↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓↓↓

Boss moves rearward-facing end Come Soup Mian with the Halogen dried bean curd

The simplified and traditional character code table of comparisons 203 comprises the character library of all single simplified Chinese characters contrasts to the number complex forms of Chinese characters, and with the complex form of Chinese characters wherein as its code.

Literal shelves conversion equipment 205 is at the complicated and simple transcode of step 106 according to sentence, in a dictionary 208 with code coding, find out and the identical sub-word string of word string code in the sentence: ︱ face ︱ end ︱ Come ︱ Soup ︱ face ︱ is with ︱ Halogen ︱ bean curd ︱ does ︱ behind the boss ︱︱ From ︱.↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ boss From rearward-facing end Come Soup Mian with the Halogen dried bean curd ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ︱ Na ︱ Pb ︱ Na ︱ Na ︱ Vc ︱ Vc ︱ Na ︱ Na ︱ Ca ︱ Vc ︱ Na ︱ Na ︱, and the old Board ︱ of ︱ From ︱ Hou face ︱ end ︱ Come ︱ Soup Surface ︱ is with ︱ Halogen ︱ bean curd universe ︱.↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ boss From rearward-facing end Come Soup Mian with the Halogen dried bean curd ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ︱ Na ︱ Pb ︱ Nc ︱ Vc ︱ Vc ︱ Na ︱ Ca ︱ Vc ︱ Na ︱ or the like.

The account form of punctuate is for choosing one group of vocabulary W _i, i=1 makes to n

Σ_{i = 1}^{n} 20 \times | W_{i} | - \log_{10} Prob (W_{i}) - \log_{10} Prob ({POS}_{i} | {POS}_{i - 1}) - 30 \times C_{i}

Be maximal value.

The result of back gained is as calculated:

︱ face ︱ end ︱ Come ︱ Soup ︱ face ︱ is with ︱ Halogen ︱ dried bean curd ︱ behind the boss ︱︱ From ︱: 1.876 * 10 ^-9

The old Board ︱ of ︱ From ︱ Hou face ︱ end ︱ Come ︱ Soup Surface ︱ is with ︱ Halogen ︱ bean curd universe ︱: 3.284 * 10 ^-3

The old Board ︱ of ︱ From ︱ Hou face ︱ end ︱ Come ︱ Soup Surface ︱ is with ︱ Halogen ︱ bean curd universe ︱

Embodiment four: the conversion of OCR identification wrongly written character

The user will be printed on “ Hair and go out the crisp Sound sound of golden Genus one Specifications Ring " the paper of printed words, send into scanner, carry out OCR identification.The result who distinguishes is as follows, the position of each word, and the program through the font comparison can get 1 to 10 candidate:

The comments the to comment the to comment the to comment the to comment the to comment the to comment the to comment the to comment the to comment and wait one two three four five six seven eight nine ten choosings and wait a word choosing choosing choosing choosing choosing choosing choosing choosing choosing choosing numeral minute word and divide word to divide word to divide word to divide word to divide word to divide word to divide word to divide word to divide 10 Hair 03886 to step on 04408 appropriate 04775 kind of 04797 principle 04799 egg 04849 Service 04870 flea 05017 ridge 05020 concubine 0503810 to go out 03464 mountain 03961 by 05015 ending 05057 and go to modern 03754 enterprise 03781 of few 0531310 gold medal, 03166 full 0 3285 of 05129 mortar 05301 to close for 03878 house 04043 the present 04054 of 03984 Kui to contain 04448 and cover 0447510 stone 04940 on 04616 native 04623 * 04880

05059 holds 05244 Code, 05315 Mom, 05405 rotten 05427 horse, 05466 Buddhist nun, 05486 milk, 05540 Do 0554203-02000-02010-02030,00,000 00,000 00,000 00,000 00,000 00,000 0000010 lemon, 03726 kerria, 04088 rubber, 04402 strain, 04417 chinaberry, 04548 lemon, 04559 high post, 04654 Mining 04682 sows 04702 behaviour 0472110 04768 helped 0477810 Sound, 04191 bud 04851 how 04871 last of the ten Heavenly stems, 04875 Song, 04934 public affairs 04935 was held a memorial ceremony for 04940 fraud, 04946 rub-a-dub 04952 tea, 0495510 sound, 03232 former times, 03708 hot 03811 temple, 03944 blue or green 04,024 04,042 ten 04125 cards, 04,130 thousand 04139 hand 04251 when 04685 Trial, 05061 rate, 05129 zither 05183 was climbed 05200 Music, 05208 jail 05254 and inspired confidence in 05282 soldier, 05288 honor 0532810 crisp 03468 and kneel 03950 born of the same parents 04359 and take off 04403 mast 04494 and execute 03486 of 04585 wrist, 04647 rudder, 04766 puppet, 04921 tripe 0492210 and angle 04119 Approximately 04121 to drink 04474 Jun, 04531 bamboo, 04662 young 04684 Yo 04692

During processing, system's first candidate of from the pending document of OCR identification candidate information data area 201, taking out one section measured length at first, and be the boundary with the punctuation mark in step 101, a selected unit " sentence ", as process object:

“ Hair goes out the crisp Sound sound of metal and stone one Specifications "

Observe by pending sentence." stone " wherein and " " two words are respectively the wrongly written character of " Genus " and " ".

Secondly, step 102 by system from a dictionary with BIG5 sign indicating number coding, find the words that the above sub-word string of any two words is constituted in this sentence., make the pre-punctuate of phase one and handle obtained words foundation punctuate method commonly used in step 103.The result is as follows:

︱ Hair goes out the ︱ Sound sound ︱ of the crisp ︱ of ︱ gold ︱ stone ︱ one Specifications ︱ Xiang ︱

To be processed in step 104 default institute is OCR identification wrongly written character.Therefore source document is just obtained the OCR identification likeness in form table of comparisons to code conversion device 202 in the code table of comparisons 203.With institute's sentence to be processed, be converted to the OCR code, and the gained result be stored in the code shelves memory bank 204 in step 105:

↓↓ ↓ ↓ ↓↓ ↓ ↓ ↓

The full stone ︱ one strain ︱ Ring ︱ Tuo ︱ Approximately ︱ Sound sound ︱ of ︱ mountain-climbing ︱

The shape code table of comparisons is with the approaching literal of hanking and importing through the OCR identification easily of every group of font by mistake, reduces one group one group word collection (cluster), and with one of them word, as its coding.For example above-mentioned " stepping on " is that the approximate word collection: “ Hair of the following shape of representative steps on clear ", " mountain " is to represent following line to be similar to the word collection: " mountain goes out celestial being ... ", " entirely " is the approximate word collection of the following shape of representative: " full gold with ... ", by that analogy.

Document conversion equipment 205 is at the shape code of step 106 according to sentence, in a dictionary 208 with code coding, finds out and the identical sub-word string of word string code in the sentence:

︱ mountain-climbing ︱ gold ︱ stone ︱ one strain ︱ Ring ︱ Tuo ︱ Approximately ︱ Sound sound ︱

︱ V R ︱ Na ︱ Na ︱ D a ︱ A ︱ A ︱ Ta ︱ Na ︱, and

︱ Hair goes out the ︱ Sound sound ︱ of the crisp ︱ of ︱ gold Genus ︱ one Specifications ︱ Ring

︱ Hair mountain ︱ metal and stone ︱ one strain ︱ Ring Tuo ︱ Approximately ︱ Sound sound ︱

︱?VR?︱?Na?︱?Da?︱?VH?︱Ta︱?Na?︱。

Σ_{i = 1}^{n} 20 \times | W_{i} | - \log_{10} Prob (W_{i}) - \log_{10} Prob ({POS}_{i} | {POS}_{i - 1}) - 30 \times C_{i}

Be maximal value.

The result of back gained is as calculated:

︱ Hair goes out the ︱ Sound sound ︱ of the crisp ︱ of ︱ gold ︱ stone ︱ one Specifications ︱ Xiang ︱: 2.19*12 ^-12

︱ Hair goes out the ︱ Sound sound ︱ of the crisp ︱ of ︱ gold Genus ︱ one Specifications ︱ Ring: 3.86*10 ^-9

︱ Hair goes out the ︱ Sound sound ︱ of the crisp ︱ of ︱ gold Genus ︱ one Specifications ︱ Ring.

In addition, conversion equipment 205 also can be confirmed for the user in step 109 display process result; Otherwise promptly correct the sentence content automatically, and by code to object code conversion equipment 205 at the code of step 110 according to the output code conversion table of comparisons 206, with the sentence after changing, convert destination document to, and be stored in this destination document memory bank 207 with shared coded system coding.So promptly finish the OCR identification wrongly written character step of correcting mistakes.

In method provided by the present invention, the possibility of result of correction obtains unexistent correct word in candidate.

Embodiment five: the Japanese manuscript embodiment that corrects mistakes

" the chemical naturally The research of それは The Ru.”

Learn that by pending sentence " chemistry " wherein is the unisonance wrongly written or mispronounced characters of " science ".

Secondly, step 102 by system from a Japanese dictionary, find the words that the sub-word string of any Japanese is constituted in this sentence., make the pre-punctuate of phase one and handle obtained words foundation punctuate method commonly used in step 103.The result is as follows:

︱それごは︱ nature ︱ chemistry ︱ The ︱ research ︱ The Ru ︱

To be processed in step 104 default institute is the approximate wrongly written or mispronounced characters of shape.Therefore source document is just obtained the Japanese sound code table of comparisons to code conversion device 202 in the code table of comparisons 203, with institute's sentence to be processed, is converted to Japanese sound code, and the gained result is stored in the code shelves memory bank (204) in step 105:

︱それごは︱ nature ︱ chemistry ︱ The ︱ research ︱ The Ru ︱.

︱So?re?de?Wa︱Shi?zen︱Ka?gaku︱O?︱Ken?Kyuu︱Su?ru︱

The simplified and traditional character code table of comparisons 203 comprises the character library of all single simplified Chinese characters contrasts to the number complex forms of Chinese characters, and with its pronunciation code as its code.

Literal shelves conversion equipment 205,, is found out and the identical sub-word string of word string code in the sentence in a dictionary 208 with code coding according to the Japanese of sentence pronunciation code in step 106:

︱それ In は︱ nature ︱ chemistry ︱ The ︱ research ︱ The Ru ︱.

︱So?re?de?wa︱Shi?zen︱Ka?gaku︱O︱Ken?Kyuu︱Su?ru︱

The ︱ moving III ︱ of the moving name of ︱ noun ︱ noun ︱ guest ︱︱ that continues, and

︱それ In は︱ natural science ︱ The ︱ research ︱ The Ru ︱.

︱So?re?de?wa︱Shi?zen?Ka?gaku︱O︱Ken?Kyuu︱Su?ru︱

The ︱ moving III ︱ of the moving name of ︱ noun ︱ guest ︱︱ or the like that continues.

Σ_{i = 1}^{n} 20 \times | W_{i} | - \log_{10} Prob (W_{i}) - \log_{10} Prob ({POS}_{i} | PO S_{i - 1}) - 30 \times C_{i}

Be maximal value.

The result of back gained is as calculated:

︱それゎ In は︱ nature ︱ chemistry ︱ The ︱ research ︱ The Ru ︱: 3.71 * 10 ^-9

︱それ In は︱ natural science ︱ The ︱ research ︱ The Ru ︱: 2.92 * 10 ^-6

The numerical value that is calculated in step 108 conversion equipment 205 foundations selects higher vocabulary to reconstitute new sentence at last.

" それ In は natural science The research The Ru "

In addition, conversion equipment 205 also can be confirmed for the user in step 109 display process result; Otherwise i.e. automatic corrigendum sentence content, and by code to object code conversion equipment 205 at the code of step 110 according to the output code conversion table of comparisons 206, the sentence with after changing converts the destination document of encoding with shared coded system to.And be stored in this destination document memory bank (207).So promptly finish Japanese unisonance wrongly written character and correct step.

More than be the explanation to the embodiment of computer file automatic error detection of the present invention; The device and method of correcting mistakes; Those skilled in the art is not difficult to understand spirit of the present invention by above-mentioned explanation; and makes various variation according to this and extend.If do not exceed spirit of the present invention, all should belong within the claim of the present invention.52 NC 4553 KU 3854 S 3755 U 156 LLL 557 MLM 4558 SU 12159 SU 12160 RU 5161 LB 3062 MJ 4563 JJ 3064 IP 3965 N 3366 DH 5167 NG 3368 MYVS 169 MF 370 L 571 QJ 12172 BY 4873 INO 3374 SK 4075 NINN 176 MMI 5077 TT 3178 MVNM 179 MDM 3580 YHN 3881 OMM 5082 OJ 3883 OMN 5084 OY 3985 OKN 3986 ONHS 4587 OIN 3588 OLL 3789 MO 3590 MMU 5091 IHU 5192 OB 3993 YC 12194 CMVS 195 Cl 3996 BHN 1697 UK 4198 CSH 5099 PSH 39100 KLN 33101 PIM 121102 PI 51

Claims

1. a computer file automatic error detection, the device of correcting mistakes comprise:

One pending document deriving means contains an original document memory bank, and takes out the literal of one section equivalent length in can a pending document, is considered as one " sentence ", is stored in this original document memory bank, as process object;

One prejudges the speech treating apparatus, contain one with the dictionary of the shared coded system of this pending sentence coding, according to certain rule words that any sub-word string constituted in will this pending sentence, with nonoverlapping continuation method arrangement;

One source document contains a code table of comparisons and a code document bank to code conversion device, can be according to this code table of comparisons, the text conversion of this pending sentence is become code, and be stored in the code document bank;

One code is to the destination document conversion equipment, contain a code dictionary, can be according to the code of this pending sentence, in this dictionary, find out with this pending sentence in the identical sub-word string of contained word string code, and, determine best disconnected speech mode and suitable vocabulary, and correct the content of this pending sentence according to the characteristic of at least two sub-word strings in this pending sentence; And

One output unit contains the output code conversion table of comparisons and a destination document memory bank, can with this pending sentence with code coding, convert the coded format for output usefulness to according to this output code conversion table of comparisons, is stored in this destination document memory bank.

2. device as claimed in claim 1 is characterized in that this code is according to long (the ︱ W of the speech of: this pending sentence to purpose shelves conversion equipment _iThe number of words that a vocabulary is comprised), word frequency (Prob (W ︱: _i): the frequency that a vocabulary occurs in general article), the part of speech probability (Porb (POS that continues _i︱ POS _I-1): the vocabulary of certain part of speech X is under the condition of i-1 position appearance of sentence, and the probability that occurs i position followed in the vocabulary of part of speech Y) and change number of words (C _i: in the destination document conversion process, advise the words and the different number of words of the former input data of co-located of change at code) etc., the vocabulary that determines best disconnected speech mode and be suitable for.

3. device as claimed in claim 2 is characterized in that this code to the destination document conversion equipment, is the characteristic according to all sub-word strings in this pending sentence, the vocabulary that determines best disconnected speech mode and be suitable for.

4. device as claimed in claim 3 is characterized in that this code is to choose one group of vocabulary W to the mode of disconnected speech of destination document conversion equipment decision and suitable vocabulary _i, i=1 makes to n

Σ_{i = 1}^{n} α \times | W_{i} | - \log_{10} Prob (W_{i}) - \log_{10} Prob (PO S_{i} | PO S_{i - 1}) - β \times C_{i}

Be maximal value.

Wherein, POS _iBe W _iPart of speech, ︱ W _i︱ is that speech is long, Prob (W _i) be word frequency, Porb (POS _i︱ POS _I-1) be the part of speech probability that continues, and C _iBe W _iThe change number of words.

5. as claim 1,2,3 or 4 described devices, it is characterized in that code contains an interface device in addition to the destination document conversion equipment, contain a display device, can be in the destination document conversion process in code, display process result, confirm for the user, and according to user's instruction, decision is made pauses in reading unpunctuated ancient writings and is suitable for vocabulary.

6. as claim 1,2,3 or 4 described devices is characterized in that the code table of comparisons is a Chinese and japanese sound code table of comparisons, are its coding rule with the pronunciation of literal; In this table of comparisons, the code of all phonetically similar words is all identical.

7. as claim 1,2,3 or 4 described devices is characterized in that the code table of comparisons is a shape code table of comparisons, with the approaching or easy literal that confuses the mistake input of font, reduce one group one group word collection (cluster), and each word collection is endowed a code.

8. as claim 1,2,3 or 4 described devices is characterized in that the code table of comparisons is a simplified and traditional character code table of comparisons, comprise the set of all single simplified Chinese character contrasts to the word group of the number complex form of Chinese characters, and each word group is endowed a code.

9. device as claimed in claim 8 is characterized in that the code table of comparisons is to organize one of complicated and simple word group to represent the character code of the complex form of Chinese characters as its code with each.

10. a computer file automatic error detection, the method for correcting mistakes comprise:

Pending sentence is obtained-literal of one section equivalent length of taking-up from a pending document, is considered as one " sentence ", as process object;

Prejudge the certain rule of speech-foundation,, arrange with nonoverlapping continuation method with the words that any sub-word string constituted in this pending sentence;

Code conversion-foundation one code the table of comparisons becomes code with the text conversion of this pending sentence;

The code of correction process-this pending sentence of foundation, in a dictionary that forms with this code editor, find out with this pending sentence in the identical sub-word string of contained word string code, and according to the characteristic of at least two sub-word strings in this pending sentence, the vocabulary that determines best disconnected speech mode and be suitable for, and correct the content of this pending sentence and exist with the code pattern; And

Object code conversion-foundation one output code conversion the table of comparisons with this pending sentence with code coding, converts the coded format for output usefulness to, as the output destination document.

11. method as claimed in claim 10 is characterized in that correction process comprises " speech is long " (the ︱ W with this pending sentence _iThe number of words that a vocabulary is comprised), " word frequency " (Prob (W ︱: _i): the frequency that a vocabulary occurs in general article), " part of speech continue probability " (Porb (POS _i︱ POS _I-1): the vocabulary of certain part of speech X is under the condition of i-1 position appearance of sentence, and the probability that occurs i position followed in the vocabulary of part of speech Y) and " change number of words " (C _i: in purpose shelves conversion process, advise the words and the different number of words of the former input data of co-located of change at code) etc., as the foundation of the vocabulary that determines best disconnected speech mode and be suitable for.

12. method as claimed in claim 11 is characterised in that wherein correction process comprises the characteristic according to all sub-word strings in this pending sentence, the vocabulary that determines best disconnected speech mode and be suitable for.

13. as method as described in the claim 12, it is characterized in that correction process is included in this dictionary, in the sub-word string identical, choose one group of vocabulary W with the contained word string code of this pending sentence _i=1 to n, make

Σ_{i = 1}^{n} α \times | W_{i} | - \log_{10} Prob (W_{i}) - \log_{10} Prob (PO S_{i} | PO S_{i - 1}) - β \times C_{i}

Be Computation of Maximal Waiting:

14. as claim 10,11,12 or 13 described methods are characterised in that correction process contains a display process result in addition, for user's affirmation, and according to user's instruction, the step of decision punctuate and suitable vocabulary.

15. as claim 10,11,12 or 13 described methods is characterized in that the code table of comparisons is a Chinese and japanese sound code table of comparisons, are its coding rule with the pronunciation of literal; In this table of comparisons, the code of all phonetically similar words is all identical.

16. as claim 10,11,12 or 13 described methods, it is characterized in that the code table of comparisons is a shape code table of comparisons, with the approaching or easy literal that confuses the mistake input of font, reduce one group one group word collection (cluster), each word collection is endowed a code.

17. as claim 10,11,12 or 13 described methods is characterized in that the code table of comparisons is a simplified and traditional character code table of comparisons, comprise the set of all single simplified Chinese character contrasts to the word group of the number complex form of Chinese characters, each word group is endowed a code.

18. method as claimed in claim 17 is characterized in that the code table of comparisons organizes one of complicated and simple word group and represent the character code of the complex form of Chinese characters as its code with each.