CN105279149A

CN105279149A - Chinese text automatic correction method

Info

Publication number: CN105279149A
Application number: CN201510688403.4A
Authority: CN
Inventors: 刘云翔; 杜杰; 李晓丹; 郑力; 杜俊; 刘续博
Original assignee: Shanghai Institute of Technology
Current assignee: Shanghai Institute of Technology
Priority date: 2015-10-21
Filing date: 2015-10-21
Publication date: 2016-01-27

Abstract

The invention discloses a Chinese text automatic correction method. The method comprises the following steps of: a) inputting a to-be-corrected Chinese text, and performing word segmentation preprocessing on the Chinese text sentence by sentence; b) searching for one-character words, two-character words or disperse strings of three or more than three characters occurring in the text subjected to word segmentation sentence by sentence; c) performing continuous determination on the disperse strings occurring in the text subjected to word segmentation by adopting an N-gram model, and checking text word level errors for each single sentence in combination with a word forming probability of separate characters; and d) constructing an error correction knowledge base to generate an error correction candidate text. According to the Chinese text automatic correction method provided by the invention, the one-character words, two-character words or disperse strings of three or more than three characters occurring in the text subjected to word segmentation are searched for sentence by sentence, the disperse strings occurring in the text subjected to word segmentation are subjected to continuous determination by adopting the N-gram model to determine identification errors, and the error correction knowledge base is constructed to generate the error correction candidate text, so that error checking and correcting processes are combined very well, and the method has the characteristics of high error checking speed and high error correcting efficiency.

Description

A kind of Chinese text auto-correction method

Technical field

The present invention relates to a kind of text correction method, particularly relate to a kind of Chinese text auto-correction method.

Background technology

Along with developing rapidly of Modern Laser phototypesetting technology and electronic publishing industry, how to ensure passed on information correctly one of importance becoming research.Current people use computing machine to carry out writing, edit and the work such as typesetting, inevitably some errors in text, such as multiword, hiatus, transposition, English word spelling write error, punctuate lack of standardization etc.Therefore need special school team's system to proofread manuscript.From long term growth, informationization is the trend of social development in the future, the electronic information that people face and manuscript increasing, and traditional craft check and correction needs press corrector to carry out reading word by word and sentence by sentence, inspection to text, all can not adapt to from cost and efficiency two aspects the trend that e-text quantity rapidly increases.Therefore, more and more urgent to the demand of an automatic school team system that accuracy is high, efficiency is high.

Automatic school team has very important practical value, and have a wide range of applications field.In publishing business, the realization of text automatic Proofreading can alleviate the workload of staff greatly, they is freed from loaded down with trivial details tasteless work, accelerates to publish rhythm and promotes developing rapidly of whole publishing business; In Text region, need with debugging, error correcting technique to speech recognition, the recognition results such as ORC Text region are modified; In copy editor, such as, all provide automatic errordetecting technology in a lot of text editing system such as word etc., the text of input is reported an error automatically; In man-machine interface, such as the man-machine interface such as data base querying, natural language requires certain fault freedom; Need to analyze the sentence of input in the systems such as aided education, find out mistake wherein, and provide possible correct option etc.

In addition, automatic Proofreading also has very important theory significance.From ownership of discipline, automatic Proofreading is subordinated to the category of natural language understanding, involves the basic sector of many natural language understandings, such as automatic word segmentation, part-of-speech tagging, syntactic analysis etc., because of but a research topic having very much a learning value.At present, the research of natural language processing has entered the stage to extensive real text process, and the real text of reality may also exist mistake, automatic Proofreading technology is studied exactly and is searched these mistakes of process, therefore the development of automatic Proofreading technology must improve the fault freedom of other natural language processings, promotes the development of whole natural language processing research further.

Summary of the invention

Technical matters to be solved by this invention is to provide a kind of Chinese text auto-correction method, to e-text automatic analysis, can find, indicate mistake and carry out error correction correction, debugging and error correction procedure are combined well, there is debugging speed fast, the feature that error correction efficiency is high.

The present invention solves the problems of the technologies described above the technical scheme adopted to be to provide a kind of Chinese text auto-correction method, comprises the steps: a) to input to wait to proofread Chinese text, carries out participle pre-service by simple sentence to Chinese text; B) individual character, double word or three words and above loose string thereof that occur in participle text are searched by simple sentence; C) adopt N-gram model to judge continuously the loose string occurred in participle text, and each simple sentence is checked to the mistake of text word level in conjunction with inside word probability; D) construct correcting knowledge sets and generate error correction candidate text.

Above-mentioned Chinese text auto-correction method, wherein, described step a) adopts voice or input through keyboard to wait to proofread Chinese text, and described pre-service comprises treating check and correction Chinese text arrangement grammar mistake and carrying out pattern match inspection input.

Above-mentioned Chinese text auto-correction method, wherein, described step a) in phonetic entry to wait to proofread the process of Chinese text as follows: receive the phonetic entry from microphone and transfer the voice flow that computing machine can receive to, the combination of Pattern matching generating candidate words word is carried out to voice flow, utilizes language model to identify the combination of candidate word word.

Above-mentioned Chinese text auto-correction method, wherein, described step a) middle input through keyboard waits that the process of proofreading Chinese text is as follows: encode to words in advance, keystroke signal is converted to the code sequence that computing machine accepts, and described code sequence be associated with word coding method.

Above-mentioned Chinese text auto-correction method, wherein, described step c) as follows to the deterministic process of three words and above loose string thereof: judge that in loose string, each word becomes separately the probability of word, determine the first error constant, the binary word model that continues is adopted to judge that adjacent two words become the probability of word successively, determine the second error constant, the ternary word model that continues is adopted to judge that adjacent three words become the probability of word successively, determine the 3rd error constant, all error constants are added the terminal error coefficient determining text word level.

Above-mentioned Chinese text auto-correction method, wherein, described step c) to continuous four words loose string W _kw _k+1w _k+2w _k+3deterministic process as follows: c1) judge W respectively _kw _k+1w _k+2w _k+3these words become separately the probability of word, if probability P=0 that certain word occurs separately, then this place is wrong, error constant K ₁+=1.5; C2) with W _k-2for reference position, W _k+4for end position, the binary word model that continues is adopted to judge, with continuous two Term co-occurrence frequency R for basis for estimation; If R=0, then error constant K ₄+=0.2, if R>=1, then K ₂-=1.0; C3) with W _k-1for reference position, W _k+4for end position, the binary word model that continues is adopted to judge, with continuous two Term co-occurrence frequency R for basis for estimation; If R=0, then error constant K ₃+=0.5, if 1<R<2, then K ₃+=0.2, if R>=2, then K ₃-=1.0; C4) with W _kthe first character of the first two word is end position, W _k+3rear second word is end position, adopts ternary word model to judge, with continuous three word co-occurrence frequency R for basis for estimation; If R=0, then error constant K ₄+=0.2, if R>=1, then K ₄-=1.0;

C5) with W _kprevious word is reference position, W _k+3a rear word is end position, adopts binary word model to judge, with continuous two word co-occurrence frequency R for basis for estimation; If R=0, then error constant K ₅+=0.8, if 1<R<3, then K ₅+=0.5, if R>=3, then K ₅-=1.0; C6) treat debugging individual character for a certain, gained error constant is added, i.e. K=K ₁+ K ₂+ K ₃+ K ₄+ K ₅if K>=1.5, then this place is wrong, is indicated by Error Text.

Above-mentioned Chinese text auto-correction method, wherein, described steps d) the error correction candidate text generated is sorted, described sequencer procedure is as follows: use each error correction candidate text to replace former Error Text, step b is repeated to the simple sentence after replacing) and step c) carry out debugging process again and obtain corresponding error constant, according to error constant size order, error correction candidate text is sorted.

Above-mentioned Chinese text auto-correction method, wherein, described steps d) text based error characteristic and the various correcting knowledge sets of likelihood match method construct, described correcting knowledge sets comprises wrongly written character dictionary, easily obscures words allusion quotation, similar code dictionary and/or the two-way dictionary of word drive.

The present invention contrasts prior art following beneficial effect: Chinese text auto-correction method provided by the invention, individual character, double word or three words and above loose string thereof that occur in participle text are searched by simple sentence, N-gram model is adopted to carry out judging continuously to determine to identify mistake to the loose string occurred in participle text, and construct correcting knowledge sets generation error correction candidate text, thus debugging and error correction procedure are combined well, there is debugging speed fast, the feature that error correction efficiency is high.

Accompanying drawing explanation

Fig. 1 is Chinese text automatic calibration schematic flow sheet of the present invention;

Fig. 2 is that the present invention carries out preprocessing process schematic diagram to Chinese text to be corrected;

Fig. 3 is that the present invention adopts input through keyboard to obtain Chinese text process schematic to be corrected;

Fig. 4 is that the present invention adopts phonetic entry to obtain Chinese text process schematic to be corrected;

Fig. 5 is that the voice signal in knowledge based storehouse of the present invention is to Chinese Character Recognition process schematic;

Fig. 6 is the detailed process schematic diagram of Chinese text automatic error-correcting of the present invention.

Embodiment

Below in conjunction with drawings and Examples, the invention will be further described.

Fig. 1 is Chinese text automatic calibration schematic flow sheet of the present invention.

Refer to Fig. 1, Chinese text auto-correction method provided by the invention, comprises the steps:

A) input wait proofread Chinese text, by simple sentence, participle pre-service is carried out to Chinese text; Voice or input through keyboard is adopted to wait to proofread Chinese text, described pre-service comprises treating check and correction Chinese text arrangement grammar mistake and carrying out pattern match inspection input, treat that check and correction Chinese text can adopt voice or input through keyboard, keyboard input process is as shown in Figure 3: encode to words in advance, keystroke signal is converted to the code sequence that computing machine accepts, and described code sequence is associated with word coding method; Phonetic entry process is as shown in Figure 4 and Figure 5: receive the phonetic entry from microphone and transfer the voice flow that computing machine can receive to, the combination of Pattern matching generating candidate words word is carried out to voice flow, utilizes language model to identify the combination of candidate word word.

B) individual character, double word or three words and above loose string thereof that occur in participle text are searched by simple sentence.

C) adopt N-gram model to judge continuously the loose string occurred in participle text, and each simple sentence is checked to the mistake of text word level in conjunction with inside word probability; As follows to the deterministic process of three words and above loose string thereof: to judge that in loose string, each word becomes separately the probability of word, determine the first error constant, the binary word model that continues is adopted to judge that adjacent two words become the probability of word successively, determine the second error constant, the ternary word model that continues is adopted to judge that adjacent three words become the probability of word successively, determine the 3rd error constant, all error constants are added the terminal error coefficient determining text word level; N-Gram is a kind of language model conventional in large vocabulary continuous speech recognition, for Chinese, is referred to as Chinese language model (CLM, ChineseLanguageModel).

D) construct correcting knowledge sets and generate error correction candidate text; Specifically can adopt text based error characteristic and the various correcting knowledge sets of likelihood match method construct, described correcting knowledge sets comprises wrongly written character dictionary, easily obscures words allusion quotation, similar code dictionary and/or the two-way dictionary of word drive; Select for the ease of user, the present invention also can sort to the error correction candidate text generated, described sequencer procedure is as follows: use each error correction candidate text to replace former Error Text, step b is repeated to the simple sentence after replacing) and step c) carry out debugging process again and obtain corresponding error constant, according to error constant size order, error correction candidate text is sorted.

Please continue see Fig. 6, provide a specific embodiment below, performing step is as follows:

Step1: input and wait to proofread text, adopt Beijing University's participle software, participle pre-service is carried out to text;

Step2: search individual character, double word or three words and above loose string thereof that occur in participle text, using all these local sources possible as mistake.Suppose to find out W in text _kw _k+1w _k+2w _k+3for the loose string of continuous four words occurred, then debugging is carried out in the source herein as mistake, k is natural number, represents and finds out the position of text in simple sentence.

Step3: judge W respectively _kw _k+1w _k+2w _k+3these words become separately the probability of word, if probability P=0 that certain word occurs separately, then this place is wrong, error constant K ₁+=1.5.

Step4: with W _k-2for reference position, W _k+4for end position, the binary word model that continues is adopted to judge, with continuous two Term co-occurrence frequency R for basis for estimation.If R=0, then error constant K ₄+=0.2, if R>=1, then K ₂-=1.0.

Step5: with W _k-1for reference position, W _k+4for end position, the binary word model that continues is adopted to judge, with continuous two Term co-occurrence frequency R for basis for estimation.If R=0, then error constant K ₃+=0.5, if 1<R<2, then K ₃+=0.2, if R>=2, then K ₃-=1.0.

Step6: with W _kthe first character of the first two word is end position, W _k+3rear second word is end position, adopts ternary word model to judge, with continuous three word co-occurrence frequency R for basis for estimation.If R=0, then error constant K ₄+=0.2, if R>=1, then K ₄-=1.0.

Step7: with W _kprevious word is reference position, W _k+3a rear word is end position, adopts binary word model to judge, with continuous two word co-occurrence frequency R for basis for estimation.If R=0, then error constant K ₅+=0.8, if 1<R<3, then K ₅+=0.5, if R>=3, then K ₅-=1.0.

Step8: treat debugging individual character for a certain, is added each module gained error constant, i.e. K=K ₁+ K ₂+ K ₃+ K ₄+ K ₅if K>=1.5, then this place is wrong, is indicated by Error Text.

Step9: terminate.

In sum, auto-correction method of the present invention, mainly comprises automatic errordetecting and error correction two parts, utilizes the combination of multi-model debugging technology based on hybrid algorithm and error correcting technique, devises a kind of self-verifying model of words staging error; And on the basis analyzing text words staging error characteristic distributions, adopt N-gram model to judge continuously the loose string occurred in text.The present invention checks the mistake of text word level in conjunction with inside word probability, on the basis of structure correcting knowledge sets, achieves Correcting Suggestion generating algorithm.In conjunction with the various correcting knowledge sets of the error characteristic of text and likelihood match method construct, comprise wrongly written character dictionary, easily obscure words allusion quotation, similar code dictionary, the two-way dictionary of word drive etc. and generate error correction candidate suggestion.And propose error correction candidate suggestion to sort, by the sequencer procedure of Correcting Suggestion by realizing the debugging process of each Correcting Suggestion.When error correction, each candidate's Correcting Suggestion is replaced former mistake, carry out debugging process and obtain corresponding error constant to this place, the minimum suggestion of error constant is most probable Correcting Suggestion, thus completes the sequencer procedure of text Correcting Suggestion.The method makes the research of error correction and debugging combine, and debugging technology is well applied to error correction procedure.Concrete advantage is as follows: 1, propose the words level automatic errordetecting function adopted based on N-gram model, reflect the information of commonly used words preferably: tuple higher for the frequency of occurrences in statistics and dictionary are compared, can find that the tuple corresponding to words conventional in Chinese has higher co-occurrence frequency, the adjacency matrix thus adding up acquisition contains conventional associational word set.2, the collocation of conventional function word can well be reacted: in Chinese, some function word is combined with some word, although there is no the implication of reality, but serve grammatical function, as " must very ", " can not ", the tuple such as " one-tenth " has very high co-occurrence probability.3, N unit words adjacency matrix can well react beginning of the sentence, sentence tail information.4, find a lot of mistake by the statistical method of words, illustrate that N unit words adjacency matrix reflects some inherent laws of natural language to a certain extent.

Although the present invention discloses as above with preferred embodiment; so itself and be not used to limit the present invention, any those skilled in the art, without departing from the spirit and scope of the present invention; when doing a little amendment and perfect, therefore protection scope of the present invention is when being as the criterion of defining with claims.

Claims

1. a Chinese text auto-correction method, is characterized in that, comprises the steps:

A) input wait proofread Chinese text, by simple sentence, participle pre-service is carried out to Chinese text;

B) individual character, double word or three words and above loose string thereof that occur in participle text are searched by simple sentence;

C) adopt N-gram model to judge continuously the loose string occurred in participle text, and each simple sentence is checked to the mistake of text word level in conjunction with inside word probability;

D) construct correcting knowledge sets and generate error correction candidate text.

2. Chinese text auto-correction method as claimed in claim 1, it is characterized in that, described step a) adopts voice or input through keyboard to wait to proofread Chinese text, and described pre-service comprises treating check and correction Chinese text arrangement grammar mistake and carrying out pattern match inspection input.

3. Chinese text auto-correction method as claimed in claim 2, it is characterized in that, in described step a), to wait to proofread the process of Chinese text as follows in phonetic entry: receive the phonetic entry from microphone and transfer the voice flow that computing machine can receive to, the combination of Pattern matching generating candidate words word is carried out to voice flow, utilizes language model to identify the combination of candidate word word.

4. Chinese text auto-correction method as claimed in claim 2, it is characterized in that, in described step a), input through keyboard waits that the process of proofreading Chinese text is as follows: encode to words in advance, keystroke signal is converted to the code sequence that computing machine accepts, and described code sequence is associated with word coding method.

5. Chinese text auto-correction method as claimed in claim 1, it is characterized in that, the deterministic process of described step c) to three words and above loose string thereof is as follows: judge that in loose string, each word becomes separately the probability of word, determine the first error constant, the binary word model that continues is adopted to judge that adjacent two words become the probability of word successively, determine the second error constant, the ternary word model that continues is adopted to judge that adjacent three words become the probability of word successively, determine the 3rd error constant, all error constants are added the terminal error coefficient determining text word level.

6. Chinese text auto-correction method as claimed in claim 5, is characterized in that, described step c) is to continuous four words loose string W _kw _k+1w _k+2w _k+3deterministic process as follows:

C1) W is judged respectively _kw _k+1w _k+2w _k+3these words become separately the probability of word, if probability P=0 that certain word occurs separately, then this place is wrong, error constant K ₁+=1.5;

C2) with W _k-2for reference position, W _k+4for end position, the binary word model that continues is adopted to judge, with continuous two Term co-occurrence frequency R for basis for estimation; If R=0, then error constant K ₄+=0.2, if R>=1, then K ₂-=1.0;

C3) with W _k-1for reference position, W _k+4for end position, the binary word model that continues is adopted to judge, with continuous two Term co-occurrence frequency R for basis for estimation; If R=0, then error constant K ₃+=0.5, if 1<R<2, then K ₃+=0.2, if R>=2, then K ₃-=1.0;

C4) with W _kthe first character of the first two word is end position, W _k+3rear second word is end position, adopts ternary word model to judge, with continuous three word co-occurrence frequency R for basis for estimation; If R=0, then error constant K ₄+=0.2, if R>=1, then K ₄-=1.0;

C5) with W _kprevious word is reference position, W _k+3a rear word is end position, adopts binary word model to judge, with continuous two word co-occurrence frequency R for basis for estimation; If R=0, then error constant K ₅+=0.8, if 1<R<3, then K ₅+=0.5, if R>=3, then K ₅-=1.0;

C6) treat debugging individual character for a certain, gained error constant is added, i.e. K=K ₁+ K ₂+ K ₃+ K ₄+ K ₅if K>=1.5, then this place is wrong, is indicated by Error Text.

7. Chinese text auto-correction method as claimed in claim 5, it is characterized in that, described step d) sorts to the error correction candidate text generated, described sequencer procedure is as follows: use each error correction candidate text to replace former Error Text, simple sentence repetition step b) after replacement and step c) are carried out to debugging process again and obtained corresponding error constant, according to error constant size order, error correction candidate text is sorted.

8. Chinese text auto-correction method as claimed in claim 1, it is characterized in that, described step d) text based error characteristic and the various correcting knowledge sets of likelihood match method construct, described correcting knowledge sets comprises wrongly written character dictionary, easily obscures words allusion quotation, similar code dictionary and/or the two-way dictionary of word drive.