A kind of Chinese language text auto-collation based on collocation
Technical field
The present invention relates to the Chinese natural language in artificial intelligence computer field is processed, more particularly to Chinese text is automatic
Check and correction field.
Background technology
Automatic proofreading for Chinese texts is one of main application of natural language processing, an and difficult problem for natural language understanding.
Chinese is input in computer by input method, and increasing people is input into Chinese character, and Pinyin Input using spelling input method
With input word and phrase, therefore can there is increasing mistake in the text in method, and many mistakes are by local context
Method cannot carry out effective wrong identification.
For the problems referred to above, the present invention propose and realize a kind of automatic errordetecting of Chinese language text based on collocation and from
Dynamic proofreading method.
The content of the invention
Goal of the invention:In order to overcome the deficiencies in the prior art, the present invention to provide a kind of Chinese language based on collocation
This auto-collation.
Technical scheme:
To solve above-mentioned technical problem, the present invention provides a kind of Chinese language text auto-collation based on collocation, the party
Method is comprised the following steps:
1) structure and features according to the collocation of Chinese word, sets up the expression structure of collocation;
2) according to collocation and part of speech, word is set up to the index structure of part of speech, and word and part of speech are to the index structure arranged in pairs or groups;
3) using step 2) word set up and part of speech to collocation index structure, the Chinese sentence for treating debugging text carries out
Automatic errordetecting and automatic error-correcting, and errors present is marked, and the amending advice of corresponding correct word is given, output is just
The debugging result of step;
4) debugging result is verified using the statistical information for treating debugging text, and exports the debugging knot through correcting
Really, so as to realize based on collocation Chinese language text automatic Proofreading.
Preferably, it is described 3) using step 2) word set up and part of speech, to the index structure of collocation, treat debugging text
Chinese sentence carries out automatic errordetecting and automatic error-correcting, and errors present is marked, and provides repairing for corresponding correct word
Reconstruction view, exports preliminary debugging result, specifically includes following steps:
31) sentence for treating debugging text carries out participle;
32) each word in sentence is traveled through, following automatic errordetecting and automatic error-correcting is carried out:
32-1) according to step 2) word set up searches the part of speech of the word to the index structure of part of speech, and then according to finding
Part of speech and step 2) word set up and part of speech, to the index structure of collocation, search the corresponding collocation node set of the part of speech;Such as
The word does not find corresponding part of speech, then according to step 2) word set up and part of speech, to the index structure of collocation, search the word
Corresponding collocation node set;
If 32-2) 32-1) the collocation node set that finds is not sky, each in traversal collocation node set
Collocation node, takes out collocation corresponding with collocation node from collocation storehouse;
32-3) position according to the word in collocation and Collocation, go for the Collocation arranged in pairs or groups with which in sentence, if
Collocation and the word composition collocation can be found in sentence, then verifies whether distance of the Collocation with the word in sentence matches
The distance between Collocation and the word in collocation, if it does, then the labelling word and Collocation are correct, terminate the word oneself
Dynamic debugging;Otherwise, then into step 32-4);
32-4) replace the word one by one with the similar word in the similar set of words of the word, by step 32-1), 32-2) and
Method 32-3), searches whether each similar word in similar set of words can find Collocation in sentence with word composition
Collocation, if there is no collocation, then into step 32-5);If there is collocation, then the labelling word is wrong;And will be similar
Word to provide the amending advice of corresponding correct word, terminates the automatic errordetecting of the word and entangles automatically as the correct word of correspondence
It is wrong;
32-5) position according to the word in collocation and Collocation, go in sentence the word similar with its Collocation or
String, if the word similar to Collocation or string being found in sentence and being arranged in pairs or groups with the word composition, verifies that the Collocation is similar
Distance with the word in sentence of word or string whether match distance in collocation between the Collocation and the word, if it does,
Then the labelling similar word or string are wrong, and using Collocation as the correct word of correspondence, to provide corresponding correct word
Amending advice, terminate the automatic errordetecting and automatic error-correcting of the word;Otherwise, then into step 32) carry out the automatic of next word
Debugging and automatic error-correcting, until end of the sentence, exports including the preliminary debugging including the amending advice of labelling and corresponding correct word
As a result.
Preferably, the step 4) debugging result is verified using the statistical information for treating debugging text, and export Jing
The debugging result of amendment is crossed, is specifically included:
41) count word frequency:The sentence for treating debugging text after to participle is counted, and counts the word frequency of each word;
42) verify debugging result:Judge step 3) word of mistake is labeled as in the preliminary debugging result that exports, in step
41) in, whether the word frequency of statistics is not less than predetermined threshold value, then thinks that the word is correct word in this way;
43) correct debugging result:With reference to step 42) result verified using statistical information, to preliminary debugging result
It is modified, final debugging result of the output through amendment.
It is further preferred that the predetermined threshold value is 5.
Preferably, the step 1) in collocation expression structure be:
Collocation Coll=<!Part of speech 1>[<a|*>]<!Part of speech 2>[<b|*]<!Part of speech 3>…<!Part of speech p>;
Wherein:<>Represent essential,
[] represents optional,
| represent optionally first,
A, b, * are corresponded to respectively and are represented that the distance between former and later two words are a, b, do not limit,
!For POS-tagging,!Part of speech 1,!Part of speech 2,!Part of speech 3,!Part of speech p corresponds expression respectively and belongs to part of speech 1, part of speech
2nd, one group of synonym of part of speech 3, part of speech p;
Part of speech is defined as:<!Part of speech p>=<| the entry 2 | ... of entry 1 | entry q>;
A large amount of collocation constitute collocation storehouse:Coll_Set=X | X be one collocation, X=<!Part of speech 1>[n]<!Part of speech 2>[m]
<!Part of speech 3>…<!Part of speech p>}.
Preferably, the step 2) in, according to collocation and part of speech, word is set up to the index structure of part of speech
MapWordToClass, and word and part of speech are to the index structure mapClassToColl for arranging in pairs or groups, and set up for depositing collocation
Collocation storehouse vecColl, and including the length of collocation call number collIndex, word position wordIndex and collocation in collocation
The collocation node structure CNode of degree collLen.
Wherein, vecColl is the abstract array representation structure of the collocation storehouse Coll_Set of storage collocation.
It is further preferred that described 3) using step 2) word set up and part of speech, to the index structure of collocation, treat debugging
The Chinese sentence of text carries out automatic errordetecting and automatic error-correcting, and errors present is marked, and provides corresponding correct
The amending advice of word, exports preliminary debugging result, specifically includes following steps:
31) the sentence S for treating debugging text carries out participle:S=w1w2…wn, wherein w1、w2、…、wnParticiple is represented respectively
Word afterwards, the word w to the sentence after participle with labelling array flag [i] to each positioniLabelling, so as to as errors present mark
Note, wherein 1≤i≤n, wherein flag [i]=0 represent that the word of correspondence position is correct, flag [i]=1 represents correspondence position
Word be wrong;
32) travel through each word w in sentence Si, carry out following automatic errordetecting and automatic error-correcting:
32-1) according to step 2) word set up searches word w to the index structure mapWordToClass of part of speechiWord
Class, and then according to the part of speech and step 2 for finding) word set up and part of speech to collocation index structure mapClassToColl,
Search the corresponding collocation node set Colls of the part of speech;Such as word wiCorresponding part of speech is not found, then according to step 2) build
Vertical word and part of speech search word w to the index structure mapClassToColl of collocationiCorresponding collocation node set Colls
(namely collocation set Colls);
If 32-2) 32-1) the collocation node set Colls that finds is not sky, each traveled through in Colls is taken
With node coll (namely collocation coll), according to the collocation node structure CNode of collocation node coll, from collocation storehouse
Collocation strColl of the call number for coll.collIndex is taken out in vecColl;
32-3) according to word wiPosition coll.wordIndex and Collocation collWord in collocation strColl,
The word collWord arranged in pairs or groups with which is gone in sentence S, if Collocation collWord and word w can be found in sentence SiComposition
Collocation strColl, then verify Collocation collWord and word wiDistance in sentence S is arranged in pairs or groups in whether matching collocation coll
Word collWord and word wiThe distance between, if it does, then in labelling array to the Collocation collWord that finds and
Word wiThe labelling flag of correspondence position be entered as 1, to represent that the word and Collocation are correct, terminate word wiFrom
Dynamic debugging;Otherwise, then into step 32-4);
32-4) with word wiSimilar set of words sim (wi) w is replaced one by onei, by step 32-1), 32-2) and 32-3)
Method, search similar set of words sim (wi) each similar word wjWhether Collocation and word w can be found in sentence Sj
Composition collocation, if there is no collocation, then into step 32-5);If there is collocation, then to word w in labelling arrayiCorrespondence
The labelling flag [i] of position is entered as -1, to represent word wiIt is wrong;And by similar word wjAs the correct word of correspondence
CorrectWord is deposited in corresponding error correction array vecCorrect with debugging node structure CorrectNode, to give
Go out the amending advice of corresponding correct word;The debugging node structure CorrectNode includes erroneous words wiIn sentence
End position end in sentence of original position begin, erroneous words and the correct word correctWord of correspondence, terminate the word
wiAutomatic errordetecting and automatic error-correcting;
32-5) according to word wiPosition coll.wordIndex and Collocation collWord in collocation strColl,
The word similar with its Collocation collWord or string are gone in sentence S, if can find in sentence and Collocation collWord
Similar word is gone here and there and constitutes collocation strColl with word wi, then verify the similar words of Collocation collWord or string and the word
wiWhether the distance in sentence S matches Collocation collWord and word w in collocation colliThe distance between, if
Match somebody with somebody, then word string w similar to Collocation collWord in labelling array to findingk..wjThe labelling flag of correspondence position
[k..j] is labeled as -1, to represent the similar word or string wk..wjBe it is wrong, and using Collocation collWord as correspondence
Correct word correctWord is deposited into the corresponding error correction array with debugging node structure CorrectNode
In vecCorrect, to provide the amending advice of corresponding correct word, terminate word wiAutomatic errordetecting and automatic error-correcting;It is no
Then, then into step 32) carry out next word wi+1Automatic errordetecting and automatic error-correcting, until end of the sentence, output is included as mistake
Labelling array flag [i] of position mark and error correction array vecCorrect as error correction result are tied in interior preliminary debugging
Really.
Preferably, the step 4) debugging result is verified using the statistical information for treating debugging text, and export Jing
The debugging result of amendment is crossed, is specifically included:
41) count word frequency:The sentence S for treating debugging text after to participle is counted, and counts each word wiWord frequency
Freq(wi);
42) verify debugging result:Traversal error correction array vecCorrect, passes through step 3 to each) possibility found out is wrong
Word word, does following judgement by mistake:If Freq (word) >=predetermined threshold value, then it is assumed that the word is correct word;
43) correct debugging result:With reference to step 42) result verified using statistical information, to preliminary debugging result
It is modified, final debugging result of the output through amendment.
Beneficial effect:The invention provides a kind of Chinese language text auto-collation based on collocation, according to collocation and word
Class, sets up word to the index structure of part of speech, and word and part of speech is to the index structure arranged in pairs or groups, using word and part of speech to the rope arranged in pairs or groups
Guiding structure, the Chinese sentence for treating debugging text carry out automatic errordetecting and automatic error-correcting, and errors present is marked, and give
Go out the amending advice of corresponding correct word, export preliminary debugging result, then using the statistical information pair for treating debugging text
Debugging result is verified, and then exports the debugging result through amendment, realizes the Chinese language text automatic Proofreading based on collocation.
Test result indicate that:What the Jing present invention was provided is reached based on the Chinese language text auto-collation recall rate of collocation
81.2%, precision reaches 75.6%.This precision has exceeded prior art, has better met the demand of practical application, has had
Higher effectiveness and accuracy.
Specific embodiment
The present invention is further described with reference to embodiment.
A kind of Chinese language text auto-collation based on collocation that the present embodiment is provided, comprises the following steps:
1) structure and features according to the collocation of Chinese word, sets up the expression structure of collocation:
The collocation refers to the combination between word:When co-occurrence probabilities of two or more words in a sentence are more than
During default threshold, the two or multiple words constitute it is reasonably combined, the word in collocation have close to have plenty of it is discrete;Category
Duplicate collocation can be constituted with other words in the identical semantic word of part of speech;Therefore the matching structure that the present invention is defined
For:
Collocation Coll=<!Part of speech 1>[<a|*>]<!Part of speech 2>[<b|*]<!Part of speech 3>…<!Part of speech p>;
Wherein:<>Represent essential,
[] represents optional,
| represent optionally first,
A, b, * are corresponded to respectively and are represented that the distance between former and later two words are a, b, do not limit,
!For POS-tagging,!Part of speech 1,!Part of speech 2,!Part of speech 3,!Part of speech p corresponds expression respectively and belongs to part of speech 1, part of speech
2nd, one group of synonym of part of speech 3, part of speech p;
Part of speech is defined as:<!Part of speech p>=<| the entry 2 | ... of entry 1 | entry q>;
A large amount of collocation constitute collocation storehouse:Coll_Set=X | X be one collocation, X=<!Part of speech 1>[n]<!Part of speech 2>[m]
<!Part of speech 3>…<!Part of speech p>}.
2) according to collocation and part of speech, word is set up to the index structure of part of speech, and word and part of speech are to the index structure arranged in pairs or groups:
Establish in the present embodiment:Index structure mapWordToClass of the word to part of speech, word and part of speech are to collocation
Index structure mapClassToColl, for depositing the collocation storehouse vecColl of collocation, and including collocation call number
The collocation node structure CNode of length collLen of collIndex, word position wordIndex and collocation in collocation, tool
Body is:
Wherein, vecColl is the abstract array representation structure of the collocation storehouse Coll_Set of storage collocation.
3) using step 2) word set up and part of speech to collocation index structure, the Chinese sentence for treating debugging text carries out
Automatic errordetecting and automatic error-correcting, and errors present is marked, and the amending advice of corresponding correct word is given, output is just
The debugging result of step;
31) the sentence S for treating debugging text carries out participle:S=w1w2…wn, wherein w1、w2、…、wnParticiple is represented respectively
Word afterwards, the word w to the sentence after participle with labelling array flag [i] to each positioniLabelling, so as to as errors present mark
Note, wherein 1≤i≤n, wherein flag [i]=0 represent that the word of correspondence position is correct, flag [i]=1 represents correspondence position
Word be wrong;Original state, flag [i]=0 (1≤i≤n);
32) travel through each word w in sentence Si, carry out following automatic errordetecting and automatic error-correcting:
Word w in sentence is scanned successively 32-01)i, if reaching the end of sentence S, debugging is exited, otherwise turn to step
Rapid 32-02);
32-02) grammatical term for the character wiLabelling flags [i], if flags [i]=1, represent word wiIt is correct word, turns to step
Rapid 32-01), otherwise turn to step 32-1);
32-1) according to step 2) word set up searches word w to the index structure mapWordToClass of part of speechiWord
Class, and then according to the part of speech C and step 2 for finding) word set up and part of speech to collocation index structure mapClassToColl,
Search the corresponding collocation node set Colls=mapClassToColl [C] of part of speech C;Such as word wiCorrespondence is not found
Part of speech, then according to step 2) word set up and part of speech, to the index structure mapClassToColl of collocation, search word wiIt is right
The collocation node set Colls for answering, that is, use wiWhether itself remove to search collocation index structure mapClassToColl comprising wiRope
Draw, if finding Colls=mapClassToColl [wi];
If 32-2) 32-1) the collocation node set Colls==NULL that finds, turn to step 32-1), if not
For sky, then each collocation node coll in Colls is traveled through, collocation corresponding with collocation node is taken out from collocation storehouse,
Collocation node structure CNode in the present embodiment i.e. according to collocation node coll, takes out rope from collocation storehouse vecColl structures
Collocation of the quotation marks for coll.collIndex, i.e. strColl=vecColl [coll.collIndex];
32-3) according to word wiPosition coll.wordIndex and Collocation collWord in collocation strColl,
The word collWord arranged in pairs or groups with which is gone in sentence S, in the present embodiment:Specifically include situations below:
Situation one:If coll.wordIndex=1, from word w in sentence SiWord w afterwardsi+1…wnIn look for and wiComposition
The word of collocation strColl;
Situation two:If coll.wordIndex=coll.collLen, from word w in sentence S1…wi-1In look for and wiGroup
Into the word of collocation strColl;
Situation three:If coll.wordIndex!=1&&coll.wordIndex!=coll.collLen, illustrates current
Word wiIt is the word in the middle of collocation, needs from w1…wi-1And wi+1…wnWord and w are looked for respectivelyiThe collocation of composition strColl;
If Collocation collWord and word w can be found in sentence SiComposition collocation strColl, then verify Collocation
CollWord and word wiWhether the distance in sentence S matches Collocation collWord and word w in collocation colliBetween
Distance, if it does, then the labelling word and Collocation are correct, in the present embodiment, then to finding in labelling array
Collocation collWord and word wiThe labelling flag of correspondence position be entered as 1, to represent that the word and Collocation are correct
, terminate word wiAutomatic errordetecting, turn to step 32-01);Otherwise, then into step 32-4);
32-4) using the similarity of Chinese words, w is tried to achieveiSimilar set of words sim (wi), with word wiSimilar set of words
sim(wi) in similar word replace w one by onei, by step 32-1), 32-2) and method 32-3), search similar set of words sim
(wi) each similar word wjWhether Collocation and word w can be found in sentence SjComposition collocation, if there is no collocation,
Step 32-5 is entered then);If there is collocation, then the labelling word be it is wrong, it is then right in labelling array in the present embodiment
Word wiThe labelling flag [i] of correspondence position is entered as -1, to represent word wiIt is wrong;And it is correct using similar word as correspondence
Word, to provide the amending advice of corresponding correct word, be similar word w in the present embodimentjIt is correct as correspondence
In corresponding error correction array vecCorrect with debugging node structure CorrectNode, i.e., word correctWord deposits
Deposit into vector<CorrectNode>In vecCorrect arrays, to provide the amending advice of corresponding correct word;It is described
Debugging node structure CorrectNode includes erroneous words wiOriginal position begin, erroneous words in sentence is in sentence
The end position end and correct word correctWord of correspondence, terminates word wiAutomatic errordetecting and automatic error-correcting, turn to step
32-01);
In the present embodiment, wherein CorrectNode structures are as follows:
32-5) according to word wiPosition coll.wordIndex and Collocation collWord in collocation strColl,
The word similar with its Collocation collWord or string are gone in sentence S, in the present embodiment:Specifically include situations below:
Situation one:If coll.wordIndex=1, from word w in sentence SiWord w afterwardsi+1…wnIn look for and
CollWord similar word or string;
Situation two:If coll.wordIndex=coll.collLen, from word w in sentence S1…wi-1In look for
CollWord similar word or string;
Situation three:If coll.wordIndex!=1&&coll.wordIndex!=coll.collLen, illustrates current
Word wi is the word in the middle of collocation, is needed from w1…wi-1And wi+1…wnThe word for looking for word collWord similar respectively or string;
If the word similar to Collocation collWord or string being found in sentence and being arranged in pairs or groups with word wi compositions
StrColl, then verify the similar words of Collocation collWord or string and word wiWhether the distance in sentence S matches collocation
Collocation collWord and word w in colliThe distance between, if it does, then the labelling similar word or string are mistakes
, word string w similar to Collocation collWord i.e. in labelling array to finding in the present embodimentk..wjCorrespondence position
Labelling flag [k..j] be labeled as -1, with represent the similar word or string wk..wjIt is wrong, includes in representing the word string
Wrong word, and using Collocation collWord as the correct word of correspondence, to provide the amending advice of corresponding correct word, at this
It is Collocation collWord is deposited as the correct word correctWord of correspondence in embodiment and ties with debugging into corresponding
In error correction array vecCorrect of point structure C orrectNode, that is, deposit into vector<CorrectNode>vecCorrect
In array, to provide the amending advice of corresponding correct word, terminate word wiAutomatic errordetecting and automatic error-correcting, turn to step
32-01);
Otherwise, then into step 32) in step 32-01) carry out next word wi+1Automatic errordetecting and automatic error-correcting,
Until end of the sentence, exports including the preliminary debugging result including the amending advice of labelling and corresponding correct word, in the present embodiment
In, as output includes labelling array flag [i] and the error correction array as error correction result as errors present labelling
VecCorrect is in interior preliminary debugging result.
4) debugging result is verified using the statistical information for treating debugging text, and exports the debugging knot through correcting
Really, so as to realize based on collocation Chinese language text automatic Proofreading:
41) count word frequency:The sentence S for treating debugging text after to participle is counted, and counts each word wiWord frequency
Freq(wi);
42) verify debugging result:Judge step 3) word of mistake is labeled as in the preliminary debugging result that exports, in step
41) in, whether the word frequency of statistics is not less than predetermined threshold value, then thinks that the word is correct word in this way, in the present embodiment, as:
Traversal error correction array vector<CorrectNode>VecCorrect, passes through step 3 to each) the possibility mistake found out
Word word, does following judgement:If Freq (word) >=predetermined threshold value a, then it is assumed that the word is correct word, in the present embodiment and
In corresponding experiment, the predetermined threshold value is preferably set to 5.
43) correct debugging result:With reference to step 42) result verified using statistical information, to preliminary debugging result
It is modified, in the present embodiment, as:Delete through step 42) checking after confirm as correct word in error correction array
The data of vecCorrect, and the labelling flag of the word correspondence position is labeled as into 1, final debugging knot of the output through amendment
Really.
Experiment:By taking above-described embodiment as an example, base of the parameter given with embodiment as experiment parameter, to present invention offer
Large-scale corpus experiment is carried out in the Chinese language text auto-collation of collocation, multiple open test is lived through, experiment adopts 1
The testing material of ten thousand row sentences, at the homonym error 6 00 in manual construction language material sentence, test result indicate that:The present invention is carried
For carry out wrong identification and automatic Proofreading based on what the Chinese language text auto-collation of collocation can be effectively combined context,
Learn after statistical analysiss, the Chinese language text auto-collation based on collocation that the present invention is provided, its recall rate reach 81.2%,
Precision reaches 75.6%.This precision has exceeded prior art, has better met the demand of practical application, has with higher
Effect property and accuracy.
It is only presently preferred embodiments of the present invention to implement row above, does not constitute restriction to the present invention, and relevant staff is not
Deviate in the range of the technology of the present invention thought, any modification, equivalent substitution and improvements for being carried out etc. all fall within the guarantor of the present invention
In the range of shield.