CN106547741A - A kind of Chinese language text auto-collation based on collocation - Google Patents

A kind of Chinese language text auto-collation based on collocation Download PDF

Info

Publication number
CN106547741A
CN106547741A CN201611048520.5A CN201611048520A CN106547741A CN 106547741 A CN106547741 A CN 106547741A CN 201611048520 A CN201611048520 A CN 201611048520A CN 106547741 A CN106547741 A CN 106547741A
Authority
CN
China
Prior art keywords
word
collocation
speech
sentence
debugging
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201611048520.5A
Other languages
Chinese (zh)
Other versions
CN106547741B (en
Inventor
张晓如
刘文旻
刘亮亮
吴健康
刘嘎琼
张再跃
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Southern Power Grid Internet Service Co ltd
Jingchuang United Beijing Intellectual Property Service Co ltd
Original Assignee
Jiangsu University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jiangsu University of Science and Technology filed Critical Jiangsu University of Science and Technology
Priority to CN201611048520.5A priority Critical patent/CN106547741B/en
Publication of CN106547741A publication Critical patent/CN106547741A/en
Application granted granted Critical
Publication of CN106547741B publication Critical patent/CN106547741B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/253Grammatical analysis; Style critique
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a kind of Chinese language text auto-collation based on collocation, comprises the following steps:1) structure and features according to the collocation of Chinese word, sets up the expression structure of collocation;2) according to collocation and part of speech, word is set up to the index structure of part of speech, and word and part of speech are to the index structure arranged in pairs or groups;3) using step 2) word set up and part of speech to collocation index structure, the Chinese sentence for treating debugging text carries out automatic errordetecting and automatic error-correcting, and errors present is marked, and the amending advice of corresponding correct word is given, export preliminary debugging result;4) debugging result is verified using the statistical information for treating debugging text, and exports the debugging result through correcting, so as to realize the Chinese language text automatic Proofreading based on collocation.Test result indicate that its recall rate of method for providing of the present invention and precision is superior has better met the demand of practical application in prior art, with higher effectiveness and accuracy.

Description

A kind of Chinese language text auto-collation based on collocation
Technical field
The present invention relates to the Chinese natural language in artificial intelligence computer field is processed, more particularly to Chinese text is automatic Check and correction field.
Background technology
Automatic proofreading for Chinese texts is one of main application of natural language processing, an and difficult problem for natural language understanding. Chinese is input in computer by input method, and increasing people is input into Chinese character, and Pinyin Input using spelling input method With input word and phrase, therefore can there is increasing mistake in the text in method, and many mistakes are by local context Method cannot carry out effective wrong identification.
For the problems referred to above, the present invention propose and realize a kind of automatic errordetecting of Chinese language text based on collocation and from Dynamic proofreading method.
The content of the invention
Goal of the invention:In order to overcome the deficiencies in the prior art, the present invention to provide a kind of Chinese language based on collocation This auto-collation.
Technical scheme:
To solve above-mentioned technical problem, the present invention provides a kind of Chinese language text auto-collation based on collocation, the party Method is comprised the following steps:
1) structure and features according to the collocation of Chinese word, sets up the expression structure of collocation;
2) according to collocation and part of speech, word is set up to the index structure of part of speech, and word and part of speech are to the index structure arranged in pairs or groups;
3) using step 2) word set up and part of speech to collocation index structure, the Chinese sentence for treating debugging text carries out Automatic errordetecting and automatic error-correcting, and errors present is marked, and the amending advice of corresponding correct word is given, output is just The debugging result of step;
4) debugging result is verified using the statistical information for treating debugging text, and exports the debugging knot through correcting Really, so as to realize based on collocation Chinese language text automatic Proofreading.
Preferably, it is described 3) using step 2) word set up and part of speech, to the index structure of collocation, treat debugging text Chinese sentence carries out automatic errordetecting and automatic error-correcting, and errors present is marked, and provides repairing for corresponding correct word Reconstruction view, exports preliminary debugging result, specifically includes following steps:
31) sentence for treating debugging text carries out participle;
32) each word in sentence is traveled through, following automatic errordetecting and automatic error-correcting is carried out:
32-1) according to step 2) word set up searches the part of speech of the word to the index structure of part of speech, and then according to finding Part of speech and step 2) word set up and part of speech, to the index structure of collocation, search the corresponding collocation node set of the part of speech;Such as The word does not find corresponding part of speech, then according to step 2) word set up and part of speech, to the index structure of collocation, search the word Corresponding collocation node set;
If 32-2) 32-1) the collocation node set that finds is not sky, each in traversal collocation node set Collocation node, takes out collocation corresponding with collocation node from collocation storehouse;
32-3) position according to the word in collocation and Collocation, go for the Collocation arranged in pairs or groups with which in sentence, if Collocation and the word composition collocation can be found in sentence, then verifies whether distance of the Collocation with the word in sentence matches The distance between Collocation and the word in collocation, if it does, then the labelling word and Collocation are correct, terminate the word oneself Dynamic debugging;Otherwise, then into step 32-4);
32-4) replace the word one by one with the similar word in the similar set of words of the word, by step 32-1), 32-2) and Method 32-3), searches whether each similar word in similar set of words can find Collocation in sentence with word composition Collocation, if there is no collocation, then into step 32-5);If there is collocation, then the labelling word is wrong;And will be similar Word to provide the amending advice of corresponding correct word, terminates the automatic errordetecting of the word and entangles automatically as the correct word of correspondence It is wrong;
32-5) position according to the word in collocation and Collocation, go in sentence the word similar with its Collocation or String, if the word similar to Collocation or string being found in sentence and being arranged in pairs or groups with the word composition, verifies that the Collocation is similar Distance with the word in sentence of word or string whether match distance in collocation between the Collocation and the word, if it does, Then the labelling similar word or string are wrong, and using Collocation as the correct word of correspondence, to provide corresponding correct word Amending advice, terminate the automatic errordetecting and automatic error-correcting of the word;Otherwise, then into step 32) carry out the automatic of next word Debugging and automatic error-correcting, until end of the sentence, exports including the preliminary debugging including the amending advice of labelling and corresponding correct word As a result.
Preferably, the step 4) debugging result is verified using the statistical information for treating debugging text, and export Jing The debugging result of amendment is crossed, is specifically included:
41) count word frequency:The sentence for treating debugging text after to participle is counted, and counts the word frequency of each word;
42) verify debugging result:Judge step 3) word of mistake is labeled as in the preliminary debugging result that exports, in step 41) in, whether the word frequency of statistics is not less than predetermined threshold value, then thinks that the word is correct word in this way;
43) correct debugging result:With reference to step 42) result verified using statistical information, to preliminary debugging result It is modified, final debugging result of the output through amendment.
It is further preferred that the predetermined threshold value is 5.
Preferably, the step 1) in collocation expression structure be:
Collocation Coll=<!Part of speech 1>[<a|*>]<!Part of speech 2>[<b|*]<!Part of speech 3>…<!Part of speech p>;
Wherein:<>Represent essential,
[] represents optional,
| represent optionally first,
A, b, * are corresponded to respectively and are represented that the distance between former and later two words are a, b, do not limit,
!For POS-tagging,!Part of speech 1,!Part of speech 2,!Part of speech 3,!Part of speech p corresponds expression respectively and belongs to part of speech 1, part of speech 2nd, one group of synonym of part of speech 3, part of speech p;
Part of speech is defined as:<!Part of speech p>=<| the entry 2 | ... of entry 1 | entry q>;
A large amount of collocation constitute collocation storehouse:Coll_Set=X | X be one collocation, X=<!Part of speech 1>[n]<!Part of speech 2>[m] <!Part of speech 3>…<!Part of speech p>}.
Preferably, the step 2) in, according to collocation and part of speech, word is set up to the index structure of part of speech MapWordToClass, and word and part of speech are to the index structure mapClassToColl for arranging in pairs or groups, and set up for depositing collocation Collocation storehouse vecColl, and including the length of collocation call number collIndex, word position wordIndex and collocation in collocation The collocation node structure CNode of degree collLen.
Wherein, vecColl is the abstract array representation structure of the collocation storehouse Coll_Set of storage collocation.
It is further preferred that described 3) using step 2) word set up and part of speech, to the index structure of collocation, treat debugging The Chinese sentence of text carries out automatic errordetecting and automatic error-correcting, and errors present is marked, and provides corresponding correct The amending advice of word, exports preliminary debugging result, specifically includes following steps:
31) the sentence S for treating debugging text carries out participle:S=w1w2…wn, wherein w1、w2、…、wnParticiple is represented respectively Word afterwards, the word w to the sentence after participle with labelling array flag [i] to each positioniLabelling, so as to as errors present mark Note, wherein 1≤i≤n, wherein flag [i]=0 represent that the word of correspondence position is correct, flag [i]=1 represents correspondence position Word be wrong;
32) travel through each word w in sentence Si, carry out following automatic errordetecting and automatic error-correcting:
32-1) according to step 2) word set up searches word w to the index structure mapWordToClass of part of speechiWord Class, and then according to the part of speech and step 2 for finding) word set up and part of speech to collocation index structure mapClassToColl, Search the corresponding collocation node set Colls of the part of speech;Such as word wiCorresponding part of speech is not found, then according to step 2) build Vertical word and part of speech search word w to the index structure mapClassToColl of collocationiCorresponding collocation node set Colls (namely collocation set Colls);
If 32-2) 32-1) the collocation node set Colls that finds is not sky, each traveled through in Colls is taken With node coll (namely collocation coll), according to the collocation node structure CNode of collocation node coll, from collocation storehouse Collocation strColl of the call number for coll.collIndex is taken out in vecColl;
32-3) according to word wiPosition coll.wordIndex and Collocation collWord in collocation strColl, The word collWord arranged in pairs or groups with which is gone in sentence S, if Collocation collWord and word w can be found in sentence SiComposition Collocation strColl, then verify Collocation collWord and word wiDistance in sentence S is arranged in pairs or groups in whether matching collocation coll Word collWord and word wiThe distance between, if it does, then in labelling array to the Collocation collWord that finds and Word wiThe labelling flag of correspondence position be entered as 1, to represent that the word and Collocation are correct, terminate word wiFrom Dynamic debugging;Otherwise, then into step 32-4);
32-4) with word wiSimilar set of words sim (wi) w is replaced one by onei, by step 32-1), 32-2) and 32-3) Method, search similar set of words sim (wi) each similar word wjWhether Collocation and word w can be found in sentence Sj Composition collocation, if there is no collocation, then into step 32-5);If there is collocation, then to word w in labelling arrayiCorrespondence The labelling flag [i] of position is entered as -1, to represent word wiIt is wrong;And by similar word wjAs the correct word of correspondence CorrectWord is deposited in corresponding error correction array vecCorrect with debugging node structure CorrectNode, to give Go out the amending advice of corresponding correct word;The debugging node structure CorrectNode includes erroneous words wiIn sentence End position end in sentence of original position begin, erroneous words and the correct word correctWord of correspondence, terminate the word wiAutomatic errordetecting and automatic error-correcting;
32-5) according to word wiPosition coll.wordIndex and Collocation collWord in collocation strColl, The word similar with its Collocation collWord or string are gone in sentence S, if can find in sentence and Collocation collWord Similar word is gone here and there and constitutes collocation strColl with word wi, then verify the similar words of Collocation collWord or string and the word wiWhether the distance in sentence S matches Collocation collWord and word w in collocation colliThe distance between, if Match somebody with somebody, then word string w similar to Collocation collWord in labelling array to findingk..wjThe labelling flag of correspondence position [k..j] is labeled as -1, to represent the similar word or string wk..wjBe it is wrong, and using Collocation collWord as correspondence Correct word correctWord is deposited into the corresponding error correction array with debugging node structure CorrectNode In vecCorrect, to provide the amending advice of corresponding correct word, terminate word wiAutomatic errordetecting and automatic error-correcting;It is no Then, then into step 32) carry out next word wi+1Automatic errordetecting and automatic error-correcting, until end of the sentence, output is included as mistake Labelling array flag [i] of position mark and error correction array vecCorrect as error correction result are tied in interior preliminary debugging Really.
Preferably, the step 4) debugging result is verified using the statistical information for treating debugging text, and export Jing The debugging result of amendment is crossed, is specifically included:
41) count word frequency:The sentence S for treating debugging text after to participle is counted, and counts each word wiWord frequency Freq(wi);
42) verify debugging result:Traversal error correction array vecCorrect, passes through step 3 to each) possibility found out is wrong Word word, does following judgement by mistake:If Freq (word) >=predetermined threshold value, then it is assumed that the word is correct word;
43) correct debugging result:With reference to step 42) result verified using statistical information, to preliminary debugging result It is modified, final debugging result of the output through amendment.
Beneficial effect:The invention provides a kind of Chinese language text auto-collation based on collocation, according to collocation and word Class, sets up word to the index structure of part of speech, and word and part of speech is to the index structure arranged in pairs or groups, using word and part of speech to the rope arranged in pairs or groups Guiding structure, the Chinese sentence for treating debugging text carry out automatic errordetecting and automatic error-correcting, and errors present is marked, and give Go out the amending advice of corresponding correct word, export preliminary debugging result, then using the statistical information pair for treating debugging text Debugging result is verified, and then exports the debugging result through amendment, realizes the Chinese language text automatic Proofreading based on collocation.
Test result indicate that:What the Jing present invention was provided is reached based on the Chinese language text auto-collation recall rate of collocation 81.2%, precision reaches 75.6%.This precision has exceeded prior art, has better met the demand of practical application, has had Higher effectiveness and accuracy.
Specific embodiment
The present invention is further described with reference to embodiment.
A kind of Chinese language text auto-collation based on collocation that the present embodiment is provided, comprises the following steps:
1) structure and features according to the collocation of Chinese word, sets up the expression structure of collocation:
The collocation refers to the combination between word:When co-occurrence probabilities of two or more words in a sentence are more than During default threshold, the two or multiple words constitute it is reasonably combined, the word in collocation have close to have plenty of it is discrete;Category Duplicate collocation can be constituted with other words in the identical semantic word of part of speech;Therefore the matching structure that the present invention is defined For:
Collocation Coll=<!Part of speech 1>[<a|*>]<!Part of speech 2>[<b|*]<!Part of speech 3>…<!Part of speech p>;
Wherein:<>Represent essential,
[] represents optional,
| represent optionally first,
A, b, * are corresponded to respectively and are represented that the distance between former and later two words are a, b, do not limit,
!For POS-tagging,!Part of speech 1,!Part of speech 2,!Part of speech 3,!Part of speech p corresponds expression respectively and belongs to part of speech 1, part of speech 2nd, one group of synonym of part of speech 3, part of speech p;
Part of speech is defined as:<!Part of speech p>=<| the entry 2 | ... of entry 1 | entry q>;
A large amount of collocation constitute collocation storehouse:Coll_Set=X | X be one collocation, X=<!Part of speech 1>[n]<!Part of speech 2>[m] <!Part of speech 3>…<!Part of speech p>}.
2) according to collocation and part of speech, word is set up to the index structure of part of speech, and word and part of speech are to the index structure arranged in pairs or groups:
Establish in the present embodiment:Index structure mapWordToClass of the word to part of speech, word and part of speech are to collocation Index structure mapClassToColl, for depositing the collocation storehouse vecColl of collocation, and including collocation call number The collocation node structure CNode of length collLen of collIndex, word position wordIndex and collocation in collocation, tool Body is:
Wherein, vecColl is the abstract array representation structure of the collocation storehouse Coll_Set of storage collocation.
3) using step 2) word set up and part of speech to collocation index structure, the Chinese sentence for treating debugging text carries out Automatic errordetecting and automatic error-correcting, and errors present is marked, and the amending advice of corresponding correct word is given, output is just The debugging result of step;
31) the sentence S for treating debugging text carries out participle:S=w1w2…wn, wherein w1、w2、…、wnParticiple is represented respectively Word afterwards, the word w to the sentence after participle with labelling array flag [i] to each positioniLabelling, so as to as errors present mark Note, wherein 1≤i≤n, wherein flag [i]=0 represent that the word of correspondence position is correct, flag [i]=1 represents correspondence position Word be wrong;Original state, flag [i]=0 (1≤i≤n);
32) travel through each word w in sentence Si, carry out following automatic errordetecting and automatic error-correcting:
Word w in sentence is scanned successively 32-01)i, if reaching the end of sentence S, debugging is exited, otherwise turn to step Rapid 32-02);
32-02) grammatical term for the character wiLabelling flags [i], if flags [i]=1, represent word wiIt is correct word, turns to step Rapid 32-01), otherwise turn to step 32-1);
32-1) according to step 2) word set up searches word w to the index structure mapWordToClass of part of speechiWord Class, and then according to the part of speech C and step 2 for finding) word set up and part of speech to collocation index structure mapClassToColl, Search the corresponding collocation node set Colls=mapClassToColl [C] of part of speech C;Such as word wiCorrespondence is not found Part of speech, then according to step 2) word set up and part of speech, to the index structure mapClassToColl of collocation, search word wiIt is right The collocation node set Colls for answering, that is, use wiWhether itself remove to search collocation index structure mapClassToColl comprising wiRope Draw, if finding Colls=mapClassToColl [wi];
If 32-2) 32-1) the collocation node set Colls==NULL that finds, turn to step 32-1), if not For sky, then each collocation node coll in Colls is traveled through, collocation corresponding with collocation node is taken out from collocation storehouse, Collocation node structure CNode in the present embodiment i.e. according to collocation node coll, takes out rope from collocation storehouse vecColl structures Collocation of the quotation marks for coll.collIndex, i.e. strColl=vecColl [coll.collIndex];
32-3) according to word wiPosition coll.wordIndex and Collocation collWord in collocation strColl, The word collWord arranged in pairs or groups with which is gone in sentence S, in the present embodiment:Specifically include situations below:
Situation one:If coll.wordIndex=1, from word w in sentence SiWord w afterwardsi+1…wnIn look for and wiComposition The word of collocation strColl;
Situation two:If coll.wordIndex=coll.collLen, from word w in sentence S1…wi-1In look for and wiGroup Into the word of collocation strColl;
Situation three:If coll.wordIndex!=1&&coll.wordIndex!=coll.collLen, illustrates current Word wiIt is the word in the middle of collocation, needs from w1…wi-1And wi+1…wnWord and w are looked for respectivelyiThe collocation of composition strColl;
If Collocation collWord and word w can be found in sentence SiComposition collocation strColl, then verify Collocation CollWord and word wiWhether the distance in sentence S matches Collocation collWord and word w in collocation colliBetween Distance, if it does, then the labelling word and Collocation are correct, in the present embodiment, then to finding in labelling array Collocation collWord and word wiThe labelling flag of correspondence position be entered as 1, to represent that the word and Collocation are correct , terminate word wiAutomatic errordetecting, turn to step 32-01);Otherwise, then into step 32-4);
32-4) using the similarity of Chinese words, w is tried to achieveiSimilar set of words sim (wi), with word wiSimilar set of words sim(wi) in similar word replace w one by onei, by step 32-1), 32-2) and method 32-3), search similar set of words sim (wi) each similar word wjWhether Collocation and word w can be found in sentence SjComposition collocation, if there is no collocation, Step 32-5 is entered then);If there is collocation, then the labelling word be it is wrong, it is then right in labelling array in the present embodiment Word wiThe labelling flag [i] of correspondence position is entered as -1, to represent word wiIt is wrong;And it is correct using similar word as correspondence Word, to provide the amending advice of corresponding correct word, be similar word w in the present embodimentjIt is correct as correspondence In corresponding error correction array vecCorrect with debugging node structure CorrectNode, i.e., word correctWord deposits Deposit into vector<CorrectNode>In vecCorrect arrays, to provide the amending advice of corresponding correct word;It is described Debugging node structure CorrectNode includes erroneous words wiOriginal position begin, erroneous words in sentence is in sentence The end position end and correct word correctWord of correspondence, terminates word wiAutomatic errordetecting and automatic error-correcting, turn to step 32-01);
In the present embodiment, wherein CorrectNode structures are as follows:
32-5) according to word wiPosition coll.wordIndex and Collocation collWord in collocation strColl, The word similar with its Collocation collWord or string are gone in sentence S, in the present embodiment:Specifically include situations below:
Situation one:If coll.wordIndex=1, from word w in sentence SiWord w afterwardsi+1…wnIn look for and CollWord similar word or string;
Situation two:If coll.wordIndex=coll.collLen, from word w in sentence S1…wi-1In look for CollWord similar word or string;
Situation three:If coll.wordIndex!=1&&coll.wordIndex!=coll.collLen, illustrates current Word wi is the word in the middle of collocation, is needed from w1…wi-1And wi+1…wnThe word for looking for word collWord similar respectively or string;
If the word similar to Collocation collWord or string being found in sentence and being arranged in pairs or groups with word wi compositions StrColl, then verify the similar words of Collocation collWord or string and word wiWhether the distance in sentence S matches collocation Collocation collWord and word w in colliThe distance between, if it does, then the labelling similar word or string are mistakes , word string w similar to Collocation collWord i.e. in labelling array to finding in the present embodimentk..wjCorrespondence position Labelling flag [k..j] be labeled as -1, with represent the similar word or string wk..wjIt is wrong, includes in representing the word string Wrong word, and using Collocation collWord as the correct word of correspondence, to provide the amending advice of corresponding correct word, at this It is Collocation collWord is deposited as the correct word correctWord of correspondence in embodiment and ties with debugging into corresponding In error correction array vecCorrect of point structure C orrectNode, that is, deposit into vector<CorrectNode>vecCorrect In array, to provide the amending advice of corresponding correct word, terminate word wiAutomatic errordetecting and automatic error-correcting, turn to step 32-01);
Otherwise, then into step 32) in step 32-01) carry out next word wi+1Automatic errordetecting and automatic error-correcting, Until end of the sentence, exports including the preliminary debugging result including the amending advice of labelling and corresponding correct word, in the present embodiment In, as output includes labelling array flag [i] and the error correction array as error correction result as errors present labelling VecCorrect is in interior preliminary debugging result.
4) debugging result is verified using the statistical information for treating debugging text, and exports the debugging knot through correcting Really, so as to realize based on collocation Chinese language text automatic Proofreading:
41) count word frequency:The sentence S for treating debugging text after to participle is counted, and counts each word wiWord frequency Freq(wi);
42) verify debugging result:Judge step 3) word of mistake is labeled as in the preliminary debugging result that exports, in step 41) in, whether the word frequency of statistics is not less than predetermined threshold value, then thinks that the word is correct word in this way, in the present embodiment, as: Traversal error correction array vector<CorrectNode>VecCorrect, passes through step 3 to each) the possibility mistake found out Word word, does following judgement:If Freq (word) >=predetermined threshold value a, then it is assumed that the word is correct word, in the present embodiment and In corresponding experiment, the predetermined threshold value is preferably set to 5.
43) correct debugging result:With reference to step 42) result verified using statistical information, to preliminary debugging result It is modified, in the present embodiment, as:Delete through step 42) checking after confirm as correct word in error correction array The data of vecCorrect, and the labelling flag of the word correspondence position is labeled as into 1, final debugging knot of the output through amendment Really.
Experiment:By taking above-described embodiment as an example, base of the parameter given with embodiment as experiment parameter, to present invention offer Large-scale corpus experiment is carried out in the Chinese language text auto-collation of collocation, multiple open test is lived through, experiment adopts 1 The testing material of ten thousand row sentences, at the homonym error 6 00 in manual construction language material sentence, test result indicate that:The present invention is carried For carry out wrong identification and automatic Proofreading based on what the Chinese language text auto-collation of collocation can be effectively combined context, Learn after statistical analysiss, the Chinese language text auto-collation based on collocation that the present invention is provided, its recall rate reach 81.2%, Precision reaches 75.6%.This precision has exceeded prior art, has better met the demand of practical application, has with higher Effect property and accuracy.
It is only presently preferred embodiments of the present invention to implement row above, does not constitute restriction to the present invention, and relevant staff is not Deviate in the range of the technology of the present invention thought, any modification, equivalent substitution and improvements for being carried out etc. all fall within the guarantor of the present invention In the range of shield.

Claims (8)

1. it is a kind of based on the Chinese language text auto-collation arranged in pairs or groups, it is characterised in that the method is comprised the following steps:
1) structure and features according to the collocation of Chinese word, sets up the expression structure of collocation;
2) according to collocation and part of speech, word is set up to the index structure of part of speech, and word and part of speech are to the index structure arranged in pairs or groups;
3) using step 2) word set up and part of speech to collocation index structure, the Chinese sentence for treating debugging text carries out automatically Debugging and automatic error-correcting, and errors present is marked, and the amending advice of corresponding correct word is given, export preliminary Debugging result;
4) debugging result is verified using the statistical information for treating debugging text, and exports the debugging result through correcting, from And realize the Chinese language text automatic Proofreading based on collocation.
2. according to claim 1 based on the Chinese language text auto-collation arranged in pairs or groups, it is characterised in that:It is described 3) to utilize Step 2) word set up and part of speech to collocation index structure, the Chinese sentence for treating debugging text carries out automatic errordetecting and automatic Error correction, and errors present is marked, and the amending advice of corresponding correct word is given, preliminary debugging result is exported, Specifically include following steps:
31) sentence for treating debugging text carries out participle;
32) each word in sentence is traveled through, following automatic errordetecting and automatic error-correcting is carried out:
32-1) according to step 2) word set up searches the part of speech of the word to the index structure of part of speech, and then according to the word for finding Class and step 2) word set up and part of speech, to the index structure of collocation, search the corresponding collocation node set of the part of speech;The such as word Corresponding part of speech is not found, then according to step 2) word set up and part of speech, to the index structure of collocation, search word correspondence Collocation node set;
If 32-2) 32-1) the collocation node set that finds is not sky, each collocation in traversal collocation node set Node, takes out collocation corresponding with collocation node from collocation storehouse;
32-3) position according to the word in collocation and Collocation, go for the Collocation arranged in pairs or groups with which in sentence, if can be Collocation and the word composition collocation are found in sentence, then verifies whether distance of the Collocation with the word in sentence is matched in collocation The distance between middle Collocation and the word, if it does, then the labelling word and Collocation are correct, terminate looking into automatically for the word It is wrong;Otherwise, then into step 32-4);
32-4) replace the word one by one with the similar word in the similar set of words of the word, by step 32-1), 32-2) and 32-3) Method, search whether each similar word in similar set of words can find Collocation and word composition collocation in sentence, If there is no collocation, then into step 32-5);If there is collocation, then the labelling word is wrong;And using similar word as The correct word of correspondence, to provide the amending advice of corresponding correct word, terminates the automatic errordetecting and automatic error-correcting of the word;
32-5) position according to the word in collocation and Collocation, go for the word similar with its Collocation or string, such as in sentence Fruit can find in sentence the word similar to Collocation or string and with word composition collocation, then verify the similar word of the Collocation or Whether distance of the string with the word in sentence matches the distance in arranging in pairs or groups between the Collocation and the word, if it does, then labelling The similar word or string are wrong, and using Collocation as the correct word of correspondence, to provide the modification of corresponding correct word Suggestion, terminates the automatic errordetecting and automatic error-correcting of the word;Otherwise, then into step 32) carry out next word automatic errordetecting and Automatic error-correcting, until end of the sentence, exports including the preliminary debugging result including the amending advice of labelling and corresponding correct word.
3. according to claim 1 based on the Chinese language text auto-collation arranged in pairs or groups, it is characterised in that:The step 4) Debugging result is verified using the statistical information for treating debugging text, and export the debugging result through correcting, specifically included:
41) count word frequency:The sentence for treating debugging text after to participle is counted, and counts the word frequency of each word;
42) verify debugging result:Judge step 3) word of mistake is labeled as in the preliminary debugging result that exports, in step 41) in Whether the word frequency of statistics is not less than predetermined threshold value, then thinks that the word is correct word in this way;
43) correct debugging result:With reference to step 42) result verified using statistical information, preliminary debugging result is carried out Amendment, final debugging result of the output through amendment.
4. according to claim 3 based on the Chinese language text auto-collation arranged in pairs or groups, it is characterised in that:The default threshold It is worth for 5.
5. according to claim 1 based on the Chinese language text auto-collation arranged in pairs or groups, it is characterised in that:
The step 1) in collocation expression structure be:
Collocation Coll=<!Part of speech 1>[<a|*>]<!Part of speech 2>[<b|*]<!Part of speech 3>…<!Part of speech p>;
Wherein:<>Represent essential,
[] represents optional,
| represent optionally first,
A, b, * are corresponded to respectively and are represented that the distance between former and later two words are a, b, do not limit,
!For POS-tagging,!Part of speech 1,!Part of speech 2,!Part of speech 3,!Part of speech p corresponds expression respectively and belongs to part of speech 1, part of speech 2, word One group of synonym of class 3, part of speech p;
Part of speech is defined as:<!Part of speech p>=<| the entry 2 | ... of entry 1 | entry q>;
A large amount of collocation constitute collocation storehouse:Coll_Set=X | X be one collocation, X=<!Part of speech 1>[n]<!Part of speech 2>[m]<!Word Class 3>…<!Part of speech p>}.
6. according to claim 5 based on the Chinese language text auto-collation arranged in pairs or groups, it is characterised in that:
The step 2) in, according to collocation and part of speech, word is set up to the index structure mapWordToClass of part of speech, and word and word Index structure mapClassToColl of the class to collocation, and set up for depositing the collocation storehouse vecColl of collocation, and including taking The collocation node structure of length collLen of position wordIndex and collocation with call number collIndex, word in collocation CNode。
7. according to claim 6 based on the Chinese language text auto-collation arranged in pairs or groups, it is characterised in that:It is described 3) to utilize Step 2) word set up and part of speech to collocation index structure, the Chinese sentence for treating debugging text carries out automatic errordetecting and automatic Error correction, and errors present is marked, and the amending advice of corresponding correct word is given, preliminary debugging result is exported, Specifically include following steps:
31) the sentence S for treating debugging text carries out participle:S=w1w2…wn, wherein w1、w2、…、wnAfter representing participle respectively Word, the word w to the sentence after participle with labelling array flag [i] to each positioniLabelling, so as to as errors present labelling, Wherein 1≤i≤n, wherein flag [i]=0 represent that the word of correspondence position is correct, and flag [i]=1 represents the word of correspondence position It is wrong;
32) travel through each word w in sentence Si, carry out following automatic errordetecting and automatic error-correcting:
32-1) according to step 2) word set up searches word w to the index structure mapWordToClass of part of speechiPart of speech, and then Part of speech and step 2 according to finding) word set up and part of speech, to the index structure mapClassToColl of collocation, search the word The corresponding collocation node set Colls of class;Such as word wiDo not find corresponding part of speech, then according to step 2) word set up and Part of speech searches word w to the index structure mapClassToColl of collocationiCorresponding collocation node set Colls;
If 32-2) 32-1) the collocation node set Colls that finds is not sky, travels through each the collocation knot in Colls Point coll, according to the collocation node structure CNode of collocation node coll, from collocation storehouse vecColl taking out call number is The collocation strColl of coll.collIndex;
32-3) according to word wiPosition coll.wordIndex and Collocation collWord in collocation strColl, in sentence S In go for the word collWord that arranges in pairs or groups with which, if Collocation collWord and word w can be found in sentence SiComposition collocation StrColl, then verify Collocation collWord and word wiWhether the distance in sentence S matches Collocation in collocation coll CollWord and word wiThe distance between, if it does, then to the Collocation collWord that finds and should in labelling array Word wiThe labelling flag of correspondence position be entered as 1, to represent that the word and Collocation are correct, terminate word wiIt is automatic Debugging;Otherwise, then into step 32-4);
32-4) with word wiSimilar set of words sim (wi) w is replaced one by onei, by step 32-1), 32-2) and side 32-3) Method, searches similar set of words sim (wi) each similar word wjWhether Collocation and word w can be found in sentence SjComposition Collocation, if there is no collocation, then into step 32-5);If there is collocation, then to word w in labelling arrayiCorrespondence position Labelling flag [i] be entered as -1, to represent word wiIt is wrong;And by similar word wjAs the correct word of correspondence CorrectWord is deposited in corresponding error correction array vecCorrect with debugging node structure CorrectNode, to give Go out the amending advice of corresponding correct word;The debugging node structure CorrectNode includes erroneous words wiIn sentence End position end in sentence of original position begin, erroneous words and the correct word correctWord of correspondence, terminate the word wiAutomatic errordetecting and automatic error-correcting;
32-5) according to word wiPosition coll.wordIndex and Collocation collWord in collocation strColl, in sentence S In go for the word similar with its Collocation collWord or string, if can find in sentence similar to Collocation collWord Word is gone here and there and constitutes collocation strColl with word wi, then verify the similar words of Collocation collWord or string and word wiIn sentence Whether the distance in sub- S matches Collocation collWord and word w in collocation colliThe distance between, if it does, then Word string w similar to Collocation collWord in labelling array to findingk..wjThe labelling flag [k..j] of correspondence position is marked - 1 is designated as, to represent the similar word or string wk..wjIt is wrong, and using Collocation collWord as the correct word of correspondence CorrectWord is deposited in corresponding error correction array vecCorrect with debugging node structure CorrectNode, to give Go out the amending advice of corresponding correct word, terminate word wiAutomatic errordetecting and automatic error-correcting;Otherwise, then into step 32) Carry out next word wi+1Automatic errordetecting and automatic error-correcting, until end of the sentence, output includes the reference numerals as errors present labelling Group flag [i] and as error correction result error correction array vecCorrect in interior preliminary debugging result.
8. according to claim 7 based on the Chinese language text auto-collation arranged in pairs or groups, it is characterised in that:The step 4) Debugging result is verified using the statistical information for treating debugging text, and export the debugging result through correcting, specifically included:
41) count word frequency:The sentence S for treating debugging text after to participle is counted, and counts each word wiWord frequency Freq (wi);
42) verify debugging result:Traversal error correction array vecCorrect, passes through step 3 to each) the possibility mistake found out Word word, does following judgement:If Freq (word) >=predetermined threshold value, then it is assumed that the word is correct word;
43) correct debugging result:With reference to step 42) result verified using statistical information, preliminary debugging result is carried out Amendment, final debugging result of the output through amendment.
CN201611048520.5A 2016-11-21 2016-11-21 A kind of Chinese language text auto-collation based on collocation Active CN106547741B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201611048520.5A CN106547741B (en) 2016-11-21 2016-11-21 A kind of Chinese language text auto-collation based on collocation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201611048520.5A CN106547741B (en) 2016-11-21 2016-11-21 A kind of Chinese language text auto-collation based on collocation

Publications (2)

Publication Number Publication Date
CN106547741A true CN106547741A (en) 2017-03-29
CN106547741B CN106547741B (en) 2019-02-15

Family

ID=58395571

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201611048520.5A Active CN106547741B (en) 2016-11-21 2016-11-21 A kind of Chinese language text auto-collation based on collocation

Country Status (1)

Country Link
CN (1) CN106547741B (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110705262A (en) * 2019-09-06 2020-01-17 宁波市科技园区明天医网科技有限公司 Improved intelligent error correction method applied to medical skill examination report
CN110991166A (en) * 2019-12-03 2020-04-10 中国标准化研究院 Chinese wrongly-written character recognition method and system based on pattern matching
CN110992782A (en) * 2018-10-01 2020-04-10 翌焕株式会社 Braille editing method using braille translation error output function, computer-readable recording medium, and computer program
CN111079415A (en) * 2019-11-12 2020-04-28 中国标准化研究院 Chinese automatic error checking method based on collocation conflict
CN111613214A (en) * 2020-05-21 2020-09-01 重庆农村商业银行股份有限公司 Language model error correction method for improving voice recognition capability

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101101594A (en) * 2007-06-12 2008-01-09 无敌科技(西安)有限公司 Real-time sentence-assisted writing method and system
CN102789504A (en) * 2012-07-19 2012-11-21 姜赢 Chinese grammar correcting method and system on basis of XLM (Extensible Markup Language) rule
US20140298168A1 (en) * 2013-03-28 2014-10-02 Est Soft Corp. System and method for spelling correction of misspelled keyword
CN104991889A (en) * 2015-06-26 2015-10-21 江苏科技大学 Fuzzy word segmentation based non-multi-character word error automatic proofreading method
CN105279149A (en) * 2015-10-21 2016-01-27 上海应用技术学院 Chinese text automatic correction method
CN105912600A (en) * 2016-04-05 2016-08-31 上海智臻智能网络科技股份有限公司 Question-answer knowledge base and establishing method thereof, intelligent question-answering method and system

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101101594A (en) * 2007-06-12 2008-01-09 无敌科技(西安)有限公司 Real-time sentence-assisted writing method and system
CN102789504A (en) * 2012-07-19 2012-11-21 姜赢 Chinese grammar correcting method and system on basis of XLM (Extensible Markup Language) rule
US20140298168A1 (en) * 2013-03-28 2014-10-02 Est Soft Corp. System and method for spelling correction of misspelled keyword
CN104991889A (en) * 2015-06-26 2015-10-21 江苏科技大学 Fuzzy word segmentation based non-multi-character word error automatic proofreading method
CN105279149A (en) * 2015-10-21 2016-01-27 上海应用技术学院 Chinese text automatic correction method
CN105912600A (en) * 2016-04-05 2016-08-31 上海智臻智能网络科技股份有限公司 Question-answer knowledge base and establishing method thereof, intelligent question-answering method and system

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110992782A (en) * 2018-10-01 2020-04-10 翌焕株式会社 Braille editing method using braille translation error output function, computer-readable recording medium, and computer program
US11170182B2 (en) 2018-10-01 2021-11-09 SENSEE, Inc. Braille editing method using error output function, recording medium storing program for executing same, and computer program stored in recording medium for executing same
CN110992782B (en) * 2018-10-01 2022-07-26 色恩西株式会社 Braille editing method using braille translation error output function, computer-readable recording medium, and computer program
CN110705262A (en) * 2019-09-06 2020-01-17 宁波市科技园区明天医网科技有限公司 Improved intelligent error correction method applied to medical skill examination report
CN110705262B (en) * 2019-09-06 2023-08-29 宁波市科技园区明天医网科技有限公司 Improved intelligent error correction method applied to medical technology inspection report
CN111079415A (en) * 2019-11-12 2020-04-28 中国标准化研究院 Chinese automatic error checking method based on collocation conflict
CN110991166A (en) * 2019-12-03 2020-04-10 中国标准化研究院 Chinese wrongly-written character recognition method and system based on pattern matching
CN110991166B (en) * 2019-12-03 2021-07-30 中国标准化研究院 Chinese wrongly-written character recognition method and system based on pattern matching
CN111613214A (en) * 2020-05-21 2020-09-01 重庆农村商业银行股份有限公司 Language model error correction method for improving voice recognition capability

Also Published As

Publication number Publication date
CN106547741B (en) 2019-02-15

Similar Documents

Publication Publication Date Title
CN106547741A (en) A kind of Chinese language text auto-collation based on collocation
CN105045778B (en) A kind of Chinese homonym mistake auto-collation
CN104991889B (en) A kind of non-multi-character word error auto-collation based on fuzzy participle
CN106202153B (en) A kind of the spelling error correction method and system of ES search engine
CN107608949A (en) A kind of Text Information Extraction method and device based on semantic model
CN102033879B (en) Method and device for identifying Chinese name
CN103631858B (en) A kind of science and technology item similarity calculating method
CN109885824A (en) A kind of Chinese name entity recognition method, device and the readable storage medium storing program for executing of level
CN104731768B (en) A kind of location of incident abstracting method towards Chinese newsletter archive
CN103473217B (en) The method and apparatus of extracting keywords from text
CN109829172A (en) A kind of automatic grammer of two-way decoding based on nerve translation is corrected mistakes model
CN106528526A (en) A Chinese address semantic tagging method based on the Bayes word segmentation algorithm
CN107807910A (en) A kind of part-of-speech tagging method based on HMM
CN110175221A (en) Utilize the refuse messages recognition methods of term vector combination machine learning
CN106528533A (en) Dynamic sentiment word and special adjunct word-based text sentiment analysis method
CN106598951A (en) Dependency structure treebank acquisition method and system
CN106708812A (en) Machine translation model obtaining method and device
CN104699797A (en) Webpage data structured analytic method and device
CN107797994A (en) Vietnamese noun phrase block identifying method based on constraints random field
CN109033166A (en) A kind of character attribute extraction training dataset construction method
CN115510864A (en) Chinese crop disease and pest named entity recognition method fused with domain dictionary
CN106156013A (en) The two-part machine translation method that a kind of regular collocation type phrase is preferential
CN103714053B (en) Japanese verb identification method for machine translation
Serrano et al. Rigoberta: A state-of-the-art language model for spanish
CN106126497A (en) A kind of automatic mining correspondence executes leader section and the method for cited literature textual content fragment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
EE01 Entry into force of recordation of patent licensing contract
EE01 Entry into force of recordation of patent licensing contract

Application publication date: 20170329

Assignee: JIANGSU KEDA HUIFENG SCIENCE AND TECHNOLOGY Co.,Ltd.

Assignor: JIANGSU University OF SCIENCE AND TECHNOLOGY

Contract record no.: X2020980007325

Denomination of invention: An automatic Chinese text proofreading method based on collocation

Granted publication date: 20190215

License type: Common License

Record date: 20201029

EC01 Cancellation of recordation of patent licensing contract
EC01 Cancellation of recordation of patent licensing contract

Assignee: JIANGSU KEDA HUIFENG SCIENCE AND TECHNOLOGY Co.,Ltd.

Assignor: JIANGSU University OF SCIENCE AND TECHNOLOGY

Contract record no.: X2020980007325

Date of cancellation: 20201223

TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20221230

Address after: Room 606-609, Compound Office Complex Building, No. 757, Dongfeng East Road, Yuexiu District, Guangzhou, Guangdong Province, 510699

Patentee after: China Southern Power Grid Internet Service Co.,Ltd.

Address before: Room 02A-084, Building C (Second Floor), No. 28, Xinxi Road, Haidian District, Beijing 100085

Patentee before: Jingchuang United (Beijing) Intellectual Property Service Co.,Ltd.

Effective date of registration: 20221230

Address after: Room 02A-084, Building C (Second Floor), No. 28, Xinxi Road, Haidian District, Beijing 100085

Patentee after: Jingchuang United (Beijing) Intellectual Property Service Co.,Ltd.

Address before: 212003, No. 2, Mengxi Road, Zhenjiang, Jiangsu

Patentee before: JIANGSU University OF SCIENCE AND TECHNOLOGY