A kind of Chinese and language structure obtain system and method
Technical field
The present invention relates to Chinese natural language process, Chinese grammar structure automatic identification field, more particularly to a kind of Chinese and language structure automatic recognition system and method.
Background technology
Chinese pivotal sentence is the special language phenomenon of a class.For example, providing following three sentences (use space, and be labelled with part of speech, be so easy to protrude the simultaneous language linguistic context in sentence):
S1:" organizing committee/n invitation/v they/r participation/v meetings/n "
S2:" school/n support/v graduates/n foundation/v "
S3:" who/r allows/v this/r only/q bottles/n falls/v ground/s/u/w”
In S1, " they " are the object of verb " invitation ", while be also the subject of verb " participation ", therefore in S1, " they " are and language.In S2, " graduate " is the object of verb " support ", while be also the subject of verb " foundation ", therefore in S2, " graduate " is and language.Equally, in S3, " this bottle " is the object of " allowing ", while be also the subject of verb " falling ", therefore in S3, " this bottle " is and language.
Can be seen that Chinese pivotal sentence from these three typical examples is a kind of common language phenomenon.Over more than 30 years, the domestic well-known scholar such as Zhu Dexi, Ding Shusheng, Huang Bairong, Lv Jiping, Wu Qisheng has carried out systematic research from grammer or semantic angle to Chinese pivotal sentence, and people's understanding Chinese pivotal sentence played an important role.
In addition to theoretical research value, Chinese teaching and training, with the development in an all-round way of the Internet, applications, and language structural research also has many important purposes.
For example, Chinese and language structure can serve as a part for the language model in speech recognition, there is important booster action to automatically creating this language model.
And for example, unknown word identification problem is always an important problem:A dictionary is given, the word not occurred in this dictionary is referred to as unregistered word.Because it is limited, it is necessary to constantly supplement in actual applications that any dictionary receives word when starting.A technical difficulty in unknown word identification or dictionary supplement is how to be accurately determined the right boundary of unregistered word.
And how by the treatment of big language material and analysis, therefrom effectively obtaining and language structure, formed and language structural libraryWhich verb how is verified, is combined with what noun could be formed and language structure under what conditionsThese problems never have and sufficiently paid close attention to and study.
The content of the invention
For how by the treatment of big language material and analysis, therefrom effectively obtaining and language structure, formed and language structural library;Which verb how is verified, is combined with what noun could be formed and the problem of language structure obtains system and method the invention provides a kind of Chinese and language structure under what conditions.
In order to solve problem above present invention employs following technical scheme:A kind of Chinese and language structure obtain system, it is characterised in that:Including carrying out participle to original training corpus Corpus, the modules A of participle corpus TCorpus is formed;Every sentence S in identification participle corpus TCorpusiThe module B of middle verb;The sentence in TCorpus is analyzed using simultaneous language pattern, the sentence to meeting simultaneous language pattern forms candidate and language structure, and the module C inserted in simultaneous language structural library SOBase to be verified;Checking candidate and language structural library SOBase, and export the module D of final result SOBaseResult;
In module described above, modules A carries out participle using an ICTCLAS system increased income to every input text in RCorpus, and every text is decoupled according to the natural of sentence, and formation does not contain the simple sentence of sentence marks;Therefore, the form of each sentence of TCorpus is Si=" W1/pos1 W2/pos2 ... Wi/posi ... Wn/posn ", and wherein each Wi is Chinese word, Chinese character, punctuation mark, Arabic numerals, English word or a letter, and posi is its corresponding part of speech;Result after modules A generation participle will be transmitted to module B, verb or verb phrases in every sentence Si in module B identification participle corpus TCorpus;Module B carries out verb merging treatment to every sentence Si in TCorpus, that is, " W occur1/v W2During/v ", then according to " W1W2/ v " merges treatment, will two or more verbs, merge into a verb, this process is called verb merging treatment;After the treatment, Processing for removing is carried out to the adverbial word for modifying verb, all modification adverbial words that will be before verb are all deleted;After module B completes verb identification, adverbial word treatment, result is transmitted to module C;Module C applications and language pattern are analyzed to the sentence in TCorpus, and the sentence to meeting simultaneous language pattern forms candidate and language structure, and inserts in simultaneous language structural library SOBase to be verified;After module C completes simultaneous language pattern analysis, result is transmitted to module D to verify the correctness of simultaneous language structure;Module D is to every record in candidate and language structural library SOBase<“Vi , 1…Vi , 2", " Ni , 1 Vi , 1 Ni , 2 Vi , 2 Ni , 3”>Carry out and the common property checking of language collocation, simultaneous language collocation diversity checking.
A kind of Chinese and language structure obtaining method, it is characterised in that:Comprise the following steps:
The first step:Participle is carried out to original training corpus Corpus, participle corpus TCorpus is formed;
Participle is carried out to every input text D in Corpus using an ICTCLAS system increased income, and every text is decoupled according to the natural of sentence, formation does not contain the simple sentence of sentence marks;Therefore, the form of each sentence of TCorpus is Si=" W1/pos1 W2/pos2…Wi/posi…Wn/posn", wherein each WiIt is Chinese word, Chinese character, punctuation mark, Arabic numerals, English word or a letter, posiIt is its corresponding part of speech;
In segmentation methods, the mark of part of speech is current in computer circle;Common part of speech has a to represent that adjective, b represent that distinction word, c represent that conjunction, d represent that adverbial word, h represent that prefix word, j represent that abbreviation word, k represent that suffix word, m represent that number, n represent that noun, p represent that preposition, q represent that measure word, r represent that pronoun, u represent that auxiliary word, z represent descriptive word;
Second step:Every sentence S in identification participle corpus TCorpusiIn verb or verb phrases;
As appearance " W1/v W2/ v ", then according to " W1W2/ v " merges treatment, will two or more verbs, merge into a verb, this process is called verb merging treatment;After the treatment, Processing for removing is carried out to the adverbial word for modifying verb, all modification adverbial words that will be before verb are all deleted;Sentence after treatment is still put into TCorpus;
3rd step:The sentence in TCorpus is analyzed using simultaneous language pattern, the sentence to meeting simultaneous language pattern forms candidate and language structure, and inserts in simultaneous language structural library SOBase to be verified;
The application and language pattern are analyzed to the sentence in TCorpus, refer to, using 5 kinds and language pattern, the sentence of one of meeting in TCorpus and language pattern to be picked out, and insert in simultaneous language structural library SOBase to be verified;
Specifically, to any sentence SO in TCorpusi, when it contains the verb for having more than 2, or only contain 1 verb, then abandon the sentence;Otherwise, if SOiForm be " Ni , 1 Vi , 1 Ni , 2 Vi , 2 Ni , 3", here, subscript i represents i-th sentence meaning;Following main task is to check Ni , 2Whether 5 kind and language pattern one of are met;If one of 5 kinds and language pattern are met, by binary pair<“Vi , 1…Vi , 2", " Ni , 1 Vi , 1 Ni , 2 Vi , 2 Ni , 3”>It is put into SOBase;Otherwise, SO is abandonedi;
5 kinds described and language pattern:If the general type of pivotal sentence is " N1 V1 N2 V2 N3", wherein N2As simultaneous language;When simultaneous language structure is obtained, only consider and language N2The simultaneous language sentence of following pattern is met, it is, when corpus is sufficiently large, simultaneous language is that the simultaneous language structure of the pivotal sentence of other forms can also be obtained from below simultaneous language satisfaction in 5 kinds of pivotal sentences of pattern:
Pattern 1:Number+noun;
Pattern 2:Number+measure word+noun;
Pattern 3:This, this, specifically, this, this position is this, these, that, that, that time, that, that position, that, those, it, they }, the element in the set is common pronoun, is commonly used to refer to lifeless object or animal, and any one element therein is all in itself a pattern;
Pattern 4:This, this, specifically, this, this position is this, these, that, that, that time, that, that position, that, those }+noun, this is a simultaneous language pattern being made up of pronoun and title;
Pattern 5:He, they, I, we, she, they, the element in the set is common pronoun, is commonly used to refer to personage, and any one element therein is all in itself a pattern;
4th step:Checking candidate and language structural library SOBase, and export final result SOBaseResult;
To every record in candidate and language structural library SOBase<“Vi , 1…Vi , 2", " Ni , 1 Vi , 1 Ni , 2 Vi , 2 Ni , 3”>, using two kinds of verification techniques:And the common property checking of language collocation, simultaneous language collocation diversity, they are both ensured that and the language correct necessary condition of structure;
The common property checking of described and language collocation, refers to work as SOi=" Ni , 1 Vi , 1 Ni , 2 Vi , 2 Ni , 3" be a correct pivotal sentence, then and language structure " Vi , 1…Vi , 2" occur in other sentences in TCorpus, rather than only occurring in pivotal sentence SOiIn;
The collocation diversity checking of described and language, refers to if SOi=" Ni , 1 Vi , 1 Ni , 2 Vi , 2 Ni , 3" it is a correct pivotal sentence, then shape such as SO 'i=" N 'i , 1 Vi , 1 N′i , 2 Vi , 2 N′i , 3”、SO″i=" N "i , 1 Vi , 1 N″i , 2 Vi , 2 N″i , 3" pivotal sentence also should repeatedly occur in TCorpus;
The specific implementation step of the 4th step is:
It is firstly introduced into two the threshold value a and b of non-negative, wherein a ∈ (0,1], b ∈ (0,1]
Step D1:It is sky to set SOBaseResult, is used to preserve authenticated, correct and language structure result;
Step D2:If SOBase is empty, D6 is gone to step;
Step D3:Any record in SOBase<“Vi , 1…Vi , 2", " Ni , 1 Vi , 1 Ni , 2 Vi , 2 Ni , 3”>, will<“Vi , 1…Vi , 2", " Ni , 1 Vi , 1 Ni , 2 Vi , 2 Ni , 3”>Taken out from SOBase;
Step D4:If cof (" Vi , 1…Vi , 2") > a, then by " Vi , 1…Vi , 2" be put into set SOBaseResult, go to step D2;
The cof (" Vi , 1…Vi , 2") reflect and language structure " Vi , 1…Vi , 2" common property, it is calculated as follows:cof(“Vi , 1…Vi , 2")=TCorpus contains " Vi , 1…Vi , 2" sentence number in structured statement bar number/TCorpus;As cof (Vi , 1…Vi , 2) > a when, by " Vi , 1…Vi , 2" it is considered as a correct and language structure;
Step D5:If muf (" Vi , 1…Vi , 2") > b, then by " Vi , 1…Vi , 2" be put into set SOBaseResult;
The muf (" Vi , 1…Vi , 2") be one and portray and language is arranged in pairs or groups multifarious mathematical method, its calculating sub-step is as follows:During beginning, V is set* , 1And V* , 2It is null set;
Step D51:In SOBase, if there is<“Vx…Vi , 2", " Ni , 1 Vx Ni , 2 Vi , 2 Ni , 3”>, then by VxIt is put into set V* , 1In;
Step D52:In SOBase, if there is<“Vi , 1…Vy", " Ni , 1 Vi , 1 Ni , 2 Vy Ni , 3”>, then by VyIt is put into set V* , 2In;
Step D53:Calculate muf (" Vi , 1…Vi , 2”):Computing formula is as follows:
Step D6:The final simultaneous language structure results SOBaseResult of output.
Beneficial effect:
The present invention is by linguistics and computer technology, it is proposed that a kind of Chinese and language structure obtain system and method.For complex, the diversity of simultaneous language form, invention introduces simultaneous language pattern, the complexity of simultaneous language form can be greatly controlled on the premise of acquisition effect is not reduced.For Chinese word-building and the complexity of sentence, to ensure the accuracy of simultaneous language structure, the present invention carries out strict checking from " and language structure matching diversity ", " and the common property of language structure matching " double angle, the simultaneous language structure to obtaining.
By after the test of 1TB language materials checking, system of the invention obtains 13.96 ten thousand pairs and language structure, and by analysis, accuracy reaches 98.2%.Therefore, the present invention achieves preferable recognition performance, has reached the purpose of practical application.
Brief description of the drawings
Fig. 1 is that a kind of Chinese and language structure obtain working-flow figure.
Specific embodiment
In order to the clearer explanation present invention, the defined below and term that is explained as follows:
(1) Chinese part of speech:Part of speech in Chinese is an attribute of Chinese word.Common are:Noun (such as child, graduate, represented with n), verb (such as invite, add, represented with v), adverbial word (such as very, often, represented with d), adjective it is (such as beautiful, fine, simple, represented with a), pronoun (such as these, this, it, they, represented with r), number (such as, 12,12, represented with m), measure word (such as, root, only, bar, represented with q).
(2) 5 kinds and language pattern:It is control and the acquisition complexity of language structure, more and language structure is obtained in that while also assuring, present invention introduces 5 kinds and language pattern.For ease of statement, the general type for hereafter assuming pivotal sentence is " N1 V1 N2 V2 N3", wherein N2As simultaneous language.The present invention considers and language N when simultaneous language structure is obtained, only2Meet following pattern simultaneous language sentence (it is, we assume that, when corpus is sufficiently large, and language be other forms pivotal sentence simultaneous language structure also can from and language meet below 5 patterns pivotal sentence in obtain):
● pattern 1:Number+noun.For example:" three/m people/n ", " 3/n projects/n " are exactly specific two examples.
● pattern 2:Number+measure word+noun.For example:" three/m/q people/n ", " 3/m/q plant/n " are exactly specific two examples.
● pattern 3:This, this, specifically, this, this position is this, these, that, that, that time, that, that position, that, those, it, they }.Element in the set is common pronoun, is commonly used to refer to lifeless object or animal, and any one element therein is all in itself a pattern.
● pattern 4:This, this, specifically, this, this position is this, these, that, that, that time, that, that position, that, those }+noun.This is a simultaneous language pattern being made up of pronoun and title." this/r matches/n ", " these/r raw materials/n " it is exactly specific two examples.
● pattern 5:He, they, I, we, she, they.Element in the set is common pronoun, is commonly used to refer to personage, and any one element therein is all in itself a pattern.
(3) ICTCLAS systems:One Words partition system that is free, increasing income.The system is input with text, is output as the segmentation sequence of the text.ICTCLAS system downloads network address is:http://ictclas.nlpir.org.After participle, each participle indicates part of speech, wherein a represents that adjective, b represent that distinction word, c represent that conjunction, d represent that adverbial word, h represent that prefix word, j represent that abbreviation word, k represent that suffix word, m represent that number, n represent that noun, p represent that preposition, q represent that measure word, r represent that pronoun, u represent that auxiliary word, z represent descriptive word, etc..
The present invention is described in more detail with reference to the accompanying drawings and detailed description.Below, an either short sentence, or article long, we are referred to as text.
A kind of Chinese and language structure obtain system and method and are divided into four main modulars:
Modules A:Participle is carried out to original training corpus Corpus, participle corpus TCorpus is formed.
Module B:Every sentence S in identification participle corpus TCorpusiIn verb.
Module C:The sentence in TCorpus is analyzed using simultaneous language pattern, the sentence to meeting simultaneous language pattern forms candidate and language structure, and inserts in simultaneous language structural library SOBase to be verified.
Module D:Checking candidate and language structural library SOBase, and export final result SOBaseResult.
The workflow or method of modules is explained in detail below.
Modules A:Participle is carried out to original training corpus Corpus, participle corpus TCorpus is formed.
We carry out participle using an ICTCLAS system increased income to every input text D in Corpus, and every text is decoupled according to the natural of sentence, and formation does not contain the simple sentence of sentence marks.Therefore, the form of each sentence of TCorpus is Si=" W1/pos1 W2/pos2…Wi/posi…Wn/posn", wherein each WiIt is Chinese word, Chinese character, punctuation mark, Arabic numerals, English word or a letter, posiIt is its corresponding part of speech.
In segmentation methods, the mark of part of speech is current in computer circle.Common part of speech has a to represent that adjective, b represent that distinction word, c represent that conjunction, d represent that adverbial word, h represent that prefix word, j represent that abbreviation word, k represent that suffix word, m represent that number, n represent that noun, p represent that preposition, q represent that measure word, r represent that pronoun, u represent that auxiliary word, z represent descriptive word etc..For example, the word segmentation result of sentence " organizing committee invites them to participate in meeting " is:" teacher/n in earnest/d says/v to/v those/r student/n listens/n ".
Module B:Every sentence S in identification participle corpus TCorpusiIn verb or verb phrases.
To every sentence S in TCorpusi, if there is " W1/v W2/ v ", then according to " W1W2/ v " merges treatment, will two or more verbs, merge into a verb, this process is called verb merging treatment.For example, sentence " teacher/n says/v to/v those/r student/n listens/v ", by after above-mentioned treatment, obtaining a verb phrases " say to ", so as to obtain pivotal sentence " teacher/n say to/v those/r student/n listens/v ".The purpose for the arrangement is that obtaining as much as possible and language structure.
After the treatment, Processing for removing is carried out to the adverbial word for modifying verb, all modification adverbial words that will be before verb are all deleted.For example, sentence " teacher/n in earnest/d says/v to/v those/r student/n listens/v ", by after verb merging treatment, obtain pivotal sentence " teacher/n in earnest/d say to/v those/r student/n listens/v ".Again by after adverbial word delete processing, obtain pivotal sentence " teacher/n say to/v those/r student/n listens/v ".
Sentence after the present invention will be processed is put into TCorpus.
Module C:The sentence in TCorpus is analyzed using simultaneous language pattern, the sentence to meeting simultaneous language pattern forms candidate and language structure, and inserts in simultaneous language structural library SOBase to be verified.
The application and language pattern are analyzed to the sentence in TCorpus, refer to 5 kinds using previous designs and language pattern, and the sentence of one of meeting in TCorpus and language pattern is picked out, and are inserted in simultaneous language structural library SOBase to be verified.
Specifically, to any sentence SO in TCorpusiIf it contains the verb for having more than 2, or only contain 1 verb, then abandon the sentence;Otherwise, if SOiForm be " Ni , 1 Vi , 1 Ni , 2 Vi , 2 Ni , 3" (here, subscript i represents i-th sentence meaning).Following main task is to check Ni , 2Whether 5 kind and language pattern one of are met.If one of 5 kinds and language pattern are met, by binary pair<“Vi , 1…Vi , 2", " Ni , 1 Vi , 1 Ni , 2 Vi , 2 Ni , 3”>It is put into SOBase;Otherwise, SO is abandonedi。
For example, to pivotal sentence " teacher/n say to/v those/r student/n listens/v ", " those/r student/n " meet and language pattern 4, therefore it is considered herein that according to pivotal sentence " teacher/n say to/v those/r student/n listens/v " obtain and language structure " say to ... listen " is one and language structure, and incite somebody to action<" say to ... listen ", " teacher/n say to/v those/r student/n listens/v ">It is put into candidate and language structural library SOBase, is further verified by module D.
Module D:Checking candidate and language structural library SOBase, and export final result SOBaseResult.
To every record in candidate and language structural library SOBase<“Vi , 1…Vi , 2", " Ni , 1 Vi , 1 Ni , 2 Vi , 2 Ni , 3”>, the present invention proposes two kinds of verification techniques:And the common property checking of language collocation, simultaneous language collocation diversity, they are both ensured that and the language correct necessary condition of structure.
The common property checking of described and language collocation, refers to if SOi=" Ni , 1 Vi , 1 Ni , 2 Vi , 2 Ni , 3" be a correct pivotal sentence, then and language structure " Vi , 1…Vi , 2" occur in other sentences in TCorpus, rather than only occurring in pivotal sentence SOiIn.
For example, SOi=" organizing committee invites them to participate in meeting ".So SOi , 1=" host invites them to participate in interaction " and SOi , 2=" owner invites them to have lunch altogether " can also occur in the sentence of other in TCorpus;Namely SOi , 1And SOi , 2It is not to depend only on SOiThis special pivotal sentence.
The collocation diversity checking of described and language, refers to if SOi=" Ni , 1 Vi , 1 Ni , 2 Vi , 2 Ni , 3" it is a correct pivotal sentence, then shape such as SO 'i=" N 'i , 1 Vi , 1 N′i , 2 Vi , 2 N′i , 3”、SO″i=" N "i , 1 Vi , 1 N″i , 2 Vi , 2 N″i , 3" etc. pivotal sentence also should repeatedly occur in TCorpus.
For example, SOi=" organizing committee invites them to participate in meeting ", then SO 'i=" friend invites us to participate in birthday party ", SO "i=" friend invites her to participate in wedding ", SO " 'i=" professor invites these foreign students to participate in exchanging meeting ".That is, in SOiIn=" organizing committee invites them to participate in meeting ", and language " they " can be to be replaced by the word of diversified forms, and SOiIn simultaneous language structure " invite ... participate in " it is rationally still and correct.
According to above-mentioned pair and the common degree of language collocation, the multifarious analysis of simultaneous language collocation and explanation, the implementation of module D is given below:It is firstly introduced into two the threshold value a and b of non-negative, wherein a ∈ (0,1], b ∈ (0,1]
Step D1:It is sky to set SOBaseResult, is used to preserve authenticated, correct and language structure result.
Step D2:If SOBase is empty, D6 is gone to step.
Step D3:Any record in SOBase<“Vi , 1…Vi , 2", " Ni , 1 Vi , 1 Ni , 2 Vi , 2 Ni , 3”>, will<“Vi , 1…Vi , 2", " Ni , 1 Vi , 1 Ni , 2 Vi , 2 Ni , 3”>Taken out from SOBase.
Step D4:If cof (" Vi , 1…Vi , 2") > a, then by " Vi , 1…Vi , 2" be put into set SOBaseResult, go to step D2.
The cof (" Vi , 1…Vi , 2") reflect and language structure " Vi , 1…Vi , 2" common property, it is calculated as follows:cof(“Vi , 1…Vi , 2")=TCorpus contains " Vi , 1…Vi , 2" sentence number in structured statement bar number/TCorpus.As cof (Vi , 1…Vi , 2) > a when, by " Vi , 1…Vi , 2" it is considered as a correct and language structure.
Step D5:If muf (" Vi , 1…Vi , 2") > b, then by " Vi , 1…Vi , 2" be put into set SOBaseResult.
The muf (" Vi , 1…Vi , 2") be one and portray and language is arranged in pairs or groups multifarious mathematical method, its calculating sub-step is as follows:During beginning, V is set* , 1And V* , 2It is null set.
Step D51:In SOBase, if there is<“Vx…Vi , 2", " Ni , 1 Vx Ni , 2 Vi , 2 Ni , 3”>, then by VxIt is put into set V* , 1In.
Step D52:In SOBase, if there is<“Vi , 1…Vy", " Ni , 1 Vi , 1 Ni , 2 Vy Ni , 3”>, then by VyIt is put into set V* , 2In.
Step D53:Calculate muf (" Vi , 1…Vi , 2”):Computing formula is as follows:
Step D6:The final simultaneous language structure results SOBaseResult of output.
Experiment effect
By repeatedly preliminary experiment, and language arrange in pairs or groups common property a threshold value be set to 0.0006 (i.e. a=0.0006) and and language arrange in pairs or groups that to be set to the simultaneous language result effect that 0.0015 (i.e. b=0.0015) obtained preferable for threshold of diversity b.By after the test of 1TB language materials checking, system of the invention obtains 13.96 ten thousand pairs and language structure, and by analysis, accuracy reaches 98.2%.Therefore, the present invention achieves preferable recognition performance, has reached the purpose of practical application.