CN106815188A

CN106815188A - A kind of Chinese and language structure obtain system and method

Info

Publication number: CN106815188A
Application number: CN201510846489.9A
Authority: CN
Inventors: 符建辉; 王卫明; 曹阳
Original assignee: KNOWOLOGY INTELLIGENT TECHNOLOGY Co Ltd
Current assignee: KNOWOLOGY INTELLIGENT TECHNOLOGY Co Ltd
Priority date: 2015-11-27
Filing date: 2015-11-27
Publication date: 2017-06-09
Anticipated expiration: 2035-11-27
Also published as: CN106815188B

Abstract

System and method are obtained the present invention relates to a kind of Chinese and language structure, including participle is carried out to original training corpus Corpus, form participle corpus TCorpus；Every sentence S in identification participle corpus TCorpus_iMiddle verb；The sentence in TCorpus is analyzed using simultaneous language pattern, the sentence to meeting simultaneous language pattern forms candidate and language structure, and inserts in simultaneous language structural library SOBase to be verified；Checking candidate and language structural library SOBase, and export final result SOBaseResult；Invention introduces simultaneous language pattern, the complexity of simultaneous language form can be greatly controlled on the premise of acquisition effect is not reduced.For Chinese word-building and the complexity of sentence, to ensure the accuracy of simultaneous language structure, the present invention carries out strict checking from " and language structure matching diversity ", " and the common property of language structure matching " double angle, the simultaneous language structure to obtaining.

Description

A kind of Chinese and language structure obtain system and method

Technical field

The present invention relates to Chinese natural language process, Chinese grammar structure automatic identification field, more particularly to a kind of Chinese and language structure automatic recognition system and method.

Background technology

Chinese pivotal sentence is the special language phenomenon of a class.For example, providing following three sentences (use space, and be labelled with part of speech, be so easy to protrude the simultaneous language linguistic context in sentence)：

S1：" organizing committee/n invitation/v they/r participation/v meetings/n "

S2：" school/n support/v graduates/n foundation/v "

S3：" who/r allows/v this/r only/q bottles/n falls/v ground/s/u/w”

In S1, " they " are the object of verb " invitation ", while be also the subject of verb " participation ", therefore in S1, " they " are and language.In S2, " graduate " is the object of verb " support ", while be also the subject of verb " foundation ", therefore in S2, " graduate " is and language.Equally, in S3, " this bottle " is the object of " allowing ", while be also the subject of verb " falling ", therefore in S3, " this bottle " is and language.

Can be seen that Chinese pivotal sentence from these three typical examples is a kind of common language phenomenon.Over more than 30 years, the domestic well-known scholar such as Zhu Dexi, Ding Shusheng, Huang Bairong, Lv Jiping, Wu Qisheng has carried out systematic research from grammer or semantic angle to Chinese pivotal sentence, and people's understanding Chinese pivotal sentence played an important role.

In addition to theoretical research value, Chinese teaching and training, with the development in an all-round way of the Internet, applications, and language structural research also has many important purposes.

For example, Chinese and language structure can serve as a part for the language model in speech recognition, there is important booster action to automatically creating this language model.

And for example, unknown word identification problem is always an important problem：A dictionary is given, the word not occurred in this dictionary is referred to as unregistered word.Because it is limited, it is necessary to constantly supplement in actual applications that any dictionary receives word when starting.A technical difficulty in unknown word identification or dictionary supplement is how to be accurately determined the right boundary of unregistered word.

And how by the treatment of big language material and analysis, therefrom effectively obtaining and language structure, formed and language structural libraryWhich verb how is verified, is combined with what noun could be formed and language structure under what conditionsThese problems never have and sufficiently paid close attention to and study.

The content of the invention

For how by the treatment of big language material and analysis, therefrom effectively obtaining and language structure, formed and language structural library；Which verb how is verified, is combined with what noun could be formed and the problem of language structure obtains system and method the invention provides a kind of Chinese and language structure under what conditions.

In order to solve problem above present invention employs following technical scheme：A kind of Chinese and language structure obtain system, it is characterised in that：Including carrying out participle to original training corpus Corpus, the modules A of participle corpus TCorpus is formed；Every sentence S in identification participle corpus TCorpus_iThe module B of middle verb；The sentence in TCorpus is analyzed using simultaneous language pattern, the sentence to meeting simultaneous language pattern forms candidate and language structure, and the module C inserted in simultaneous language structural library SOBase to be verified；Checking candidate and language structural library SOBase, and export the module D of final result SOBaseResult；

In module described above, modules A carries out participle using an ICTCLAS system increased income to every input text in RCorpus, and every text is decoupled according to the natural of sentence, and formation does not contain the simple sentence of sentence marks；Therefore, the form of each sentence of TCorpus is Si=" W1/pos1 W2/pos2 ... Wi/posi ... Wn/posn ", and wherein each Wi is Chinese word, Chinese character, punctuation mark, Arabic numerals, English word or a letter, and posi is its corresponding part of speech；Result after modules A generation participle will be transmitted to module B, verb or verb phrases in every sentence Si in module B identification participle corpus TCorpus；Module B carries out verb merging treatment to every sentence Si in TCorpus, that is, " W occur₁/v W₂During/v ", then according to " W₁W₂/ v " merges treatment, will two or more verbs, merge into a verb, this process is called verb merging treatment；After the treatment, Processing for removing is carried out to the adverbial word for modifying verb, all modification adverbial words that will be before verb are all deleted；After module B completes verb identification, adverbial word treatment, result is transmitted to module C；Module C applications and language pattern are analyzed to the sentence in TCorpus, and the sentence to meeting simultaneous language pattern forms candidate and language structure, and inserts in simultaneous language structural library SOBase to be verified；After module C completes simultaneous language pattern analysis, result is transmitted to module D to verify the correctness of simultaneous language structure；Module D is to every record in candidate and language structural library SOBase<“V_i _, ₁…V_i _, ₂", " N_i _, ₁ V_i _, ₁ N_i _, ₂ V_i _, ₂ N_i _, ₃”>Carry out and the common property checking of language collocation, simultaneous language collocation diversity checking.

A kind of Chinese and language structure obtaining method, it is characterised in that：Comprise the following steps：

The first step：Participle is carried out to original training corpus Corpus, participle corpus TCorpus is formed；

Participle is carried out to every input text D in Corpus using an ICTCLAS system increased income, and every text is decoupled according to the natural of sentence, formation does not contain the simple sentence of sentence marks；Therefore, the form of each sentence of TCorpus is S_i=" W₁/pos₁ W₂/pos₂…W_i/pos_i…W_n/pos_n", wherein each W_iIt is Chinese word, Chinese character, punctuation mark, Arabic numerals, English word or a letter, pos_iIt is its corresponding part of speech；

In segmentation methods, the mark of part of speech is current in computer circle；Common part of speech has a to represent that adjective, b represent that distinction word, c represent that conjunction, d represent that adverbial word, h represent that prefix word, j represent that abbreviation word, k represent that suffix word, m represent that number, n represent that noun, p represent that preposition, q represent that measure word, r represent that pronoun, u represent that auxiliary word, z represent descriptive word；

Second step：Every sentence S in identification participle corpus TCorpus_iIn verb or verb phrases；

As appearance " W₁/v W₂/ v ", then according to " W₁W₂/ v " merges treatment, will two or more verbs, merge into a verb, this process is called verb merging treatment；After the treatment, Processing for removing is carried out to the adverbial word for modifying verb, all modification adverbial words that will be before verb are all deleted；Sentence after treatment is still put into TCorpus；

3rd step：The sentence in TCorpus is analyzed using simultaneous language pattern, the sentence to meeting simultaneous language pattern forms candidate and language structure, and inserts in simultaneous language structural library SOBase to be verified；

The application and language pattern are analyzed to the sentence in TCorpus, refer to, using 5 kinds and language pattern, the sentence of one of meeting in TCorpus and language pattern to be picked out, and insert in simultaneous language structural library SOBase to be verified；

Specifically, to any sentence SO in TCorpus_i, when it contains the verb for having more than 2, or only contain 1 verb, then abandon the sentence；Otherwise, if SO_iForm be " N_i _, ₁ V_i _, ₁ N_i _, ₂ V_i _, ₂ N_i _, ₃", here, subscript i represents i-th sentence meaning；Following main task is to check N_i _, ₂Whether 5 kind and language pattern one of are met；If one of 5 kinds and language pattern are met, by binary pair<“V_i _, ₁…V_i _, ₂", " N_i _, ₁ V_i _, ₁ N_i _, ₂ V_i _, ₂ N_i _, ₃”>It is put into SOBase；Otherwise, SO is abandoned_i；

5 kinds described and language pattern：If the general type of pivotal sentence is " N₁ V₁ N₂ V₂ N₃", wherein N₂As simultaneous language；When simultaneous language structure is obtained, only consider and language N₂The simultaneous language sentence of following pattern is met, it is, when corpus is sufficiently large, simultaneous language is that the simultaneous language structure of the pivotal sentence of other forms can also be obtained from below simultaneous language satisfaction in 5 kinds of pivotal sentences of pattern：

Pattern 1：Number+noun；

Pattern 2：Number+measure word+noun；

Pattern 3：This, this, specifically, this, this position is this, these, that, that, that time, that, that position, that, those, it, they }, the element in the set is common pronoun, is commonly used to refer to lifeless object or animal, and any one element therein is all in itself a pattern；

Pattern 4：This, this, specifically, this, this position is this, these, that, that, that time, that, that position, that, those }+noun, this is a simultaneous language pattern being made up of pronoun and title；

Pattern 5：He, they, I, we, she, they, the element in the set is common pronoun, is commonly used to refer to personage, and any one element therein is all in itself a pattern；

4th step：Checking candidate and language structural library SOBase, and export final result SOBaseResult；

To every record in candidate and language structural library SOBase<“V_i _, ₁…V_i _, ₂", " N_i _, ₁ V_i _, ₁ N_i _, ₂ V_i _, ₂ N_i _, ₃”>, using two kinds of verification techniques：And the common property checking of language collocation, simultaneous language collocation diversity, they are both ensured that and the language correct necessary condition of structure；

The common property checking of described and language collocation, refers to work as SO_i=" N_i _, ₁ V_i _, ₁ N_i _, ₂ V_i _, ₂ N_i _, ₃" be a correct pivotal sentence, then and language structure " V_i _, ₁…V_i _, ₂" occur in other sentences in TCorpus, rather than only occurring in pivotal sentence SO_iIn；

The collocation diversity checking of described and language, refers to if SO_i=" N_i _, ₁ V_i _, ₁ N_i _, ₂ V_i _, ₂ N_i _, ₃" it is a correct pivotal sentence, then shape such as SO '_i=" N '_i _, ₁ V_i _, ₁ N′_i _, ₂ V_i _, ₂ N′_i _, ₃”、SO″_i=" N "_i _, ₁ V_i _, ₁ N″_i _, ₂ V_i _, ₂ N″_i _, ₃" pivotal sentence also should repeatedly occur in TCorpus；

The specific implementation step of the 4th step is：

It is firstly introduced into two the threshold value a and b of non-negative, wherein a ∈ (0,1], b ∈ (0,1]

Step D1：It is sky to set SOBaseResult, is used to preserve authenticated, correct and language structure result；

Step D2：If SOBase is empty, D6 is gone to step；

Step D3：Any record in SOBase<“V_i _, ₁…V_i _, ₂", " N_i _, ₁ V_i _, ₁ N_i _, ₂ V_i _, ₂ N_i _, ₃”>, will<“V_i _, ₁…V_i _, ₂", " N_i _, ₁ V_i _, ₁ N_i _, ₂ V_i _, ₂ N_i _, ₃”>Taken out from SOBase；

Step D4：If cof (" V_i _, ₁…V_i _, ₂") ＞ a, then by " V_i _, ₁…V_i _, ₂" be put into set SOBaseResult, go to step D2；

The cof (" V_i _, ₁…V_i _, ₂") reflect and language structure " V_i _, ₁…V_i _, ₂" common property, it is calculated as follows：cof(“V_i _, ₁…V_i _, ₂")=TCorpus contains " V_i _, ₁…V_i _, ₂" sentence number in structured statement bar number/TCorpus；As cof (V_i _, ₁…V_i _, ₂) ＞ a when, by " V_i _, ₁…V_i _, ₂" it is considered as a correct and language structure；

Step D5：If muf (" V_i _, ₁…V_i _, ₂") ＞ b, then by " V_i _, ₁…V_i _, ₂" be put into set SOBaseResult；

The muf (" V_i _, ₁…V_i _, ₂") be one and portray and language is arranged in pairs or groups multifarious mathematical method, its calculating sub-step is as follows：During beginning, V is set_* _, ₁And V_* _, ₂It is null set；

Step D51：In SOBase, if there is<“V_x…V_i _, ₂", " N_i _, ₁ V_x N_i _, ₂ V_i _, ₂ N_i _, ₃”>, then by V_xIt is put into set V_* _, ₁In；

Step D52：In SOBase, if there is<“V_i _, ₁…V_y", " N_i _, ₁ V_i _, ₁ N_i _, ₂ V_y N_i _, ₃”>, then by V_yIt is put into set V_* _, ₂In；

Step D53：Calculate muf (" V_i _, ₁…V_i _, ₂”)：Computing formula is as follows：

Step D6：The final simultaneous language structure results SOBaseResult of output.

Beneficial effect：

The present invention is by linguistics and computer technology, it is proposed that a kind of Chinese and language structure obtain system and method.For complex, the diversity of simultaneous language form, invention introduces simultaneous language pattern, the complexity of simultaneous language form can be greatly controlled on the premise of acquisition effect is not reduced.For Chinese word-building and the complexity of sentence, to ensure the accuracy of simultaneous language structure, the present invention carries out strict checking from " and language structure matching diversity ", " and the common property of language structure matching " double angle, the simultaneous language structure to obtaining.

By after the test of 1TB language materials checking, system of the invention obtains 13.96 ten thousand pairs and language structure, and by analysis, accuracy reaches 98.2%.Therefore, the present invention achieves preferable recognition performance, has reached the purpose of practical application.

Brief description of the drawings

Fig. 1 is that a kind of Chinese and language structure obtain working-flow figure.

Specific embodiment

In order to the clearer explanation present invention, the defined below and term that is explained as follows：

(1) Chinese part of speech：Part of speech in Chinese is an attribute of Chinese word.Common are：Noun (such as child, graduate, represented with n), verb (such as invite, add, represented with v), adverbial word (such as very, often, represented with d), adjective it is (such as beautiful, fine, simple, represented with a), pronoun (such as these, this, it, they, represented with r), number (such as, 12,12, represented with m), measure word (such as, root, only, bar, represented with q).

(2) 5 kinds and language pattern：It is control and the acquisition complexity of language structure, more and language structure is obtained in that while also assuring, present invention introduces 5 kinds and language pattern.For ease of statement, the general type for hereafter assuming pivotal sentence is " N₁ V₁ N₂ V₂ N₃", wherein N₂As simultaneous language.The present invention considers and language N when simultaneous language structure is obtained, only₂Meet following pattern simultaneous language sentence (it is, we assume that, when corpus is sufficiently large, and language be other forms pivotal sentence simultaneous language structure also can from and language meet below 5 patterns pivotal sentence in obtain)：

● pattern 1：Number+noun.For example：" three/m people/n ", " 3/n projects/n " are exactly specific two examples.

● pattern 2：Number+measure word+noun.For example：" three/m/q people/n ", " 3/m/q plant/n " are exactly specific two examples.

● pattern 3：This, this, specifically, this, this position is this, these, that, that, that time, that, that position, that, those, it, they }.Element in the set is common pronoun, is commonly used to refer to lifeless object or animal, and any one element therein is all in itself a pattern.

● pattern 4：This, this, specifically, this, this position is this, these, that, that, that time, that, that position, that, those }+noun.This is a simultaneous language pattern being made up of pronoun and title." this/r matches/n ", " these/r raw materials/n " it is exactly specific two examples.

● pattern 5：He, they, I, we, she, they.Element in the set is common pronoun, is commonly used to refer to personage, and any one element therein is all in itself a pattern.

(3) ICTCLAS systems：One Words partition system that is free, increasing income.The system is input with text, is output as the segmentation sequence of the text.ICTCLAS system downloads network address is：http：//ictclas.nlpir.org.After participle, each participle indicates part of speech, wherein a represents that adjective, b represent that distinction word, c represent that conjunction, d represent that adverbial word, h represent that prefix word, j represent that abbreviation word, k represent that suffix word, m represent that number, n represent that noun, p represent that preposition, q represent that measure word, r represent that pronoun, u represent that auxiliary word, z represent descriptive word, etc..

The present invention is described in more detail with reference to the accompanying drawings and detailed description.Below, an either short sentence, or article long, we are referred to as text.

A kind of Chinese and language structure obtain system and method and are divided into four main modulars：

Modules A：Participle is carried out to original training corpus Corpus, participle corpus TCorpus is formed.

Module B：Every sentence S in identification participle corpus TCorpus_iIn verb.

Module C：The sentence in TCorpus is analyzed using simultaneous language pattern, the sentence to meeting simultaneous language pattern forms candidate and language structure, and inserts in simultaneous language structural library SOBase to be verified.

Module D：Checking candidate and language structural library SOBase, and export final result SOBaseResult.

The workflow or method of modules is explained in detail below.

We carry out participle using an ICTCLAS system increased income to every input text D in Corpus, and every text is decoupled according to the natural of sentence, and formation does not contain the simple sentence of sentence marks.Therefore, the form of each sentence of TCorpus is S_i=" W₁/pos₁ W₂/pos₂…W_i/pos_i…W_n/pos_n", wherein each W_iIt is Chinese word, Chinese character, punctuation mark, Arabic numerals, English word or a letter, pos_iIt is its corresponding part of speech.

In segmentation methods, the mark of part of speech is current in computer circle.Common part of speech has a to represent that adjective, b represent that distinction word, c represent that conjunction, d represent that adverbial word, h represent that prefix word, j represent that abbreviation word, k represent that suffix word, m represent that number, n represent that noun, p represent that preposition, q represent that measure word, r represent that pronoun, u represent that auxiliary word, z represent descriptive word etc..For example, the word segmentation result of sentence " organizing committee invites them to participate in meeting " is：" teacher/n in earnest/d says/v to/v those/r student/n listens/n ".

Module B：Every sentence S in identification participle corpus TCorpus_iIn verb or verb phrases.

To every sentence S in TCorpus_i, if there is " W₁/v W₂/ v ", then according to " W₁W₂/ v " merges treatment, will two or more verbs, merge into a verb, this process is called verb merging treatment.For example, sentence " teacher/n says/v to/v those/r student/n listens/v ", by after above-mentioned treatment, obtaining a verb phrases " say to ", so as to obtain pivotal sentence " teacher/n say to/v those/r student/n listens/v ".The purpose for the arrangement is that obtaining as much as possible and language structure.

After the treatment, Processing for removing is carried out to the adverbial word for modifying verb, all modification adverbial words that will be before verb are all deleted.For example, sentence " teacher/n in earnest/d says/v to/v those/r student/n listens/v ", by after verb merging treatment, obtain pivotal sentence " teacher/n in earnest/d say to/v those/r student/n listens/v ".Again by after adverbial word delete processing, obtain pivotal sentence " teacher/n say to/v those/r student/n listens/v ".

Sentence after the present invention will be processed is put into TCorpus.

The application and language pattern are analyzed to the sentence in TCorpus, refer to 5 kinds using previous designs and language pattern, and the sentence of one of meeting in TCorpus and language pattern is picked out, and are inserted in simultaneous language structural library SOBase to be verified.

Specifically, to any sentence SO in TCorpus_iIf it contains the verb for having more than 2, or only contain 1 verb, then abandon the sentence；Otherwise, if SO_iForm be " N_i _, ₁ V_i _, ₁ N_i _, ₂ V_i _, ₂ N_i _, ₃" (here, subscript i represents i-th sentence meaning).Following main task is to check N_i _, ₂Whether 5 kind and language pattern one of are met.If one of 5 kinds and language pattern are met, by binary pair<“V_i _, ₁…V_i _, ₂", " N_i _, ₁ V_i _, ₁ N_i _, ₂ V_i _, ₂ N_i _, ₃”>It is put into SOBase；Otherwise, SO is abandoned_i。

For example, to pivotal sentence " teacher/n say to/v those/r student/n listens/v ", " those/r student/n " meet and language pattern 4, therefore it is considered herein that according to pivotal sentence " teacher/n say to/v those/r student/n listens/v " obtain and language structure " say to ... listen " is one and language structure, and incite somebody to action<" say to ... listen ", " teacher/n say to/v those/r student/n listens/v ">It is put into candidate and language structural library SOBase, is further verified by module D.

To every record in candidate and language structural library SOBase<“V_i _, ₁…V_i _, ₂", " N_i _, ₁ V_i _, ₁ N_i _, ₂ V_i _, ₂ N_i _, ₃”>, the present invention proposes two kinds of verification techniques：And the common property checking of language collocation, simultaneous language collocation diversity, they are both ensured that and the language correct necessary condition of structure.

The common property checking of described and language collocation, refers to if SO_i=" N_i _, ₁ V_i _, ₁ N_i _, ₂ V_i _, ₂ N_i _, ₃" be a correct pivotal sentence, then and language structure " V_i _, ₁…V_i _, ₂" occur in other sentences in TCorpus, rather than only occurring in pivotal sentence SO_iIn.

For example, SO_i=" organizing committee invites them to participate in meeting ".So SO_i _, ₁=" host invites them to participate in interaction " and SO_i _, ₂=" owner invites them to have lunch altogether " can also occur in the sentence of other in TCorpus；Namely SO_i _, ₁And SO_i _, ₂It is not to depend only on SO_iThis special pivotal sentence.

The collocation diversity checking of described and language, refers to if SO_i=" N_i _, ₁ V_i _, ₁ N_i _, ₂ V_i _, ₂ N_i _, ₃" it is a correct pivotal sentence, then shape such as SO '_i=" N '_i _, ₁ V_i _, ₁ N′_i _, ₂ V_i _, ₂ N′_i _, ₃”、SO″_i=" N "_i _, ₁ V_i _, ₁ N″_i _, ₂ V_i _, ₂ N″_i _, ₃" etc. pivotal sentence also should repeatedly occur in TCorpus.

For example, SO_i=" organizing committee invites them to participate in meeting ", then SO '_i=" friend invites us to participate in birthday party ", SO "_i=" friend invites her to participate in wedding ", SO " '_i=" professor invites these foreign students to participate in exchanging meeting ".That is, in SO_iIn=" organizing committee invites them to participate in meeting ", and language " they " can be to be replaced by the word of diversified forms, and SO_iIn simultaneous language structure " invite ... participate in " it is rationally still and correct.

According to above-mentioned pair and the common degree of language collocation, the multifarious analysis of simultaneous language collocation and explanation, the implementation of module D is given below：It is firstly introduced into two the threshold value a and b of non-negative, wherein a ∈ (0,1], b ∈ (0,1]

Step D1：It is sky to set SOBaseResult, is used to preserve authenticated, correct and language structure result.

Step D2：If SOBase is empty, D6 is gone to step.

Step D3：Any record in SOBase<“V_i _, ₁…V_i _, ₂", " N_i _, ₁ V_i _, ₁ N_i _, ₂ V_i _, ₂ N_i _, ₃”>, will<“V_i _, ₁…V_i _, ₂", " N_i _, ₁ V_i _, ₁ N_i _, ₂ V_i _, ₂ N_i _, ₃”>Taken out from SOBase.

Step D4：If cof (" V_i _, ₁…V_i _, ₂") ＞ a, then by " V_i _, ₁…V_i _, ₂" be put into set SOBaseResult, go to step D2.

The cof (" V_i _, ₁…V_i _, ₂") reflect and language structure " V_i _, ₁…V_i _, ₂" common property, it is calculated as follows：cof(“V_i _, ₁…V_i _, ₂")=TCorpus contains " V_i _, ₁…V_i _, ₂" sentence number in structured statement bar number/TCorpus.As cof (V_i _, ₁…V_i _, ₂) ＞ a when, by " V_i _, ₁…V_i _, ₂" it is considered as a correct and language structure.

Step D5：If muf (" V_i _, ₁…V_i _, ₂") ＞ b, then by " V_i _, ₁…V_i _, ₂" be put into set SOBaseResult.

The muf (" V_i _, ₁…V_i _, ₂") be one and portray and language is arranged in pairs or groups multifarious mathematical method, its calculating sub-step is as follows：During beginning, V is set_* _, ₁And V_* _, ₂It is null set.

Step D51：In SOBase, if there is<“V_x…V_i _, ₂", " N_i _, ₁ V_x N_i _, ₂ V_i _, ₂ N_i _, ₃”>, then by V_xIt is put into set V_* _, ₁In.

Step D52：In SOBase, if there is<“V_i _, ₁…V_y", " N_i _, ₁ V_i _, ₁ N_i _, ₂ V_y N_i _, ₃”>, then by V_yIt is put into set V_* _, ₂In.

Experiment effect

By repeatedly preliminary experiment, and language arrange in pairs or groups common property a threshold value be set to 0.0006 (i.e. a=0.0006) and and language arrange in pairs or groups that to be set to the simultaneous language result effect that 0.0015 (i.e. b=0.0015) obtained preferable for threshold of diversity b.By after the test of 1TB language materials checking, system of the invention obtains 13.96 ten thousand pairs and language structure, and by analysis, accuracy reaches 98.2%.Therefore, the present invention achieves preferable recognition performance, has reached the purpose of practical application.

Claims

1. a kind of Chinese and language structure obtain system, it is characterised in that：Including carrying out participle to original training corpus Corpus, Form the modules A of participle corpus TCorpus；Every sentence S in identification participle corpus TCorpus_iThe mould of middle verb Block B；The sentence in TCorpus is analyzed using simultaneous language pattern, the sentence to meeting simultaneous language pattern forms candidate and language Structure, and the module C inserted in simultaneous language structural library SOBase to be verified；Checking candidate and language structural library SOBase, And export the module D of final result SOBaseResult；

In module described above, modules A is entered using an ICTCLAS system increased income to every input text in RCorpus Row participle, and every text is decoupled according to the natural of sentence, formation does not contain the simple of sentence marks Sentence；Therefore, the form of each sentence of TCorpus is Si=" W1/posl W2/pos2 ... Wi/posi ... Wn/posn ", Wherein each Wi is Chinese word, Chinese character, punctuation mark, Arabic numerals, English word or a letter, and posi is it Corresponding part of speech；Result after modules A generation participle will be transmitted to module B, in module B identification participle corpus TCorpus Verb or verb phrases in every sentence Si；Module B carries out verb merging treatment to every sentence Si in TCorpus, There is " W₁/v W₂During/v ", then according to " W₁W₂/ v " merges treatment, will two or more verbs, merging It is a verb, this process is called verb merging treatment；After the treatment, Processing for removing is carried out to the adverbial word for modifying verb, All modification adverbial words that will be before verb are all deleted；After module B completes verb identification, adverbial word treatment, result is transmitted to module C；Module C applications and language pattern are analyzed to the sentence in TCorpus, and the sentence to meeting simultaneous language pattern forms candidate and language Structure, and insert in simultaneous language structural library SOBase to be verified；After module C completes simultaneous language pattern analysis, result is transmitted to mould Block D so as to verify and language structure correctness；Module D is to every record in candidate and language structural library SOBase<“V_{I, 1}…V_{I, 2}", “N_{I, 1}V_{I, 1}N_{I, 2}V_{I, 2}N_{I, 3}”>Carry out and the common property checking of language collocation, simultaneous language collocation diversity checking.

2. a kind of Chinese and language structure obtaining method, it is characterised in that：Comprise the following steps：

Participle is carried out to every in Corpus input text D using an ICTCLAS system increased income, and by every text Natural according to sentence is decoupled, and formation does not contain the simple sentence of sentence marks；Therefore, each sentence of TCorpus The form of son is S_i=" W₁/pos₁W₂/pos₂…W_i/pos₁…W_n/pos_n", wherein each W_iBe a Chinese word, Chinese character, Punctuation mark, Arabic numerals, English word or letter, pos_iIt is its corresponding part of speech；

In segmentation methods, the mark of part of speech is current in computer circle；Common part of speech has a to represent that adjective, b are represented Distinction word, c represent conjunction, d represent adverbial word, h represent prefix word, j represent abbreviation word, k represent suffix word, m represent number, N represents that noun, p represent that preposition, q represent that measure word, r represent that pronoun, u represent that auxiliary word, z represent descriptive word；

As appearance " W₁/v W₂/ v ", then according to " W₁W₂/ v " merges treatment, will two or more verbs, conjunction And be a verb, this process is called verb merging treatment；After the treatment, Processing for removing is carried out to the adverbial word for modifying verb, All modification adverbial words that will be before verb are all deleted；Sentence after treatment is still put into TCorpus；

3rd step：The sentence in TCorpus is analyzed using simultaneous language pattern, the sentence to meeting simultaneous language pattern forms candidate And language structure, and insert in simultaneous language structural library SOBase to be verified；

The application and language pattern are analyzed to the sentence in TCorpus, refer to using 5 kinds and language pattern, by TCorpus In meet and the sentence of one of language pattern is picked out, insert in simultaneous language structural library SOBase to be verified；

Specifically, to any sentence SO in TCorpus_i, when it contains the verb for having more than 2, or only contain 1 verb, Then abandon the sentence；Otherwise, if SO_iForm be " N_{I, 1}V_{I, 1}N_{I, 2}V_{I, 2}N_{I, 3}", here, subscript i represents i-th sentence meaning Think；Following main task is to check N_{I, 2}Whether 5 kind and language pattern one of are met；If meeting one of 5 kinds and language pattern, By binary pair<“V_{I, 1}…V_{I, 2}", " N_{I, 1}V_{I, 1}N_{I, 2}V_{I, 2}N_{I, 3}”>It is put into SOBase；Otherwise, SO is abandoned_i；

5 kinds described and language pattern：If the general type of pivotal sentence is " N₁V₁N₂V₂N₃", wherein N₂As simultaneous language；Obtaining And during language structure, only consider and language N₂Meet the simultaneous language sentence of following pattern, it is, when corpus is sufficiently large, and language is The simultaneous language structure of the pivotal sentence of other forms can also be obtained from below simultaneous language satisfaction in 5 kinds of pivotal sentences of pattern：

Pattern 1：Number+noun；

Pattern 2：Number+measure word+noun；

Pattern 3：This, this, specifically, this, this position is this, these, that, that, that time, that, that position, That, those, it, they, the element in the set is common pronoun, is commonly used to refer to lifeless object or dynamic Thing, any one element therein is all in itself a pattern；

Pattern 4：This, this, specifically, this, this position is this, these, that, that, that time, that, that position, That, those+noun, this is a simultaneous language pattern being made up of pronoun and title；

Pattern 5：He, they, I, we, she, they, the element in the set is common pronoun, is commonly used to refer to Personage, any one element therein is all in itself a pattern；

To every record in candidate and language structural library SOBase<“V_{I, 1}…V_{I, 2}", " N_{I, 1}V_{I, 1}N_{I, 2}V_{I, 2}N_{I, 3}”>, use Two kinds of verification techniques：And the common property checking of language collocation, simultaneous language collocation diversity, they are both ensured that and language structure is correctly necessary Condition；

The common property checking of described and language collocation, refers to work as SO_i=" N_{I, 1}V_{I, 1}N_{I, 2}V_{I, 2}N_{I, 3}" be a correct pivotal sentence, then And language structure " V_{I, 1}…V_{I, 2}" occur in other sentences in TCorpus, rather than only occurring in pivotal sentence SO_iIn；

The collocation diversity checking of described and language, refers to if SO_i=" N_{I, 1}V_{I, 1}N_{I, 2}V_{I, 2}N_{I, 3}" it is a correct pivotal sentence, So shape such as SO '_i=" N '_{I, 1}V_{I, 1}N′_{I, 2}V_{I, 2}N′_{I, 3}”、SO″_i=" N "_{I, 1}V_{I, 1}N″_{I, 2}V_{I, 2}N″_{I, 3}" pivotal sentence in TCorpus Also should repeatedly occur.

3. a kind of Chinese according to claim 2 and language structure obtaining method, it is characterised in that：4th step it is specific Implementation steps are：

Step D2：If SOBase is empty, D6 is gone to step；

Step D3：Any record in SOBase<“V_{I, 1}…V_{I, 2}", " N_{I, 1}V_{I, 1}N_{I, 2}V_{I, 2}N_{I, 3}”>, will<“V_{I, 1}…V_{I, 2}", “N_i,_{, 1}V_{I, 1}N_{I, 2}V_{I, 2}N_{I, 3}”>Taken out from SOBase；

Step D4：If cof (" V_{I, 1}…V_{I, 2}”)>A, then by " V_{I, 1}…V_{I, 2}" be put into set SOBaseResult, turn Step D2；

The cof (" V_{I, 1}…V_{I, 2}") reflect and language structure " V_{I, 1}…V_{I, 2}" common property, it is calculated as follows： cof(“V_{I, 1}…V_{I, 2}")=TCorpus contains " V_{I, 1}…V_{I, 2}" sentence number in structured statement bar number/TCorpus；When cof(V_{I, 1}…V_{I, 2})>During a, by " V_{I, 1}…V_{I, 2}" it is considered as a correct and language structure；

Step D5：If muf (" V_{I, 1}…V_{I, 2}”)>B, then by " V_{I, 1}…V_{I, 2}" be put into set SOBaseResult；

The muf (" V_{I, 1}…V_{I, 2}") be one and portray and language is arranged in pairs or groups multifarious mathematical method, its calculating sub-step is as follows：Open During the beginning, V is set_{*, 1}And V_{*, 2}It is null set；

Step D51：In SOBase, if there is<“V_x…V_{I, 2}", " N_{I, 1}V_xN_{I, 2}V_{I, 2}N_{I, 3}”>, then by V_xIt is put into Set V_{*, 1}In；

Step D52：In SOBase, if there is<“V_{I, 1}…V_y", " N_{I, 1}V_{I, 1}N_{I, 2}V_yN_{I, 3}”>, then by V_yIt is put into Set V_{*, 2}In；

Step D53：Calculate muf (" V_{I, 1}…V_{I, 2}”)：Computing formula is as follows：

m u f ({V_{''}}_{i, 1} ... {V_{i, 2}}^{''}) = \frac{c o f ({V_{''}}_{i, 1} ... {V_{i, 2}}^{''})}{Σ_{V_{x} &Element; V_{*, 1}} c o f ({V_{''}}_{x} ... {V_{i, 2}}^{''}) + Σ_{V_{y} &Element; V_{*, 2}} c o f ({V_{''}}_{i, 1} ... {V_{y}}^{''})}