CN106815188A - A kind of Chinese and language structure obtain system and method - Google Patents

A kind of Chinese and language structure obtain system and method Download PDF

Info

Publication number
CN106815188A
CN106815188A CN201510846489.9A CN201510846489A CN106815188A CN 106815188 A CN106815188 A CN 106815188A CN 201510846489 A CN201510846489 A CN 201510846489A CN 106815188 A CN106815188 A CN 106815188A
Authority
CN
China
Prior art keywords
language
sentence
tcorpus
pattern
simultaneous
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201510846489.9A
Other languages
Chinese (zh)
Other versions
CN106815188B (en
Inventor
符建辉
王卫明
曹阳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
KNOWOLOGY INTELLIGENT TECHNOLOGY Co Ltd
Original Assignee
KNOWOLOGY INTELLIGENT TECHNOLOGY Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by KNOWOLOGY INTELLIGENT TECHNOLOGY Co Ltd filed Critical KNOWOLOGY INTELLIGENT TECHNOLOGY Co Ltd
Priority to CN201510846489.9A priority Critical patent/CN106815188B/en
Publication of CN106815188A publication Critical patent/CN106815188A/en
Application granted granted Critical
Publication of CN106815188B publication Critical patent/CN106815188B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates

Abstract

System and method are obtained the present invention relates to a kind of Chinese and language structure, including participle is carried out to original training corpus Corpus, form participle corpus TCorpus;Every sentence S in identification participle corpus TCorpusiMiddle verb;The sentence in TCorpus is analyzed using simultaneous language pattern, the sentence to meeting simultaneous language pattern forms candidate and language structure, and inserts in simultaneous language structural library SOBase to be verified;Checking candidate and language structural library SOBase, and export final result SOBaseResult;Invention introduces simultaneous language pattern, the complexity of simultaneous language form can be greatly controlled on the premise of acquisition effect is not reduced.For Chinese word-building and the complexity of sentence, to ensure the accuracy of simultaneous language structure, the present invention carries out strict checking from " and language structure matching diversity ", " and the common property of language structure matching " double angle, the simultaneous language structure to obtaining.

Description

A kind of Chinese and language structure obtain system and method
Technical field
The present invention relates to Chinese natural language process, Chinese grammar structure automatic identification field, more particularly to a kind of Chinese and language structure automatic recognition system and method.
Background technology
Chinese pivotal sentence is the special language phenomenon of a class.For example, providing following three sentences (use space, and be labelled with part of speech, be so easy to protrude the simultaneous language linguistic context in sentence):
S1:" organizing committee/n invitation/v they/r participation/v meetings/n "
S2:" school/n support/v graduates/n foundation/v "
S3:" who/r allows/v this/r only/q bottles/n falls/v ground/s/u/w”
In S1, " they " are the object of verb " invitation ", while be also the subject of verb " participation ", therefore in S1, " they " are and language.In S2, " graduate " is the object of verb " support ", while be also the subject of verb " foundation ", therefore in S2, " graduate " is and language.Equally, in S3, " this bottle " is the object of " allowing ", while be also the subject of verb " falling ", therefore in S3, " this bottle " is and language.
Can be seen that Chinese pivotal sentence from these three typical examples is a kind of common language phenomenon.Over more than 30 years, the domestic well-known scholar such as Zhu Dexi, Ding Shusheng, Huang Bairong, Lv Jiping, Wu Qisheng has carried out systematic research from grammer or semantic angle to Chinese pivotal sentence, and people's understanding Chinese pivotal sentence played an important role.
In addition to theoretical research value, Chinese teaching and training, with the development in an all-round way of the Internet, applications, and language structural research also has many important purposes.
For example, Chinese and language structure can serve as a part for the language model in speech recognition, there is important booster action to automatically creating this language model.
And for example, unknown word identification problem is always an important problem:A dictionary is given, the word not occurred in this dictionary is referred to as unregistered word.Because it is limited, it is necessary to constantly supplement in actual applications that any dictionary receives word when starting.A technical difficulty in unknown word identification or dictionary supplement is how to be accurately determined the right boundary of unregistered word.
And how by the treatment of big language material and analysis, therefrom effectively obtaining and language structure, formed and language structural libraryWhich verb how is verified, is combined with what noun could be formed and language structure under what conditionsThese problems never have and sufficiently paid close attention to and study.
The content of the invention
For how by the treatment of big language material and analysis, therefrom effectively obtaining and language structure, formed and language structural library;Which verb how is verified, is combined with what noun could be formed and the problem of language structure obtains system and method the invention provides a kind of Chinese and language structure under what conditions.
In order to solve problem above present invention employs following technical scheme:A kind of Chinese and language structure obtain system, it is characterised in that:Including carrying out participle to original training corpus Corpus, the modules A of participle corpus TCorpus is formed;Every sentence S in identification participle corpus TCorpusiThe module B of middle verb;The sentence in TCorpus is analyzed using simultaneous language pattern, the sentence to meeting simultaneous language pattern forms candidate and language structure, and the module C inserted in simultaneous language structural library SOBase to be verified;Checking candidate and language structural library SOBase, and export the module D of final result SOBaseResult;
In module described above, modules A carries out participle using an ICTCLAS system increased income to every input text in RCorpus, and every text is decoupled according to the natural of sentence, and formation does not contain the simple sentence of sentence marks;Therefore, the form of each sentence of TCorpus is Si=" W1/pos1 W2/pos2 ... Wi/posi ... Wn/posn ", and wherein each Wi is Chinese word, Chinese character, punctuation mark, Arabic numerals, English word or a letter, and posi is its corresponding part of speech;Result after modules A generation participle will be transmitted to module B, verb or verb phrases in every sentence Si in module B identification participle corpus TCorpus;Module B carries out verb merging treatment to every sentence Si in TCorpus, that is, " W occur1/v W2During/v ", then according to " W1W2/ v " merges treatment, will two or more verbs, merge into a verb, this process is called verb merging treatment;After the treatment, Processing for removing is carried out to the adverbial word for modifying verb, all modification adverbial words that will be before verb are all deleted;After module B completes verb identification, adverbial word treatment, result is transmitted to module C;Module C applications and language pattern are analyzed to the sentence in TCorpus, and the sentence to meeting simultaneous language pattern forms candidate and language structure, and inserts in simultaneous language structural library SOBase to be verified;After module C completes simultaneous language pattern analysis, result is transmitted to module D to verify the correctness of simultaneous language structure;Module D is to every record in candidate and language structural library SOBase<“Vi , 1…Vi , 2", " Ni , 1 Vi , 1 Ni , 2 Vi , 2 Ni , 3”>Carry out and the common property checking of language collocation, simultaneous language collocation diversity checking.
A kind of Chinese and language structure obtaining method, it is characterised in that:Comprise the following steps:
The first step:Participle is carried out to original training corpus Corpus, participle corpus TCorpus is formed;
Participle is carried out to every input text D in Corpus using an ICTCLAS system increased income, and every text is decoupled according to the natural of sentence, formation does not contain the simple sentence of sentence marks;Therefore, the form of each sentence of TCorpus is Si=" W1/pos1 W2/pos2…Wi/posi…Wn/posn", wherein each WiIt is Chinese word, Chinese character, punctuation mark, Arabic numerals, English word or a letter, posiIt is its corresponding part of speech;
In segmentation methods, the mark of part of speech is current in computer circle;Common part of speech has a to represent that adjective, b represent that distinction word, c represent that conjunction, d represent that adverbial word, h represent that prefix word, j represent that abbreviation word, k represent that suffix word, m represent that number, n represent that noun, p represent that preposition, q represent that measure word, r represent that pronoun, u represent that auxiliary word, z represent descriptive word;
Second step:Every sentence S in identification participle corpus TCorpusiIn verb or verb phrases;
As appearance " W1/v W2/ v ", then according to " W1W2/ v " merges treatment, will two or more verbs, merge into a verb, this process is called verb merging treatment;After the treatment, Processing for removing is carried out to the adverbial word for modifying verb, all modification adverbial words that will be before verb are all deleted;Sentence after treatment is still put into TCorpus;
3rd step:The sentence in TCorpus is analyzed using simultaneous language pattern, the sentence to meeting simultaneous language pattern forms candidate and language structure, and inserts in simultaneous language structural library SOBase to be verified;
The application and language pattern are analyzed to the sentence in TCorpus, refer to, using 5 kinds and language pattern, the sentence of one of meeting in TCorpus and language pattern to be picked out, and insert in simultaneous language structural library SOBase to be verified;
Specifically, to any sentence SO in TCorpusi, when it contains the verb for having more than 2, or only contain 1 verb, then abandon the sentence;Otherwise, if SOiForm be " Ni , 1 Vi , 1 Ni , 2 Vi , 2 Ni , 3", here, subscript i represents i-th sentence meaning;Following main task is to check Ni , 2Whether 5 kind and language pattern one of are met;If one of 5 kinds and language pattern are met, by binary pair<“Vi , 1…Vi , 2", " Ni , 1 Vi , 1 Ni , 2 Vi , 2 Ni , 3”>It is put into SOBase;Otherwise, SO is abandonedi
5 kinds described and language pattern:If the general type of pivotal sentence is " N1 V1 N2 V2 N3", wherein N2As simultaneous language;When simultaneous language structure is obtained, only consider and language N2The simultaneous language sentence of following pattern is met, it is, when corpus is sufficiently large, simultaneous language is that the simultaneous language structure of the pivotal sentence of other forms can also be obtained from below simultaneous language satisfaction in 5 kinds of pivotal sentences of pattern:
Pattern 1:Number+noun;
Pattern 2:Number+measure word+noun;
Pattern 3:This, this, specifically, this, this position is this, these, that, that, that time, that, that position, that, those, it, they }, the element in the set is common pronoun, is commonly used to refer to lifeless object or animal, and any one element therein is all in itself a pattern;
Pattern 4:This, this, specifically, this, this position is this, these, that, that, that time, that, that position, that, those }+noun, this is a simultaneous language pattern being made up of pronoun and title;
Pattern 5:He, they, I, we, she, they, the element in the set is common pronoun, is commonly used to refer to personage, and any one element therein is all in itself a pattern;
4th step:Checking candidate and language structural library SOBase, and export final result SOBaseResult;
To every record in candidate and language structural library SOBase<“Vi , 1…Vi , 2", " Ni , 1 Vi , 1 Ni , 2 Vi , 2 Ni , 3”>, using two kinds of verification techniques:And the common property checking of language collocation, simultaneous language collocation diversity, they are both ensured that and the language correct necessary condition of structure;
The common property checking of described and language collocation, refers to work as SOi=" Ni , 1 Vi , 1 Ni , 2 Vi , 2 Ni , 3" be a correct pivotal sentence, then and language structure " Vi , 1…Vi , 2" occur in other sentences in TCorpus, rather than only occurring in pivotal sentence SOiIn;
The collocation diversity checking of described and language, refers to if SOi=" Ni , 1 Vi , 1 Ni , 2 Vi , 2 Ni , 3" it is a correct pivotal sentence, then shape such as SO 'i=" N 'i , 1 Vi , 1 N′i , 2 Vi , 2 N′i , 3”、SO″i=" N "i , 1 Vi , 1 N″i , 2 Vi , 2 N″i , 3" pivotal sentence also should repeatedly occur in TCorpus;
The specific implementation step of the 4th step is:
It is firstly introduced into two the threshold value a and b of non-negative, wherein a ∈ (0,1], b ∈ (0,1]
Step D1:It is sky to set SOBaseResult, is used to preserve authenticated, correct and language structure result;
Step D2:If SOBase is empty, D6 is gone to step;
Step D3:Any record in SOBase<“Vi , 1…Vi , 2", " Ni , 1 Vi , 1 Ni , 2 Vi , 2 Ni , 3”>, will<“Vi , 1…Vi , 2", " Ni , 1 Vi , 1 Ni , 2 Vi , 2 Ni , 3”>Taken out from SOBase;
Step D4:If cof (" Vi , 1…Vi , 2") > a, then by " Vi , 1…Vi , 2" be put into set SOBaseResult, go to step D2;
The cof (" Vi , 1…Vi , 2") reflect and language structure " Vi , 1…Vi , 2" common property, it is calculated as follows:cof(“Vi , 1…Vi , 2")=TCorpus contains " Vi , 1…Vi , 2" sentence number in structured statement bar number/TCorpus;As cof (Vi , 1…Vi , 2) > a when, by " Vi , 1…Vi , 2" it is considered as a correct and language structure;
Step D5:If muf (" Vi , 1…Vi , 2") > b, then by " Vi , 1…Vi , 2" be put into set SOBaseResult;
The muf (" Vi , 1…Vi , 2") be one and portray and language is arranged in pairs or groups multifarious mathematical method, its calculating sub-step is as follows:During beginning, V is set* , 1And V* , 2It is null set;
Step D51:In SOBase, if there is<“Vx…Vi , 2", " Ni , 1 Vx Ni , 2 Vi , 2 Ni , 3”>, then by VxIt is put into set V* , 1In;
Step D52:In SOBase, if there is<“Vi , 1…Vy", " Ni , 1 Vi , 1 Ni , 2 Vy Ni , 3”>, then by VyIt is put into set V* , 2In;
Step D53:Calculate muf (" Vi , 1…Vi , 2”):Computing formula is as follows:
Step D6:The final simultaneous language structure results SOBaseResult of output.
Beneficial effect:
The present invention is by linguistics and computer technology, it is proposed that a kind of Chinese and language structure obtain system and method.For complex, the diversity of simultaneous language form, invention introduces simultaneous language pattern, the complexity of simultaneous language form can be greatly controlled on the premise of acquisition effect is not reduced.For Chinese word-building and the complexity of sentence, to ensure the accuracy of simultaneous language structure, the present invention carries out strict checking from " and language structure matching diversity ", " and the common property of language structure matching " double angle, the simultaneous language structure to obtaining.
By after the test of 1TB language materials checking, system of the invention obtains 13.96 ten thousand pairs and language structure, and by analysis, accuracy reaches 98.2%.Therefore, the present invention achieves preferable recognition performance, has reached the purpose of practical application.
Brief description of the drawings
Fig. 1 is that a kind of Chinese and language structure obtain working-flow figure.
Specific embodiment
In order to the clearer explanation present invention, the defined below and term that is explained as follows:
(1) Chinese part of speech:Part of speech in Chinese is an attribute of Chinese word.Common are:Noun (such as child, graduate, represented with n), verb (such as invite, add, represented with v), adverbial word (such as very, often, represented with d), adjective it is (such as beautiful, fine, simple, represented with a), pronoun (such as these, this, it, they, represented with r), number (such as, 12,12, represented with m), measure word (such as, root, only, bar, represented with q).
(2) 5 kinds and language pattern:It is control and the acquisition complexity of language structure, more and language structure is obtained in that while also assuring, present invention introduces 5 kinds and language pattern.For ease of statement, the general type for hereafter assuming pivotal sentence is " N1 V1 N2 V2 N3", wherein N2As simultaneous language.The present invention considers and language N when simultaneous language structure is obtained, only2Meet following pattern simultaneous language sentence (it is, we assume that, when corpus is sufficiently large, and language be other forms pivotal sentence simultaneous language structure also can from and language meet below 5 patterns pivotal sentence in obtain):
● pattern 1:Number+noun.For example:" three/m people/n ", " 3/n projects/n " are exactly specific two examples.
● pattern 2:Number+measure word+noun.For example:" three/m/q people/n ", " 3/m/q plant/n " are exactly specific two examples.
● pattern 3:This, this, specifically, this, this position is this, these, that, that, that time, that, that position, that, those, it, they }.Element in the set is common pronoun, is commonly used to refer to lifeless object or animal, and any one element therein is all in itself a pattern.
● pattern 4:This, this, specifically, this, this position is this, these, that, that, that time, that, that position, that, those }+noun.This is a simultaneous language pattern being made up of pronoun and title." this/r matches/n ", " these/r raw materials/n " it is exactly specific two examples.
● pattern 5:He, they, I, we, she, they.Element in the set is common pronoun, is commonly used to refer to personage, and any one element therein is all in itself a pattern.
(3) ICTCLAS systems:One Words partition system that is free, increasing income.The system is input with text, is output as the segmentation sequence of the text.ICTCLAS system downloads network address is:http://ictclas.nlpir.org.After participle, each participle indicates part of speech, wherein a represents that adjective, b represent that distinction word, c represent that conjunction, d represent that adverbial word, h represent that prefix word, j represent that abbreviation word, k represent that suffix word, m represent that number, n represent that noun, p represent that preposition, q represent that measure word, r represent that pronoun, u represent that auxiliary word, z represent descriptive word, etc..
The present invention is described in more detail with reference to the accompanying drawings and detailed description.Below, an either short sentence, or article long, we are referred to as text.
A kind of Chinese and language structure obtain system and method and are divided into four main modulars:
Modules A:Participle is carried out to original training corpus Corpus, participle corpus TCorpus is formed.
Module B:Every sentence S in identification participle corpus TCorpusiIn verb.
Module C:The sentence in TCorpus is analyzed using simultaneous language pattern, the sentence to meeting simultaneous language pattern forms candidate and language structure, and inserts in simultaneous language structural library SOBase to be verified.
Module D:Checking candidate and language structural library SOBase, and export final result SOBaseResult.
The workflow or method of modules is explained in detail below.
Modules A:Participle is carried out to original training corpus Corpus, participle corpus TCorpus is formed.
We carry out participle using an ICTCLAS system increased income to every input text D in Corpus, and every text is decoupled according to the natural of sentence, and formation does not contain the simple sentence of sentence marks.Therefore, the form of each sentence of TCorpus is Si=" W1/pos1 W2/pos2…Wi/posi…Wn/posn", wherein each WiIt is Chinese word, Chinese character, punctuation mark, Arabic numerals, English word or a letter, posiIt is its corresponding part of speech.
In segmentation methods, the mark of part of speech is current in computer circle.Common part of speech has a to represent that adjective, b represent that distinction word, c represent that conjunction, d represent that adverbial word, h represent that prefix word, j represent that abbreviation word, k represent that suffix word, m represent that number, n represent that noun, p represent that preposition, q represent that measure word, r represent that pronoun, u represent that auxiliary word, z represent descriptive word etc..For example, the word segmentation result of sentence " organizing committee invites them to participate in meeting " is:" teacher/n in earnest/d says/v to/v those/r student/n listens/n ".
Module B:Every sentence S in identification participle corpus TCorpusiIn verb or verb phrases.
To every sentence S in TCorpusi, if there is " W1/v W2/ v ", then according to " W1W2/ v " merges treatment, will two or more verbs, merge into a verb, this process is called verb merging treatment.For example, sentence " teacher/n says/v to/v those/r student/n listens/v ", by after above-mentioned treatment, obtaining a verb phrases " say to ", so as to obtain pivotal sentence " teacher/n say to/v those/r student/n listens/v ".The purpose for the arrangement is that obtaining as much as possible and language structure.
After the treatment, Processing for removing is carried out to the adverbial word for modifying verb, all modification adverbial words that will be before verb are all deleted.For example, sentence " teacher/n in earnest/d says/v to/v those/r student/n listens/v ", by after verb merging treatment, obtain pivotal sentence " teacher/n in earnest/d say to/v those/r student/n listens/v ".Again by after adverbial word delete processing, obtain pivotal sentence " teacher/n say to/v those/r student/n listens/v ".
Sentence after the present invention will be processed is put into TCorpus.
Module C:The sentence in TCorpus is analyzed using simultaneous language pattern, the sentence to meeting simultaneous language pattern forms candidate and language structure, and inserts in simultaneous language structural library SOBase to be verified.
The application and language pattern are analyzed to the sentence in TCorpus, refer to 5 kinds using previous designs and language pattern, and the sentence of one of meeting in TCorpus and language pattern is picked out, and are inserted in simultaneous language structural library SOBase to be verified.
Specifically, to any sentence SO in TCorpusiIf it contains the verb for having more than 2, or only contain 1 verb, then abandon the sentence;Otherwise, if SOiForm be " Ni , 1 Vi , 1 Ni , 2 Vi , 2 Ni , 3" (here, subscript i represents i-th sentence meaning).Following main task is to check Ni , 2Whether 5 kind and language pattern one of are met.If one of 5 kinds and language pattern are met, by binary pair<“Vi , 1…Vi , 2", " Ni , 1 Vi , 1 Ni , 2 Vi , 2 Ni , 3”>It is put into SOBase;Otherwise, SO is abandonedi
For example, to pivotal sentence " teacher/n say to/v those/r student/n listens/v ", " those/r student/n " meet and language pattern 4, therefore it is considered herein that according to pivotal sentence " teacher/n say to/v those/r student/n listens/v " obtain and language structure " say to ... listen " is one and language structure, and incite somebody to action<" say to ... listen ", " teacher/n say to/v those/r student/n listens/v ">It is put into candidate and language structural library SOBase, is further verified by module D.
Module D:Checking candidate and language structural library SOBase, and export final result SOBaseResult.
To every record in candidate and language structural library SOBase<“Vi , 1…Vi , 2", " Ni , 1 Vi , 1 Ni , 2 Vi , 2 Ni , 3”>, the present invention proposes two kinds of verification techniques:And the common property checking of language collocation, simultaneous language collocation diversity, they are both ensured that and the language correct necessary condition of structure.
The common property checking of described and language collocation, refers to if SOi=" Ni , 1 Vi , 1 Ni , 2 Vi , 2 Ni , 3" be a correct pivotal sentence, then and language structure " Vi , 1…Vi , 2" occur in other sentences in TCorpus, rather than only occurring in pivotal sentence SOiIn.
For example, SOi=" organizing committee invites them to participate in meeting ".So SOi , 1=" host invites them to participate in interaction " and SOi , 2=" owner invites them to have lunch altogether " can also occur in the sentence of other in TCorpus;Namely SOi , 1And SOi , 2It is not to depend only on SOiThis special pivotal sentence.
The collocation diversity checking of described and language, refers to if SOi=" Ni , 1 Vi , 1 Ni , 2 Vi , 2 Ni , 3" it is a correct pivotal sentence, then shape such as SO 'i=" N 'i , 1 Vi , 1 N′i , 2 Vi , 2 N′i , 3”、SO″i=" N "i , 1 Vi , 1 N″i , 2 Vi , 2 N″i , 3" etc. pivotal sentence also should repeatedly occur in TCorpus.
For example, SOi=" organizing committee invites them to participate in meeting ", then SO 'i=" friend invites us to participate in birthday party ", SO "i=" friend invites her to participate in wedding ", SO " 'i=" professor invites these foreign students to participate in exchanging meeting ".That is, in SOiIn=" organizing committee invites them to participate in meeting ", and language " they " can be to be replaced by the word of diversified forms, and SOiIn simultaneous language structure " invite ... participate in " it is rationally still and correct.
According to above-mentioned pair and the common degree of language collocation, the multifarious analysis of simultaneous language collocation and explanation, the implementation of module D is given below:It is firstly introduced into two the threshold value a and b of non-negative, wherein a ∈ (0,1], b ∈ (0,1]
Step D1:It is sky to set SOBaseResult, is used to preserve authenticated, correct and language structure result.
Step D2:If SOBase is empty, D6 is gone to step.
Step D3:Any record in SOBase<“Vi , 1…Vi , 2", " Ni , 1 Vi , 1 Ni , 2 Vi , 2 Ni , 3”>, will<“Vi , 1…Vi , 2", " Ni , 1 Vi , 1 Ni , 2 Vi , 2 Ni , 3”>Taken out from SOBase.
Step D4:If cof (" Vi , 1…Vi , 2") > a, then by " Vi , 1…Vi , 2" be put into set SOBaseResult, go to step D2.
The cof (" Vi , 1…Vi , 2") reflect and language structure " Vi , 1…Vi , 2" common property, it is calculated as follows:cof(“Vi , 1…Vi , 2")=TCorpus contains " Vi , 1…Vi , 2" sentence number in structured statement bar number/TCorpus.As cof (Vi , 1…Vi , 2) > a when, by " Vi , 1…Vi , 2" it is considered as a correct and language structure.
Step D5:If muf (" Vi , 1…Vi , 2") > b, then by " Vi , 1…Vi , 2" be put into set SOBaseResult.
The muf (" Vi , 1…Vi , 2") be one and portray and language is arranged in pairs or groups multifarious mathematical method, its calculating sub-step is as follows:During beginning, V is set* , 1And V* , 2It is null set.
Step D51:In SOBase, if there is<“Vx…Vi , 2", " Ni , 1 Vx Ni , 2 Vi , 2 Ni , 3”>, then by VxIt is put into set V* , 1In.
Step D52:In SOBase, if there is<“Vi , 1…Vy", " Ni , 1 Vi , 1 Ni , 2 Vy Ni , 3”>, then by VyIt is put into set V* , 2In.
Step D53:Calculate muf (" Vi , 1…Vi , 2”):Computing formula is as follows:
Step D6:The final simultaneous language structure results SOBaseResult of output.
Experiment effect
By repeatedly preliminary experiment, and language arrange in pairs or groups common property a threshold value be set to 0.0006 (i.e. a=0.0006) and and language arrange in pairs or groups that to be set to the simultaneous language result effect that 0.0015 (i.e. b=0.0015) obtained preferable for threshold of diversity b.By after the test of 1TB language materials checking, system of the invention obtains 13.96 ten thousand pairs and language structure, and by analysis, accuracy reaches 98.2%.Therefore, the present invention achieves preferable recognition performance, has reached the purpose of practical application.

Claims (3)

1. a kind of Chinese and language structure obtain system, it is characterised in that:Including carrying out participle to original training corpus Corpus, Form the modules A of participle corpus TCorpus;Every sentence S in identification participle corpus TCorpusiThe mould of middle verb Block B;The sentence in TCorpus is analyzed using simultaneous language pattern, the sentence to meeting simultaneous language pattern forms candidate and language Structure, and the module C inserted in simultaneous language structural library SOBase to be verified;Checking candidate and language structural library SOBase, And export the module D of final result SOBaseResult;
In module described above, modules A is entered using an ICTCLAS system increased income to every input text in RCorpus Row participle, and every text is decoupled according to the natural of sentence, formation does not contain the simple of sentence marks Sentence;Therefore, the form of each sentence of TCorpus is Si=" W1/posl W2/pos2 ... Wi/posi ... Wn/posn ", Wherein each Wi is Chinese word, Chinese character, punctuation mark, Arabic numerals, English word or a letter, and posi is it Corresponding part of speech;Result after modules A generation participle will be transmitted to module B, in module B identification participle corpus TCorpus Verb or verb phrases in every sentence Si;Module B carries out verb merging treatment to every sentence Si in TCorpus, There is " W1/v W2During/v ", then according to " W1W2/ v " merges treatment, will two or more verbs, merging It is a verb, this process is called verb merging treatment;After the treatment, Processing for removing is carried out to the adverbial word for modifying verb, All modification adverbial words that will be before verb are all deleted;After module B completes verb identification, adverbial word treatment, result is transmitted to module C;Module C applications and language pattern are analyzed to the sentence in TCorpus, and the sentence to meeting simultaneous language pattern forms candidate and language Structure, and insert in simultaneous language structural library SOBase to be verified;After module C completes simultaneous language pattern analysis, result is transmitted to mould Block D so as to verify and language structure correctness;Module D is to every record in candidate and language structural library SOBase<“VI, 1…VI, 2", “NI, 1VI, 1NI, 2VI, 2NI, 3”>Carry out and the common property checking of language collocation, simultaneous language collocation diversity checking.
2. a kind of Chinese and language structure obtaining method, it is characterised in that:Comprise the following steps:
The first step:Participle is carried out to original training corpus Corpus, participle corpus TCorpus is formed;
Participle is carried out to every in Corpus input text D using an ICTCLAS system increased income, and by every text Natural according to sentence is decoupled, and formation does not contain the simple sentence of sentence marks;Therefore, each sentence of TCorpus The form of son is Si=" W1/pos1W2/pos2…Wi/pos1…Wn/posn", wherein each WiBe a Chinese word, Chinese character, Punctuation mark, Arabic numerals, English word or letter, posiIt is its corresponding part of speech;
In segmentation methods, the mark of part of speech is current in computer circle;Common part of speech has a to represent that adjective, b are represented Distinction word, c represent conjunction, d represent adverbial word, h represent prefix word, j represent abbreviation word, k represent suffix word, m represent number, N represents that noun, p represent that preposition, q represent that measure word, r represent that pronoun, u represent that auxiliary word, z represent descriptive word;
Second step:Every sentence S in identification participle corpus TCorpusiIn verb or verb phrases;
As appearance " W1/v W2/ v ", then according to " W1W2/ v " merges treatment, will two or more verbs, conjunction And be a verb, this process is called verb merging treatment;After the treatment, Processing for removing is carried out to the adverbial word for modifying verb, All modification adverbial words that will be before verb are all deleted;Sentence after treatment is still put into TCorpus;
3rd step:The sentence in TCorpus is analyzed using simultaneous language pattern, the sentence to meeting simultaneous language pattern forms candidate And language structure, and insert in simultaneous language structural library SOBase to be verified;
The application and language pattern are analyzed to the sentence in TCorpus, refer to using 5 kinds and language pattern, by TCorpus In meet and the sentence of one of language pattern is picked out, insert in simultaneous language structural library SOBase to be verified;
Specifically, to any sentence SO in TCorpusi, when it contains the verb for having more than 2, or only contain 1 verb, Then abandon the sentence;Otherwise, if SOiForm be " NI, 1VI, 1NI, 2VI, 2NI, 3", here, subscript i represents i-th sentence meaning Think;Following main task is to check NI, 2Whether 5 kind and language pattern one of are met;If meeting one of 5 kinds and language pattern, By binary pair<“VI, 1…VI, 2", " NI, 1VI, 1NI, 2VI, 2NI, 3”>It is put into SOBase;Otherwise, SO is abandonedi
5 kinds described and language pattern:If the general type of pivotal sentence is " N1V1N2V2N3", wherein N2As simultaneous language;Obtaining And during language structure, only consider and language N2Meet the simultaneous language sentence of following pattern, it is, when corpus is sufficiently large, and language is The simultaneous language structure of the pivotal sentence of other forms can also be obtained from below simultaneous language satisfaction in 5 kinds of pivotal sentences of pattern:
Pattern 1:Number+noun;
Pattern 2:Number+measure word+noun;
Pattern 3:This, this, specifically, this, this position is this, these, that, that, that time, that, that position, That, those, it, they, the element in the set is common pronoun, is commonly used to refer to lifeless object or dynamic Thing, any one element therein is all in itself a pattern;
Pattern 4:This, this, specifically, this, this position is this, these, that, that, that time, that, that position, That, those+noun, this is a simultaneous language pattern being made up of pronoun and title;
Pattern 5:He, they, I, we, she, they, the element in the set is common pronoun, is commonly used to refer to Personage, any one element therein is all in itself a pattern;
4th step:Checking candidate and language structural library SOBase, and export final result SOBaseResult;
To every record in candidate and language structural library SOBase<“VI, 1…VI, 2", " NI, 1VI, 1NI, 2VI, 2NI, 3”>, use Two kinds of verification techniques:And the common property checking of language collocation, simultaneous language collocation diversity, they are both ensured that and language structure is correctly necessary Condition;
The common property checking of described and language collocation, refers to work as SOi=" NI, 1VI, 1NI, 2VI, 2NI, 3" be a correct pivotal sentence, then And language structure " VI, 1…VI, 2" occur in other sentences in TCorpus, rather than only occurring in pivotal sentence SOiIn;
The collocation diversity checking of described and language, refers to if SOi=" NI, 1VI, 1NI, 2VI, 2NI, 3" it is a correct pivotal sentence, So shape such as SO 'i=" N 'I, 1VI, 1N′I, 2VI, 2N′I, 3”、SO″i=" N "I, 1VI, 1N″I, 2VI, 2N″I, 3" pivotal sentence in TCorpus Also should repeatedly occur.
3. a kind of Chinese according to claim 2 and language structure obtaining method, it is characterised in that:4th step it is specific Implementation steps are:
It is firstly introduced into two the threshold value a and b of non-negative, wherein a ∈ (0,1], b ∈ (0,1]
Step D1:It is sky to set SOBaseResult, is used to preserve authenticated, correct and language structure result;
Step D2:If SOBase is empty, D6 is gone to step;
Step D3:Any record in SOBase<“VI, 1…VI, 2", " NI, 1VI, 1NI, 2VI, 2NI, 3”>, will<“VI, 1…VI, 2", “Ni,, 1VI, 1NI, 2VI, 2NI, 3”>Taken out from SOBase;
Step D4:If cof (" VI, 1…VI, 2”)>A, then by " VI, 1…VI, 2" be put into set SOBaseResult, turn Step D2;
The cof (" VI, 1…VI, 2") reflect and language structure " VI, 1…VI, 2" common property, it is calculated as follows: cof(“VI, 1…VI, 2")=TCorpus contains " VI, 1…VI, 2" sentence number in structured statement bar number/TCorpus;When cof(VI, 1…VI, 2)>During a, by " VI, 1…VI, 2" it is considered as a correct and language structure;
Step D5:If muf (" VI, 1…VI, 2”)>B, then by " VI, 1…VI, 2" be put into set SOBaseResult;
The muf (" VI, 1…VI, 2") be one and portray and language is arranged in pairs or groups multifarious mathematical method, its calculating sub-step is as follows:Open During the beginning, V is set*, 1And V*, 2It is null set;
Step D51:In SOBase, if there is<“Vx…VI, 2", " NI, 1VxNI, 2VI, 2NI, 3”>, then by VxIt is put into Set V*, 1In;
Step D52:In SOBase, if there is<“VI, 1…Vy", " NI, 1VI, 1NI, 2VyNI, 3”>, then by VyIt is put into Set V*, 2In;
Step D53:Calculate muf (" VI, 1…VI, 2”):Computing formula is as follows:
m u f ( V &prime; &prime; i , 1 ... V i , 2 &prime; &prime; ) = c o f ( V &prime; &prime; i , 1 ... V i , 2 &prime; &prime; ) &Sigma; V x &Element; V * , 1 c o f ( V &prime; &prime; x ... V i , 2 &prime; &prime; ) + &Sigma; V y &Element; V * , 2 c o f ( V &prime; &prime; i , 1 ... V y &prime; &prime; )
Step D6:The final simultaneous language structure results SOBaseResult of output.
CN201510846489.9A 2015-11-27 2015-11-27 Method for acquiring Chinese and bilingual structure Active CN106815188B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510846489.9A CN106815188B (en) 2015-11-27 2015-11-27 Method for acquiring Chinese and bilingual structure

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510846489.9A CN106815188B (en) 2015-11-27 2015-11-27 Method for acquiring Chinese and bilingual structure

Publications (2)

Publication Number Publication Date
CN106815188A true CN106815188A (en) 2017-06-09
CN106815188B CN106815188B (en) 2020-02-18

Family

ID=59103490

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510846489.9A Active CN106815188B (en) 2015-11-27 2015-11-27 Method for acquiring Chinese and bilingual structure

Country Status (1)

Country Link
CN (1) CN106815188B (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101937430A (en) * 2010-09-03 2011-01-05 清华大学 Method for extracting event sentence pattern from Chinese sentence
CN102737013A (en) * 2011-04-02 2012-10-17 三星电子(中国)研发中心 Device and method for identifying statement emotion based on dependency relation

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101937430A (en) * 2010-09-03 2011-01-05 清华大学 Method for extracting event sentence pattern from Chinese sentence
CN102737013A (en) * 2011-04-02 2012-10-17 三星电子(中国)研发中心 Device and method for identifying statement emotion based on dependency relation

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
傅成宏: "现代汉语兼语结构的机器探测", 《合肥学院学报(社会科学版)》 *
陈静 等: "基于条件随机场的兼语结构自动识别", 《情报科学》 *

Also Published As

Publication number Publication date
CN106815188B (en) 2020-02-18

Similar Documents

Publication Publication Date Title
CN108363743B (en) Intelligent problem generation method and device and computer readable storage medium
US9443513B2 (en) System and method for automated detection of plagiarized spoken responses
Cobb Learning about language and learners from computer programs
CN101599071A (en) The extraction method of conversation text topic
Al-Sulaiti et al. Designing and developing a corpus of contemporary Arabic
Alotaiby et al. Clitics in Arabic language: a statistical study
CN106446147A (en) Emotion analysis method based on structuring features
Abdurakhmonova et al. MorphUz: Morphological Analyzer for the Uzbek Language
Zaferanieh et al. On the impacts of four collocation instructional methods: Web-based concordancing vs. traditional method, explicit vs. implicit Instruction
Cheng et al. Research on automatic error correction method in English writing based on deep neural network
Zhu et al. Machine learning-based grammar error detection method in English composition
Zhao Chinese character modernisation in the digital era: A historical perspective
Gu et al. A chinese text corrector based on seq2seq model
CN110969010A (en) Problem generation method based on relationship guidance and dual-channel interaction mechanism
CN106815188A (en) A kind of Chinese and language structure obtain system and method
Kristeller Philosophy and its Historiography
CN108710607B (en) Text rewriting method and device
Wróbel et al. Punctuation restoration with transformers
Barker et al. ChatGPT as a text simplification tool to remove bias
CN106815189B (en) Method for identifying new Chinese verb
Mizan et al. The Role of Modern Linguistics in the Learning of Arabic Language Skills
CN111814433B (en) Uygur language entity identification method and device and electronic equipment
CN116226332B (en) Metaphor generation method and system based on concept metaphor theory
Zhang et al. An Enhanced Method for Neural Machine Translation via Data Augmentation Based on the Self-Constructed English-Chinese Corpus, WCC-EC
Basumatary et al. Deep Learning Based Bodo Parts of Speech Tagger

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information

Address after: 212009 Zhenjiang high tech Industrial Development Zone, Jiangsu, No. 668, No. twelve road.

Applicant after: Zhongke national power (Zhenjiang) Intelligent Technology Co., Ltd.

Address before: 212009 18 building, North Tower, Twin Tower Rd 468, twelve road 468, Ding Mo Jing, Jiangsu.

Applicant before: Knowology Intelligent Technology Co., Ltd.

CB02 Change of applicant information
GR01 Patent grant
GR01 Patent grant
CB03 Change of inventor or designer information

Inventor after: Fu Jianhui

Inventor after: Wang Weimin

Inventor after: Cao Yang

Inventor before: Fu Jianhui

Inventor before: Wang Weiming

Inventor before: Cao Yang

CB03 Change of inventor or designer information