CN105740235A - Phrase tree to dependency tree transformation method capable of combining Vietnamese grammatical features - Google Patents

Phrase tree to dependency tree transformation method capable of combining Vietnamese grammatical features Download PDF

Info

Publication number
CN105740235A
CN105740235A CN201610064305.8A CN201610064305A CN105740235A CN 105740235 A CN105740235 A CN 105740235A CN 201610064305 A CN201610064305 A CN 201610064305A CN 105740235 A CN105740235 A CN 105740235A
Authority
CN
China
Prior art keywords
vietnamese
tree
treebank
interdependent
phrase
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201610064305.8A
Other languages
Chinese (zh)
Other versions
CN105740235B (en
Inventor
郭剑毅
李英
余正涛
线岩团
毛存礼
陈玮
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Kunming University of Science and Technology
Original Assignee
Kunming University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Kunming University of Science and Technology filed Critical Kunming University of Science and Technology
Priority to CN201610064305.8A priority Critical patent/CN105740235B/en
Publication of CN105740235A publication Critical patent/CN105740235A/en
Application granted granted Critical
Publication of CN105740235B publication Critical patent/CN105740235B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2228Indexing structures
    • G06F16/2246Trees, e.g. B+trees

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Machine Translation (AREA)

Abstract

The invention relates to a phrase tree to dependency tree transformation method capable of combining Vietnamese grammatical features, and belongs to the technical field of natural language processing. The phrase tree to dependency tree transformation method comprises the following steps: firstly, constructing a Vietnamese phrase tree library; utilizing a center subnode filter table which combines the Vietnamese grammatical features and a dependency relationship annotator to finish the phrase tree to dependency tree transformation in the Vietnamese phrase tree library to obtain a first-level Vietnamese dependency tree library; according to the corpus of the manually annotated first-level Vietnamese dependency tree library, training to obtain a MSTParser model, utilizing the MSTParser model to carry out the expansion of the first-level Vietnamese dependency tree library to obtain an expanded second-level Vietnamese dependency tree library; and utilizing a dependency relationship corrector to correct the corpus of the expanded second-level Vietnamese dependency tree library to obtain a final three-level Vietnamese dependency tree library. The method avoids a process that the Vietnamese dependency tree library is manually collected and annotated, saves manpower and time for constructing the tree library, and obviously improves accuracy.

Description

A kind of tree of phrases conversion method to dependency tree merging Vietnamese grammar property
Technical field
The present invention relates to a kind of tree of phrases conversion method to dependency tree merging Vietnamese grammar property, belong to natural language processing technique field.
Background technology
Vietnam and Yunnan are linked by common mountains and rivers, and the contact history between two peoples is long, language communication both sides' people's friendly exchanges with get along, mutually serve highly important effect in study.Therefore, have important practical significance for the research work that the Chinese is more bilingual.In the intertranslation process of Vietnamese and Chinese, the syntactic analysis of Vietnamese is highly important element task.Syntactic analysis refers to that following given syntactic analysis goes out the grammatical structure of sentence, and it has vital effect in research in natural language processing, information extraction and machine translation etc..The syntactic analysis used at present mainly has two kinds of forms: phrase structure analysis method and dependency structure analytic process.Sentence is cut into phrase by phrase structure analysis method exactly, analyzes the hierarchical relationship between sentence phrase.Phrase structure tree is mainly by destination node, non-terminal point and what P-marker was constituted, and wherein most basic composition is syntactic marker, namely non-terminal point (such as noun phrase NP, verb phrase VP);Dependency structure analysis is exactly analyze the dependence between sentence phrase, it can be unequivocally demonstrated that the dominance relation (such as " I likes having tea " between word, I and be exactly subject-predicate relation between liking) due to the extensive use of dependence, these years also increasingly by the attention of scholar.Although phrase structure and dependency structure are different in the form of expression, but they are all the descriptions to sentence grammatical structure, therefore structurally there is concordance.Carrying out tree of phrases in the process of dependency structure tree, one of very effective method of syntactic feature.By utilizing Vietnamese phrase structure tree to obtain, to the conversion of dependency structure tree, the Vietnamese dependency structure tree that more accuracy rate is higher, thus completing the structure of the interdependent treebank of Vietnamese, have become as the core work of whole Vietnamese dependency analysis, if can to the in addition effective and reasonable solution of this problem, then support the upper layer application such as the syntactic analysis of Vietnamese, machine translation, acquisition of information being can be provided with power.
Summary of the invention
The invention provides a kind of tree of phrases conversion method to dependency tree merging Vietnamese grammar property, the problem relatively difficult for solving the artificial mark interdependent treebank of Vietnamese, the more rare problem of the interdependent treebank of Vietnamese built, the upper layer application such as the syntactic analysis of Vietnamese, machine translation, acquisition of information are provided that powerful support by the interdependent treebank of Vietnamese that the present invention builds.
The technical scheme is that a kind of tree of phrases conversion method to dependency tree merging Vietnamese grammar property, the tree of phrases of described fusion Vietnamese grammar property is to the specifically comprising the following steps that of conversion method of dependency tree
Step1, first structure Vietnamese phrase treebank;
Described step Step1 specifically comprises the following steps that
Step1.1, the Vietnamese sentence that acquisition participle is later from the chunk language material of VLSP;
Step1.2, the Vietnamese sentence after the participle obtained be placed in VLSP website analyze obtain correspondence Vietnamese tree of phrases;
Step1.3, the Vietnamese tree of phrases obtained manually is proofreaded, thus obtaining Vietnamese phrase treebank.
Step2, utilize and merged the center child node filter table of Vietnamese grammar property and dependence annotator completes the conversion to dependency tree of the tree of phrases in Vietnamese phrase treebank, obtain the interdependent treebank of one-level Vietnamese;
Described step Step2 specifically comprises the following steps that
Step2.1, make the center child node filter table meeting Vietnamese language feature according to Vietnamese grammar property;
Step2.2, center child node filter table is utilized to complete the preliminary conversion to the short-and-medium language tree of Vietnamese phrase treebank to dependency tree;
Step2.3, by analyzing the word order difference of Vietnamese and Chinese, in conjunction with CTB dependence mark collection, make the dependence mark collection of applicable Vietnamese;
Step2.4, the dependence of Vietnamese is marked collection and be input to dependence annotator;
Step2.5, utilize dependence annotator to complete conversion after the mark work of Vietnamese dependency tree, be finally completed the conversion to dependency tree of the Vietnamese tree of phrases in Vietnamese phrase treebank, obtained the interdependent treebank of one-level Vietnamese.
The language material training of the interdependent treebank of one-level Vietnamese after the artificial mark of Step3, basis obtains MSTParser model, utilizes MSTParser model to carry out the extension of the interdependent treebank of one-level Vietnamese, two grades of interdependent treebanks of Vietnamese after being expanded;
Described step Step3 specifically comprises the following steps that
Step3.1, language material to the interdependent treebank of one-level Vietnamese obtained manually are proofreaded, and whether the content of check and correction includes interdependent arc correct, and whether dependence mark mark is correct;
The language material training of the interdependent treebank of one-level Vietnamese after artificial check and correction is obtained the interdependent syntactic analysis model of Vietnamese, i.e. MSTParser model by Step3.2, use MST algorithm;
Step3.3, utilizing Vietnamese sentence that MSTParser model training is new thus carrying out the extension of the interdependent treebank of one-level Vietnamese, obtaining two grades of interdependent treebanks of Vietnamese.
Step4, utilize dependence corrector that the language material of two grades of interdependent treebanks of Vietnamese after extension is corrected, obtain the final three grade interdependent treebank of Vietnamese.
The invention has the beneficial effects as follows:
1, build high-quality Vietnamese interdependent treebank and the upper layer application such as the syntactic analysis of Vietnamese, machine translation, acquisition of information are provided that powerful support;
2, by utilizing the tree of phrases merging Vietnamese grammar property to construct, to the conversion method of dependency tree, the Vietnamese dependency tree corpus that accuracy rate is higher;
3, the method building dependency tree that the present invention proposes eliminates the artificial process marking the interdependent treebank of Vietnamese, greatly saves manpower and builds the time of treebank;
4, the tree of phrases merging Vietnamese grammar property that the present invention proposes is compared to the Vietnamese interdependent treebank language material that the conversion method of dependency tree obtains to adopt and significantly improves by the method accuracy rate of the bilingual word-alignment language material structure interdependent treebank of Vietnamese of the Chinese-more.
Accompanying drawing explanation
Fig. 1 is the flow chart of the specific embodiment in the present invention;
Fig. 2 is the phrase structure tree schematic diagram of the present invention;
Fig. 3 is that the present invention changes the Vietnamese dependence schematic diagram obtained.
Detailed description of the invention
Embodiment 1: as Figure 1-3, a kind of tree of phrases conversion method to dependency tree merging Vietnamese grammar property, the tree of phrases of described fusion Vietnamese grammar property is to the specifically comprising the following steps that of conversion method of dependency tree
Step1, first structure Vietnamese phrase treebank;
Step2, utilize and merged the center child node filter table of Vietnamese grammar property and dependence annotator completes the conversion to dependency tree of the tree of phrases in Vietnamese phrase treebank, obtain the interdependent treebank of one-level Vietnamese;
The language material training of the interdependent treebank of one-level Vietnamese after the artificial mark of Step3, basis obtains MSTParser model, utilizes MSTParser model to carry out the extension of the interdependent treebank of one-level Vietnamese, two grades of interdependent treebanks of Vietnamese after being expanded;
Step4, utilize dependence corrector that the language material of two grades of interdependent treebanks of Vietnamese after extension is corrected, obtain the final three grade interdependent treebank of Vietnamese.
Embodiment 2: as Figure 1-3, a kind of tree of phrases conversion method to dependency tree merging Vietnamese grammar property, the tree of phrases of described fusion Vietnamese grammar property is to the specifically comprising the following steps that of conversion method of dependency tree
Step1, first structure Vietnamese phrase treebank;
Step2, utilize and merged the center child node filter table of Vietnamese grammar property and dependence annotator completes the conversion to dependency tree of the tree of phrases in Vietnamese phrase treebank, obtain the interdependent treebank of one-level Vietnamese;
The language material training of the interdependent treebank of one-level Vietnamese after the artificial mark of Step3, basis obtains MSTParser model, utilizes MSTParser model to carry out the extension of the interdependent treebank of one-level Vietnamese, two grades of interdependent treebanks of Vietnamese after being expanded;
Step4, utilize dependence corrector that the language material of two grades of interdependent treebanks of Vietnamese after extension is corrected, obtain the final three grade interdependent treebank of Vietnamese.
Described step Step1 specifically comprises the following steps that
Step1.1, the Vietnamese sentence that acquisition participle is later from the chunk language material of VLSP;
Step1.2, the Vietnamese sentence after the participle obtained be placed in VLSP website analyze obtain correspondence Vietnamese tree of phrases;
Step1.3, the Vietnamese tree of phrases obtained manually is proofreaded, thus obtaining Vietnamese phrase treebank.
Embodiment 3: as Figure 1-3, a kind of tree of phrases conversion method to dependency tree merging Vietnamese grammar property, the tree of phrases of described fusion Vietnamese grammar property is to the specifically comprising the following steps that of conversion method of dependency tree
Step1, first structure Vietnamese phrase treebank;
Step2, utilize and merged the center child node filter table of Vietnamese grammar property and dependence annotator completes the conversion to dependency tree of the tree of phrases in Vietnamese phrase treebank, obtain the interdependent treebank of one-level Vietnamese;
The language material training of the interdependent treebank of one-level Vietnamese after the artificial mark of Step3, basis obtains MSTParser model, utilizes MSTParser model to carry out the extension of the interdependent treebank of one-level Vietnamese, two grades of interdependent treebanks of Vietnamese after being expanded;
Step4, utilize dependence corrector that the language material of two grades of interdependent treebanks of Vietnamese after extension is corrected, obtain the final three grade interdependent treebank of Vietnamese.
Described step Step1 specifically comprises the following steps that
Step1.1, the Vietnamese sentence that acquisition participle is later from the chunk language material of VLSP;
Step1.2, the Vietnamese sentence after the participle obtained be placed in VLSP website analyze obtain correspondence Vietnamese tree of phrases;
Step1.3, the Vietnamese tree of phrases obtained manually is proofreaded, thus obtaining Vietnamese phrase treebank.
Described step Step2 specifically comprises the following steps that
Step2.1, make the center child node filter table meeting Vietnamese language feature according to Vietnamese grammar property;
Step2.2, center child node filter table is utilized to complete the preliminary conversion to the short-and-medium language tree of Vietnamese phrase treebank to dependency tree;
Step2.3, by analyzing the word order difference of Vietnamese and Chinese, in conjunction with CTB dependence mark collection, make the dependence mark collection of applicable Vietnamese;
Step2.4, the dependence of Vietnamese is marked collection and be input to dependence annotator;
Step2.5, utilize dependence annotator to complete conversion after the mark work of Vietnamese dependency tree, be finally completed the conversion to dependency tree of the Vietnamese tree of phrases in Vietnamese phrase treebank, obtained the interdependent treebank of one-level Vietnamese.
Embodiment 4: as Figure 1-3, a kind of tree of phrases conversion method to dependency tree merging Vietnamese grammar property, the tree of phrases of described fusion Vietnamese grammar property is to the specifically comprising the following steps that of conversion method of dependency tree
Step1, first structure Vietnamese phrase treebank;
Step2, utilize and merged the center child node filter table of Vietnamese grammar property and dependence annotator completes the conversion to dependency tree of the tree of phrases in Vietnamese phrase treebank, obtain the interdependent treebank of one-level Vietnamese;
The language material training of the interdependent treebank of one-level Vietnamese after the artificial mark of Step3, basis obtains MSTParser model, utilizes MSTParser model to carry out the extension of the interdependent treebank of one-level Vietnamese, two grades of interdependent treebanks of Vietnamese after being expanded;
Step4, utilize dependence corrector that the language material of two grades of interdependent treebanks of Vietnamese after extension is corrected, obtain the final three grade interdependent treebank of Vietnamese.
Described step Step1 specifically comprises the following steps that
Step1.1, the Vietnamese sentence that acquisition participle is later from the chunk language material of VLSP;
Step1.2, the Vietnamese sentence after the participle obtained be placed in VLSP website analyze obtain correspondence Vietnamese tree of phrases;
Step1.3, the Vietnamese tree of phrases obtained manually is proofreaded, thus obtaining Vietnamese phrase treebank.
Described step Step2 specifically comprises the following steps that
Step2.1, make the center child node filter table meeting Vietnamese language feature according to Vietnamese grammar property;
Step2.2, center child node filter table is utilized to complete the preliminary conversion to the short-and-medium language tree of Vietnamese phrase treebank to dependency tree;
Step2.3, by analyzing the word order difference of Vietnamese and Chinese, in conjunction with CTB dependence mark collection, make the dependence mark collection of applicable Vietnamese;
Step2.4, the dependence of Vietnamese is marked collection and be input to dependence annotator;
Step2.5, utilize dependence annotator to complete conversion after the mark work of Vietnamese dependency tree, be finally completed the conversion to dependency tree of the Vietnamese tree of phrases in Vietnamese phrase treebank, obtained the interdependent treebank of one-level Vietnamese.
Described step Step3 specifically comprises the following steps that
Step3.1, language material to the interdependent treebank of one-level Vietnamese obtained manually are proofreaded, and whether the content of check and correction includes interdependent arc correct, and whether dependence mark mark is correct;
The language material training of the interdependent treebank of one-level Vietnamese after artificial check and correction is obtained the interdependent syntactic analysis model of Vietnamese, i.e. MSTParser model by Step3.2, use MST algorithm;
Step3.3, utilizing Vietnamese sentence that MSTParser model training is new thus carrying out the extension of the interdependent treebank of one-level Vietnamese, obtaining two grades of interdependent treebanks of Vietnamese.
Embodiment 5: as Figure 1-3, a kind of tree of phrases conversion method to dependency tree merging Vietnamese grammar property, the tree of phrases of described fusion Vietnamese grammar property is to the specifically comprising the following steps that of conversion method of dependency tree
Step1, first structure Vietnamese phrase treebank;
Build Vietnamese phrase treebank language material and be by the basis that Vietnamese tree of phrases is changed to dependency tree.Only build out high-quality language material, information-based development could be carried out by based on.Phrase treebank language material is also by the indispensable ingredient of treebank study on the transformation.Build phrase treebank language material to specifically comprise the following steps that
1), from the chunk language material of VLSP, the Vietnamese sentence that participle is later is obtained;
First on VLSP website, download the language material of Vietnamese chunk, extract the Vietnamese simple sentence 10000 that participle obtains.
2), the Vietnamese sentence after the participle obtained is placed in VLSP website and analyzes the Vietnamese tree of phrases obtaining correspondence;
The Vietnamese simple sentence after 10000 participles obtained, it is placed in VLSP website and can obtain corresponding phrase structure treebank.
3), Vietnamese academics and students is allowed manually to proofread the Vietnamese tree of phrases obtained, thus obtaining the Vietnamese phrase treebank that accuracy rate is higher;
In order to better carry out treebank conversion work, ask Vietnamese teacher and Vietnam foreign student that 10000 the phrase structure treebanks obtained manually are proofreaded, to ensure the accuracy of experiment basis language material.The structure of concrete Vietnamese phrase structure tree as in figure 2 it is shown, wherein NP represent a noun phrase, VP represents a verb phrase.
Step2, utilize and merged the center child node filter table of Vietnamese grammar property and dependence annotator completes the conversion to dependency tree of the tree of phrases in Vietnamese phrase treebank, obtain the interdependent treebank of one-level Vietnamese;
Based on the Vietnamese tree of phrases language material above built, first formulate the center child node filter table for the conversion of Vietnamese treebank in conjunction with the grammar property of Vietnamese, utilize the thought of center child node filter table to complete the tree of phrases preliminary conversion to dependency tree;Then in conjunction with the diversity of Vietnamese and Chinese, according to CTB dependence mark system, amendment obtains more conforming to the dependence mark collection of Vietnamese grammatical rules;Finally utilize dependence annotator to complete the dependence to Vietnamese to mark.
1) the center child node filter table meeting Vietnamese language feature, is made according to Vietnamese grammar property;
Vietnamese be a kind of typical single syllable, indeformable, have the language of tone.Grammatical relation between word word is not by the metamorphosis of word itself, but represents by the means such as word order and function word.Its principal character is as follows: the arrangement of (1) word order is most important table justice means in Vietnamese grammer.The change of word order can cause the change of semanteme.Such ascòn,C ò n is different from c ò ncònAnd the word order in Vietnamese sentence is generally the word order that a kind of specifics strengthens gradually, the vocabulary that namely meaning of a word generality is strong position in sentence is just forward, and on the contrary, the meaning of a word concrete vocabulary position in sentence is rearward, for instance (2) grammer system is highly stable.Vietnamese is very big by the impact of other language especially Chinese, and this impact is mainly reflected in lexical level, and having vocabulary more than half is Chinese FrameNet or the word utilizing Chinese morpheme to create.But for syntactic level, Chinese is little on the impact of Vietnamese, Vietnamese still keeps oneself characteristic grammar system constant, for instance the phrase structure rule of " front just retrodeviate " is exactly eternal.The characteristic word formation pattern of " front just retrodeviate ".The composition of nominal head representation property feature is rearmounted, and adjective is rearmounted in other words, is the Vietnamese most salient feature that is different from Chinese.People claim this word formation pattern to be " front just retrodeviate ".It is exactly " justice is mended on right side " that the group word mode just retrodeviated before this is embodied on sentence, and namely the word on the word supplementary notes left side on the right side, more backward more concrete.Such as following sentence: Chinese sentence: Conb ònhàngoàicánhVietnamese sentence: the field, outside that is just at grass of Babalus bubalis L. black my family is at village's tail.Chinese is just the opposite, be front to the rear just, be left side mend justice.(3) Vietnamese adjective and verb have many common taxemes, often make sentence predicate, so being collectively referred to as " predicate ".The phenomenon of adjective directly rear attached complement is very general, for instance:K é mto á n,Deng.Chinese adjective can also arrange object by band, for instance " lechery " " hospitable " etc., but general not as Vietnamese.(4) adverbial modifier position is flexible, and the front adverbial modifier relatively Chinese is many, and the middle adverbial modifier is less.The location comparison of the adverbial modifier's composition in Chinese is flexible, before, during and after the adverbial modifier all very common, but before Vietnamese, the adverbial modifier is many, and the rear adverbial modifier takes second place, and the middle adverbial modifier is less.Such as: Ng à ymai(5) the passive clause of Vietnamese is more than Chinese.Due toDo,doDeng the use of word, the passive clause in Vietnamese is more than Chinese.Such as:Bang à y. above-mentioned two sentence is when using Chinese expression, if firmly " quilt " " obtaining " is said, on the contrary awkward.
Better find the Centroid of each phrase, herein the feature of Vietnamese described above is dissolved in the formulation of center child node filter table.The system of center child node filter table is a part critically important in our whole work, and table 1 is portion centers child node filter table, and its each provisional capital is by < phrase type, the direction of search, priority > tri-composition.Wherein, phrase type is the phrase symbol of non-terminal node;The direction of search is in the direction of nonterminal intra-node search center child node, starts to search for the right when value is L on the left of phrase, starts to search for the left when value is R on the right side of phrase;Priority is the order of priority determining the internal all kinds of mark child nodes of phrase as Centroid.Such as, according to entry < VP, L, a VP in filter table;V;A;AP;N;NP;S;.* >, it is possible to so determine the center child node of VP phrase: observe each child node of VP from left to right, the child node being labeled as VP found at first is the center child node of VP;Without finding VP node, again observing each child node of VP from left to right, the symbol found at first is the center child node that the child node of V is VP;By that analogy, if there is no any child node being labeled as VP, V, A, AP, N, NP, S .* inside this VP, just give tacit consent to child node centered by the child node of the leftmost side.The center child node filter table of the applicable Vietnamese treebank conversion made is as shown in table 1.
Table 1 center child node filter table
Phrase type The direction of search Priority
S L S;VP;AP;NP;.*
SBAR L SBAR;S;VP;AP;NP;.*
SQ L SQ;VP;AP;NP;.*
NP L NP;Nc;Nu;Np;N;P;.*
VP L VP;V;A;AP;N;NP;S;.*
AP L AP;A;N;S;.*
PP L PP;E;VP;SBAR;AP;QP;.*
RP R RP;R;T;NP;.*
XP L XP;X;.*
MDP L MDP;T;I;A;P;R;X;.*
UCP L .*
WHADV L R;.*
WHVP L V;.*
QP L QP;M;.*
2) center child node filter table, is utilized to complete the preliminary conversion to the short-and-medium language tree of Vietnamese phrase treebank to dependency tree;
Give an example and find center child node: (VP (R)(Vcòn)(NP-DOB(N)(Anghèo)).VP phrase type is found, it can be seen that the entry corresponding to VP is < VP, L, VP firstly the need of in the child node filter table of center;V;A;AP;N;NP;S;.*>.Second step need from left to right to browse in VP phrase that first is labeled as V be exactly word (Vc ò n).This means that " c ò n " is exactly the center child node of this VP phrase.
3), by analyzing the word order difference of Vietnamese and Chinese, in conjunction with CTB dependence mark collection, the dependence mark collection of applicable Vietnamese is made;
By the phraseological diversity of relative analysis Vietnamese Yu Chinese, CTB Binzhou treebank Chinese dependence is marked collection and is made that amendment.In the feature such as be mainly reflected in that the more attribute of Vietnamese is rearmounted and the adverbial modifier is rearmounted so that the interdependent treebank mark collection obtained after amendment is more suitable for the taxeme of Vietnamese.
4), the dependence of Vietnamese is marked collection and be input to dependence annotator;
The dependence of Vietnamese is marked collection and is input to dependence annotator, eigenvalue is set simultaneously, be ready for Vietnamese dependence mark.
5), utilize dependence annotator to complete the mark work of the Vietnamese dependency tree after changing, be finally completed the conversion to dependency tree of the Vietnamese tree of phrases in Vietnamese phrase treebank, obtained the interdependent treebank of one-level Vietnamese;
In the process determining dependence, mainly have employed Statistics-Based Method to carry out dependence mark.We utilize online algorithm to carry out the weights of training feature vector, and online algorithm is different from SVM, and it is the accuracy rate of the whole tree of maximized raising in whole training process.Online algorithm is a kind of based on apart from maximized learning algorithm simultaneously, is used widely, and performance is fine in dependency analysis, text classification etc..The structure of the interdependent treebank of one-level Vietnamese obtained is as it is shown on figure 3, the dependence that analyzes between two words, and what represent such as SBV is between two words be subject-predicate relation.
The language material training of the interdependent treebank of one-level Vietnamese after the artificial mark of Step3, basis obtains MSTParser model, utilizes MSTParser model to carry out the extension of the interdependent treebank of one-level Vietnamese, two grades of interdependent treebanks of Vietnamese after being expanded;
1), the language material of the interdependent treebank of one-level Vietnamese obtained manually being proofreaded, whether the content of check and correction includes interdependent arc correct, and whether dependence mark mark is correct;
Randomly draw 1000 in this process and analyze the Vietnamese accuracy rate that whole treebank is converted to.Obtain Vietnamese rate of accuracy reached to 89.4%.Ask Vietnamese teacher and Vietnam foreign student that the Vietnamese dependency tree obtained has been done manual synchronizing on this basis, to ensure the quality of corpus.
2), use MST algorithm that the language material training of the interdependent treebank of one-level Vietnamese after artificial check and correction is obtained the interdependent syntactic analysis model of Vietnamese, i.e. MSTParser model;
MST (MaximumSpanningTrees) algorithm adopts the dependency tree of full sentence to be trained, and uses maximum spanning tree to search for the optimum dependency tree of whole sentence during dependency analysis.We are by a Vietnamese sentence S={s1,s2,...,snDependency tree be expressed as a directed graph G=(V, E), wherein the word in Vietnamese sentence constitute G vertex set V={v1,v2,...,vn,Represent the line of the Vietnamese dependence of a word upon another word.If there being an oriented line pointing to summit j from summit i in dependency tree, then (i, j) ∈ E, each directed edge weight definition is that (i, j, y), namely j depends on the probability of i to score just a directed edge between summit i, j ∈ V.Wherein y is dependency relationship type.The weight of one dependency tree is the summation of directed edge weight in this tree.So, it determines searching optimal result is converted into maximum spanning tree problem of searching in directed graph G=(V, E) by the dependency analysis method of formula:
T = G = ( V , E ) argmax &Sigma; ( i , j ) &Element; E s c o r e ( i , j , y )
3), utilizing Vietnamese sentence that MSTParser model training is new thus carrying out the extension of the interdependent treebank of one-level Vietnamese, obtaining two grades of interdependent treebanks of Vietnamese.
Here we utilize the model that training obtains Vietnamese sentence to be trained thus expanding the interdependent treebank language material of new Vietnamese, the source of Vietnamese sentence is seen as shown in Figure 1 herein, news corpus 20 can be obtained from Vietnam's news website, 000, CRF participle instrument is utilized to carry out participle as corpus, here the amount of corpus is increased 30,000.
Step4, utilize dependence corrector that the language material of two grades of interdependent treebanks of Vietnamese after extension is corrected, obtain the final three grade interdependent treebank of Vietnamese.
The present invention is first depending on Chinese dependence mark system and the grammatical rules of Vietnamese, makes dependence list;Then in conjunction with the language feature of Vietnamese, make center child node filter table, utilize the thought of center child node filter table to carry out primary transformants;Dependence annotator is finally used to carry out dependence mark.Based on 10, the 000 dependency structure trees obtained after conversion, utilize MSTParser instrument to train further and obtain more Vietnamese dependency structure tree.One interdependent syntactic analysis system of Vietnamese of final structure.
Experimental result is as shown in table 2.Table 2 can be seen that, when Vietnamese language material is relatively many, adopting the interdependent treebank of Vietnamese that the conversion method of tree of phrases to dependency tree to merge Vietnamese grammar property generates, accuracy rate compares employing CRFParser and the method for the interdependent treebank of Vietnamese built for intermediary with Chinese significantly improves.
Wherein, whole sentence interdependent syntactic analysis evaluation metrics selects: interdependent arc accuracy rate (UnlabeledAttachmentScore, UAS) and mark accuracy rate (LabeledAttachmentScore, LAS) definition are as follows:
The comparison of table 2 additive method and the inventive method
Above in conjunction with accompanying drawing, the specific embodiment of the present invention is explained in detail, but the present invention is not limited to above-mentioned embodiment, in the ken that those of ordinary skill in the art possess, it is also possible to make various change under the premise without departing from present inventive concept.

Claims (4)

1. the tree of phrases conversion method to dependency tree merging Vietnamese grammar property, it is characterised in that: the tree of phrases of described fusion Vietnamese grammar property is to the specifically comprising the following steps that of conversion method of dependency tree
Step1, first structure Vietnamese phrase treebank;
Step2, utilize and merged the center child node filter table of Vietnamese grammar property and dependence annotator completes the conversion to dependency tree of the tree of phrases in Vietnamese phrase treebank, obtain the interdependent treebank of one-level Vietnamese;
The language material training of the interdependent treebank of one-level Vietnamese after the artificial mark of Step3, basis obtains MSTParser model, utilizes MSTParser model to carry out the extension of the interdependent treebank of one-level Vietnamese, two grades of interdependent treebanks of Vietnamese after being expanded;
Step4, utilize dependence corrector that the language material of two grades of interdependent treebanks of Vietnamese after extension is corrected, obtain the final three grade interdependent treebank of Vietnamese.
2. the tree of phrases of fusion Vietnamese grammar property according to claim 1 is to the conversion method of dependency tree, it is characterised in that: described step Step1 specifically comprises the following steps that
Step1.1, the Vietnamese sentence that acquisition participle is later from the chunk language material of VLSP;
Step1.2, the Vietnamese sentence after the participle obtained be placed in VLSP website analyze obtain correspondence Vietnamese tree of phrases;
Step1.3, the Vietnamese tree of phrases obtained manually is proofreaded, thus obtaining Vietnamese phrase treebank.
3. the tree of phrases of fusion Vietnamese grammar property according to claim 1 is to the conversion method of dependency tree, it is characterised in that: described step Step2 specifically comprises the following steps that
Step2.1, make the center child node filter table meeting Vietnamese language feature according to Vietnamese grammar property;
Step2.2, center child node filter table is utilized to complete the preliminary conversion to the short-and-medium language tree of Vietnamese phrase treebank to dependency tree;
Step2.3, by analyzing the word order difference of Vietnamese and Chinese, in conjunction with CTB dependence mark collection, make the dependence mark collection of applicable Vietnamese;
Step2.4, the dependence of Vietnamese is marked collection and be input to dependence annotator;
Step2.5, utilize dependence annotator to complete conversion after the mark work of Vietnamese dependency tree, be finally completed the conversion to dependency tree of the Vietnamese tree of phrases in Vietnamese phrase treebank, obtained the interdependent treebank of one-level Vietnamese.
4. the tree of phrases of fusion Vietnamese grammar property according to claim 1 is to the conversion method of dependency tree, it is characterised in that: described step Step3 specifically comprises the following steps that
Step3.1, language material to the interdependent treebank of one-level Vietnamese obtained manually are proofreaded, and whether the content of check and correction includes interdependent arc correct, and whether dependence mark mark is correct;
The language material training of the interdependent treebank of one-level Vietnamese after artificial check and correction is obtained the interdependent syntactic analysis model of Vietnamese, i.e. MSTParser model by Step3.2, use MST algorithm;
Step3.3, utilizing Vietnamese sentence that MSTParser model training is new thus carrying out the extension of the interdependent treebank of one-level Vietnamese, obtaining two grades of interdependent treebanks of Vietnamese.
CN201610064305.8A 2016-01-29 2016-01-29 It is a kind of merge Vietnamese grammar property tree of phrases to dependency tree conversion method Active CN105740235B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610064305.8A CN105740235B (en) 2016-01-29 2016-01-29 It is a kind of merge Vietnamese grammar property tree of phrases to dependency tree conversion method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610064305.8A CN105740235B (en) 2016-01-29 2016-01-29 It is a kind of merge Vietnamese grammar property tree of phrases to dependency tree conversion method

Publications (2)

Publication Number Publication Date
CN105740235A true CN105740235A (en) 2016-07-06
CN105740235B CN105740235B (en) 2019-02-19

Family

ID=56247132

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610064305.8A Active CN105740235B (en) 2016-01-29 2016-01-29 It is a kind of merge Vietnamese grammar property tree of phrases to dependency tree conversion method

Country Status (1)

Country Link
CN (1) CN105740235B (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106250367A (en) * 2016-07-27 2016-12-21 昆明理工大学 The method building the interdependent treebank of Vietnamese based on the Nivre algorithm improved
CN106844333A (en) * 2016-12-20 2017-06-13 竹间智能科技(上海)有限公司 A kind of statement analytical method and system based on semantic and syntactic structure
CN107656921A (en) * 2017-10-10 2018-02-02 上海数眼科技发展有限公司 A kind of short text dependency analysis method based on deep learning
CN108280060A (en) * 2018-01-10 2018-07-13 昆明理工大学 A method of the interdependent treebank error detection of Vietnamese based on treebank conversion
CN110096715A (en) * 2019-05-06 2019-08-06 北京理工大学 A kind of fusion pronunciation character Chinese-Vietnamese statistical machine translation method

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060004563A1 (en) * 2004-06-30 2006-01-05 Microsoft Corporation Module for creating a language neutral syntax representation using a language particular syntax tree
CN103020148A (en) * 2012-11-23 2013-04-03 复旦大学 System and method for converting Chinese phrase structure tree banks into interdependent structure tree banks
CN104991890A (en) * 2015-07-15 2015-10-21 昆明理工大学 Method for constructing Vietnamese dependency tree bank on basis of Chinese-Vietnamese vocabulary alignment corpora

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060004563A1 (en) * 2004-06-30 2006-01-05 Microsoft Corporation Module for creating a language neutral syntax representation using a language particular syntax tree
CN103020148A (en) * 2012-11-23 2013-04-03 复旦大学 System and method for converting Chinese phrase structure tree banks into interdependent structure tree banks
CN104991890A (en) * 2015-07-15 2015-10-21 昆明理工大学 Method for constructing Vietnamese dependency tree bank on basis of Chinese-Vietnamese vocabulary alignment corpora

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
DAT QUOC NGUYEN ET AL.: "From Treebank Conversion to Automatic Dependency Parsing for Vietnamese", 《19TH INTERNATIONAL CONFERENCE ON APPLICATIONS OF NATURAL LANGUAGE TO INFORMATION SYSTEMS》 *
PHUONG-THAI NGUYEN ET AL.: "Building a Large Syntactically-Annotated Corpus of Vietnamese", 《PROCEEDINGS OF THE THIRD LINGUISTIC ANNOTATION WORKSHOP, ACL-IJCNLP 2009》 *
尤昉 等: "基于语义依存关系的汉语语料库的构建", 《中文信息学报》 *

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106250367A (en) * 2016-07-27 2016-12-21 昆明理工大学 The method building the interdependent treebank of Vietnamese based on the Nivre algorithm improved
CN106250367B (en) * 2016-07-27 2019-04-09 昆明理工大学 Method based on the improved Nivre algorithm building interdependent treebank of Vietnamese
CN106844333A (en) * 2016-12-20 2017-06-13 竹间智能科技(上海)有限公司 A kind of statement analytical method and system based on semantic and syntactic structure
CN107656921A (en) * 2017-10-10 2018-02-02 上海数眼科技发展有限公司 A kind of short text dependency analysis method based on deep learning
CN107656921B (en) * 2017-10-10 2021-01-08 上海数眼科技发展有限公司 Short text dependency analysis method based on deep learning
CN108280060A (en) * 2018-01-10 2018-07-13 昆明理工大学 A method of the interdependent treebank error detection of Vietnamese based on treebank conversion
CN110096715A (en) * 2019-05-06 2019-08-06 北京理工大学 A kind of fusion pronunciation character Chinese-Vietnamese statistical machine translation method

Also Published As

Publication number Publication date
CN105740235B (en) 2019-02-19

Similar Documents

Publication Publication Date Title
Sabet et al. SimAlign: High quality word alignments without parallel training data using static and contextualized embeddings
CN106484664B (en) Similarity calculating method between a kind of short text
CN104331451B (en) A kind of recommendation degree methods of marking of network user&#39;s comment based on theme
CN104636466B (en) Entity attribute extraction method and system for open webpage
CN105740235A (en) Phrase tree to dependency tree transformation method capable of combining Vietnamese grammatical features
CN110502642B (en) Entity relation extraction method based on dependency syntactic analysis and rules
CN104778204B (en) More document subject matters based on two layers of cluster find method
CN108052593A (en) A kind of subject key words extracting method based on descriptor vector sum network structure
CN105975625A (en) Chinglish inquiring correcting method and system oriented to English search engine
CN108268668B (en) Topic diversity-based text data viewpoint abstract mining method
CN107957991A (en) A kind of entity attribute information extraction method and device relied on based on syntax
CN107305539A (en) A kind of text tendency analysis method based on Word2Vec network sentiment new word discoveries
CN103699529A (en) Method and device for fusing machine translation systems by aid of word sense disambiguation
CN108959258A (en) It is a kind of that entity link method is integrated based on the specific area for indicating to learn
CN104317846A (en) Semantic analysis and marking method and system
CN104899188A (en) Problem similarity calculation method based on subjects and focuses of problems
CN103034627B (en) Calculate the method and apparatus of sentence similarity and the method and apparatus of machine translation
CN110377918A (en) Merge the more neural machine translation method of the Chinese-of syntax analytic tree
CN105320650B (en) A kind of machine translation method and its system based on corpus matching and syntactic analysis
CN104298663B (en) Method and device for translation consistency and statistical machine translation method and system
CN106445921A (en) Chinese text term extracting method utilizing quadratic mutual information
CN105808530A (en) Translation method and device in statistical machine translation
CN104778256A (en) Rapid incremental clustering method for domain question-answering system consultations
CN105068997A (en) Parallel corpus construction method and device
CN111143672A (en) Expert specialty scholars recommendation method based on knowledge graph

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant