CN106202037A

CN106202037A - Vietnamese tree of phrases construction method based on chunk

Info

Publication number: CN106202037A
Application number: CN201610497061.2A
Authority: CN
Inventors: 郭剑毅; 李英; 余正涛; 线岩团; 毛存礼; 陈玮
Original assignee: Kunming University of Science and Technology
Current assignee: Kunming University of Science and Technology
Priority date: 2016-06-30
Filing date: 2016-06-30
Publication date: 2016-12-07
Anticipated expiration: 2036-06-30
Also published as: CN106202037B

Abstract

The present invention relates to Vietnamese tree of phrases construction method based on chunk, belong to natural language processing technique field.First the present invention carries out upper strata chunk and basic unit's chunk mark to Vietnamese tree of phrases mark collection；Choose upper strata chunk and the feature set of basic unit's chunk, then build Vietnamese phrase treebank based on chunk structure model；Utilize chunk parsing instrument that the Vietnamese sentence after participle is carried out chunk parsing, obtain the primary Vietnamese phrase treebank built based on chunk；Utilize phrase treebank corrector that the primary Vietnamese phrase treebank built based on chunk is corrected, finally obtain the whole level Vietnamese phrase treebank after correction.Present invention, avoiding the process artificially collecting and marking Vietnamese phrase treebank, save manpower and build the time of treebank；The method building tree of phrases that the present invention proposes compares employing context-free grammar structure Vietnamese phrase treebank and maximum entropy builds Vietnamese phrase treebank method accuracy rate and significantly improves.

Description

Vietnamese tree of phrases construction method based on chunk

Technical field

The present invention relates to Vietnamese tree of phrases construction method based on chunk, belong to natural language processing technique field.

Background technology

The analysis of phrase treebank has very important effect, such as syntactic pattern with building for philological research Extraction and the investigation etc. of language phenomenon；It is usually used to train participle instrument, parser and semantic role mark simultaneously The systems such as note device, these systems are again the bases of the application such as information extraction, machine translation, question answering system and text classification.Closely Nian Lai, along with machine learning method and the fast development of artificial intelligence, automatically building of phrase treebank becomes more and more important.

Phrase syntactic analysis is that automatic deduction goes out the grammatical structure of sentence, parsing sentence institute according to given grammer system Relation (Allen1995) between the syntactic units comprised and these syntactic units, is converted into a structurized language by sentence Method tree.Tree of phrases is made up of according to specific grammatical rules terminal symbol, nonterminal symbol and P-marker these three symbol.According to Grammatical rules, some terminal symbols constitute a phrase, participate in reduction next time as nonterminal symbol, until by whole sentence reduction For root node.

Research for Vietnamese phrase treebank is little.Research currently for Vietnamese consists predominantly of: Nguyen C T, Nguyen T K (2006) et al. utilize CRF Yu SVM to build Vietnamese participle model, complete the participle work of Vietnamese； Le H P, Nguyen T M H, Romary L (2006) et al. are proposed for the Lexical link grammar of Vietnam, but do not say These syntax utilize on the structure of tree of phrases；Nguyen P T, Vu X L, Nguyen T M H (2009) et al. simply introduce Once build the Research Thinking of Vietnamese syntax tree, but do not provide structure result；Dinh Dien, Thuy Ngan, Xuan Quang (2009) et al. carries out bilingual machine translation, in this process by building the parallel syntax tree of English-Vietnamese In constructed Vietnamese syntax tree there is also many problem, such as English and Vietnamese can not one_to_one corresponding, cause Vietnamese Syntax tree accuracy rate is the lowest.

For the shortage of Vietnamese phrase treebank and build the problem of difficulty, the invention provides a kind of new based on group The Vietnamese tree of phrases construction method of block.This method can automatically analyze out the phrase structure tree of Vietnamese, solves Vietnamese phrase The Construct question of treebank.The Vietnamese phrase treebank that the present invention builds is to the syntactic analysis of Vietnamese, machine translation, information extraction It is provided that powerful support etc. upper layer application.

Summary of the invention

The invention provides Vietnamese tree of phrases construction method based on chunk, for solving, artificial mark Vietnamese is short The problem that language treebank is relatively difficult, the problem building larger Vietnamese phrase treebank inconvenience, and conventional construction Vietnam Language treebank method accuracy rate is low, the problem of time-consuming length.The present invention propose build tree of phrases method compare employing context without Close syntax structure Vietnamese phrase treebank and maximum entropy builds Vietnamese phrase treebank method accuracy rate and significantly improves.Simultaneously this The upper layer application such as the syntactic analysis of Vietnamese, machine translation, information extraction have been provided that by the Vietnamese phrase treebank of bright structure Power supports.

The technical scheme is that Vietnamese tree of phrases construction method based on chunk, described Vietnam based on chunk Specifically comprising the following steps that of language tree of phrases construction method

Step1, first Vietnamese tree of phrases mark collection is carried out upper strata chunk and basic unit chunk mark, mark is obtained Tree of phrases is as corpus；The accuracy rate of the corpus that profit is acquired in this way is higher so that utilization should The feature set that corpus obtains is more accurate；

Step2, choose upper strata chunk and the feature set of basic unit's chunk, according to corpus, CRF model is adjusted, instruction Practise the CRF model after improvement, utilize the CRF model construction upper strata chunk after improving and basic unit's chunk model, by upper strata chunk Model is built with being converted into Vietnamese phrase treebank based on chunk after the combination of basic unit chunk model；Use the CRF mould after improving Going out Vietnamese phrase treebank constructed by type and build model, the structure effect for Vietnamese phrase treebank is more preferable, and quality is higher；

Step3, utilize chunk parsing instrument that the Vietnamese sentence after participle is carried out chunk parsing, thus obtain chunk language Material, carries out basic unit's chunk and upper layer group block analysis to acquired language material, obtains the primary Vietnamese phrase built based on chunk Treebank；The structure using Vietnamese phrase treebank based on chunk structure model to carry out Vietnamese phrase treebank compares employing up and down Literary composition Grammars builds Vietnamese phrase treebank and maximum entropy builds Vietnamese phrase treebank method accuracy rate and significantly improves；

Step4, utilize phrase treebank corrector that the primary Vietnamese phrase treebank built based on chunk is corrected, After corrected after whole level Vietnamese phrase treebank.Primary Vietnamese phrase treebank is further corrected guarantee obtain The quality of whole level Vietnamese phrase treebank, it is possible to for machine translation, the upper layer application such as information extraction provides language material to support.

As the preferred version of the present invention, in described step Step1, carry out manually marking the Vietnamese tree of phrases obtained What upper strata chunk and basic unit's chunk marked specifically comprises the following steps that

Step1.1, according to the language feature of Vietnamese in combination with the mark system of CTB, i.e. Chinese Penn Treebank, formulate The mark collection of Vietnamese tree of phrases；

Step1.2, combine upper strata chunk and basic unit's chunk target has defined and Vietnamese tree of phrases marks the upper of collection Layer chunk and basic unit's chunk mark；

Step1.3, the Vietnamese tree of phrases being made up of upper strata chunk and basic unit's chunk obtained by mark are as training language Material.

As the preferred version of the present invention, specifically comprising the following steps that of described step Step2

CRF model is adjusted by Step2.1, foundation corpus, trains the CRF model after improvement；

Step2.2, choose setting upper strata chunk and the feature set of basic unit's chunk；

The CRF model construction after setting upper strata chunk and the feature set of basic unit's chunk and improving is chosen in Step2.3, utilization Upper layer group block models and basic unit's chunk model, be converted into Vietnam based on chunk after upper strata chunk and basic unit's chunk model combination Language phrase treebank builds model；

As the preferred version of the present invention, specifically comprising the following steps that of described step Step3

Step3.1, the Vietnamese sentence after participle is carried out chunk parsing, obtain Vietnamese chunk language material；

Step3.2, utilize obtain upper layer group block models and basic unit's chunk model chunk language material is carried out basic unit's chunk and Upper layer group block analysis, finally gives the primary Vietnamese phrase treebank built based on chunk.

The invention has the beneficial effects as follows:

1, the method building tree of phrases that the present invention proposes is compared employing context-free grammar and is built Vietnamese tree of phrases Storehouse and maximum entropy build Vietnamese phrase treebank method accuracy rate and significantly improve.The Vietnamese phrase treebank that the present invention builds simultaneously The upper layer application such as the syntactic analysis of Vietnamese, machine translation, information extraction are provided that powerful support；

2, the Vietnamese tree of phrases corpus that the scale that constructs is relatively large；

3, the method building tree of phrases that the present invention proposes eliminates the artificial process marking Vietnamese phrase treebank, significantly Save manpower and build treebank time.

Accompanying drawing explanation

Fig. 1 is the flow chart in the present invention.

Detailed description of the invention

Embodiment 1: as it is shown in figure 1, Vietnamese tree of phrases construction method based on chunk, described Vietnamese based on chunk Specifically comprising the following steps that of tree of phrases construction method

Embodiment 2: as it is shown in figure 1, Vietnamese tree of phrases construction method based on chunk, the present embodiment and embodiment 1 phase With, wherein, as the preferred version of the present invention, in described step Step1, carry out manually marking the Vietnamese tree of phrases obtained What upper strata chunk and basic unit's chunk marked specifically comprises the following steps that

Embodiment 3: as it is shown in figure 1, Vietnamese tree of phrases construction method based on chunk, the present embodiment and embodiment 2 phase With, wherein, as the preferred version of the present invention, specifically comprising the following steps that of described step Step2

Embodiment 4: as it is shown in figure 1, Vietnamese tree of phrases construction method based on chunk, the present embodiment and embodiment 3 phase With, wherein, as the preferred version of the present invention, specifically comprising the following steps that of described step Step3

Embodiment 5: as it is shown in figure 1, Vietnamese tree of phrases construction method based on chunk, described Vietnamese based on chunk Specifically comprising the following steps that of tree of phrases construction method

Step3, utilize chunk parsing instrument that 2.7 ten thousand Vietnamese sentences after participle are carried out chunk parsing, thus obtain Take chunk language material, acquired language material carried out basic unit's chunk and upper layer group block analysis, obtain 2.7 ten thousand based on chunk build Primary Vietnamese phrase treebank；Use Vietnamese phrase treebank based on chunk to build model and carry out the structure of Vietnamese phrase treebank Building to compare uses context-free grammar structure Vietnamese phrase treebank and maximum entropy structure Vietnamese phrase treebank method accurate Rate significantly improves；

Wherein, concrete, first described step Step1 carries out upper strata to 5000 Vietnamese trees of phrases of artificial mark Chunk and basic unit's chunk mark, the tree of phrases obtained by mark is as corpus；

Build Vietnamese phrase treebank language material and be by the basis that Vietnamese tree of phrases builds.Only build out high-quality Language material, could carry out information-based development by based on.It is short that phrase treebank language material is also by Vietnamese based on chunk Language treebank builds studies an indispensable ingredient.Build phrase treebank language material to specifically comprise the following steps that

1), according to the language feature of Vietnamese in combination with the mark system of CTB, i.e. Chinese Penn Treebank, Vietnam is formulated The mark collection of language tree of phrases；

Vietnamese belongs to Austroasiatic, and it is the mother tongue of country of Vietnam.Each language has the word order of oneself, Vietnamese The order relying primarily on composition goes to pass on important syntactic information.Although writing of Vietnamese text derives from Latin Mutation, Vietnamese has three obvious feature differences to remove western language.Vietnamese phrase treebank is built with impact of crucial importance Some Vietnamese features as follows:

First, the minimum component units of Vietnamese is syllable.Word can only be by one (Beautiful) or multiple (G á girl i) syllable composition.As many Asian languages (such as Chinese, Japanese and Thai), Vietnamese does not has word to separate Symbol.Space separates the separator of syllable the most one by one, and the separator of neither one word, so Vietnamese sentence is the most permissible There is a variety of dividing method.

Then, Vietnamese is a kind of isolated verbal unit, in this language, word can not change form and according in sentence Word order determines its grammatical function.It is to say, word order arrangement is most important table justice means in Vietnamese grammer.Changing of word order Change can cause the change of semanteme.Such asC ò n represents son, c ò nBut the mankind are represented.And the word in Vietnamese sentence Sequence be generally vocabulary that the word order that a kind of specifics gradually strengthens, the i.e. meaning of a word generality the is strong position in sentence the most Forward, on the contrary, the meaning of a word the most concrete vocabulary position in sentence is the most rearward.Such as: AnhMua (he buys) T á o (Fructus Mali pumilae).

Finally, Vietnamese is a kind of language quite fixing word order, is made up of SVO (SVO) the word order fixed.The most just Being to say, they general word order are: main

Language+predicate+object.Such as: Kia (that) l à (YES)(some)(seat) nh à (house) v á ch(soil Wall).By analyzing the grammar property of Vietnamese, it is found that Vietnamese has obvious attribute rearmounted, the spy that the adverbial modifier is rearmounted Point.Such as:(I usually has a meal)quán(in dining room).

For features described above and the mark system of CTB (Chinese Penn Treebank) of Vietnamese, formulate Vietnamese tree of phrases Mark collection, part Vietnamese tree of phrases mark collection is as shown in table 1.

Table 1 part Vietnamese tree of phrases mark collection

Phrase type marks	Phrase type explanation
		NP	Noun phrase
VP	Verb phrase
		PP	Prepositional phrase
AP	Adjective Phrases

2), combine upper strata chunk and basic unit's chunk target has defined and 5000 Vietnamese trees of phrases mark the upper of collection Layer chunk and basic unit's chunk mark；

In order to be respectively trained basic unit's chunk parsing model and upper layer group block analysis model, it is necessary first to by a syntax tree All chunks be divided into two parts: basic unit's chunk collection and upper strata chunk collection.In order to make basic unit's chunk and upper strata chunk have clearly Definition, the first height to each node in syntax tree provides descriptive definition: make each terminal symbol in syntax tree The height of node (word) is zero, the height of other nonterminal node be the height of the child nodes of this non-terminal Big value is plus a fixing height value 1.Secondly, the level to the syntax of Peen Treebank form have following descriptive fixed Justice: it is recognized herein that the syntax tree of a complete Peen Treebank form can be divided into some levels, the number of plies of syntax tree is just Being the height of the root node of syntax tree, each level is made up of one group of orderly subtree collection.Make the son that terminal symbol node is constituted Tree collection is combined into the 0th layer；The set that n-th layer subtree collection is made up of those height subtrees less than or equal to n, if this straton tree Set the most only takes big subtree containing the subtree comprised by big subtree, casts out and is comprised little subtree.According to realized herein The needs of syntactic analysis tree-model, are referred to as basic unit's chunk set by the chunk collection corresponding to the 2nd layer of orderly subtree collection, and by The chunk collection corresponding to all of subtree collection of more than 2 layers is collectively referred to as upper layer group set of blocks.

According to upper strata defined above chunk and basic unit's chunk mark, upper by be accomplished manually 5000 Vietnamese trees of phrases Layer chunk and basic unit's chunk mark.

3) the Vietnamese tree of phrases being made up of upper strata chunk and basic unit's chunk, obtained by mark is as corpus；

By being accomplished manually the upper strata chunk to 5000 Vietnamese trees of phrases and the result of basic unit's chunk mark, will be as instruction Practice upper strata chunk and the corpus of basic unit's chunk model.

Wherein, described Step2 chooses upper strata chunk and the feature set of basic unit's chunk, according to corpus to CRF model It is adjusted, trains the CRF model after improvement, utilize the CRF model construction upper strata chunk after improving and basic unit's chunk model, Model is built by being converted into Vietnamese phrase treebank based on chunk after upper strata chunk and basic unit's chunk model combination；

Based on the Vietnamese tree of phrases language material above built, obtain upper strata chunk used here as the CRF model training improved With basic unit's chunk model, after upper strata chunk and basic unit's chunk model combination, it is converted into Vietnamese phrase treebank structure based on chunk Established model.

1), according to corpus, CRF model is adjusted, trains the CRF model after improvement；

Sequence labelling task is to include bioinformatics (bioinformatics), computational linguistics (computational Linguistics) an important task and in the field such as speech recognition (speech recognition).At natural language Speech process field part-of-speech tagging and chunk parsing are all typical sequence labelling tasks, are marked the sequence to observe. Such as in chunk parsing task, by using sequence labelling model that the sentence of input is marked and will can make up one The subsequence of new chunk gives identical labelling.For sequence labelling task, people are at first it is envisioned that Hidden Markov mould Type (Hidden Markov Models).HMM is one and generates model, and it is to observation sequence stochastic variable X And corresponding labelling stochastic variable Y is modeled, and calculate Joint Distribution probability P BXY between them.But connection The most serious problem that has closing distribution probability model is intended to enumerate all of observation sequence, and this task is in a lot of fields In be unsolvable.So needing a model that problem can turn to solvable problem, and so conditional probability model is exactly A kind of model.Conditional probability model calculate observe stochastic variable X and the condition distribution probability P of corresponding labelling stochastic variable Y and It not associating P (X Y), thus complicated problem greatly can be simplified.

Conditional random field models is exactly a kind of probabilistic framework using condition distribution probability, is also typical discrimination model. Compare other sequence labelling model, and conditional random field models has a lot of intrinsic advantages.First contrast Hidden Markov mould Type, the interdependent hypothesis demand relative relaxation of conditional random field models；Secondly contrast maximum entropy Markov model C Maximum Entropy Markov Models) and other condition Markov model based on directed graph, conditional random field models can Avoid marking bias problem.Therefore the performance at the task conditional random field models of a lot of reality is preferable.

Lafferty in his article by the definition of probability of the corresponding labelled sequence v of given observation sequence two as public Shown in formula 1.

exp(∑_jλ_jt_j(y_j-1, y_j, x, i)+∑_kλ_kt_k(y_i, x, i)) (1)

Wherein t_j(y_j-1, y_j, x, i) it is whole observation sequence and the labelled sequence transfer characteristic function in i and i-1 position； And t_k(y_j, x, i) it is labelling and the observation sequence state characteristic function in position；No and #, it is the parameter of the two function, needs Estimate from training data.

Need to build the real-valued function of an observation sequence when defined feature function.(x i) comes e by this real-valued function Some distribution characters of training data are described.Be as follows in chunk parsing one about e (x, i) | object lesson formula 2.

Simplification in order to express will be described as shown in Equation 3 with following labelling herein.

S(y_j, x, i)=S (y_j-1, y_j, x, i) (3)

And there is the global characteristics function for given observation sequence x and the conditional random field models at labelled sequence sunset fixed Justice is formula 4.

F (y, x) = Σ_{i = 1}^{n} f_{j} (y_{j - 1}, y_{j}, x, i) - - - (4)

Function f in superincumbent formula_j(y_j-1, y_j, x, i) both can be state characteristic function S (y_j-1, y_j, x, i) also may be used To be transfer characteristic function t_j(y_j-1, y_j, x, i).So for given observation sequence x, the probability distribution of its labelled sequence y Just can be to be write as formula 5 form.

p (y | x, λ) = \frac{1}{z (x)} \exp (Σ_{j} λ_{j} F_{j} (y, x)) - - - (5)

Wherein Z (x) is normalization factor.

2), setting upper strata chunk and the feature set of basic unit's chunk are chosen；

The related notion of complete syntactic analysis based on chunk and the analysis process of syntactic analysis based on chunk.Pass through Description above understands, and this syntactic analysis system is based on chunk parsing, and therefore the performance of chunk parsing can directly restrict The performance of overall syntactic analysis.If chunk parsing module those phrases that analyze of entirely accurate can may be constructed new group Block, then the syntax tree dressed up by correct set of tiles is also correct.And chunk parsing model used herein be based on Sequence labelling model, namely conditional random field models (CRFs).Therefore, the performance of chunk parsing model largely takes Certainly in the choosing of feature of this chunk parsing model, one group of preferable feature can make model have the strongest discriminating power, carries The accuracy of high analyte.The Baseline system introducing syntactic analysis model based on chunk parsing herein is used by this section Some features, and these features according to application need be broadly divided into two big classes: for feature and the use of basic unit's chunk parsing Feature in upper layer group block analysis.

Basic unit's chunk parsing is equivalent to use conditional random field models (CRFs) to carry out shallow parsing, therefore at this The feature that one layer of total used feature is used with shallow parsing is similar.Table 2 gives basic unit in Baseline system Some feature templates that chunk parsing module is used, these feature templates mostly come from Sha and Pereira and The work of Yoshimasa Tsuruokat et al..

From Table 2, it can be seen that basic unit's chunk parsing has only used the correlated characteristic of part of speech and word.This is because basic unit Chunk parsing is the ground floor analysis of the data to input, and the test sentence inputted is simply with the word sequence of part of speech labelling, Therefore it is merely able to use both features.

The feature that table 2 benchmark system basic unit chunk parsing is used

Feature classification	Character representation	Feature description
			POS Unigram	P_i·I ∈ {-2 ,-1,0,1,2}	Unit part of speech feature
POS Bigram	P_iP_i+1·I ∈ {-2 ,-1,0,1}	Adjacent binary part of speech feature
			POS Trigram	P_i-1P_iP_i+1·I ∈ {-2 ,-1,0,1,2}	Adjacent ternary part of speech feature
Word Unigram	w_i·I ∈ {-2 ,-1,0,1,2}	Uniterm stack features
			Word Bigram	w_iw_i+1·I ∈ {-2 ,-1,0,1}	Adjacent binary phrase feature
Word Trigram	w_i-1w_iw_i+1·i∈{0}	Adjacent ternary phrase feature

During the syntactic analysis based on chunk introduced herein, the level chunk parsing on basic unit can be referred to as Upper layer analysis.Basic unit's chunk parsing is the chunk parsing based on word and part of speech, and based on chunk point during the chunk parsing on upper strata Analysis, and each chunk this stalk tree corresponding in the system of the syntactic analysis introduced herein, therefore going up layer analysis can To use some features based on syntactic structure.It is, for example possible to use non-terminal syntactic marker feature, the centre word of subtree with And the part of speech of centre word, the boundary node information etc. of subtree.Table 3 lists the spy that benchmark system chunk parsing at the middle and upper levels is used Levy template, and Partial Feature template therein comes from the work of Yoshimasa Tsuruoka et al..As shown in Table 3, benchmark System uses class three major types feature altogether: nonterminal symbol marker characteristic, centre word feature and centre word part of speech feature, by using These three feature can be just that the syntactic analysis system based on chunk introduced herein reaches higher performance.But, only this A little feature of course do not make full use of the information that lower floor's chunk (each chunk corresponds to a syntax subtree) is provided, and this is also The reason that benchmark system herein is limited.To this end, more improve benchmark system performance by the chapters and sections below are introduced herein Feature and corresponding method.The feature templates that on the basis of table 3, the block analysis of system upper layer group uses

3) the CRF model construction upper strata chosen after setting upper strata chunk and the feature set of basic unit's chunk and improving, is utilized Chunk model and basic unit's chunk model, short by being converted into Vietnamese based on chunk after upper strata chunk and basic unit's chunk model combination Language treebank builds model；

Chunk parsing problem can be converted to when carrying out chunk parsing sequence labelling problem, how will be described in detail will Syntax problem analysis completely is converted to chunk parsing problem.Yoshimasa Tsuruoka et al. mentions in their paper and adopting Syntactic analysis is carried out by the method in two stages.They will analyze referred to as basic unit's chunk parsing (base-level the first stage And upper layer group block analysis (up-level chunking) chunking).The reason using the analysis method in two stages is basic unit The feature that chunk parsing is used with upper layer group block analysis is different.The input of basic unit's chunk parsing one is when a sentence, in sentence Only comprise word and corresponding part of speech, so the feature that basic unit's chunk parsing can use only has word and part of speech.And basic unit's chunk parsing Output be chunk sequence, a stalk tree can be expressed as due to each chunk again, so these chunk sequences can represent Subtree sequence.The result (subtree sequence) of basic unit's chunk parsing passes to upper layer group block analysis, and therefore upper layer group block analysis is permissible Use more abundant feature.In addition to basic word and part of speech feature, the sentence that upper layer group block analysis can also use subtree Also Method information.In order to preferably use conditional random field models and utilize more feature, herein by based on chunk parsing complete Syntactic analysis model is divided into two parts: basic unit's chunk parsing model and upper layer group block analysis model.

It is also required to be respectively trained two models when training syntactic analysis model based on chunk.Concrete way is: make Basic unit's chunk model is trained with the basic unit's chunk in training treebank；The upper strata chunk in training treebank is used to train upper layer group Block models.In order to be respectively trained basic unit's chunk parsing model and upper layer group block analysis model, it is necessary first to by a syntax tree All chunks be divided into two parts: basic unit's chunk collection and upper strata chunk collection.In order to make basic unit's chunk and upper strata chunk have clearly Definition, the first height to each node in syntax tree provides descriptive definition: make each terminal symbol node in syntax tree The height of (word) is zero, and the height of other nonterminal node is the maximum of the height of the child nodes of this non-terminal Plus a fixing height value 1.Secondly, the level of the syntax of Peen Treebank form there is is following descriptive definition: this Literary composition thinks that the syntax tree of a complete Peen Treebank form can be divided into some levels, and the number of plies of syntax tree is exactly sentence The height of the root node of method tree, each level is made up of one group of orderly subtree collection.Make the subtree set that terminal symbol node is constituted It is combined into the 0th layer；The set that n-th layer subtree collection is made up of those height subtrees less than or equal to n, if this layer of subtree collection The most only take big subtree containing the subtree comprised by big subtree, cast out and comprised little subtree.According to the syntax realized herein The needs of parsing tree model, are referred to as basic unit's chunk set by the chunk collection corresponding to the 2nd layer of orderly subtree collection, and by the 2nd layer The above chunk collection corresponding to all of subtree collection is collectively referred to as upper layer group set of blocks.

Wherein, described step Step3 utilize chunk parsing instrument that 2.7 ten thousand Vietnamese sentences after participle are carried out group Block analysis, thus obtain chunk language material, acquired language material is carried out basic unit's chunk and upper layer group block analysis, obtains 2.7 ten thousand The primary Vietnamese phrase treebank built based on chunk；

1), 2.7 ten thousand Vietnamese sentences after participle are carried out chunk parsing, obtain 2.7 ten thousand Vietnamese chunk language materials；

First with participle instrument, 2.7 ten thousand the Vietnamese sentences obtained are carried out participle, then utilize chunk parsing work Tool carries out chunk parsing to the sentence after 2.7 ten thousand participles.

2), utilize the upper layer group block models obtained and basic unit's chunk model that chunk language material is carried out basic unit's chunk and upper layer group Block analysis, finally gives 2.7 ten thousand primary Vietnamese phrase treebanks built based on chunk.

Utilize in Step2.3 obtain upper layer group block models and basic unit's chunk model chunk language material is carried out basic unit's chunk and Upper layer group block analysis, finally gives 2.7 ten thousand Vietnamese phrase treebanks.

Wherein, described step Step4 utilizes the phrase treebank corrector primary Vietnamese phrase to building based on chunk Treebank is corrected, and finally obtains the whole level Vietnamese phrase treebank after correction.

There are some problems, mainly due to institute in the primary Vietnamese phrase treebank owing to obtaining in Step3 in quality The accuracy rate to the Vietnamese chunk language material obtained in Step2 obtained is not high enough caused, and for this problem, utilizes short Primary Vietnamese phrase treebank is corrected by language tree corrector, finally obtains the Vietnamese phrase treebank that quality is higher.

First 5000 the Vietnamese trees of phrases manually marked are carried out subtree layer, basic unit's chunk set and upper strata by the present invention Chunk set notation is as training treebank；Then choose upper strata chunk and the feature set of basic unit's chunk, utilize CRF to build upper layer group Block and basic unit's chunk model, be converted into the short tree of Vietnamese by the result of chunk parsing；Then, utilize chunk parsing instrument to participle Rear 2.7 ten thousand Vietnamese sentences carry out chunk parsing, thus obtain chunk language material, acquired language material is completed basic unit's chunk and Upper layer group block analysis, obtains 2.7 ten thousand Vietnamese phrase treebanks；Recycling phrase treebank corrector is short to newly-generated Vietnamese Language treebank is corrected, and finally obtains final Vietnamese phrase treebank.

Experimental result is as shown in table 4.It can be seen that use Vietnamese tree of phrases base construction method based on chunk in table 4 The Vietnamese phrase treebank generated, accuracy rate is compared employing PCFG and is built Vietnamese phrase treebank and maximum entropy structure Vietnamese Phrase treebank method accuracy rate significantly improves；

Wherein, using PARSEVAL syntactic analysis appraisement system, it is the most general a kind of evaluating standard.Mainly by Accuracy rate (LP), recall rate (LR) and three indexs of F value, F value has considered accuracy rate and recall rate.It is defined as follows:

Table 4 additive method and the comparison of the inventive method

Method	LR%	LP%	F value %
				The Vietnamese phrase treebank that PCFG builds	81.36	80.64	81.00
The Vietnamese phrase treebank that maximum entropy builds	79.83	78.69	79.26
				The new Vietnamese phrase treebank built based on chunk	86.32	83.45	85.66

Above in conjunction with accompanying drawing, the detailed description of the invention of the present invention is explained in detail, but the present invention is not limited to above-mentioned Embodiment, in the ken that those of ordinary skill in the art are possessed, it is also possible to before without departing from present inventive concept Put that various changes can be made.

Claims

1. Vietnamese tree of phrases construction method based on chunk, it is characterised in that: described Vietnamese tree of phrases structure based on chunk Specifically comprising the following steps that of construction method

Step1, first Vietnamese tree of phrases mark collection is carried out upper strata chunk and basic unit chunk mark, the phrase obtained by mark Tree is as corpus；

Step2, choose upper strata chunk and the feature set of basic unit's chunk, according to corpus, CRF model is adjusted, trains CRF model after improvement, utilizes the CRF model construction upper strata chunk after improving and basic unit's chunk model, by upper strata chunk and base It is converted into Vietnamese phrase treebank based on chunk after layer chunk model combination and builds model；

Step3, utilize chunk parsing instrument that the Vietnamese sentence after participle is carried out chunk parsing, thus obtain chunk language material, Acquired language material is carried out basic unit's chunk and upper layer group block analysis, obtains the primary Vietnamese tree of phrases built based on chunk Storehouse；

Step4, utilize phrase treebank corrector that the primary Vietnamese phrase treebank built based on chunk is corrected, finally Whole level Vietnamese phrase treebank after correction.

Vietnamese tree of phrases construction method based on chunk the most according to claim 1, it is characterised in that: described step In Step1, carry out specifically comprising the following steps that of upper strata chunk and basic unit chunk mark to manually marking the Vietnamese tree of phrases obtained

Step1.1, according to the language feature of Vietnamese in combination with the mark system of CTB, i.e. Chinese Penn Treebank, formulate Vietnam The mark collection of language tree of phrases；

Step1.2, combine upper strata chunk and basic unit's chunk target has defined the upper layer group that Vietnamese tree of phrases marks collection Block and basic unit's chunk mark；

Step1.3, the Vietnamese tree of phrases being made up of upper strata chunk and basic unit's chunk obtained by mark are as corpus.

Vietnamese tree of phrases construction method based on chunk the most according to claim 1, it is characterised in that: described step Step2 specifically comprises the following steps that

The CRF model construction upper strata after setting upper strata chunk and the feature set of basic unit's chunk and improving is chosen in Step2.3, utilization Chunk model and basic unit's chunk model, short by being converted into Vietnamese based on chunk after upper strata chunk and basic unit's chunk model combination Language treebank builds model.

4. according to the Vietnamese tree of phrases construction method based on chunk described in claim 1 or 3, it is characterised in that: described step Rapid Step3 specifically comprises the following steps that

Upper layer group block models and basic unit's chunk model that Step3.2, utilization obtain carry out basic unit's chunk and upper strata to chunk language material Chunk parsing, finally gives the primary Vietnamese phrase treebank built based on chunk.