Summary of the invention
The invention provides Vietnamese tree of phrases construction method based on chunk, for solving, artificial mark Vietnamese is short
The problem that language treebank is relatively difficult, the problem building larger Vietnamese phrase treebank inconvenience, and conventional construction Vietnam
Language treebank method accuracy rate is low, the problem of time-consuming length.The present invention propose build tree of phrases method compare employing context without
Close syntax structure Vietnamese phrase treebank and maximum entropy builds Vietnamese phrase treebank method accuracy rate and significantly improves.Simultaneously this
The upper layer application such as the syntactic analysis of Vietnamese, machine translation, information extraction have been provided that by the Vietnamese phrase treebank of bright structure
Power supports.
The technical scheme is that Vietnamese tree of phrases construction method based on chunk, described Vietnam based on chunk
Specifically comprising the following steps that of language tree of phrases construction method
Step1, first Vietnamese tree of phrases mark collection is carried out upper strata chunk and basic unit chunk mark, mark is obtained
Tree of phrases is as corpus;The accuracy rate of the corpus that profit is acquired in this way is higher so that utilization should
The feature set that corpus obtains is more accurate;
Step2, choose upper strata chunk and the feature set of basic unit's chunk, according to corpus, CRF model is adjusted, instruction
Practise the CRF model after improvement, utilize the CRF model construction upper strata chunk after improving and basic unit's chunk model, by upper strata chunk
Model is built with being converted into Vietnamese phrase treebank based on chunk after the combination of basic unit chunk model;Use the CRF mould after improving
Going out Vietnamese phrase treebank constructed by type and build model, the structure effect for Vietnamese phrase treebank is more preferable, and quality is higher;
Step3, utilize chunk parsing instrument that the Vietnamese sentence after participle is carried out chunk parsing, thus obtain chunk language
Material, carries out basic unit's chunk and upper layer group block analysis to acquired language material, obtains the primary Vietnamese phrase built based on chunk
Treebank;The structure using Vietnamese phrase treebank based on chunk structure model to carry out Vietnamese phrase treebank compares employing up and down
Literary composition Grammars builds Vietnamese phrase treebank and maximum entropy builds Vietnamese phrase treebank method accuracy rate and significantly improves;
Step4, utilize phrase treebank corrector that the primary Vietnamese phrase treebank built based on chunk is corrected,
After corrected after whole level Vietnamese phrase treebank.Primary Vietnamese phrase treebank is further corrected guarantee obtain
The quality of whole level Vietnamese phrase treebank, it is possible to for machine translation, the upper layer application such as information extraction provides language material to support.
As the preferred version of the present invention, in described step Step1, carry out manually marking the Vietnamese tree of phrases obtained
What upper strata chunk and basic unit's chunk marked specifically comprises the following steps that
Step1.1, according to the language feature of Vietnamese in combination with the mark system of CTB, i.e. Chinese Penn Treebank, formulate
The mark collection of Vietnamese tree of phrases;
Step1.2, combine upper strata chunk and basic unit's chunk target has defined and Vietnamese tree of phrases marks the upper of collection
Layer chunk and basic unit's chunk mark;
Step1.3, the Vietnamese tree of phrases being made up of upper strata chunk and basic unit's chunk obtained by mark are as training language
Material.
As the preferred version of the present invention, specifically comprising the following steps that of described step Step2
CRF model is adjusted by Step2.1, foundation corpus, trains the CRF model after improvement;
Step2.2, choose setting upper strata chunk and the feature set of basic unit's chunk;
The CRF model construction after setting upper strata chunk and the feature set of basic unit's chunk and improving is chosen in Step2.3, utilization
Upper layer group block models and basic unit's chunk model, be converted into Vietnam based on chunk after upper strata chunk and basic unit's chunk model combination
Language phrase treebank builds model;
As the preferred version of the present invention, specifically comprising the following steps that of described step Step3
Step3.1, the Vietnamese sentence after participle is carried out chunk parsing, obtain Vietnamese chunk language material;
Step3.2, utilize obtain upper layer group block models and basic unit's chunk model chunk language material is carried out basic unit's chunk and
Upper layer group block analysis, finally gives the primary Vietnamese phrase treebank built based on chunk.
The invention has the beneficial effects as follows:
1, the method building tree of phrases that the present invention proposes is compared employing context-free grammar and is built Vietnamese tree of phrases
Storehouse and maximum entropy build Vietnamese phrase treebank method accuracy rate and significantly improve.The Vietnamese phrase treebank that the present invention builds simultaneously
The upper layer application such as the syntactic analysis of Vietnamese, machine translation, information extraction are provided that powerful support;
2, the Vietnamese tree of phrases corpus that the scale that constructs is relatively large;
3, the method building tree of phrases that the present invention proposes eliminates the artificial process marking Vietnamese phrase treebank, significantly
Save manpower and build treebank time.
Embodiment 5: as it is shown in figure 1, Vietnamese tree of phrases construction method based on chunk, described Vietnamese based on chunk
Specifically comprising the following steps that of tree of phrases construction method
Step1, first Vietnamese tree of phrases mark collection is carried out upper strata chunk and basic unit chunk mark, mark is obtained
Tree of phrases is as corpus;The accuracy rate of the corpus that profit is acquired in this way is higher so that utilization should
The feature set that corpus obtains is more accurate;
Step2, choose upper strata chunk and the feature set of basic unit's chunk, according to corpus, CRF model is adjusted, instruction
Practise the CRF model after improvement, utilize the CRF model construction upper strata chunk after improving and basic unit's chunk model, by upper strata chunk
Model is built with being converted into Vietnamese phrase treebank based on chunk after the combination of basic unit chunk model;Use the CRF mould after improving
Going out Vietnamese phrase treebank constructed by type and build model, the structure effect for Vietnamese phrase treebank is more preferable, and quality is higher;
Step3, utilize chunk parsing instrument that 2.7 ten thousand Vietnamese sentences after participle are carried out chunk parsing, thus obtain
Take chunk language material, acquired language material carried out basic unit's chunk and upper layer group block analysis, obtain 2.7 ten thousand based on chunk build
Primary Vietnamese phrase treebank;Use Vietnamese phrase treebank based on chunk to build model and carry out the structure of Vietnamese phrase treebank
Building to compare uses context-free grammar structure Vietnamese phrase treebank and maximum entropy structure Vietnamese phrase treebank method accurate
Rate significantly improves;
Step4, utilize phrase treebank corrector that the primary Vietnamese phrase treebank built based on chunk is corrected,
After corrected after whole level Vietnamese phrase treebank.Primary Vietnamese phrase treebank is further corrected guarantee obtain
The quality of whole level Vietnamese phrase treebank, it is possible to for machine translation, the upper layer application such as information extraction provides language material to support.
Wherein, concrete, first described step Step1 carries out upper strata to 5000 Vietnamese trees of phrases of artificial mark
Chunk and basic unit's chunk mark, the tree of phrases obtained by mark is as corpus;
Build Vietnamese phrase treebank language material and be by the basis that Vietnamese tree of phrases builds.Only build out high-quality
Language material, could carry out information-based development by based on.It is short that phrase treebank language material is also by Vietnamese based on chunk
Language treebank builds studies an indispensable ingredient.Build phrase treebank language material to specifically comprise the following steps that
1), according to the language feature of Vietnamese in combination with the mark system of CTB, i.e. Chinese Penn Treebank, Vietnam is formulated
The mark collection of language tree of phrases;
Vietnamese belongs to Austroasiatic, and it is the mother tongue of country of Vietnam.Each language has the word order of oneself, Vietnamese
The order relying primarily on composition goes to pass on important syntactic information.Although writing of Vietnamese text derives from Latin
Mutation, Vietnamese has three obvious feature differences to remove western language.Vietnamese phrase treebank is built with impact of crucial importance
Some Vietnamese features as follows:
First, the minimum component units of Vietnamese is syllable.Word can only be by one (Beautiful) or multiple (G á girl i) syllable composition.As many Asian languages (such as Chinese, Japanese and Thai), Vietnamese does not has word to separate
Symbol.Space separates the separator of syllable the most one by one, and the separator of neither one word, so Vietnamese sentence is the most permissible
There is a variety of dividing method.
Then, Vietnamese is a kind of isolated verbal unit, in this language, word can not change form and according in sentence
Word order determines its grammatical function.It is to say, word order arrangement is most important table justice means in Vietnamese grammer.Changing of word order
Change can cause the change of semanteme.Such asC ò n represents son, c ò nBut the mankind are represented.And the word in Vietnamese sentence
Sequence be generally vocabulary that the word order that a kind of specifics gradually strengthens, the i.e. meaning of a word generality the is strong position in sentence the most
Forward, on the contrary, the meaning of a word the most concrete vocabulary position in sentence is the most rearward.Such as: AnhMua (he buys)
T á o (Fructus Mali pumilae).
Finally, Vietnamese is a kind of language quite fixing word order, is made up of SVO (SVO) the word order fixed.The most just
Being to say, they general word order are: main
Language+predicate+object.Such as: Kia (that) l à (YES)(some)(seat) nh à (house) v á ch(soil
Wall).By analyzing the grammar property of Vietnamese, it is found that Vietnamese has obvious attribute rearmounted, the spy that the adverbial modifier is rearmounted
Point.Such as:(I usually has a meal)quán(in dining room).
For features described above and the mark system of CTB (Chinese Penn Treebank) of Vietnamese, formulate Vietnamese tree of phrases
Mark collection, part Vietnamese tree of phrases mark collection is as shown in table 1.
Table 1 part Vietnamese tree of phrases mark collection
Phrase type marks |
Phrase type explanation |
NP |
Noun phrase |
VP |
Verb phrase |
PP |
Prepositional phrase |
AP |
Adjective Phrases |
2), combine upper strata chunk and basic unit's chunk target has defined and 5000 Vietnamese trees of phrases mark the upper of collection
Layer chunk and basic unit's chunk mark;
In order to be respectively trained basic unit's chunk parsing model and upper layer group block analysis model, it is necessary first to by a syntax tree
All chunks be divided into two parts: basic unit's chunk collection and upper strata chunk collection.In order to make basic unit's chunk and upper strata chunk have clearly
Definition, the first height to each node in syntax tree provides descriptive definition: make each terminal symbol in syntax tree
The height of node (word) is zero, the height of other nonterminal node be the height of the child nodes of this non-terminal
Big value is plus a fixing height value 1.Secondly, the level to the syntax of Peen Treebank form have following descriptive fixed
Justice: it is recognized herein that the syntax tree of a complete Peen Treebank form can be divided into some levels, the number of plies of syntax tree is just
Being the height of the root node of syntax tree, each level is made up of one group of orderly subtree collection.Make the son that terminal symbol node is constituted
Tree collection is combined into the 0th layer;The set that n-th layer subtree collection is made up of those height subtrees less than or equal to n, if this straton tree
Set the most only takes big subtree containing the subtree comprised by big subtree, casts out and is comprised little subtree.According to realized herein
The needs of syntactic analysis tree-model, are referred to as basic unit's chunk set by the chunk collection corresponding to the 2nd layer of orderly subtree collection, and by
The chunk collection corresponding to all of subtree collection of more than 2 layers is collectively referred to as upper layer group set of blocks.
According to upper strata defined above chunk and basic unit's chunk mark, upper by be accomplished manually 5000 Vietnamese trees of phrases
Layer chunk and basic unit's chunk mark.
3) the Vietnamese tree of phrases being made up of upper strata chunk and basic unit's chunk, obtained by mark is as corpus;
By being accomplished manually the upper strata chunk to 5000 Vietnamese trees of phrases and the result of basic unit's chunk mark, will be as instruction
Practice upper strata chunk and the corpus of basic unit's chunk model.
Wherein, described Step2 chooses upper strata chunk and the feature set of basic unit's chunk, according to corpus to CRF model
It is adjusted, trains the CRF model after improvement, utilize the CRF model construction upper strata chunk after improving and basic unit's chunk model,
Model is built by being converted into Vietnamese phrase treebank based on chunk after upper strata chunk and basic unit's chunk model combination;
Based on the Vietnamese tree of phrases language material above built, obtain upper strata chunk used here as the CRF model training improved
With basic unit's chunk model, after upper strata chunk and basic unit's chunk model combination, it is converted into Vietnamese phrase treebank structure based on chunk
Established model.
1), according to corpus, CRF model is adjusted, trains the CRF model after improvement;
Sequence labelling task is to include bioinformatics (bioinformatics), computational linguistics (computational
Linguistics) an important task and in the field such as speech recognition (speech recognition).At natural language
Speech process field part-of-speech tagging and chunk parsing are all typical sequence labelling tasks, are marked the sequence to observe.
Such as in chunk parsing task, by using sequence labelling model that the sentence of input is marked and will can make up one
The subsequence of new chunk gives identical labelling.For sequence labelling task, people are at first it is envisioned that Hidden Markov mould
Type (Hidden Markov Models).HMM is one and generates model, and it is to observation sequence stochastic variable X
And corresponding labelling stochastic variable Y is modeled, and calculate Joint Distribution probability P BXY between them.But connection
The most serious problem that has closing distribution probability model is intended to enumerate all of observation sequence, and this task is in a lot of fields
In be unsolvable.So needing a model that problem can turn to solvable problem, and so conditional probability model is exactly
A kind of model.Conditional probability model calculate observe stochastic variable X and the condition distribution probability P of corresponding labelling stochastic variable Y and
It not associating P (X Y), thus complicated problem greatly can be simplified.
Conditional random field models is exactly a kind of probabilistic framework using condition distribution probability, is also typical discrimination model.
Compare other sequence labelling model, and conditional random field models has a lot of intrinsic advantages.First contrast Hidden Markov mould
Type, the interdependent hypothesis demand relative relaxation of conditional random field models;Secondly contrast maximum entropy Markov model C Maximum
Entropy Markov Models) and other condition Markov model based on directed graph, conditional random field models can
Avoid marking bias problem.Therefore the performance at the task conditional random field models of a lot of reality is preferable.
Lafferty in his article by the definition of probability of the corresponding labelled sequence v of given observation sequence two as public
Shown in formula 1.
exp(∑jλjtj(yj-1, yj, x, i)+∑kλktk(yi, x, i)) (1)
Wherein tj(yj-1, yj, x, i) it is whole observation sequence and the labelled sequence transfer characteristic function in i and i-1 position;
And tk(yj, x, i) it is labelling and the observation sequence state characteristic function in position;No and #, it is the parameter of the two function, needs
Estimate from training data.
Need to build the real-valued function of an observation sequence when defined feature function.(x i) comes e by this real-valued function
Some distribution characters of training data are described.Be as follows in chunk parsing one about e (x, i) | object lesson formula 2.
Simplification in order to express will be described as shown in Equation 3 with following labelling herein.
S(yj, x, i)=S (yj-1, yj, x, i) (3)
And there is the global characteristics function for given observation sequence x and the conditional random field models at labelled sequence sunset fixed
Justice is formula 4.
Function f in superincumbent formulaj(yj-1, yj, x, i) both can be state characteristic function S (yj-1, yj, x, i) also may be used
To be transfer characteristic function tj(yj-1, yj, x, i).So for given observation sequence x, the probability distribution of its labelled sequence y
Just can be to be write as formula 5 form.
Wherein Z (x) is normalization factor.
2), setting upper strata chunk and the feature set of basic unit's chunk are chosen;
The related notion of complete syntactic analysis based on chunk and the analysis process of syntactic analysis based on chunk.Pass through
Description above understands, and this syntactic analysis system is based on chunk parsing, and therefore the performance of chunk parsing can directly restrict
The performance of overall syntactic analysis.If chunk parsing module those phrases that analyze of entirely accurate can may be constructed new group
Block, then the syntax tree dressed up by correct set of tiles is also correct.And chunk parsing model used herein be based on
Sequence labelling model, namely conditional random field models (CRFs).Therefore, the performance of chunk parsing model largely takes
Certainly in the choosing of feature of this chunk parsing model, one group of preferable feature can make model have the strongest discriminating power, carries
The accuracy of high analyte.The Baseline system introducing syntactic analysis model based on chunk parsing herein is used by this section
Some features, and these features according to application need be broadly divided into two big classes: for feature and the use of basic unit's chunk parsing
Feature in upper layer group block analysis.
Basic unit's chunk parsing is equivalent to use conditional random field models (CRFs) to carry out shallow parsing, therefore at this
The feature that one layer of total used feature is used with shallow parsing is similar.Table 2 gives basic unit in Baseline system
Some feature templates that chunk parsing module is used, these feature templates mostly come from Sha and Pereira and
The work of Yoshimasa Tsuruokat et al..
From Table 2, it can be seen that basic unit's chunk parsing has only used the correlated characteristic of part of speech and word.This is because basic unit
Chunk parsing is the ground floor analysis of the data to input, and the test sentence inputted is simply with the word sequence of part of speech labelling,
Therefore it is merely able to use both features.
The feature that table 2 benchmark system basic unit chunk parsing is used
Feature classification |
Character representation |
Feature description |
POS Unigram |
Pi·I ∈ {-2 ,-1,0,1,2} |
Unit part of speech feature |
POS Bigram |
PiPi+1·I ∈ {-2 ,-1,0,1} |
Adjacent binary part of speech feature |
POS Trigram |
Pi-1PiPi+1·I ∈ {-2 ,-1,0,1,2} |
Adjacent ternary part of speech feature |
Word Unigram |
wi·I ∈ {-2 ,-1,0,1,2} |
Uniterm stack features |
Word Bigram |
wiwi+1·I ∈ {-2 ,-1,0,1} |
Adjacent binary phrase feature |
Word Trigram |
wi-1wiwi+1·i∈{0} |
Adjacent ternary phrase feature |
During the syntactic analysis based on chunk introduced herein, the level chunk parsing on basic unit can be referred to as
Upper layer analysis.Basic unit's chunk parsing is the chunk parsing based on word and part of speech, and based on chunk point during the chunk parsing on upper strata
Analysis, and each chunk this stalk tree corresponding in the system of the syntactic analysis introduced herein, therefore going up layer analysis can
To use some features based on syntactic structure.It is, for example possible to use non-terminal syntactic marker feature, the centre word of subtree with
And the part of speech of centre word, the boundary node information etc. of subtree.Table 3 lists the spy that benchmark system chunk parsing at the middle and upper levels is used
Levy template, and Partial Feature template therein comes from the work of Yoshimasa Tsuruoka et al..As shown in Table 3, benchmark
System uses class three major types feature altogether: nonterminal symbol marker characteristic, centre word feature and centre word part of speech feature, by using
These three feature can be just that the syntactic analysis system based on chunk introduced herein reaches higher performance.But, only this
A little feature of course do not make full use of the information that lower floor's chunk (each chunk corresponds to a syntax subtree) is provided, and this is also
The reason that benchmark system herein is limited.To this end, more improve benchmark system performance by the chapters and sections below are introduced herein
Feature and corresponding method.The feature templates that on the basis of table 3, the block analysis of system upper layer group uses
3) the CRF model construction upper strata chosen after setting upper strata chunk and the feature set of basic unit's chunk and improving, is utilized
Chunk model and basic unit's chunk model, short by being converted into Vietnamese based on chunk after upper strata chunk and basic unit's chunk model combination
Language treebank builds model;
Chunk parsing problem can be converted to when carrying out chunk parsing sequence labelling problem, how will be described in detail will
Syntax problem analysis completely is converted to chunk parsing problem.Yoshimasa Tsuruoka et al. mentions in their paper and adopting
Syntactic analysis is carried out by the method in two stages.They will analyze referred to as basic unit's chunk parsing (base-level the first stage
And upper layer group block analysis (up-level chunking) chunking).The reason using the analysis method in two stages is basic unit
The feature that chunk parsing is used with upper layer group block analysis is different.The input of basic unit's chunk parsing one is when a sentence, in sentence
Only comprise word and corresponding part of speech, so the feature that basic unit's chunk parsing can use only has word and part of speech.And basic unit's chunk parsing
Output be chunk sequence, a stalk tree can be expressed as due to each chunk again, so these chunk sequences can represent
Subtree sequence.The result (subtree sequence) of basic unit's chunk parsing passes to upper layer group block analysis, and therefore upper layer group block analysis is permissible
Use more abundant feature.In addition to basic word and part of speech feature, the sentence that upper layer group block analysis can also use subtree Also
Method information.In order to preferably use conditional random field models and utilize more feature, herein by based on chunk parsing complete
Syntactic analysis model is divided into two parts: basic unit's chunk parsing model and upper layer group block analysis model.
It is also required to be respectively trained two models when training syntactic analysis model based on chunk.Concrete way is: make
Basic unit's chunk model is trained with the basic unit's chunk in training treebank;The upper strata chunk in training treebank is used to train upper layer group
Block models.In order to be respectively trained basic unit's chunk parsing model and upper layer group block analysis model, it is necessary first to by a syntax tree
All chunks be divided into two parts: basic unit's chunk collection and upper strata chunk collection.In order to make basic unit's chunk and upper strata chunk have clearly
Definition, the first height to each node in syntax tree provides descriptive definition: make each terminal symbol node in syntax tree
The height of (word) is zero, and the height of other nonterminal node is the maximum of the height of the child nodes of this non-terminal
Plus a fixing height value 1.Secondly, the level of the syntax of Peen Treebank form there is is following descriptive definition: this
Literary composition thinks that the syntax tree of a complete Peen Treebank form can be divided into some levels, and the number of plies of syntax tree is exactly sentence
The height of the root node of method tree, each level is made up of one group of orderly subtree collection.Make the subtree set that terminal symbol node is constituted
It is combined into the 0th layer;The set that n-th layer subtree collection is made up of those height subtrees less than or equal to n, if this layer of subtree collection
The most only take big subtree containing the subtree comprised by big subtree, cast out and comprised little subtree.According to the syntax realized herein
The needs of parsing tree model, are referred to as basic unit's chunk set by the chunk collection corresponding to the 2nd layer of orderly subtree collection, and by the 2nd layer
The above chunk collection corresponding to all of subtree collection is collectively referred to as upper layer group set of blocks.
Wherein, described step Step3 utilize chunk parsing instrument that 2.7 ten thousand Vietnamese sentences after participle are carried out group
Block analysis, thus obtain chunk language material, acquired language material is carried out basic unit's chunk and upper layer group block analysis, obtains 2.7 ten thousand
The primary Vietnamese phrase treebank built based on chunk;
1), 2.7 ten thousand Vietnamese sentences after participle are carried out chunk parsing, obtain 2.7 ten thousand Vietnamese chunk language materials;
First with participle instrument, 2.7 ten thousand the Vietnamese sentences obtained are carried out participle, then utilize chunk parsing work
Tool carries out chunk parsing to the sentence after 2.7 ten thousand participles.
2), utilize the upper layer group block models obtained and basic unit's chunk model that chunk language material is carried out basic unit's chunk and upper layer group
Block analysis, finally gives 2.7 ten thousand primary Vietnamese phrase treebanks built based on chunk.
Utilize in Step2.3 obtain upper layer group block models and basic unit's chunk model chunk language material is carried out basic unit's chunk and
Upper layer group block analysis, finally gives 2.7 ten thousand Vietnamese phrase treebanks.
Wherein, described step Step4 utilizes the phrase treebank corrector primary Vietnamese phrase to building based on chunk
Treebank is corrected, and finally obtains the whole level Vietnamese phrase treebank after correction.
There are some problems, mainly due to institute in the primary Vietnamese phrase treebank owing to obtaining in Step3 in quality
The accuracy rate to the Vietnamese chunk language material obtained in Step2 obtained is not high enough caused, and for this problem, utilizes short
Primary Vietnamese phrase treebank is corrected by language tree corrector, finally obtains the Vietnamese phrase treebank that quality is higher.
First 5000 the Vietnamese trees of phrases manually marked are carried out subtree layer, basic unit's chunk set and upper strata by the present invention
Chunk set notation is as training treebank;Then choose upper strata chunk and the feature set of basic unit's chunk, utilize CRF to build upper layer group
Block and basic unit's chunk model, be converted into the short tree of Vietnamese by the result of chunk parsing;Then, utilize chunk parsing instrument to participle
Rear 2.7 ten thousand Vietnamese sentences carry out chunk parsing, thus obtain chunk language material, acquired language material is completed basic unit's chunk and
Upper layer group block analysis, obtains 2.7 ten thousand Vietnamese phrase treebanks;Recycling phrase treebank corrector is short to newly-generated Vietnamese
Language treebank is corrected, and finally obtains final Vietnamese phrase treebank.
Experimental result is as shown in table 4.It can be seen that use Vietnamese tree of phrases base construction method based on chunk in table 4
The Vietnamese phrase treebank generated, accuracy rate is compared employing PCFG and is built Vietnamese phrase treebank and maximum entropy structure Vietnamese
Phrase treebank method accuracy rate significantly improves;
Wherein, using PARSEVAL syntactic analysis appraisement system, it is the most general a kind of evaluating standard.Mainly by
Accuracy rate (LP), recall rate (LR) and three indexs of F value, F value has considered accuracy rate and recall rate.It is defined as follows:
Table 4 additive method and the comparison of the inventive method
Method |
LR% |
LP% |
F value % |
The Vietnamese phrase treebank that PCFG builds |
81.36 |
80.64 |
81.00 |
The Vietnamese phrase treebank that maximum entropy builds |
79.83 |
78.69 |
79.26 |
The new Vietnamese phrase treebank built based on chunk |
86.32 |
83.45 |
85.66 |
Above in conjunction with accompanying drawing, the detailed description of the invention of the present invention is explained in detail, but the present invention is not limited to above-mentioned
Embodiment, in the ken that those of ordinary skill in the art are possessed, it is also possible to before without departing from present inventive concept
Put that various changes can be made.