CN106202037A - Vietnamese tree of phrases construction method based on chunk - Google Patents

Vietnamese tree of phrases construction method based on chunk Download PDF

Info

Publication number
CN106202037A
CN106202037A CN201610497061.2A CN201610497061A CN106202037A CN 106202037 A CN106202037 A CN 106202037A CN 201610497061 A CN201610497061 A CN 201610497061A CN 106202037 A CN106202037 A CN 106202037A
Authority
CN
China
Prior art keywords
chunk
vietnamese
treebank
basic unit
tree
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201610497061.2A
Other languages
Chinese (zh)
Other versions
CN106202037B (en
Inventor
郭剑毅
李英
余正涛
线岩团
毛存礼
陈玮
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Kunming University of Science and Technology
Original Assignee
Kunming University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Kunming University of Science and Technology filed Critical Kunming University of Science and Technology
Priority to CN201610497061.2A priority Critical patent/CN106202037B/en
Publication of CN106202037A publication Critical patent/CN106202037A/en
Application granted granted Critical
Publication of CN106202037B publication Critical patent/CN106202037B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/253Grammatical analysis; Style critique
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/42Data-driven translation

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)

Abstract

The present invention relates to Vietnamese tree of phrases construction method based on chunk, belong to natural language processing technique field.First the present invention carries out upper strata chunk and basic unit's chunk mark to Vietnamese tree of phrases mark collection;Choose upper strata chunk and the feature set of basic unit's chunk, then build Vietnamese phrase treebank based on chunk structure model;Utilize chunk parsing instrument that the Vietnamese sentence after participle is carried out chunk parsing, obtain the primary Vietnamese phrase treebank built based on chunk;Utilize phrase treebank corrector that the primary Vietnamese phrase treebank built based on chunk is corrected, finally obtain the whole level Vietnamese phrase treebank after correction.Present invention, avoiding the process artificially collecting and marking Vietnamese phrase treebank, save manpower and build the time of treebank;The method building tree of phrases that the present invention proposes compares employing context-free grammar structure Vietnamese phrase treebank and maximum entropy builds Vietnamese phrase treebank method accuracy rate and significantly improves.

Description

Vietnamese tree of phrases construction method based on chunk
Technical field
The present invention relates to Vietnamese tree of phrases construction method based on chunk, belong to natural language processing technique field.
Background technology
The analysis of phrase treebank has very important effect, such as syntactic pattern with building for philological research Extraction and the investigation etc. of language phenomenon;It is usually used to train participle instrument, parser and semantic role mark simultaneously The systems such as note device, these systems are again the bases of the application such as information extraction, machine translation, question answering system and text classification.Closely Nian Lai, along with machine learning method and the fast development of artificial intelligence, automatically building of phrase treebank becomes more and more important.
Phrase syntactic analysis is that automatic deduction goes out the grammatical structure of sentence, parsing sentence institute according to given grammer system Relation (Allen1995) between the syntactic units comprised and these syntactic units, is converted into a structurized language by sentence Method tree.Tree of phrases is made up of according to specific grammatical rules terminal symbol, nonterminal symbol and P-marker these three symbol.According to Grammatical rules, some terminal symbols constitute a phrase, participate in reduction next time as nonterminal symbol, until by whole sentence reduction For root node.
Research for Vietnamese phrase treebank is little.Research currently for Vietnamese consists predominantly of: Nguyen C T, Nguyen T K (2006) et al. utilize CRF Yu SVM to build Vietnamese participle model, complete the participle work of Vietnamese; Le H P, Nguyen T M H, Romary L (2006) et al. are proposed for the Lexical link grammar of Vietnam, but do not say These syntax utilize on the structure of tree of phrases;Nguyen P T, Vu X L, Nguyen T M H (2009) et al. simply introduce Once build the Research Thinking of Vietnamese syntax tree, but do not provide structure result;Dinh Dien, Thuy Ngan, Xuan Quang (2009) et al. carries out bilingual machine translation, in this process by building the parallel syntax tree of English-Vietnamese In constructed Vietnamese syntax tree there is also many problem, such as English and Vietnamese can not one_to_one corresponding, cause Vietnamese Syntax tree accuracy rate is the lowest.
For the shortage of Vietnamese phrase treebank and build the problem of difficulty, the invention provides a kind of new based on group The Vietnamese tree of phrases construction method of block.This method can automatically analyze out the phrase structure tree of Vietnamese, solves Vietnamese phrase The Construct question of treebank.The Vietnamese phrase treebank that the present invention builds is to the syntactic analysis of Vietnamese, machine translation, information extraction It is provided that powerful support etc. upper layer application.
Summary of the invention
The invention provides Vietnamese tree of phrases construction method based on chunk, for solving, artificial mark Vietnamese is short The problem that language treebank is relatively difficult, the problem building larger Vietnamese phrase treebank inconvenience, and conventional construction Vietnam Language treebank method accuracy rate is low, the problem of time-consuming length.The present invention propose build tree of phrases method compare employing context without Close syntax structure Vietnamese phrase treebank and maximum entropy builds Vietnamese phrase treebank method accuracy rate and significantly improves.Simultaneously this The upper layer application such as the syntactic analysis of Vietnamese, machine translation, information extraction have been provided that by the Vietnamese phrase treebank of bright structure Power supports.
The technical scheme is that Vietnamese tree of phrases construction method based on chunk, described Vietnam based on chunk Specifically comprising the following steps that of language tree of phrases construction method
Step1, first Vietnamese tree of phrases mark collection is carried out upper strata chunk and basic unit chunk mark, mark is obtained Tree of phrases is as corpus;The accuracy rate of the corpus that profit is acquired in this way is higher so that utilization should The feature set that corpus obtains is more accurate;
Step2, choose upper strata chunk and the feature set of basic unit's chunk, according to corpus, CRF model is adjusted, instruction Practise the CRF model after improvement, utilize the CRF model construction upper strata chunk after improving and basic unit's chunk model, by upper strata chunk Model is built with being converted into Vietnamese phrase treebank based on chunk after the combination of basic unit chunk model;Use the CRF mould after improving Going out Vietnamese phrase treebank constructed by type and build model, the structure effect for Vietnamese phrase treebank is more preferable, and quality is higher;
Step3, utilize chunk parsing instrument that the Vietnamese sentence after participle is carried out chunk parsing, thus obtain chunk language Material, carries out basic unit's chunk and upper layer group block analysis to acquired language material, obtains the primary Vietnamese phrase built based on chunk Treebank;The structure using Vietnamese phrase treebank based on chunk structure model to carry out Vietnamese phrase treebank compares employing up and down Literary composition Grammars builds Vietnamese phrase treebank and maximum entropy builds Vietnamese phrase treebank method accuracy rate and significantly improves;
Step4, utilize phrase treebank corrector that the primary Vietnamese phrase treebank built based on chunk is corrected, After corrected after whole level Vietnamese phrase treebank.Primary Vietnamese phrase treebank is further corrected guarantee obtain The quality of whole level Vietnamese phrase treebank, it is possible to for machine translation, the upper layer application such as information extraction provides language material to support.
As the preferred version of the present invention, in described step Step1, carry out manually marking the Vietnamese tree of phrases obtained What upper strata chunk and basic unit's chunk marked specifically comprises the following steps that
Step1.1, according to the language feature of Vietnamese in combination with the mark system of CTB, i.e. Chinese Penn Treebank, formulate The mark collection of Vietnamese tree of phrases;
Step1.2, combine upper strata chunk and basic unit's chunk target has defined and Vietnamese tree of phrases marks the upper of collection Layer chunk and basic unit's chunk mark;
Step1.3, the Vietnamese tree of phrases being made up of upper strata chunk and basic unit's chunk obtained by mark are as training language Material.
As the preferred version of the present invention, specifically comprising the following steps that of described step Step2
CRF model is adjusted by Step2.1, foundation corpus, trains the CRF model after improvement;
Step2.2, choose setting upper strata chunk and the feature set of basic unit's chunk;
The CRF model construction after setting upper strata chunk and the feature set of basic unit's chunk and improving is chosen in Step2.3, utilization Upper layer group block models and basic unit's chunk model, be converted into Vietnam based on chunk after upper strata chunk and basic unit's chunk model combination Language phrase treebank builds model;
As the preferred version of the present invention, specifically comprising the following steps that of described step Step3
Step3.1, the Vietnamese sentence after participle is carried out chunk parsing, obtain Vietnamese chunk language material;
Step3.2, utilize obtain upper layer group block models and basic unit's chunk model chunk language material is carried out basic unit's chunk and Upper layer group block analysis, finally gives the primary Vietnamese phrase treebank built based on chunk.
The invention has the beneficial effects as follows:
1, the method building tree of phrases that the present invention proposes is compared employing context-free grammar and is built Vietnamese tree of phrases Storehouse and maximum entropy build Vietnamese phrase treebank method accuracy rate and significantly improve.The Vietnamese phrase treebank that the present invention builds simultaneously The upper layer application such as the syntactic analysis of Vietnamese, machine translation, information extraction are provided that powerful support;
2, the Vietnamese tree of phrases corpus that the scale that constructs is relatively large;
3, the method building tree of phrases that the present invention proposes eliminates the artificial process marking Vietnamese phrase treebank, significantly Save manpower and build treebank time.
Accompanying drawing explanation
Fig. 1 is the flow chart in the present invention.
Detailed description of the invention
Embodiment 1: as it is shown in figure 1, Vietnamese tree of phrases construction method based on chunk, described Vietnamese based on chunk Specifically comprising the following steps that of tree of phrases construction method
Step1, first Vietnamese tree of phrases mark collection is carried out upper strata chunk and basic unit chunk mark, mark is obtained Tree of phrases is as corpus;The accuracy rate of the corpus that profit is acquired in this way is higher so that utilization should The feature set that corpus obtains is more accurate;
Step2, choose upper strata chunk and the feature set of basic unit's chunk, according to corpus, CRF model is adjusted, instruction Practise the CRF model after improvement, utilize the CRF model construction upper strata chunk after improving and basic unit's chunk model, by upper strata chunk Model is built with being converted into Vietnamese phrase treebank based on chunk after the combination of basic unit chunk model;Use the CRF mould after improving Going out Vietnamese phrase treebank constructed by type and build model, the structure effect for Vietnamese phrase treebank is more preferable, and quality is higher;
Step3, utilize chunk parsing instrument that the Vietnamese sentence after participle is carried out chunk parsing, thus obtain chunk language Material, carries out basic unit's chunk and upper layer group block analysis to acquired language material, obtains the primary Vietnamese phrase built based on chunk Treebank;The structure using Vietnamese phrase treebank based on chunk structure model to carry out Vietnamese phrase treebank compares employing up and down Literary composition Grammars builds Vietnamese phrase treebank and maximum entropy builds Vietnamese phrase treebank method accuracy rate and significantly improves;
Step4, utilize phrase treebank corrector that the primary Vietnamese phrase treebank built based on chunk is corrected, After corrected after whole level Vietnamese phrase treebank.Primary Vietnamese phrase treebank is further corrected guarantee obtain The quality of whole level Vietnamese phrase treebank, it is possible to for machine translation, the upper layer application such as information extraction provides language material to support.
Embodiment 2: as it is shown in figure 1, Vietnamese tree of phrases construction method based on chunk, the present embodiment and embodiment 1 phase With, wherein, as the preferred version of the present invention, in described step Step1, carry out manually marking the Vietnamese tree of phrases obtained What upper strata chunk and basic unit's chunk marked specifically comprises the following steps that
Step1.1, according to the language feature of Vietnamese in combination with the mark system of CTB, i.e. Chinese Penn Treebank, formulate The mark collection of Vietnamese tree of phrases;
Step1.2, combine upper strata chunk and basic unit's chunk target has defined and Vietnamese tree of phrases marks the upper of collection Layer chunk and basic unit's chunk mark;
Step1.3, the Vietnamese tree of phrases being made up of upper strata chunk and basic unit's chunk obtained by mark are as training language Material.
Embodiment 3: as it is shown in figure 1, Vietnamese tree of phrases construction method based on chunk, the present embodiment and embodiment 2 phase With, wherein, as the preferred version of the present invention, specifically comprising the following steps that of described step Step2
CRF model is adjusted by Step2.1, foundation corpus, trains the CRF model after improvement;
Step2.2, choose setting upper strata chunk and the feature set of basic unit's chunk;
The CRF model construction after setting upper strata chunk and the feature set of basic unit's chunk and improving is chosen in Step2.3, utilization Upper layer group block models and basic unit's chunk model, be converted into Vietnam based on chunk after upper strata chunk and basic unit's chunk model combination Language phrase treebank builds model;
Embodiment 4: as it is shown in figure 1, Vietnamese tree of phrases construction method based on chunk, the present embodiment and embodiment 3 phase With, wherein, as the preferred version of the present invention, specifically comprising the following steps that of described step Step3
Step3.1, the Vietnamese sentence after participle is carried out chunk parsing, obtain Vietnamese chunk language material;
Step3.2, utilize obtain upper layer group block models and basic unit's chunk model chunk language material is carried out basic unit's chunk and Upper layer group block analysis, finally gives the primary Vietnamese phrase treebank built based on chunk.
Embodiment 5: as it is shown in figure 1, Vietnamese tree of phrases construction method based on chunk, described Vietnamese based on chunk Specifically comprising the following steps that of tree of phrases construction method
Step1, first Vietnamese tree of phrases mark collection is carried out upper strata chunk and basic unit chunk mark, mark is obtained Tree of phrases is as corpus;The accuracy rate of the corpus that profit is acquired in this way is higher so that utilization should The feature set that corpus obtains is more accurate;
Step2, choose upper strata chunk and the feature set of basic unit's chunk, according to corpus, CRF model is adjusted, instruction Practise the CRF model after improvement, utilize the CRF model construction upper strata chunk after improving and basic unit's chunk model, by upper strata chunk Model is built with being converted into Vietnamese phrase treebank based on chunk after the combination of basic unit chunk model;Use the CRF mould after improving Going out Vietnamese phrase treebank constructed by type and build model, the structure effect for Vietnamese phrase treebank is more preferable, and quality is higher;
Step3, utilize chunk parsing instrument that 2.7 ten thousand Vietnamese sentences after participle are carried out chunk parsing, thus obtain Take chunk language material, acquired language material carried out basic unit's chunk and upper layer group block analysis, obtain 2.7 ten thousand based on chunk build Primary Vietnamese phrase treebank;Use Vietnamese phrase treebank based on chunk to build model and carry out the structure of Vietnamese phrase treebank Building to compare uses context-free grammar structure Vietnamese phrase treebank and maximum entropy structure Vietnamese phrase treebank method accurate Rate significantly improves;
Step4, utilize phrase treebank corrector that the primary Vietnamese phrase treebank built based on chunk is corrected, After corrected after whole level Vietnamese phrase treebank.Primary Vietnamese phrase treebank is further corrected guarantee obtain The quality of whole level Vietnamese phrase treebank, it is possible to for machine translation, the upper layer application such as information extraction provides language material to support.
Wherein, concrete, first described step Step1 carries out upper strata to 5000 Vietnamese trees of phrases of artificial mark Chunk and basic unit's chunk mark, the tree of phrases obtained by mark is as corpus;
Build Vietnamese phrase treebank language material and be by the basis that Vietnamese tree of phrases builds.Only build out high-quality Language material, could carry out information-based development by based on.It is short that phrase treebank language material is also by Vietnamese based on chunk Language treebank builds studies an indispensable ingredient.Build phrase treebank language material to specifically comprise the following steps that
1), according to the language feature of Vietnamese in combination with the mark system of CTB, i.e. Chinese Penn Treebank, Vietnam is formulated The mark collection of language tree of phrases;
Vietnamese belongs to Austroasiatic, and it is the mother tongue of country of Vietnam.Each language has the word order of oneself, Vietnamese The order relying primarily on composition goes to pass on important syntactic information.Although writing of Vietnamese text derives from Latin Mutation, Vietnamese has three obvious feature differences to remove western language.Vietnamese phrase treebank is built with impact of crucial importance Some Vietnamese features as follows:
First, the minimum component units of Vietnamese is syllable.Word can only be by one (Beautiful) or multiple (G á girl i) syllable composition.As many Asian languages (such as Chinese, Japanese and Thai), Vietnamese does not has word to separate Symbol.Space separates the separator of syllable the most one by one, and the separator of neither one word, so Vietnamese sentence is the most permissible There is a variety of dividing method.
Then, Vietnamese is a kind of isolated verbal unit, in this language, word can not change form and according in sentence Word order determines its grammatical function.It is to say, word order arrangement is most important table justice means in Vietnamese grammer.Changing of word order Change can cause the change of semanteme.Such asC ò n represents son, c ò nBut the mankind are represented.And the word in Vietnamese sentence Sequence be generally vocabulary that the word order that a kind of specifics gradually strengthens, the i.e. meaning of a word generality the is strong position in sentence the most Forward, on the contrary, the meaning of a word the most concrete vocabulary position in sentence is the most rearward.Such as: AnhMua (he buys) T á o (Fructus Mali pumilae).
Finally, Vietnamese is a kind of language quite fixing word order, is made up of SVO (SVO) the word order fixed.The most just Being to say, they general word order are: main
Language+predicate+object.Such as: Kia (that) l à (YES)(some)(seat) nh à (house) v á ch(soil Wall).By analyzing the grammar property of Vietnamese, it is found that Vietnamese has obvious attribute rearmounted, the spy that the adverbial modifier is rearmounted Point.Such as:(I usually has a meal)quán(in dining room).
For features described above and the mark system of CTB (Chinese Penn Treebank) of Vietnamese, formulate Vietnamese tree of phrases Mark collection, part Vietnamese tree of phrases mark collection is as shown in table 1.
Table 1 part Vietnamese tree of phrases mark collection
Phrase type marks Phrase type explanation
NP Noun phrase
VP Verb phrase
PP Prepositional phrase
AP Adjective Phrases
2), combine upper strata chunk and basic unit's chunk target has defined and 5000 Vietnamese trees of phrases mark the upper of collection Layer chunk and basic unit's chunk mark;
In order to be respectively trained basic unit's chunk parsing model and upper layer group block analysis model, it is necessary first to by a syntax tree All chunks be divided into two parts: basic unit's chunk collection and upper strata chunk collection.In order to make basic unit's chunk and upper strata chunk have clearly Definition, the first height to each node in syntax tree provides descriptive definition: make each terminal symbol in syntax tree The height of node (word) is zero, the height of other nonterminal node be the height of the child nodes of this non-terminal Big value is plus a fixing height value 1.Secondly, the level to the syntax of Peen Treebank form have following descriptive fixed Justice: it is recognized herein that the syntax tree of a complete Peen Treebank form can be divided into some levels, the number of plies of syntax tree is just Being the height of the root node of syntax tree, each level is made up of one group of orderly subtree collection.Make the son that terminal symbol node is constituted Tree collection is combined into the 0th layer;The set that n-th layer subtree collection is made up of those height subtrees less than or equal to n, if this straton tree Set the most only takes big subtree containing the subtree comprised by big subtree, casts out and is comprised little subtree.According to realized herein The needs of syntactic analysis tree-model, are referred to as basic unit's chunk set by the chunk collection corresponding to the 2nd layer of orderly subtree collection, and by The chunk collection corresponding to all of subtree collection of more than 2 layers is collectively referred to as upper layer group set of blocks.
According to upper strata defined above chunk and basic unit's chunk mark, upper by be accomplished manually 5000 Vietnamese trees of phrases Layer chunk and basic unit's chunk mark.
3) the Vietnamese tree of phrases being made up of upper strata chunk and basic unit's chunk, obtained by mark is as corpus;
By being accomplished manually the upper strata chunk to 5000 Vietnamese trees of phrases and the result of basic unit's chunk mark, will be as instruction Practice upper strata chunk and the corpus of basic unit's chunk model.
Wherein, described Step2 chooses upper strata chunk and the feature set of basic unit's chunk, according to corpus to CRF model It is adjusted, trains the CRF model after improvement, utilize the CRF model construction upper strata chunk after improving and basic unit's chunk model, Model is built by being converted into Vietnamese phrase treebank based on chunk after upper strata chunk and basic unit's chunk model combination;
Based on the Vietnamese tree of phrases language material above built, obtain upper strata chunk used here as the CRF model training improved With basic unit's chunk model, after upper strata chunk and basic unit's chunk model combination, it is converted into Vietnamese phrase treebank structure based on chunk Established model.
1), according to corpus, CRF model is adjusted, trains the CRF model after improvement;
Sequence labelling task is to include bioinformatics (bioinformatics), computational linguistics (computational Linguistics) an important task and in the field such as speech recognition (speech recognition).At natural language Speech process field part-of-speech tagging and chunk parsing are all typical sequence labelling tasks, are marked the sequence to observe. Such as in chunk parsing task, by using sequence labelling model that the sentence of input is marked and will can make up one The subsequence of new chunk gives identical labelling.For sequence labelling task, people are at first it is envisioned that Hidden Markov mould Type (Hidden Markov Models).HMM is one and generates model, and it is to observation sequence stochastic variable X And corresponding labelling stochastic variable Y is modeled, and calculate Joint Distribution probability P BXY between them.But connection The most serious problem that has closing distribution probability model is intended to enumerate all of observation sequence, and this task is in a lot of fields In be unsolvable.So needing a model that problem can turn to solvable problem, and so conditional probability model is exactly A kind of model.Conditional probability model calculate observe stochastic variable X and the condition distribution probability P of corresponding labelling stochastic variable Y and It not associating P (X Y), thus complicated problem greatly can be simplified.
Conditional random field models is exactly a kind of probabilistic framework using condition distribution probability, is also typical discrimination model. Compare other sequence labelling model, and conditional random field models has a lot of intrinsic advantages.First contrast Hidden Markov mould Type, the interdependent hypothesis demand relative relaxation of conditional random field models;Secondly contrast maximum entropy Markov model C Maximum Entropy Markov Models) and other condition Markov model based on directed graph, conditional random field models can Avoid marking bias problem.Therefore the performance at the task conditional random field models of a lot of reality is preferable.
Lafferty in his article by the definition of probability of the corresponding labelled sequence v of given observation sequence two as public Shown in formula 1.
exp(∑jλjtj(yj-1, yj, x, i)+∑kλktk(yi, x, i)) (1)
Wherein tj(yj-1, yj, x, i) it is whole observation sequence and the labelled sequence transfer characteristic function in i and i-1 position; And tk(yj, x, i) it is labelling and the observation sequence state characteristic function in position;No and #, it is the parameter of the two function, needs Estimate from training data.
Need to build the real-valued function of an observation sequence when defined feature function.(x i) comes e by this real-valued function Some distribution characters of training data are described.Be as follows in chunk parsing one about e (x, i) | object lesson formula 2.
Simplification in order to express will be described as shown in Equation 3 with following labelling herein.
S(yj, x, i)=S (yj-1, yj, x, i) (3)
And there is the global characteristics function for given observation sequence x and the conditional random field models at labelled sequence sunset fixed Justice is formula 4.
F ( y , x ) = Σ i = 1 n f j ( y j - 1 , y j , x , i ) - - - ( 4 )
Function f in superincumbent formulaj(yj-1, yj, x, i) both can be state characteristic function S (yj-1, yj, x, i) also may be used To be transfer characteristic function tj(yj-1, yj, x, i).So for given observation sequence x, the probability distribution of its labelled sequence y Just can be to be write as formula 5 form.
p ( y | x , λ ) = 1 z ( x ) exp ( Σ j λ j F j ( y , x ) ) - - - ( 5 )
Wherein Z (x) is normalization factor.
2), setting upper strata chunk and the feature set of basic unit's chunk are chosen;
The related notion of complete syntactic analysis based on chunk and the analysis process of syntactic analysis based on chunk.Pass through Description above understands, and this syntactic analysis system is based on chunk parsing, and therefore the performance of chunk parsing can directly restrict The performance of overall syntactic analysis.If chunk parsing module those phrases that analyze of entirely accurate can may be constructed new group Block, then the syntax tree dressed up by correct set of tiles is also correct.And chunk parsing model used herein be based on Sequence labelling model, namely conditional random field models (CRFs).Therefore, the performance of chunk parsing model largely takes Certainly in the choosing of feature of this chunk parsing model, one group of preferable feature can make model have the strongest discriminating power, carries The accuracy of high analyte.The Baseline system introducing syntactic analysis model based on chunk parsing herein is used by this section Some features, and these features according to application need be broadly divided into two big classes: for feature and the use of basic unit's chunk parsing Feature in upper layer group block analysis.
Basic unit's chunk parsing is equivalent to use conditional random field models (CRFs) to carry out shallow parsing, therefore at this The feature that one layer of total used feature is used with shallow parsing is similar.Table 2 gives basic unit in Baseline system Some feature templates that chunk parsing module is used, these feature templates mostly come from Sha and Pereira and The work of Yoshimasa Tsuruokat et al..
From Table 2, it can be seen that basic unit's chunk parsing has only used the correlated characteristic of part of speech and word.This is because basic unit Chunk parsing is the ground floor analysis of the data to input, and the test sentence inputted is simply with the word sequence of part of speech labelling, Therefore it is merely able to use both features.
The feature that table 2 benchmark system basic unit chunk parsing is used
Feature classification Character representation Feature description
POS Unigram PI ∈ {-2 ,-1,0,1,2} Unit part of speech feature
POS Bigram PiPi+1·I ∈ {-2 ,-1,0,1} Adjacent binary part of speech feature
POS Trigram Pi-1PiPi+1·I ∈ {-2 ,-1,0,1,2} Adjacent ternary part of speech feature
Word Unigram wI ∈ {-2 ,-1,0,1,2} Uniterm stack features
Word Bigram wiwi+1·I ∈ {-2 ,-1,0,1} Adjacent binary phrase feature
Word Trigram wi-1wiwi+1·i∈{0} Adjacent ternary phrase feature
During the syntactic analysis based on chunk introduced herein, the level chunk parsing on basic unit can be referred to as Upper layer analysis.Basic unit's chunk parsing is the chunk parsing based on word and part of speech, and based on chunk point during the chunk parsing on upper strata Analysis, and each chunk this stalk tree corresponding in the system of the syntactic analysis introduced herein, therefore going up layer analysis can To use some features based on syntactic structure.It is, for example possible to use non-terminal syntactic marker feature, the centre word of subtree with And the part of speech of centre word, the boundary node information etc. of subtree.Table 3 lists the spy that benchmark system chunk parsing at the middle and upper levels is used Levy template, and Partial Feature template therein comes from the work of Yoshimasa Tsuruoka et al..As shown in Table 3, benchmark System uses class three major types feature altogether: nonterminal symbol marker characteristic, centre word feature and centre word part of speech feature, by using These three feature can be just that the syntactic analysis system based on chunk introduced herein reaches higher performance.But, only this A little feature of course do not make full use of the information that lower floor's chunk (each chunk corresponds to a syntax subtree) is provided, and this is also The reason that benchmark system herein is limited.To this end, more improve benchmark system performance by the chapters and sections below are introduced herein Feature and corresponding method.The feature templates that on the basis of table 3, the block analysis of system upper layer group uses
3) the CRF model construction upper strata chosen after setting upper strata chunk and the feature set of basic unit's chunk and improving, is utilized Chunk model and basic unit's chunk model, short by being converted into Vietnamese based on chunk after upper strata chunk and basic unit's chunk model combination Language treebank builds model;
Chunk parsing problem can be converted to when carrying out chunk parsing sequence labelling problem, how will be described in detail will Syntax problem analysis completely is converted to chunk parsing problem.Yoshimasa Tsuruoka et al. mentions in their paper and adopting Syntactic analysis is carried out by the method in two stages.They will analyze referred to as basic unit's chunk parsing (base-level the first stage And upper layer group block analysis (up-level chunking) chunking).The reason using the analysis method in two stages is basic unit The feature that chunk parsing is used with upper layer group block analysis is different.The input of basic unit's chunk parsing one is when a sentence, in sentence Only comprise word and corresponding part of speech, so the feature that basic unit's chunk parsing can use only has word and part of speech.And basic unit's chunk parsing Output be chunk sequence, a stalk tree can be expressed as due to each chunk again, so these chunk sequences can represent Subtree sequence.The result (subtree sequence) of basic unit's chunk parsing passes to upper layer group block analysis, and therefore upper layer group block analysis is permissible Use more abundant feature.In addition to basic word and part of speech feature, the sentence that upper layer group block analysis can also use subtree Also Method information.In order to preferably use conditional random field models and utilize more feature, herein by based on chunk parsing complete Syntactic analysis model is divided into two parts: basic unit's chunk parsing model and upper layer group block analysis model.
It is also required to be respectively trained two models when training syntactic analysis model based on chunk.Concrete way is: make Basic unit's chunk model is trained with the basic unit's chunk in training treebank;The upper strata chunk in training treebank is used to train upper layer group Block models.In order to be respectively trained basic unit's chunk parsing model and upper layer group block analysis model, it is necessary first to by a syntax tree All chunks be divided into two parts: basic unit's chunk collection and upper strata chunk collection.In order to make basic unit's chunk and upper strata chunk have clearly Definition, the first height to each node in syntax tree provides descriptive definition: make each terminal symbol node in syntax tree The height of (word) is zero, and the height of other nonterminal node is the maximum of the height of the child nodes of this non-terminal Plus a fixing height value 1.Secondly, the level of the syntax of Peen Treebank form there is is following descriptive definition: this Literary composition thinks that the syntax tree of a complete Peen Treebank form can be divided into some levels, and the number of plies of syntax tree is exactly sentence The height of the root node of method tree, each level is made up of one group of orderly subtree collection.Make the subtree set that terminal symbol node is constituted It is combined into the 0th layer;The set that n-th layer subtree collection is made up of those height subtrees less than or equal to n, if this layer of subtree collection The most only take big subtree containing the subtree comprised by big subtree, cast out and comprised little subtree.According to the syntax realized herein The needs of parsing tree model, are referred to as basic unit's chunk set by the chunk collection corresponding to the 2nd layer of orderly subtree collection, and by the 2nd layer The above chunk collection corresponding to all of subtree collection is collectively referred to as upper layer group set of blocks.
Wherein, described step Step3 utilize chunk parsing instrument that 2.7 ten thousand Vietnamese sentences after participle are carried out group Block analysis, thus obtain chunk language material, acquired language material is carried out basic unit's chunk and upper layer group block analysis, obtains 2.7 ten thousand The primary Vietnamese phrase treebank built based on chunk;
1), 2.7 ten thousand Vietnamese sentences after participle are carried out chunk parsing, obtain 2.7 ten thousand Vietnamese chunk language materials;
First with participle instrument, 2.7 ten thousand the Vietnamese sentences obtained are carried out participle, then utilize chunk parsing work Tool carries out chunk parsing to the sentence after 2.7 ten thousand participles.
2), utilize the upper layer group block models obtained and basic unit's chunk model that chunk language material is carried out basic unit's chunk and upper layer group Block analysis, finally gives 2.7 ten thousand primary Vietnamese phrase treebanks built based on chunk.
Utilize in Step2.3 obtain upper layer group block models and basic unit's chunk model chunk language material is carried out basic unit's chunk and Upper layer group block analysis, finally gives 2.7 ten thousand Vietnamese phrase treebanks.
Wherein, described step Step4 utilizes the phrase treebank corrector primary Vietnamese phrase to building based on chunk Treebank is corrected, and finally obtains the whole level Vietnamese phrase treebank after correction.
There are some problems, mainly due to institute in the primary Vietnamese phrase treebank owing to obtaining in Step3 in quality The accuracy rate to the Vietnamese chunk language material obtained in Step2 obtained is not high enough caused, and for this problem, utilizes short Primary Vietnamese phrase treebank is corrected by language tree corrector, finally obtains the Vietnamese phrase treebank that quality is higher.
First 5000 the Vietnamese trees of phrases manually marked are carried out subtree layer, basic unit's chunk set and upper strata by the present invention Chunk set notation is as training treebank;Then choose upper strata chunk and the feature set of basic unit's chunk, utilize CRF to build upper layer group Block and basic unit's chunk model, be converted into the short tree of Vietnamese by the result of chunk parsing;Then, utilize chunk parsing instrument to participle Rear 2.7 ten thousand Vietnamese sentences carry out chunk parsing, thus obtain chunk language material, acquired language material is completed basic unit's chunk and Upper layer group block analysis, obtains 2.7 ten thousand Vietnamese phrase treebanks;Recycling phrase treebank corrector is short to newly-generated Vietnamese Language treebank is corrected, and finally obtains final Vietnamese phrase treebank.
Experimental result is as shown in table 4.It can be seen that use Vietnamese tree of phrases base construction method based on chunk in table 4 The Vietnamese phrase treebank generated, accuracy rate is compared employing PCFG and is built Vietnamese phrase treebank and maximum entropy structure Vietnamese Phrase treebank method accuracy rate significantly improves;
Wherein, using PARSEVAL syntactic analysis appraisement system, it is the most general a kind of evaluating standard.Mainly by Accuracy rate (LP), recall rate (LR) and three indexs of F value, F value has considered accuracy rate and recall rate.It is defined as follows:
Table 4 additive method and the comparison of the inventive method
Method LR% LP% F value %
The Vietnamese phrase treebank that PCFG builds 81.36 80.64 81.00
The Vietnamese phrase treebank that maximum entropy builds 79.83 78.69 79.26
The new Vietnamese phrase treebank built based on chunk 86.32 83.45 85.66
Above in conjunction with accompanying drawing, the detailed description of the invention of the present invention is explained in detail, but the present invention is not limited to above-mentioned Embodiment, in the ken that those of ordinary skill in the art are possessed, it is also possible to before without departing from present inventive concept Put that various changes can be made.

Claims (4)

1. Vietnamese tree of phrases construction method based on chunk, it is characterised in that: described Vietnamese tree of phrases structure based on chunk Specifically comprising the following steps that of construction method
Step1, first Vietnamese tree of phrases mark collection is carried out upper strata chunk and basic unit chunk mark, the phrase obtained by mark Tree is as corpus;
Step2, choose upper strata chunk and the feature set of basic unit's chunk, according to corpus, CRF model is adjusted, trains CRF model after improvement, utilizes the CRF model construction upper strata chunk after improving and basic unit's chunk model, by upper strata chunk and base It is converted into Vietnamese phrase treebank based on chunk after layer chunk model combination and builds model;
Step3, utilize chunk parsing instrument that the Vietnamese sentence after participle is carried out chunk parsing, thus obtain chunk language material, Acquired language material is carried out basic unit's chunk and upper layer group block analysis, obtains the primary Vietnamese tree of phrases built based on chunk Storehouse;
Step4, utilize phrase treebank corrector that the primary Vietnamese phrase treebank built based on chunk is corrected, finally Whole level Vietnamese phrase treebank after correction.
Vietnamese tree of phrases construction method based on chunk the most according to claim 1, it is characterised in that: described step In Step1, carry out specifically comprising the following steps that of upper strata chunk and basic unit chunk mark to manually marking the Vietnamese tree of phrases obtained
Step1.1, according to the language feature of Vietnamese in combination with the mark system of CTB, i.e. Chinese Penn Treebank, formulate Vietnam The mark collection of language tree of phrases;
Step1.2, combine upper strata chunk and basic unit's chunk target has defined the upper layer group that Vietnamese tree of phrases marks collection Block and basic unit's chunk mark;
Step1.3, the Vietnamese tree of phrases being made up of upper strata chunk and basic unit's chunk obtained by mark are as corpus.
Vietnamese tree of phrases construction method based on chunk the most according to claim 1, it is characterised in that: described step Step2 specifically comprises the following steps that
CRF model is adjusted by Step2.1, foundation corpus, trains the CRF model after improvement;
Step2.2, choose setting upper strata chunk and the feature set of basic unit's chunk;
The CRF model construction upper strata after setting upper strata chunk and the feature set of basic unit's chunk and improving is chosen in Step2.3, utilization Chunk model and basic unit's chunk model, short by being converted into Vietnamese based on chunk after upper strata chunk and basic unit's chunk model combination Language treebank builds model.
4. according to the Vietnamese tree of phrases construction method based on chunk described in claim 1 or 3, it is characterised in that: described step Rapid Step3 specifically comprises the following steps that
Step3.1, the Vietnamese sentence after participle is carried out chunk parsing, obtain Vietnamese chunk language material;
Upper layer group block models and basic unit's chunk model that Step3.2, utilization obtain carry out basic unit's chunk and upper strata to chunk language material Chunk parsing, finally gives the primary Vietnamese phrase treebank built based on chunk.
CN201610497061.2A 2016-06-30 2016-06-30 Vietnamese phrase tree constructing method based on chunking Active CN106202037B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610497061.2A CN106202037B (en) 2016-06-30 2016-06-30 Vietnamese phrase tree constructing method based on chunking

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610497061.2A CN106202037B (en) 2016-06-30 2016-06-30 Vietnamese phrase tree constructing method based on chunking

Publications (2)

Publication Number Publication Date
CN106202037A true CN106202037A (en) 2016-12-07
CN106202037B CN106202037B (en) 2019-05-14

Family

ID=57463532

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610497061.2A Active CN106202037B (en) 2016-06-30 2016-06-30 Vietnamese phrase tree constructing method based on chunking

Country Status (1)

Country Link
CN (1) CN106202037B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108491383A (en) * 2018-03-14 2018-09-04 昆明理工大学 A kind of Thai sentence cutting method based on maximum entropy disaggregated model and the correction of Thai syntax rule
CN110096715A (en) * 2019-05-06 2019-08-06 北京理工大学 A kind of fusion pronunciation character Chinese-Vietnamese statistical machine translation method
CN110362691A (en) * 2019-07-19 2019-10-22 大连语智星科技有限公司 A kind of tree bank building system
CN113627150A (en) * 2021-07-01 2021-11-09 昆明理工大学 Method and device for extracting parallel sentence pairs for transfer learning based on language similarity

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101377770A (en) * 2007-08-27 2009-03-04 微软公司 Method and system for analyzing Chinese group block
CN101446941A (en) * 2008-12-10 2009-06-03 苏州大学 Natural language level and syntax analytic method based on historical information
CN103500160A (en) * 2013-10-18 2014-01-08 大连理工大学 Syntactic analysis method based on sliding semantic string matching

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101377770A (en) * 2007-08-27 2009-03-04 微软公司 Method and system for analyzing Chinese group block
CN101446941A (en) * 2008-12-10 2009-06-03 苏州大学 Natural language level and syntax analytic method based on historical information
CN103500160A (en) * 2013-10-18 2014-01-08 大连理工大学 Syntactic analysis method based on sliding semantic string matching

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
刘新: ""基于层叠条件随机场的汉语句法分析技术的研究"", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *
周强 等: ""汉语句子的组块分析体系"", 《计算机学报》 *
耿向好 等: ""一种基于历史信息的多层次中文句法分析方法"", 《计算机应用与软件》 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108491383A (en) * 2018-03-14 2018-09-04 昆明理工大学 A kind of Thai sentence cutting method based on maximum entropy disaggregated model and the correction of Thai syntax rule
CN110096715A (en) * 2019-05-06 2019-08-06 北京理工大学 A kind of fusion pronunciation character Chinese-Vietnamese statistical machine translation method
CN110362691A (en) * 2019-07-19 2019-10-22 大连语智星科技有限公司 A kind of tree bank building system
CN113627150A (en) * 2021-07-01 2021-11-09 昆明理工大学 Method and device for extracting parallel sentence pairs for transfer learning based on language similarity
CN113627150B (en) * 2021-07-01 2022-12-20 昆明理工大学 Language similarity-based parallel sentence pair extraction method and device for transfer learning

Also Published As

Publication number Publication date
CN106202037B (en) 2019-05-14

Similar Documents

Publication Publication Date Title
Bod An all-subtrees approach to unsupervised parsing
CN105045778B (en) A kind of Chinese homonym mistake auto-collation
CN105957518A (en) Mongolian large vocabulary continuous speech recognition method
CN101051458B (en) Rhythm phrase predicting method based on module analysis
CN106202037B (en) Vietnamese phrase tree constructing method based on chunking
CN101650942A (en) Prosodic structure forming method based on prosodic phrase
Maamouri et al. Diacritization: A challenge to Arabic treebank annotation and parsing
CN103688254B (en) Error-detecting system based on example, method and error-detecting facility for assessment writing automatically
CN106528524A (en) Word segmentation method based on MMseg algorithm and pointwise mutual information algorithm
CN104778256A (en) Rapid incremental clustering method for domain question-answering system consultations
CN101685441A (en) Generalized reordering statistic translation method and device based on non-continuous phrase
CN104375988A (en) Word and expression alignment method and device
Kübler et al. Part of speech tagging for Arabic
Georgi From Aari to Zulu: massively multilingual creation of language tools using interlinear glossed text
Maskey et al. Bootstrapping phonetic lexicons for new languages
Rugchatjaroen et al. Efficient two-stage processing for joint sequence model-based Thai grapheme-to-phoneme conversion
Oravecz et al. Semi-automatic normalization of Old Hungarian codices
Kaur et al. Hybrid approach for spell checker and grammar checker for Punjabi
Taji et al. The columbia university-new york university abu dhabi sigmorphon 2016 morphological reinflection shared task submission
Rosen Building and Using Corpora of Non-Native Czech.
Wu et al. Modeling hip hop challenge-response lyrics as machine translation
Dembitz et al. An economic approach to big data in a minority language
Saychum et al. Efficient Thai Grapheme-to-Phoneme Conversion Using CRF-Based Joint Sequence Modeling.
Vičič et al. Automated implementation process of machine translation system for related languages
Li et al. The study of comparison and conversion about traditional Mongolian and Cyrillic Mongolian

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CB03 Change of inventor or designer information

Inventor after: Yu Zhengtao

Inventor after: Li Ying

Inventor after: Guo Jianyi

Inventor after: Xian Yantuan

Inventor after: Mao Cunli

Inventor after: Chen Wei

Inventor before: Guo Jianyi

Inventor before: Li Ying

Inventor before: Yu Zhengtao

Inventor before: Xian Yantuan

Inventor before: Mao Cunli

Inventor before: Chen Wei

CB03 Change of inventor or designer information