CN106021227A - State transition and neural network-based Chinese chunk parsing method - Google Patents

State transition and neural network-based Chinese chunk parsing method Download PDF

Info

Publication number
CN106021227A
CN106021227A CN201610324281.5A CN201610324281A CN106021227A CN 106021227 A CN106021227 A CN 106021227A CN 201610324281 A CN201610324281 A CN 201610324281A CN 106021227 A CN106021227 A CN 106021227A
Authority
CN
China
Prior art keywords
word
vector
speech
chunk
characteristic
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201610324281.5A
Other languages
Chinese (zh)
Other versions
CN106021227B (en
Inventor
戴新宇
程川
陈家骏
黄书剑
张建兵
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University
Original Assignee
Nanjing University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University filed Critical Nanjing University
Priority to CN201610324281.5A priority Critical patent/CN106021227B/en
Publication of CN106021227A publication Critical patent/CN106021227A/en
Application granted granted Critical
Publication of CN106021227B publication Critical patent/CN106021227B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking

Abstract

The invention proposes a state transition and neural network-based Chinese chunk parsing method. The method comprises the steps of converting a chunk parsing task into a serialized tagging task; tagging a sentence by using a state transition-based framework; scoring a transition operation to be carried out in each state by using a forward neural network in the tagging process; and taking a distributed representation characteristic of words and part-of-speech tagging learned by utilizing a two-way long short-term memory neural network model as an additional information characteristic of a tagging model, thereby improving the accuracy of chunk parsing. Compared with other Chinese chunk parsing technologies, the Chinese chunk parsing method has the advantages that characteristics of chunk levels can be more flexibly added by using the state transition-based framework, combination modes among the characteristics can be automatically learned by using the neural network, the useful additional information characteristic is introduced by utilizing the two-way long short-term memory neural network model, and the combination of the state transition-based framework, the neural network and the two-way long short-term memory neural network model effectively improves the accuracy of chunk parsing.

Description

A kind of Chinese Chunk based on state transfer with neutral net analyzes method
Technical field
The present invention relates to a kind of method utilizing computer to carry out Chinese shallow parsing, particularly a kind of utilize based on The method that the mode that state transfer combines with neutral net carries out automatic Chinese chunk parsing.
Background technology
Chinese parsing is a basic task in Chinese information processing, and it is widely applied demand and has attracted in a large number Correlational study thus promote the fast development of its correlation technique.Complete syntactic analysis is more high because of self complexity of its problem Factor is so that analysis accuracy is relatively low, speed is relatively slow, thus practicality is limited.Chunk parsing, is again shallow parsing, with Obtaining the complete syntactic analysis for the purpose of the complete syntax tree of a sentence different, it analyzes target is to identify some knot in sentence Structure is relatively easy, the noun phrase of the sentence constituent of non-nested, such as non-nested, verb phrase etc..Owing to it identifies that target is Non-nested, nonoverlapping phrase components meeting certain grammatical norm in sentence, so group for complete syntactic analysis The complexity of block analysis task is less, processing speed the most faster, simultaneously because it can divide as machine translation, complete syntax The pretreatment stage of all multitasks such as analysis, information extraction, so being constantly subjected to the concern of research worker.Chunk for Chinese divides Analysis, since the appearance of Chinese treebank and having research worker therefrom to extract the data set for chunk parsing task, phase Close research constant.
In the mode of modeling chunk parsing task, being regarded as serializing mark task is a kind of common approach.Its work As process it is: for sentence to be analyzed, in units of word, from left to right each word is labeled (i.e. labelling), wherein A kind of notation methods is that the chunk that word marks into belt type (noun phrase, verb phrase, Adjective Phrases etc.) starts word, list Alone become block word, and the chunk of belt type does not terminates to belong to word five kinds outside word, chunk interior genus word, chunk.When whole sentence is by this After mode has marked, more therefrom extract complete chunk information.The present invention model Chinese Chunk analysis task time also by it Regard serializing mark task as and use aforesaid five class notation methods.
Statistics-Based Method is widely used in chunk parsing task, it is common practice that use in structuring study It is processed by classical model, such as hidden Markov model, conditional random field models, support based on dynamic programming vector Machine models etc., in existing inventive technique, if Microsoft is in the patent " method and system of Chinese Text Chunking " of application in 2007 In, just employ conditional random field models and it is processed.But this kind of method is because the reason of its model self causes Chunk level another characteristic is used limited by it, for whole sentence for processing object, needing the group of more consideration global information The least impact is had for block analysis task.In order to alleviate the impact that this class model brings, method based on state transfer is one Individual selection, the method uses more in complete syntactic analysis, and it has efficiently, feature accurately.Its work process is: for Sentence to be analyzed, in units of word, order reads in word from left to right, and each word read in is labeled operation, mark Type with reference to aforementioned notation methods, the carrying out of each labeling operation is a corresponding state (sentence being defined on whole sentence A state recording which word of current sentence be marked, each marked marking types corresponding to word and which word also Do not mark) transfer, the selection of concrete marking types is then completed by the scoring model trained.Owing to certain word is being carried out During mark, in sentence this all word in word left side marking types it has been determined that, it is possible to make full use of this part and marked word Information the mark of current word is instructed, in particular with this word left side have been identified as chunk chunk relevant information come Instruct.In order to utilize the information characteristics of chunk rank more, the present invention uses mode based on state transfer to carry out Chinese Chunk is analyzed.
Neutral net is a kind of conventional machine learning method, and it has from some automatic learning characteristics of ground atom feature The ability of compound mode, this is different from needs user to design what considerable task was correlated with according to prioris such as linguistics are relevant The traditional method of template.Neutral net is attempted in Chinese information processing in a large number, but so far in Chinese Chunk is analyzed Yet there are no and used.The use of neutral net, can save the artificial work customizing a large amount of assemblage characteristic templates, the most permissible The combination between automatically learning characteristic is carried out by the ability to express that neutral net is powerful.On the other hand, divide at traditional chunk In analysis technology, information characteristics used when being labeled each word is certain fixed size window on the basis of current word In word or part-of-speech information, but it is found that many letters useful to chunk parsing after the Chinese sentence analyzed in treebank Breath feature usually can be beyond window, such as between punctuation mark information, " word, word, word, the word ... " etc. such as " " ", " " " with pause mark is Every Text Mode information, this category information usually institute wider across scope, it is not easy to include in traditional chunk parsing technology.In order to Make full use of this information, present invention uses two-way length Memory Neural Networks and word and the part of speech sequence of sentence are counted Calculate, thus capture remote word and part of speech feature more.
Summary of the invention
Goal of the invention: the present invention is directed to model used in current Chinese Chunk analytical technology and can not make full use of chunk rank With remote information characteristics and the shortcoming of the assemblage characteristic template needing manual customization complexity, propose a kind of based on state transfer With the restriction that the method for neutral net alleviates this respect, promote the accuracy that Chinese Chunk is analyzed.
In order to solve above-mentioned technical problem, the invention discloses a kind of Chinese Chunk based on state transfer with neutral net Analysis method and about the supplemental instruction of model parameter training method used in analysis process.
Chinese Chunk based on state transfer with neutral net of the present invention is analyzed method and is comprised the following steps:
Step 1, computer reads a Chinese language text file comprising sentence to be analyzed, defines the type of Chinese Chunk, Treat parsing sentence carry out participle and each word is carried out part-of-speech tagging, when carrying out part-of-speech tagging, true according to current sentence state The fixed part-of-speech tagging type that can select;
Step 2, utilizes method based on state transfer and neutral net to treat parsing sentence and carries out Chinese Chunk analysis.
Wherein, step 1 comprises the steps:
Step 1-1, uses at Binzhou treebank Chinese edition CTB (The Chinese Penn Treebank) 4.0 (these treebanks The University of Pennsylvania for a band mark treebank of Chinese language material) on the basis of 12 kinds of phrase type definition Chinese defining Language chunk type;Chunk type is selected according to its objectives voluntarily by user, and traditional Chinese Chunk analysis task is general Having two kinds of concrete phrase chunking tasks: one is to be identified just for noun phrase, two is in Binzhou treebank Chinese edition The chunk of 12 types defined on the basis of CTB4.0 is identified.In embodiment 1, have chosen the second way, to this The concrete meaning explanation of 12 kinds of phrase type is as shown in table 1:
Table 1 Chinese Chunk type declaration
Type Implication Example
ADJP Adjective Phrases In development/JJ country/NN
ADVP Adverbial phrase Generally/AD use/VV
CLP Classification type phrase Hongkong dollar/M and/CC dollar/M
DNP Re-define phrase more / DEG
DP Determiner phrase This/DT
DVP Ground word phrase Equality/VA harmony/VA ground/DEV
LCP Directional phrases In recent years/NT carrys out/LC
LST Sequence phrase (/PU mono-/CD)/PU
NP Noun phrase Highway/NN project/NN
PP Prepositional phrase With/P complete system plant/NN
QP Numeral-classifier compound phrase One/CD/M
VP Verb phrase Forever/AD blooms/VV
Wherein, " NN " in " country/NN " is the part of speech that this word is corresponding, and " NN " represents that noun, " VV " represent verb etc..
Step 1-2, uses BIOES mark system true with the mode that the Chinese Chunk type defined in step 1-1 combines The marking types that can select when each word to be marked being carried out part-of-speech tagging during calibration note.By chunk parsing task modeling After becoming serializing mark task, it is thus necessary to determine that use which kind of mark system.In English chunk parsing task, the mark used Typically there is BIO and BIOES two class in injection body system, will mark with chunk type and the group of BIO or BIOES by each word in sentence Close.Wherein, in BIO notation methods, B represents the beginning of a chunk, and I represents the inside of a chunk, and O represents beyond chunk Other position;In BIOES notation methods, B represents the beginning of a chunk, and I represents the inside of a chunk, and E represents one The end of individual chunk, O represents other position beyond chunk, and S represents that a word individually becomes chunk.Mark with one below Help to illustrate the implication of BIOES mark system as a example by sentence.First, a sentence carrying out piecemeal by chunk is given:
[NP PVG] [NP exploitation and legal construction] [VP is Tong Bu] [.]
NP represents that this chunk is noun phrase, and VP represents that this chunk is verb phrase, "." represent that this word is not belonging to any one Individual chunk.This sentence form after being labeled by BIOES mark system is as follows:
Shanghai _ B-NP Pudong _ E-NP exploitation _ B-NP and _ I-NP legal system _ I-NP construction _ E-NP is synchronization _ S-VP._ O needs Illustrating, the system according to BIOES is carried out by the mark in the present invention.Additionally, the combination of chunk type and BIOES is not Complete combination between the two, only B and S and chunk type carry out complete combination, i.e. assume that chunk type has type1, type2,…,typekK kind altogether, then, after they combine with B and S, just have B-type1, B-type2..., B-typek, S-type1, S-type2..., S-typek2k kind, adds I, O, S three types, so one has marking types in 2k+3, in the present invention altogether K=12, so having 27 kinds.Example sentence above is after marking in such a way:
Shanghai _ B-NP Pudong _ E exploitation _ B-NP and _ I legal system _ I construction _ E is synchronization _ S-VP._O
Additionally, in annotation process, the candidate's marking types generation for certain word is also restricted by certain rule, the present invention In restriction as follows:
1. first word of sentence can not be I, E;
2. type is B-typexThe later word of word can not be B-typey、O、S-typey
3. type is that the word later word of I can not be for B-typey、O、S-typey
4. type is that the word later word of O can not be for I, E;
5. type can not be I, E after being the word of E;
6. type is S-typexWord later word can not be I, E.
Step 1 Computer reads a natural language text file comprising sentence to be analyzed, is carrying out Chinese Chunk During analysis, it is desirable to input in addition to having divided word to sentence itself, also to complete each word is carried out part-of-speech tagging.Example As shown in table 2 in a complete sentence inputting:
The sentence inputting to be analyzed that 2 one, table is complete
Word Part-of-speech tagging
France NR
National defence NN
Minister NN
LEO's tal fibre NR
1 day NT
Say VV
, PU
France NR
? AD
Research VV
From P
Bosnia-Herzegovena NR
Withdrawal of troops VV
's DEC
Plan NN
PU
Step 2, utilizes method based on state transfer and neutral net to carry out chunk parsing to each sentence read.This Part operation is carried out under the big frame shifted based on state, in the serializing mask method shifted based on state, for often One sentence, in units of word, order reads in word from left to right, and the reading of each word can cause current sentence state once Transfer, a state recording of sentence which word of current sentence is marked, each marked marking types corresponding to word with And which word does not also mark.If the mark for each word is unique, then each word in sentence is being labeled The most just obtained the complete annotated sequence for this sentence, its process can be briefly described into: assume a length of n of sentence, initial State is s1, be labeled as mark to what the t word was carried outt, the t word is labeled after state be st+1, the most whole process Can the most simply be described as The annotated sequence that whole sentence is corresponding is mark1, mark2,…,markn, this notation methods is called greed search by the present invention.But this notation methods obtain for whole The mark accuracy of sentence is relatively low, so the method that present invention employs post search completes the mark to whole sentence.
Before describing the method for post search in detail, need simply to introduce search completely: search completely is different from greedy The heart is searched for, and when being labeled for each word during search, no longer has to an annotation results, but obtains One annotation results set (i.e. state set), it is assumed that the state set that sentence before being labeled i-th word is in It is expressed as Si, therefore the state set of sentence is S before being labeled first word of sentence1, only one of which state, It is expressed asWhen being labeled first word, its candidate's marking types is defined by step 1-2, it is assumed that for state set S1 In the notation methods that can select when current word is labeled of each state be k, then to stateCarry out k completely The state set S obtained after planting mark and extension2In have k state, be expressed as(order is by score height It is ranked up);In like manner, when second word is labeled, will be to state set S2In each state carry out k kind extension, The new state set obtained will have k2Individual state, is expressed asBy that analogy, exist The state set of mark completely to whole sentence has just been obtained after the t word is extendedIf which kind of mark extended operation every time (has i.e. carried out for this) Can retain in new state after expansion, it is possible to from state set Sn+1In each state set out backtracking, reduce a pin Complete annotated sequence to this sentence, wherein by Sn+1The sequence of that state reduction of middle highest scoring is exactly that the method is to this The annotation results of sentence.Using this searching method, state set size will be made quickly to increase, this is in real operation can not Row, so the mode that have employed post search in the present invention reduces the state set after extending every time.Post is searched for and is searched completely The place of Suo Butong is: to preceding state set St-1In all states when being extended, the new state not being in control The status number of set has how many, and (choosing of m is selected depending on specific tasks the m the most only keeping score the highest by user, typically M is the biggest, and the mark precision obtained is the highest, but expense is the biggest, and the m as chosen in embodiment 1 is 4) individual, so can ensure that The size of the new state set obtained after having operated for the conditional extensions of each word is less than m.As search completely, From state set Sn+1That state of middle highest scoring is set out and is recalled forward, and the annotated sequence to this sentence that reduction obtains is i.e. For the method annotation results to this sentence.The present invention just have employed this post way of search.
Represent the length of sentence to be analyzed in whole step 2 with n, step 2 comprises the steps:
Step 2-1, under given state, (in state recording current sentence, which word has completed mark and has marked Which word is type, have recorded for for not marking word simultaneously), when processing t word, all marking types are given a mark;Now Before given state is sentence to be analyzed, t-1 word has completed mark and the marking types of its correspondence known, t to n-th Word is not for marking word and the t word is next pending word;
Step 2-2, given state set St, when processing t word to each state in this state setBy step All marking types are given a mark by the mode in 2-1, and this marking is completed by calculating, will give each marking types one reality Numerical value, this real number value is referred to as the score that the type is corresponding, then mode as described in step 1-2 generates candidate's marking types, by each Word is labeled thus is extended this state by candidate's marking types, and selects the m of highest scoring by the mode of post search Individual new state, obtains new state set St+1
Step 2-3, to t=1,2 ..., n, perform step 2-1 and 2-2, obtain final dbjective state set Sn+1, and take Go out the state of wherein highest scoringThe annotated sequence of highest scoring is obtained from the backtracking of this state, the most all words Type mark is complete, and the annotated sequence of this highest scoring is reduced to the chunk parsing result of correspondence, and this result is to be worked as The analysis result of front sentence.
The heretofore described state transfer operation for each word is under certain current sentence state, to reading The classification labeling operation that carries out of word.When the t word is labeled, given preceding state set StIn some shape State, the marking types set that can be labeled is defined by step 1-2, to waiting the behaviour that in mark set, each mark is given a mark Make to be completed by a feedforward neural network, use the mark that current word can be labeled under given state by neutral net The process that type carries out giving a mark includes two steps: one is the generation of characteristic information, i.e. the generation of neutral net input;Two is profit By neutral net, all candidate categories are given a mark.Step 2-1 specifically includes following steps:
Step 2-1-1, generate characteristic vector, characteristic vector include essential information characteristic vector and additional information feature to Amount;
Step 2-1-2, utilizes feedforward neural network to be calculated the characteristic vector input generated in step 2-1-1 The score of all candidate's marking types.
Firstly it is pointed out that in information processing, the expression for each feature mainly has two ways, Yi Zhongshi One-hot represents, another kind is distributed expression.One-hot represents and represents a feature with a vector the longest, vector The size of feature lexicon of a length of all features composition, in the component of vector, only this feature is corresponding in feature lexicon Position is 1, and other are all 0;Distributed expression is then to give a real-valued vectors representing it, the dimension of vector for each feature Degree needs sets itself according to task.It is pointed out that both representations are widely used in this field, should be ability Known to field technique personnel, the most do not launch explanation.The representation that the present invention uses is distributed expression, i.e. to each feature Giving the real-valued vectors of certain dimension, the characteristic dimension size set in embodiment 1 is as 50.In the present invention, this part is defeated The generation entered includes two steps, and one is that essential information feature generates, and two is the generation of additional information feature.In whole step 2- In 1-1, all words in sentence to be analyzed are from left to right represented sequentially as w1,w2,…,wn, wnRepresent in sentence to be analyzed n-th Word, n value is natural number;The part of speech that in sentence to be analyzed, all words are corresponding is from left to right represented sequentially as p1,p2,…,pn, pn Represent the part of speech that in sentence to be analyzed, n-th word is corresponding;One feature * characteristic of correspondence vector representation is e (*), step 2-1-1 Comprise the steps:
Step 2-1-1-1, generates essential information characteristic vector.Essential information characteristic vector includes with current word institute to be marked It is set to the word in certain window of benchmark and part of speech feature characteristic of correspondence vector in place, and in place with current word to be marked institute The word generic feature characteristic of correspondence of the mark vector being set in certain window of benchmark, detailed process is as follows: substantially believe In breath feature, word feature vector includes: several second word characteristic of correspondence vector e (w centered by currently pending word-2)、 Several first word characteristic of correspondence vector e (w centered by currently pending word-1), currently pending word characteristic of correspondence Vector e (w0), several first word characteristic of correspondence vector e (w centered by currently pending word1), and currently to wait to locate Several second word characteristic of correspondence vector e (w centered by reason word2);
Part of speech characteristic vector includes: centered by currently pending word the part of speech characteristic of correspondence of several second word to Amount e (p-2), the part of speech characteristic of correspondence vector e (p of several first word centered by currently pending word-1), currently wait to locate The part of speech characteristic of correspondence vector e (p of reason word0), centered by currently pending word, the part of speech of several first word is corresponding Characteristic vector e (p1), the part of speech characteristic of correspondence vector e (p of several second word centered by currently pending word2), with work as The part of speech combination characteristic of correspondence vector e (p of several second word and first word centered by front pending word-2p-1), with work as The part of speech combination characteristic of correspondence vector e (p of several first word and currently pending word centered by front pending word-1p0)、 The part of speech combination characteristic of correspondence vector e of several first word and currently pending word centered by currently pending word (p0p1), centered by currently pending word several second word and first word part of speech combination characteristic of correspondence vector e (p1p2);
In chunk parsing task, basic feature used of giving a mark each marking types in each step is typically wrapped Include the word in the certain window on the basis of current word position to be marked and part of speech feature, in place with current word to be marked institute The word generic feature of mark being set in certain window of benchmark.Generally, current word is referred to as w0, left side i-th word It is referred to as w-i, the right i-th word is referred to as wi;The part of speech of current word is referred to as p0, the part of speech of left side i-th word is referred to as p-i, The part of speech of the right i-th word is referred to as pi;Mark word generic feature and have different, because for whole with above two kinds All words of individual sentence and part-of-speech information are analyzed and are started to be known that, so window is usually on the basis of current word to both sides Extend, and owing to annotation process is from left to right simultaneously, when marking a word to be marked, the only word on the current word left side Marking types is known, so can only extend to the left on the basis of current word, and note current word left side i-th word marking types For t-i.The choosing the difference according to selected window size of i and different, the value such as i selected in embodiment 1 is 2 (i.e. windows Size is 5), its corresponding basic feature is as shown in table 3, table 4 and table 5:
The basic word feature of table 3
The basic part of speech feature of table 4
Table 5 word generic feature
It should be noted that above-mentioned feature based on word and part of speech is for as it is known to those skilled in the art that and extensively being made With, so the most no longer doing further description, specifically it is referred to following list of references: Chen W, Zhang Y, Isahara H.An empirical study of Chinese chunking[C]//Proceedings of the COLING/ACLon Main conference poster sessions.Association for Computational Linguistics,2006:97-104.
The above-mentioned category feature having marked word with traditional such as the meaning phase in the model such as Hidden Markov, condition random field With, but occupation mode is variant: the present invention processes as the feature equal with front predicate and part of speech feature, and traditional The mode utilizing dynamic programming in model processes, and in contrast conventional model, increasing of i will bring the quick growth of time overhead, Mode based on state transfer time overhead when i increases in the present invention increases little, and this is also based on the side of state transfer Formula is in the advantage incorporated on this category feature speed per hour degree;
Step 2-1-1-2, generates additional information characteristic vector: additional information characteristic vector includes with current word institute to be marked Mark chunk relevant word feature vector and part of speech characteristic vector in being set to certain window of benchmark in place, use two-way length The word feature vector of the position current to be marked that Memory Neural Networks model calculates and part of speech characteristic vector.
Step 2-1-1-2 comprises the steps of:
Step 2-1-1-2-1, centered by currently pending word, several second chunk, first chunk represent respectively For c-2、c-1, chunk ciFirst vocabulary be shown as start_word (ci), last vocabulary is shown as end_word (ci), i =-2 ,-1, grammer centre word is expressed as head_word (ci), chunk ciThe part of speech of first word be expressed as start_POS (ci), the part of speech of last word be expressed as end_POS (ci), the part of speech of grammer centre word is expressed as head_POS (ci), generate The relevant word feature vector of chunk and part of speech feature has been marked in certain window on the basis of current word position to be marked Vector: the word feature vector of chunk rank includes: first word of several second chunk centered by currently pending word Characteristic vector e (start_word (c-2)), centered by currently pending word last word of several second chunk Characteristic vector e (end_word (c-2)), centered by currently pending word the grammer centre word of several second chunk Characteristic vector e (head_word (c-2)), centered by currently pending word the feature of first word of several first chunk Vector e (start_word (c-1)), centered by currently pending word the feature of last word of several first chunk Vector e (end_word (c-1)), centered by currently pending word the feature of the grammer centre word of several first chunk to Amount e (head_word (c-1));
The part of speech characteristic vector of chunk rank includes: centered by currently pending word the first of several second chunk Characteristic vector e (start_POS (the c of the part of speech of individual word-2)), centered by currently pending word, several second chunk are Characteristic vector (end_POS (the c of the part of speech of later word-2)), centered by currently pending word several second chunk Characteristic vector e (head_POS (the c of the part of speech of grammer centre word-2)), centered by currently pending word several first group Characteristic vector e (start_POS (the c of the part of speech of first word of block-1), several first centered by currently pending word Characteristic vector e (end_POS (the c of the part of speech of last word of chunk-1)), centered by currently pending word several first Characteristic vector e (head_POS (the c of the part of speech of the grammer centre word of individual chunk-1));I chooses according to selected window size Difference and different, the value such as i selected in embodiment 1 is 2, and the chunk level another characteristic of its correspondence is as shown in table 6:
Table 6 chunk rank word and part of speech feature
It should be noted that above-mentioned chunk level another characteristic under the models such as traditional condition random field due to by Ma Erke The restriction that husband assumes, so cannot be used, but by a kind of complexity and after carrying out beta pruning as in the present invention Dynamic programming algorithm has and is used, be specifically referred to documents below: Zhou J, Qu W, Zhang F.Exploiting chunk-level features to improve phrase chunking[C]//Proceedings of the 2012Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning.Association for Computational Linguistics,2012:557-567.
Step 2-1-1-2-2, uses two-way length Memory Neural Networks model to calculate the word generating current position to be marked With part-of-speech information characteristic vector: the input of two-way length Memory Neural Networks model is all words in sentence to be analyzed and treats The part of speech that in parsing sentence, all words are corresponding, be output as forward direction word feature vector, forward direction part of speech characteristic vector, backward word feature to Amount and backward part of speech characteristic vector.The tanh being used in formula below firstly the need of explanation is hyperbolic functions, is a reality Value function, it acts on and represents on a vector that each element in vector does this to be operated, and obtains one and input vector The object vector that dimension is identical;σ is sigmod function, is a real-valued function, and it acts on and represents vector on a vector In each element do this operation, obtain an object vector identical with input vector dimension;⊙ is point multiplication operation, will The vectorial step-by-step that two dimensions are identical is done multiplication and is obtained the result vector of an identical dimensional.The calculating of these four characteristic vector Journey is as follows:
Forward direction word feature vector is represented sequentially as hf(w1),hf(w2),…,hf(wn), hf(wt) (t=1 ..., n) represent t Individual forward direction word feature vector, its calculation is carried out as follows:
f t w f = σ ( W f h w f h f ( w t - 1 ) + W f x w f e ( w t ) + W f c w f c t - 1 w f + b f w f ) ,
i t w f = σ ( W i h w f h f ( w t - 1 ) + W i x w f e ( w t ) + W i c w f c t - 1 w f + b i w f ) ,
o t w f = σ ( W o h w f h f ( w t - 1 ) + W o x w f e ( w t ) + W o c w f c t w f + b o w f ) ,
Wherein, It is that the most trained (training process uses model parameter training method in description Mode in supplemental instruction completes) good model parameter matrix, in matrix, the value of each element is real number value, this group parameter with T is unrelated, and all computing units in i.e. one sequence of calculation share same group of parameter;
It is the results of intermediate calculations in the t computing unit, is real-valued vectors;
e(wt)、hf(wt-1)、It is the input of t computing unit, is real-valued vectors, e (w thereint) it is word wtCharacteristic of correspondence vector;hf(wt)、It is the output of t computing unit,Auxiliary for length Memory Neural Networks model Help result of calculation, eventually serve as the only h of forward direction word feature vectorf(wt-1), owing to this is the computation model of a serializing, The output h of t-1 computing unitf(wt-1)、It is the input of t computing unit;
Etc. being matrix multiplication operation.
Forward direction part of speech characteristic vector is represented sequentially as hf(p2),…,hf(pn), hf(pt) (t=1 ..., before n) representing t To part of speech characteristic vector, its calculation is carried out as follows:
f t p f = σ ( W f h p f h f ( p t - 1 ) + W f x p f e ( p t ) + W f c p f c t - 1 p f + b f p f ) ,
i t p f = σ ( W i h p f h f ( p t - 1 ) + W i x p f e ( p t ) + W i c p f c t - 1 p f + b i p f ) ,
o t p f = σ ( W o h p f h f ( p t - 1 ) + W o x p f e ( p t ) + W o c p f c t p f + b o p f ) ,
Wherein, It is the most trained (the additional theory of model parameter training method in training process employing description Mode in bright completes) good model parameter matrix, in matrix, the value of each element is real number value, and this group parameter is unrelated with t, All computing units in i.e. one sequence of calculation share same group of parameter;
It is the results of intermediate calculations in the t computing unit, is real-valued vectors;
e(pt)、hf(pt-1)、It is the input of t computing unit, is real-valued vectors, e (p thereint) it is word Property ptCharacteristic of correspondence vector;hf(pt)、It is the output of t computing unit,For length Memory Neural Networks model Auxiliary result of calculation, eventually serves as the only h of forward direction word feature vectorf(pt-1), owing to this is the calculating mould of a serializing Type, the output h of t-1 computing unitf(pt-1)、It is the input of t computing unit;
Etc. being matrix multiplication operation.
Backward word feature vector is represented sequentially as hb(w1),hb(w2),…,hb(wn), hb(wt) (t=1 ..., n) represent t Individual backward word feature vector, its calculation is carried out as follows:
f t w b = σ ( W f h w b h b ( w t + 1 ) + W f x w b e ( w t ) + W f c w b c t + 1 w b + b f w b ) ,
i t w b = σ ( W i h w b h b ( w t + 1 ) + W i x w b e ( w t ) + W i c w b c t + 1 w b + b i w b ) ,
o t w b = σ ( W o h w b h b ( w t + 1 ) + W o x w b e ( w t ) + W o c w b c t w b + b o w b ) ,
Wherein, It is that the most trained (training process uses model parameter training method in description Mode in supplemental instruction completes) good model parameter matrix, in matrix, the value of each element is real number value, this group parameter with T is unrelated, and all computing units in i.e. one sequence of calculation share same group of parameter;
It is the results of intermediate calculations in the t computing unit, is real-valued vectors;
e(wt)、hb(wt+1)、It is the input of t computing unit, is real-valued vectors, e (w thereint) it is word wtCharacteristic of correspondence vector;hb(wt)、It is the output of t computing unit,Auxiliary for length Memory Neural Networks model Help result of calculation, eventually serve as the only h of forward direction word feature vectorb(wt-1), owing to this is the computation model of a serializing, The output h of t+1 computing unitb(wt-1)、It is the input of t computing unit;
Etc. being matrix multiplication operation.
Backward part of speech characteristic vector is represented sequentially as hb(p1),hb(p2),…,hb(pn), hb(pt) (t=1 ..., n) represent T backward part of speech characteristic vector, its calculation is carried out as follows:
f t p b = σ ( W f h p b h b ( p t + 1 ) + W f x p b e ( p t ) + W f c p b c t + 1 p b + b f p b ) ,
i t p b = σ ( W i h p b h b ( p t + 1 ) + W i x p b e ( p t ) + W i c p b c t + 1 p b + b i p b ) ,
o t p b = σ ( W o h p b h b ( p t + 1 ) + W o x p b e ( p t ) + W o c p b c t p b + b o p b ) ,
Wherein, It is that the most trained (training process uses model parameter training method in description Mode in supplemental instruction completes) good model parameter matrix, in matrix, the value of each element is real number value, this group parameter with T is unrelated, and all computing units in i.e. one sequence of calculation share same group of parameter;
It is the results of intermediate calculations in the t computing unit, is real-valued vectors;
e(pt)、hb(pt+1)、It is the input of t computing unit, is real-valued vectors, e (p thereint) it is word Property ptCharacteristic of correspondence vector;hb(pt)、It is the output of t computing unit,For length Memory Neural Networks model Auxiliary result of calculation, eventually serves as the only h of forward direction word feature vectorb(pt+1), owing to this is the calculating mould of a serializing Type, the output h of t+1 computing unitb(pt+1)、It is the input of t computing unit
Etc. being matrix multiplication operation.
In order to make full use of from current word to be marked more remote word string and the pattern information of part of speech string in sentence, this Bright employing two-way length Memory Neural Networks model calculates word and the part-of-speech information feature of current word position to be marked.Tool Body calculates process and divides forward and backward two step, and from left to right, backward calculation is consistent from right to left, so place is the most detailed for forward direction Describe bright forward calculation process in detail: first, it is assumed that sentence length is n, the word in sentence is from left to right represented sequentially as w1,w2,…, wn, its characteristic of correspondence vector is followed successively by e (w1),e(w2),…e(wn);Part of speech in sentence is from left to right represented sequentially as p1, p2,…,pn, its characteristic of correspondence vector is followed successively by e (p1),e(p2),…e(pn);Additionally by calculate gained forward direction word feature to Amount is represented sequentially as hf(w1),hf(w2),…,hf(wn), the forward direction part of speech characteristic vector calculating gained is represented sequentially as hf (p1),hf(p2),…,hf(pn);It should be noted that these vectors are the real-valued vectors trained, their dimension Set by user, such as w in embodiment 1tAnd ptDimension set be 50, hf(wt) and hf(pt) dimension set be 25.
Employing feedforward neural network in step 2-1-2 to be calculated the score of all marking types, step 2-1-1 is tied Shu Hou, has just obtained a real-valued vectors being spliced by the vector that all features described in step 2-1-1 are corresponding, its dimension Degree size is the dimension sum of aforementioned all characteristic vectors, this vector as the input of feedforward neural network, whole before Godwards Carry out as follows through the calculating process of network:
H=σ (W1x+b1),
O=W2H,
Wherein, W1、b1、W2Being the model parameter matrix trained, in matrix, the value of each element is real number value;X is Input vector, it is spliced by all characteristic vectors of gained in step 2-1-1, and its dimension is the institute generated in step-1-1 Having the dimension sum of eigen vector, the value of each of which element is real number value;H is the hidden layer vector of neutral net, is middle meter Calculating result unit, it is a vector, and its dimension is good by predefined, as its dimension size is 300 in embodiment 1;O is meter Calculating output, be a real-valued vectors, its dimension size is corresponding to marking each word in the annotation process defined in step 1-2 The marking types number that can select during note, wherein the g value represents the score that current procedures is designated as type g;W1x、W2H is Matrix multiplication operation.
Step 2-2 comprises the steps:
Step 2-2-1, each state in given preceding state set, by the mode in step 2-1 to all marks Type is given a mark.Assume state SxMust be divided into score (Sx), marking types typekMust be divided into score (typek), it is assumed that right All marking types are all extended, then by obtaining K new dbjective state after extension, be expressed asK For all marking types sum, be calculated as follows kth state to reserved portion
s c o r e ( S i k t + 1 ) = s c o r e ( S i t ) + s c o r e ( type k ) ,
Wherein, k value is 1~K, and these scores are real number value.Determine that candidate marks class by the mode in step 1-2 Type, by candidate's marking types by stateIt is extended, it is assumed that state set StIn state determine by the mode in step 1-2 Candidate's marking types has c (i) individual, then to obtaining the individual new state of c (i) after conditional extensions, be expressed as
Step 2-2-2, it is assumed that state set StHaving z state, z value is natural number, by state set StIn all states press Mode in step 2-2-1 is extended, and the state after all extensions is
Step 2-2-3, takes out score by the state after all extensions that the mode of post search obtains from step 2-2-2 M the highest state, forms new state set
Beneficial effect: the Chinese Chunk in the present invention analyzes the method based on state transfer of method use compared to by extensively The method based on Markov hypothesis of general use can add chunk level another characteristic, for greater flexibility simultaneously to each state The neural network model that uses when giving a mark of the candidate's transfering type compound mode that can automatically acquire between feature, in addition The utilization of two-way length Memory Neural Networks model introduces useful additional information feature, and triplicity gets up to improve Chinese The degree of accuracy of chunk parsing.
Accompanying drawing explanation
Being the present invention with detailed description of the invention below in conjunction with the accompanying drawings and further illustrate, the present invention's is above-mentioned And/or otherwise advantage will become apparent.
Fig. 1 is length Memory Neural Networks computing unit schematic diagram.
Fig. 2 is forward direction length Memory Neural Networks sequence of calculation schematic network structure.
Fig. 3 is feedforward neural network structural representation.
Fig. 4 is the flow chart of the present invention.
Detailed description of the invention
The present invention proposes a kind of Chinese Chunk based on state transfer with neutral net and analyzes method.It is in sentence Each word when carrying out chunk type mark, first according to existing information architecture relevant information feature, then utilize neutral net pair All candidate categories are given a mark, and then perform state transfer operation.In existing Chinese Chunk analytical technology, due to model certainly The reason that body is assumed causes the use to long distance feature abundant not, and requires the feature templates that manual designs is complicated, this The method that invention proposes effectively alleviates the two shortcoming.
As shown in Figure 4, the invention discloses a kind of Chinese Chunk based on state transfer with neutral net and analyze method, it Both can add chunk level another characteristic neatly, neural network model can be used again automatically to acquire the combination side between feature Formula, uses also by two-way length Memory Neural Networks model and introduces useful additional information feature, and then improve Chinese The degree of accuracy of chunk parsing.
Completely Chinese Chunk based on state transfer with neutral net of the present invention is analyzed process and is comprised the steps:
Step 1, computer reads a Chinese language text file comprising sentence to be analyzed, defines the type of Chinese Chunk, Treat parsing sentence carry out participle and each word is carried out part-of-speech tagging, when carrying out part-of-speech tagging, true according to current sentence state The fixed part-of-speech tagging type that can select;
Step 2, utilizes method based on state transfer and neutral net to carry out chunk parsing to each sentence read.
Comprise the steps: when definition Chinese Chunk type of the present invention and marking types
Step 1-1, defines chunk type to be analyzed.Chunk type is selected according to its objectives voluntarily by user Selecting, traditional Chinese Chunk analysis task typically has two kinds of concrete phrase chunking tasks: one is to carry out just for noun phrase Identifying, two is that the chunk for 12 types defined on the basis of treebank Chinese edition CTB4.0 of Binzhou is identified;
Step 1-2, determines the marking types that can select when in annotation process being labeled each word.By in sentence Each word marks with chunk type and the combination of BIO or BIOES.
Assume initially that sentence length to be dealt with is n, a state recording which word of current sentence of definition sentence It is marked, each marked marking types corresponding to word and which word does not also mark, i-th word will be labeled it The state set that front sentence is in is expressed as Si, state representation therein isThe size of the post searching method center pillar used sets For m, the analysis process for this sentence of the present invention therein comprises the steps:
All marking types, under given state, are given a mark by step 3-1 when processing t word;
Step 3-2, given state set St, when processing t word, for each state in this state setPress Each candidate's marking types is labeled, and is extended state, and selects m new shape of highest scoring by the mode of post search State, obtains new state set St+1
Step 3-3, to t=1,2 ..., n, iteration performs step 3-1 and 3-2, obtains final dbjective state set Sn+1, And take out the state of wherein highest scoringBacktracking obtains the whole annotated sequence of this sentence.
The most of the present invention when the t word is processed, given preceding state set StIn some State, the marking types set that can be labeled is defined by step 1-2, the behaviour giving a mark each mark in mark set Make to be completed by a feedforward neural network, use the mark that current word can be labeled under given state by neutral net The process that type carries out giving a mark includes two steps: one is the generation of characteristic information, i.e. the generation of neutral net input;Two is profit Giving a mark all candidate categories by neutral net, step 3-1 specifically includes following steps:
Step 3-1-1, the generation of feedforward neural network input;
Step 3-1-2, as it is shown on figure 3, utilize feedforward neural network in step 3-1-1 generate characteristic vector input into Row is calculated the score of all candidate's marking types.
The generation of feedforward neural network of the present invention input includes two steps, and one is that essential information feature generates, Two is the generation of additional information feature.Step 3-1-1 comprises the steps:
Step 3-1-1-1, generates essential information feature.Including the certain window on the basis of current word position to be marked Word in Kou and part of speech feature, with the word generic feature of mark in certain window of current word position to be marked, Word feature has e (w-2), e (w-1), e (w0), e (w1), e (w2), they represent respectively and count to the left centered by currently pending word The second, first word, current word, several first, second word characteristic of correspondence vectors centered by current word;Part of speech feature There is e (p-2), e (p-1), e (p0), e (p1), e (p2), e (p-2p-1), e (p-1p0), e (p0p1), e (p1p2), e (p-2p-1p0), e (p-1p0p1), e (p0p1p2), they represent respectively centered by currently pending word the part of speech of several second, first word, The part of speech of current word, centered by current word the part of speech of several first, second word, the word of several second and first word Property combination, several first and the characteristic of correspondence such as the part of speech combination vector of current word.These characteristic vectors are and train Real-valued vectors.
Step 3-1-1-2, generates additional information feature, including following two step:
Step 3-1-1-2-1, has marked chunk in generating the certain window on the basis of current word position to be marked Relevant word and part of speech feature.The word feature of chunk rank has e (start_word (c-2)),e(end_word(c-2)), e (head_word(c-2)), e (start_word (c-1), e (end_word (c-1)),e(head_word(c-1)), represent respectively with First word of several second chunk, last word, grammer centre word centered by currently pending word, with current word it is First word of several first chunk in center, last word, grammer centre word;The part of speech feature of chunk rank has e (start_POS(c-2)),(end_POS(c-2)), e (head_POS (c-2)), e (start_POS (c-1), e (end_POS (c-1)),e(head_POS(c-1)), represent first word of several second chunk centered by currently pending word respectively Part of speech, the part of speech of last word, the part of speech of grammer centre word, centered by current word the first of several first chunk The part of speech of individual word, the part of speech of last word, the part of speech of grammer centre word.These characteristic vectors be trained real-valued to Amount;
Step 3-1-1-2-2, generates the position current to be marked of use two-way length Memory Neural Networks model calculating Word and part-of-speech information feature.The input of this step is all words in sentence, is from left to right represented sequentially as w1,w2,…,wn;With And the part of speech that in sentence, all words are corresponding, from left to right it is represented sequentially as p1,p2,…,pn.It is output as forward direction word feature vector, depends on Secondary it is expressed as hf(w1),hf(w2),…,hf(wn);Forward direction part of speech characteristic vector, is represented sequentially as hf(p1),hf(p2),…,hf (pn);Backward word feature vector, is represented sequentially as hb(w1),hb(w2),…,hb(wn);Backward part of speech characteristic vector, represents successively For hb(p1),hb(p2),…,hb(pn).Due to the backward difference compared with forward direction in simply calculated direction, calculation is the same, So the most only describing forward calculation process in detail, for each hfX () (x can be wtOr pt(t=1,2 ... n), the most defeated Entering different with calculating parameter, calculation is completely the same, is abbreviated as hf), calculate as follows:
ft=σ (Wfhht-1+Wfxxt+Wfcct-1+bf),
it=σ (Wihht-1+Wixxt+Wicct-1+bi),
ct=ft⊙ct-1+it⊙tanh(Wchht-1+Wcxxt+bc),
ot=σ (Wohht-1+Woxxt+Wocct+bo),
ht=ot⊙tanh(ct),
Wherein, Wfh、Wfx、Wfc、bf、Wih、Wix、Wic、bi、Wch、Wcx、bc、Woh、Wox、Woc、boIt is the most trained (training Analysis method in the Cheng Caiyong present invention combines the mode of the correct annotated sequence that maximum likelihood training data is concentrated and realizes) good Model parameter matrix, in matrix, the value of each element is real number value, it should be pointed out that this group parameter is unrelated with t, namely Saying, all computing units in a sequence of calculation share same group of parameter, because relating to word in the present invention and part of speech is respective Forward, the reverse sequence of calculation, so having 4 groups of parameters;ft、it、otIt is the results of intermediate calculations in the t computing unit, is Real-valued vectors;ht-1、ct-1、xtIt is the input of t computing unit, is real-valued vectors, x thereintIt is e (wt) or e (pt);ct、htIt is the output of t computing unit, but ctFor the auxiliary result of calculation of length Memory Neural Networks model, finally As word or the only h of part of speech characteristic vectort, htIt is target feature vector hf(wt) or hf(pt), it should be pointed out that due to This is the computation model of a serializing, the output h of t-1 computing unitt-1、ct-1It is the input of t computing unit; Tanh is hyperbolic functions, is a real-valued function, and it acts on and represents on a vector each element in vector is done this Operation, obtains an object vector identical with input vector dimension;σ is sigmod function, is a real-valued function, its effect A vector represents each element in vector is done this operation, obtain a target identical with input vector dimension Vector;⊙ is point multiplication operation, the identical vectorial step-by-step of two dimensions will do multiplication and obtain the result vector of an identical dimensional; Wfhht-1、WfxxtEtc. being matrix multiplication operation.
Step 3-1-2, utilizes feedforward neural network to be calculated the characteristic vector input generated in step 3-1-1 The score of all marking types.After step 3-1 terminates, just obtained one corresponding by all features described in step 3-1 The real-valued vectors that vector is spliced, its dimension size is the dimension sum of aforementioned all characteristic vectors, and this vector is as front The input of neurad network, the calculating process of whole feedforward neural network is carried out as follows:
H=σ (W1x+b)
O=W2h
Wherein, W1、b、W2Being the model parameter matrix trained, in matrix, the value of each element is real number value;X is Input vector, the value of each of which element is real number value;O is to calculate output, is a real-valued vectors, and its dimension size corresponds to The marking types number that can select when each word being labeled in the annotation process defined in step 1-2, wherein i-th value table Show the score that current procedures is designated as classification i;W1x、W2H is matrix multiplication operation.
Step 3-2, given state set St, when processing t word, for each state in this state setPress Each candidate's marking types is labeled, and is extended state, and selects m new shape of highest scoring by the mode of post search State, obtains new state set St+1.Comprise the following steps:
Step 3-2-1, each state in given preceding state setBy the mode in step 3-1 to all marks Note type is given a mark, it is assumed that state SxMust be divided into score (Sx), marking types typekMust be divided into score (typek), If assuming all types is all extended, then will obtain K (K is all marking types sum) individual new target-like after extension State, is expressed asReserved portion is calculated by as follows:
s c o r e ( S i k t + 1 ) = s c o r e ( S i t ) + s c o r e ( type k ) ,
Wherein, these scores are real number value.Then candidate's marking types is determined by the constraint rule in step 1-2, by these Marking types is by stateIt is extended, it is assumed that state set StIn certain stateThe time determined by the constraint rule in step 1-2 Marking types is selected to have c (i) individual, then to stateThe individual new state of c (i) will be obtained after extension, be expressed as
Step 3-2-2, by state set StIn (assuming there be m state), all states are entered by the mode in step 3-2-1 Row extension, the state after all extensions is
Step 3-2-3, takes out m state of highest scoring, form new from step 3-2-2 in all states obtained State set
Step 3-3, to t=1,2 ..., n, perform step 3-1 and 3-2, obtain final dbjective state set Sn+1, and take Go out the state of wherein highest scoringBacktracking obtains the whole annotated sequence of this sentence, and then obtains the chunk that sentence is corresponding Analysis result.
The supplemental instruction of the model parameter training method used in analysis process of the present invention is as follows:
From analyze during step 2 it is recognised that used in the process of analysis of the present invention parameter include as Under several parts (being these parameters below is model parameter group):
1, each feature characteristic of correspondence vector, represents with e (*) herein, and * therein represents the base in step 2-1-1-1 The word feature of the chunk rank in this word feature and basic part of speech feature and step 2-1-1-2-1 and part of speech feature, i.e. train The all words occurred in expectation and the most corresponding one group of spy of the combination of part of speech and adjacent two contaminations and adjacent two parts of speech Levy vector;
2, the neural network parameter used by forward direction word sequence of the calculating in step 2-1-1-2-2
3, the neural network parameter used by backward word sequence of the calculating in step 2-1-1-2-2
4, the neural network parameter used by forward direction part of speech sequence of the calculating in step 2-1-1-2-2
5, the neural network parameter used by backward part of speech sequence of the calculating in step 2-1-1-2-2
6, feedforward neural network parameter W used in step 2-1-21、W2
Training process uses the correct annotated sequence of maximum likelihood training data concentration, use iterative manner realization.? Before training starts, at random the parameter in model parameter group is carried out value, such as embodiment 1 and embodiment 2 and all arrives by-0.1 It is uniformly distributed stochastical sampling value between 0.1.Then labeled data is used to integrate (assuming that data set size is as D) dataest ={ sent1,sent2,…,sentDParameter is trained: first one training objective of definition, this object definition is at whole number According on collection, being also called loss function, it is the function of all parameters in whole model parameter group, it is assumed that for L (dataset), pin To each sentence sentrLoss function be expressed as loss (sentr) both definition enters in the following manner with calculating process OK:
When processing t word of sentence by the mode in the step 2 during analyzing, in preceding state set Each state, by the method for expressing in step 2-2, it is assumed that be expressed asThen by the process of step 2-1 it is known that working as Mark score (the type under this state front, kth marking types being given a mark obtainedk) actually model parameter group the In 2~5 groups in all parameters (assuming to be expressed as Θ) and the 1st group of parameter of model parameter group under this state current by step 2- One compound function of those characteristic vectors taken out in 1-1-1 and step 2-1-1-2-1.Assume at given statePlace The all characteristic vectors taken out when managing the t are generally designated asBecause the score of whole sentence is carried out table herein Showing, we will be at given state for convenienceWhen processing the t, kth marking types is given a mark obtained score graph It is shown asThen have:
s c o r e ( S i t , t , type k ) = F ( Θ , E ( S i t , t ) ) ,
F therein is by the process prescription of step 2-1, four length Memory Neural Networks and strong point neutral net are combined Compound function, Θ is all parameters in model parameter group the 2nd~5 group.
From whole step 2 it is recognised that after processed a sentence by step 2-3, state set In each stateScore be all parameters in model parameter group the 2nd~5 group In state from the beginning in (assuming to be expressed as Θ) and the 1st group of parameterExpand to stateWhole path in process By a compound function of all characteristic vectors taken out in step 2-1-1-1 and step 2-1-1-2-1 during each word.Assume for State set Sn+1In each stateIt is from stateExpand to stateMarking types sequence selected by during It is classified asDuring the status switch that experienced be(It isIt is), then stateMust be divided into:
s c o r e ( S i n + 1 ) = Σ j = 1 n s c o r e ( S i j - 1 j , j , type i j ) ,
Because training sentence is all labeled data, i.e. know its correct annotated sequence, it is assumed that state set Sn+1In StateCorresponding correct annotated sequence.Definition is for the loss function of this sentence:
l o s s ( sent r ) = - Σ l = 1 m e s c o r e ( s g o l d n + 1 ) e s c o r e ( s l n + 1 ) ,
E thereinxRepresenting exponential function, e represents the constant of natural logrithm.
Defining the loss function for whole training dataset is:
L ( d a t a s e t ; Θ , E ) = Σ l = 1 D l o s s ( sent l ) ,
Θ therein, E represent that this loss function is the function of parameter in model parameter group.
The target of whole training process minimizes above loss function exactly, minimizes above loss function and tries to achieve parameter Method have multiple and for industry practitioner know, wherein have employed stochastic gradient descent method such as embodiment and it asked Solve.
Embodiment 1
First, in the supplemental instruction of the middle model parameter training method of the model parameter in the present embodiment (reference number of a document in the mode 728 files in Binzhou treebank Chinese edition CTB (The Chinese Penn Treebank) 4.0 From chtb_001.fid to chtb_899.ptb, it should be noted that this numbering is not to have continuously, so only 110 literary compositions Part) 9978 sentences on be trained gained.
The present embodiment utilizes the Chinese Chunk based on state transfer with neutral net in the present invention to analyze method to one The complete procedure that sentence carries out Chinese Chunk analysis is as follows:
Step 1-1, defines Chinese Chunk type, defines 12 kinds on the basis of treebank Chinese edition CTB4.0 of Binzhou Type: ADJP, ADVP, CLP, DNP, DP, DVP, LCP, LST, NP, PP, QP, VP, its concrete meaning is shown in step 1-in description 1;
Step 1-2, determines the marking types that can select when in annotation process being labeled each word, uses BIOES System.The marking types finally determined has B-ADJP, B-ADVP, B-CLP, B-DNP, B-DP, B-DVP, B-LCP, B-LST, B-NP, B-PP, B-QP, B-VP, ADJP, I, O, E, S-ADVP, S-CLP, S-DNP, S-DP, S-DVP, S-LCP, S-LST, S- NP, S-PP, S-QP, S-VP 27 kinds;
Step 2-1, computer reads a natural language text file comprising sentence to be analyzed.For convenience of explanation, The most only read in sentence " Shanghai/NR Pudong/NR exploitation/NN and/CC legal system/NN construction/NN synchronization/VV ";
Step 3, starts most, and original state collection is combined into S1, wherein there is a state, forThis state is initial one, Then following steps are performed;
Step 3-1, processes the 1st word " Shanghai ", execution following steps:
Step 3-1-1, generates the input of feedforward network, execution following steps:
Step 3-1-1-1, generates essential information feature.Because being first word, count without word to the left, by common practices, Add on its left side and supplement word, it is assumed that for " word_start ", and supplementary part of speech, it is assumed that for " POS_start ", so herein Corresponding word is characterized as w-2=" word_start ", w-1=" word_start ", w0=" Shanghai ", w1=" Pudong ", w2=" open Send out ", part of speech is characterized as p-2=" POS_start ", p-1=" POS_start ", p0=" NR ", p1=" NR ", p2=" NN ", p-2p-1 =" POS_startPOS_start ", p-1p0=" POS_start NR ", p0p1=" NR NR ", p1p2=" NR NN ", then takes Going out the vector representation that these features are corresponding, in this embodiment, the dimension of these characteristic vectors is set to 50 entirely, and they are real-valued Vector, such as e (w0) front 5 element values be-0.0999,0.0599,0.0669 ,-0.0786,0.0527;
Step 3-1-1-2, generates additional information feature.Execution following steps:
Step 3-1-1-2-1, generates chunk related term and part of speech characteristic vector.Because the most analyzing before this word Chunk, is also shown by supplementary vocabulary, respectively start_word (c-2)=" start_chunk_word_NULL ", end_word (c-2)=" end_chunk_word_NULL ", head_word (c-2)=" head_chunk_word_NULL ", start_word (c-1)=" start_chunk_word_NULL ", end_word (c-1)=" end_chunk_word_NULL ", head_word (c-1)=" head_chunk_word_NULL ", start_POS (c-2)=" start_chunk_POS_NULL ", end_POS (c-2)=" end_chunk_POS_NULL ", head_POS (c-2)=" head_chunk_POS_NULL ", start_POS (c-1) =" start_chunk_POS_NULL ", end_POS (c-1)=" end_chunk_POS_NULL ", head_POS (c-1)= " head_chunk_POS_NULL ", then takes out the vector representation that these features are corresponding, in this embodiment, these features to The dimension of amount is set to 50 entirely, and they are real-valued vectors;
Step 3-1-1-2-2, as depicted in figs. 1 and 2, generates and uses working as of two-way length Memory Neural Networks model calculating The word of front position to be marked and the characteristic vector of part-of-speech information feature.For word feature vector, input as word pair each in sentence The vector representation answered, for part of speech characteristic vector, inputs as vector representation corresponding to part of speech each in sentence, these vector representations The same words corresponding with step 3-1-1-1 or the vector representation of part of speech are consistent, such as e (w0)(w0=" Shanghai ") front 5 units Element value is still-0.0999,0.0599,0.0669 ,-0.0786,0.0527;For the parameter in length memory models, its value It is real number value, such as calculating the matrix W of forward direction term vectorfhFront 5 parameter values in middle the first row are 0.13637, 0.11527、-0.06217、-0.19870、0.03157;Then each word and part of speech characteristic of correspondence vector h it are calculatedfWith hb, they are real-valued vectors, h set in the present embodimentfAnd hbDimension be 25.
Step 3-1-2, the institute's directed quantity obtained in splicing step 3-1-1, obtain a real-valued vectors, in the present embodiment altogether It is 14 × 50+12 × 50+4 × 25=1400 dimension, then obtains all 27 kinds of respective scores of marking types, each in the present embodiment The respective score of individual marking types is respectively 0.7898 (B-ADJP), 0.4961 (ADVP) ,-0.1281 (B-CLP) ,-0.0817 (B-DNP),0.5265(B-DP),-0.0789(B-DVP),0.4362(B-LCP),-0.2250(B-LST),2.9887(B- NP),-0.0726(B-PP),0.1320(B-QP),0.4636(B-VP),1.6294(E),1.8871(I),-0.3904(O), 0.6985(S-ADJP),-0.1703(S-ADVP),-0.3287(S-CLP),0.1734(S-DNP),0.5694(S-DP), 0.0990(S-DVP),0.0902(S-LCP),-1.0364(S-LST),2.0767(S-NP),-0.0179(S-PP),-0.0606 (S-QP),0.0941(S-VP);
Step 3-2-1, current given state set is S1, only one of which state, forAnd haveIn by specification, the constraint rule 1 in step 1-2 removes marking types I and E obtained in step 3-1-2 (score (I)=1.8871, score (E)=1.6294), by stateIt is extended by remaining each marking types and calculates its correspondence The score of dbjective stateBecauseSo having Such as have
Step 3-2-2, by state S1In each state be extended by the mode in step 3-2-1.Because wherein only HaveSo just obtaining 27-2=25 new state;
Step 3-2-3, selects the state set that the state composition of 4 highest scorings is new from these 25 new states.These 4 The new state of highest scoring is followed successively by New state set S is formed by them2, its bag Containing four new states, it is respectively as follows:
1.Represent " Shanghai/NR_B-NP Pudong/NR exploitation/NN and/CC legal system/NN construction/NN synchronization/VV ", score 2.9887;
2.Represent " Shanghai/NR_S-NP Pudong/NR exploitation/NN and/CC legal system/NN construction/NN synchronization/VV ", score 2.0767;
3.Represent " Shanghai/NR_S-ADJP Pudong/NR exploitation/NN and/CC legal system/NN construction/NN synchronization/VV ", must Divide 0.7898;
4.Represent " Shanghai/NR_B-QP Pudong/NR exploitation/NN and/CC legal system/NN construction/NN synchronization/VV ", score 0.6985。
Step 3-3, processes remaining word by the mode in step 3-1 and 3-2, obtains final dbjective state set S8, it comprises four states, is respectively as follows:
1.Represent " Shanghai/NR_B-NP Pudong/NR_E exploitation/NN_B-NP and/CC_I legal system/NN_I construction/NN_E Synchronization/VV_S-VP ", score 24.6169;
2.Represent " Shanghai/NR_B-NP Pudong/NR_E exploitation/NN_B-NP and/CC_I legal system/NN_E construction/NN_S- VP synchronization/VV_S-VP ", score 20.2407;
3.Represent " Shanghai/NR_B-NP Pudong/NR_E exploitation/NN_B-NP and/CC_I legal system/NN_I construction/NN_E Synchronization/VV_B-VP ", score 19.7653;
4.Represent " Shanghai/NR_B-NP Pudong/NR_E exploitation/NN_B-NP and/CC_I legal system/NN_I construction/NN_E Synchronization/VV_O ", score 19.6299.
Take out the state of wherein highest scoringBacktracking obtains the annotated sequence of whole sentence:
Its corresponding chunk parsing result is [NP PVG] [NP exploitation and legal construction] [VP is Tong Bu].
Embodiment 2
Algorithm used by the present invention all uses C Plus Plus to write realization.The type that the experiment of this embodiment is used is: Intel (R) Core (TM) i7-5930K processor, dominant frequency is 3.50GHz, inside saves as 64G.First, the model in the present embodiment Mode in the supplemental instruction of parameter middle model parameter training method is at Binzhou treebank Chinese edition CTB (The Chinese Penn Treebank) (reference number of a document is from chtb_001.fid to chtb_ in 728 files in 4.0 899.ptb, it should be noted that this numbering is not to have continuously, so only 110 files) 9978 sentences on carry out Training gained.Data used by experiment test have employed in 110 files that (reference number of a document is from chtb_900.fid to chtb_ 1078.ptb, it should be noted that this numbering is not to have continuously, so only 110 files) 5290 sentences carry out Chunk parsing, experimental result is as shown in table 7:
Table 7 experimental result explanation
MBL therein (Memory-based learning) is learning method based on memory, TBL (Transformation-based learning) is transformation based learning method, CRF (Conditional Random Field) be condition random field learning method, SVM (Support Vector Machin) be support vector machine learning method, this Four kinds is traditional machine in normal service learning algorithm processing this task.It should be noted that carry out evaluation and test on this data set it is Evaluate a usual way of Chinese Text Chunking method.It can be seen that the method in the present invention achieves on this data set Higher F1-score value, illustrates the effectiveness of the method.
The calculation of F1-score is illustrated: because this test set is labeled data collection, so being to know herein The correct annotation results in road, it is assumed that for whole data set, the set S (gold) of all chunks composition, its size is count (gold);After each sentence concentrating data carries out chunk parsing by the mode in embodiment 1, take out all analysis and tie Chunk composition in Guo predicts the outcome and gathers S (predict), it is assumed that its size is count (predict);S (gold) and S (predict) collection of chunk composition identical in is combined into S (correct), and its size is count (correct);Assume prediction standard Exactness is expressed as precision, it was predicted that recall rate is expressed as recall, then the calculating of each value is carried out as follows:
p r e c i s i o n = c o u n t ( c o r r e c t ) c o u n t ( p r e d i c t ) ,
r e c a l l = c o u n t ( c o r r e c t ) c o u n t ( g o l d ) ,
F 1 - s c o r e = 2 × p r e c i s i o n × r e c a l l p r e c i s i o n + r e c a l l .

Claims (9)

1. a Chinese Chunk based on state transfer with neutral net analyzes method, it is characterised in that comprise the steps:
Step 1, computer reads a Chinese language text file comprising sentence to be analyzed, the type of definition Chinese Chunk, treats Parsing sentence carries out participle and each word is carried out part-of-speech tagging, when carrying out part-of-speech tagging, determines institute according to current sentence state The part-of-speech tagging type that can select;
Step 2, utilizes method based on state transfer and neutral net to treat parsing sentence and carries out Chinese Chunk analysis.
Method the most according to claim 1, it is characterised in that step 1 comprises the steps:
Step 1-1, according to 12 kinds of phrase type definition Chinese Chunk types of table 1 definition;
Table 1
Type Implication ADJP Adjective Phrases ADVP Adverbial phrase CLP Classification type phrase DNP Re-define phrase more DP Determiner phrase DVP Ground word phrase LCP Directional phrases LST Sequence phrase NP Noun phrase PP Prepositional phrase QP Numeral-classifier compound phrase VP Verb phrase
Step 1-2, the mode using BIOES mark system to combine with the Chinese Chunk type defined in step 1-1 determines mark The marking types that can select when each word to be marked being carried out part-of-speech tagging during note.
Method the most according to claim 2, it is characterised in that in step 2, process of Chinese Chunk being analyzed is as a sequence Rowization mark task, the type of mark is by the BIOES mark used in the Chinese Chunk type defined in step 1-1 and step 1-2 The mode that injection body system combines generates.
Method the most according to claim 3, it is characterised in that represent the length of sentence to be analyzed in whole step 2 with n Degree, step 2 comprises the steps:
All marking types, under given state, are given a mark by step 2-1 when processing t word, and now given state is i.e. Having completed mark and the marking types of its correspondence known for t-1 word before sentence to be analyzed, t to n-th word is not for mark word And the t word is next pending word;
Step 2-2, given state set St, when processing t word to each state in this state set by step 2-1 Mode all marking types are given a mark, this marking is completed by calculating, will give each marking types one real number value, This real number value is referred to as the score that the type is corresponding, then mode as described in step 1-2 generates candidate's marking types, marks by each candidate Word is labeled thus is extended this state by note type, and selects m new shape of highest scoring by the mode of post search State, obtains new state set St+1
Step 2-3, to t=1,2 ..., n, iteration performs step 2-1 and 2-2, obtains final dbjective state set Sn+1, and take Go out the state of wherein highest scoringThe annotated sequence of highest scoring is obtained from the backtracking of this state, the most all words Type mark is complete, and the annotated sequence of this highest scoring is reduced to the chunk parsing result of correspondence, and this result is to be worked as The analysis result of front sentence.
5. according to the method described in claim 4, it is characterised in that step 2-1 comprises the steps:
Step 2-1-1, generates characteristic vector, and characteristic vector includes essential information characteristic vector and additional information characteristic vector;
Step 2-1-2, utilizes feedforward neural network that the characteristic vector generated in step 2-1-1 is calculated all candidates The score of marking types.
6. according to the method described in claim 5, it is characterised by, owning in sentence to be analyzed in whole step 2-1-1 Word is from left to right represented sequentially as w1,w2,…,wn, wnRepresenting the n-th word in sentence to be analyzed, n value is natural number;To be analyzed The part of speech that in sentence, all words are corresponding is from left to right represented sequentially as p1,p2,…,pn, pnRepresent the n-th word in sentence to be analyzed Corresponding part of speech;One feature * characteristic of correspondence vector representation is e (*), and step 2-1-1 comprises the steps:
Step 2-1-1-1, generates essential information characteristic vector, and essential information characteristic vector includes with current word to be marked institute in place It is set to the word in certain window of benchmark and part of speech feature characteristic of correspondence vector, and with current word position to be marked is The word generic feature characteristic of correspondence of mark vector in certain window of benchmark;Detailed process is as follows: essential information is special Levy word feature vector to include: several second word characteristic of correspondence vector e (w centered by currently pending word-2), with work as Several first word characteristic of correspondence vector e (w centered by front pending word-1), currently pending word characteristic of correspondence vector e (w0), several first word characteristic of correspondence vector e (w centered by currently pending word1), and with currently pending word Centered by several second word characteristic of correspondence vector e (w2);
Part of speech characteristic vector includes: the part of speech characteristic of correspondence vector e of several second word centered by currently pending word (p-2), the part of speech characteristic of correspondence vector e (p of several first word centered by currently pending word-1), currently pending The part of speech characteristic of correspondence vector e (p of word0), the part of speech of several first word is corresponding centered by currently pending word spy Levy vector e (p1), the part of speech characteristic of correspondence vector e (p of several second word centered by currently pending word2), with currently The part of speech combination characteristic of correspondence vector e (p of several second word and first word centered by pending word-2p-1), with currently The part of speech combination characteristic of correspondence vector e (p of several first word and currently pending word centered by pending word-1p0), with The part of speech combination characteristic of correspondence vector e (p of several first word and currently pending word centered by currently pending word0p1)、 The part of speech combination characteristic of correspondence vector e (p of several second word and first word centered by currently pending word1p2);
Step 2-1-1-2, generates additional information characteristic vector: additional information characteristic vector includes with current word to be marked institute in place Mark chunk relevant word feature vector and part of speech characteristic vector in being set to certain window of benchmark, use two-way length to remember The word feature vector of the position current to be marked that neural network model calculates and part of speech characteristic vector.
7. according to the method described in claim 6, it is characterised in that step 2-1-1-2 comprises the steps:
Step 2-1-1-2-1, centered by currently pending word, several second chunk, first chunk are expressed as c-2、c-1, chunk ciFirst vocabulary be shown as start_word (ci), last vocabulary is shown as end_word (ci), i=- 2 ,-1, grammer centre word is expressed as head_word (ci), chunk ciThe part of speech of first word be expressed as start_POS (ci)、 The part of speech of last word is expressed as end_POS (ci), the part of speech of grammer centre word is expressed as head_POS (ci), generate to work as Before marked the relevant word feature vector of chunk and part of speech characteristic vector in certain window on the basis of word position to be marked:
The word feature vector of chunk rank includes: first word of several second chunk centered by currently pending word Characteristic vector e (start_word (c-2)), centered by currently pending word last word of several second chunk Characteristic vector e (end_word (c-2)), centered by currently pending word the spy of the grammer centre word of several second chunk Levy vector e (head_word (c-2)), centered by currently pending word the feature of first word of several first chunk to Amount e (start_word (c-1)), centered by currently pending word the feature of last word of several first chunk to Amount e (end_word (c-1)), centered by currently pending word characteristic vector e of the grammer centre word of several first chunk (head_word(c-1));
The part of speech characteristic vector of chunk rank includes: first word of several second chunk centered by currently pending word The characteristic vector e (start_POS (c of part of speech-2)), centered by currently pending word last of several second chunk Characteristic vector e (end_POS (the c of the part of speech of individual word-2)), centered by currently pending word the grammer of several second chunk Characteristic vector e (head_POS (the c of the part of speech of centre word-2)), centered by currently pending word several first chunk Characteristic vector e (start_POS (the c of the part of speech of first word-1), centered by currently pending word several first chunk The characteristic vector e (end_POS (c of part of speech of last word-1)), centered by currently pending word several first group Characteristic vector e (head_POS (the c of the part of speech of the grammer centre word of block-1));
Step 2-1-1-2-2, uses two-way length Memory Neural Networks model to calculate word and the word generating current position to be marked Property information eigenvector: the input of two-way length Memory Neural Networks model is all words in sentence to be analyzed and to be analyzed The part of speech that in sentence, all words are corresponding, be output as forward direction word feature vector, forward direction part of speech characteristic vector, backward word feature vector and Backward part of speech characteristic vector, tanh used in formula below is hyperbolic functions, is a real-valued function, its act on one to Represent in amount and each element in vector is done this operation, obtain an object vector identical with input vector dimension;σ is Sigmod function, is a real-valued function, and it acts on and represents on a vector that each element in vector does this to be operated, Obtain an object vector identical with input vector dimension;⊙ is point multiplication operation, will the identical vectorial step-by-step of two dimensions Doing multiplication and obtain the result vector of an identical dimensional, the calculating process of these four characteristic vector is as follows:
Forward direction word feature vector is represented sequentially as hf(w1),hf(w2),…,hf(wn), hf(wt) represent the t forward direction word feature to Amount, its calculation is carried out as follows:
f t w f = σ ( W f h w f h f ( w t - 1 ) + W f x w f e ( w t ) + W f c w f c t - 1 w f + b f w f ) ,
i t w f = σ ( W i h w f h f ( w t - 1 ) + W i x w f e ( w t ) + W i c w f c t - 1 w f + b i w f ) ,
o t w f = σ ( W o h w f h f ( w t - 1 ) + W o x w f e ( w t ) + W o c w f c t w f + b o w f ) ,
Wherein, Being the model parameter matrix trained, in matrix, the value of each element is Real number value, this group parameter is unrelated with t, and all computing units in i.e. one sequence of calculation share same group of parameter;
It is the results of intermediate calculations in the t computing unit, is real-valued vectors;
e(wt)、hf(wt-1)、It is the input of t computing unit, is real-valued vectors, e (w thereint) it is word wtRight The characteristic vector answered;hf(wt)、It is the output of t computing unit,Auxiliary meter for length Memory Neural Networks model Calculate result, eventually serve as the only h of forward direction word feature vectorf(wt-1), owing to this is the computation model of a serializing, t-1 The output h of individual computing unitf(wt-1)、It is the input of t computing unit;
Forward direction part of speech characteristic vector is represented sequentially as hf(p2),…,hf(pn), hf(pt) represent the t forward direction part of speech characteristic vector, Its calculation is carried out as follows:
f t p f = σ ( W f h p f h f ( p t - 1 ) + W f x p f e ( p t ) + W f c p f c t - 1 p f + b f p f ) ,
i t p f = σ ( W i h p f h f ( p t - 1 ) + W i x p f e ( p t ) + W i c p f c t - 1 p f + b i p f ) ,
o t p f = σ ( W o h p f h f ( p t - 1 ) + W o x p f e ( p t ) + W o c p f c t p f + b o p f ) ,
Wherein, Being the model parameter matrix trained, in matrix, the value of each element is real number value, This group parameter is unrelated with t, and all computing units in i.e. one sequence of calculation share same group of parameter;
It is the results of intermediate calculations in the t computing unit, is real-valued vectors;
e(pt)、hf(pt-1)、It is the input of t computing unit, is real-valued vectors, e (p thereint) it is part of speech pt Characteristic of correspondence vector;hf(pt)、It is the output of t computing unit,Auxiliary for length Memory Neural Networks model Result of calculation, eventually serves as the only h of forward direction word feature vectorf(pt-1), owing to this is the computation model of a serializing, the The output h of t-1 computing unitf(pt-1)、It is the input of t computing unit;
Backward word feature vector is represented sequentially as hb(w1),hb(w2),…,hb(wn), hb(wt) represent t backward word feature to Amount, its calculation is carried out as follows:
f t w b = σ ( W f h w b h b ( w t + 1 ) + W f x w b e ( w t ) + W f c w b c t + 1 w b + b f w b ) ,
i t w b = σ ( W i h w b h b ( w t + 1 ) + W i x w b e ( w t ) + W i c w b c t + 1 w b + b i w b ) ,
o t w b = σ ( W o h w b h b ( w t + 1 ) + W o x w b e ( w t ) + W o c w b c t w b + b o w b ) ,
Wherein, Being the model parameter matrix trained, in matrix, the value of each element is real number Value, this group parameter is unrelated with t, and all computing units in i.e. one sequence of calculation share same group of parameter;
It is the results of intermediate calculations in the t computing unit, is real-valued vectors;e(wt)、hb(t+1)、It is the input of t computing unit, is real-valued vectors, e (w thereint) it is word wtCharacteristic of correspondence vector;hb (wt)、It is the output of t computing unit,For the auxiliary result of calculation of length Memory Neural Networks model, finally make Only h for forward direction word feature vectorb(wt-1), owing to this is the computation model of a serializing, t+1 computing unit Output hb(wt-1)、It is the input of t computing unit;
Backward part of speech characteristic vector is represented sequentially as hb(p1),hb(p2),…,hb(pn), hb(pt) represent that t backward part of speech is special Levying vector, its calculation is carried out as follows:
f t p b = σ ( W f h p b h b ( p t + 1 ) + W f x p b e ( p t ) + W f c p b c t + 1 p b + b f p b ) ,
i t p b = σ ( W i h p b h b ( p t + 1 ) + W i x p b e ( p t ) + W i c p b c t + 1 p b + b i p b ) ,
o t p b = σ ( W o h p b h b ( p t + 1 ) + W o x p b e ( p t ) + W o c p b c t p b + b o p b ) ,
Wherein, Being the model parameter matrix trained, in matrix, the value of each element is Real number value, this group parameter is unrelated with t, and all computing units in i.e. one sequence of calculation share same group of parameter;
It is the results of intermediate calculations in the t computing unit, is real-valued vectors;
e(pt)、hb(pt+1)、It is the input of t computing unit, is real-valued vectors, e (p thereint) it is part of speech pt Characteristic of correspondence vector;hb(pt)、It is the output of t computing unit,Auxiliary for length Memory Neural Networks model Result of calculation, eventually serves as the only h of forward direction word feature vectorb(pt+1), owing to this is the computation model of a serializing, t The output h of+1 computing unitb(pt+1)、It is the input of t computing unit.
8. according to the method described in claim 7, it is characterised in that step 2-1-2 employs feedforward neural network and counts Calculating the score obtaining all marking types, the calculating process of whole feedforward neural network is carried out as follows:
H=σ (W1x+b1),
O=W2H,
Wherein, W1、b1、W2Being the model parameter matrix trained, in matrix, the value of each element is real number value;X is input Vector, it is spliced by all characteristic vectors of gained in step 2-1-1, and its dimension is all spies generated in step-1-1 Property vector dimension sum, the value of each of which element is real number value;H is the hidden layer vector of neutral net, is intermediate computations knot Really unit;O is to calculate output, is a real-valued vectors, and its dimension size is corresponding to right in the annotation process defined in step 1-2 The marking types number that can select when each word is labeled, wherein current procedures is designated as obtaining of type g by the g value expression Point, this score is a real number value;W1x、W2H is matrix multiplication operation.
Method described in the most according to Claim 8, it is characterised in that step 2-2 comprises the steps:
Step 2-2-1, each state in given preceding state set, by the mode in step 2-1 to all marking types Give a mark, it is assumed that state SxMust be divided into score (Sx), marking types typekMust be divided into score (typek), it is assumed that right All marking types are all extended, then by obtaining K new state after extension, be expressed asK is All marking types sum, be calculated as follows kth state to reserved portion
s c o r e ( S i k t + 1 ) = s c o r e ( S i t ) + s c o r e ( type k ) ,
Wherein, k value is 1~K, and score is real number value, determines candidate's marking types by the mode in step 1-2, by candidate State is extended by marking types, it is assumed that state set StIn candidate's mark of determining by the mode in step 1-2 of state Type has c (i) individual, then to obtaining the individual new state of c (i) after conditional extensions, be expressed as
Step 2-2-2, it is assumed that state set StHaving z state, z value is natural number, by state set StIn all states by step Mode in rapid 2-2-1 is extended, and the state after all extensions is
Step 2-2-3, takes out highest scoring by the state after all extensions that the mode of post search obtains from step 2-2-2 M state, form new state set
CN201610324281.5A 2016-05-16 2016-05-16 A kind of Chinese Chunk analysis method based on state transfer and neural network Active CN106021227B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610324281.5A CN106021227B (en) 2016-05-16 2016-05-16 A kind of Chinese Chunk analysis method based on state transfer and neural network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610324281.5A CN106021227B (en) 2016-05-16 2016-05-16 A kind of Chinese Chunk analysis method based on state transfer and neural network

Publications (2)

Publication Number Publication Date
CN106021227A true CN106021227A (en) 2016-10-12
CN106021227B CN106021227B (en) 2018-08-21

Family

ID=57097925

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610324281.5A Active CN106021227B (en) 2016-05-16 2016-05-16 A kind of Chinese Chunk analysis method based on state transfer and neural network

Country Status (1)

Country Link
CN (1) CN106021227B (en)

Cited By (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106547737A (en) * 2016-10-25 2017-03-29 复旦大学 Based on the sequence labelling method in the natural language processing of deep learning
CN106776869A (en) * 2016-11-28 2017-05-31 北京百度网讯科技有限公司 Chess game optimization method, device and search engine based on neutral net
CN107168955A (en) * 2017-05-23 2017-09-15 南京大学 Word insertion and the Chinese word cutting method of neutral net using word-based context
CN107247700A (en) * 2017-04-27 2017-10-13 北京捷通华声科技股份有限公司 A kind of method and device for adding text marking
CN107632981A (en) * 2017-09-06 2018-01-26 沈阳雅译网络技术有限公司 A kind of neural machine translation method of introducing source language chunk information coding
CN107992479A (en) * 2017-12-25 2018-05-04 北京牡丹电子集团有限责任公司数字电视技术中心 Word rank Chinese Text Chunking method based on transfer method
WO2018086519A1 (en) * 2016-11-08 2018-05-17 北京国双科技有限公司 Method and device for identifying specific text information
CN108363695A (en) * 2018-02-23 2018-08-03 西南交通大学 A kind of user comment attribute extraction method based on bidirectional dependency syntax tree characterization
CN108446355A (en) * 2018-03-12 2018-08-24 深圳证券信息有限公司 Investment and financing event argument abstracting method, device and equipment
CN109086274A (en) * 2018-08-23 2018-12-25 电子科技大学 English social media short text time expression recognition method based on restricted model
CN109923557A (en) * 2016-11-03 2019-06-21 易享信息技术有限公司 Use continuous regularization training joint multitask neural network model
CN112052646A (en) * 2020-08-27 2020-12-08 安徽聚戎科技信息咨询有限公司 Text data labeling method
CN112651241A (en) * 2021-01-08 2021-04-13 昆明理工大学 Chinese parallel structure automatic identification method based on semi-supervised learning
CN116227497A (en) * 2022-11-29 2023-06-06 广东外语外贸大学 Sentence structure analysis method and device based on deep neural network

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103546623A (en) * 2012-07-12 2014-01-29 百度在线网络技术(北京)有限公司 Method, device and equipment for sending voice information and text description information thereof

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103546623A (en) * 2012-07-12 2014-01-29 百度在线网络技术(北京)有限公司 Method, device and equipment for sending voice information and text description information thereof

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
CHRIS ALBERTI ET AL: ""Improved Transition-Based Parsing and Tagging with Neural Networks"", 《PROCEEDINGS OF THE 2015 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING》 *
DAVIDWEISS ET AL: ""Structured Training for Neural Network Transition-Based Parsing"", 《PROCEEDINGS OF THE 53RD ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS AND THE 7TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING》 *
HAO ZHOU ET AL: ""A Neural Probabilistic Structured-Prediction Model for Transition-Based Dependency Parsing"", 《PROCEEDINGS OF THE 53RD ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS AND THE 7TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING 》 *
YING LIU ET AL: ""Improving Chinese text Chunking"s precision using Transformnation-based Learning"", 《2005 YOUTH PROJECT OF ASIA RESEARCH CENTER》 *

Cited By (26)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106547737A (en) * 2016-10-25 2017-03-29 复旦大学 Based on the sequence labelling method in the natural language processing of deep learning
CN106547737B (en) * 2016-10-25 2020-05-12 复旦大学 Sequence labeling method in natural language processing based on deep learning
CN109923557A (en) * 2016-11-03 2019-06-21 易享信息技术有限公司 Use continuous regularization training joint multitask neural network model
CN109923557B (en) * 2016-11-03 2024-03-19 硕动力公司 Training joint multitasking neural network model using continuous regularization
US11797825B2 (en) 2016-11-03 2023-10-24 Salesforce, Inc. Training a joint many-task neural network model using successive regularization
US11783164B2 (en) 2016-11-03 2023-10-10 Salesforce.Com, Inc. Joint many-task neural network model for multiple natural language processing (NLP) tasks
US11010554B2 (en) 2016-11-08 2021-05-18 Beijing Gridsum Technology Co., Ltd. Method and device for identifying specific text information
WO2018086519A1 (en) * 2016-11-08 2018-05-17 北京国双科技有限公司 Method and device for identifying specific text information
CN106776869A (en) * 2016-11-28 2017-05-31 北京百度网讯科技有限公司 Chess game optimization method, device and search engine based on neutral net
CN106776869B (en) * 2016-11-28 2020-04-07 北京百度网讯科技有限公司 Search optimization method and device based on neural network and search engine
CN107247700A (en) * 2017-04-27 2017-10-13 北京捷通华声科技股份有限公司 A kind of method and device for adding text marking
CN107168955A (en) * 2017-05-23 2017-09-15 南京大学 Word insertion and the Chinese word cutting method of neutral net using word-based context
CN107168955B (en) * 2017-05-23 2019-06-04 南京大学 Utilize the Chinese word cutting method of the word insertion and neural network of word-based context
CN107632981B (en) * 2017-09-06 2020-11-03 沈阳雅译网络技术有限公司 Neural machine translation method introducing source language chunk information coding
CN107632981A (en) * 2017-09-06 2018-01-26 沈阳雅译网络技术有限公司 A kind of neural machine translation method of introducing source language chunk information coding
CN107992479A (en) * 2017-12-25 2018-05-04 北京牡丹电子集团有限责任公司数字电视技术中心 Word rank Chinese Text Chunking method based on transfer method
CN108363695A (en) * 2018-02-23 2018-08-03 西南交通大学 A kind of user comment attribute extraction method based on bidirectional dependency syntax tree characterization
CN108363695B (en) * 2018-02-23 2020-04-24 西南交通大学 User comment attribute extraction method based on bidirectional dependency syntax tree representation
CN108446355B (en) * 2018-03-12 2022-05-20 深圳证券信息有限公司 Investment and financing event element extraction method, device and equipment
CN108446355A (en) * 2018-03-12 2018-08-24 深圳证券信息有限公司 Investment and financing event argument abstracting method, device and equipment
CN109086274A (en) * 2018-08-23 2018-12-25 电子科技大学 English social media short text time expression recognition method based on restricted model
CN112052646A (en) * 2020-08-27 2020-12-08 安徽聚戎科技信息咨询有限公司 Text data labeling method
CN112052646B (en) * 2020-08-27 2024-03-29 安徽聚戎科技信息咨询有限公司 Text data labeling method
CN112651241A (en) * 2021-01-08 2021-04-13 昆明理工大学 Chinese parallel structure automatic identification method based on semi-supervised learning
CN116227497A (en) * 2022-11-29 2023-06-06 广东外语外贸大学 Sentence structure analysis method and device based on deep neural network
CN116227497B (en) * 2022-11-29 2023-09-26 广东外语外贸大学 Sentence structure analysis method and device based on deep neural network

Also Published As

Publication number Publication date
CN106021227B (en) 2018-08-21

Similar Documents

Publication Publication Date Title
CN106021227B (en) A kind of Chinese Chunk analysis method based on state transfer and neural network
Liu et al. Learning to assemble neural module tree networks for visual grounding
Gupta et al. MMQA: A multi-domain multi-lingual question-answering framework for English and Hindi
US7124073B2 (en) Computer-assisted memory translation scheme based on template automaton and latent semantic index principle
CN111966812B (en) Automatic question answering method based on dynamic word vector and storage medium
CN110232122A (en) A kind of Chinese Question Classification method based on text error correction and neural network
CN113743099B (en) System, method, medium and terminal for extracting terms based on self-attention mechanism
CN114217766A (en) Semi-automatic demand extraction method based on pre-training language fine-tuning and dependency characteristics
Matsumori et al. Lattegan: Visually guided language attention for multi-turn text-conditioned image manipulation
Elbedwehy et al. Efficient Image Captioning Based on Vision Transformer Models.
Chang et al. SikuGPT: A generative pre-trained model for intelligent information processing of ancient texts from the perspective of digital humanities
Zhu et al. A prompt model with combined semantic refinement for aspect sentiment analysis
CN109885695A (en) Assets suggest generation method, device, computer equipment and storage medium
CN110472253B (en) Sentence-level machine translation quality estimation model training method based on mixed granularity
Lo et al. Cool English: A grammatical error correction system based on large learner corpora
Zhu Deep learning for Chinese language sentiment extraction and analysis
Liu et al. A multi-classification sentiment analysis model of Chinese short text based on gated linear units and attention mechanism
CN112507717A (en) Medical field entity classification method fusing entity keyword features
Han et al. Lexicalized neural unsupervised dependency parsing
Acharjee et al. Sequence-to-sequence learning-based conversion of pseudo-code to source code using neural translation approach
CN115510230A (en) Mongolian emotion analysis method based on multi-dimensional feature fusion and comparative reinforcement learning mechanism
Fu et al. Research on Chinese Text Classification Based on Improved RNN
Lei Intelligent Recognition English Translation Model Based on Embedded Machine Learning and Improved GLR Algorithm
Singh et al. Extract reordering rules of sentence structure using neuro-fuzzy machine learning system
Rudnick et al. Lexical selection for hybrid mt with sequence labeling

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant