A kind of Chinese sentence functional component analysis method
Technical field
The present invention relates to Chinese sentence functional component analysis method, it is related to machine translation mothod field.
Background technology
Syntactic analysis is a key issue of natural language processing, and the effect for reaching at present is not fully up to expectations, is in one
In the state of individual bottleneck.Syntactic analysis is still a much-talked-about topic in present research, and reason is that syntactic analysis is in
Crossover position in whole natural language processing task, a lot of other natural language processing tasks can use the result, have
Many researchs have all used this partial analysis content, and either superstructure or understructure all can be to apply to be somebody's turn to do
Result, main syntactic analysis method can be divided into two classes, and a class is shallow parsing, that is, chunk parsing, no longer
With word as cell processing, but processed by elementary cell of language block, directly generation one is divided into this alanysis new
The analysis method of sequence result, also there is the method for being analyzed different language blocks again, and syntactic analysis is carried out by unit of language block,
And ignore the information of language block inner structure, the result for producing is still a Partial Parsing tree;Another kind of is complete syntax
Analysis, processing unit herein is then each word in sentence, and generation is a complete syntax tree, and this syntactic analysis
Task can also be divided into the syntax tree analysis and the analysis of dependency structure syntax tree of phrase structure, in phrase structure syntactic analysis
In, basic sentence uniterm is passed through into its form and relation in phrase, for gradually setting up by different level is complete to be had
The syntax tree of phrase hierarchical structure, it is similar, in having interdependent syntactic analysis, according to the definition of dependency grammar, built by model
What is stood out is the complete syntax tree that set up out by dependence between word one has dependence.
But, the function information all not having in sentence in these researchs is considered wherein, phrase structure syntax point
What analysis considered is the information of phrase level, and what is considered is the dependence between word in interdependent syntactic analysis, and these all do not have
Play the role of to embody word or set of words shows (such as SVO etc.) in sentence, Zhou Qiang of Tsing-Hua University et al. is carried for the first time
Similar concept is gone out, functional component is extracted task and is converted into a kind of chunk parsing task by they, therewith before phrase language block
The difference is that functional component of the label for sentence, and inter-related task has been issued in the task of CIPS-2009, but after
Several years in, related research is substantially at the lag phase, only has an article related to the task to deliver in 2011
In Journal of Chinese Information Processing.
Function of sentence composition all has very important significance in many practical problems, and for example the word alignment in machine translation is appointed
In business, using function of sentence composition information, we can accelerate word alignment speed and accuracy rate, that is, allow the word of identical component to carry out
Corresponding, such method was both easy, and the rule in linguistics is met again;Similarly, in interdependent syntactic analysis, we can be with
Illegal path is directly deleted during post is searched for as qualifications by the use of function of sentence composition information, so as to carry
The speed of height search, similarly, such rule also has the advantages that simple and easy to do;Research in semantic analysis.More attach most importance to
Want be in whole natural language processing task, it can as syntactic analysis and a transition task of semantic analysis, from
Seen in granularity, it is less than semantic analysis higher than syntactic analysis, the task obtains preferable effect can all can to the two tasks
Effect is improved, from introduction before it can be seen that this research has critically important application prospect, is worth carrying out this direction
Concern.
But existing correlative study is in the very primary stage, not many work can be used for reference, main
The function chunk parsing of the analysis method or Zhou Qiang wanted et al., but these methods also have many defects.First, Chinese function
The data volume of treebank is not very many, while being asked with the presence of the artificial accuracy certain using the regular composition treebank being converted into
Topic, and do not have the renewal of data afterwards;Secondly, whether Zhou Qiang et al. or old hundred million research are all only for Chinese sentence
Son marks out its function language block, produces a result for individual layer linear structure and is not a hierarchical structure, in order to
Serve the structure of parsing tree;In addition, for from specific Task, there is presently no work(specially to Chinese sentence
Energy composition has the development of correlative study.Therefore, it is proposed that Chinese functional component analyzes baseline model and based on shift-in reduction
Act the analysis method of transfer.From above-mentioned work contribution and work meaning, our work is that have good background to anticipate
Justice.
The content of the invention
The invention aims to solve the problems, such as that prior art does not account for the functional component of Chinese sentence, and propose
A kind of Chinese sentence functional component analysis method.
A kind of Chinese sentence functional component analysis method detailed process is:
Step one, training corpus is processed, wherein training corpus uses CTB5.0, by way of canonical is matched pair
CTB5.0 is converted, and changes into the form with functional component label, and the sentence with functional component label form is carried out
Amendment, obtains revised language material;Language material after being modified changes into the form based on word granularity, used as training data A;
CTB5.0 is Chinese Binzhou treebank;
Step 2, a series of process that whole functional component analysis process is considered into state transfers, obtain syntactic function
Component analyser, training data A input syntactic function component analysers is trained and obtains the analysis of Chinese sentence functional component
MODEL C;
Step 3, pure Chinese language text data are processed according to Chinese sentence functional component analysis model C, carried
The sentence of functional component label, is modified to the sentence with functional component label, obtains revised language material;To be repaiied
Language material after just changes into the form based on word granularity, as training data B, training data A is combined into work with training data B
It is final training data;
Step 4, final training data input syntactic function component analyser is trained obtains Chinese sentence function
Constituent analysis model D, is tested Chinese sentence to be tested using Chinese sentence functional component analysis model D, is tested
As a result.
Beneficial effects of the present invention are:
The present invention uses a kind of Chinese sentence functional component analysis method, and whole functional component analysis process is considered into one
The process of sequence of states transfer, obtains syntactic function component analyser, and a training corpus part is using CTB5.0 (Chinese Bin Zhoushu
Storehouse), another part carries out a series of result after treatment using pure Chinese language text data, uses syntactic function component analyser
Training corpus is trained, functional component analysis model is obtained, using Chinese sentence functional component analysis model to be tested
Chinese sentence (500 sentences) is tested, and obtains accurate rate higher, recall rate, F values.
The accurate rate of present invention syntactic function composition tree whole when testing 500 Chinese sentences as shown in table 1
It is 97.38%, recall rate is that 97.79%, F values are 90.90%.
Brief description of the drawings
Fig. 1 is the method frame figure of whole syntactic function constituent analysis;
Fig. 2 is to illustrate the result figure that functional component analysis is carried out to a Chinese sentence with dendrogram, wherein, [SBJ]
It is subject, [PRE] is predicate, [OBJ] is object, [ADV] is the adverbial modifier, [ADJ] is modifier, language centered on [HEAD], IP is
Sentence, NP is nominal phrase, and VP is verb character phrase, and ADVP is adverbial phrase, and PP is prepositional phrase, and CP is supplement phrase,
ADJP is adjunctival, and QP is numeral classifier phrase, and PN is pronoun, and AD is adverbial word, and VV is action verb, and VA is dynamic for Adjective
Word, JJ is adjective, and NN is noun, and AS is auxiliary verb, and P is preposition, and CD is numeral-classifier compound, and OD is with sequential numeral-classifier compound, DEC
For, CC is conjunction, and PU is punctuation mark.
Specific embodiment
Specific embodiment one:A kind of Chinese sentence functional component analysis method detailed process of present embodiment is:
Step one, training corpus is processed, wherein training corpus uses CTB5.0 (Chinese Binzhou treebank), CTB5.0
Language material is in itself that the result of syntactic analysis is converted by way of canonical is matched to CTB5.0, is changed into functional component
The form of label, is modified to the sentence with functional component label form, obtains revised language material;After being modified
Language material change into the form based on word granularity, as training data A;
Step 2, the syntactic analysis method () that will be based on shifting are applied in functional component analysis, by whole functional component
Analysis process considers into a series of process of state transfers, obtains syntactic function component analyser, and training data A is input into syntax
Functional component analyzer is trained and obtains Chinese sentence functional component analysis model C;Such as Fig. 1;
Step 3, according to Chinese sentence functional component analysis model C to pure Chinese language text data (not including letter, English)
(People's Net obtain on news, 10000 of editorial) processed, the sentence with functional component label is obtained, to general
Store-through mistake be modified, the sentence with functional component label is modified, obtain revised language material;Will carry out
Revised language material changes into the form based on word granularity, as training data B, training data A is combined with training data B
As final training data;
Step 4, final training data input syntactic function component analyser is trained obtains Chinese sentence function
Constituent analysis model D, is surveyed using Chinese sentence functional component analysis model D to Chinese sentence to be tested (500 sentences)
Examination, obtains test result.
Specific embodiment two:Present embodiment from unlike specific embodiment one:To training in the step one
Language material is processed, and wherein training corpus uses CTB5.0 (Chinese Binzhou treebank), and CTB5.0 language materials are in itself syntactic analyses
As a result, CTB5.0 is converted by way of canonical is matched, changes into the form with functional component label, it is active to band
The sentence of energy composition label form is modified, and obtains revised language material;Language material after being modified is changed into based on word
The form of granularity, as training data A;Detailed process is:
Training corpus is processed, wherein training corpus uses CTB5.0 (Chinese Binzhou treebank), CTB5.0 language material sheets
Body is the result of syntactic analysis, and CTB5.0 is converted by way of canonical is matched, and is changed into functional component label
Form, subject, predicate, object, the adverbial modifier, attribute, complement, the head functional component of functional component label including sentence, with
And the hypotaxis of sentence;Functional component label in sentence with functional component label form is mislabeled or the carrying out of spill tag is repaiied
Just, revised language material is obtained;
Directional information will be added between the Chinese character of revised language material inside, generate the syntax tree of Chinese character granularity, as syntax
Each node increases directional information in tree, used as training data A.
Direction has three kinds:Left (l), right (r), (c) arranged side by side, represent the semantic node of core in two child nodes of expression respectively
It is left child node, right child node and two status identical situations of child node.Such as, word:Science, left child node is
Section, right node is to learn, and they are coordinations, and mark here is mended in simple, and this relation is not sentence;
The syntax tree of syntactic analysis and generation Chinese character granularity is instructed using the structural information between the Chinese character of word inside,
We are labeled to the relation between the Chinese character of word inside, are that each node increased " direction " information.
Other steps and parameter are identical with specific embodiment one.
Specific embodiment three:Present embodiment from unlike specific embodiment one or two:Sentence in the step 2
The analysis process of method functional component analyzer is;
Each sentence inside data A once enters enqueue, whole functional component analysis process is considered into a series of
The process of state transfer, each state is made up of a stack and a queue, in stack the in store syntactic function for having generated into
Divide tree fragment (part in a syntactic function composition tree), in store still untreated Chinese character in queue;
Under original state, stack is sky, and the number of element is identical with the number of Chinese character in sentence in queue;
The action of each state transfer is selected according to average perceived device in the set of actions for pre-defining,
The set of actions for defining be shift-in-division, shift-in-attachment, reduction-unitary, reduction-binary, reduction-word,
Reduction-sub-word, pause, termination, average perceived device search for plan by calculating the score that each is acted under current state using post
Slightly selected;
Average perceived device acts the power for being scored at characteristic vector and average perceived device by calculating under current state each
It is worth the dot product of vector, the feature templates defined according to characteristic vector carry out characteristic vector pickup and obtain to Chinese sentence to be detected
Arrive, general architectural feature template is as follows:
The architectural feature template related to Chinese character is as follows:
It is as follows that syntactic function component analyser performs the character string feature used when shift-in-division is acted
It is as follows that syntactic function component analyser performs the character string feature used when shift-in-attachment is acted
z-1.z0 z-1.z0.t-1 z0.y-1 start(ω-1).z0.t-1
It is as follows that syntactic function component analyser performs the character string feature used when reduction-word is acted
Under final state, queue is sky, and it is the root node of syntactic function composition tree there was only unique IP, IP in stack, in instruction
Practice and Chinese sentence functional component analysis model C obtained after terminating, decoding obtains a complete syntactic function composition tree after terminating,
Such as Fig. 2.
Whole Chinese sentence functional component analysis process mainly the treatment including training corpus, the writing of training program,
The parameter selection of training pattern.Training corpus treatment i.e. correct corpus in itself exist analysis marking error and will
Corpus changes into the form based on word granular information.The key component of training program is feature extraction and average perceived device reality
It is existing.The parameter selection of training pattern mainly includes iteration wheel number.
Average perceived device is, to the Decision Classfication for acting, to use average perceived device principle, averagely under a certain state
Perceptron strategy can avoid the generation of over-fitting to a certain extent.If iteration always takes turns number for T, the index for often taking turns iteration is t,
Wherein 0<t<T+1, the sentence sum in corpus is N, and the index of sentence is n, wherein 0<n<N+1.If during t wheel iteration, place
N-th is managed afterwards, the weights of model are wt,n, then the weights of the model that traditional average perceived device Algorithm for Training is obtained are
wT,N。
This weights can cause that model obtains precision of prediction higher on training set, but easily cause over-fitting and show
As so that precision of prediction of the model on test set be not high.Average perceived device strategy does not use w to prevent over-fittingT,N
As final weights, but useAs the weights of model.Average perceived device algorithm is as follows
Other steps and parameter are identical with specific embodiment one or two.
Specific embodiment four:Unlike one of present embodiment and specific embodiment one to three:The step 3
It is middle according to Chinese sentence functional component analysis model C to data (pure Chinese language text) (People's Net obtain on news, editorial
10000) functional component analysis is carried out, the sentence with functional component label is obtained, the mistake to generally existing is modified,
Sentence with functional component label is modified, revised language material is obtained;Revised language material is changed into based on word
The form of granularity, as training data B, is combined training data A with training data B as final training data;Specifically
Process is:
According to Chinese sentence functional component analysis model C to data (pure Chinese language text) (People's Net obtain on news,
10000 of editorial) functional component analysis is carried out, the sentence with functional component label is obtained, the mistake to generally existing is entered
Row amendment, functional component label includes subject, predicate, object, the adverbial modifier, attribute, complement, the head functional component of sentence, with
And the hypotaxis of sentence;Functional component in functional component label is mislabeled or spill tag is modified, obtain revised language
Material;
Directional information will be added between the Chinese character of revised language material inside, generate the syntax tree of Chinese character granularity, as syntax
Each node increases directional information in tree, used as training data B;
Direction has three kinds:Left (l), right (r), (c) arranged side by side, represent the semantic node of core in two child nodes of expression respectively
It is left child node, right child node and two status identical situations of child node.
The syntax tree of syntactic analysis and generation Chinese character granularity is instructed using the structural information between the Chinese character of word inside,
We are labeled to the relation between the Chinese character of word inside, are that each node increased " direction " information.
Training data A is added as final training data with training data B.
Other steps and parameter are identical with one of specific embodiment one to three.
Specific embodiment five:Unlike one of present embodiment and specific embodiment one to four:The step 4
Middle be trained final training data input syntactic function component analyser obtains Chinese sentence functional component analysis model
D, is tested Chinese sentence to be tested (500 sentences) using Chinese sentence functional component analysis model D, obtains test knot
Really;Detailed process is:
Whole functional component analysis process is considered into a series of process of state transfers, syntactic function constituent analysis is obtained
Device, by being specially that final training data input syntactic function component analyser is trained:
Each state is made up of a stack and a queue, the in store syntactic function composition tree fragment for having generated in stack
(part in a syntactic function composition tree), in store still untreated Chinese character in queue;
Under original state, stack is sky, and the number of element is identical with the number of Chinese character in sentence in queue;
The action of each state transfer is selected according to average perceived device in the set of actions for pre-defining, and is defined
Set of actions be shift-in-division, shift-in-attachment, reduction unitary, reduction-binary, reduction-word, reduction-sub-word, pause,
Termination, average perceived device is selected by calculating the score that each is acted under current state using post search strategy;
Under final state, queue is sky, and it is the root node of syntactic function composition tree there was only unique IP, IP in stack, in instruction
Practice and Chinese sentence functional component analysis model D is obtained after terminating, decoding obtains a complete syntactic function composition tree after terminating.
Other steps and parameter are identical with one of specific embodiment one to four.
Beneficial effects of the present invention are verified using following examples:
Embodiment one:
A kind of Chinese sentence functional component analysis method of the present embodiment is specifically to be prepared according to following steps:
(1) training corpus
CTB (Binzhou treebank) more than 13000 sentences and People's Net obtain on news, 10000 of editorial;It is processed to
Into the form of word granularity.
(2) training process
Initial model 1 is trained using CTB language materials;Parse is carried out using 10000 new sentences of initial model 1 pair, sentence is obtained
Method functional component result, also serves as training corpus;With reference to two parts training corpus, training pattern 2 again.
(3) test set
500 sentences different from training corpus are randomly selected, by after the model parse for training, carrying out artificial school
It is right, it is ensured that the accuracy of test set.
The experimental result on 500 test sets after calibration is as shown in the table:
F=2P*Q/ (P+Q).
The present invention can also have other various embodiments, in the case of without departing substantially from spirit of the invention and its essence, this area
Technical staff works as can make various corresponding changes and deformation according to the present invention, but these corresponding changes and deformation should all belong to
The protection domain of appended claims of the invention.