CN106844348A

CN106844348A - A kind of Chinese sentence functional component analysis method

Info

Publication number: CN106844348A
Application number: CN201710077125.8A
Authority: CN
Inventors: 赵铁军; 曹海龙; 王亚楠; 徐冰; 朱聪慧; 杨沐昀; 郑德权; 马春鹏
Original assignee: Harbin Institute of Technology
Current assignee: Heilongjiang Industrial Technology Research Institute Asset Management Co ltd
Priority date: 2017-02-13
Filing date: 2017-02-13
Publication date: 2017-06-13
Anticipated expiration: 2037-02-13
Also published as: CN106844348B

Abstract

A kind of Chinese sentence functional component analysis method, the present invention relates to Chinese sentence functional component analysis method.The invention aims to solve the problems, such as that prior art does not account for the functional component of Chinese sentence.Process is：First, training corpus is processed, CTB5.0 is converted, change into the form with functional component label, be modified and obtain revised language material；The form based on word granularity is changed into, as A；2nd, A input syntactic function component analysers are trained and obtain Chinese sentence functional component analysis model C；3rd, pure Chinese language text data are processed, obtains the sentence with functional component label, change into the form based on word granularity, as B, be combined A with B as final training data；4th, Chinese sentence to be tested is tested using Chinese sentence functional component analysis model D, obtains test result.The present invention is used for function of sentence constituent analysis field.

Description

A kind of Chinese sentence functional component analysis method

Technical field

The present invention relates to Chinese sentence functional component analysis method, it is related to machine translation mothod field.

Background technology

Syntactic analysis is a key issue of natural language processing, and the effect for reaching at present is not fully up to expectations, is in one In the state of individual bottleneck.Syntactic analysis is still a much-talked-about topic in present research, and reason is that syntactic analysis is in Crossover position in whole natural language processing task, a lot of other natural language processing tasks can use the result, have Many researchs have all used this partial analysis content, and either superstructure or understructure all can be to apply to be somebody's turn to do Result, main syntactic analysis method can be divided into two classes, and a class is shallow parsing, that is, chunk parsing, no longer With word as cell processing, but processed by elementary cell of language block, directly generation one is divided into this alanysis new The analysis method of sequence result, also there is the method for being analyzed different language blocks again, and syntactic analysis is carried out by unit of language block, And ignore the information of language block inner structure, the result for producing is still a Partial Parsing tree；Another kind of is complete syntax Analysis, processing unit herein is then each word in sentence, and generation is a complete syntax tree, and this syntactic analysis Task can also be divided into the syntax tree analysis and the analysis of dependency structure syntax tree of phrase structure, in phrase structure syntactic analysis In, basic sentence uniterm is passed through into its form and relation in phrase, for gradually setting up by different level is complete to be had The syntax tree of phrase hierarchical structure, it is similar, in having interdependent syntactic analysis, according to the definition of dependency grammar, built by model What is stood out is the complete syntax tree that set up out by dependence between word one has dependence.

But, the function information all not having in sentence in these researchs is considered wherein, phrase structure syntax point What analysis considered is the information of phrase level, and what is considered is the dependence between word in interdependent syntactic analysis, and these all do not have Play the role of to embody word or set of words shows (such as SVO etc.) in sentence, Zhou Qiang of Tsing-Hua University et al. is carried for the first time Similar concept is gone out, functional component is extracted task and is converted into a kind of chunk parsing task by they, therewith before phrase language block The difference is that functional component of the label for sentence, and inter-related task has been issued in the task of CIPS-2009, but after Several years in, related research is substantially at the lag phase, only has an article related to the task to deliver in 2011 In Journal of Chinese Information Processing.

Function of sentence composition all has very important significance in many practical problems, and for example the word alignment in machine translation is appointed In business, using function of sentence composition information, we can accelerate word alignment speed and accuracy rate, that is, allow the word of identical component to carry out Corresponding, such method was both easy, and the rule in linguistics is met again；Similarly, in interdependent syntactic analysis, we can be with Illegal path is directly deleted during post is searched for as qualifications by the use of function of sentence composition information, so as to carry The speed of height search, similarly, such rule also has the advantages that simple and easy to do；Research in semantic analysis.More attach most importance to Want be in whole natural language processing task, it can as syntactic analysis and a transition task of semantic analysis, from Seen in granularity, it is less than semantic analysis higher than syntactic analysis, the task obtains preferable effect can all can to the two tasks Effect is improved, from introduction before it can be seen that this research has critically important application prospect, is worth carrying out this direction Concern.

But existing correlative study is in the very primary stage, not many work can be used for reference, main The function chunk parsing of the analysis method or Zhou Qiang wanted et al., but these methods also have many defects.First, Chinese function The data volume of treebank is not very many, while being asked with the presence of the artificial accuracy certain using the regular composition treebank being converted into Topic, and do not have the renewal of data afterwards；Secondly, whether Zhou Qiang et al. or old hundred million research are all only for Chinese sentence Son marks out its function language block, produces a result for individual layer linear structure and is not a hierarchical structure, in order to Serve the structure of parsing tree；In addition, for from specific Task, there is presently no work(specially to Chinese sentence Energy composition has the development of correlative study.Therefore, it is proposed that Chinese functional component analyzes baseline model and based on shift-in reduction Act the analysis method of transfer.From above-mentioned work contribution and work meaning, our work is that have good background to anticipate Justice.

The content of the invention

The invention aims to solve the problems, such as that prior art does not account for the functional component of Chinese sentence, and propose A kind of Chinese sentence functional component analysis method.

A kind of Chinese sentence functional component analysis method detailed process is：

Step one, training corpus is processed, wherein training corpus uses CTB5.0, by way of canonical is matched pair CTB5.0 is converted, and changes into the form with functional component label, and the sentence with functional component label form is carried out Amendment, obtains revised language material；Language material after being modified changes into the form based on word granularity, used as training data A；

CTB5.0 is Chinese Binzhou treebank；

Step 2, a series of process that whole functional component analysis process is considered into state transfers, obtain syntactic function Component analyser, training data A input syntactic function component analysers is trained and obtains the analysis of Chinese sentence functional component MODEL C；

Step 3, pure Chinese language text data are processed according to Chinese sentence functional component analysis model C, carried The sentence of functional component label, is modified to the sentence with functional component label, obtains revised language material；To be repaiied Language material after just changes into the form based on word granularity, as training data B, training data A is combined into work with training data B It is final training data；

Step 4, final training data input syntactic function component analyser is trained obtains Chinese sentence function Constituent analysis model D, is tested Chinese sentence to be tested using Chinese sentence functional component analysis model D, is tested As a result.

Beneficial effects of the present invention are：

The present invention uses a kind of Chinese sentence functional component analysis method, and whole functional component analysis process is considered into one The process of sequence of states transfer, obtains syntactic function component analyser, and a training corpus part is using CTB5.0 (Chinese Bin Zhoushu Storehouse), another part carries out a series of result after treatment using pure Chinese language text data, uses syntactic function component analyser Training corpus is trained, functional component analysis model is obtained, using Chinese sentence functional component analysis model to be tested Chinese sentence (500 sentences) is tested, and obtains accurate rate higher, recall rate, F values.

The accurate rate of present invention syntactic function composition tree whole when testing 500 Chinese sentences as shown in table 1 It is 97.38%, recall rate is that 97.79%, F values are 90.90%.

Brief description of the drawings

Fig. 1 is the method frame figure of whole syntactic function constituent analysis；

Fig. 2 is to illustrate the result figure that functional component analysis is carried out to a Chinese sentence with dendrogram, wherein, [SBJ] It is subject, [PRE] is predicate, [OBJ] is object, [ADV] is the adverbial modifier, [ADJ] is modifier, language centered on [HEAD], IP is Sentence, NP is nominal phrase, and VP is verb character phrase, and ADVP is adverbial phrase, and PP is prepositional phrase, and CP is supplement phrase, ADJP is adjunctival, and QP is numeral classifier phrase, and PN is pronoun, and AD is adverbial word, and VV is action verb, and VA is dynamic for Adjective Word, JJ is adjective, and NN is noun, and AS is auxiliary verb, and P is preposition, and CD is numeral-classifier compound, and OD is with sequential numeral-classifier compound, DEC For, CC is conjunction, and PU is punctuation mark.

Specific embodiment

Specific embodiment one：A kind of Chinese sentence functional component analysis method detailed process of present embodiment is：

Step one, training corpus is processed, wherein training corpus uses CTB5.0 (Chinese Binzhou treebank), CTB5.0 Language material is in itself that the result of syntactic analysis is converted by way of canonical is matched to CTB5.0, is changed into functional component The form of label, is modified to the sentence with functional component label form, obtains revised language material；After being modified Language material change into the form based on word granularity, as training data A；

Step 2, the syntactic analysis method () that will be based on shifting are applied in functional component analysis, by whole functional component Analysis process considers into a series of process of state transfers, obtains syntactic function component analyser, and training data A is input into syntax Functional component analyzer is trained and obtains Chinese sentence functional component analysis model C；Such as Fig. 1；

Step 3, according to Chinese sentence functional component analysis model C to pure Chinese language text data (not including letter, English) (People's Net obtain on news, 10000 of editorial) processed, the sentence with functional component label is obtained, to general Store-through mistake be modified, the sentence with functional component label is modified, obtain revised language material；Will carry out Revised language material changes into the form based on word granularity, as training data B, training data A is combined with training data B As final training data；

Step 4, final training data input syntactic function component analyser is trained obtains Chinese sentence function Constituent analysis model D, is surveyed using Chinese sentence functional component analysis model D to Chinese sentence to be tested (500 sentences) Examination, obtains test result.

Specific embodiment two：Present embodiment from unlike specific embodiment one：To training in the step one Language material is processed, and wherein training corpus uses CTB5.0 (Chinese Binzhou treebank), and CTB5.0 language materials are in itself syntactic analyses As a result, CTB5.0 is converted by way of canonical is matched, changes into the form with functional component label, it is active to band The sentence of energy composition label form is modified, and obtains revised language material；Language material after being modified is changed into based on word The form of granularity, as training data A；Detailed process is：

Training corpus is processed, wherein training corpus uses CTB5.0 (Chinese Binzhou treebank), CTB5.0 language material sheets Body is the result of syntactic analysis, and CTB5.0 is converted by way of canonical is matched, and is changed into functional component label Form, subject, predicate, object, the adverbial modifier, attribute, complement, the head functional component of functional component label including sentence, with And the hypotaxis of sentence；Functional component label in sentence with functional component label form is mislabeled or the carrying out of spill tag is repaiied Just, revised language material is obtained；

Directional information will be added between the Chinese character of revised language material inside, generate the syntax tree of Chinese character granularity, as syntax Each node increases directional information in tree, used as training data A.

Direction has three kinds：Left (l), right (r), (c) arranged side by side, represent the semantic node of core in two child nodes of expression respectively It is left child node, right child node and two status identical situations of child node.Such as, word：Science, left child node is Section, right node is to learn, and they are coordinations, and mark here is mended in simple, and this relation is not sentence；

The syntax tree of syntactic analysis and generation Chinese character granularity is instructed using the structural information between the Chinese character of word inside, We are labeled to the relation between the Chinese character of word inside, are that each node increased " direction " information.

Other steps and parameter are identical with specific embodiment one.

Specific embodiment three：Present embodiment from unlike specific embodiment one or two：Sentence in the step 2 The analysis process of method functional component analyzer is；

Each sentence inside data A once enters enqueue, whole functional component analysis process is considered into a series of The process of state transfer, each state is made up of a stack and a queue, in stack the in store syntactic function for having generated into Divide tree fragment (part in a syntactic function composition tree), in store still untreated Chinese character in queue；

Under original state, stack is sky, and the number of element is identical with the number of Chinese character in sentence in queue；

The action of each state transfer is selected according to average perceived device in the set of actions for pre-defining,

The set of actions for defining be shift-in-division, shift-in-attachment, reduction-unitary, reduction-binary, reduction-word, Reduction-sub-word, pause, termination, average perceived device search for plan by calculating the score that each is acted under current state using post Slightly selected；

Average perceived device acts the power for being scored at characteristic vector and average perceived device by calculating under current state each It is worth the dot product of vector, the feature templates defined according to characteristic vector carry out characteristic vector pickup and obtain to Chinese sentence to be detected Arrive, general architectural feature template is as follows：

The architectural feature template related to Chinese character is as follows：

It is as follows that syntactic function component analyser performs the character string feature used when shift-in-division is acted

It is as follows that syntactic function component analyser performs the character string feature used when shift-in-attachment is acted

z-1.z0 z-1.z0.t-1 z0.y-1 start(ω-1).z0.t-1

It is as follows that syntactic function component analyser performs the character string feature used when reduction-word is acted

Under final state, queue is sky, and it is the root node of syntactic function composition tree there was only unique IP, IP in stack, in instruction Practice and Chinese sentence functional component analysis model C obtained after terminating, decoding obtains a complete syntactic function composition tree after terminating, Such as Fig. 2.

Whole Chinese sentence functional component analysis process mainly the treatment including training corpus, the writing of training program, The parameter selection of training pattern.Training corpus treatment i.e. correct corpus in itself exist analysis marking error and will Corpus changes into the form based on word granular information.The key component of training program is feature extraction and average perceived device reality It is existing.The parameter selection of training pattern mainly includes iteration wheel number.

Average perceived device is, to the Decision Classfication for acting, to use average perceived device principle, averagely under a certain state Perceptron strategy can avoid the generation of over-fitting to a certain extent.If iteration always takes turns number for T, the index for often taking turns iteration is t, Wherein 0<t<T+1, the sentence sum in corpus is N, and the index of sentence is n, wherein 0<n<N+1.If during t wheel iteration, place N-th is managed afterwards, the weights of model are w_t,n, then the weights of the model that traditional average perceived device Algorithm for Training is obtained are w_T,N。

This weights can cause that model obtains precision of prediction higher on training set, but easily cause over-fitting and show As so that precision of prediction of the model on test set be not high.Average perceived device strategy does not use w to prevent over-fitting_T,N As final weights, but useAs the weights of model.Average perceived device algorithm is as follows

Other steps and parameter are identical with specific embodiment one or two.

Specific embodiment four：Unlike one of present embodiment and specific embodiment one to three：The step 3 It is middle according to Chinese sentence functional component analysis model C to data (pure Chinese language text) (People's Net obtain on news, editorial 10000) functional component analysis is carried out, the sentence with functional component label is obtained, the mistake to generally existing is modified, Sentence with functional component label is modified, revised language material is obtained；Revised language material is changed into based on word The form of granularity, as training data B, is combined training data A with training data B as final training data；Specifically Process is：

According to Chinese sentence functional component analysis model C to data (pure Chinese language text) (People's Net obtain on news, 10000 of editorial) functional component analysis is carried out, the sentence with functional component label is obtained, the mistake to generally existing is entered Row amendment, functional component label includes subject, predicate, object, the adverbial modifier, attribute, complement, the head functional component of sentence, with And the hypotaxis of sentence；Functional component in functional component label is mislabeled or spill tag is modified, obtain revised language Material；

Directional information will be added between the Chinese character of revised language material inside, generate the syntax tree of Chinese character granularity, as syntax Each node increases directional information in tree, used as training data B；

Direction has three kinds：Left (l), right (r), (c) arranged side by side, represent the semantic node of core in two child nodes of expression respectively It is left child node, right child node and two status identical situations of child node.

Training data A is added as final training data with training data B.

Other steps and parameter are identical with one of specific embodiment one to three.

Specific embodiment five：Unlike one of present embodiment and specific embodiment one to four：The step 4 Middle be trained final training data input syntactic function component analyser obtains Chinese sentence functional component analysis model D, is tested Chinese sentence to be tested (500 sentences) using Chinese sentence functional component analysis model D, obtains test knot Really；Detailed process is：

Whole functional component analysis process is considered into a series of process of state transfers, syntactic function constituent analysis is obtained Device, by being specially that final training data input syntactic function component analyser is trained：

Each state is made up of a stack and a queue, the in store syntactic function composition tree fragment for having generated in stack (part in a syntactic function composition tree), in store still untreated Chinese character in queue；

The action of each state transfer is selected according to average perceived device in the set of actions for pre-defining, and is defined Set of actions be shift-in-division, shift-in-attachment, reduction unitary, reduction-binary, reduction-word, reduction-sub-word, pause, Termination, average perceived device is selected by calculating the score that each is acted under current state using post search strategy；

Under final state, queue is sky, and it is the root node of syntactic function composition tree there was only unique IP, IP in stack, in instruction Practice and Chinese sentence functional component analysis model D is obtained after terminating, decoding obtains a complete syntactic function composition tree after terminating.

Other steps and parameter are identical with one of specific embodiment one to four.

Beneficial effects of the present invention are verified using following examples：

Embodiment one：

A kind of Chinese sentence functional component analysis method of the present embodiment is specifically to be prepared according to following steps：

(1) training corpus

CTB (Binzhou treebank) more than 13000 sentences and People's Net obtain on news, 10000 of editorial；It is processed to Into the form of word granularity.

(2) training process

Initial model 1 is trained using CTB language materials；Parse is carried out using 10000 new sentences of initial model 1 pair, sentence is obtained Method functional component result, also serves as training corpus；With reference to two parts training corpus, training pattern 2 again.

(3) test set

500 sentences different from training corpus are randomly selected, by after the model parse for training, carrying out artificial school It is right, it is ensured that the accuracy of test set.

The experimental result on 500 test sets after calibration is as shown in the table：

F=2P*Q/ (P+Q).

The present invention can also have other various embodiments, in the case of without departing substantially from spirit of the invention and its essence, this area Technical staff works as can make various corresponding changes and deformation according to the present invention, but these corresponding changes and deformation should all belong to The protection domain of appended claims of the invention.

Claims

1. a kind of Chinese sentence functional component analysis method, it is characterised in that：A kind of Chinese sentence functional component analysis method tool Body process is：

CTB5.0 is Chinese Binzhou treebank；

Step 2, a series of process that whole functional component analysis process is considered into state transfers, obtain syntactic function composition Analyzer, training data A input syntactic function component analysers is trained and obtains Chinese sentence functional component analysis model C；

Step 3, pure Chinese language text data are processed according to Chinese sentence functional component analysis model C, obtained with functional The sentence of composition label, is modified to the sentence with functional component label, obtains revised language material；After being modified Language material change into the form based on word granularity, as training data B, be combined training data A with training data B as most Whole training data；

Step 4, final training data input syntactic function component analyser is trained obtains Chinese sentence functional component Analysis model D, is tested Chinese sentence to be tested using Chinese sentence functional component analysis model D, obtains test result.

2. a kind of Chinese sentence functional component analysis method according to claim 1, it is characterised in that：It is right in the step one Training corpus is processed, and wherein training corpus uses CTB5.0, and CTB5.0 is converted by way of canonical is matched, and is turned Form of the chemical conversion with functional component label, is modified to the sentence with functional component label form, obtains revised Language material；Language material after being modified changes into the form based on word granularity, used as training data A；Detailed process is：

Training corpus is processed, wherein training corpus uses CTB5.0, CTB5.0 is carried out by way of canonical is matched Conversion, changes into the form with functional component label, functional component label include the subject of sentence, predicate, object, the adverbial modifier, Attribute, complement, head functional component；Functional component label in sentence with functional component label form is mislabeled or spill tag Be modified, obtain revised language material；

Directional information will be added between the Chinese character of revised language material inside, the syntax tree of Chinese character granularity is generated, as training data A。

3. a kind of Chinese sentence functional component analysis method according to claim 2, it is characterised in that：Sentence in the step 2 The analysis process of method functional component analyzer is；

Each state is made up of a stack and a queue, the in store syntactic function composition tree fragment for having generated, team in stack In store still untreated Chinese character in row；

The set of actions for defining be shift-in-division, shift-in-attachment, reduction-unitary, reduction-binary, reduction-word, reduction- Sub-word, pause, termination, average perceived device are entered by calculating the score that each is acted under current state using post search strategy Row selection；

Average perceived device by calculate under current state each act be scored at the weights of characteristic vector and average perceived device to The dot product of amount, the feature templates defined according to characteristic vector carry out characteristic vector pickup and obtain to Chinese sentence to be detected 's；

Under final state, queue is sky, and it is the root node of syntactic function composition tree there was only unique IP, IP in stack, in training eventually Chinese sentence functional component analysis model C is obtained after only, decoding obtains a complete syntactic function composition tree after terminating.

4. a kind of Chinese sentence functional component analysis method according to claim 3, it is characterised in that：Root in the step 3 Functional component analysis is carried out to pure Chinese language text data according to Chinese sentence functional component analysis model C, is obtained with functional component The sentence of label, is modified to the sentence with functional component label, obtains revised language material；Revised language material is turned Form of the chemical conversion based on word granularity, as training data B, is combined training data A with training data B as final training Data；Detailed process is：

Functional component analysis is carried out to the pure Chinese language text of data according to Chinese sentence functional component analysis model C, is obtained with active The sentence of energy composition label, functional component label includes subject, predicate, object, the adverbial modifier, attribute, complement, the head work(of sentence Can composition；Functional component in functional component label is mislabeled or spill tag is modified, obtain revised language material；After correcting Language material inside Chinese character between add directional information, the syntax tree of Chinese character granularity is generated, as training data B；By training data A It is added as final training data with training data B.

5. a kind of Chinese sentence functional component analysis method according to claim 4, it is characterised in that：Will in the step 4 Final training data input syntactic function component analyser is trained and obtains Chinese sentence functional component analysis model D, adopts Chinese sentence to be tested is tested with Chinese sentence functional component analysis model D, obtains test result；Detailed process is：

Whole functional component analysis process is considered into a series of process of state transfers, syntactic function component analyser is obtained, By being specially that final training data input syntactic function component analyser is trained：

The action of each state transfer is selected according to average perceived device in the set of actions for pre-defining, the action for defining Collection is combined into shift-in-division, shift-in-attachment, reduction unitary, reduction-binary, reduction-word, reduction-sub-word, pause, termination, Average perceived device is selected by calculating the score that each is acted under current state using post search strategy；

Under final state, queue is sky, and it is the root node of syntactic function composition tree there was only unique IP, IP in stack, in training eventually Chinese sentence functional component analysis model D is obtained after only, decoding obtains a complete syntactic function composition tree after terminating.