CN101446942A

CN101446942A - Semantic character labeling method of natural language sentence

Info

Publication number: CN101446942A
Application number: CNA2008102436058A
Authority: CN
Inventors: 王红玲; 朱巧明; 钱培德; 孔芳; 李培峰; 周国栋; 钱龙华
Original assignee: Suzhou University
Current assignee: Suzhou University
Priority date: 2008-12-10
Filing date: 2008-12-10
Publication date: 2009-06-03

Abstract

The invention discloses a semantic character labeling method of a natural language sentence, which is characterized in that Chinese syntax analysis and semantic character label are simultaneously realized by adopting a combined learning model. The invention can simultaneously output the syntax analysis result of one sentence and gives the semantic role labeling result of a predicative by using a combined model. Because semantic information is increased in a syntax analysis model in the combined learning model, a model trained is particularly suitable for a semantic role labeling task. Therefore, the semantic role label output by the model has high performance. Meanwhile, the performance between the result output by a single syntax analysis model and the syntax analysis result output by the combined model is not large. Particularly, the syntax analysis performance can also be improved by adding semantic information.

Description

A kind of semantic character labeling method of natural language sentences

Technical field

The present invention relates to a kind of method of the semantic analysis to natural language, relate in particular to the method that a kind of semantic role to natural language sentences is analyzed and marked, belong to the natural language processing field in the computational linguistics.

Background technology

Semantic analysis is a key issue of natural language processing.As one of present hot research problem, (Semantic Role Labeling is a kind of of Shallow Semantic Parsing (ShallowSemantic Parsing) SRL) to semantic character labeling, its essence is the semantic analysis of carrying out shallow-layer in sentence level.So-called semantic character labeling is exactly for given sentence, and each predicate in the distich marks out the corresponding semantic component in the sentence, and makes corresponding semantic marker, as agent, word denoting the receiver of an action, instrument or adjunct etc.Can be applied to SRL question answering system, information extraction, text snippet, text such as contain at the field, are with a wide range of applications.

Semantic character labeling based on machine learning can be divided into four-stage usually: a) pre-service filters out the syntactic constituent that can not become semantic role usually; B) semantic component identification, identifying which mark unit is the semantic role of a certain target predicate; C) semantic role classification is for carrying out the classification of semantic role in the unit of discerning semantic component; D) aftertreatment is carried out global optimization to the semantic role of mark, determines rational role's combination.Wherein, identification and sorting phase generally use local the derivation.Deriving and be meant the semantic label of each composition in the independent decision sentence in so-called part, and does not rely on the mark of other compositions, trains the model that obtains to be called partial model like this.Correspondingly, the overall situation is derived and is generally occurred in post-processing stages.The so-called overall situation is derived and is meant on the basis of deriving in the part, considers the dependence between each composition label, by integrated relevant hard restriction of world model and soft-constraint condition, thereby obtains rational semantic role combination.Usually, reasonably integrated partial model and world model can improve the performance and the robustness of system greatly.

The learning method of partial model can be divided into two classes usually: based on the method for proper vector with based on the method for kernel function.From at present, based on the method for proper vector, obtained bigger success, speed and performance all are better than the method based on kernel function greatly.

Need artificial definition to have the feature templates of discrimination in a large number based on the method for proper vector, according to this template each example is converted into then that proper vector is learnt or predicted operation.Mainly concentrate at present the research of feature engineering and machine learning model.But the local message owing to only reflected unit to be marked based on the method for proper vector can not reflect global information and structuring syntactic information well, so people's exploration is carried out semantic character labeling based on the method for kernel function.Basic thought based on the method for kernel function is that the inseparable problem of low dimensional linear is mapped to higher dimensional space, makes it to become the linear separability problem.Common this mapping can reach by calculating kernel function implicit expression, thus reduction time and space complexity.Kernel function can well incorporate learning algorithms such as support vector machine, perceptron, thereby has caused people's extensive interest.

The natural language processing task comprises part-of-speech tagging, syntactic analysis, semantic analysis, information extraction etc., normally carries out according to the order of sequence, and a promptly back task is carried out on the basis of last task, and for example the semantic role analysis usually will be based on the result of syntactic analysis.Syntactic analysis (Syntactic Parsing) is the basic problem and the gordian technique of natural language processing.Its task is according to given grammer, derives the syntactic structure of sentence automatically, i.e. relation between the sentence unit that sentence comprised and these sentence unit.The purpose of syntactic analysis mainly contains two: one is " pedigree " structure of determining that sentence is comprised; Another is the relation of determining between the composition of sentence.Usually, import a sentence, promptly the linear precedence between the word is exported a nonlinear data structure, as phrase structure tree (as syntax tree) or directed acyclic graph (as dependence figure) etc.

Therefore, in the prior art, always at first carry out syntactic analysis, obtain syntax tree, on the basis of syntax tree, carry out semantic analysis then, comprise the semantic role mark that carries out sentence.Such way can be brought a series of problems, one, the preceding paragraph task is when carrying out, can't consider the real needs of back task and can not satisfy the demands that syntactic analysis is towards follow-up a plurality of tasks usually, it is more common therefore to export the result, and semantic character labeling is except that these general informations of needs, also may need some information specific, as the probabilistic information of syntax tree or its subtree etc., these information syntactic analysis systems are shortcoming relatively usually; Its two, the performance of consequent task is subjected to the restriction of front mission performance, syntactic analysis result's quality can directly have influence on the performance of semantic character labeling.Existing studies show that, the semantic character labeling that on manual syntax tree of English and automatic syntax tree, carries out, its performance (F1 value) differs 10 percentage points, and based on the semantic character labeling that carries out on manual syntax tree of Chinese and the automatic syntax tree, its F1 value differs even reaches 30 percentage points.

Summary of the invention

The present invention seeks to the purpose of this invention is to provide a kind of semantic character labeling method of effective sentence, unite the derivation model by foundation, reduce automatic syntactic analysis result to the semantic character labeling Effect on Performance, thereby solve problem based on the semantic character labeling method poor-performing of automatic syntactic analysis.

For achieving the above object, the technical solution used in the present invention is: a kind of semantic character labeling method of natural language sentences, adopt the combination learning model, and realize Chinese syntactic analysis and semantic character labeling simultaneously, comprise the steps:

(1) generative semantics character labeling model:

Generate the training file: from tagged corpus,, generate required training file according to the feature extraction feature in the following table;

The position	The path	Centre word and part of speech thereof
The position	The path	Centre word and part of speech thereof	Predicate	The subclass framework	The syntactic constituent type
Syntactic constituent head-word and tail speech	Syntactic constituent left side sibling type	The predicate syntactic frame	Predicate	The subclass framework	The syntactic constituent type
Syntactic constituent head-word and tail speech	Syntactic constituent left side sibling type	The predicate syntactic frame	The verb classification	The previous speech of syntactic constituent	Syntactic constituent father node type
Compressed path	Whether syntactic constituent has right sibling	Syntactic constituent left side sibling centre word type	The verb classification	The previous speech of syntactic constituent	Syntactic constituent father node type
Compressed path	Whether syntactic constituent has right sibling	Syntactic constituent left side sibling centre word type	The path of revising	Whether there is root node in the path

Model generates: utilize the training file to the training of maximum entropy classifiers model, obtain the semantic character labeling model file; This model can obtain through behind enough mark language material training studies, uses this model can identify the semantic role of given predicate in the sentence effectively.

(2) generate the combination learning model:

In existing syntactic analysis model, though can obtain preferable performance based on the syntactic analysis model of vocabulary PCFG, it is low excessively to carry out efficient, and time complexity is 0 (n ⁵); In contrast to this, based on the syntactic analysis model of historical information, formulate decision-making according to mode from left to right, only need travel through once, the method is owing to only need once to travel through from left to right, so it is higher to carry out efficient, but shortcoming is only to use the chunk information of current location front when making decision, and father node generates early than its child node usually, under the situation that child node does not all generate, also unreliable according to the prediction that residing contextual information is made, certainly will cause performance decrease like this.For example, based on basic phrase recognition result shown in Figure 1, obtain correct syntactic analysis result, then preceding four decision-makings must be { Start S and No, Start VP and No}, being chunk NP (I) generates father node NP and is chunk VBD (saw) generation father node VP, and this moment, newly-generated father's node NP and other son's nodes of VP were not known without exception, i.e. only under the situation that its first son generates, producing of father's node, and this often is difficult to accomplish this point under reality.On the contrary, under the established situation of all son's nodes, predict that its father node then is more prone to relatively with reliable.

Based on this, the present invention has used a kind of level and syntax analytic model based on historical information, and this model is the advantage of above two kinds of models comprehensively, obtains preferable performance with less time complexity.Its basic thought is: in every layer of processing procedure, preferentially identify the chunk of easy identification, so just can provide abundanter contextual information to carry out complicated chunk identification; The chunk of not merged chunk and new identification generation constitutes the input of step processing down jointly, repeats this process until identifying root node.Its process is the recursive procedure that a level is handled, and termination condition is when certain layer of processing, and all chunks are merged into a new chunk, i.e. syntax tree root node.

Complete syntactic analysis process can be divided into: part-of-speech tagging, basic chunk identification and syntactic analysis.The effect of syntactic analysis model is on basic chunk base of recognition, predicts next step the decision-making of each chunk, obtains correct syntactic analysis result.The feature masterplate that generation model is commonly used has: centre word of n tree and part of speech, current syntactic constituent mark and decision-making mark, the contextual feature of current composition etc. (comprise 1 yuan, 2 yuan, 3 yuan etc. information).

On level and syntax analytic model basis, incorporate semantic character labeling information, can obtain combination learning model of the present invention, realize the combination learning of syntactic analysis and semantic character labeling.The conjunctive model that proposes is based on following observation: role's composition of predicate w is its sibling normally, or the sibling of its ancestors' node, and in fact this principle also is widely used in the beta pruning strategy of semantic character labeling.Be made up of two parts based on the conjunctive model that this principle proposes: syntax tree makes up part and character labeling part, when syntax tree structure part generates ancestors' node of a new target verb, just call the semantic character labeling model, judge and mark other son's nodes and the relation of the semantic role between the target verb of ancestors' node, increase other semantic relevant information simultaneously, make it to influence the structure of syntax tree.Because the structure of syntax tree is in the variation all the time, and the major influence factors of this variation is a semantic information, and the input of semantic character labeling model is exactly the syntax tree of current structure, and therefore the semantic role of output is also in continuous adjustment, and syntactic analysis and character labeling are interactional.Specifically describe as follows:

Given predicate w when ancestors' node n ode of a newly-generated w, calls the semantic character labeling model, and the child node of judging node concerns with the semantic role of predicate w.Fig. 2 has provided the syntactic analysis and the semantic character labeling combination learning example schematic of an english sentence.To the intermediate result among Fig. 2 .a 1, the target predicate is VBD (closed), after identifying its father node VP, shown in Fig. 2 .b, call the semantic character labeling model, judge that its sibling PP (at 2569.26) and the semantic role of verb VBD (closed) concern, shown in Fig. 2 .c; And then, after the VP node merges to node S, call the semantic character labeling model, judge the semantic role relation of sibling NP (The Dow Jones industrials) with the predicate VBD (closed) of VP.

Calling the semantic character labeling model, when the semantic role between judgement current node and the predicate concerned, if the role L1 that obtains is a non-NULL, promptly current node was certain semantic role of predicate, then the probable value with this role is increased in system's probability, as shown in Equation (1):

prob(T)＝prob(T)*prob(L1) (1)

T*＝argmax(Prob(T)) (2)

Wherein, prob (T) is the probability of current generation tree T syntactic analysis, and prob (L1) is the probability that current node is noted as L1.The target of syntactic analysis model then is to seek optimum syntax tree T*, and T* satisfies formula (2) here.

Except this information of semantic role, in the syntactic analysis model, also increase other semantic relevant informations, with the effect of further raising semantic information.

Thus, the method that generates the combination learning model is,

Extract corpus: from treebank, extract the syntactic analysis corpus, comprise semantic feature in the training incident of syntactic analysis;

Generate the training file: on syntactic analysis model common feature basis, add semantic feature, generate the training file;

Described syntactic analysis model common feature is represented by following table:

The centre word of n tree, composition mark and decision-making mark
	The centre word part of speech of n tree, composition mark and decision-making mark
Composition mark of n tree and decision-making mark
Composition mark of n tree and decision-making mark	The contextual feature of n tree (1 yuan, 2 yuan, 3 yuan, 4 yuan etc.)

Described semantic feature is represented by following table:

Predicate	Current predicate verb itself
Predicate	Current predicate verb itself	The predicate classification	Verb classification under the predicate
The path	Current syntactic constituent is to the path of predicate	The predicate classification	Verb classification under the predicate
The path	Current syntactic constituent is to the path of predicate	The predicate role	Call the predicate semantic role that the semantic character labeling Model Identification goes out

The semantic character labeling model that adopts step (1) that obtains of semantic feature is realized;

Model generates: utilize the training file to the training of maximum entropy classifiers model, obtain the combination learning model file;

(3) part of speech mark: call the part of speech mark module, given sentence is carried out part-of-speech tagging, the part of speech mark series result of N kind optimum before keeping;

(4) basic phrase identification: call basic phrase identification module,, carry out basic phrase identification, the basic phrase recognition result of N kind optimum before keeping at last respectively to the N kind part of speech mark result of step (3) output;

(5) syntactic analysis: call the combination learning model, with the basic phrase recognition result of N kind of step (4) output as input, syntactic analysis result and semantic character labeling result that output is optimum;

Wherein, N is 10～20 integer, and the N value is excessive, will keep too many useless intermediate result in the resolving, increases system overhead; And the N value is too small, then may lose some correct intermediate results.

Because the technique scheme utilization, the present invention compared with prior art has following advantage:

The present invention can export the syntactic analysis result of a sentence and the semantic character labeling result of given predicate simultaneously by the use of conjunctive model.In the combination learning model, owing to increased semantic information in the syntactic analysis model, the model that makes training come out is more suitable in the semantic character labeling task, and therefore the semantic character labeling performance of model output is higher thus.The syntactic analysis result of conjunctive model output simultaneously compares with the result of single syntactic analysis model, and both performances do not have very big difference, even because the adding of semantic information can also improve the performance of syntactic analysis.

Description of drawings

Fig. 1 is the basic phrase recognition result of sentence " I saw the man with the book ".

Fig. 2 is syntactic analysis and semantic character labeling combination learning example schematic.Wherein sentence is: TheDow Jones industrials closed at 2569.26.

Fig. 3 is the output result of the conjunctive model of Chinese sentence " Sino-U.S. signs an agreement in Shanghai " among the embodiment.

Embodiment

Below in conjunction with drawings and Examples the present invention is further described:

Embodiment: will be classification problem to the mark Task Switching of semantic role, and adopt maximum entropy classifiers to train, and obtain the semantic character labeling model.To the sentence structure analysis task, be divided into part of speech mark subtask, basic phrase recognin task and level and syntax analytic subtask, part of speech mark and basic phrase recognin task adopt the mature modules in the existing syntactic analysis software to finish; When syntactic analysis, call the semantic character labeling model, obtain semantic role information, with basic phrase recognition result and semantic information as input, syntactic analysis result and semantic character labeling result that output is optimum.

The generation of semantic character labeling model:

Generate the training file: from tagged corpus,, generate required training file according to the feature extraction feature in the table 1;

Model generates: adopt maximum entropy model, to the training of training file, obtain the semantic character labeling model;

Table 1

The generation of combination learning model:

Extract corpus: from treebank, extract the level and syntax analytic corpus;

Generate the training file: adopt maximum entropy model, add semantic feature (table 2), generate the required tag file of training according to syntactic analysis model common feature (table 3)

Model generates: adopt maximum entropy model, to the training of training file, obtain the combination learning model;

Table 2

Table 3

The centre word of n tree, composition mark and decision-making mark
	The centre word part of speech of n tree, composition mark and decision-making mark
Composition mark of n tree and decision-making mark

The contextual feature of n tree

To sentence to be analyzed, carry out successively:

Part of speech mark: call the part of speech mark module, given sentence is carried out part-of-speech tagging, the part of speech mark series result of N kind optimum before keeping;

Basic phrase identification: call basic phrase identification module, carry out basic phrase identification, the basic phrase recognition result of N kind optimum before keeping at last;

Syntactic analysis: call the combination learning model, as input, export optimum syntactic analysis result and semantic character labeling result with the basic phrase recognition result of 3.2 N kind.

Fig. 3 is the output result of the conjunctive model of Chinese sentence " Sino-U.S. signs an agreement in Shanghai ".The node of wherein drawing a circle need call the semantic character labeling model when generating, judge the child node of this node and the semantic relation between the predicate node (signing).Explanation on every limit among the figure is the concrete steps of syntactic analysis.Be described below:

(1) it is the beginning of IP composition to basic phrase NP (Sino-U.S.) mark, promptly is labeled as " S_IP "

(2) judge whether this phrase finishes,, be designated as " NO " as for not

(3) to basic phrase P () be labeled as the beginning (" S_PP ") of PP composition

(4) phrase does not finish identification, is designated as " NO "

(5) basic phrase NP (Shanghai) is labeled as the continuation (" J_PP ") of PP composition

(6) phrase end of identification is designated as " YES ", forms syntactic constituent PP;

(7) composition PP is labeled as the beginning (" S_VP ") of VP composition

(8) phrase does not finish identification, is designated as " NO "

(9) basic phrase W (signing) is labeled as the beginning (" S_VP ") of VP composition

(10) phrase does not finish identification, is designated as " NO "

(11) basic phrase NP (agreement) is labeled as the continuation (" J_VP ") of VP composition

(12) phrase end of identification is designated as " YES ", forms syntactic constituent VP; Because VP is the father node of predicate verb W (signing), therefore call the semantic character labeling model, judge another child's node NP (agreement) of this node and the relation between the predicate node W (signing), drawing NP (agreement) is the A1 role of W, revises the probability of current syntactic analysis.

(13) composition VP is labeled as the continuation (" J_VP ") of another VP composition

(14) phrase end of identification is designated as " YES ", forms syntactic constituent VP; Because VP is ancestors' node of predicate verb W (signing), therefore call the semantic character labeling model, judge another child's node PP of this node and the relation between the predicate node W, drawing PP is the AM-LOC role of W, revises the probability of current syntactic analysis.

(15) composition VP is labeled as the continuation (" J_IP ") of IP composition

(16) phrase end of identification is designated as " YES ", forms syntactic constituent IP; Because IP is ancestors' node of predicate verb W (signing), therefore call the semantic character labeling model, judge another child's node NP (Sino-U.S.) of this node and the relation between the predicate node W, drawing NP is the A0 role of W, revises the probability of current syntactic analysis.

Claims

1. the semantic character labeling method of natural language sentences is characterized in that: adopt the combination learning model, realize Chinese syntactic analysis and semantic character labeling simultaneously, comprise the steps:

(1) generative semantics character labeling model:

The position The path Centre word and part of speech thereof Predicate The subclass framework The syntactic constituent type Syntactic constituent head-word and tail speech Syntactic constituent left side sibling type The predicate syntactic frame The verb classification The previous speech of syntactic constituent Syntactic constituent father node type Compressed path Whether syntactic constituent has right sibling Syntactic constituent left side sibling centre word type The path of revising Whether there is root node in the path

Model generates: utilize the training file to the training of maximum entropy classifiers model, obtain the semantic character labeling model file;

(2) generate the combination learning model:

The centre word of n tree, composition mark and decision-making mark The centre word part of speech of n tree, composition mark and decision-making mark Composition mark of n tree and decision-making mark The contextual feature of n tree

Described semantic feature is represented by following table:

Predicate Current predicate verb itself The predicate classification Verb classification under the predicate The path Current syntactic constituent is to the path of predicate The predicate role Call the predicate semantic role that the semantic character labeling Model Identification goes out

Wherein, N is 10～20 integer.