CN101520775B

CN101520775B - Chinese syntax parsing method with merged semantic information

Info

Publication number: CN101520775B
Application number: CN2009101318275A
Authority: CN
Inventors: 吴玺宏; 迟惠生; 罗定生; 林小俊; 樊杨
Original assignee: Peking University
Current assignee: Peking University
Priority date: 2009-02-17
Filing date: 2009-04-08
Publication date: 2012-05-30
Anticipated expiration: 2029-04-08
Also published as: CN101520775A

Abstract

The invention discloses a Chinese syntax parsing method with merged semantic information, belonging to the technical field of natural language processing. The method comprises the following steps: step 1), extracting different hierarchical semantic classes of words according to the hyponymy of the knowledge network to obtain indexes from the words to the semantic classes; step 2), using a word in a syntactic tree as a key assignment and query the knowledge network to obtain a semantic class of the word and add the semantic class to a certain layer of the syntactic tree; step 3), using the syntactic tree after being processed in the step 2) as training data to train grammar so as to obtain a grammar model; step 4), utilizing the grammar model after being trained in the step 3) to decode a sentence to be analyzed. Compared with the prior art, the invention adopts the semantic information to disambiguate parsing so that the parsing effect is remarkably improved.

Description

A kind of Chinese syntactic analysis and coding/decoding method that incorporates semantic information

Technical field

The invention belongs to the natural language processing technique field, be specifically related to a kind of Chinese syntactic analysis and coding/decoding method that incorporates semantic information, in syntactic analysis, introduce semantic knowledge and help improve the performance of syntactic analysis.

Background technology

Syntactic analysis is a very important technology in the middle of the natural language processing, and what it was analyzed is how to be combined to form significant phrase, sentence between speech and the speech, discloses the language regulation of deep layer.The result of syntactic analysis will directly have influence on the understanding to natural language.In the middle of the practical natural Language Processing was used, a high performance parser helped promoting the performance of information extraction, information retrieval, mechanical translation, automatic question answering contour level application system.

The syntactic analysis process is exactly under the situation of the grammatical model of a given cover, derives the syntactic structure of sentence according to certain algorithm, representes with a kind of tree structure usually.For example in short, " Dalian foreign export volume is over half from ' three moneys ' enterprise.", the result who carries out syntactic analysis can be represented by the structure tree in the accompanying drawing 1 (a).In the middle of this tree construction, the leafy node of the bottom is a speech, is called terminal symbol; The non-leafy node on upper strata all is called nonterminal symbol, but not the bottom of leafy node is represented part of speech, is called preparatory terminal symbol.Because the natural language ubiquity ambiguousness; For analyzing a plurality of different syntactic structures with a word; Therefore just need utilize effective information and algorithm to clear up the ambiguity of existence; Find out the most reasonably syntactic structure, this also is a current various syntactic analysis method problem to be solved.

Utilize method that statistics writes can be from corpus the skewed popularity information of learning Vocabulary and structure, thereby handle the ambiguity problem of syntactic structure to a certain extent.The appearance of the syntactic structure treebank resource of some artificial marks (like the big treebank of guest of Univ Pennsylvania USA's structure) for proposing to have created condition based on the syntactic analysis method of statistics, has promoted the development of this type technology greatly.Statistics study in the syntactic analysis method maximum be probability context-free grammar (PCFG:Probabilistic Context-Free Grammar); It describes sentence structure through a series of context-free grammar rule, and gives every probability that rule is certain.The advantage of this method is that form is simple, can in polynomial time, handle.

A problem of PCFG model comes from conditional independence assumption, under this assumed condition, thinks that the expansion of any one nonterminal symbol (i.e. each node more than the speech node in syntax tree) and the expansion of other nonterminal symbols are separate.But through the statistical distribution of each position nonterminal symbol in the treebank is discovered, the expansion of a node sometimes be with its place tree in the position relevant, and this point is uncared-for when simple PCFG modeling.In order to address this problem, just need improve basic PC FG model, two kinds of approach are arranged usually: introduce vocabulary information and expansion nonterminal symbol mark, the latter usually is known as the nonlexicalized method again.Introducing the most representative work of vocabulary message context is the syntactic analysis method that centre word drives; Representative work such as Michael Collins are that each nonterminal symbol in the syntax rule is introduced information such as vocabulary, distance in the middle of his PhD dissertation; Improve the differentiation property of the syntax; The method of nonlexicalized syntactic analysis mainly contains through the mode of manual work carries out refinement to the part nonterminal symbol; Thereby and can cover more language phenomenon through the automatic refinement mark of the method for unsupervised learning, representative work is the people's such as Dan Klein of UC Berkeley work.Yet these two kinds of methods also exist defective separately: the introducing of lexical information has brought the sparse problem of certain data in the vocabulary method, in the nonlexicalized method automatically the refinement mark exist problems the such as whether portrayal of language phenomenon accurate.

Summary of the invention

The object of the present invention is to provide a kind of Chinese syntactic analysis and coding/decoding method that incorporates semantic information, utilize semantic information to help improve the performance of syntactic analysis, can also have the semantic information of syntactic constraint simultaneously from acquisition in the middle of the sentence structure analysis result.

There has been theoretical research to show that semantic information can help the sentence structure disambiguation.What semantic concept was related is implication, structure and the tongue etc. of word, and correlative study can be divided into two parts: study the semanteme (meaning of a word) of single speech and the implication of single speech are how to join together to form the implication of sentence.The main task of semantic analysis is that the lexical semantic unit that produces language text is represented and the dependence between them.Though syntactic analysis and semantic analysis are two different aspects of language analysis, both exist the relation of mutual restriction.The word order of Chinese is very strong to the restriction of semanteme, exists complicated semantic relation between the syntactic constituent.In many cases, only grammatical form being carried out the syntactic structure analysis is to have explained not the inherent laws of sentence.Therefore, in Chinese syntactic analysis, introduce semanteme and can help clearing up of structural ambiguity.

The prerequisite of using semantic information is to have the predefined semantic standard of a cover, and the most directly way is to use existing semantic resource.Employed semantic resource is to know net (HowNet) in our method.Know that net is a basis that is characterized as with the notion of english-chinese bilingual representative and notion, to disclose between notion and the notion and the relation between the characteristic that notion was had is the general knowledge storehouse of substance.Therefrom we can obtain certain speech notion or the concept attribute of different levels as our semantic type; Such as we can therefrom obtain " automobile " semantic type " the entity| entity=＞the thing| all things on earth=＞the physical| material=＞the inanimate| inanimate object=＞the artifact| artifact=＞the implement| utensil=＞the vehicle| vehicles=＞the LandVehicle| car ", this wherein from left to right expression be semantic type of " automobile " different levels from coarse to fine in HowNet.Such as, " entity| entity " is semantic type of the most slightly one deck, it is widest in area that he comprises; And " LandVehicle| car " is semantic type of the most carefully one deck, and the meaning that its is expressed is the thinnest, near " automobile ".

The present invention is dissolved into semantic information in the nonlexicalized syntactic analysis process through investigating the relation of syntactic analysis and semantic analysis, solves the problem that the PCFG model lacks semantic information, and through semantic marker the part of speech layer is carried out further refinement.Through introducing semantic information, help syntactic analysis to carry out ambiguity resolution, thereby the performance of syntactic analysis is improved to some extent.

Therefore, basic thought of the present invention is to think that sentence structure and semanteme are two different aspects of language analysis, and they play a role in the middle of the process of language analysis jointly, and influence each other, and semantic information helps clearing up of structural ambiguity very much.Through in the nonlexicalized syntactic analysis method, incorporating semantic information, the performance of parser is obviously promoted, and both comprised the modified relationship of sentence structure in the middle of the resulting analysis result, also comprised the semantic classes of each speech.

Starting point of the present invention is to obtain high performance parser, and is that supplementary means improves the syntactic analysis performance with the semantic analysis.What the basic model of syntactic analysis adopted is the PCFG model of non-vocabularyization, and this model is through the automatic refinement mark of the method for unsupervised learning, improves the descriptive power of the syntax, and its performance has surpassed the vocabulary parser.This method on this basis with HowNet as semantic dictionary; The semantic classes of a certain level is provided for the part speech in the middle of the sentence structure treebank; And with semantic type of preparatory terminal symbol (being the last layer of lexis) level, and train the grammatical model that obtains comprising semantic information with the treebank behind the mark attached to syntax tree.Need not carry out the syntactic analysis result that any special processing can obtain having semantic marker in decoded portion.Find that through experiment this method effectively raises the performance of syntactic analysis.

Divide three parts to introduce technical scheme of the present invention in detail below.

1. semantic information incorporates the mode of syntactic analysis

With HowNet as semantic dictionary, with wherein the definition adopted former (being defined as the least unit of meaning) as semantic classes.Adopted former certain hyponymy that in HowNet, exists; Shown in accompanying drawing 2; Extract the semantic classes of different levels according to this hyponymy, inquire about obtaining its semantic type with the speech in the syntax tree as key assignments, and with semanteme type attached on the preparatory terminal symbol.For the consistance that guarantees semantic system and alleviate the sparse problem of data, what here need guarantee a bit is that the semantic class that all speech inquiries obtain is in same one deck in HowNet.

Just have the problem of word sense disambiguation for the speech that has a plurality of semantic classess, the strategy here is to get first semantic classes; We have designed the meaning classification labeling system of a polysemant on the other hand, adopt the artificial mode that marks that the semantic class of polysemant is marked.For non-existent speech among the HowNet, then do not add semantic information.

What accompanying drawing 1 showed is an example that mark is semantic.Accompanying drawing 1 (a) is the sentence in the treebank before the mark; Accompanying drawing 1 (b) is through the sentence behind the semantic tagger, can see that introducing semantic strategy is exactly that semantic classes with certain speech is attached on its pairing preparatory terminal symbol.

For the nonterminal symbol more than the part of speech layer, can not from HowNet, directly obtain, the simplest addition manner can adopt the method for extracting centre word that is similar to, and the semantic information of preparatory terminal symbol is treated as centre word, extracts on the node of upper strata.But consider that the semantic classes of speech is many, append to the upper strata node and may produce more nonterminal symbol, can produce the very serious sparse problem of data for the insufficient situation of data volume.Therefore, still adopt the mode of not having the automatic division merging of supervision to segment automatically for the upper strata nonterminal symbol, and do not introduce semanteme.

Through after such processing, the preparatory terminal symbol in the pairing upper strata of most number in the treebank with regard on the mark semantic type of certain one deck among the HowNet, adopt this treebank to carry out the syntactic analysis model training, just can obtain to incorporate the grammatical model of semantic information.Utilize these syntax to decode, can obtain having the syntactic analysis result of semantic marker, the result is also more accurate in syntactic analysis simultaneously.

2. syntactic analysis model training

The basic sentence structure analytical model that the present invention adopted is a nonlexicalized syntactic analysis model, promptly adopts unsupervised mode that nonterminal symbol node mark is carried out refinement, improves the descriptive power of the syntax.Briefly introduce this model below.

In recent years, nonlexicalized PCFG syntactic analysis method has been obtained bigger progress, and the performance of best model has reached the highest level of current syntactic analysis.This model is the automatic refinement nonterminal symbol of the mode mark that under basic PCFG framework, passes through unsupervised learning, strengthens the descriptive power of the syntax.The training part of this model mainly comprises division, merges two processes.Fission process is that each nonterminal symbol is split into two, mark is carried out refinement, thereby enlarged grammatical complicacy, has enlarged the coverage to the language phenomenon that occurs in the treebank; Fusion process is that which is necessary for the division that guarantees mark in the step toward division; This point is whether to divide for the influence of whole treebank likelihood score and weigh through investigating a certain mark; If promptly dividing the whole treebank likelihood score in sub-mark merging back that with two descends not obvious; Then the division of this mark is unnecessary, thereby sub-mark is merged.

Adopt this nonlexicalized syntactic analysis method based on automatic division, at first can guarantee the baseline system of superior performance, this model is convenient to incorporate semantic information simultaneously.In addition, add semantic information, help retraining the automatic division of syntactic marker through the external semantic dictionary; And on the other hand, follow-up automatic division can guarantee that again the semantic class of adding is unlikely to influence the division of syntactic function.

3. syntactic analysis decode procedure

For a new sentence to be analyzed, just can analyze its syntactic structure according to the grammatical model that obtains in the training process.Fundamental method is to adopt the grammar rule in the grammatical model to derive a most probable syntax tree according to the mode of chart analysis is bottom-up, but this its search volume of simple analysis mode is very huge.In order to raise the efficiency; Just adopt a kind of analysis strategy from coarse to fine; Promptly at first adopt simple grammatical model decoding to obtain a series of candidate result, and then adopt meticulousr grammatical model in these candidate result, to decode again, so just can before the meticulous decoding of back, dismiss many impossible results; Thereby reduced the search volume, improved efficient.

Good effect of the present invention:

Compared with prior art, the present invention adopts semantic information to help the syntactic analysis disambiguation, has effectively improved the performance of syntactic analysis, and the efficient of syntactic analysis and accuracy are significantly improved; And can obtain the semantic information of part speech through the parser of this fusion semantic information.

Description of drawings

Syntax tree after Fig. 1 syntax tree and the interpolation semantic information;

(a) be the sentence that marks in the preceding treebank; (b) be through the sentence behind the semantic tagger;

Adopted elite tree fragment example among the semantic resource HowNet of Fig. 2;

Fig. 3 method flow diagram of the present invention.

Embodiment

Describe embodiment of the present invention in detail below in conjunction with accompanying drawing, method flow diagram of the present invention is as shown in Figure 3.

1. make up speech-semanteme type index

According to define among the HowNet adopted former between hyponymy extract semantic type of different layers from coarse to fine, and corresponding with each speech, thus construct by the index of speech to semantic class.The speech is here attaching part of speech information.

2. original treebank is added semantic category information

To original treebank, obtain semantic type information with speech and part of speech as key assignments, the information with semantic class is attached on part of speech (terminal symbol in advance) level then, realizes the refinement to part of speech layer mark.The part part of speech has just comprised semantic information like this.

Possibly there are a plurality of different semantics classes in some word, has adopted two kinds of strategies to this situation: choose in a plurality of semantemes first, perhaps adopt the mode of artificial mark based on context to select.

3. train grammatical model

With the treebank that added semantic category information as training data.Adopt the nonlexicalized syntactic analysis model of front introduction to carry out syntax training, adopt the mode of division automatically, merging to carry out refinement for nonterminal symbol in the training process.On the other hand; Whether need also carry out this thinning process in order to investigate to the preparatory terminal symbol that has added semantic information; We have carried out experimental verification; The result finds still to segment its effect in semantic automatically and be better than and do not carry out segmenting adding coarseness, and to distinguish the stronger fine granularity of property semantic and do not carry out automatic refinement, the introduction that following effect analysis part also can be detailed and the effect of this way also is better than direct interpolation.

4. treat anolytic sentence and carry out syntactic analysis

The grammatical model that has trained above having had; Nonlexicalized parser for a sentence to be analyzed (having passed through word segmentation processing) just can adopt the front to introduce is decoded according to grammatical model; Obtain the syntactic analysis result, also have the semantic tagger result of this statement simultaneously.

Effect analysis:

In order to verify validity of the present invention, we have designed a series of experiment, below introductory section experiment.

The experiment language material:

The training and testing language material adopts the big Chinese treebank UPenn Chinese Tree Bank 2.0 of guest, and wherein totally 325 pieces of news category language materials adopt standard mode to divide: to use a 1-25 piece of writing as development set, totally 350 word; A 26-270 piece of writing is as training set, totally 3172 word; A 271-300 piece of writing is as test set, totally 348 word.

Semantic dictionary adopts HowNet.

Baseline system:

The nonlexicalized syntactic analysis model that baseline system adopts the front to introduce; Adopt unsupervised method that the nonterminal symbol mark is divided refinement automatically; Each iteration is split into 2 with original tally; Confirm new mark corresponding parameters through the EM algorithm, then merge according to the mark of likelihood score contribution to division.

Evaluation program:

Evaluation program adopts current use syntactic analysis evaluating tool EVALB comparatively widely.This instrument is to be evaluation criterion with the bracket indicia matched, pays close attention to accuracy rate, recall rate and F value.

Experimental result and analysis:

The result that baseline system is tested on the CTB standard data set sees table 1:

Table 1: baseline system performance

Wherein S&M representes division-merging process round-robin number of times, representes once to divide such as S&M-1-iteration; S&M-2 representes to carry out twice division-iteration, promptly on the grammatical basis that once division-iteration obtains, once divides-iteration again.Len representes the length of sentence, the speech number that promptly comprises in the sentence, and Len＜=40 expression is only tested on less than 40 sentence in length; All is illustrated on all sentences and tests.LR representes recall rate, and LP representes accuracy rate, and F1 representes the F value.

In order to weaken the sparse problem of data to a certain extent, we choose among the HowNet semantic type of top layer, and carry out automatic refinement to institute is underlined, adopt the experimental result such as the table 2 of same data set.

Table 2 adds semantic type of labeled analysis performance of coarseness

From last table, can find to merge, surpass baseline system through the syntactic analysis performance of adding the semantic information class since the 4th iteration division.In the 6th iteration, divide the meticulous training that occurred crossing, the F value has certain decline, and the trend that in baseline system and improvement system, appears is consistent.But the result who adds semantic type still is superior to baseline system.With the 5th take turns iteration the result compare, the F value has brought up to 81.63% by 80.26%, definitely improves 1.37 points, this improves quite remarkable in the research of syntactic analysis.

In addition, adopt the big Chinese treebank of guest (comprising 18782 sentences altogether) of 5.0 versions of up-to-date issue to train, syntactic analysis performance of the present invention can reach F value 86.39%.Add before and after the semantic information contrast trend with above the result that draws on the big Chinese treebank 2.0 of guest listed similar, just repeat no more here.

The present invention is the basis with the nonlexicalized parser; Semantic information is incorporated wherein; Utilize semantic information to help syntactic analysis to carry out disambiguation, the parser performance is obviously promoted, and can obtain the semantic information of part speech through the parser of this fusion semantic information.

Claims

1. a Chinese syntactic analysis and a coding/decoding method that combines semantic information the steps include:

1), obtains by the index of speech to semantic type according to knowing that the hyponymy of netting extracts the semantic classes of the different levels of speech;

2) inquire about semantic type that obtains this speech with the speech in the syntax tree as key-value pair knowledge net, and semantic type is added on the preparatory terminal symbol layer of syntax tree;

3) with step 2) syntax tree after handling is as training data, adopts nonlexicalized syntactic analysis model to carry out syntax training, wherein adopts the mode of division automatically, merging to carry out refinement for non-preparatory terminal symbol, obtains grammatical model;

4) utilize the grammatical model after step 3) is trained that sentence to be analyzed is decoded.

2. the method for claim 1 is characterized in that comprising part of speech information in institute's predicate.

3. method as claimed in claim 2 is characterized in that with speech and part of speech being that key-value pair knowledge net is inquired about semantic type that obtains this speech.

4. like claim 1 or 3 described methods, it is characterized in that same one deck of knowing net is inquired about for semantic type, semantic type that all speech inquiries are obtained is in same one deck in knowing net.

5. the method for claim 1 is characterized in that if there are a plurality of different semantics classes in word, then chooses first the semantic type semantic class as this speech in a plurality of semantemes, or adopts the mode of artificial mark based on context to select.