CN101866337A

CN101866337A - Part-or-speech tagging system, and device and method thereof for training part-or-speech tagging model

Info

Publication number: CN101866337A
Application number: CN200910132711A
Authority: CN
Inventors: 胡长建; 赵凯; 邱立坤; 沈国阳
Original assignee: NEC China Co Ltd
Current assignee: NEC China Co Ltd; Renesas Electronics China Co Ltd
Priority date: 2009-04-14
Filing date: 2009-04-14
Publication date: 2010-10-20
Anticipated expiration: 2029-04-14
Also published as: JP2010250814A; JP5128629B2; CN101866337B

Abstract

The invention relates to a part-or-speech tagging system, which comprises a part-or-speech tagging training model, and a part-or-speech tagging device, wherein the part-or-speech tagging training model is used for training the part-or-speech tagging model lay by layer and node by node by using tagged first texts in a part-or-speech tagging training set based on a part-or-speech hierarchy tree; and the hierarchy tree device is used for tagging part-or-speech of the texts to be tagged by using the trained part-or-speech tagging . The invention also relates to a part-or-speech tagging method, and a device and a method thereof for training the part-or-speech tagging model. According to the system and the method of the invention, the part of speech in a large scale part-or-speech tagging set is tagged, and the tagging precision of part of speech is enhanced.

Description

The part-of-speech tagging system, be used to train the devices and methods therefor of part-of-speech tagging model

Technical field

The present invention relates to natural language processing field, particularly, relate to a kind of part-of-speech tagging system, be used to train the devices and methods therefor of part-of-speech tagging model.

Background technology

Along with extensively popularizing and social informationization day by day of internet, the accessible natural language text quantity of computing machine unprecedentedly increases, towards application demand rapid growths such as the text mining of magnanimity information, information extraction, cross-language information processing, man-machine interactions, natural language processing technique is one of core technology of reply the demand.Part-of-speech tagging is to go up correct part of speech to each the speech mark in the text, and it is the basis of natural language processing.Since the result of part-of-speech tagging directly influence natural language processing the upper strata process field (such as, word frequency statistics, syntactic analysis, chunk parsing, semantic analysis etc.), therefore obtain efficient and the part-of-speech tagging method and system is extremely important accurately.

Part-of-speech tagging is a sequence labelling problem of natural language processing field, and condition random domain model (Conditional Random Fields-CRFs) is widely used in handling the sequence labelling problem in the natural language.The condition random territory is a kind of non-directed graph model that is used for calculating the conditional probability of formulating the output node value when given input nodal value in essence, it has the ability of expressive element length apart from dependence and overlapping feature, can be used to handle the stronger information extraction work of association of overall importance.Therefore it has avoided the strong correlation hypothesis of picture maximum entropy (MaximumEntropy-ME) and hidden Markov model digraph models such as (Hidden Markov Model-HMM) effectively, customer service they mark biasing problems of occurring, be the best statistical machine learning model of handling sequence data mark problem at present.Obtain a reasonable part-of-speech tagging model, need to introduce abundant more feature and adopt extensive mark collection to train.Yet the training process of CRFs is a job very consuming time and expend computational resource, and the demand of its training time and computational resource will be with the exponential increase of mark number of labels.Therefore the CRFs model seldom be used in the large scale system with big mark set use in (such as the part-of-speech tagging system), be used in usually in less and applied environment corpus on a small scale of feature.Consider the high accuracy requirement of part-of-speech tagging, how the CRFs model being applied to the part-of-speech tagging work with extensive mark collection and extensive corpus feature is a problem that urgency is to be solved.

For the problems referred to above, exist some relevant solutions, for example: document 1 (1.Cohn T, Smith A, Osborne M.Scaling conditional randomfields using error-correcting codes.In Proc.the 43rd Annual Meeting ofthe Association for Computational Linguistics (ACL ' 05), Ann Arbor, Michigan:Association for Computational Linguistics, June 2005, pp.10-17.) provided a kind of method that CRFs is applied to big mark set.The document is introduced error correction output code (Error Correcting Output Code-ECOC, ECOC is a kind of assemblage method, define redundant decision function earlier, be called decode procedure-coding, constructing final classification function based on above-mentioned decision function then is that decode procedure-decoding) solves the CRF training problem under the big mark collection.Detailed process is as follows: training process (cataloged procedure)

1) supposition mark collection has m label (for example, NN-noun, VB-verb, JJ-adjective, RB-adverbial word), and manually selected ECOC supposes that its length is n, and the purpose of this correcting code is exactly the vector that label is mapped as a n bit, and example is as follows:

Table 1

By above-mentioned coding, this method is with regard to original mark problem (also can regard many classification problems as), be transformed into n separate two-value classification problem, each row coding just corresponding a two-value sorter, such as the 3rd selected sorter of black box, the speech that its purpose will be labeled as " NN, JJ " exactly makes a distinction with the speech that is labeled as " VB, RB ".

2) corpus that makes up the two-value sorter (is realized by revising original language material, be exactly that the mark tag modification in the corpus is the value in the corresponding coding in simple terms, such as being above-mentioned the 3rd sorter structure language material, so only need in the original language material all are labeled as " NN ", " JJ " is labeled as " 1 " again, and own " VB ", " RB " replaces with " 0 ").After obtaining amended language material, this method adopts traditional CRFs training method to train corresponding two-value sorter.

Model use (decode procedure)

1) given any one sentence, for example " NEC Develops word-leading technologyto prevent IP phone spam ".

2) all two-value sorters that use above-mentioned training to come out to above-mentioned sentence mark respectively, and the record annotation results, suppose that annotation results is as follows:

As noted above, for each speech all can a corresponding n bit vector, adopt strategy relatively more commonly used just can contrast coding vector in the above-mentioned table 1 of this vector sum, and then seek the label of a coupling and mark this speech with it.Such as for speech " Develops ", its corresponding n bit vectors and " VB " corresponding codes are the most approaching, and this system just is labeled as Develops " VB-verb " so.

Present technology can not solve the part-of-speech tagging problem that CRF is applied to extensive mark collection very effectively, makes this method also have distance from practical application, specifically:

1) performance of the method for document 1 depends on choosing of ECOC coding to a great extent, but choosing a desirable ECOC is the comparison difficulty.

2) above-mentioned scheme does not fundamentally solve the serious dependence of the huge and high-end computational resource of time consumption for training.Training process in the document [1] will be trained n two-value sorter, and wherein the size of n depends on that ECOC chooses, and at the part-of-speech tagging problem, this value is bigger, and the corresponding training time is still very long, and the dependence of high-end computational resource is still existed.In decode procedure,, make that the application of training pattern is also very consuming time in addition, also have the dependence problem of high-end computational resource owing to will add the loaded down with trivial details of codes match process to the use one by one of all two-value sorters.

Summary of the invention

The present invention is a technology of introducing part of speech layering, classification, and solves traditional C RF in conjunction with stacked CRFs model and be difficult to be applied to part-of-speech tagging problem under the extensive mark collection.The present invention can analyze the inner link between the part of speech automatically from training set, and organizes all parts of speech according to these inner links structure part of speech hierarchical trees.According to this part of speech hierarchical tree, the present invention introduces stacked CRFs model, and then make that every layer mark number is reduced, and set the introducing relation between each model in detail, can train the stacked CRFs part-of-speech tagging model that is used for extensive mark collection at last automatically.Consider the sparse problem that training set may exist, the present invention also trains for the part of speech of unregistered word based on word-building rule and guesses model, with the precision of further raising part-of-speech tagging of the present invention.

According to first aspect present invention, a kind of part-of-speech tagging system has been proposed, comprising: the part-of-speech tagging model training apparatus is used for successively training node by node the part-of-speech tagging model based on first text that the part of speech hierarchical tree utilizes the part-of-speech tagging training set to mark; And the part-of-speech tagging device, be used to use the part-of-speech tagging model of training that text to be marked is carried out part-of-speech tagging.

According to second aspect present invention, a kind of part-of-speech tagging method has been proposed, comprising: part-of-speech tagging model training step, utilize first text that has marked in the part-of-speech tagging training set to come successively to train node by node the part-of-speech tagging model based on the part of speech hierarchical tree; And the part-of-speech tagging step, use the part-of-speech tagging model of training that text to be marked is carried out part-of-speech tagging.

According to third aspect present invention, a kind of device that is used to train the part-of-speech tagging model has been proposed, comprise: CRF model training language material tectonic element is used for utilizing the part of speech hierarchical tree successively to be labeled as second text node by node from first text that the part-of-speech tagging training set has marked and constructs CRF model training language material; And CRF model training unit, be used to utilize second text of the each mark of CRF model training language material tectonic element correspondingly successively to train the CRF model node by node to obtain the part-of-speech tagging model.

According to fourth aspect present invention, a kind of method that is used to train the part-of-speech tagging model has been proposed, comprise: CRF model training language material constitution step, first text that utilizes the part of speech hierarchical tree to have marked from the part-of-speech tagging training set successively is labeled as second text node by node and constructs CRF model training language material; And CRF model training step, utilize second text of the each mark of CRF model training language material constitution step correspondingly successively to train the CRF model node by node to obtain the part-of-speech tagging model.

The present invention has fundamentally solved CRFs and has been used for the part-of-speech tagging problem that big mark collects, specifically:

1) makes the CRFs model can use the part-of-speech tagging work of big mark collection, and solved the dependency problem of and high-end computational resource huge that the system and method that the present invention proposes can train the part-of-speech tagging model on ordinary PC to the training time;

2) improved the precision of part-of-speech tagging, reason has two: one, part of speech sequence labelling are the work that global association is stronger, and therefore introducing the CRFs model can realize global optimum effectively, can improve the part-of-speech tagging precision; Its two, introduce unregistered word part of speech conjecture mechanism based on word-building rule, can solve the sparse problem of training set effectively, also can improve the overall precision of part-of-speech tagging;

3) method mentioned of the present invention is a full-automatic method, can reduce training widely and optimize the cost of labor of part-of-speech tagging model.

Description of drawings

Fig. 1 a shows the synoptic diagram according to the part-of-speech tagging system of first embodiment of the invention;

Fig. 1 b is the process flow diagram according to the part-of-speech tagging method of first embodiment of the invention;

Fig. 2 shows the synoptic diagram according to part of speech hierarchical tree construction device of the present invention;

Fig. 3 shows the process flow diagram according to part of speech hierarchical tree construction method of the present invention;

Fig. 4 a is an example structure figure of part of speech hierarchical tree;

Fig. 4 b and 4c are examples of the data structure of part of speech hierarchical tree;

Fig. 5 a shows the schematic configuration diagram according to part-of-speech tagging model training apparatus of the present invention;

Fig. 5 b shows the process flow diagram according to part-of-speech tagging model training method of the present invention;

Fig. 6 a shows according to part-of-speech tagging schematic representation of apparatus of the present invention;

Fig. 6 b is the process flow diagram according to part-of-speech tagging method of the present invention;

Fig. 7 a shows the synoptic diagram according to the part-of-speech tagging system of second embodiment of the invention;

Fig. 7 b is the process flow diagram according to the part-of-speech tagging method of second embodiment of the invention;

Fig. 8 a shows the synoptic diagram according to the part-of-speech tagging system of third embodiment of the invention;

Fig. 8 b is the process flow diagram according to the part-of-speech tagging method of third embodiment of the invention.

Embodiment

Below, the preferred embodiments of the present invention will be described with reference to the drawings.In the accompanying drawings, components identical will be by identical reference symbol or numeral.In addition, in following description of the present invention, with the specific descriptions of omitting known function and configuration, to avoid making theme of the present invention unclear.

Fig. 1 a is the schematic configuration diagram according to the part-of-speech tagging system of first embodiment of the invention.Part-of-speech tagging training set 10 in the part-of-speech tagging system 1 comprises a large amount of texts that has marked, that is, and and the text collection that has marked.Part of speech hierarchical tree construction device 14 is used for analyzing incidence relation between the part of speech based on the text that marks of part-of-speech tagging training set 10, and make up part of speech hierarchical tree 15 according to the incidence relation of analyzing and come stratification to organize the part of speech of the mark that occurs in the part-of-speech tagging training set, this incidence relation for example can be the similarity between the part of speech.Part-of-speech tagging model training apparatus 12 is used for training and generates part-of-speech tagging model 13, this part-of-speech tagging model training apparatus reads the text that has marked from part-of-speech tagging training set 10, and according to the part of speech layer of structure information in the part of speech hierarchical tree 15, make up the model training process and be used for the CRFs part-of-speech tagging model 13 of part-of-speech tagging with training, wherein the part-of-speech tagging model that obtains of training is stacked part-of-speech tagging model.Part-of-speech tagging device 22 is used for marking according to the part-of-speech tagging model that the obtains part of speech to the speech that do not mark text.

Though the part-of-speech tagging system shown in Fig. 1 a comprises part of speech hierarchical tree construction device 14, but, be that this part-of-speech tagging system also can not comprise this part of speech hierarchical tree construction device with being appreciated that, and be to use the part of speech hierarchical tree that has made up to come text to be marked is carried out part-of-speech tagging.This part of speech hierarchical tree for example can be the hierarchical tree of manual construction.And this part-of-speech tagging system can only comprise that part-of-speech tagging model training apparatus 12 generates the part-of-speech tagging model 13 that is used for part-of-speech tagging.

Part of speech hierarchical tree 15 is organized part of speech with tree-structured hierarchical.Fig. 4 a shows an example structure of part of speech hierarchical tree, and this part of speech hierarchical tree one has 4 layers in this example, and 0,1,2,3, wherein the 2nd and the 3rd layer node number is 6.The leaf node correspondence of part of speech hierarchical tree be real part of speech, all the other nodes are empty class names of setting arbitrarily.Fig. 4 b and 4c show the example of data structure of the part of speech hierarchical tree of Fig. 4 a.

Fig. 1 b shows the process flow diagram of part-of-speech tagging method.At S 101, part of speech hierarchical tree construction device 14 makes up part of speech hierarchical tree 15 and comes stratification to organize the part of speech of the mark that occurs in the part-of-speech tagging training set.At S102, part-of-speech tagging model training apparatus 12 reads the text that has marked from part-of-speech tagging training set 10, and, generating part-of-speech tagging model 13 according to the part of speech layer of structure information in the part of speech hierarchical tree 15, this part-of-speech tagging model 13 is marking model of stepped construction.At S103, part-of-speech tagging device 22 utilizes the text of the 13 pairs of inputs of part-of-speech tagging model that generate to carry out part-of-speech tagging.

At first to how generating part of speech hierarchical tree 15 be described below in conjunction with Fig. 2 and Fig. 3.

Fig. 2 is the schematic configuration diagram according to part of speech hierarchical tree construction device 14 of the present invention.Wherein part of speech feature templates selected cell 140 is used to select characterize the part of speech feature templates of the grammer performance of part of speech, can there be multiple mode to characterize the grammer of part of speech, for example can choose the preceding speech of the current speech that has marked in the text, preceding speech part of speech, back speech and back these several features of speech part of speech are used as the part of speech feature templates.Proper vector construction unit 141 is used for according to the part of speech feature templates of selecting, and makes up the characteristic of correspondence vector at each part of speech that occurs in the part-of-speech tagging training set 10.Similarity calculated 142 is used for utilizing the proper vector of structure that any two parts of speech of part-of-speech tagging training set 10 are calculated its similarity.Cluster cell 143 is used for using traditional hierarchical clustering algorithm that all parts of speech of part-of-speech tagging training set 10 are carried out cluster according to the similarity of calculating, and generates part of speech hierarchical tree 15 according to pre-defined rule.

Fig. 3 shows the process flow diagram that part of speech hierarchical tree construction device generates the method for part of speech hierarchical tree.At S301, the feature that part of speech feature templates selected cell 140 is selected part of speech for example selects to have marked the preceding speech of the current speech in the text as the part of speech feature templates, preceding speech part of speech, back speech and back these several features of speech part of speech.For Hong Kong/ns choose/v ten/m is big/a is outstanding/this text that has marked of a youth/n, the current speech of selection is " choosing ", current speech part of speech is " v ", its part of speech character representation is as follows:

At S302, proper vector construction unit 141 makes up the characteristic of correspondence vector at all parts of speech that occur in the part-of-speech tagging training set 10 according to the part of speech feature templates.For example, total dz speech in the part-of-speech tagging training set, lz part of speech, the feature of given above-mentioned selected part of speech, this module makes up following vector for any one part of speech x so:

1) x＜preceding speech〉preceding term vector-vectorial dimension is dz, the frequency of specific word appears in the speech front that vectorial corresponding element characterizes the x part of speech

2) x＜preceding speech part of speech〉preceding speech part of speech vector-vectorial dimension is lz, the frequency of specific part of speech appears in the speech front that vectorial corresponding element characterizes the x part of speech

3) x＜back speech〉term vector-vectorial dimension is dz in the back, the frequency of specific word appears in the speech back of vectorial corresponding element sign x part of speech

4) x＜back speech part of speech〉vector-vectorial dimension is lz to back speech part of speech, the frequency of specific part of speech appears in the speech back of vectorial corresponding element sign x part of speech

At S303, similarity calculated 142 is used for any two parts of speech of part-of-speech tagging training set 10 are calculated its similarity according to the following step.For example, for part of speech x1 and part of speech x2,

1) calculate at first respectively two parts of speech (x1, the similarity of character pair vector x2):

Simc (x1＜preceding speech 〉, x2＜preceding speech 〉),

Simc (x1＜preceding speech part of speech 〉, x2＜preceding speech part of speech 〉),

Simc (x1＜back speech 〉, x2＜back speech 〉),

Simc (x1＜back speech part of speech 〉, x2＜back speech part of speech 〉)

2) use following formula calculated population similarity

Sim (x1, x2)=w1*Simc (x1＜preceding speech 〉, x2＜preceding speech 〉)+

W2*Simc (x1＜preceding speech part of speech 〉, x2＜preceding speech part of speech 〉)+

W3*Simc (x1＜back speech 〉, x2＜back speech 〉)+

W4*Simc (x1＜back speech part of speech 〉, x2＜back speech part of speech 〉)

W1+w2+w3+w4=1 wherein

At step S304, cluster cell 143 utilizes hierarchical clustering algorithm (for example, the K-means clustering algorithm) that all parts of speech are carried out cluster according to the similarity that calculates, and generates hierarchical tree according to pre-defined rule.In the present invention, this pre-defined rule can be to limit every layer node number less than n (n is a positive integer).For example, n equals 8.

Describe how to generate the part-of-speech tagging model below in conjunction with Fig. 5 a and Fig. 5 b.Fig. 5 a is the structural drawing according to part-of-speech tagging model training apparatus 12 of the present invention.Part-of-speech tagging model training apparatus 12 comprises: CRF model training language material tectonic element 121, CRF model training unit 122 and logical circuit 120.CRF model training language material tectonic element 121 successively marks node by node according to 15 pairs of training texts that read from part-of-speech tagging training set 10 of part of speech hierarchical tree.The CRF model is correspondingly successively trained node by node according to the training text of CRF model training language material tectonic element 121 each marks in CRF model training unit 122.Logical circuit 120 control CRF model training language material tectonic elements 121 and CRF model training unit 122 carry out the part-of-speech tagging model training.Logical circuit 120 is mounted with the level number of part of speech hierarchical tree, and CRF model training language material tectonic element 121 and CRF model training unit 122 every layer finish dealing with after, the number of plies is increased by 1, up to all node end process to last one deck of part of speech hierarchical tree.

Fig. 5 b is the process flow diagram that the part-of-speech tagging model training apparatus generates the method for part-of-speech tagging model.This process flow diagram comprises a nested training method of double-deck round-robin.This method adopts the training mode under pushing up.The training result of last layer can independently carry out with the training between the layer one deck is influential down.Suppose that the part of speech hierarchical tree has the n layer, the i layer has m _iIndividual node, present node are j.At first at S601, logical circuit 120 is 0 with i layer initial assignment.At S602, logical circuit 120 is 1 with node j assignment.Afterwards at S603, CRF model training language material tectonic element 121 structure＜i, j〉CRF model training language material, will mark the child node title that part-of-speech tagging label in the text replaces with the present node of this label in the part of speech hierarchical tree in the original part-of-speech tagging training set 10.At S604, CRF model training unit 122 utilization＜i, j〉the feature templates training＜i of CRF model training language material and selection, j〉the CRF model, wherein, when i=0, the preceding word of each two speech, current speech and the co-occurrence (co-occurrence) between back word and each two speech of front and back before and after the feature templates that CRF model training unit 122 is selected comprises; In i＞0 o'clock, except using the 0th layer of feature templates of using, also use the part of speech that comprises each two speech of front and back in the last layer annotation results, and the feature templates of the co-occurrence between the co-occurrence between the part of speech, speech and the part of speech.At S605, judge that with j value increase by 1 and at S606 whether j is greater than m _iIf j is less than m _iThen continue to carry out S603, otherwise at S607 i value is increased by 1 and carry out S602, up to the node of all layers in the part of speech hierarchical tree having been carried out S603 and S604, thereby training obtains being applied to the stacked part-of-speech tagging model that extensive mark collects.

For example, a given sentence that mark is intact:

Hong Kong/ns chooses/and v ten/m is big/and a is outstanding/a youth/n

At the 0th layer, structure＜0,1〉CRF model training language material.At first above-mentioned sentence is marked again.Referring to the part of speech hierarchical tree shown in Fig. 4 a, the child node of this 0 layer of the 1st node is respectively " label1 ", " label2 ", " label3 " and " label4 ".And the ground floor nodename that the actual part of speech " v " among Fig. 4 a corresponds in the part of speech hierarchical tree is " label1 ", and all are labeled as the speech of " v " in original training set so, all will mark this speech again and be " label1 ".

The 0th layer above-mentioned sentence marked again after, obtain following sentence:

Hong Kong/label3 chooses/and label1 ten/label2 is big/and label1 is outstanding/label1

Youth/label3

At 0 layer, training CRF model.The feature templates of selecting comprises " Hong Kong ", the co-occurrence between each two speech of front and back of speech such as " choosing ", the preceding word of current speech and back word and each two speech of front and back (co-occurrence refers to the situation that two speech occur simultaneously in certain context).

Afterwards, at the 1st layer above-mentioned sentence is marked once more again.To the 1st layer of the 1st node＜1,1 〉, carry out＜1,1〉CRF model training language material structure.Referring to the part of speech hierarchical tree of Fig. 4 a, because＜1,1〉node child node comprises " label11; label12 ", so, 0 layer of part-of-speech tagging further carefully is designated as " label 11; label 12 ", i.e. the child node title of present node set for the speech of " label1 "

For 0 layer of annotation results: Hong Kong/label3 choose/label1 ten/label2 is big/label1 is outstanding/label1 youth/label3,＜1,1〉node corpus after heavily marking is:

Hong Kong/label3 chooses/and label12 ten/label2 is big/and label11 is outstanding/label11 youth/label3

Carry out＜1,1〉node CRF model training afterwards.Wherein the feature templates of Xuan Zeing also comprises the part of speech of each two speech of front and back in the last layer annotation results except the 0th layer feature templates, and the co-occurrence between the co-occurrence between the part of speech, speech and the part of speech.For example, for " a choosing " speech, each two speech " Hong Kong " and " tens' " part of speech " label3 " " label2 " before and after it, the co-occurrence between the co-occurrence between the above-mentioned part of speech, speech and the part of speech.

Similarly, right＜1,2〉node,＜1,3〉node,＜1,4〉node carry out above-mentioned CRF model training respectively expects structure and CRF model training.Up to all nodes of all layers having been carried out CRF model training language material structure and CRF model training.

Fig. 6 a shows the structural drawing of part-of-speech tagging device.Referring to Fig. 6 a, part-of-speech tagging device 22 comprises logical circuit 222, CRF aspect of model tectonic element 220 and CRF part-of-speech tagging unit 221.Logical circuit 222 is according to stacked part-of-speech tagging model, and control CRF aspect of model tectonic element 220 and CRF part-of-speech tagging unit 221 carry out part-of-speech tagging.CRF aspect of model tectonic element 220 is under the control of logical circuit 222, be text application＜i to be marked, j〉CRF model structural attitude node by node successively, CRF part-of-speech tagging unit 221 according to the characteristic of latent structure unit 220 each structures, correspondingly successively carries out part-of-speech tagging node by node under the control of logical circuit 222.

Fig. 6 b is the process flow diagram that the part-of-speech tagging device is carried out stacked CRF part-of-speech tagging method.Suppose that the part of speech marking model has the n layer, the i layer has m _iIndividual node, present node are j.At first at S901, logical circuit 222 is 0 with i layer initial assignment.At S902, logical circuit 222 is 1 with node j assignment.At S903, CRF aspect of model tectonic element 220 is application＜i, j afterwards〉CRF model construction characteristic, according to the feature templates of setting in the training part-of-speech tagging model process, make up the input feature vector data of CRFs model, the layer i at different, use one of following two kinds of diverse ways:

When 1) i equals 0, carry out the feature templates filling process of CRF model, that is, directly from the text to be marked of input, extract relevant characteristic information, and be filled into template, generate the input feature vector data of corresponding CRFs model.

When 2) i is not equal to 0,, also comprises from the result who utilizes i-1 layer CRF model that text to be marked is marked and extract characteristic of correspondence information, generate the input feature vector data of corresponding CRFs model except the relevant characteristic information that obtains in 0 layer.

At S904, based on the characteristic that obtains, utilize part-of-speech tagging model 10＜i, j〉the CRF model marks text to be marked.

At S905, judge that with j value increase by 1 and at S906 whether j is greater than m _iIf j is less than m _iThen continue to carry out S903, otherwise at S907 with i value increase by 1 and carry out S902, up to the node of all layers in the part of speech hierarchical tree having been carried out S903 and S904.By layer by layer text being carried out part-of-speech tagging, realized the part-of-speech tagging of extensive mark collection thus.Provide a simple examples below, further specify whole mark process:

A given text to be marked: ten big city good for habitatioies are shortlisted in Beijing.

The 0th layer (using＜0,1〉CRFs model)

Result behind the mark is: and Beijing/label3 is shortlisted for/label1 ten/label2 is big/and label1 is livable/label1 city/label3

The 1st layer (using the CRFs model of all these layers)

By＜1,1〉CRFs model obtain Beijing/label3 and be shortlisted for/label12 ten/label2 is big/label11 is livable/label11 city/label3

2. use＜1,2〉CRFs models ...

Annotation results after the 1st layer of end is:

Beijing/label32 is shortlisted for/and label12 ten/label21 is big/and label11 is livable/label11 city/label3l

The 2nd layer

1. obtain by＜2,1〉CRFs model:

Beijing/label32 is shortlisted for/and label12 ten/label21 is big/and a is livable/a city/label31

2. use＜2,1〉CRFs model ...

Finally can access complete annotation results:

Beijing/ns is shortlisted for/and v ten/m is big/and a is livable/a city/n

Fig. 7 a is the schematic configuration diagram of the part-of-speech tagging system of second embodiment of the invention.Compare with the part-of-speech tagging system shown in Fig. 1 a, this part-of-speech tagging system also comprises apparatus for evaluating 16, adjusting gear 17 and test set construction device 18.Test set construction device 18 is used for selecting the test set of a part-of-speech tagging text collection as text collection to be marked at random from part-of-speech tagging training set 10.Apparatus for evaluating 16 is used for utilizing the part-of-speech tagging model that the result that test set to be marked carries out behind the part-of-speech tagging is assessed, and, evaluates and tests the mark precision according to the result of test that is.Adjusting gear 17 is used for according to the assessment result of apparatus for evaluating part of speech hierarchical tree construction device 14 being adjusted, thereby generates the more part of speech hierarchical tree of dominance energy.

Fig. 7 b shows the process flow diagram that the part-of-speech tagging system carries out the method for part-of-speech tagging.With reference to figure 7b, at S701, test set construction device 18 from part-of-speech tagging training set 10 subclass of random extraction as test set.At S702,13 pairs of test sets of the part-of-speech tagging model that the utilization of part-of-speech tagging system trains carry out part-of-speech tagging.At S703, the precision of the test set of 16 pairs of marks of apparatus for evaluating part of speech is assessed and assessment result is sent to adjusting gear 17.Afterwards at S704, adjusting gear 17 is according to the performance of assessment result court verdict marking model, and when the performance of part-of-speech tagging model does not satisfy predetermined condition, carry out S705, to the W1 that uses in the part of speech hierarchical tree construction device 14, W2, the threshold value of W3 and W4 is adjusted to change cluster result.Utilize heuristic rule that cluster result is adjusted at the S706 adjusting gear.Heuristic rule for example is: " n " should assign in the different groups with " ns ".

Fig. 8 a is the structural drawing according to the part-of-speech tagging system of third embodiment of the invention.For unregistered word, owing to there is not corresponding training data in the corpus, therefore often ratio of precision is lower to the mark of this class speech, and then the whole mark of influence precision.Part-of-speech tagging of the present invention system can revise the part of speech of not landing speech, thereby improves the overall precision of system's part-of-speech tagging.Compare with the part-of-speech tagging system shown in Fig. 1 a, this part-of-speech tagging system comprises that also not landing the speech part of speech guesses model construction device 19 and do not land speech part of speech correcting device 21.Do not land speech part of speech conjecture model construction device 19 and be used for learning word-building rule, and do not land speech part of speech conjecture model 20 based on the word-building rule establishment of study from existing part-of-speech tagging training set 10.Not landing speech part of speech correcting device 21 is used to utilize and does not land speech part of speech conjecture model and come part of speech correction that the text that utilizes part-of-speech tagging model 13 mark parts of speech is not landed speech.

Fig. 8 b shows the part-of-speech tagging method according to third embodiment of the invention.With reference to figure 8b, at S801, not landing speech part of speech conjecture model construction device 19 at first carries out the immediate constituent cutting to the speech in the part-of-speech tagging training set and the attribute of immediate constituent is analyzed (promptly, its immediate constituent found out in speech in each part-of-speech tagging training set, and the attribute of immediate constituent marked) to obtain the speech components series.

Below simplicity of explanation is carried out in the definition of immediate constituent.The subsection that constitutes a big unit is referred to as the composition of big unit, and the subsection that correspondingly directly constitutes a big unit is called immediate constituent.Speech in the part-of-speech tagging training set itself belongs to speech, rather than the constituent littler than speech, so immediate constituent and immediate constituent attributive analysis are different from general sense word segmentation and part-of-speech tagging, but the speech in each part-of-speech tagging training set that is made of two words and plural word all is cut into unit than its low one-level, such as two-character word, the unit of low one-level is exactly the single character (morpheme) that constitutes this two-character word, and for three words and more than three words, then be speech (maximum match) and the remaining single morpheme that exists in the dictionary with its cutting, such as " Ministry of Science and Technology ", suppose in the dictionary exist " science ", " technology " two speech, and do not have " science and technology ", " technology department " etc. is exactly " science/technology/portion " so after its cutting, suppose in the dictionary exist " science ", " technology department ", speech such as " technology " after the cutting is exactly " science/technology department " so.Therefore, the immediate constituent here may be speech, also may be morpheme.The attribute of immediate constituent mainly refers to grammatical attribute, and the form demonstration with the part of speech mark comprises all possible part of speech mark.

Table 1 provides the immediate constituent cutting and the attributive analysis result of " cold violence, strafe " two speech:

The cutting of table 1. speech immediate constituent and immediate constituent attributive analysis example as a result obtain corresponding sequence:

Cold violence → cold 2a N_B violence 4n N_E

Strafe → sweep 2v V_B and penetrate 2v V_E; For unregistered word is " cold penetrating ", and the speech components series that obtains so is: cold 2a penetrates 2v

At S802, do not land speech part of speech conjecture model construction device 19 and select the part of speech feature templates.

At S803, do not land speech part of speech conjecture model construction device 19 and utilize the part of speech feature templates of selecting that the speech components series that generates is changed, and do not land speech part of speech conjecture model 20 by known machine learning algorithm generation.For example, the part of speech that speech part of speech conjecture model 20 obtains " cold penetrating " whole speech is not landed in utilization: POS (cold 2aV_B penetrates 2vV_E)=V.

At S804, part-of-speech tagging system utilization generates does not land 20 pairs on speech part of speech conjecture model and marks again based on the speech that do not land in the text of part-of-speech tagging model 13 marks.

Suppose " to sweep 2v V_B and penetrate 2v V_E " for the speech components series, the feature templates of selection is:

//Part-of-speech?of?the?constituent?word

U01：％x[-1，2]//the?former?one?constituent′s?second

featu?re(/)(″/″denotes?a?n?ull?feature)

U02：％x[0，2]//the?current?constituent′s?second?feature(a)

//Length?of?the?constituent?word

U03：％x[1，1]//the?next?one?constituent′s?first?feature(2，2)

//The?constituent?word?itself

U04：％x[0，0]//the?current?one?constituent′s?zero?feature

So the speech components series " is swept 2v V_B and penetrated 2v V_E " and change, with its conversion

Input data for machine learning methods such as CRF:

if(T(-1，2)＝′/′)tag＝′V_B′

if(T(0，2)＝′v′)tag＝′V_B′

if(T(1，1)＝′2′)tag＝′V_B′

If (T (0,0)=' sweeps ') tag=' V_B '

if(T(-1，2)＝′v′)tag＝′V_E′

if(T(0，2)＝′v′)tag＝′V_E′

if(T(1，1)＝′2′)tag＝′V_E′

If (T (0,0)=' penetrates ') tag=' V_E '

Though utilize the 20 pairs of speech that do not land that finally mark in the text that obtains based on part-of-speech tagging model 13 of speech part of speech conjecture model that do not land that generate to mark again, but also can utilize 20 pairs on the speech part of speech conjecture model of not landing of generation to mark again at the speech that do not land in the text of anterior layer mark based on part-of-speech tagging model 13, promptly, be used for the characteristic of one deck down afterwards in order to revise part of speech mark result when anterior layer.

The present invention is an example with the Chinese text, embodiment is illustrated, but very clearly, the present invention also can be used for the part-of-speech tagging to English, Japanese and other Languages equally.

Although with reference to specific embodiment, invention has been described, the present invention should not limited by these embodiment, and should only be limited by claims.Should be understood that under the prerequisite that does not depart from scope and spirit of the present invention, those of ordinary skills can change or revise embodiment.

Claims

1. part-of-speech tagging system comprises:

The part-of-speech tagging model training apparatus is used for successively training node by node the part-of-speech tagging model based on first text that the part of speech hierarchical tree utilizes the part-of-speech tagging training set to mark; And

The part-of-speech tagging device is used to use the part-of-speech tagging model of training that text to be marked is carried out part-of-speech tagging.

2. part-of-speech tagging as claimed in claim 1 system, wherein the part-of-speech tagging model training apparatus comprises:

CRF model training language material tectonic element is used for utilizing the part of speech hierarchical tree successively to be labeled as second text node by node from first text that the part-of-speech tagging training set has marked and constructs CRF model training language material; And

CRF model training unit is used to utilize second text of the each mark of CRF model training language material tectonic element correspondingly successively to train the CRF model to obtain the part-of-speech tagging model node by node.

3. part-of-speech tagging as claimed in claim 2 system, wherein CRF model training language material tectonic element is gone successively mark node by node by the mark part of speech in first text being replaced with the child node title of the corresponding present node in position of this part of speech in the part of speech hierarchical tree.

4. part-of-speech tagging as claimed in claim 3 system, wherein CRF model training unit selects feature templates successively to train node by node the CRF model in the following manner:

(a) when anterior layer be the 0th layer, feature templates comprises the preceding word of each two speech of front and back of each speech in second text, current speech and the co-occurrence between back word and each two speech of front and back; With

(b) when anterior layer be not the 0th layer, feature templates comprises the part of speech of each two speech of front and back of each speech in second text of the feature templates of the 0th layer of selection and last layer, and the co-occurrence between the co-occurrence between the part of speech, speech and the part of speech.

5. part-of-speech tagging as claimed in claim 2 system, wherein the part-of-speech tagging device comprises:

CRF aspect of model tectonic element is used at text to be marked for using CRF model structural attitude data node by node successively; And

CRF part-of-speech tagging unit is used for correspondingly successively carrying out part-of-speech tagging node by node according to the characteristic of the each structure of characteristic tectonic element.

6. part-of-speech tagging as claimed in claim 5 system, wherein CRF aspect of model tectonic element makes up the characteristic of CRF model according to following manner:

(a) when anterior layer be the 0th layer, from text to be marked, extract the characteristic of the feature templates of the 0th layer of selection when being used to be filled in training CRF model; With

(b) when anterior layer be not the 0th layer, use the 0th layer characteristic and from utilize second text after last layer CRFs model marks text to be marked, extract characteristic.

7. part-of-speech tagging as claimed in claim 1 system also comprises:

Part of speech hierarchical tree construction device is used for by the relation analysis between the part of speech that marks text of part-of-speech tagging training set is made up the part of speech hierarchical tree.

8. part-of-speech tagging as claimed in claim 7 system, wherein part of speech hierarchical tree construction device comprises:

Part of speech feature templates selected cell is used to select to characterize the feature templates of part of speech feature;

The proper vector construction unit is used for according to the feature templates of selecting, for the part of speech in the part-of-speech tagging training set makes up the characteristic of correspondence vector;

Similarity calculated is used to utilize the similarity between the proper vector calculating part of speech; And cluster cell, be used for part of speech being carried out cluster, to generate the part of speech hierarchical tree according to similarity.

9. part-of-speech tagging as claimed in claim 8 system also comprises:

The test set constructing apparatus is used for selecting at random to have marked the text collection of part of speech as test set from the part-of-speech tagging training set;

Apparatus for evaluating is used for utilizing the part-of-speech tagging model that the result that the text to be marked from test set carries out part-of-speech tagging is assessed; And

Adjusting gear is used for according to assessment result the part of speech hierarchical tree being adjusted.

10. part-of-speech tagging as claimed in claim 9 system, its middle regulator is adjusted the threshold value that part of speech hierarchical tree construction device uses when the similarity of calculating between the part of speech.

11. part-of-speech tagging as claimed in claim 1 or 2 system also comprises:

Do not land speech part of speech conjecture model construction device, be used for not landing speech part of speech conjecture model from part-of-speech tagging training focusing study word-building rule and structure; And

Do not land speech part of speech correcting device, be used for using and do not land speech part of speech conjecture model and do not carry out part-of-speech tagging, and the part of speech of not landing speech of using part-of-speech tagging model mark part of speech is revised landing speech.

12. a part-of-speech tagging method comprises:

Part-of-speech tagging model training step utilizes first text that has marked in the part-of-speech tagging training set successively to train node by node the part-of-speech tagging model based on the part of speech hierarchical tree; And

The part-of-speech tagging step uses the part-of-speech tagging model of training that text to be marked is carried out part-of-speech tagging.

13. part-of-speech tagging method as claimed in claim 12, wherein part-of-speech tagging model training step comprises:

CRF model training language material constitution step, first text that utilizes the part of speech hierarchical tree to have marked from the part-of-speech tagging training set successively is labeled as second text node by node and constructs CRF model training language material; And

CRF model training step utilizes second text of the each mark of CRF model training language material constitution step correspondingly successively to train the CRF model to obtain the part-of-speech tagging model node by node.

14. part-of-speech tagging method as claimed in claim 13, wherein CRF model training language material constitution step comprises by the mark part of speech in first text being replaced with the child node title of the position corresponding present node of this part of speech in the part of speech hierarchical tree and goes successively the step of mark node by node.

15. part-of-speech tagging method as claimed in claim 14, wherein CRF model training step selects feature templates successively to train node by node the CRF model in the following manner:

16. part-of-speech tagging method as claimed in claim 13, wherein the part-of-speech tagging step comprises:

CRF aspect of model constitution step, at text to be marked for using CRF model structural attitude data node by node successively; And

CRF part-of-speech tagging step is correspondingly successively carried out part-of-speech tagging node by node according to the characteristic of the each structure of characteristic constitution step.

17. part-of-speech tagging method as claimed in claim 16, wherein CRF aspect of model constitution step is constructed the characteristic of CRF model according to the following manner structure:

(1) when anterior layer be the 0th layer, from text to be marked, extract the characteristic of the feature templates of the 0th layer of selection when being used to be filled in training CRF model; With

(2) when anterior layer be not the 0th layer, use the 0th layer characteristic and from utilize second text after last layer CRFs model marks text to be marked, extract characteristic.

18. part-of-speech tagging method as claimed in claim 12 also comprises:

Part of speech hierarchical tree construction step is by making up the part of speech hierarchical tree to the relation analysis between the part of speech that marks text in the part-of-speech tagging training set.

19. part-of-speech tagging method as claimed in claim 18, wherein part of speech hierarchical tree construction step comprises:

The part of speech feature templates is selected step, selects to characterize the feature templates of part of speech feature;

The proper vector construction step is according to the feature templates of selecting, for the part of speech in the part-of-speech tagging training set makes up the characteristic of correspondence vector;

The similarity calculation procedure is utilized the similarity between the proper vector calculating part of speech; And

The cluster step is carried out cluster according to similarity to part of speech, to generate the part of speech hierarchical tree.

20. part-of-speech tagging method as claimed in claim 19 also comprises:

The test set constitution step is selected to have marked the text collection of part of speech as test set from the part-of-speech tagging training set at random;

Appraisal procedure is assessed the result who carries out part-of-speech tagging from the text to be marked in the test set utilizing the part-of-speech tagging model; And

Set-up procedure is adjusted the part of speech hierarchical tree according to assessment result.

21. part-of-speech tagging method as claimed in claim 20, wherein set-up procedure comprises the step that threshold values that part of speech hierarchical tree construction step is used when the similarity of calculating between the part of speech is adjusted.

22., also comprise as claim 12 or 13 described part-of-speech tagging methods:

Do not land speech part of speech conjecture model construction step, do not land speech part of speech conjecture model from part-of-speech tagging training focusing study word-building rule and structure; And

Do not land speech part of speech correction step, use is not landed speech part of speech conjecture model and is not carried out part-of-speech tagging to landing speech, and the part of speech of not landing speech of using part-of-speech tagging model mark part of speech is revised.

23. a device that is used to train the part-of-speech tagging model comprises:

24. a method that is used to train the part-of-speech tagging model comprises: