CN101866337B

CN101866337B - Part-or-speech tagging system, and device and method thereof for training part-or-speech tagging model

Info

Publication number: CN101866337B
Application number: CN200910132711.3A
Authority: CN
Inventors: 胡长建; 赵凯; 邱立坤; 沈国阳
Original assignee: NEC China Co Ltd
Current assignee: NEC China Co Ltd; Renesas Electronics China Co Ltd
Priority date: 2009-04-14
Filing date: 2009-04-14
Publication date: 2014-07-02
Anticipated expiration: 2029-04-14
Also published as: JP2010250814A; JP5128629B2; CN101866337A

Abstract

The invention relates to a part-or-speech tagging system, which comprises a part-or-speech tagging training model, and a part-or-speech tagging device, wherein the part-or-speech tagging training model is used for training the part-or-speech tagging model lay by layer and node by node by using tagged first texts in a part-or-speech tagging training set based on a part-or-speech hierarchy tree; and the hierarchy tree device is used for tagging part-or-speech of the texts to be tagged by using the trained part-or-speech tagging . The invention also relates to a part-or-speech tagging method, and a device and a method thereof for training the part-or-speech tagging model. According to the system and the method of the invention, the part of speech in a large scale part-or-speech tagging set is tagged, and the tagging precision of part of speech is enhanced.

Description

Part-of-speech tagging system, for training the devices and methods therefor of part-of-speech tagging model

Technical field

The present invention relates to natural language processing field, particularly, relate to a kind of part-of-speech tagging system, for training the devices and methods therefor of part-of-speech tagging model.

Background technology

Along with extensively popularizing and social informationization day by day of internet, the accessible natural language text quantity of computing machine unprecedentedly increases, towards application demand rapid growths such as the text mining of magnanimity information, information extraction, cross-language information processing, man-machine interactions, natural language processing technique is one of core technology of reply the demand.Part-of-speech tagging is that it is the basis of natural language processing to the upper correct part of speech of each word mark in text.Due to the result of part-of-speech tagging directly affect natural language processing upper strata process field (such as, word frequency statistics, syntactic analysis, chunk parsing, semantic analysis etc.), therefore obtain efficient and part-of-speech tagging method and system is extremely important accurately.

Part-of-speech tagging is a sequence labelling problem of natural language processing field, and condition random domain model (Conditional Random Fields-CRFs) is widely used in processing the sequence labelling problem in natural language.Conditional random fields is a kind of for calculating the non-directed graph model of the conditional probability of formulating output node value when the given input nodal value in essence, it has the ability of expressive element long-distance dependence and overlapping property feature, can be for the treatment of the stronger information extraction work of association of overall importance.Therefore it has avoided the strong correlation hypothesis of the Directed Graph Models such as picture maximum entropy (MaximumEntropy-ME) and hidden Markov model (Hidden Markov Model-HMM) effectively, customer service they mark biasing problems of occurring, be the best statistical machine learning model of processing at present sequence data mark problem.Obtain a reasonable part-of-speech tagging model, need to introduce abundanter feature and adopt extensive mark collection to train.But the training process of CRFs is a job very consuming time and expend computational resource, and the demand of its training time and computational resource will be with the exponential increase of mark number of labels.Therefore CRFs model is seldom used in the large scale system application with large mark set (such as part-of-speech tagging system), is conventionally used in feature applied environment less and corpus on a small scale.Consider the high accuracy requirement of part-of-speech tagging, how CRFs model being applied to the part-of-speech tagging work with extensive mark collection and large-scale training language material feature is a problem anxious to be resolved.

For the problems referred to above, exist some relevant solutions, for example: document 1 (1.Cohn T, Smith A, Osborne M.Scaling conditional randomfields using error-correcting codes.In Proc.the 43rd Annual Meeting ofthe Association for Computational Linguistics (ACL ' 05), Ann Arbor, Michigan:Association for Computational Linguistics, June 2005, pp.10-17.) provided a kind of method that CRFs is applied to large mark set.The document is introduced error correcting output codes (Error Correcting Output Code-ECOC, ECOC is a kind of assemblage method, first define redundancy decision function, be called decode procedure-coding, then constructing final classification function based on above-mentioned decision function is decode procedure-decoding) solve the CRF training problem under large mark collection.Detailed process is as follows:

Training process (cataloged procedure)

1) supposition mark collection has m label (for example, NN-noun, VB-verb, JJ-adjective, RB-adverbial word), a manually selected ECOC, suppose that its length is n, the object of this correcting code is exactly to be the vector of a n bit by label mapping, and example is as follows:

Table 1

By above-mentioned coding, the method is just by original mark problem (also can regard many classification problems as), be transformed into n separate two-value classification problem, each row coding is corresponding a two-value sorter just, such as the 3rd selected sorter of black box, the word that its object will be labeled as " NN, JJ " exactly makes a distinction with the word that is labeled as " VB, RB ".

2) corpus that builds two-value sorter (is realized by revising original language material, be exactly to be the value in corresponding coding by the mark tag modification in corpus in simple terms, such as being above-mentioned the 3rd sorter structure language material, so only need be by all being labeled as " NN " in original language material, " JJ " is labeled as " 1 " again, and own " VB ", " RB " replaces with " 0 ").Obtain after amended language material, the method adopts traditional CRFs training method to train corresponding two-value sorter.

Model use procedure (decode procedure)

1) given any one sentence, for example " NEC Develops word-leading technologyto prevent IP phone spam ".

2) use above-mentioned training all two-value sorters out to mark respectively to above-mentioned sentence, and record annotation results, suppose that annotation results is as follows:

As noted above, the corresponding n bit of vector can to(for) each word, adopts more conventional strategy just can contrast the coding vector in the above-mentioned table 1 of this vector sum, so find the label of a coupling and with it mark this word.Such as for word " Develops ", the coding that its corresponding n bit vectors and " VB " are corresponding is the most approaching, and this system is just labeled as Develops " VB-verb " so.

Current technology can not solve the part-of-speech tagging problem that CRF is applied to extensive mark collection very effectively, make the method also have distance from practical application, specifically: 1) performance of the method for document 1 depends on choosing of ECOC coding to a great extent, but it is more difficult choosing a desirable ECOC.

2) above-mentioned scheme does not fundamentally solve the serious dependence of the huge and high-end computational resource of time consumption for training.Training process in document [1] will be trained n two-value sorter, and wherein the size of n depends on that ECOC chooses, and for part-of-speech tagging problem, this value is larger, and the corresponding training time is still very long, and the dependence of high-end computational resource is still existed.In decode procedure, owing to will, to the use one by one of all two-value sorters, adding the loaded down with trivial details of codes match process, make the application of training pattern also very consuming time in addition, also have the Dependence Problem of high-end computational resource.

Summary of the invention

The present invention is the technology of introducing part of speech layering, classification, and solves traditional C RF in conjunction with stacked CRFs model and be difficult to be applied to the part-of-speech tagging problem under extensive mark collection.The present invention can analyze the inner link between part of speech automatically from training set, and organizes all parts of speech according to these inner links structure part of speech hierarchical trees.According to this part of speech hierarchical tree, the present invention introduces stacked CRFs model, and then the mark number of every layer is reduced, and set in detail the introducing relation between each model, finally can automatically train the stacked CRFs part-of-speech tagging model for extensive mark collection.Consider the Sparse Problems that training set may exist, the present invention also trains the part of speech conjecture model for unregistered word based on word-building rule, further to improve the precision of part-of-speech tagging of the present invention.

According to first aspect present invention, a kind of part-of-speech tagging system has been proposed, comprising: part-of-speech tagging model training apparatus, for utilize the first text that part-of-speech tagging training set has marked to carry out successively to train node by node part-of-speech tagging model based on part of speech hierarchical tree; And part-of-speech tagging device, for the part-of-speech tagging model that uses training, text to be marked is carried out to part-of-speech tagging.

According to second aspect present invention, a kind of part-of-speech tagging method has been proposed, comprising: part-of-speech tagging model training step, utilize the first text having marked in part-of-speech tagging training set to carry out successively to train node by node part-of-speech tagging model based on part of speech hierarchical tree; And part-of-speech tagging step, use the part-of-speech tagging model of training to carry out part-of-speech tagging to text to be marked.

According to third aspect present invention, propose a kind of for training the device of part-of-speech tagging model, comprise: CRF model training language material tectonic element, construct CRF model training language material for utilizing part of speech hierarchical tree that the first text having marked from part-of-speech tagging training set is successively labeled as to the second text node by node; And CRF model training unit, correspondingly successively train node by node CRF model to obtain part-of-speech tagging model for the second text that utilizes the each mark of CRF model training language material tectonic element.

According to fourth aspect present invention, propose a kind of for training the method for part-of-speech tagging model, comprise: CRF model training language material constitution step, utilizes part of speech hierarchical tree that the first text having marked from part-of-speech tagging training set is successively labeled as to the second text node by node and constructs CRF model training language material; And CRF model training step, utilize the second text of the each mark of CRF model training language material constitution step correspondingly successively to train node by node CRF model to obtain part-of-speech tagging model.

The present invention has fundamentally solved the part-of-speech tagging problem of CRFs for large mark collection, specifically:

1) make CRFs model can use the part-of-speech tagging work of large mark collection, and solved the dependency problem of and high-end computational resource huge to the training time, the system and method that the present invention proposes can train part-of-speech tagging model in ordinary PC;

2) improved the precision of part-of-speech tagging, reason has two: one, and part of speech sequence labelling is the work that global association is stronger, therefore introduces CRFs model and can effectively realize global optimum, can improve part-of-speech tagging precision; Its two, introduce unregistered word part of speech based on word-building rule conjecture mechanism, can effectively solve the Sparse Problems of training set, also can improve the overall precision of part-of-speech tagging;

3) method that the present invention mentions is full-automatic method, can reduce widely the cost of labor of training and optimization part-of-speech tagging model.

Brief description of the drawings

Fig. 1 a shows according to the schematic diagram of the part-of-speech tagging system of first embodiment of the invention;

Fig. 1 b is according to the process flow diagram of the part-of-speech tagging method of first embodiment of the invention;

Fig. 2 shows according to the schematic diagram of part of speech hierarchical tree construction device of the present invention;

Fig. 3 shows according to the process flow diagram of part of speech hierarchical tree construction method of the present invention;

Fig. 4 a is an example structure figure of part of speech hierarchical tree;

Fig. 4 b and 4c are examples of the data structure of part of speech hierarchical tree;

Fig. 5 a shows according to the schematic configuration diagram of part-of-speech tagging model training apparatus of the present invention;

Fig. 5 b shows according to the process flow diagram of part-of-speech tagging model training method of the present invention;

Fig. 6 a shows according to the schematic diagram of part-of-speech tagging device of the present invention;

Fig. 6 b is according to the process flow diagram of part-of-speech tagging method of the present invention;

Fig. 7 a shows according to the schematic diagram of the part-of-speech tagging system of second embodiment of the invention;

Fig. 7 b is according to the process flow diagram of the part-of-speech tagging method of second embodiment of the invention;

Fig. 8 a shows according to the schematic diagram of the part-of-speech tagging system of third embodiment of the invention;

Fig. 8 b is according to the process flow diagram of the part-of-speech tagging method of third embodiment of the invention.

Embodiment

The preferred embodiments of the present invention will be described with reference to the drawings below.In the accompanying drawings, identical element is by the reference symbol by identical or numeral.In addition, in following description of the present invention, by the specific descriptions of omitting known function and configuration, to avoid making theme of the present invention unclear.

Fig. 1 a is according to the schematic configuration diagram of the part-of-speech tagging system of first embodiment of the invention.Part-of-speech tagging training set 10 in part-of-speech tagging system 1 comprises a large amount of texts having marked, that is, and and the text collection having marked.Part of speech hierarchical tree construction device 14 is analyzed the incidence relation between part of speech for the text that marks based on part-of-speech tagging training set 10, and building part of speech hierarchical tree 15 and carry out stratification and organize the part of speech of the mark occurring in part-of-speech tagging training set according to the incidence relation of analyzing, this incidence relation can be for example the similarity between part of speech.Part-of-speech tagging model training apparatus 12 generates part-of-speech tagging model 13 for training, this part-of-speech tagging model training apparatus reads the text having marked from part-of-speech tagging training set 10, and according to the part of speech layer of structure information in part of speech hierarchical tree 15, build model training process is used for part-of-speech tagging CRFs part-of-speech tagging model 13 with training, the part-of-speech tagging model that wherein training obtains is stacked part-of-speech tagging model.Part-of-speech tagging device 22 is for marking the part of speech of the word that does not mark text according to the part-of-speech tagging model obtaining.

Although the part-of-speech tagging system shown in Fig. 1 a comprises part of speech hierarchical tree construction device 14, but, be understandably that this part-of-speech tagging system also can not comprise this part of speech hierarchical tree construction device, but text to be marked carried out to part-of-speech tagging with the part of speech hierarchical tree having built.This part of speech hierarchical tree can be for example the hierarchical tree of manual construction.And this part-of-speech tagging system can only comprise that part-of-speech tagging model training apparatus 12 generates the part-of-speech tagging model 13 for part-of-speech tagging.

Part of speech hierarchical tree 15 is organized part of speech with tree-structured hierarchical.Fig. 4 a shows an example structure of part of speech hierarchical tree, and this part of speech hierarchical tree one has 4 layers in this example, and 0,1,2,3, wherein the nodes of the 2nd and the 3rd layer is 6.What the leaf node of part of speech hierarchical tree was corresponding is real part of speech, and all the other nodes are empty class names of setting arbitrarily.Fig. 4 b and 4c show an example of the data structure of the part of speech hierarchical tree of Fig. 4 a.

Fig. 1 b shows the process flow diagram of part-of-speech tagging method.At S101, part of speech hierarchical tree construction device 14 builds part of speech hierarchical tree 15 carrys out stratification and organizes the part of speech of the mark occurring in part-of-speech tagging training set.At S102, part-of-speech tagging model training apparatus 12 reads the text having marked from part-of-speech tagging training set 10, and according to the part of speech layer of structure information in part of speech hierarchical tree 15, generating part-of-speech tagging model 13, this part-of-speech tagging model 13 is marking model of stepped construction.At S103, part-of-speech tagging device 22 utilizes the part-of-speech tagging model 13 generating to carry out part-of-speech tagging to the text of input.

First be described how generating part of speech hierarchical tree 15 in conjunction with Fig. 2 and Fig. 3 below.

Fig. 2 is according to the schematic configuration diagram of part of speech hierarchical tree construction device 14 of the present invention.Wherein part of speech feature templates selected cell 140 is for selecting the part of speech feature templates of the grammer performance that characterizes part of speech, can there is various ways to characterize the grammer of part of speech, for example can choose the front word of the current word having marked in text, front word part of speech, rear word and these several features of rear word part of speech are used as part of speech feature templates.Proper vector construction unit 141, for according to the part of speech feature templates of selecting, builds characteristic of correspondence vector for each part of speech occurring in part-of-speech tagging training set 10.Similarity calculated 142 is calculated its similarity for the proper vector of utilizing structure to any two parts of speech of part-of-speech tagging training set 10.Cluster cell 143 is for using traditional hierarchical clustering algorithm to carry out cluster to all parts of speech of part-of-speech tagging training set 10 according to the similarity of calculating, and generates part of speech hierarchical tree 15 according to pre-defined rule.

Fig. 3 shows the process flow diagram of the method for part of speech hierarchical tree construction device generation part of speech hierarchical tree.At S301, part of speech feature templates selected cell 140 selects the feature of part of speech as part of speech feature templates, for example, select to have marked front word, front word part of speech, rear word and these several features of rear word part of speech of the current word in text.For choose/v of Hong Kong/ns ten/m large/a is outstanding/this text having marked of a youth/n, the current word of selection is " choosing ", current word part of speech is " v ", its part of speech character representation is as follows:

Figure DEST_PATH_GA20171931200910132711301D00011

At S302, proper vector construction unit 141, for all parts of speech that occur in part-of-speech tagging training set 10, builds characteristic of correspondence vector according to part of speech feature templates.For example, total dz word in part-of-speech tagging training set, lz part of speech, the feature of given above-mentioned selected part of speech, this module is that any one part of speech x builds following vector so:

1) before word >, term vector-vectorial dimension is dz before x<, and vectorial corresponding element characterizes the frequency that occurs specific word before the word of x part of speech

2) before word part of speech >, word part of speech vector-vectorial dimension is lz before x<, and vectorial corresponding element characterizes the frequency that occurs specific part of speech before the word of x part of speech

3) after x<, after word >, term vector-vectorial dimension is dz, and vectorial corresponding element characterizes the frequency that occurs specific word after the word of x part of speech

4) after x<, after word part of speech >, word part of speech vector-vectorial dimension is lz, and vectorial corresponding element characterizes the frequency that occurs specific part of speech after the word of x part of speech

At S303, similarity calculated 142 is calculated its similarity for any two parts of speech to part-of-speech tagging training set 10 according to the following step.For example, for part of speech x1 and part of speech x2,

1) calculate first respectively the similarity of the character pair vector of two parts of speech (x1, x2):

Simc (word > before x1<, word > before x2<),

Simc (word part of speech > before x1<, word part of speech > before x2<),

Simc (word > after x1<, word > after x2<),

Simc (word part of speech > after x1<, word part of speech > after x2<)

2) use formula calculated population similarity below

Sim (x1, x2)=w1*Simc (word > before x1<, word > before x2<)+

W2*Simc (word part of speech > before x1<, word part of speech > before x2<)+

W3*Simc (word > after x1<, word > after x2<)+

W4*Simc (word part of speech > after x1<, word part of speech > after x2<)

Wherein w1+w2+w3+w4=1

At step S304, cluster cell 143 utilizes hierarchical clustering algorithm (for example, K-means clustering algorithm) to carry out cluster to all parts of speech according to the similarity calculating, and generates hierarchical tree according to pre-defined rule.In the present invention, this pre-defined rule can be to limit the nodes of every layer to be less than n (n is positive integer).For example, n equals 8.

Describe how to generate part-of-speech tagging model below in conjunction with Fig. 5 a and Fig. 5 b.Fig. 5 a is according to the structural drawing of part-of-speech tagging model training apparatus 12 of the present invention.Part-of-speech tagging model training apparatus 12 comprises: CRF model training language material tectonic element 121, CRF model training unit 122 and logical circuit 120.CRF model training language material tectonic element 121 successively marks the training text reading from part-of-speech tagging training set 10 node by node according to part of speech hierarchical tree 15.CRF model is correspondingly successively trained node by node according to the training text of CRF model training language material tectonic element 121 each marks in CRF model training unit 122.Logical circuit 120 is controlled CRF model training language material tectonic element 121 and CRF model training unit 122 carries out part-of-speech tagging model training.Logical circuit 120 is mounted with the level number of part of speech hierarchical tree, and CRF model training language material tectonic element 121 and CRF model training unit 122 every layer finish dealing with after, the number of plies is increased to 1, until all node end process of the last one deck to part of speech hierarchical tree.

Fig. 5 b is the process flow diagram that part-of-speech tagging model training apparatus generates the method for part-of-speech tagging model.This process flow diagram comprises the nested training method of a double-deck circulation.The method adopts Zi the training mode under pushing up.The training result of last layer has impact to lower one deck, can independently carry out with the training between layer.Suppose that part of speech hierarchical tree has n layer, i layer has m _iindividual node, present node is j.First at S601, logical circuit 120 is 0 by i layer initial assignment.At S602, logical circuit 120 is 1 by node j assignment.Afterwards at S603, CRF model training language material tectonic element 121 is constructed <i, j>CRF model training language material, by the child node title that marks part-of-speech tagging label in text and replace with the present node of this label in part of speech hierarchical tree in original part-of-speech tagging training set 10.At S604, CRF model training unit 122 utilizes <i, the feature templates training <i of j>CRF model training language material and selection, j>CRF model, wherein, in the time of i=0, the co-occurrence (co-occurrence) before and after the feature templates that CRF model training unit 122 is selected comprises between front word and each two words of rear word and front and back of each two words, current word; In the time of i > 0, except using the 0th layer of feature templates of using, also use and comprise the part of speech of each two words in front and back in last layer annotation results, and the feature templates of co-occurrence between co-occurrence, word and part of speech between part of speech.At S605, j value is increased to 1 and judge at S606 whether j is greater than m _iif j is less than m _icontinue to carry out S603, otherwise at S607, i value is increased to 1 and carry out S602, until the node of all layers in part of speech hierarchical tree has been carried out to S603 and S604, thus train the stacked part-of-speech tagging model that obtains being applied to extensive mark collection.

For example, a given sentence that mark is intact:

Choose/v of Hong Kong/ns ten/m is large/and a is outstanding/a youth/n

At the 0th layer, structure <0,1>CRF model training language material.First above-mentioned sentence is marked again.Referring to the part of speech hierarchical tree shown in Fig. 4 a, the child node of this 0 layer of the 1st node is respectively " label1 ", " label2 ", " label3 " and " label4 ".And the ground floor nodename that actual part of speech " v " in Fig. 4 a corresponds in part of speech hierarchical tree is " label1 ", so all words that are labeled as " v " in original training set, all will mark this word again for " label1 ".

After above-mentioned sentence being marked again at the 0th layer, obtain following sentence:

Choose/label1 of Hong Kong/label3 ten/label2 is large/and label1 is outstanding/label1 youth/label3

At 0 layer, training CRF model.The feature templates of selecting comprises " Hong Kong ", the co-occurrence (co-occurrence refers to the situation that two words occur in certain context simultaneously) between each two words in front and back of words such as " choosing ", front word and each two words of rear word and front and back of current word.

Afterwards, at the 1st layer, above-mentioned sentence is marked again again.To the 1st layer of the 1st node <1,1>, carries out <1,1>CRF model training language material structure.Referring to the part of speech hierarchical tree of Fig. 4 a, due to <1, the child node of 1> node comprises " label11; label12 ", so, the word that is " label1 " by 0 layer of part-of-speech tagging is further carefully designated as " label11, label12 ", i.e. the child node title set of present node

For 0 layer of choose/label1 of annotation results: Hong Kong/label3 ten/label2 large/label1 is outstanding/label1 youth/label3, at <1, the corpus after 1> node is heavily marked is:

Choose/label12 of Hong Kong/label3 ten/label2 is large/and label11 is outstanding/label11 youth/label3

Carry out afterwards <1,1> node CRF model training.The feature templates of wherein selecting, except the feature templates of the 0th layer, also comprises the part of speech of each two words in front and back in last layer annotation results, and co-occurrence between co-occurrence, word and part of speech between part of speech.For example, for " a choosing " word, the part of speech " label3 " " label2 " of each two words " Hong Kong " and " ten " before and after it, the co-occurrence between co-occurrence, word and part of speech between above-mentioned part of speech.

Similarly, to <1,2> node, <1,3> node, <1,4> node carries out respectively above-mentioned CRF model training and expects structure and CRF model training.Until all nodes of all layers have been carried out to CRF model training language material structure and CRF model training.

Fig. 6 a shows the structural drawing of part-of-speech tagging device.Referring to Fig. 6 a, part-of-speech tagging device 22 comprises logical circuit 222, CRF aspect of model tectonic element 220 and CRF part-of-speech tagging unit 221.Logical circuit 222, according to stacked part-of-speech tagging model, is controlled CRF aspect of model tectonic element 220 and CRF part-of-speech tagging unit 221 and is carried out part-of-speech tagging.CRF aspect of model tectonic element 220 is under the control of logical circuit 222, for text application <i to be marked, successively structural attitude node by node of j>CRF model, CRF part-of-speech tagging unit 221 according to the characteristic of latent structure unit 220 each structures, correspondingly successively carries out node by node part-of-speech tagging under the control of logical circuit 222.

Fig. 6 b is the process flow diagram that part-of-speech tagging device is carried out stacked CRF part-of-speech tagging method.Suppose that part of speech marking model has n layer, i layer has m _iindividual node, present node is j.First at S901, logical circuit 222 is 0 by i layer initial assignment.At S902, logical circuit 222 is 1 by node j assignment.Afterwards at S903, CRF aspect of model tectonic element 220 is application <i, j>CRF Construction of A Model characteristic, according to the feature templates of setting in training part-of-speech tagging model process, build the input feature vector data of CRFs model, for different layer i, use one of following two kinds of diverse ways:

1) when i equals 0, carry out the feature templates filling process of CRF model, that is, directly from the text to be marked of input, extract relevant characteristic information, and be filled into template, generate the input feature vector data of corresponding CRFs model.

2) when i is not equal to 0, except the relevant characteristic information obtaining in 0 layer, also comprises the result from utilizing i-1 layer CRF model to mark text to be marked and extract characteristic of correspondence information, generate the input feature vector data of corresponding CRFs model.

At S904, based on the characteristic obtaining, utilize the <i of part-of-speech tagging model 10, j>CRF model marks text to be marked.

At S905, j value is increased to 1 and judge at S906 whether j is greater than m _iif j is less than m _icontinue to carry out S903, otherwise at S907, i value is increased to 1 and carry out S902, until the node of all layers in part of speech hierarchical tree has been carried out to S903 and S904.By layer by layer text being carried out to part-of-speech tagging, realize the part-of-speech tagging of extensive mark collection thus.

Provide a simple examples below, further illustrate whole mark process:

A given text to be marked: ten large city good for habitatioies are shortlisted in Beijing.

The 0th layer (application <0,1>CRFs model)

Result after mark is: and be shortlisted for/label1 of Beijing/label3 ten/label2 is large/and label1 is livable/label1 city/label3

The 1st layer (applying the CRFs model of all these layers)

1. by <1,1>CRFs model obtains Beijing/label3 be shortlisted for/label12 ten/label2 is large/and label11 is livable/label11 city/label3

2. application <1,2>CRFs model ...

……

The 1st layer of annotation results after finishing is:

Be shortlisted for/label12 of Beijing/label32 ten/label21 is large/and label11 is livable/label11 city/label31

The 2nd layer

1. by <2,1>CRFs model obtains:

Be shortlisted for/label12 of Beijing/label32 ten/label21 is large/and a is livable/a city/label31

2. application <2,1>CRFs model ...

Finally can obtain complete annotation results:

Be shortlisted for/v of Beijing/ns ten/m is large/and a is livable/a city/n

Fig. 7 a is the schematic configuration diagram of the part-of-speech tagging system of second embodiment of the invention.Compare with the part-of-speech tagging system shown in Fig. 1 a, this part-of-speech tagging system also comprises apparatus for evaluating 16, adjusting gear 17 and test set construction device 18.Test set construction device 18 is for the test set as text collection to be marked from part-of-speech tagging text collection of the random selection of part-of-speech tagging training set 10.Apparatus for evaluating 16 for the result of utilizing part-of-speech tagging model to carry out after part-of-speech tagging test set to be marked is assessed, that is, marks precision according to the result evaluation and test of test.Adjusting gear 17 is for according to the assessment result of apparatus for evaluating, part of speech hierarchical tree construction device 14 being adjusted, thus generate dominance more can part of speech hierarchical tree.

Fig. 7 b shows the process flow diagram of the method for part-of-speech tagging system execution part-of-speech tagging.With reference to figure 7b, at S701, test set construction device 18 is random from part-of-speech tagging training set 10 extracts a subset as test set.At S702, the part-of-speech tagging model 13 that the utilization of part-of-speech tagging system trains carries out part-of-speech tagging to test set.At S703, the precision of the test set of apparatus for evaluating 16 to mark part of speech is assessed and assessment result is sent to adjusting gear 17.Afterwards at S704, adjusting gear 17 is according to the performance of assessment result court verdict marking model, and in the time that the performance of part-of-speech tagging model does not meet predetermined condition, carry out S705, to the W1 using in part of speech hierarchical tree construction device 14, W2, the threshold value adjustment of W3 and W4 changes cluster result.Utilize heuristic rule to adjust cluster result at S706 adjusting gear.Heuristic rule is for example: " n " and " ns " should assign in different groups.

Fig. 8 a is according to the structural drawing of the part-of-speech tagging system of third embodiment of the invention.For unregistered word, owing to there is not corresponding training data in corpus, therefore to the mark of this class word, often ratio of precision is lower, and then impact entirety mark precision.Part-of-speech tagging system of the present invention can be revised the part of speech that does not log in word, thereby improves the overall precision of system part-of-speech tagging.Compare with the part-of-speech tagging system shown in Fig. 1 a, this part-of-speech tagging system also comprises that not logging in word part of speech guesses model construction device 19 and do not log in word part of speech correcting device 21.Do not log in word part of speech conjecture model construction device 19 for from existing part-of-speech tagging training set 10 learning word-building rules, and word-building rule based on study creates and does not log in word part of speech conjecture model 20.Do not log in word part of speech correcting device 21 for utilizing the part of speech correction that does not log in word part of speech conjecture model the text that utilizes part-of-speech tagging model 13 to mark part of speech is not logged in word.

Fig. 8 b shows the part-of-speech tagging method according to third embodiment of the invention.With reference to figure 8b, at S801, not logging in word part of speech conjecture model construction device 19 first carries out immediate constituent cutting to the word in part-of-speech tagging training set and (the attribute of immediate constituent is analyzed, word in each part-of-speech tagging training set is found out to its immediate constituent, and the attribute of immediate constituent is marked) to obtain word components series.

The definition of immediate constituent is carried out to simplicity of explanation below.The subsection that forms a Ge great unit is referred to as the composition of large unit, and the subsection that correspondingly directly forms a Ge great unit is called immediate constituent.Word in part-of-speech tagging training set itself belongs to word, instead of the constituent less than word, so immediate constituent and immediate constituent attributive analysis are different from word segmentation and part-of-speech tagging in general sense, but word in the part-of-speech tagging training set that each is made up of two words and plural word is cut into the unit than its low one-level, such as two-character word, the unit of low one-level is exactly the single character (morpheme) that forms this two-character word, and for three words and more than three words, be word (maximum coupling) and the remaining single morpheme existing in dictionary by its cutting, such as " Ministry of Science and Technology ", suppose in dictionary exist " science ", " technology " two words, and not there is not " science and technology ", " technology department " etc., after its cutting, be exactly so " science/technology/portion ", suppose in dictionary exist " science ", " technology department ", words such as " technology ", after cutting, be exactly so " science/technology department ".Therefore, the immediate constituent here may be word, may be also morpheme.The attribute of immediate constituent mainly refers to grammatical attribute, shows with the form of part of speech mark, comprises all possible part of speech mark.

Table 1 provides immediate constituent cutting and the attributive analysis result of " cold violence, strafe " two words:

Immediate constituent	Immediate constituent length (byte)	Immediate constituent attribute
			Cold	2	A
Violence	4	N

Sweep	2	V
			Penetrate	2	V

The cutting of table 1. word immediate constituent obtains corresponding sequence with immediate constituent attributive analysis result example:

Cold violence → cold 2a N_B violence 4n N_E

Strafe → sweep 2v V_B and penetrate 2v V_E; Be " cold penetrating " for unregistered word, the word components series obtaining is so: cold 2a penetrates 2v

At S802, do not log in word part of speech conjecture model construction device 19 and select part of speech feature templates.

At S803, do not log in word part of speech conjecture model construction device 19 and utilize the part of speech feature templates of selecting to change the word components series generating, and generate and do not log in word part of speech conjecture model 20 by known machine learning algorithm.For example, utilize the part of speech that does not log in word part of speech conjecture model 20 and obtain " cold penetrating " whole word: POS (cold 2a V_B, penetrates 2v V_E)=V.

At S804, the word part of speech conjecture model 20 that do not log in of part-of-speech tagging system utilization generation marks again to the word that do not log in the text marking based on part-of-speech tagging model 13.

Suppose " to sweep 2v V_B and penetrate 2v V_E " for word components series, the feature templates of selection is:

//Part-of-speech of the constituent word

U01：％x[-1，2]//the former one constituent′s second

feature(/)(″/″denotes a null feature)

U02：％x[0，2]//the current constituent′s second feature(a)

//Length of the constituent word

U03：％x[1，1]//the next one constituent′s first feature(2，2)

//The constituent word itself

U04：％x[0，0]//the current one constituent′s zero feature

So word components series " is swept 2v V_B and is penetrated 2v V_E " and change, be converted into the input data of the machine learning methods such as CRF:

if(T(-1，2)＝′/′)tag＝′V_B′

if(T(0，2)＝′v′)tag＝′V_B′

if(T(1，1)＝′2′)tag＝′V_B′

If (T (0,0)=' sweeps ') tag=' V_B '

if(T(-1，2)＝′v′)tag＝′V_E′

if(T(0，2)＝′v′)tag＝′V_E′

if(T(1，1)＝′2′)tag＝′V_E′

If (T (0,0)=' penetrates ') tag=' V_E '

Although utilize the word part of speech conjecture model 20 that do not log in generating again to mark the word that do not log in the text obtaining based on the final mark of part-of-speech tagging model 13, but that can utilize generation does not log in word part of speech conjecture model 20 to the word that do not log in the text of current layer mark marks again based on part-of-speech tagging model 13 yet, in order to revise the part of speech mark result of current layer, afterwards for the characteristic of lower one deck.

The present invention, taking Chinese text as example, is illustrated embodiment, but very clearly, the present invention also can be equally for the part-of-speech tagging to English, Japanese and other Languages.

Although with reference to specific embodiment, invention has been described, the present invention should not limited by these embodiment, and should only be limited by claims.Should be clear, do not departing under the prerequisite of scope and spirit of the present invention, those of ordinary skill in the art can change or revise embodiment.

Claims

1. a part-of-speech tagging system, comprising:

Part-of-speech tagging model training apparatus, for utilizing the first text that part-of-speech tagging training set has marked successively to train node by node part-of-speech tagging model based on part of speech hierarchical tree; And

Part-of-speech tagging device, carries out part-of-speech tagging for the part-of-speech tagging model that uses training to text to be marked,

Wherein part-of-speech tagging model training apparatus comprises:

CRF model training language material tectonic element, constructs CRF model training language material for utilizing part of speech hierarchical tree that the first text having marked from part-of-speech tagging training set is successively labeled as to the second text node by node; And

CRF model training unit, correspondingly successively trains CRF model to obtain part-of-speech tagging model for the second text that utilizes the each mark of CRF model training language material tectonic element node by node;

Wherein CRF model training unit selects feature templates successively to train node by node CRF model in the following manner:

(a) current layer is the 0th layer, and feature templates comprises the co-occurrence between front word and each two words of rear word and front and back of each two words in the front and back of each word in the second text, current word; With

(b) current layer is not the 0th layer, and feature templates comprises the part of speech of each two words in front and back of each word in the feature templates of the 0th layer of selection and the second text of last layer, and co-occurrence between co-occurrence, word and part of speech between part of speech.

2. part-of-speech tagging system as claimed in claim 1, wherein CRF model training language material tectonic element successively marks node by node by the child node title that the mark part of speech in the first text is replaced with to the present node corresponding with the position of this part of speech in part of speech hierarchical tree.

3. part-of-speech tagging system as claimed in claim 1, wherein part-of-speech tagging device comprises:

CRF aspect of model tectonic element, for being successively structural attitude data node by node of application CRF model for text to be marked; And

CRF part-of-speech tagging unit, for correspondingly successively carrying out node by node part-of-speech tagging according to the characteristic of the each structure of characteristic tectonic element;

Wherein CRF aspect of model tectonic element builds the characteristic of CRF model according to following manner:

(a) current layer is the 0th layer, extracts the characteristic of the feature templates of the 0th layer of selection while being used for being filled in training CRF model from text to be marked; With

(b) current layer is not the 0th layer, uses the characteristic of the 0th layer and extract characteristic from utilize the second text last layer CRFs model marks text to be marked.

4. part-of-speech tagging system as claimed in claim 1, also comprises:

Part of speech hierarchical tree construction device, for by building part of speech hierarchical tree to the relation analysis between the part of speech that marks text of part-of-speech tagging training set.

5. part-of-speech tagging system as claimed in claim 4, wherein part of speech hierarchical tree construction device comprises:

Part of speech feature templates selected cell, for selecting to characterize the feature templates of part of speech feature;

Proper vector construction unit, for according to the feature templates of selecting, is the part of speech structure characteristic of correspondence vector in part-of-speech tagging training set;

Similarity calculated, for utilizing proper vector to calculate the similarity between part of speech; And

Cluster cell, for according to similarity, part of speech being carried out to cluster, to generate part of speech hierarchical tree.

6. part-of-speech tagging system as claimed in claim 5, also comprises:

Test set constructing apparatus, for the text collection that marked part of speech from the random selection of part-of-speech tagging training set as test set;

Apparatus for evaluating, for to utilizing part-of-speech tagging model to assess the result of carrying out part-of-speech tagging from the text to be marked of test set; And

Adjusting gear, for adjusting part of speech hierarchical tree according to assessment result.

7. part-of-speech tagging system as claimed in claim 6, the threshold value that its middle regulator uses when the similarity of calculating between part of speech part of speech hierarchical tree construction device is adjusted.

8. part-of-speech tagging system as claimed in claim 1, also comprises:

Do not log in word part of speech conjecture Construction of A Model device, for not logging in word part of speech conjecture model from part-of-speech tagging training set learning word-building rule structure; And

Do not log in word part of speech correcting device, do not log in word part of speech conjecture model for use and carry out part-of-speech tagging to not logging in word, and the part of speech that does not log in word that uses part-of-speech tagging model mark part of speech is revised.

9. a part-of-speech tagging method, comprising:

Part-of-speech tagging model training step, utilizes the first text having marked in part-of-speech tagging training set successively to train node by node part-of-speech tagging model based on part of speech hierarchical tree; And

Part-of-speech tagging step, is used the part-of-speech tagging model of training to carry out part-of-speech tagging to text to be marked,

Wherein part-of-speech tagging model training step comprises:

CRF model training language material constitution step, utilizes part of speech hierarchical tree that the first text having marked from part-of-speech tagging training set is successively labeled as to the second text node by node and constructs CRF model training language material; And

CRF model training step, utilizes the second text of the each mark of CRF model training language material constitution step correspondingly successively to train node by node CRF model to obtain part-of-speech tagging model;

Wherein CRF model training step selects feature templates successively to train node by node CRF model in the following manner:

10. part-of-speech tagging method as claimed in claim 9, wherein CRF model training language material constitution step comprises the step that the child node title by the mark part of speech in the first text being replaced with to the present node corresponding with the position of this part of speech in part of speech hierarchical tree successively marks node by node.

11. part-of-speech tagging methods as claimed in claim 9, wherein part-of-speech tagging step comprises:

CRF aspect of model constitution step is successively structural attitude data node by node of application CRF model for text to be marked; And

CRF part-of-speech tagging step, correspondingly successively carries out part-of-speech tagging node by node according to the characteristic of the each structure of characteristic constitution step;

Wherein CRF aspect of model constitution step is constructed the characteristic of CRF model according to following manner structure:

(1) current layer is the 0th layer, extracts the characteristic of the feature templates of the 0th layer of selection while being used for being filled in training CRF model from text to be marked; With

(2) current layer is not the 0th layer, uses the characteristic of the 0th layer and extract characteristic from utilize the second text last layer CRFs model marks text to be marked.

12. part-of-speech tagging methods as claimed in claim 9, also comprise:

Part of speech hierarchical tree construction step, builds part of speech hierarchical tree by the relation analysis between the part of speech that marks text in part-of-speech tagging training set.

13. part-of-speech tagging methods as claimed in claim 12, wherein part of speech hierarchical tree construction step comprises:

Part of speech feature templates is selected step, selects to characterize the feature templates of part of speech feature;

Proper vector construction step, according to the feature templates of selecting, for the part of speech in part-of-speech tagging training set builds characteristic of correspondence vector;

Similarity calculation procedure, utilizes proper vector to calculate the similarity between part of speech; And

Sorting procedure, carries out cluster according to similarity to part of speech, to generate part of speech hierarchical tree.

14. part-of-speech tagging methods as claimed in claim 13, also comprise:

Test set constitution step, the text collection that random selection has marked part of speech from part-of-speech tagging training set is as test set;

Appraisal procedure, to utilizing part-of-speech tagging model to assess the result of carrying out part-of-speech tagging from the text to be marked in test set; And

Set-up procedure, adjusts part of speech hierarchical tree according to assessment result.

15. part-of-speech tagging methods as claimed in claim 14, wherein set-up procedure comprises the step that threshold value that part of speech hierarchical tree construction step is used when the similarity of calculating between part of speech is adjusted.

16. part-of-speech tagging methods as claimed in claim 9, also comprise:

Do not log in word part of speech conjecture Construction of A Model step, do not log in word part of speech conjecture model from part-of-speech tagging training set learning word-building rule structure; And

Do not log in word part of speech correction step, use does not log in word part of speech conjecture model and carries out part-of-speech tagging to not logging in word, and the part of speech that does not log in word that uses part-of-speech tagging model mark part of speech is revised.

17. 1 kinds for training the device of part-of-speech tagging model, comprising:

18. 1 kinds for training the method for part-of-speech tagging model, comprising: