CN101866337A - Part-or-speech tagging system, and device and method thereof for training part-or-speech tagging model - Google Patents

Part-or-speech tagging system, and device and method thereof for training part-or-speech tagging model Download PDF

Info

Publication number
CN101866337A
CN101866337A CN200910132711A CN200910132711A CN101866337A CN 101866337 A CN101866337 A CN 101866337A CN 200910132711 A CN200910132711 A CN 200910132711A CN 200910132711 A CN200910132711 A CN 200910132711A CN 101866337 A CN101866337 A CN 101866337A
Authority
CN
China
Prior art keywords
speech
tagging
model
node
speech tagging
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN200910132711A
Other languages
Chinese (zh)
Other versions
CN101866337B (en
Inventor
胡长建
赵凯
邱立坤
沈国阳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
NEC China Co Ltd
Renesas Electronics China Co Ltd
Original Assignee
NEC China Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by NEC China Co Ltd filed Critical NEC China Co Ltd
Priority to CN200910132711.3A priority Critical patent/CN101866337B/en
Priority to JP2010077274A priority patent/JP5128629B2/en
Publication of CN101866337A publication Critical patent/CN101866337A/en
Application granted granted Critical
Publication of CN101866337B publication Critical patent/CN101866337B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The invention relates to a part-or-speech tagging system, which comprises a part-or-speech tagging training model, and a part-or-speech tagging device, wherein the part-or-speech tagging training model is used for training the part-or-speech tagging model lay by layer and node by node by using tagged first texts in a part-or-speech tagging training set based on a part-or-speech hierarchy tree; and the hierarchy tree device is used for tagging part-or-speech of the texts to be tagged by using the trained part-or-speech tagging . The invention also relates to a part-or-speech tagging method, and a device and a method thereof for training the part-or-speech tagging model. According to the system and the method of the invention, the part of speech in a large scale part-or-speech tagging set is tagged, and the tagging precision of part of speech is enhanced.

Description

The part-of-speech tagging system, be used to train the devices and methods therefor of part-of-speech tagging model
Technical field
The present invention relates to natural language processing field, particularly, relate to a kind of part-of-speech tagging system, be used to train the devices and methods therefor of part-of-speech tagging model.
Background technology
Along with extensively popularizing and social informationization day by day of internet, the accessible natural language text quantity of computing machine unprecedentedly increases, towards application demand rapid growths such as the text mining of magnanimity information, information extraction, cross-language information processing, man-machine interactions, natural language processing technique is one of core technology of reply the demand.Part-of-speech tagging is to go up correct part of speech to each the speech mark in the text, and it is the basis of natural language processing.Since the result of part-of-speech tagging directly influence natural language processing the upper strata process field (such as, word frequency statistics, syntactic analysis, chunk parsing, semantic analysis etc.), therefore obtain efficient and the part-of-speech tagging method and system is extremely important accurately.
Part-of-speech tagging is a sequence labelling problem of natural language processing field, and condition random domain model (Conditional Random Fields-CRFs) is widely used in handling the sequence labelling problem in the natural language.The condition random territory is a kind of non-directed graph model that is used for calculating the conditional probability of formulating the output node value when given input nodal value in essence, it has the ability of expressive element length apart from dependence and overlapping feature, can be used to handle the stronger information extraction work of association of overall importance.Therefore it has avoided the strong correlation hypothesis of picture maximum entropy (MaximumEntropy-ME) and hidden Markov model digraph models such as (Hidden Markov Model-HMM) effectively, customer service they mark biasing problems of occurring, be the best statistical machine learning model of handling sequence data mark problem at present.Obtain a reasonable part-of-speech tagging model, need to introduce abundant more feature and adopt extensive mark collection to train.Yet the training process of CRFs is a job very consuming time and expend computational resource, and the demand of its training time and computational resource will be with the exponential increase of mark number of labels.Therefore the CRFs model seldom be used in the large scale system with big mark set use in (such as the part-of-speech tagging system), be used in usually in less and applied environment corpus on a small scale of feature.Consider the high accuracy requirement of part-of-speech tagging, how the CRFs model being applied to the part-of-speech tagging work with extensive mark collection and extensive corpus feature is a problem that urgency is to be solved.
For the problems referred to above, exist some relevant solutions, for example: document 1 (1.Cohn T, Smith A, Osborne M.Scaling conditional randomfields using error-correcting codes.In Proc.the 43rd Annual Meeting ofthe Association for Computational Linguistics (ACL ' 05), Ann Arbor, Michigan:Association for Computational Linguistics, June 2005, pp.10-17.) provided a kind of method that CRFs is applied to big mark set.The document is introduced error correction output code (Error Correcting Output Code-ECOC, ECOC is a kind of assemblage method, define redundant decision function earlier, be called decode procedure-coding, constructing final classification function based on above-mentioned decision function then is that decode procedure-decoding) solves the CRF training problem under the big mark collection.Detailed process is as follows: training process (cataloged procedure)
1) supposition mark collection has m label (for example, NN-noun, VB-verb, JJ-adjective, RB-adverbial word), and manually selected ECOC supposes that its length is n, and the purpose of this correcting code is exactly the vector that label is mapped as a n bit, and example is as follows:
Table 1
By above-mentioned coding, this method is with regard to original mark problem (also can regard many classification problems as), be transformed into n separate two-value classification problem, each row coding just corresponding a two-value sorter, such as the 3rd selected sorter of black box, the speech that its purpose will be labeled as " NN, JJ " exactly makes a distinction with the speech that is labeled as " VB, RB ".
2) corpus that makes up the two-value sorter (is realized by revising original language material, be exactly that the mark tag modification in the corpus is the value in the corresponding coding in simple terms, such as being above-mentioned the 3rd sorter structure language material, so only need in the original language material all are labeled as " NN ", " JJ " is labeled as " 1 " again, and own " VB ", " RB " replaces with " 0 ").After obtaining amended language material, this method adopts traditional CRFs training method to train corresponding two-value sorter.
Model use (decode procedure)
1) given any one sentence, for example " NEC Develops word-leading technologyto prevent IP phone spam ".
2) all two-value sorters that use above-mentioned training to come out to above-mentioned sentence mark respectively, and the record annotation results, suppose that annotation results is as follows:
Figure B2009101327113D0000031
As noted above, for each speech all can a corresponding n bit vector, adopt strategy relatively more commonly used just can contrast coding vector in the above-mentioned table 1 of this vector sum, and then seek the label of a coupling and mark this speech with it.Such as for speech " Develops ", its corresponding n bit vectors and " VB " corresponding codes are the most approaching, and this system just is labeled as Develops " VB-verb " so.
Present technology can not solve the part-of-speech tagging problem that CRF is applied to extensive mark collection very effectively, makes this method also have distance from practical application, specifically:
1) performance of the method for document 1 depends on choosing of ECOC coding to a great extent, but choosing a desirable ECOC is the comparison difficulty.
2) above-mentioned scheme does not fundamentally solve the serious dependence of the huge and high-end computational resource of time consumption for training.Training process in the document [1] will be trained n two-value sorter, and wherein the size of n depends on that ECOC chooses, and at the part-of-speech tagging problem, this value is bigger, and the corresponding training time is still very long, and the dependence of high-end computational resource is still existed.In decode procedure,, make that the application of training pattern is also very consuming time in addition, also have the dependence problem of high-end computational resource owing to will add the loaded down with trivial details of codes match process to the use one by one of all two-value sorters.
Summary of the invention
The present invention is a technology of introducing part of speech layering, classification, and solves traditional C RF in conjunction with stacked CRFs model and be difficult to be applied to part-of-speech tagging problem under the extensive mark collection.The present invention can analyze the inner link between the part of speech automatically from training set, and organizes all parts of speech according to these inner links structure part of speech hierarchical trees.According to this part of speech hierarchical tree, the present invention introduces stacked CRFs model, and then make that every layer mark number is reduced, and set the introducing relation between each model in detail, can train the stacked CRFs part-of-speech tagging model that is used for extensive mark collection at last automatically.Consider the sparse problem that training set may exist, the present invention also trains for the part of speech of unregistered word based on word-building rule and guesses model, with the precision of further raising part-of-speech tagging of the present invention.
According to first aspect present invention, a kind of part-of-speech tagging system has been proposed, comprising: the part-of-speech tagging model training apparatus is used for successively training node by node the part-of-speech tagging model based on first text that the part of speech hierarchical tree utilizes the part-of-speech tagging training set to mark; And the part-of-speech tagging device, be used to use the part-of-speech tagging model of training that text to be marked is carried out part-of-speech tagging.
According to second aspect present invention, a kind of part-of-speech tagging method has been proposed, comprising: part-of-speech tagging model training step, utilize first text that has marked in the part-of-speech tagging training set to come successively to train node by node the part-of-speech tagging model based on the part of speech hierarchical tree; And the part-of-speech tagging step, use the part-of-speech tagging model of training that text to be marked is carried out part-of-speech tagging.
According to third aspect present invention, a kind of device that is used to train the part-of-speech tagging model has been proposed, comprise: CRF model training language material tectonic element is used for utilizing the part of speech hierarchical tree successively to be labeled as second text node by node from first text that the part-of-speech tagging training set has marked and constructs CRF model training language material; And CRF model training unit, be used to utilize second text of the each mark of CRF model training language material tectonic element correspondingly successively to train the CRF model node by node to obtain the part-of-speech tagging model.
According to fourth aspect present invention, a kind of method that is used to train the part-of-speech tagging model has been proposed, comprise: CRF model training language material constitution step, first text that utilizes the part of speech hierarchical tree to have marked from the part-of-speech tagging training set successively is labeled as second text node by node and constructs CRF model training language material; And CRF model training step, utilize second text of the each mark of CRF model training language material constitution step correspondingly successively to train the CRF model node by node to obtain the part-of-speech tagging model.
The present invention has fundamentally solved CRFs and has been used for the part-of-speech tagging problem that big mark collects, specifically:
1) makes the CRFs model can use the part-of-speech tagging work of big mark collection, and solved the dependency problem of and high-end computational resource huge that the system and method that the present invention proposes can train the part-of-speech tagging model on ordinary PC to the training time;
2) improved the precision of part-of-speech tagging, reason has two: one, part of speech sequence labelling are the work that global association is stronger, and therefore introducing the CRFs model can realize global optimum effectively, can improve the part-of-speech tagging precision; Its two, introduce unregistered word part of speech conjecture mechanism based on word-building rule, can solve the sparse problem of training set effectively, also can improve the overall precision of part-of-speech tagging;
3) method mentioned of the present invention is a full-automatic method, can reduce training widely and optimize the cost of labor of part-of-speech tagging model.
Description of drawings
Fig. 1 a shows the synoptic diagram according to the part-of-speech tagging system of first embodiment of the invention;
Fig. 1 b is the process flow diagram according to the part-of-speech tagging method of first embodiment of the invention;
Fig. 2 shows the synoptic diagram according to part of speech hierarchical tree construction device of the present invention;
Fig. 3 shows the process flow diagram according to part of speech hierarchical tree construction method of the present invention;
Fig. 4 a is an example structure figure of part of speech hierarchical tree;
Fig. 4 b and 4c are examples of the data structure of part of speech hierarchical tree;
Fig. 5 a shows the schematic configuration diagram according to part-of-speech tagging model training apparatus of the present invention;
Fig. 5 b shows the process flow diagram according to part-of-speech tagging model training method of the present invention;
Fig. 6 a shows according to part-of-speech tagging schematic representation of apparatus of the present invention;
Fig. 6 b is the process flow diagram according to part-of-speech tagging method of the present invention;
Fig. 7 a shows the synoptic diagram according to the part-of-speech tagging system of second embodiment of the invention;
Fig. 7 b is the process flow diagram according to the part-of-speech tagging method of second embodiment of the invention;
Fig. 8 a shows the synoptic diagram according to the part-of-speech tagging system of third embodiment of the invention;
Fig. 8 b is the process flow diagram according to the part-of-speech tagging method of third embodiment of the invention.
Embodiment
Below, the preferred embodiments of the present invention will be described with reference to the drawings.In the accompanying drawings, components identical will be by identical reference symbol or numeral.In addition, in following description of the present invention, with the specific descriptions of omitting known function and configuration, to avoid making theme of the present invention unclear.
Fig. 1 a is the schematic configuration diagram according to the part-of-speech tagging system of first embodiment of the invention.Part-of-speech tagging training set 10 in the part-of-speech tagging system 1 comprises a large amount of texts that has marked, that is, and and the text collection that has marked.Part of speech hierarchical tree construction device 14 is used for analyzing incidence relation between the part of speech based on the text that marks of part-of-speech tagging training set 10, and make up part of speech hierarchical tree 15 according to the incidence relation of analyzing and come stratification to organize the part of speech of the mark that occurs in the part-of-speech tagging training set, this incidence relation for example can be the similarity between the part of speech.Part-of-speech tagging model training apparatus 12 is used for training and generates part-of-speech tagging model 13, this part-of-speech tagging model training apparatus reads the text that has marked from part-of-speech tagging training set 10, and according to the part of speech layer of structure information in the part of speech hierarchical tree 15, make up the model training process and be used for the CRFs part-of-speech tagging model 13 of part-of-speech tagging with training, wherein the part-of-speech tagging model that obtains of training is stacked part-of-speech tagging model.Part-of-speech tagging device 22 is used for marking according to the part-of-speech tagging model that the obtains part of speech to the speech that do not mark text.
Though the part-of-speech tagging system shown in Fig. 1 a comprises part of speech hierarchical tree construction device 14, but, be that this part-of-speech tagging system also can not comprise this part of speech hierarchical tree construction device with being appreciated that, and be to use the part of speech hierarchical tree that has made up to come text to be marked is carried out part-of-speech tagging.This part of speech hierarchical tree for example can be the hierarchical tree of manual construction.And this part-of-speech tagging system can only comprise that part-of-speech tagging model training apparatus 12 generates the part-of-speech tagging model 13 that is used for part-of-speech tagging.
Part of speech hierarchical tree 15 is organized part of speech with tree-structured hierarchical.Fig. 4 a shows an example structure of part of speech hierarchical tree, and this part of speech hierarchical tree one has 4 layers in this example, and 0,1,2,3, wherein the 2nd and the 3rd layer node number is 6.The leaf node correspondence of part of speech hierarchical tree be real part of speech, all the other nodes are empty class names of setting arbitrarily.Fig. 4 b and 4c show the example of data structure of the part of speech hierarchical tree of Fig. 4 a.
Fig. 1 b shows the process flow diagram of part-of-speech tagging method.At S 101, part of speech hierarchical tree construction device 14 makes up part of speech hierarchical tree 15 and comes stratification to organize the part of speech of the mark that occurs in the part-of-speech tagging training set.At S102, part-of-speech tagging model training apparatus 12 reads the text that has marked from part-of-speech tagging training set 10, and, generating part-of-speech tagging model 13 according to the part of speech layer of structure information in the part of speech hierarchical tree 15, this part-of-speech tagging model 13 is marking model of stepped construction.At S103, part-of-speech tagging device 22 utilizes the text of the 13 pairs of inputs of part-of-speech tagging model that generate to carry out part-of-speech tagging.
At first to how generating part of speech hierarchical tree 15 be described below in conjunction with Fig. 2 and Fig. 3.
Fig. 2 is the schematic configuration diagram according to part of speech hierarchical tree construction device 14 of the present invention.Wherein part of speech feature templates selected cell 140 is used to select characterize the part of speech feature templates of the grammer performance of part of speech, can there be multiple mode to characterize the grammer of part of speech, for example can choose the preceding speech of the current speech that has marked in the text, preceding speech part of speech, back speech and back these several features of speech part of speech are used as the part of speech feature templates.Proper vector construction unit 141 is used for according to the part of speech feature templates of selecting, and makes up the characteristic of correspondence vector at each part of speech that occurs in the part-of-speech tagging training set 10.Similarity calculated 142 is used for utilizing the proper vector of structure that any two parts of speech of part-of-speech tagging training set 10 are calculated its similarity.Cluster cell 143 is used for using traditional hierarchical clustering algorithm that all parts of speech of part-of-speech tagging training set 10 are carried out cluster according to the similarity of calculating, and generates part of speech hierarchical tree 15 according to pre-defined rule.
Fig. 3 shows the process flow diagram that part of speech hierarchical tree construction device generates the method for part of speech hierarchical tree.At S301, the feature that part of speech feature templates selected cell 140 is selected part of speech for example selects to have marked the preceding speech of the current speech in the text as the part of speech feature templates, preceding speech part of speech, back speech and back these several features of speech part of speech.For Hong Kong/ns choose/v ten/m is big/a is outstanding/this text that has marked of a youth/n, the current speech of selection is " choosing ", current speech part of speech is " v ", its part of speech character representation is as follows:
Figure B2009101327113D0000081
At S302, proper vector construction unit 141 makes up the characteristic of correspondence vector at all parts of speech that occur in the part-of-speech tagging training set 10 according to the part of speech feature templates.For example, total dz speech in the part-of-speech tagging training set, lz part of speech, the feature of given above-mentioned selected part of speech, this module makes up following vector for any one part of speech x so:
1) x<preceding speech〉preceding term vector-vectorial dimension is dz, the frequency of specific word appears in the speech front that vectorial corresponding element characterizes the x part of speech
2) x<preceding speech part of speech〉preceding speech part of speech vector-vectorial dimension is lz, the frequency of specific part of speech appears in the speech front that vectorial corresponding element characterizes the x part of speech
3) x<back speech〉term vector-vectorial dimension is dz in the back, the frequency of specific word appears in the speech back of vectorial corresponding element sign x part of speech
4) x<back speech part of speech〉vector-vectorial dimension is lz to back speech part of speech, the frequency of specific part of speech appears in the speech back of vectorial corresponding element sign x part of speech
At S303, similarity calculated 142 is used for any two parts of speech of part-of-speech tagging training set 10 are calculated its similarity according to the following step.For example, for part of speech x1 and part of speech x2,
1) calculate at first respectively two parts of speech (x1, the similarity of character pair vector x2):
Simc (x1<preceding speech 〉, x2<preceding speech 〉),
Simc (x1<preceding speech part of speech 〉, x2<preceding speech part of speech 〉),
Simc (x1<back speech 〉, x2<back speech 〉),
Simc (x1<back speech part of speech 〉, x2<back speech part of speech 〉)
2) use following formula calculated population similarity
Sim (x1, x2)=w1*Simc (x1<preceding speech 〉, x2<preceding speech 〉)+
W2*Simc (x1<preceding speech part of speech 〉, x2<preceding speech part of speech 〉)+
W3*Simc (x1<back speech 〉, x2<back speech 〉)+
W4*Simc (x1<back speech part of speech 〉, x2<back speech part of speech 〉)
W1+w2+w3+w4=1 wherein
At step S304, cluster cell 143 utilizes hierarchical clustering algorithm (for example, the K-means clustering algorithm) that all parts of speech are carried out cluster according to the similarity that calculates, and generates hierarchical tree according to pre-defined rule.In the present invention, this pre-defined rule can be to limit every layer node number less than n (n is a positive integer).For example, n equals 8.
Describe how to generate the part-of-speech tagging model below in conjunction with Fig. 5 a and Fig. 5 b.Fig. 5 a is the structural drawing according to part-of-speech tagging model training apparatus 12 of the present invention.Part-of-speech tagging model training apparatus 12 comprises: CRF model training language material tectonic element 121, CRF model training unit 122 and logical circuit 120.CRF model training language material tectonic element 121 successively marks node by node according to 15 pairs of training texts that read from part-of-speech tagging training set 10 of part of speech hierarchical tree.The CRF model is correspondingly successively trained node by node according to the training text of CRF model training language material tectonic element 121 each marks in CRF model training unit 122.Logical circuit 120 control CRF model training language material tectonic elements 121 and CRF model training unit 122 carry out the part-of-speech tagging model training.Logical circuit 120 is mounted with the level number of part of speech hierarchical tree, and CRF model training language material tectonic element 121 and CRF model training unit 122 every layer finish dealing with after, the number of plies is increased by 1, up to all node end process to last one deck of part of speech hierarchical tree.
Fig. 5 b is the process flow diagram that the part-of-speech tagging model training apparatus generates the method for part-of-speech tagging model.This process flow diagram comprises a nested training method of double-deck round-robin.This method adopts the training mode under pushing up.The training result of last layer can independently carry out with the training between the layer one deck is influential down.Suppose that the part of speech hierarchical tree has the n layer, the i layer has m iIndividual node, present node are j.At first at S601, logical circuit 120 is 0 with i layer initial assignment.At S602, logical circuit 120 is 1 with node j assignment.Afterwards at S603, CRF model training language material tectonic element 121 structure<i, j〉CRF model training language material, will mark the child node title that part-of-speech tagging label in the text replaces with the present node of this label in the part of speech hierarchical tree in the original part-of-speech tagging training set 10.At S604, CRF model training unit 122 utilization<i, j〉the feature templates training<i of CRF model training language material and selection, j〉the CRF model, wherein, when i=0, the preceding word of each two speech, current speech and the co-occurrence (co-occurrence) between back word and each two speech of front and back before and after the feature templates that CRF model training unit 122 is selected comprises; In i>0 o'clock, except using the 0th layer of feature templates of using, also use the part of speech that comprises each two speech of front and back in the last layer annotation results, and the feature templates of the co-occurrence between the co-occurrence between the part of speech, speech and the part of speech.At S605, judge that with j value increase by 1 and at S606 whether j is greater than m iIf j is less than m iThen continue to carry out S603, otherwise at S607 i value is increased by 1 and carry out S602, up to the node of all layers in the part of speech hierarchical tree having been carried out S603 and S604, thereby training obtains being applied to the stacked part-of-speech tagging model that extensive mark collects.
For example, a given sentence that mark is intact:
Hong Kong/ns chooses/and v ten/m is big/and a is outstanding/a youth/n
At the 0th layer, structure<0,1〉CRF model training language material.At first above-mentioned sentence is marked again.Referring to the part of speech hierarchical tree shown in Fig. 4 a, the child node of this 0 layer of the 1st node is respectively " label1 ", " label2 ", " label3 " and " label4 ".And the ground floor nodename that the actual part of speech " v " among Fig. 4 a corresponds in the part of speech hierarchical tree is " label1 ", and all are labeled as the speech of " v " in original training set so, all will mark this speech again and be " label1 ".
The 0th layer above-mentioned sentence marked again after, obtain following sentence:
Hong Kong/label3 chooses/and label1 ten/label2 is big/and label1 is outstanding/label1
Youth/label3
At 0 layer, training CRF model.The feature templates of selecting comprises " Hong Kong ", the co-occurrence between each two speech of front and back of speech such as " choosing ", the preceding word of current speech and back word and each two speech of front and back (co-occurrence refers to the situation that two speech occur simultaneously in certain context).
Afterwards, at the 1st layer above-mentioned sentence is marked once more again.To the 1st layer of the 1st node<1,1 〉, carry out<1,1〉CRF model training language material structure.Referring to the part of speech hierarchical tree of Fig. 4 a, because<1,1〉node child node comprises " label11; label12 ", so, 0 layer of part-of-speech tagging further carefully is designated as " label 11; label 12 ", i.e. the child node title of present node set for the speech of " label1 "
For 0 layer of annotation results: Hong Kong/label3 choose/label1 ten/label2 is big/label1 is outstanding/label1 youth/label3,<1,1〉node corpus after heavily marking is:
Hong Kong/label3 chooses/and label12 ten/label2 is big/and label11 is outstanding/label11 youth/label3
Carry out<1,1〉node CRF model training afterwards.Wherein the feature templates of Xuan Zeing also comprises the part of speech of each two speech of front and back in the last layer annotation results except the 0th layer feature templates, and the co-occurrence between the co-occurrence between the part of speech, speech and the part of speech.For example, for " a choosing " speech, each two speech " Hong Kong " and " tens' " part of speech " label3 " " label2 " before and after it, the co-occurrence between the co-occurrence between the above-mentioned part of speech, speech and the part of speech.
Similarly, right<1,2〉node,<1,3〉node,<1,4〉node carry out above-mentioned CRF model training respectively expects structure and CRF model training.Up to all nodes of all layers having been carried out CRF model training language material structure and CRF model training.
Fig. 6 a shows the structural drawing of part-of-speech tagging device.Referring to Fig. 6 a, part-of-speech tagging device 22 comprises logical circuit 222, CRF aspect of model tectonic element 220 and CRF part-of-speech tagging unit 221.Logical circuit 222 is according to stacked part-of-speech tagging model, and control CRF aspect of model tectonic element 220 and CRF part-of-speech tagging unit 221 carry out part-of-speech tagging.CRF aspect of model tectonic element 220 is under the control of logical circuit 222, be text application<i to be marked, j〉CRF model structural attitude node by node successively, CRF part-of-speech tagging unit 221 according to the characteristic of latent structure unit 220 each structures, correspondingly successively carries out part-of-speech tagging node by node under the control of logical circuit 222.
Fig. 6 b is the process flow diagram that the part-of-speech tagging device is carried out stacked CRF part-of-speech tagging method.Suppose that the part of speech marking model has the n layer, the i layer has m iIndividual node, present node are j.At first at S901, logical circuit 222 is 0 with i layer initial assignment.At S902, logical circuit 222 is 1 with node j assignment.At S903, CRF aspect of model tectonic element 220 is application<i, j afterwards〉CRF model construction characteristic, according to the feature templates of setting in the training part-of-speech tagging model process, make up the input feature vector data of CRFs model, the layer i at different, use one of following two kinds of diverse ways:
When 1) i equals 0, carry out the feature templates filling process of CRF model, that is, directly from the text to be marked of input, extract relevant characteristic information, and be filled into template, generate the input feature vector data of corresponding CRFs model.
When 2) i is not equal to 0,, also comprises from the result who utilizes i-1 layer CRF model that text to be marked is marked and extract characteristic of correspondence information, generate the input feature vector data of corresponding CRFs model except the relevant characteristic information that obtains in 0 layer.
At S904, based on the characteristic that obtains, utilize part-of-speech tagging model 10<i, j〉the CRF model marks text to be marked.
At S905, judge that with j value increase by 1 and at S906 whether j is greater than m iIf j is less than m iThen continue to carry out S903, otherwise at S907 with i value increase by 1 and carry out S902, up to the node of all layers in the part of speech hierarchical tree having been carried out S903 and S904.By layer by layer text being carried out part-of-speech tagging, realized the part-of-speech tagging of extensive mark collection thus.Provide a simple examples below, further specify whole mark process:
A given text to be marked: ten big city good for habitatioies are shortlisted in Beijing.
The 0th layer (using<0,1〉CRFs model)
Result behind the mark is: and Beijing/label3 is shortlisted for/label1 ten/label2 is big/and label1 is livable/label1 city/label3
The 1st layer (using the CRFs model of all these layers)
By<1,1〉CRFs model obtain Beijing/label3 and be shortlisted for/label12 ten/label2 is big/label11 is livable/label11 city/label3
2. use<1,2〉CRFs models ...
Annotation results after the 1st layer of end is:
Beijing/label32 is shortlisted for/and label12 ten/label21 is big/and label11 is livable/label11 city/label3l
The 2nd layer
1. obtain by<2,1〉CRFs model:
Beijing/label32 is shortlisted for/and label12 ten/label21 is big/and a is livable/a city/label31
2. use<2,1〉CRFs model ...
Finally can access complete annotation results:
Beijing/ns is shortlisted for/and v ten/m is big/and a is livable/a city/n
Fig. 7 a is the schematic configuration diagram of the part-of-speech tagging system of second embodiment of the invention.Compare with the part-of-speech tagging system shown in Fig. 1 a, this part-of-speech tagging system also comprises apparatus for evaluating 16, adjusting gear 17 and test set construction device 18.Test set construction device 18 is used for selecting the test set of a part-of-speech tagging text collection as text collection to be marked at random from part-of-speech tagging training set 10.Apparatus for evaluating 16 is used for utilizing the part-of-speech tagging model that the result that test set to be marked carries out behind the part-of-speech tagging is assessed, and, evaluates and tests the mark precision according to the result of test that is.Adjusting gear 17 is used for according to the assessment result of apparatus for evaluating part of speech hierarchical tree construction device 14 being adjusted, thereby generates the more part of speech hierarchical tree of dominance energy.
Fig. 7 b shows the process flow diagram that the part-of-speech tagging system carries out the method for part-of-speech tagging.With reference to figure 7b, at S701, test set construction device 18 from part-of-speech tagging training set 10 subclass of random extraction as test set.At S702,13 pairs of test sets of the part-of-speech tagging model that the utilization of part-of-speech tagging system trains carry out part-of-speech tagging.At S703, the precision of the test set of 16 pairs of marks of apparatus for evaluating part of speech is assessed and assessment result is sent to adjusting gear 17.Afterwards at S704, adjusting gear 17 is according to the performance of assessment result court verdict marking model, and when the performance of part-of-speech tagging model does not satisfy predetermined condition, carry out S705, to the W1 that uses in the part of speech hierarchical tree construction device 14, W2, the threshold value of W3 and W4 is adjusted to change cluster result.Utilize heuristic rule that cluster result is adjusted at the S706 adjusting gear.Heuristic rule for example is: " n " should assign in the different groups with " ns ".
Fig. 8 a is the structural drawing according to the part-of-speech tagging system of third embodiment of the invention.For unregistered word, owing to there is not corresponding training data in the corpus, therefore often ratio of precision is lower to the mark of this class speech, and then the whole mark of influence precision.Part-of-speech tagging of the present invention system can revise the part of speech of not landing speech, thereby improves the overall precision of system's part-of-speech tagging.Compare with the part-of-speech tagging system shown in Fig. 1 a, this part-of-speech tagging system comprises that also not landing the speech part of speech guesses model construction device 19 and do not land speech part of speech correcting device 21.Do not land speech part of speech conjecture model construction device 19 and be used for learning word-building rule, and do not land speech part of speech conjecture model 20 based on the word-building rule establishment of study from existing part-of-speech tagging training set 10.Not landing speech part of speech correcting device 21 is used to utilize and does not land speech part of speech conjecture model and come part of speech correction that the text that utilizes part-of-speech tagging model 13 mark parts of speech is not landed speech.
Fig. 8 b shows the part-of-speech tagging method according to third embodiment of the invention.With reference to figure 8b, at S801, not landing speech part of speech conjecture model construction device 19 at first carries out the immediate constituent cutting to the speech in the part-of-speech tagging training set and the attribute of immediate constituent is analyzed (promptly, its immediate constituent found out in speech in each part-of-speech tagging training set, and the attribute of immediate constituent marked) to obtain the speech components series.
Below simplicity of explanation is carried out in the definition of immediate constituent.The subsection that constitutes a big unit is referred to as the composition of big unit, and the subsection that correspondingly directly constitutes a big unit is called immediate constituent.Speech in the part-of-speech tagging training set itself belongs to speech, rather than the constituent littler than speech, so immediate constituent and immediate constituent attributive analysis are different from general sense word segmentation and part-of-speech tagging, but the speech in each part-of-speech tagging training set that is made of two words and plural word all is cut into unit than its low one-level, such as two-character word, the unit of low one-level is exactly the single character (morpheme) that constitutes this two-character word, and for three words and more than three words, then be speech (maximum match) and the remaining single morpheme that exists in the dictionary with its cutting, such as " Ministry of Science and Technology ", suppose in the dictionary exist " science ", " technology " two speech, and do not have " science and technology ", " technology department " etc. is exactly " science/technology/portion " so after its cutting, suppose in the dictionary exist " science ", " technology department ", speech such as " technology " after the cutting is exactly " science/technology department " so.Therefore, the immediate constituent here may be speech, also may be morpheme.The attribute of immediate constituent mainly refers to grammatical attribute, and the form demonstration with the part of speech mark comprises all possible part of speech mark.
Table 1 provides the immediate constituent cutting and the attributive analysis result of " cold violence, strafe " two speech:
The cutting of table 1. speech immediate constituent and immediate constituent attributive analysis example as a result obtain corresponding sequence:
Cold violence → cold 2a N_B violence 4n N_E
Strafe → sweep 2v V_B and penetrate 2v V_E; For unregistered word is " cold penetrating ", and the speech components series that obtains so is: cold 2a penetrates 2v
At S802, do not land speech part of speech conjecture model construction device 19 and select the part of speech feature templates.
At S803, do not land speech part of speech conjecture model construction device 19 and utilize the part of speech feature templates of selecting that the speech components series that generates is changed, and do not land speech part of speech conjecture model 20 by known machine learning algorithm generation.For example, the part of speech that speech part of speech conjecture model 20 obtains " cold penetrating " whole speech is not landed in utilization: POS (cold 2aV_B penetrates 2vV_E)=V.
At S804, part-of-speech tagging system utilization generates does not land 20 pairs on speech part of speech conjecture model and marks again based on the speech that do not land in the text of part-of-speech tagging model 13 marks.
Suppose " to sweep 2v V_B and penetrate 2v V_E " for the speech components series, the feature templates of selection is:
//Part-of-speech?of?the?constituent?word
U01:%x[-1,2]//the?former?one?constituent′s?second
featu?re(/)(″/″denotes?a?n?ull?feature)
U02:%x[0,2]//the?current?constituent′s?second?feature(a)
//Length?of?the?constituent?word
U03:%x[1,1]//the?next?one?constituent′s?first?feature(2,2)
//The?constituent?word?itself
U04:%x[0,0]//the?current?one?constituent′s?zero?feature
So the speech components series " is swept 2v V_B and penetrated 2v V_E " and change, with its conversion
Input data for machine learning methods such as CRF:
if(T(-1,2)=′/′)tag=′V_B′
if(T(0,2)=′v′)tag=′V_B′
if(T(1,1)=′2′)tag=′V_B′
If (T (0,0)=' sweeps ') tag=' V_B '
if(T(-1,2)=′v′)tag=′V_E′
if(T(0,2)=′v′)tag=′V_E′
if(T(1,1)=′2′)tag=′V_E′
If (T (0,0)=' penetrates ') tag=' V_E '
Though utilize the 20 pairs of speech that do not land that finally mark in the text that obtains based on part-of-speech tagging model 13 of speech part of speech conjecture model that do not land that generate to mark again, but also can utilize 20 pairs on the speech part of speech conjecture model of not landing of generation to mark again at the speech that do not land in the text of anterior layer mark based on part-of-speech tagging model 13, promptly, be used for the characteristic of one deck down afterwards in order to revise part of speech mark result when anterior layer.
The present invention is an example with the Chinese text, embodiment is illustrated, but very clearly, the present invention also can be used for the part-of-speech tagging to English, Japanese and other Languages equally.
Although with reference to specific embodiment, invention has been described, the present invention should not limited by these embodiment, and should only be limited by claims.Should be understood that under the prerequisite that does not depart from scope and spirit of the present invention, those of ordinary skills can change or revise embodiment.

Claims (24)

1. part-of-speech tagging system comprises:
The part-of-speech tagging model training apparatus is used for successively training node by node the part-of-speech tagging model based on first text that the part of speech hierarchical tree utilizes the part-of-speech tagging training set to mark; And
The part-of-speech tagging device is used to use the part-of-speech tagging model of training that text to be marked is carried out part-of-speech tagging.
2. part-of-speech tagging as claimed in claim 1 system, wherein the part-of-speech tagging model training apparatus comprises:
CRF model training language material tectonic element is used for utilizing the part of speech hierarchical tree successively to be labeled as second text node by node from first text that the part-of-speech tagging training set has marked and constructs CRF model training language material; And
CRF model training unit is used to utilize second text of the each mark of CRF model training language material tectonic element correspondingly successively to train the CRF model to obtain the part-of-speech tagging model node by node.
3. part-of-speech tagging as claimed in claim 2 system, wherein CRF model training language material tectonic element is gone successively mark node by node by the mark part of speech in first text being replaced with the child node title of the corresponding present node in position of this part of speech in the part of speech hierarchical tree.
4. part-of-speech tagging as claimed in claim 3 system, wherein CRF model training unit selects feature templates successively to train node by node the CRF model in the following manner:
(a) when anterior layer be the 0th layer, feature templates comprises the preceding word of each two speech of front and back of each speech in second text, current speech and the co-occurrence between back word and each two speech of front and back; With
(b) when anterior layer be not the 0th layer, feature templates comprises the part of speech of each two speech of front and back of each speech in second text of the feature templates of the 0th layer of selection and last layer, and the co-occurrence between the co-occurrence between the part of speech, speech and the part of speech.
5. part-of-speech tagging as claimed in claim 2 system, wherein the part-of-speech tagging device comprises:
CRF aspect of model tectonic element is used at text to be marked for using CRF model structural attitude data node by node successively; And
CRF part-of-speech tagging unit is used for correspondingly successively carrying out part-of-speech tagging node by node according to the characteristic of the each structure of characteristic tectonic element.
6. part-of-speech tagging as claimed in claim 5 system, wherein CRF aspect of model tectonic element makes up the characteristic of CRF model according to following manner:
(a) when anterior layer be the 0th layer, from text to be marked, extract the characteristic of the feature templates of the 0th layer of selection when being used to be filled in training CRF model; With
(b) when anterior layer be not the 0th layer, use the 0th layer characteristic and from utilize second text after last layer CRFs model marks text to be marked, extract characteristic.
7. part-of-speech tagging as claimed in claim 1 system also comprises:
Part of speech hierarchical tree construction device is used for by the relation analysis between the part of speech that marks text of part-of-speech tagging training set is made up the part of speech hierarchical tree.
8. part-of-speech tagging as claimed in claim 7 system, wherein part of speech hierarchical tree construction device comprises:
Part of speech feature templates selected cell is used to select to characterize the feature templates of part of speech feature;
The proper vector construction unit is used for according to the feature templates of selecting, for the part of speech in the part-of-speech tagging training set makes up the characteristic of correspondence vector;
Similarity calculated is used to utilize the similarity between the proper vector calculating part of speech; And cluster cell, be used for part of speech being carried out cluster, to generate the part of speech hierarchical tree according to similarity.
9. part-of-speech tagging as claimed in claim 8 system also comprises:
The test set constructing apparatus is used for selecting at random to have marked the text collection of part of speech as test set from the part-of-speech tagging training set;
Apparatus for evaluating is used for utilizing the part-of-speech tagging model that the result that the text to be marked from test set carries out part-of-speech tagging is assessed; And
Adjusting gear is used for according to assessment result the part of speech hierarchical tree being adjusted.
10. part-of-speech tagging as claimed in claim 9 system, its middle regulator is adjusted the threshold value that part of speech hierarchical tree construction device uses when the similarity of calculating between the part of speech.
11. part-of-speech tagging as claimed in claim 1 or 2 system also comprises:
Do not land speech part of speech conjecture model construction device, be used for not landing speech part of speech conjecture model from part-of-speech tagging training focusing study word-building rule and structure; And
Do not land speech part of speech correcting device, be used for using and do not land speech part of speech conjecture model and do not carry out part-of-speech tagging, and the part of speech of not landing speech of using part-of-speech tagging model mark part of speech is revised landing speech.
12. a part-of-speech tagging method comprises:
Part-of-speech tagging model training step utilizes first text that has marked in the part-of-speech tagging training set successively to train node by node the part-of-speech tagging model based on the part of speech hierarchical tree; And
The part-of-speech tagging step uses the part-of-speech tagging model of training that text to be marked is carried out part-of-speech tagging.
13. part-of-speech tagging method as claimed in claim 12, wherein part-of-speech tagging model training step comprises:
CRF model training language material constitution step, first text that utilizes the part of speech hierarchical tree to have marked from the part-of-speech tagging training set successively is labeled as second text node by node and constructs CRF model training language material; And
CRF model training step utilizes second text of the each mark of CRF model training language material constitution step correspondingly successively to train the CRF model to obtain the part-of-speech tagging model node by node.
14. part-of-speech tagging method as claimed in claim 13, wherein CRF model training language material constitution step comprises by the mark part of speech in first text being replaced with the child node title of the position corresponding present node of this part of speech in the part of speech hierarchical tree and goes successively the step of mark node by node.
15. part-of-speech tagging method as claimed in claim 14, wherein CRF model training step selects feature templates successively to train node by node the CRF model in the following manner:
(a) when anterior layer be the 0th layer, feature templates comprises the preceding word of each two speech of front and back of each speech in second text, current speech and the co-occurrence between back word and each two speech of front and back; With
(b) when anterior layer be not the 0th layer, feature templates comprises the part of speech of each two speech of front and back of each speech in second text of the feature templates of the 0th layer of selection and last layer, and the co-occurrence between the co-occurrence between the part of speech, speech and the part of speech.
16. part-of-speech tagging method as claimed in claim 13, wherein the part-of-speech tagging step comprises:
CRF aspect of model constitution step, at text to be marked for using CRF model structural attitude data node by node successively; And
CRF part-of-speech tagging step is correspondingly successively carried out part-of-speech tagging node by node according to the characteristic of the each structure of characteristic constitution step.
17. part-of-speech tagging method as claimed in claim 16, wherein CRF aspect of model constitution step is constructed the characteristic of CRF model according to the following manner structure:
(1) when anterior layer be the 0th layer, from text to be marked, extract the characteristic of the feature templates of the 0th layer of selection when being used to be filled in training CRF model; With
(2) when anterior layer be not the 0th layer, use the 0th layer characteristic and from utilize second text after last layer CRFs model marks text to be marked, extract characteristic.
18. part-of-speech tagging method as claimed in claim 12 also comprises:
Part of speech hierarchical tree construction step is by making up the part of speech hierarchical tree to the relation analysis between the part of speech that marks text in the part-of-speech tagging training set.
19. part-of-speech tagging method as claimed in claim 18, wherein part of speech hierarchical tree construction step comprises:
The part of speech feature templates is selected step, selects to characterize the feature templates of part of speech feature;
The proper vector construction step is according to the feature templates of selecting, for the part of speech in the part-of-speech tagging training set makes up the characteristic of correspondence vector;
The similarity calculation procedure is utilized the similarity between the proper vector calculating part of speech; And
The cluster step is carried out cluster according to similarity to part of speech, to generate the part of speech hierarchical tree.
20. part-of-speech tagging method as claimed in claim 19 also comprises:
The test set constitution step is selected to have marked the text collection of part of speech as test set from the part-of-speech tagging training set at random;
Appraisal procedure is assessed the result who carries out part-of-speech tagging from the text to be marked in the test set utilizing the part-of-speech tagging model; And
Set-up procedure is adjusted the part of speech hierarchical tree according to assessment result.
21. part-of-speech tagging method as claimed in claim 20, wherein set-up procedure comprises the step that threshold values that part of speech hierarchical tree construction step is used when the similarity of calculating between the part of speech is adjusted.
22., also comprise as claim 12 or 13 described part-of-speech tagging methods:
Do not land speech part of speech conjecture model construction step, do not land speech part of speech conjecture model from part-of-speech tagging training focusing study word-building rule and structure; And
Do not land speech part of speech correction step, use is not landed speech part of speech conjecture model and is not carried out part-of-speech tagging to landing speech, and the part of speech of not landing speech of using part-of-speech tagging model mark part of speech is revised.
23. a device that is used to train the part-of-speech tagging model comprises:
CRF model training language material tectonic element is used for utilizing the part of speech hierarchical tree successively to be labeled as second text node by node from first text that the part-of-speech tagging training set has marked and constructs CRF model training language material; And
CRF model training unit is used to utilize second text of the each mark of CRF model training language material tectonic element correspondingly successively to train the CRF model to obtain the part-of-speech tagging model node by node.
24. a method that is used to train the part-of-speech tagging model comprises:
CRF model training language material constitution step, first text that utilizes the part of speech hierarchical tree to have marked from the part-of-speech tagging training set successively is labeled as second text node by node and constructs CRF model training language material; And
CRF model training step utilizes second text of the each mark of CRF model training language material constitution step correspondingly successively to train the CRF model to obtain the part-of-speech tagging model node by node.
CN200910132711.3A 2009-04-14 2009-04-14 Part-or-speech tagging system, and device and method thereof for training part-or-speech tagging model Expired - Fee Related CN101866337B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN200910132711.3A CN101866337B (en) 2009-04-14 2009-04-14 Part-or-speech tagging system, and device and method thereof for training part-or-speech tagging model
JP2010077274A JP5128629B2 (en) 2009-04-14 2010-03-30 Part-of-speech tagging system, part-of-speech tagging model training apparatus and method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN200910132711.3A CN101866337B (en) 2009-04-14 2009-04-14 Part-or-speech tagging system, and device and method thereof for training part-or-speech tagging model

Publications (2)

Publication Number Publication Date
CN101866337A true CN101866337A (en) 2010-10-20
CN101866337B CN101866337B (en) 2014-07-02

Family

ID=42958068

Family Applications (1)

Application Number Title Priority Date Filing Date
CN200910132711.3A Expired - Fee Related CN101866337B (en) 2009-04-14 2009-04-14 Part-or-speech tagging system, and device and method thereof for training part-or-speech tagging model

Country Status (2)

Country Link
JP (1) JP5128629B2 (en)
CN (1) CN101866337B (en)

Cited By (23)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103150381A (en) * 2013-03-14 2013-06-12 北京理工大学 High-precision Chinese predicate identification method
CN103164426A (en) * 2011-12-13 2013-06-19 北大方正集团有限公司 Method and device of recognizing named entity
CN103530282A (en) * 2013-10-23 2014-01-22 北京紫冬锐意语音科技有限公司 Corpus tagging method and equipment
CN103902525A (en) * 2012-12-28 2014-07-02 新疆电力信息通信有限责任公司 Uygur language part-of-speech tagging method
CN104391836A (en) * 2014-11-07 2015-03-04 百度在线网络技术(北京)有限公司 Method and device for processing feature templates for syntactic analysis
CN105930415A (en) * 2016-04-19 2016-09-07 昆明理工大学 Support vector machine-based Vietnamese part-of-speech tagging method
CN105955955A (en) * 2016-05-05 2016-09-21 东南大学 Disambiguation-free unsupervised part-of-speech tagging method based on error-correcting output codes
CN106778887A (en) * 2016-12-27 2017-05-31 努比亚技术有限公司 The terminal and method of sentence flag sequence are determined based on condition random field
CN106844346A (en) * 2017-02-09 2017-06-13 北京红马传媒文化发展有限公司 Short text Semantic Similarity method of discrimination and system based on deep learning model Word2Vec
CN107239444A (en) * 2017-05-26 2017-10-10 华中科技大学 A kind of term vector training method and system for merging part of speech and positional information
CN107526724A (en) * 2017-08-22 2017-12-29 北京百度网讯科技有限公司 For marking the method and device of language material
CN107832425A (en) * 2017-11-13 2018-03-23 北京神州泰岳软件股份有限公司 A kind of corpus labeling method, the apparatus and system of more wheel iteration
CN108182448A (en) * 2017-12-22 2018-06-19 北京中关村科金技术有限公司 A kind of selection method and relevant apparatus for marking strategy
CN109033084A (en) * 2018-07-26 2018-12-18 国信优易数据有限公司 A kind of semantic hierarchies tree constructing method and device
CN109344406A (en) * 2018-09-30 2019-02-15 阿里巴巴集团控股有限公司 Part-of-speech tagging method, apparatus and electronic equipment
CN109726386A (en) * 2017-10-30 2019-05-07 中国移动通信有限公司研究院 A kind of term vector model generating method, device and computer readable storage medium
CN110427487A (en) * 2019-07-30 2019-11-08 中国工商银行股份有限公司 A kind of data mask method, device and storage medium
CN110457683A (en) * 2019-07-15 2019-11-15 北京百度网讯科技有限公司 Model optimization method, apparatus, computer equipment and storage medium
CN110781667A (en) * 2019-10-25 2020-02-11 北京中献电子技术开发有限公司 Japanese verb identification and part-of-speech tagging method for neural network machine translation
CN111401067A (en) * 2020-03-18 2020-07-10 上海观安信息技术股份有限公司 Honeypot simulation data generation method and device
CN111950274A (en) * 2020-07-31 2020-11-17 中国工商银行股份有限公司 Chinese word segmentation method and device for linguistic data in professional field
CN112148877A (en) * 2020-09-23 2020-12-29 网易(杭州)网络有限公司 Corpus text processing method and device and electronic equipment
WO2021003313A1 (en) * 2019-07-02 2021-01-07 Servicenow, Inc. Deriving multiple meaning representations for an utterance in a natural language understanding framework

Families Citing this family (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103631961B (en) * 2013-12-17 2017-01-18 苏州大学张家港工业技术研究院 Method for identifying relationship between sentiment words and evaluation objects
CN108241662B (en) * 2016-12-23 2021-12-28 北京国双科技有限公司 Data annotation optimization method and device
CN109766523A (en) * 2017-11-09 2019-05-17 普天信息技术有限公司 Part-of-speech tagging method and labeling system
CN109992763A (en) * 2017-12-29 2019-07-09 北京京东尚科信息技术有限公司 Language marks processing method, system, electronic equipment and computer-readable medium
CN110348465B (en) * 2018-04-03 2022-10-18 富士通株式会社 Method for labelling a sample
CN109657230B (en) * 2018-11-06 2023-07-28 众安信息技术服务有限公司 Named entity recognition method and device integrating word vector and part-of-speech vector
CN110175236B (en) * 2019-04-24 2023-07-21 平安科技(深圳)有限公司 Training sample generation method and device for text classification and computer equipment
CN110377899A (en) * 2019-05-30 2019-10-25 北京达佳互联信息技术有限公司 A kind of method, apparatus and electronic equipment of determining word part of speech
CN110321433B (en) * 2019-06-26 2023-04-07 创新先进技术有限公司 Method and device for determining text category
CN110532391B (en) * 2019-08-30 2022-07-05 网宿科技股份有限公司 Text part-of-speech tagging method and device
CN111160034B (en) * 2019-12-31 2024-02-27 东软集团股份有限公司 Entity word labeling method, device, storage medium and equipment
JP2021162917A (en) * 2020-03-30 2021-10-11 ソニーグループ株式会社 Information processing apparatus and information processing method
CN112017786A (en) * 2020-07-02 2020-12-01 厦门市妇幼保健院(厦门市计划生育服务中心) ES-based custom word segmentation device
CN111859862B (en) * 2020-07-22 2024-03-22 海尔优家智能科技(北京)有限公司 Text data labeling method and device, storage medium and electronic device
CN112163424A (en) * 2020-09-17 2021-01-01 中国建设银行股份有限公司 Data labeling method, device, equipment and medium
CN113158659B (en) * 2021-02-08 2024-03-08 银江技术股份有限公司 Case-related property calculation method based on judicial text
CN115146642B (en) * 2022-07-21 2023-08-29 北京市科学技术研究院 Named entity recognition-oriented training set automatic labeling method and system

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101075251A (en) * 2007-06-18 2007-11-21 中国电子科技集团公司第五十四研究所 Method for searching file based on data excavation
JP2008217592A (en) * 2007-03-06 2008-09-18 Nippon Telegr & Teleph Corp <Ntt> Language analysis model learning device, language analysis model learning method, language analysis model learning program and recording medium

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2008217592A (en) * 2007-03-06 2008-09-18 Nippon Telegr & Teleph Corp <Ntt> Language analysis model learning device, language analysis model learning method, language analysis model learning program and recording medium
CN101075251A (en) * 2007-06-18 2007-11-21 中国电子科技集团公司第五十四研究所 Method for searching file based on data excavation

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
TAKU KUDO,ET AL: "Applying Conditional Random Fields to Japanese Morphological Analysis", 《IN PROC.OF EMNLP》 *

Cited By (42)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103164426B (en) * 2011-12-13 2015-10-28 北大方正集团有限公司 A kind of method of named entity recognition and device
CN103164426A (en) * 2011-12-13 2013-06-19 北大方正集团有限公司 Method and device of recognizing named entity
CN103902525B (en) * 2012-12-28 2016-09-21 国网新疆电力公司信息通信公司 Uighur part-of-speech tagging method
CN103902525A (en) * 2012-12-28 2014-07-02 新疆电力信息通信有限责任公司 Uygur language part-of-speech tagging method
CN103150381B (en) * 2013-03-14 2016-03-02 北京理工大学 A kind of High-precision Chinese predicate identification method
CN103150381A (en) * 2013-03-14 2013-06-12 北京理工大学 High-precision Chinese predicate identification method
CN103530282A (en) * 2013-10-23 2014-01-22 北京紫冬锐意语音科技有限公司 Corpus tagging method and equipment
CN103530282B (en) * 2013-10-23 2016-07-13 北京紫冬锐意语音科技有限公司 Corpus labeling method and equipment
CN104391836A (en) * 2014-11-07 2015-03-04 百度在线网络技术(北京)有限公司 Method and device for processing feature templates for syntactic analysis
CN104391836B (en) * 2014-11-07 2017-07-21 百度在线网络技术(北京)有限公司 Handle the method and device of the feature templates for syntactic analysis
CN105930415A (en) * 2016-04-19 2016-09-07 昆明理工大学 Support vector machine-based Vietnamese part-of-speech tagging method
CN105955955B (en) * 2016-05-05 2018-08-28 东南大学 A kind of unsupervised part-of-speech tagging method without disambiguation based on error correcting output codes
CN105955955A (en) * 2016-05-05 2016-09-21 东南大学 Disambiguation-free unsupervised part-of-speech tagging method based on error-correcting output codes
CN106778887A (en) * 2016-12-27 2017-05-31 努比亚技术有限公司 The terminal and method of sentence flag sequence are determined based on condition random field
CN106778887B (en) * 2016-12-27 2020-05-19 瑞安市辉煌网络科技有限公司 Terminal and method for determining sentence mark sequence based on conditional random field
CN106844346B (en) * 2017-02-09 2020-08-25 北京红马传媒文化发展有限公司 Short text semantic similarity discrimination method and system based on deep learning model Word2Vec
CN106844346A (en) * 2017-02-09 2017-06-13 北京红马传媒文化发展有限公司 Short text Semantic Similarity method of discrimination and system based on deep learning model Word2Vec
CN107239444A (en) * 2017-05-26 2017-10-10 华中科技大学 A kind of term vector training method and system for merging part of speech and positional information
CN107239444B (en) * 2017-05-26 2019-10-08 华中科技大学 A kind of term vector training method and system merging part of speech and location information
CN107526724A (en) * 2017-08-22 2017-12-29 北京百度网讯科技有限公司 For marking the method and device of language material
CN109726386A (en) * 2017-10-30 2019-05-07 中国移动通信有限公司研究院 A kind of term vector model generating method, device and computer readable storage medium
CN109726386B (en) * 2017-10-30 2023-05-09 中国移动通信有限公司研究院 Word vector model generation method, device and computer readable storage medium
CN107832425A (en) * 2017-11-13 2018-03-23 北京神州泰岳软件股份有限公司 A kind of corpus labeling method, the apparatus and system of more wheel iteration
CN107832425B (en) * 2017-11-13 2020-03-06 中科鼎富(北京)科技发展有限公司 Multi-iteration corpus labeling method, device and system
CN108182448A (en) * 2017-12-22 2018-06-19 北京中关村科金技术有限公司 A kind of selection method and relevant apparatus for marking strategy
CN108182448B (en) * 2017-12-22 2020-08-21 北京中关村科金技术有限公司 Selection method of marking strategy and related device
CN109033084A (en) * 2018-07-26 2018-12-18 国信优易数据有限公司 A kind of semantic hierarchies tree constructing method and device
CN109344406B (en) * 2018-09-30 2023-06-20 创新先进技术有限公司 Part-of-speech tagging method and device and electronic equipment
CN109344406A (en) * 2018-09-30 2019-02-15 阿里巴巴集团控股有限公司 Part-of-speech tagging method, apparatus and electronic equipment
US11205052B2 (en) 2019-07-02 2021-12-21 Servicenow, Inc. Deriving multiple meaning representations for an utterance in a natural language understanding (NLU) framework
US11720756B2 (en) 2019-07-02 2023-08-08 Servicenow, Inc. Deriving multiple meaning representations for an utterance in a natural language understanding (NLU) framework
WO2021003313A1 (en) * 2019-07-02 2021-01-07 Servicenow, Inc. Deriving multiple meaning representations for an utterance in a natural language understanding framework
CN110457683A (en) * 2019-07-15 2019-11-15 北京百度网讯科技有限公司 Model optimization method, apparatus, computer equipment and storage medium
CN110427487A (en) * 2019-07-30 2019-11-08 中国工商银行股份有限公司 A kind of data mask method, device and storage medium
CN110427487B (en) * 2019-07-30 2022-05-17 中国工商银行股份有限公司 Data labeling method and device and storage medium
CN110781667B (en) * 2019-10-25 2021-10-08 北京中献电子技术开发有限公司 Japanese verb identification and part-of-speech tagging method for neural network machine translation
CN110781667A (en) * 2019-10-25 2020-02-11 北京中献电子技术开发有限公司 Japanese verb identification and part-of-speech tagging method for neural network machine translation
CN111401067A (en) * 2020-03-18 2020-07-10 上海观安信息技术股份有限公司 Honeypot simulation data generation method and device
CN111401067B (en) * 2020-03-18 2023-07-14 上海观安信息技术股份有限公司 Honeypot simulation data generation method and device
CN111950274A (en) * 2020-07-31 2020-11-17 中国工商银行股份有限公司 Chinese word segmentation method and device for linguistic data in professional field
CN112148877A (en) * 2020-09-23 2020-12-29 网易(杭州)网络有限公司 Corpus text processing method and device and electronic equipment
CN112148877B (en) * 2020-09-23 2023-07-04 网易(杭州)网络有限公司 Corpus text processing method and device and electronic equipment

Also Published As

Publication number Publication date
JP2010250814A (en) 2010-11-04
JP5128629B2 (en) 2013-01-23
CN101866337B (en) 2014-07-02

Similar Documents

Publication Publication Date Title
CN101866337B (en) Part-or-speech tagging system, and device and method thereof for training part-or-speech tagging model
Gupta et al. Abstractive summarization: An overview of the state of the art
CN109359293B (en) Mongolian name entity recognition method neural network based and its identifying system
CN108280064B (en) Combined processing method for word segmentation, part of speech tagging, entity recognition and syntactic analysis
CN101539907B (en) Part-of-speech tagging model training device and part-of-speech tagging system and method thereof
CN101251862B (en) Content-based problem automatic classifying method and system
CN103198149B (en) Method and system for query error correction
CN105843801B (en) The structure system of more translation Parallel Corpus
CN110427623A (en) Semi-structured document Knowledge Extraction Method, device, electronic equipment and storage medium
CN112417880A (en) Court electronic file oriented case information automatic extraction method
CN107330032A (en) A kind of implicit chapter relationship analysis method based on recurrent neural network
CN103885938A (en) Industry spelling mistake checking method based on user feedback
CN103309926A (en) Chinese and English-named entity identification method and system based on conditional random field (CRF)
CN109325112A (en) A kind of across language sentiment analysis method and apparatus based on emoji
CN104778256A (en) Rapid incremental clustering method for domain question-answering system consultations
CN110276069A (en) A kind of Chinese braille mistake automatic testing method, system and storage medium
CN112417854A (en) Chinese document abstraction type abstract method
CN110222338B (en) Organization name entity identification method
CN104346326A (en) Method and device for determining emotional characteristics of emotional texts
CN105868187B (en) The construction method of more translation Parallel Corpus
CN114757184B (en) Method and system for realizing knowledge question and answer in aviation field
CN112948588B (en) Chinese text classification method for quick information editing
CN103970732B (en) Mining method and device of new word translation
CN110874408B (en) Model training method, text recognition device and computing equipment
CN112765359B (en) Text classification method based on few samples

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20140702

Termination date: 20170414