CN108268447B - Labeling method for Tibetan named entities - Google Patents
Labeling method for Tibetan named entities Download PDFInfo
- Publication number
- CN108268447B CN108268447B CN201810059120.7A CN201810059120A CN108268447B CN 108268447 B CN108268447 B CN 108268447B CN 201810059120 A CN201810059120 A CN 201810059120A CN 108268447 B CN108268447 B CN 108268447B
- Authority
- CN
- China
- Prior art keywords
- vector
- corpus
- word
- labeled
- named entities
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/22—Matching criteria, e.g. proximity measures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/23—Clustering techniques
- G06F18/232—Non-hierarchical techniques
- G06F18/2321—Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Artificial Intelligence (AREA)
- General Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- General Health & Medical Sciences (AREA)
- Computational Linguistics (AREA)
- Probability & Statistics with Applications (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Health & Medical Sciences (AREA)
- Machine Translation (AREA)
Abstract
The invention discloses a labeling method of Tibetan named entities, which is characterized in that a semi-supervised learning mode is adopted, labeled corpora are used for training a double-granularity model, namely a coarse-granularity NER based on word vector KNN clustering and a fine-granularity NER based on semi-Markov CRFs, then unlabeled corpora are labeled, a new labeled entity is added into the labeled corpora for secondary training of the double-granularity model, and the double-granularity NER is iteratively improved. The method overcomes the limitation that supervised learning excessively depends on labeled corpora and the problem of a single discrimination mode of the traditional CRFs method, integrates the characteristics of entity semantic characteristics, interaction among named entities and the like, combines a clustering graph and a probability graph, improves the degree of model fitting from the complementary advantages of the semantics and the grammatical structure of the named entities, realizes the integrated ground NER, and effectively improves the accuracy and the efficiency of the Tibetan named entity recognition.
Description
Technical Field
The invention relates to the technical field of language processing, in particular to a labeling method of Tibetan named entities.
Background
Named Entity Recognition (NER) refers to detecting Entity words composed of single words, words or multiple words in a text and determining which Entity class the Entity words belong to: name of person, place name, organization, etc. From the viewpoint of Natural Language Processing (NLP), named entity recognition is mainly to solve the problem of entity recognition that is not registered in a dictionary. From the perspective of knowledge discovery and acquisition, named entity recognition is the extraction of named entities from unstructured text that relate to user-desired information. The effectiveness of named entity recognition can directly affect the performance of related research and application systems over which it overrides, such as structured representation of text, information extraction, information retrieval, machine translation and question and answer systems, and the like.
The Tibetan and Chinese, English and other language characters have certain commonality and certain special characteristics, for example, the Tibetan character structure takes a basic character as a core, and other letters are added in front and back and are overlapped up and down on the basis of the basic character to form a complete character table structure. Although the dictionary, rules, grammar and features used in the Tibetan language named entity recognition are different from other languages, the method adopted by the entity recognition is not different from the method involved in other languages from the perspective of the methodology of the named entity recognition.
There are many named entity recognition methods, which can be said to be related to Supervised Learning (SL) to Unsupervised Learning (UL), rule-and-Dictionary Based Learning (RDBL) to Statistical Machine Learning (SML), but these methods still have certain drawbacks. For example, in the supervising learning environment, although the classifier obtains better fitting performance after training and learning of labeled data, the premise is that many linguists spend a lot of time labeling the original corpus. As unsupervised learning of SL opposites, UL avoids the cost of annotation data, but is significantly inferior to entity recognition in performance due to its lack of a priori knowledge of training and learning. In the process of marking data, people acquire a large number of rules and perform entity identification from the perspective of entity construction rules, and although the method obtains a certain accuracy in small data sets, with the increase of data sets, especially in the current big data era, the highlighted main problem of the rule-based entity identification method is that a rule base cannot exhaust all named entity rules. Stated another way, RDBL does not take full advantage of the context and associated features of the named entity. However, SML is just a significant improvement in accuracy by taking full advantage of the context-dependent characteristics of named entities in annotation data. Such as Hidden Markov Models (HMMs), Support Vector Machines (SVMs), Maximum Entropy Hidden Markov Models (MEMMs), Conditional Random Fields (CRFs), and skip-chain Random Fields (skip-chain CRFs). In contrast, the conditional random field adopts the probability of statistical normalization in the global scope, overcomes the problem of labeling bias of HMM and MEMM, can obtain better classification results, and obtains better accuracy of NER in chapters than the conventional NER algorithm by using artificial synonym pairs on the basis of basic CRFs. The above statistical learning methods all consider entity identification from a fine-grained point of view, and when discriminating an ne (named entity), the CRFs method lacks the metric property of considering features, the internal features of the entity (e.g. no markov property), and the like. Furthermore, such methods rely heavily on annotation corpora, i.e. similar to finding entities and matching calculations in a generalized dictionary (annotation corpora containing features and named entities), but may lead to increased recognition errors when the named entities we need to annotate do not appear in the generalized dictionary and their near-sense NEs do not have their similar context.
Disclosure of Invention
The invention aims to overcome the defects in the prior art, provides a labeling method of Tibetan named entities, and solves the technical problems that a supervised learning method excessively depends on labeled linguistic data and a traditional rule and statistical method-based independent judgment mode is adopted.
In order to solve the technical problems, the technical scheme adopted by the invention is as follows: a method for labeling Tibetan naming entities comprises the following steps:
normalizing the unmarked data to obtain unmarked normalized corpora, and adding the newly marked named entity into the original marked corpora;
training a noun short-term annotator Semi-Markov CRFs _1 by using the labeled corpus, and then carrying out noun phrase segmentation and labeling on the normalized corpus by using the labeled corpus;
reading the labeled corpus and the normalized corpus, establishing a CBOW model combining characters, words, phrases and named entities, and obtaining a corpus matrix and a vector space of the characters, words, phrases and named entities of the nominal characters through the training of the CBOW model;
based on a vector space, finding K nearest neighbor tagged named entities of the untagged noun phrases by utilizing a KNN algorithm, calculating cosine similarity between the untagged noun phrases and the K nearest neighbor tagged named entities, then selecting q named entities with similarity values larger than a preset threshold lambda from the K neighbors, wherein q is more than or equal to 0 and less than or equal to K, and if q is more than 0, taking the named entity category of the untagged noun phrases as the category of the named entity with the largest cosine similarity in the K nearest neighbors; adding the newly labeled named entity into the labeled corpus to enable the normalized corpus to obtain a part of label;
reading sequence data of the labeled corpus, and training a fine grain degree marker Semi-Markov CRFs _ 2; and then, labeling the unlabeled named entities in the normalized corpus by using Semi-Markov CRFs _2 to realize the full labeling of the named entities.
The normalization processing comprises: word segmentation and sentence normalization, punctuation normalization, word segmentation and part-of-speech tagging normalization, and stop word normalization.
The corpus matrix obtaining method comprises the following steps:
firstly, constructing a dictionary containing four subsets of characters, words, phrases and named entities, and carrying out vector initialization operation on each element of the dictionary: assigning a random vector with 400-600 dimensions to each element, wherein the value of each dimension is limited to [ -1,1 ];
secondly, establishing a sliding window with the length of 5, and sequentially sliding and reading data from the labeled corpus and the normalized corpus labeled by noun to obtain window data win<x-2x-1x0x+1x+2>Where 0 denotes the center position of the window, x0Representing a target word;
using Context ═ x±pAnd p is 1,2 represents x0And x is carried out0For x, pre-processing of context word vectors of±pWhen the words, phrases or named entities are processed as follows:
When x is±pE { word }, x±pVector of (2)Taking values as word vectors wordvecotrThe formula is as follows:
in the formula, wordvecotrDenotes x±pThe vector corresponding to the word belonging to it,vector, N, representing the jth Tibetan word in the word±pI denotes the target word x0A certain context word x±pThe number of words contained;
when x is±pE { phrase }, x±pVector of (2)Taking the value as the phrase vector chunkingvectorThe formula is as follows:
wherein, chunkingvectorDenotes x±pThe vector corresponding to the time when belonging to a phrase,vector representing the qth Tibetan word in the phrase, | N'±pI denotes the target word x0A certain context word x±pThe number of words contained;
when x is±pE, e { named entity }, and then correspondingly process according to the corresponding categories of the characters, the words and the phrases;
then, x input to the CBOW model is calculated0Vector mean Context (x) of the Context of0) The formula is as follows:
in the formula, Context (x)0) Represents the input of the CBOW model; p is 1, 2;
and establishing an objective function of the CBOW learning algorithm by using the comparative noise estimation, wherein the formula is as follows:
in the formula, θ represents Context (x)0) A weight vector of (a); d represents a corpus;representing an activation function; x'0Represents a negative example; NCE (x'0) Representing a set of negative samples, x0Do not belong to this set; context (x'0) A word vector mean representing a context of a negative sample, where the original target word in the window is replaced by x'0;
Finally, learning parameters by using a random gradient ascent algorithm, and updating context word vectors; when CBOW traverses the whole corpus, a corpus matrix is obtained.
The method for constructing the vector space comprises the following steps: and extracting a vector space generated by vectors of all the nominal characters, words, phrases and named entities from the corpus matrix.
The specific method for training the finesse marker Semi-Markov CRFs _2 is as follows:
one window data s _ w of phrase segmentation sequence data s _ x of one sentence is slidably read from a markup corpus using a sliding window having a length of 3 units<s-1s0s+1>And one window mark data s _ l of the phrase mark sequence s _ y corresponding to s _ x<y′-1y′0y′+1>(ii) a Wherein: s0Denotes the target phrase, s-1And s+1Respectively represents s0The above and below; y'-1,y′0,y′+1E.g. Y ', Y' { R, T, P, O, N }, where R denotes a person name; t represents time; p represents a place; o represents an organization; n represents other types of phrases;
constructing a state transition feature function tv1(y′-1,y′0S _ x, k ') and t'v2(y′+1,y′0S _ x, k'), state characterization functionS 'to'v3(y′0S _ x, k'), a segmentation feature function segv4(y′0,s0,ss0,ee0) And a feature function vector F' (s _ x, s _ y) ═ F<f1(s_x,s_y),f2(s_x,s_y)…fz(s_x,s_y)>Wherein y'0,y′-1And y'+1The marks respectively represent the named entity categories of the current unit, the previous unit and the next unit of the window; k' represents the current position of the characteristic function; 0 indicates the middle position of the window,representing a division s0The starting point of (a) is,representing a division s0The end point of (1);represents the sum of features at each position, z', z ″ -1, 2 … z; f. ofz′(y′-1,y′0S _ x, k') represents a feature function vector composed of a concatenation of v1+ v2 state transition feature functions, v3 state functions, and v4 split feature functions, where z is v1+ v2+ v3+ v 4;
and (3) creating a Semi-Markov conditional random field Tibetan language fine granularity marker Semi-Markov CRFs _2 according to the characteristic function, and training the marker by utilizing the marking linguistic data and combining an L-BFGS algorithm and a gradient ascending method.
And realizing full labeling of named entities on the normalized corpus by using Semi-Markov CRFs _2, wherein the specific method comprises the following steps:
from the partially labeled normalized corpus, a window data s _ w of 3 size of a phrase segmentation sequence s _ x of a sentence is read<s-1s0s+1>Traversing the noun phrases which are not labeled in the corpus by using the trained fine-grained Tibetan named entity labeling device, calculating the maximum conditional probability P (s _ y | s _ x, W') of the named entity label sequence s _ y of s _ x to determine the category of the unlabeled entity,the formula is as follows:
wherein | W' | ═ Wv1|+|Wv2|+|Wv3|+|Wv4| which represents the vector length; and outputting a mark sequence by using a Viterbi algorithm according to the calculated conditional probability, and finally realizing the full labeling of the named entity.
Compared with the prior art, the invention has the following beneficial effects:
1. the limitation of preliminary knowledge for model training in the learning of the supervised CRFs is overcome; the internal characteristic information of the named entity in the external dictionary is utilized, the utilization rate of the dictionary is further improved, and the accuracy of Tibetan named entity identification is improved.
2. The method combines the characteristics of the entity similar words, the semantic similarity of the entity word context and the entity word vector and the like, and combines the clustering and the probability map, so as to realize the entity recognition in a collective manner, and compared with the CRFs method, the method integrates the characteristics of a plurality of aspects, and further improves the accuracy of the recognition of the Tibetan named entity;
3. the NER based on word vector KNN clustering in the coarse granularity level and the NER based on semi-Markov CRFs in the fine granularity level can be utilized to improve the model fitting degree from the complementary point of the semantic and the grammatical structure of the named entity.
Drawings
FIG. 1 is a flow chart of the present invention;
FIG. 2 is a corpus normalization flow chart;
FIG. 3 is a flow chart of noun phrase tagging;
FIG. 4 is a flow chart of corpus matrix creation based on the CBOW model of word entity federation;
FIG. 5 is a flow chart of coarse grain labeling;
fig. 6 is a flow chart of fine-grained annotation.
The specific implementation mode is as follows:
the invention provides a labeling method of Tibetan named entities, which utilizes a labeled corpus to train a noun phrase labeler and a named entity labeler with a fine granularity level, and utilizes a non-labeled corpus to improve the performance of the labeling method of the Tibetan named entities in a semi-supervised learning mode. Firstly, normalizing non-labeled linguistic data, then utilizing labeled linguistic data and normalized linguistic data for training a noun phrase marker Semi-Markov CRFs _1 and noun phrase segmentation and labeling, further creating a CBOW with combined characters, words and entities to obtain a linguistic data matrix and a vector space, so as to realize named entity labeling of a coarse-granularity K-coarse Nearest Neighbor, adding the newly labeled named entity into the labeled linguistic data, training a fine-granularity marker Semi-Markov CRFs _2, traversing part of labeled normalized linguistic data, and realizing complete labeling of the linguistic data. The method overcomes the limitation that supervised learning excessively depends on labeled corpora and the problem of a traditional independent distinguishing mode based on rules and a statistical method, integrates the characteristics of entity semantic characteristics, interaction between named entities and the like, combines a clustering graph and a probability graph, realizes the entity recognition in a collective mode, and effectively improves the efficiency of the Tibetan named entity recognition.
The technical solution of the present invention is explained in detail below with reference to the accompanying drawings and the specific embodiments, but the scope of the present invention is not limited to the embodiments.
FIG. 1 is a flow chart of the present invention, comprising the steps of:
And carrying out normalization processing on the un-labeled data to obtain un-labeled normalized corpora, and adding the newly labeled named entity into the original labeled corpora.
As shown in fig. 2, the method specifically includes the following steps:
And training the Semi-Markov CRFs _1 by using the labeled corpus, and then realizing the segmentation and labeling of noun phrases of the husband normalized corpus.
As shown in fig. 3, the method specifically includes the following steps:
and step 301, inputting the linguistic data. And respectively inputting the marking corpora and the normalized corpora according to the training and the testing of the model.
In step 302, Semi-Markov CRFs _1 is divided into two stages, training and labeling (testing). First, in the training stage, the word sequence data x of a sentence is read from the markup corpus<x1x2…xn>And word sequence tag data y ═<y1y2…yi…yn>And a phrase segmentation sequence s corresponding to a sentence< s1s2…sj…>Wherein x isnDenotes a word, yiThe representation indicates the current segmentation marker, yiE.g. Y, Y ═ { F, E }, F denotes a non-noun tag, E denotes a nounMarker, sjJ represents j phrase segmentation of x, j is less than or equal to n; segmenting s for each phrasej=<bj,ej,yj>,biDenotes the current segmentation start point, ejIndicates the current segmentation end point, yjE is Y; construction of a segmentation feature function fk(j,x,sj)=fk(yj,yj-1,x,bj,ej) Wherein: y isj-1For the previous segmentation markers, k represents the number of feature functions; c represents the corpus including the segmentation feature function fk(j,x,sj) The number of (2);
establishing conditional probability of the noun short word annotator, wherein the formula is as follows:
wherein, W is a weight vector of the segmentation characteristic function vector F (x, s); zW(x)=∑s′eW·F(x,s′)Is a normalization factor, s' represents all possible valid sequence segmentations;
according to the labeled corpusstRepresenting the t-th word sequence x in the markup corpustNum is the total number of sentences of the labeled corpus, and a target function of the phrase segmentation sequence is created, wherein the formula is as follows:
wherein s istWord sequence x representing the tth sentence in markup corpustThe noun phrase label sequence of (2), Num is the total number of word sequences of the labeled corpus; the gradient is calculated by adopting an L-BFGS algorithm, and the weight W is updated by using gradient rise.
Second, the testing stage reads the word sequence data x' of a sentence from the normalized corpus< x′1x′2…x′n>The maximum length value of the noun phrase is preset to be L, and the maximum conditional probability P (s | x ', W) of the phrase segmentation sequence s is calculated according to the word sequence data x', and the formula is as follows:
in the formula, | s | represents the number of phrases after x' is divided; i denotes the position of the current phrase of the phrase sequence;
obtaining the best phrase segmentation sequence and noun phrase label of x by using Viterbi algorithm, wherein the formula is as follows:
with the proviso that i > 0; if i is 0, Viterbi (i, y) takes 0 value; otherwise, the Viterbi (i, y) takes the value of- ∞.
And (3) constructing and training the CBOW of the word entity combination by using the labeled linguistic data and the noun phrase labeled husband normalized linguistic data to obtain a vector space of the linguistic data phrase and the noun phrase.
As shown in fig. 4, the method specifically includes the following steps:
Secondly, establishing a sliding window with the length of 5, sequentially sliding and reading the sequence data x from the marking linguistic data and the normalized linguistic data to obtain window data win<x-2x-1x0x+1x+2>Where 0 denotes the center position of the window, x0Representing a target word; using Context ═ x±pAnd p is 1,2 represents x0And proceed with x0For x, pre-processing of context word vectors of±pFor a word, phrase, or named entity, the representation is treated as follows:
When x is±pE { word }, x±pVector of (2)Taking values as word vectors wordvecotrThe formula is as follows:
in the formula, wordvecotrDenotes x±pThe vector corresponding to the word belonging to it,vector, N, representing the jth Tibetan word in the word±pRepresenting the target word x0A certain context word x±pThe number of included words;
in the formula, chunkingvectorDenotes x±pThe vector corresponding to the time when belonging to a phrase,vector, N ', representing the qth Tibetan word in the phrase'±pRepresenting the target word x0A certain context word x±pThe number of words contained;
when x is±pE { named entity }, and then corresponding processing is carried out according to the corresponding categories of the characters, the words and the phrases (dictionary, (dictionary of dictionary).
(2) Mean, calculating x input to CBOW0Vector mean Context (x) of the Context of0) The formula is as follows:
in the formula, Context (x)0) An input representing CBOW; ,
(3) the objective function of the CBOW learning algorithm is established by using a Noise-contrast Estimation algorithm, namely NCE for short, and the formula is as follows:
in the formula, θ represents Context (x)0) A weight vector of (a); d represents a corpus;representing an activation function; x'0Represents a negative example; NCE (x'0) A set of negative examples is represented, and,word vector mean representing context of negative samples, i.e., target word in window is replaced with negative sample x'0(ii) a And (5) updating the context word vector by using a random gradient ascending algorithm to learn parameters.
And according to the corpus matrix, realizing partial annotation on the named entity in the normalized corpus by utilizing K-Nearest Neighbor.
As shown in fig. 5, the method specifically includes the following steps:
And 505, adding the newly labeled named entity into the labeled corpus to obtain part of the named entity label of the normalized corpus.
105, fine-grained marking
And training a fine-grained level tagger Semi-Markov CRFs _2 of the Semi-Markov conditional random field by using the tagged corpus, and carrying out full tagging on the normalized corpus by using the tagger.
As shown in fig. 6, the method specifically includes the following steps:
In step 603, Semi-Markov CRFs _2 is divided into two stages, training and testing.
(1) In the training stage, a state transition characteristic function t is created according to data read by the sliding windowv1(y′-1,y′0S _ x, k ') and t'v2(y′+1,y′0S _ x, k '), state characterization function s'v3(y′0S _ x, k'), a segmentation feature functionAnd a feature function vector F' (s _ x, s _ y) ═ F<f1(s_x,s_y),f2(s_x,s_y)…fz(s_x,s_y)>Wherein y'0,y′-1And y'+1The marks respectively represent the named entity categories of the current unit, the previous unit and the next unit of the window; k' represents the current position of the characteristic function; 0 indicates the middle position of the window,representing a division s0The starting point of (a) is,representing a division s0The end point of (1);represents the sum of features at each position, z', z ″ -1, 2 … z; f. ofz′(y′-1,y′0S _ x, k') represents a feature function vector composed of a concatenation of v1+ v2 state transition feature functions, v3 state functions, and v4 split feature functions, where z is v1+ v2+ v3+ v 4;
establishing conditional probability of the named entity annotator, wherein the formula is as follows:
wherein W 'is a weight vector, Z', of F '(s _ x, s _ y)'W′(s _ x) is a normalization factor; then, according to the labeled corpusWherein, M is the total number of phrase sequences of the labeled corpus, x ' M and Y ' M respectively represent the phrase sequence of a sentence and the corresponding labeled Y ' sequence, and an objective function of a log-likelihood function is established, and the formula is as follows:
L′(W′)=ΣmlogP(y′m|x′m,W′) =∑m(W′·F′(X′m,y′m)-logZ′w′(X′m))
the gradient is calculated using the L-BFGS algorithm, and the weight vector W' is updated using the gradient ascent.
(2) A testing (labeling) stage, using a sliding window with the length of 3 units to slidably read a window data s _ w of a phrase segmentation sequence s _ x of a sentence from the normalized corpus of which the noun phrase labeling is performed<s-1s0s+1>Traversing the noun phrases which are not labeled in the corpus by using the trained fine-grained Tibetan named entity labeling device, and calculating the maximum conditional probability P (s _ y | s _ x, W') of the named entity label sequence s _ y of s _ x to determine the category of the unlabeled entity, wherein the formula is as follows:
wherein | W' | ═ Wv1|+|Wv2|+|Wv3|+|Wv4| which represents the vector length; and outputting the mark sequence by using a Viterbi algorithm according to the calculated conditional probability.
And step 604, traversing part of the labeled linguistic data by using the optimized Semi-Markov CRFs _2 to realize the full labeling of the named entity, so as to obtain a new named entity labeled linguistic data.
The above description is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, several modifications and variations can be made without departing from the technical principle of the present invention, and these modifications and variations should also be regarded as the protection scope of the present invention.
Claims (6)
1. A method for labeling Tibetan naming entities is characterized by comprising the following steps:
normalizing the unmarked data to obtain unmarked normalized corpora, and adding the newly marked named entity into the original marked corpora;
training a noun short-term annotator Semi-Markov CRFs _1 by using the labeled corpus, and then carrying out noun phrase segmentation and labeling on the normalized corpus by using the labeled corpus;
reading the labeled corpus and the normalized corpus, establishing a CBOW model combining characters, words, phrases and named entities, and obtaining a corpus matrix and a vector space of the characters, words, phrases and named entities of the nominal characters through the training of the CBOW model;
based on a vector space, finding K nearest neighbor labeled named entities of unlabeled noun phrases by using a KNN algorithm, calculating cosine similarity between the unlabeled noun phrases and the K nearest neighbor labeled named entities, then selecting q named entities with similarity values larger than a preset threshold lambda from the K neighbors, wherein q is more than or equal to 0 and less than or equal to K, and if q is more than 0, taking the class of the named entities of the unlabeled noun phrases as the class of the named entity with the maximum cosine similarity in the K nearest neighbors; adding the newly labeled named entity into the labeled corpus to enable the normalized corpus to obtain a part of label;
reading sequence data of the labeled corpus, and training a fine grain degree marker Semi-Markov CRFs _ 2; and then, labeling the unlabeled named entities in the normalized corpus by using Semi-Markov CRFs _2 to realize the full labeling of the named entities.
2. The method for labeling Tibetan named entities of claim 1, wherein the normalization process comprises: word segmentation and sentence normalization, punctuation normalization, word segmentation and part-of-speech tagging normalization, and stop word normalization.
3. The method for labeling Tibetan named entities according to claim 1, wherein the corpus matrix is obtained by the following method:
firstly, constructing a dictionary containing four subsets of characters, words, phrases and named entities, and carrying out vector initialization operation on each element of the dictionary: assigning a random vector with 400-600 dimensions to each element, wherein the value of each dimension is limited to [ -1,1 ];
secondly, establishing a sliding window with the length of 5, and sequentially sliding and reading data from the labeled corpus and the normalized corpus labeled by noun to obtain window data win<x-2x-1x0x+1x+2>Wherein 0 represents in the windowHeart position, x0Representing a target word;
using Context ═ x±pAnd p is 1,2 represents x0And x is carried out0For x, pre-processing of context word vectors of±pWhen the words, phrases or named entities are processed as follows:
When x is±pE { word }, x±pVector of (2)Taking values as word vectors wordvecotrThe formula is as follows:
in the formula, wordvecotrDenotes x±pVectors corresponding to words, charactersjvectorVector, N, representing the jth Tibetan word in the word±pI denotes the target word x0A certain context word x±pThe number of words contained;
when x is±pE { phrase }, x±pVector of (2)Taking the value as the phrase vector chunkingvectorThe formula is as follows:
wherein, chunkingvectorDenotes x±pThe vector corresponding to the time when belonging to a phrase,vector representing the qth Tibetan word in the phrase, | N'±pI denotes the target word x0A certain context word x±pThe number of words contained;
when x is±pE, e { named entity }, and then correspondingly process according to the corresponding categories of the characters, the words and the phrases;
then, x input to CBOW is calculated0Vector mean Context (x) of the Context of0) The formula is as follows:
in the formula, Context (x)0) Represents the input of the CBOW model; p is 1, 2;
and establishing an objective function of the CBOW learning algorithm by using the comparative noise estimation, wherein the formula is as follows:
in the formula, θ represents Context (x)0) A weight vector of (a); d represents a corpus;representing an activation function; x'0Represents a negative example; NCE (x'0) Representing a set of negative samples, x0Do not belong to this set; context (x'0) A word vector mean representing a context of a negative sample, where the original target word in the window is replaced by x'0;
Finally, learning parameters by using a random gradient ascent algorithm, and updating context word vectors; and when the CBOW traverses the whole corpus, obtaining a corpus matrix.
4. The method for labeling Tibetan language named entities according to claim 1, wherein the method for constructing the vector space comprises: and extracting a vector space generated by vectors of all the nominal characters, words, phrases and named entities from the corpus matrix.
5. The method for labeling Tibetan named entities as set forth in claim 1, wherein the specific method for training the fine granularity marker Semi-Markov CRFs _2 is as follows:
one window data s _ w of phrase segmentation sequence data s _ x of one sentence is slidably read from a markup corpus using a sliding window having a length of 3 units<s-1s0s+1>And one window mark data s _ l of the phrase mark sequence s _ y corresponding to s _ x<y′-1y′0y′+1>(ii) a Wherein: s0Denotes the target phrase, s-1And s+1Respectively represents s0The above and below; y'-1,y′0,y′+1E.g. Y ', Y' { R, T, P, O, N }, where R denotes a name of a person; t represents time; p represents a place; o represents an organization; n represents other types of phrases;
constructing a state transition feature function tv1(y′-1,y′0S _ x, k ') and t'v2(y′+1,y′0S _ x, k '), state characterization function s'v3(y′0S _ x, k'), a segmentation feature functionAnd a feature function vector F' (s _ x, s _ y) ═ F<f1(s_x,s_y),f2(s_x,s_y)…fz(s_x,s_y)>Wherein y'0,y′-1And y'+1The marks respectively represent the named entity categories of the current unit, the previous unit and the next unit of the window; k' represents the current position of the characteristic function; 0 indicates the middle position of the window,representing a division s0The starting point of (a) is,representing a division s0The end point of (1);represents the sum of features at each position, z', z ″ -1, 2 … z; f. ofz′(y′-1,y′0S _ x, k') represents a feature function vector composed of a concatenation of v1+ v2 state transition feature functions, v3 state functions, and v4 split feature functions, where z is v1+ v2+ v3+ v 4;
and (3) creating a Semi-Markov conditional random field Tibetan language fine granularity marker Semi-Markov CRFs _2 according to the characteristic function, and training the marker by utilizing the marking linguistic data and combining an L-BFGS algorithm and a gradient ascending method.
6. The method for labeling Tibetan named entities as claimed in claim 1, wherein Semi-Markov CRFs _2 is used to label named entities in full for normalized corpus, and the specific method is as follows:
from the partially labeled normalized corpus, a window data s _ w of 3 size of a phrase segmentation sequence s _ x of a sentence is read<s-1s0s+1>Traversing the noun phrases which are not labeled in the corpus by using the trained fine-grained Tibetan named entity labeling device, and calculating the maximum conditional probability P (s _ y | s _ x, W') of the named entity label sequence s _ y of s _ x to determine the category of the unlabeled entity, wherein the formula is as follows:
wherein | W' | ═ Wv1|+|Wv2|+|Wv3|+|Wv4| which represents the vector length; and outputting a mark sequence by using a Viterbi algorithm according to the calculated conditional probability, and finally realizing the full labeling of the named entity.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810059120.7A CN108268447B (en) | 2018-01-22 | 2018-01-22 | Labeling method for Tibetan named entities |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810059120.7A CN108268447B (en) | 2018-01-22 | 2018-01-22 | Labeling method for Tibetan named entities |
Publications (2)
Publication Number | Publication Date |
---|---|
CN108268447A CN108268447A (en) | 2018-07-10 |
CN108268447B true CN108268447B (en) | 2020-12-01 |
Family
ID=62776300
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810059120.7A Active CN108268447B (en) | 2018-01-22 | 2018-01-22 | Labeling method for Tibetan named entities |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108268447B (en) |
Families Citing this family (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110738223B (en) * | 2018-07-18 | 2022-04-08 | 宇通客车股份有限公司 | Point cloud data clustering method and device of laser radar |
CN110020428B (en) * | 2018-07-19 | 2023-05-23 | 成都信息工程大学 | Method for jointly identifying and normalizing Chinese medicine symptom names based on semi-Markov |
CN109192201A (en) * | 2018-09-14 | 2019-01-11 | 苏州亭云智能科技有限公司 | Voice field order understanding method based on dual model identification |
CN109388801B (en) * | 2018-09-30 | 2023-07-14 | 创新先进技术有限公司 | Method and device for determining similar word set and electronic equipment |
CN110162749B (en) * | 2018-10-22 | 2023-07-21 | 哈尔滨工业大学(深圳) | Information extraction method, information extraction device, computer equipment and computer readable storage medium |
CN109657061B (en) * | 2018-12-21 | 2020-11-27 | 合肥工业大学 | Integrated classification method for massive multi-word short texts |
WO2020133039A1 (en) * | 2018-12-27 | 2020-07-02 | 深圳市优必选科技有限公司 | Entity identification method and apparatus in dialogue corpus, and computer device |
CN110298033B (en) * | 2019-05-29 | 2022-07-08 | 西南电子技术研究所(中国电子科技集团公司第十研究所) | Keyword corpus labeling training extraction system |
CN110287481B (en) * | 2019-05-29 | 2022-06-14 | 西南电子技术研究所(中国电子科技集团公司第十研究所) | Named entity corpus labeling training system |
CN110472248A (en) * | 2019-08-22 | 2019-11-19 | 广东工业大学 | A kind of recognition methods of Chinese text name entity |
CN110909548B (en) * | 2019-10-10 | 2024-03-12 | 平安科技(深圳)有限公司 | Chinese named entity recognition method, device and computer readable storage medium |
CN114943235A (en) * | 2022-07-12 | 2022-08-26 | 长安大学 | Named entity recognition method based on multi-class language model |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104133848A (en) * | 2014-07-01 | 2014-11-05 | 中央民族大学 | Tibetan language entity knowledge information extraction method |
CN104809176A (en) * | 2015-04-13 | 2015-07-29 | 中央民族大学 | Entity relationship extracting method of Zang language |
CN107133220A (en) * | 2017-06-07 | 2017-09-05 | 东南大学 | Name entity recognition method in a kind of Geography field |
CN107391485A (en) * | 2017-07-18 | 2017-11-24 | 中译语通科技(北京)有限公司 | Entity recognition method is named based on the Korean of maximum entropy and neural network model |
CN107608955A (en) * | 2017-08-31 | 2018-01-19 | 张国喜 | A kind of Chinese hides name entity inter-translation method and device |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106484682B (en) * | 2015-08-25 | 2019-06-25 | 阿里巴巴集团控股有限公司 | Machine translation method, device and electronic equipment based on statistics |
-
2018
- 2018-01-22 CN CN201810059120.7A patent/CN108268447B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104133848A (en) * | 2014-07-01 | 2014-11-05 | 中央民族大学 | Tibetan language entity knowledge information extraction method |
CN104809176A (en) * | 2015-04-13 | 2015-07-29 | 中央民族大学 | Entity relationship extracting method of Zang language |
CN107133220A (en) * | 2017-06-07 | 2017-09-05 | 东南大学 | Name entity recognition method in a kind of Geography field |
CN107391485A (en) * | 2017-07-18 | 2017-11-24 | 中译语通科技(北京)有限公司 | Entity recognition method is named based on the Korean of maximum entropy and neural network model |
CN107608955A (en) * | 2017-08-31 | 2018-01-19 | 张国喜 | A kind of Chinese hides name entity inter-translation method and device |
Also Published As
Publication number | Publication date |
---|---|
CN108268447A (en) | 2018-07-10 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108268447B (en) | Labeling method for Tibetan named entities | |
CN113011533B (en) | Text classification method, apparatus, computer device and storage medium | |
US9195646B2 (en) | Training data generation apparatus, characteristic expression extraction system, training data generation method, and computer-readable storage medium | |
Zitouni et al. | Maximum entropy based restoration of Arabic diacritics | |
CN111738003B (en) | Named entity recognition model training method, named entity recognition method and medium | |
CN110263325B (en) | Chinese word segmentation system | |
CN114065758B (en) | Document keyword extraction method based on hypergraph random walk | |
CN107480200B (en) | Word labeling method, device, server and storage medium based on word labels | |
CN112101027A (en) | Chinese named entity recognition method based on reading understanding | |
CN113505200A (en) | Sentence-level Chinese event detection method combining document key information | |
CN108681532B (en) | Sentiment analysis method for Chinese microblog | |
Grönroos et al. | Morfessor EM+ Prune: Improved subword segmentation with expectation maximization and pruning | |
Chen et al. | Integrating natural language processing with image document analysis: what we learned from two real-world applications | |
CN107797986B (en) | LSTM-CNN-based mixed corpus word segmentation method | |
CN114491062B (en) | Short text classification method integrating knowledge graph and topic model | |
WO2022242074A1 (en) | Multi-feature fusion-based method for named entity recognition in chinese medical text | |
CN111133429A (en) | Extracting expressions for natural language processing | |
CN113127607A (en) | Text data labeling method and device, electronic equipment and readable storage medium | |
CN116629238A (en) | Text enhancement quality evaluation method, electronic device and storage medium | |
CN112989839A (en) | Keyword feature-based intent recognition method and system embedded in language model | |
Bach et al. | Rre task: The task of recognition of requisite part and effectuation part in law sentences | |
Zhou et al. | Online handwritten Japanese character string recognition using conditional random fields | |
CN115858733A (en) | Cross-language entity word retrieval method, device, equipment and storage medium | |
Oprean et al. | Handwritten word recognition using Web resources and recurrent neural networks | |
Li et al. | Attention-based LSTM-CNNs for uncertainty identification on Chinese social media texts |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |