CN108268447B

CN108268447B - Labeling method for Tibetan named entities

Info

Publication number: CN108268447B
Application number: CN201810059120.7A
Authority: CN
Inventors: 夏建华; 张进兵; 韩立新
Original assignee: Hohai University HHU
Current assignee: Hohai University HHU
Priority date: 2018-01-22
Filing date: 2018-01-22
Publication date: 2020-12-01
Anticipated expiration: 2038-01-22
Also published as: CN108268447A

Abstract

The invention discloses a labeling method of Tibetan named entities, which is characterized in that a semi-supervised learning mode is adopted, labeled corpora are used for training a double-granularity model, namely a coarse-granularity NER based on word vector KNN clustering and a fine-granularity NER based on semi-Markov CRFs, then unlabeled corpora are labeled, a new labeled entity is added into the labeled corpora for secondary training of the double-granularity model, and the double-granularity NER is iteratively improved. The method overcomes the limitation that supervised learning excessively depends on labeled corpora and the problem of a single discrimination mode of the traditional CRFs method, integrates the characteristics of entity semantic characteristics, interaction among named entities and the like, combines a clustering graph and a probability graph, improves the degree of model fitting from the complementary advantages of the semantics and the grammatical structure of the named entities, realizes the integrated ground NER, and effectively improves the accuracy and the efficiency of the Tibetan named entity recognition.

Description

Labeling method for Tibetan named entities

Technical Field

The invention relates to the technical field of language processing, in particular to a labeling method of Tibetan named entities.

Background

Named Entity Recognition (NER) refers to detecting Entity words composed of single words, words or multiple words in a text and determining which Entity class the Entity words belong to: name of person, place name, organization, etc. From the viewpoint of Natural Language Processing (NLP), named entity recognition is mainly to solve the problem of entity recognition that is not registered in a dictionary. From the perspective of knowledge discovery and acquisition, named entity recognition is the extraction of named entities from unstructured text that relate to user-desired information. The effectiveness of named entity recognition can directly affect the performance of related research and application systems over which it overrides, such as structured representation of text, information extraction, information retrieval, machine translation and question and answer systems, and the like.

The Tibetan and Chinese, English and other language characters have certain commonality and certain special characteristics, for example, the Tibetan character structure takes a basic character as a core, and other letters are added in front and back and are overlapped up and down on the basis of the basic character to form a complete character table structure. Although the dictionary, rules, grammar and features used in the Tibetan language named entity recognition are different from other languages, the method adopted by the entity recognition is not different from the method involved in other languages from the perspective of the methodology of the named entity recognition.

There are many named entity recognition methods, which can be said to be related to Supervised Learning (SL) to Unsupervised Learning (UL), rule-and-Dictionary Based Learning (RDBL) to Statistical Machine Learning (SML), but these methods still have certain drawbacks. For example, in the supervising learning environment, although the classifier obtains better fitting performance after training and learning of labeled data, the premise is that many linguists spend a lot of time labeling the original corpus. As unsupervised learning of SL opposites, UL avoids the cost of annotation data, but is significantly inferior to entity recognition in performance due to its lack of a priori knowledge of training and learning. In the process of marking data, people acquire a large number of rules and perform entity identification from the perspective of entity construction rules, and although the method obtains a certain accuracy in small data sets, with the increase of data sets, especially in the current big data era, the highlighted main problem of the rule-based entity identification method is that a rule base cannot exhaust all named entity rules. Stated another way, RDBL does not take full advantage of the context and associated features of the named entity. However, SML is just a significant improvement in accuracy by taking full advantage of the context-dependent characteristics of named entities in annotation data. Such as Hidden Markov Models (HMMs), Support Vector Machines (SVMs), Maximum Entropy Hidden Markov Models (MEMMs), Conditional Random Fields (CRFs), and skip-chain Random Fields (skip-chain CRFs). In contrast, the conditional random field adopts the probability of statistical normalization in the global scope, overcomes the problem of labeling bias of HMM and MEMM, can obtain better classification results, and obtains better accuracy of NER in chapters than the conventional NER algorithm by using artificial synonym pairs on the basis of basic CRFs. The above statistical learning methods all consider entity identification from a fine-grained point of view, and when discriminating an ne (named entity), the CRFs method lacks the metric property of considering features, the internal features of the entity (e.g. no markov property), and the like. Furthermore, such methods rely heavily on annotation corpora, i.e. similar to finding entities and matching calculations in a generalized dictionary (annotation corpora containing features and named entities), but may lead to increased recognition errors when the named entities we need to annotate do not appear in the generalized dictionary and their near-sense NEs do not have their similar context.

Disclosure of Invention

The invention aims to overcome the defects in the prior art, provides a labeling method of Tibetan named entities, and solves the technical problems that a supervised learning method excessively depends on labeled linguistic data and a traditional rule and statistical method-based independent judgment mode is adopted.

In order to solve the technical problems, the technical scheme adopted by the invention is as follows: a method for labeling Tibetan naming entities comprises the following steps:

normalizing the unmarked data to obtain unmarked normalized corpora, and adding the newly marked named entity into the original marked corpora;

training a noun short-term annotator Semi-Markov CRFs _1 by using the labeled corpus, and then carrying out noun phrase segmentation and labeling on the normalized corpus by using the labeled corpus;

reading the labeled corpus and the normalized corpus, establishing a CBOW model combining characters, words, phrases and named entities, and obtaining a corpus matrix and a vector space of the characters, words, phrases and named entities of the nominal characters through the training of the CBOW model;

based on a vector space, finding K nearest neighbor tagged named entities of the untagged noun phrases by utilizing a KNN algorithm, calculating cosine similarity between the untagged noun phrases and the K nearest neighbor tagged named entities, then selecting q named entities with similarity values larger than a preset threshold lambda from the K neighbors, wherein q is more than or equal to 0 and less than or equal to K, and if q is more than 0, taking the named entity category of the untagged noun phrases as the category of the named entity with the largest cosine similarity in the K nearest neighbors; adding the newly labeled named entity into the labeled corpus to enable the normalized corpus to obtain a part of label;

reading sequence data of the labeled corpus, and training a fine grain degree marker Semi-Markov CRFs _ 2; and then, labeling the unlabeled named entities in the normalized corpus by using Semi-Markov CRFs _2 to realize the full labeling of the named entities.

The normalization processing comprises: word segmentation and sentence normalization, punctuation normalization, word segmentation and part-of-speech tagging normalization, and stop word normalization.

The corpus matrix obtaining method comprises the following steps:

firstly, constructing a dictionary containing four subsets of characters, words, phrases and named entities, and carrying out vector initialization operation on each element of the dictionary: assigning a random vector with 400-600 dimensions to each element, wherein the value of each dimension is limited to [ -1,1 ];

secondly, establishing a sliding window with the length of 5, and sequentially sliding and reading data from the labeled corpus and the normalized corpus labeled by noun to obtain window data win<x_-2x_-1x₀x₊₁x₊₂>Where 0 denotes the center position of the window, x₀Representing a target word;

using Context ═ x_±pAnd p is 1,2 represents x₀And x is carried out₀For x, pre-processing of context word vectors of_±pWhen the words, phrases or named entities are processed as follows:

when x is_±pE.g. { word }, x_±pVector of (2)

Taking the value as a word vector characher_vector；

When x is_±pE { word }, x_±pVector of (2)

Taking values as word vectors word_vecotrThe formula is as follows:

in the formula, word_vecotrDenotes x_±pThe vector corresponding to the word belonging to it,

vector, N, representing the jth Tibetan word in the word_±pI denotes the target word x₀A certain context word x_±pThe number of words contained;

when x is_±pE { phrase }, x_±pVector of (2)

Taking the value as the phrase vector chunking_vectorThe formula is as follows:

wherein, chunking_vectorDenotes x_±pThe vector corresponding to the time when belonging to a phrase,

vector representing the qth Tibetan word in the phrase, | N'_±pI denotes the target word x₀A certain context word x_±pThe number of words contained;

when x is_±pE, e { named entity }, and then correspondingly process according to the corresponding categories of the characters, the words and the phrases;

then, x input to the CBOW model is calculated₀Vector mean Context (x) of the Context of₀) The formula is as follows:

in the formula, Context (x)₀) Represents the input of the CBOW model; p is 1, 2;

and establishing an objective function of the CBOW learning algorithm by using the comparative noise estimation, wherein the formula is as follows:

in the formula, θ represents Context (x)₀) A weight vector of (a); d represents a corpus;

representing an activation function; x'₀Represents a negative example; NCE (x'₀) Representing a set of negative samples, x₀Do not belong to this set; context (x'₀) A word vector mean representing a context of a negative sample, where the original target word in the window is replaced by x'₀；

Finally, learning parameters by using a random gradient ascent algorithm, and updating context word vectors; when CBOW traverses the whole corpus, a corpus matrix is obtained.

The method for constructing the vector space comprises the following steps: and extracting a vector space generated by vectors of all the nominal characters, words, phrases and named entities from the corpus matrix.

The specific method for training the finesse marker Semi-Markov CRFs _2 is as follows:

one window data s _ w of phrase segmentation sequence data s _ x of one sentence is slidably read from a markup corpus using a sliding window having a length of 3 units<s_-1s₀s₊₁>And one window mark data s _ l of the phrase mark sequence s _ y corresponding to s _ x<y′_-1y′₀y′₊₁>(ii) a Wherein: s₀Denotes the target phrase, s_-1And s₊₁Respectively represents s₀The above and below; y'_-1,y′₀,y′₊₁E.g. Y ', Y' { R, T, P, O, N }, where R denotes a person name; t represents time; p represents a place; o represents an organization; n represents other types of phrases;

constructing a state transition feature function t_v1(y′_-1,y′₀S _ x, k ') and t'_v2(y′₊₁,y′₀S _ x, k'), state characterization functionS 'to'_v3(y′₀S _ x, k'), a segmentation feature function seg_v4(y′₀,s₀,s_s0,e_e0) And a feature function vector F' (s _ x, s _ y) ═ F<f₁(s_x,s_y),f₂(s_x,s_y)…f_z(s_x,s_y)>Wherein y'₀，y′_-1And y'₊₁The marks respectively represent the named entity categories of the current unit, the previous unit and the next unit of the window; k' represents the current position of the characteristic function; 0 indicates the middle position of the window,

representing a division s₀The starting point of (a) is,

representing a division s₀The end point of (1);

represents the sum of features at each position, z', z ″ -1, 2 … z; f. of_z′(y′_-1,y′₀S _ x, k') represents a feature function vector composed of a concatenation of v1+ v2 state transition feature functions, v3 state functions, and v4 split feature functions, where z is v1+ v2+ v3+ v 4;

and (3) creating a Semi-Markov conditional random field Tibetan language fine granularity marker Semi-Markov CRFs _2 according to the characteristic function, and training the marker by utilizing the marking linguistic data and combining an L-BFGS algorithm and a gradient ascending method.

And realizing full labeling of named entities on the normalized corpus by using Semi-Markov CRFs _2, wherein the specific method comprises the following steps:

from the partially labeled normalized corpus, a window data s _ w of 3 size of a phrase segmentation sequence s _ x of a sentence is read<s_-1s₀s₊₁>Traversing the noun phrases which are not labeled in the corpus by using the trained fine-grained Tibetan named entity labeling device, calculating the maximum conditional probability P (s _ y | s _ x, W') of the named entity label sequence s _ y of s _ x to determine the category of the unlabeled entity,the formula is as follows:

wherein | W' | ═ W_v1|+|W_v2|+|W_v3|+|W_v4| which represents the vector length; and outputting a mark sequence by using a Viterbi algorithm according to the calculated conditional probability, and finally realizing the full labeling of the named entity.

Compared with the prior art, the invention has the following beneficial effects:

1. the limitation of preliminary knowledge for model training in the learning of the supervised CRFs is overcome; the internal characteristic information of the named entity in the external dictionary is utilized, the utilization rate of the dictionary is further improved, and the accuracy of Tibetan named entity identification is improved.

2. The method combines the characteristics of the entity similar words, the semantic similarity of the entity word context and the entity word vector and the like, and combines the clustering and the probability map, so as to realize the entity recognition in a collective manner, and compared with the CRFs method, the method integrates the characteristics of a plurality of aspects, and further improves the accuracy of the recognition of the Tibetan named entity;

3. the NER based on word vector KNN clustering in the coarse granularity level and the NER based on semi-Markov CRFs in the fine granularity level can be utilized to improve the model fitting degree from the complementary point of the semantic and the grammatical structure of the named entity.

Drawings

FIG. 1 is a flow chart of the present invention;

FIG. 2 is a corpus normalization flow chart;

FIG. 3 is a flow chart of noun phrase tagging;

FIG. 4 is a flow chart of corpus matrix creation based on the CBOW model of word entity federation;

FIG. 5 is a flow chart of coarse grain labeling;

fig. 6 is a flow chart of fine-grained annotation.

The specific implementation mode is as follows:

the invention provides a labeling method of Tibetan named entities, which utilizes a labeled corpus to train a noun phrase labeler and a named entity labeler with a fine granularity level, and utilizes a non-labeled corpus to improve the performance of the labeling method of the Tibetan named entities in a semi-supervised learning mode. Firstly, normalizing non-labeled linguistic data, then utilizing labeled linguistic data and normalized linguistic data for training a noun phrase marker Semi-Markov CRFs _1 and noun phrase segmentation and labeling, further creating a CBOW with combined characters, words and entities to obtain a linguistic data matrix and a vector space, so as to realize named entity labeling of a coarse-granularity K-coarse Nearest Neighbor, adding the newly labeled named entity into the labeled linguistic data, training a fine-granularity marker Semi-Markov CRFs _2, traversing part of labeled normalized linguistic data, and realizing complete labeling of the linguistic data. The method overcomes the limitation that supervised learning excessively depends on labeled corpora and the problem of a traditional independent distinguishing mode based on rules and a statistical method, integrates the characteristics of entity semantic characteristics, interaction between named entities and the like, combines a clustering graph and a probability graph, realizes the entity recognition in a collective mode, and effectively improves the efficiency of the Tibetan named entity recognition.

The technical solution of the present invention is explained in detail below with reference to the accompanying drawings and the specific embodiments, but the scope of the present invention is not limited to the embodiments.

FIG. 1 is a flow chart of the present invention, comprising the steps of:

step 101, corpus normalization

And carrying out normalization processing on the un-labeled data to obtain un-labeled normalized corpora, and adding the newly labeled named entity into the original labeled corpora.

As shown in fig. 2, the method specifically includes the following steps:

step 201, inputting a non-labeled corpus, reading a sentence in the corpus each time by using a window function, wherein each sliding takes a sentence as a basic unit, and the size of the sliding window takes the length of the longest sentence.

Step 202, performing word segmentation on the Tibetan language sentence (performing word segmentation by using a third-party word segmentation tool) to obtain a basic word and normalization processing, wherein the basic word and normalization processing comprises removing illegal Tibetan language sentences, namely sentences which do not conform to a Tibetan language model; collection of non-Tibetan numeralsMelting; normalization processing is carried out on Tibetan punctuation marks, for example, a phrase or words adopt "/" as separators; single plumb to be used for sentence ends

Double plumb at end of chapter

And for rolling the last four vertical symbols

And cloud headings for titles or chapters

The use of "//" as separators is uniform.

Step 203, outputting the normalized corpus to obtain normalized data of the non-labeled corpus; the newly labeled noun phrases and named entities are added to the labeled corpus.

Step 102, noun short-term annotator Semi-Markov CRFs _1

And training the Semi-Markov CRFs _1 by using the labeled corpus, and then realizing the segmentation and labeling of noun phrases of the husband normalized corpus.

As shown in fig. 3, the method specifically includes the following steps:

and step 301, inputting the linguistic data. And respectively inputting the marking corpora and the normalized corpora according to the training and the testing of the model.

In step 302, Semi-Markov CRFs _1 is divided into two stages, training and labeling (testing). First, in the training stage, the word sequence data x of a sentence is read from the markup corpus<x₁x₂…x_n>And word sequence tag data y ═<y₁y₂…y_i…y_n>And a phrase segmentation sequence s corresponding to a sentence< s₁s₂…s_j…>Wherein x is_nDenotes a word, y_iThe representation indicates the current segmentation marker, y_iE.g. Y, Y ═ { F, E }, F denotes a non-noun tag, E denotes a nounMarker, s_jJ represents j phrase segmentation of x, j is less than or equal to n; segmenting s for each phrase_j＝<b_j,e_j,y_j>，b_iDenotes the current segmentation start point, e_jIndicates the current segmentation end point, y_jE is Y; construction of a segmentation feature function f_k(j,x,s_j)＝f_k(y_j,y_j-1,x,b_j,e_j) Wherein: y is_j-1For the previous segmentation markers, k represents the number of feature functions; c represents the corpus including the segmentation feature function f_k(j,x,s_j) The number of (2);

establishing conditional probability of the noun short word annotator, wherein the formula is as follows:

wherein, W is a weight vector of the segmentation characteristic function vector F (x, s); z_W(x)＝∑_s′e^W·F(x,s′)Is a normalization factor, s' represents all possible valid sequence segmentations;

according to the labeled corpus

s_tRepresenting the t-th word sequence x in the markup corpus_tNum is the total number of sentences of the labeled corpus, and a target function of the phrase segmentation sequence is created, wherein the formula is as follows:

wherein s is_tWord sequence x representing the tth sentence in markup corpus_tThe noun phrase label sequence of (2), Num is the total number of word sequences of the labeled corpus; the gradient is calculated by adopting an L-BFGS algorithm, and the weight W is updated by using gradient rise.

Second, the testing stage reads the word sequence data x' of a sentence from the normalized corpus< x′₁x′₂…x′_n>The maximum length value of the noun phrase is preset to be L, and the maximum conditional probability P (s | x ', W) of the phrase segmentation sequence s is calculated according to the word sequence data x', and the formula is as follows:

in the formula, | s | represents the number of phrases after x' is divided; i denotes the position of the current phrase of the phrase sequence;

obtaining the best phrase segmentation sequence and noun phrase label of x by using Viterbi algorithm, wherein the formula is as follows:

with the proviso that i > 0; if i is 0, Viterbi (i, y) takes 0 value; otherwise, the Viterbi (i, y) takes the value of- ∞.

Step 303, traversing the normalized corpus using the trained Semi-Markov CRFs _1 to implement the sequence segmentation and noun phrase tagging of the normalized corpus, for example,

(Chinese translation: Adam/Jade Tree/Tibetan/autonomous/State/visit ")" the compound noun phrases in "will be taken as a whole:

the Chinese translation is: "jatropha zang autonomous states"), although it contains a named entity of the place name class "

(the Chinese translation is: the "Yutre") ", the non-nominal phrases are not segmented and labeled, but are still treated as basic words. Finally, the noun phrase marking corpus is obtained.

Step 103, word entity associative CBOW

And (3) constructing and training the CBOW of the word entity combination by using the labeled linguistic data and the noun phrase labeled husband normalized linguistic data to obtain a vector space of the linguistic data phrase and the noun phrase.

As shown in fig. 4, the method specifically includes the following steps:

step 401, inputting noun phrase tagging corpus.

Step 402, this step is a Continuous Bag-of-Word (CBOW) model of Word entity association, and a vector (collectively referred to as a Word vector) of words, noun phrases and named entities is obtained through training of the model, and the specific process is as follows: (1) constructing a dictionary containing all Tibetan language materials of characters, words (including two or more Tibetan language characters which can not be segmented again), phrases (two and words or compound words of words) and named entities (named entities of labeled characters, words and phrases), and performing vector initialization on each element of the dictionary, namely assigning a random vector with 400 and 600 dimensions to each element, wherein the value of each dimension is limited to [ -1,1 ].

Secondly, establishing a sliding window with the length of 5, sequentially sliding and reading the sequence data x from the marking linguistic data and the normalized linguistic data to obtain window data win<x_-2x_-1x₀x₊₁x₊₂>Where 0 denotes the center position of the window, x₀Representing a target word; using Context ═ x_±pAnd p is 1,2 represents x₀And proceed with x₀For x, pre-processing of context word vectors of_±pFor a word, phrase, or named entity, the representation is treated as follows:

when x is_±pE.g. { word }, x_±pVector of (2)

Taking the value as a word vector characher_vector；

When x is_±pE { word }, x_±pVector of (2)

Taking values as word vectors word_vecotrThe formula is as follows:

vector, N, representing the jth Tibetan word in the word_±pRepresenting the target word x₀A certain context word x_±pThe number of included words;

when x is_±pE.g. { phrase },

update to the following values:

in the formula, chunking_vectorDenotes x_±pThe vector corresponding to the time when belonging to a phrase,

vector, N ', representing the qth Tibetan word in the phrase'_±pRepresenting the target word x₀A certain context word x_±pThe number of words contained;

when x is_±pE { named entity }, and then corresponding processing is carried out according to the corresponding categories of the characters, the words and the phrases (dictionary, (dictionary of dictionary).

(2) Mean, calculating x input to CBOW₀Vector mean Context (x) of the Context of₀) The formula is as follows:

in the formula, Context (x)₀) An input representing CBOW; ,

(3) the objective function of the CBOW learning algorithm is established by using a Noise-contrast Estimation algorithm, namely NCE for short, and the formula is as follows:

representing an activation function; x'₀Represents a negative example; NCE (x'₀) A set of negative examples is represented, and,

word vector mean representing context of negative samples, i.e., target word in window is replaced with negative sample x'₀(ii) a And (5) updating the context word vector by using a random gradient ascending algorithm to learn parameters.

Step 403, when the CBOW traverses the whole corpus, a corpus matrix is obtained, that is, the corpus matrix includes word vectors of all the basic words and compound words.

Step 104, coarse degree marker

And according to the corpus matrix, realizing partial annotation on the named entity in the normalized corpus by utilizing K-Nearest Neighbor.

As shown in fig. 5, the method specifically includes the following steps:

step 501, inputting a corpus matrix.

Step 502, extracting word vectors of all the characters, words and phrases of the nominal part of speech from the corpus matrix to generate a vector space of noun phrases.

Step 503, based on the vector space, using KNN algorithm to find K nearest labeled named entities of unlabeled noun phrase, calculating cosine similarity between the unlabeled noun phrase and K nearest labeled named entities, then selecting q (q is greater than or equal to 0 and less than or equal to K) named entities with similarity value greater than preset threshold λ from the K nearest neighbors, if q is greater than 0, then taking the named entity category of the unlabeled noun phrase as the category of the named entity with maximum cosine similarity among the K nearest neighbors.

And 505, adding the newly labeled named entity into the labeled corpus to obtain part of the named entity label of the normalized corpus.

105, fine-grained marking

And training a fine-grained level tagger Semi-Markov CRFs _2 of the Semi-Markov conditional random field by using the tagged corpus, and carrying out full tagging on the normalized corpus by using the tagger.

As shown in fig. 6, the method specifically includes the following steps:

step 601, inputting linguistic data. And respectively inputting a marking corpus and a part of the marking corpus according to a training stage and a marking (testing) stage of the model, as shown by a dotted arrow in the figure.

Step 602, using a sliding window with a length of 3 units, slidably reading one window data s _ w of the phrase segmentation sequence data s _ x of one sentence from the markup corpus<s_-1s₀s₊₁>And one window flag data s _ l of the phrase flag sequence s _ y corresponding to s _ x<y′_-1y′₀y′₊₁>(ii) a Wherein: s₀Representing a target short language, s-₁And s₊₁Respectively represents s₀The above and below; y'_-1,y′₀,y′₊₁E.g. Y ', Y' { R, T, P, O, N }, where R denotes a name of a person; t represents time; p represents a place; o represents an organization; n represents other types of phrases.

In step 603, Semi-Markov CRFs _2 is divided into two stages, training and testing.

(1) In the training stage, a state transition characteristic function t is created according to data read by the sliding window_v1(y′_-1,y′₀S _ x, k ') and t'_v2(y′₊₁,y′₀S _ x, k '), state characterization function s'_v3(y′₀S _ x, k'), a segmentation feature function

And a feature function vector F' (s _ x, s _ y) ═ F<f₁(s_x,s_y),f₂(s_x,s_y)…f_z(s_x,s_y)>Wherein y'₀，y′_-1And y'₊₁The marks respectively represent the named entity categories of the current unit, the previous unit and the next unit of the window; k' represents the current position of the characteristic function; 0 indicates the middle position of the window,

representing a division s₀The starting point of (a) is,

representing a division s₀The end point of (1);

establishing conditional probability of the named entity annotator, wherein the formula is as follows:

wherein W 'is a weight vector, Z', of F '(s _ x, s _ y)'_W′(s _ x) is a normalization factor; then, according to the labeled corpus

Wherein, M is the total number of phrase sequences of the labeled corpus, x ' M and Y ' M respectively represent the phrase sequence of a sentence and the corresponding labeled Y ' sequence, and an objective function of a log-likelihood function is established, and the formula is as follows:

L′(W′)＝Σ_mlogP(y′_m|x′_m，W′) ＝∑_m(W′·F′(X′_m，y′_m)-logZ′_w′(X′_m))

the gradient is calculated using the L-BFGS algorithm, and the weight vector W' is updated using the gradient ascent.

(2) A testing (labeling) stage, using a sliding window with the length of 3 units to slidably read a window data s _ w of a phrase segmentation sequence s _ x of a sentence from the normalized corpus of which the noun phrase labeling is performed<s_-1s₀s₊₁>Traversing the noun phrases which are not labeled in the corpus by using the trained fine-grained Tibetan named entity labeling device, and calculating the maximum conditional probability P (s _ y | s _ x, W') of the named entity label sequence s _ y of s _ x to determine the category of the unlabeled entity, wherein the formula is as follows:

wherein | W' | ═ W_v1|+|W_v2|+|W_v3|+|W_v4| which represents the vector length; and outputting the mark sequence by using a Viterbi algorithm according to the calculated conditional probability.

And step 604, traversing part of the labeled linguistic data by using the optimized Semi-Markov CRFs _2 to realize the full labeling of the named entity, so as to obtain a new named entity labeled linguistic data.

The above description is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, several modifications and variations can be made without departing from the technical principle of the present invention, and these modifications and variations should also be regarded as the protection scope of the present invention.

Claims

1. A method for labeling Tibetan naming entities is characterized by comprising the following steps:

based on a vector space, finding K nearest neighbor labeled named entities of unlabeled noun phrases by using a KNN algorithm, calculating cosine similarity between the unlabeled noun phrases and the K nearest neighbor labeled named entities, then selecting q named entities with similarity values larger than a preset threshold lambda from the K neighbors, wherein q is more than or equal to 0 and less than or equal to K, and if q is more than 0, taking the class of the named entities of the unlabeled noun phrases as the class of the named entity with the maximum cosine similarity in the K nearest neighbors; adding the newly labeled named entity into the labeled corpus to enable the normalized corpus to obtain a part of label;

2. The method for labeling Tibetan named entities of claim 1, wherein the normalization process comprises: word segmentation and sentence normalization, punctuation normalization, word segmentation and part-of-speech tagging normalization, and stop word normalization.

3. The method for labeling Tibetan named entities according to claim 1, wherein the corpus matrix is obtained by the following method:

secondly, establishing a sliding window with the length of 5, and sequentially sliding and reading data from the labeled corpus and the normalized corpus labeled by noun to obtain window data win<x_-2x_-1x₀x₊₁x₊₂>Wherein 0 represents in the windowHeart position, x₀Representing a target word;

when x is_±pE.g. { word }, x_±pVector of (2)

Taking the value as a word vector characher_vector；

When x is_±pE { word }, x_±pVector of (2)

Taking values as word vectors word_vecotrThe formula is as follows:

in the formula, word_vecotrDenotes x_±pVectors corresponding to words, characters_jvectorVector, N, representing the jth Tibetan word in the word_±pI denotes the target word x₀A certain context word x_±pThe number of words contained;

when x is_±pE { phrase }, x_±pVector of (2)

Taking the value as the phrase vector chunking_vectorThe formula is as follows:

then, x input to CBOW is calculated₀Vector mean Context (x) of the Context of₀) The formula is as follows:

Finally, learning parameters by using a random gradient ascent algorithm, and updating context word vectors; and when the CBOW traverses the whole corpus, obtaining a corpus matrix.

4. The method for labeling Tibetan language named entities according to claim 1, wherein the method for constructing the vector space comprises: and extracting a vector space generated by vectors of all the nominal characters, words, phrases and named entities from the corpus matrix.

5. The method for labeling Tibetan named entities as set forth in claim 1, wherein the specific method for training the fine granularity marker Semi-Markov CRFs _2 is as follows:

one window data s _ w of phrase segmentation sequence data s _ x of one sentence is slidably read from a markup corpus using a sliding window having a length of 3 units<s_-1s₀s₊₁>And one window mark data s _ l of the phrase mark sequence s _ y corresponding to s _ x<y′_-1y′₀y′₊₁>(ii) a Wherein: s₀Denotes the target phrase, s_-1And s₊₁Respectively represents s₀The above and below; y'_-1，y′₀，y′₊₁E.g. Y ', Y' { R, T, P, O, N }, where R denotes a name of a person; t represents time; p represents a place; o represents an organization; n represents other types of phrases;

constructing a state transition feature function t_v1(y′_-1，y′₀S _ x, k ') and t'_v2(y′₊₁，y′₀S _ x, k '), state characterization function s'_v3(y′₀S _ x, k'), a segmentation feature function

And a feature function vector F' (s _ x, s _ y) ═ F<f₁(s_x,s_y)，f₂(s_x，s_y)…f_z(s_x，s_y)>Wherein y'₀，y′_-1And y'₊₁The marks respectively represent the named entity categories of the current unit, the previous unit and the next unit of the window; k' represents the current position of the characteristic function; 0 indicates the middle position of the window,

representing a division s₀The starting point of (a) is,

representing a division s₀The end point of (1);

represents the sum of features at each position, z', z ″ -1, 2 … z; f. of_z′(y′_-1，y′₀S _ x, k') represents a feature function vector composed of a concatenation of v1+ v2 state transition feature functions, v3 state functions, and v4 split feature functions, where z is v1+ v2+ v3+ v 4;

6. The method for labeling Tibetan named entities as claimed in claim 1, wherein Semi-Markov CRFs _2 is used to label named entities in full for normalized corpus, and the specific method is as follows:

from the partially labeled normalized corpus, a window data s _ w of 3 size of a phrase segmentation sequence s _ x of a sentence is read<s_-1s₀s₊₁>Traversing the noun phrases which are not labeled in the corpus by using the trained fine-grained Tibetan named entity labeling device, and calculating the maximum conditional probability P (s _ y | s _ x, W') of the named entity label sequence s _ y of s _ x to determine the category of the unlabeled entity, wherein the formula is as follows: