CN108268447B - Labeling method for Tibetan named entities - Google Patents

Labeling method for Tibetan named entities Download PDF

Info

Publication number
CN108268447B
CN108268447B CN201810059120.7A CN201810059120A CN108268447B CN 108268447 B CN108268447 B CN 108268447B CN 201810059120 A CN201810059120 A CN 201810059120A CN 108268447 B CN108268447 B CN 108268447B
Authority
CN
China
Prior art keywords
vector
corpus
word
labeled
named entities
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810059120.7A
Other languages
Chinese (zh)
Other versions
CN108268447A (en
Inventor
夏建华
张进兵
韩立新
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hohai University HHU
Original Assignee
Hohai University HHU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hohai University HHU filed Critical Hohai University HHU
Priority to CN201810059120.7A priority Critical patent/CN108268447B/en
Publication of CN108268447A publication Critical patent/CN108268447A/en
Application granted granted Critical
Publication of CN108268447B publication Critical patent/CN108268447B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Health & Medical Sciences (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a labeling method of Tibetan named entities, which is characterized in that a semi-supervised learning mode is adopted, labeled corpora are used for training a double-granularity model, namely a coarse-granularity NER based on word vector KNN clustering and a fine-granularity NER based on semi-Markov CRFs, then unlabeled corpora are labeled, a new labeled entity is added into the labeled corpora for secondary training of the double-granularity model, and the double-granularity NER is iteratively improved. The method overcomes the limitation that supervised learning excessively depends on labeled corpora and the problem of a single discrimination mode of the traditional CRFs method, integrates the characteristics of entity semantic characteristics, interaction among named entities and the like, combines a clustering graph and a probability graph, improves the degree of model fitting from the complementary advantages of the semantics and the grammatical structure of the named entities, realizes the integrated ground NER, and effectively improves the accuracy and the efficiency of the Tibetan named entity recognition.

Description

Labeling method for Tibetan named entities
Technical Field
The invention relates to the technical field of language processing, in particular to a labeling method of Tibetan named entities.
Background
Named Entity Recognition (NER) refers to detecting Entity words composed of single words, words or multiple words in a text and determining which Entity class the Entity words belong to: name of person, place name, organization, etc. From the viewpoint of Natural Language Processing (NLP), named entity recognition is mainly to solve the problem of entity recognition that is not registered in a dictionary. From the perspective of knowledge discovery and acquisition, named entity recognition is the extraction of named entities from unstructured text that relate to user-desired information. The effectiveness of named entity recognition can directly affect the performance of related research and application systems over which it overrides, such as structured representation of text, information extraction, information retrieval, machine translation and question and answer systems, and the like.
The Tibetan and Chinese, English and other language characters have certain commonality and certain special characteristics, for example, the Tibetan character structure takes a basic character as a core, and other letters are added in front and back and are overlapped up and down on the basis of the basic character to form a complete character table structure. Although the dictionary, rules, grammar and features used in the Tibetan language named entity recognition are different from other languages, the method adopted by the entity recognition is not different from the method involved in other languages from the perspective of the methodology of the named entity recognition.
There are many named entity recognition methods, which can be said to be related to Supervised Learning (SL) to Unsupervised Learning (UL), rule-and-Dictionary Based Learning (RDBL) to Statistical Machine Learning (SML), but these methods still have certain drawbacks. For example, in the supervising learning environment, although the classifier obtains better fitting performance after training and learning of labeled data, the premise is that many linguists spend a lot of time labeling the original corpus. As unsupervised learning of SL opposites, UL avoids the cost of annotation data, but is significantly inferior to entity recognition in performance due to its lack of a priori knowledge of training and learning. In the process of marking data, people acquire a large number of rules and perform entity identification from the perspective of entity construction rules, and although the method obtains a certain accuracy in small data sets, with the increase of data sets, especially in the current big data era, the highlighted main problem of the rule-based entity identification method is that a rule base cannot exhaust all named entity rules. Stated another way, RDBL does not take full advantage of the context and associated features of the named entity. However, SML is just a significant improvement in accuracy by taking full advantage of the context-dependent characteristics of named entities in annotation data. Such as Hidden Markov Models (HMMs), Support Vector Machines (SVMs), Maximum Entropy Hidden Markov Models (MEMMs), Conditional Random Fields (CRFs), and skip-chain Random Fields (skip-chain CRFs). In contrast, the conditional random field adopts the probability of statistical normalization in the global scope, overcomes the problem of labeling bias of HMM and MEMM, can obtain better classification results, and obtains better accuracy of NER in chapters than the conventional NER algorithm by using artificial synonym pairs on the basis of basic CRFs. The above statistical learning methods all consider entity identification from a fine-grained point of view, and when discriminating an ne (named entity), the CRFs method lacks the metric property of considering features, the internal features of the entity (e.g. no markov property), and the like. Furthermore, such methods rely heavily on annotation corpora, i.e. similar to finding entities and matching calculations in a generalized dictionary (annotation corpora containing features and named entities), but may lead to increased recognition errors when the named entities we need to annotate do not appear in the generalized dictionary and their near-sense NEs do not have their similar context.
Disclosure of Invention
The invention aims to overcome the defects in the prior art, provides a labeling method of Tibetan named entities, and solves the technical problems that a supervised learning method excessively depends on labeled linguistic data and a traditional rule and statistical method-based independent judgment mode is adopted.
In order to solve the technical problems, the technical scheme adopted by the invention is as follows: a method for labeling Tibetan naming entities comprises the following steps:
normalizing the unmarked data to obtain unmarked normalized corpora, and adding the newly marked named entity into the original marked corpora;
training a noun short-term annotator Semi-Markov CRFs _1 by using the labeled corpus, and then carrying out noun phrase segmentation and labeling on the normalized corpus by using the labeled corpus;
reading the labeled corpus and the normalized corpus, establishing a CBOW model combining characters, words, phrases and named entities, and obtaining a corpus matrix and a vector space of the characters, words, phrases and named entities of the nominal characters through the training of the CBOW model;
based on a vector space, finding K nearest neighbor tagged named entities of the untagged noun phrases by utilizing a KNN algorithm, calculating cosine similarity between the untagged noun phrases and the K nearest neighbor tagged named entities, then selecting q named entities with similarity values larger than a preset threshold lambda from the K neighbors, wherein q is more than or equal to 0 and less than or equal to K, and if q is more than 0, taking the named entity category of the untagged noun phrases as the category of the named entity with the largest cosine similarity in the K nearest neighbors; adding the newly labeled named entity into the labeled corpus to enable the normalized corpus to obtain a part of label;
reading sequence data of the labeled corpus, and training a fine grain degree marker Semi-Markov CRFs _ 2; and then, labeling the unlabeled named entities in the normalized corpus by using Semi-Markov CRFs _2 to realize the full labeling of the named entities.
The normalization processing comprises: word segmentation and sentence normalization, punctuation normalization, word segmentation and part-of-speech tagging normalization, and stop word normalization.
The corpus matrix obtaining method comprises the following steps:
firstly, constructing a dictionary containing four subsets of characters, words, phrases and named entities, and carrying out vector initialization operation on each element of the dictionary: assigning a random vector with 400-600 dimensions to each element, wherein the value of each dimension is limited to [ -1,1 ];
secondly, establishing a sliding window with the length of 5, and sequentially sliding and reading data from the labeled corpus and the normalized corpus labeled by noun to obtain window data win<x-2x-1x0x+1x+2>Where 0 denotes the center position of the window, x0Representing a target word;
using Context ═ x±pAnd p is 1,2 represents x0And x is carried out0For x, pre-processing of context word vectors of±pWhen the words, phrases or named entities are processed as follows:
when x is±pE.g. { word }, x±pVector of (2)
Figure BDA0001554762900000041
Taking the value as a word vector charachervector
When x is±pE { word }, x±pVector of (2)
Figure BDA0001554762900000042
Taking values as word vectors wordvecotrThe formula is as follows:
Figure BDA0001554762900000043
in the formula, wordvecotrDenotes x±pThe vector corresponding to the word belonging to it,
Figure BDA0001554762900000047
vector, N, representing the jth Tibetan word in the word±pI denotes the target word x0A certain context word x±pThe number of words contained;
when x is±pE { phrase }, x±pVector of (2)
Figure BDA0001554762900000044
Taking the value as the phrase vector chunkingvectorThe formula is as follows:
Figure BDA0001554762900000045
wherein, chunkingvectorDenotes x±pThe vector corresponding to the time when belonging to a phrase,
Figure BDA0001554762900000046
vector representing the qth Tibetan word in the phrase, | N'±pI denotes the target word x0A certain context word x±pThe number of words contained;
when x is±pE, e { named entity }, and then correspondingly process according to the corresponding categories of the characters, the words and the phrases;
then, x input to the CBOW model is calculated0Vector mean Context (x) of the Context of0) The formula is as follows:
Figure BDA0001554762900000051
in the formula, Context (x)0) Represents the input of the CBOW model; p is 1, 2;
and establishing an objective function of the CBOW learning algorithm by using the comparative noise estimation, wherein the formula is as follows:
Figure RE-GDA0001674458840000052
in the formula, θ represents Context (x)0) A weight vector of (a); d represents a corpus;
Figure BDA0001554762900000053
representing an activation function; x'0Represents a negative example; NCE (x'0) Representing a set of negative samples, x0Do not belong to this set; context (x'0) A word vector mean representing a context of a negative sample, where the original target word in the window is replaced by x'0
Finally, learning parameters by using a random gradient ascent algorithm, and updating context word vectors; when CBOW traverses the whole corpus, a corpus matrix is obtained.
The method for constructing the vector space comprises the following steps: and extracting a vector space generated by vectors of all the nominal characters, words, phrases and named entities from the corpus matrix.
The specific method for training the finesse marker Semi-Markov CRFs _2 is as follows:
one window data s _ w of phrase segmentation sequence data s _ x of one sentence is slidably read from a markup corpus using a sliding window having a length of 3 units<s-1s0s+1>And one window mark data s _ l of the phrase mark sequence s _ y corresponding to s _ x<y′-1y′0y′+1>(ii) a Wherein: s0Denotes the target phrase, s-1And s+1Respectively represents s0The above and below; y'-1,y′0,y′+1E.g. Y ', Y' { R, T, P, O, N }, where R denotes a person name; t represents time; p represents a place; o represents an organization; n represents other types of phrases;
constructing a state transition feature function tv1(y′-1,y′0S _ x, k ') and t'v2(y′+1,y′0S _ x, k'), state characterization functionS 'to'v3(y′0S _ x, k'), a segmentation feature function segv4(y′0,s0,ss0,ee0) And a feature function vector F' (s _ x, s _ y) ═ F<f1(s_x,s_y),f2(s_x,s_y)…fz(s_x,s_y)>Wherein y'0,y′-1And y'+1The marks respectively represent the named entity categories of the current unit, the previous unit and the next unit of the window; k' represents the current position of the characteristic function; 0 indicates the middle position of the window,
Figure BDA0001554762900000061
representing a division s0The starting point of (a) is,
Figure BDA0001554762900000062
representing a division s0The end point of (1);
Figure BDA0001554762900000063
represents the sum of features at each position, z', z ″ -1, 2 … z; f. ofz′(y′-1,y′0S _ x, k') represents a feature function vector composed of a concatenation of v1+ v2 state transition feature functions, v3 state functions, and v4 split feature functions, where z is v1+ v2+ v3+ v 4;
and (3) creating a Semi-Markov conditional random field Tibetan language fine granularity marker Semi-Markov CRFs _2 according to the characteristic function, and training the marker by utilizing the marking linguistic data and combining an L-BFGS algorithm and a gradient ascending method.
And realizing full labeling of named entities on the normalized corpus by using Semi-Markov CRFs _2, wherein the specific method comprises the following steps:
from the partially labeled normalized corpus, a window data s _ w of 3 size of a phrase segmentation sequence s _ x of a sentence is read<s-1s0s+1>Traversing the noun phrases which are not labeled in the corpus by using the trained fine-grained Tibetan named entity labeling device, calculating the maximum conditional probability P (s _ y | s _ x, W') of the named entity label sequence s _ y of s _ x to determine the category of the unlabeled entity,the formula is as follows:
Figure BDA0001554762900000064
wherein | W' | ═ Wv1|+|Wv2|+|Wv3|+|Wv4| which represents the vector length; and outputting a mark sequence by using a Viterbi algorithm according to the calculated conditional probability, and finally realizing the full labeling of the named entity.
Compared with the prior art, the invention has the following beneficial effects:
1. the limitation of preliminary knowledge for model training in the learning of the supervised CRFs is overcome; the internal characteristic information of the named entity in the external dictionary is utilized, the utilization rate of the dictionary is further improved, and the accuracy of Tibetan named entity identification is improved.
2. The method combines the characteristics of the entity similar words, the semantic similarity of the entity word context and the entity word vector and the like, and combines the clustering and the probability map, so as to realize the entity recognition in a collective manner, and compared with the CRFs method, the method integrates the characteristics of a plurality of aspects, and further improves the accuracy of the recognition of the Tibetan named entity;
3. the NER based on word vector KNN clustering in the coarse granularity level and the NER based on semi-Markov CRFs in the fine granularity level can be utilized to improve the model fitting degree from the complementary point of the semantic and the grammatical structure of the named entity.
Drawings
FIG. 1 is a flow chart of the present invention;
FIG. 2 is a corpus normalization flow chart;
FIG. 3 is a flow chart of noun phrase tagging;
FIG. 4 is a flow chart of corpus matrix creation based on the CBOW model of word entity federation;
FIG. 5 is a flow chart of coarse grain labeling;
fig. 6 is a flow chart of fine-grained annotation.
The specific implementation mode is as follows:
the invention provides a labeling method of Tibetan named entities, which utilizes a labeled corpus to train a noun phrase labeler and a named entity labeler with a fine granularity level, and utilizes a non-labeled corpus to improve the performance of the labeling method of the Tibetan named entities in a semi-supervised learning mode. Firstly, normalizing non-labeled linguistic data, then utilizing labeled linguistic data and normalized linguistic data for training a noun phrase marker Semi-Markov CRFs _1 and noun phrase segmentation and labeling, further creating a CBOW with combined characters, words and entities to obtain a linguistic data matrix and a vector space, so as to realize named entity labeling of a coarse-granularity K-coarse Nearest Neighbor, adding the newly labeled named entity into the labeled linguistic data, training a fine-granularity marker Semi-Markov CRFs _2, traversing part of labeled normalized linguistic data, and realizing complete labeling of the linguistic data. The method overcomes the limitation that supervised learning excessively depends on labeled corpora and the problem of a traditional independent distinguishing mode based on rules and a statistical method, integrates the characteristics of entity semantic characteristics, interaction between named entities and the like, combines a clustering graph and a probability graph, realizes the entity recognition in a collective mode, and effectively improves the efficiency of the Tibetan named entity recognition.
The technical solution of the present invention is explained in detail below with reference to the accompanying drawings and the specific embodiments, but the scope of the present invention is not limited to the embodiments.
FIG. 1 is a flow chart of the present invention, comprising the steps of:
step 101, corpus normalization
And carrying out normalization processing on the un-labeled data to obtain un-labeled normalized corpora, and adding the newly labeled named entity into the original labeled corpora.
As shown in fig. 2, the method specifically includes the following steps:
step 201, inputting a non-labeled corpus, reading a sentence in the corpus each time by using a window function, wherein each sliding takes a sentence as a basic unit, and the size of the sliding window takes the length of the longest sentence.
Step 202, performing word segmentation on the Tibetan language sentence (performing word segmentation by using a third-party word segmentation tool) to obtain a basic word and normalization processing, wherein the basic word and normalization processing comprises removing illegal Tibetan language sentences, namely sentences which do not conform to a Tibetan language model; collection of non-Tibetan numeralsMelting; normalization processing is carried out on Tibetan punctuation marks, for example, a phrase or words adopt "/" as separators; single plumb to be used for sentence ends
Figure BDA0001554762900000081
Double plumb at end of chapter
Figure BDA0001554762900000082
And for rolling the last four vertical symbols
Figure BDA0001554762900000083
And cloud headings for titles or chapters
Figure BDA0001554762900000084
The use of "//" as separators is uniform.
Step 203, outputting the normalized corpus to obtain normalized data of the non-labeled corpus; the newly labeled noun phrases and named entities are added to the labeled corpus.
Step 102, noun short-term annotator Semi-Markov CRFs _1
And training the Semi-Markov CRFs _1 by using the labeled corpus, and then realizing the segmentation and labeling of noun phrases of the husband normalized corpus.
As shown in fig. 3, the method specifically includes the following steps:
and step 301, inputting the linguistic data. And respectively inputting the marking corpora and the normalized corpora according to the training and the testing of the model.
In step 302, Semi-Markov CRFs _1 is divided into two stages, training and labeling (testing). First, in the training stage, the word sequence data x of a sentence is read from the markup corpus<x1x2…xn>And word sequence tag data y ═<y1y2…yi…yn>And a phrase segmentation sequence s corresponding to a sentence< s1s2…sj…>Wherein x isnDenotes a word, yiThe representation indicates the current segmentation marker, yiE.g. Y, Y ═ { F, E }, F denotes a non-noun tag, E denotes a nounMarker, sjJ represents j phrase segmentation of x, j is less than or equal to n; segmenting s for each phrasej=<bj,ej,yj>,biDenotes the current segmentation start point, ejIndicates the current segmentation end point, yjE is Y; construction of a segmentation feature function fk(j,x,sj)=fk(yj,yj-1,x,bj,ej) Wherein: y isj-1For the previous segmentation markers, k represents the number of feature functions; c represents the corpus including the segmentation feature function fk(j,x,sj) The number of (2);
establishing conditional probability of the noun short word annotator, wherein the formula is as follows:
Figure BDA0001554762900000091
wherein, W is a weight vector of the segmentation characteristic function vector F (x, s); zW(x)=∑s′eW·F(x,s′)Is a normalization factor, s' represents all possible valid sequence segmentations;
according to the labeled corpus
Figure BDA0001554762900000092
stRepresenting the t-th word sequence x in the markup corpustNum is the total number of sentences of the labeled corpus, and a target function of the phrase segmentation sequence is created, wherein the formula is as follows:
Figure RE-GDA0001674458840000101
wherein s istWord sequence x representing the tth sentence in markup corpustThe noun phrase label sequence of (2), Num is the total number of word sequences of the labeled corpus; the gradient is calculated by adopting an L-BFGS algorithm, and the weight W is updated by using gradient rise.
Second, the testing stage reads the word sequence data x' of a sentence from the normalized corpus< x′1x′2…x′n>The maximum length value of the noun phrase is preset to be L, and the maximum conditional probability P (s | x ', W) of the phrase segmentation sequence s is calculated according to the word sequence data x', and the formula is as follows:
Figure BDA0001554762900000102
in the formula, | s | represents the number of phrases after x' is divided; i denotes the position of the current phrase of the phrase sequence;
obtaining the best phrase segmentation sequence and noun phrase label of x by using Viterbi algorithm, wherein the formula is as follows:
Figure BDA0001554762900000103
with the proviso that i > 0; if i is 0, Viterbi (i, y) takes 0 value; otherwise, the Viterbi (i, y) takes the value of- ∞.
Step 303, traversing the normalized corpus using the trained Semi-Markov CRFs _1 to implement the sequence segmentation and noun phrase tagging of the normalized corpus, for example,
Figure RE-GDA0001674458840000103
(Chinese translation: Adam/Jade Tree/Tibetan/autonomous/State/visit ")" the compound noun phrases in "will be taken as a whole:
Figure RE-GDA0001674458840000104
the Chinese translation is: "jatropha zang autonomous states"), although it contains a named entity of the place name class "
Figure RE-GDA0001674458840000106
(the Chinese translation is: the "Yutre") ", the non-nominal phrases are not segmented and labeled, but are still treated as basic words. Finally, the noun phrase marking corpus is obtained.
Step 103, word entity associative CBOW
And (3) constructing and training the CBOW of the word entity combination by using the labeled linguistic data and the noun phrase labeled husband normalized linguistic data to obtain a vector space of the linguistic data phrase and the noun phrase.
As shown in fig. 4, the method specifically includes the following steps:
step 401, inputting noun phrase tagging corpus.
Step 402, this step is a Continuous Bag-of-Word (CBOW) model of Word entity association, and a vector (collectively referred to as a Word vector) of words, noun phrases and named entities is obtained through training of the model, and the specific process is as follows: (1) constructing a dictionary containing all Tibetan language materials of characters, words (including two or more Tibetan language characters which can not be segmented again), phrases (two and words or compound words of words) and named entities (named entities of labeled characters, words and phrases), and performing vector initialization on each element of the dictionary, namely assigning a random vector with 400 and 600 dimensions to each element, wherein the value of each dimension is limited to [ -1,1 ].
Secondly, establishing a sliding window with the length of 5, sequentially sliding and reading the sequence data x from the marking linguistic data and the normalized linguistic data to obtain window data win<x-2x-1x0x+1x+2>Where 0 denotes the center position of the window, x0Representing a target word; using Context ═ x±pAnd p is 1,2 represents x0And proceed with x0For x, pre-processing of context word vectors of±pFor a word, phrase, or named entity, the representation is treated as follows:
when x is±pE.g. { word }, x±pVector of (2)
Figure BDA0001554762900000111
Taking the value as a word vector charachervector
When x is±pE { word }, x±pVector of (2)
Figure BDA0001554762900000112
Taking values as word vectors wordvecotrThe formula is as follows:
Figure BDA0001554762900000113
in the formula, wordvecotrDenotes x±pThe vector corresponding to the word belonging to it,
Figure BDA0001554762900000114
vector, N, representing the jth Tibetan word in the word±pRepresenting the target word x0A certain context word x±pThe number of included words;
when x is±pE.g. { phrase },
Figure BDA0001554762900000121
update to the following values:
Figure BDA0001554762900000122
in the formula, chunkingvectorDenotes x±pThe vector corresponding to the time when belonging to a phrase,
Figure BDA0001554762900000123
vector, N ', representing the qth Tibetan word in the phrase'±pRepresenting the target word x0A certain context word x±pThe number of words contained;
when x is±pE { named entity }, and then corresponding processing is carried out according to the corresponding categories of the characters, the words and the phrases (dictionary, (dictionary of dictionary).
(2) Mean, calculating x input to CBOW0Vector mean Context (x) of the Context of0) The formula is as follows:
Figure BDA0001554762900000124
in the formula, Context (x)0) An input representing CBOW; ,
(3) the objective function of the CBOW learning algorithm is established by using a Noise-contrast Estimation algorithm, namely NCE for short, and the formula is as follows:
Figure RE-GDA0001674458840000124
in the formula, θ represents Context (x)0) A weight vector of (a); d represents a corpus;
Figure BDA0001554762900000126
representing an activation function; x'0Represents a negative example; NCE (x'0) A set of negative examples is represented, and,
Figure BDA0001554762900000127
word vector mean representing context of negative samples, i.e., target word in window is replaced with negative sample x'0(ii) a And (5) updating the context word vector by using a random gradient ascending algorithm to learn parameters.
Step 403, when the CBOW traverses the whole corpus, a corpus matrix is obtained, that is, the corpus matrix includes word vectors of all the basic words and compound words.
Step 104, coarse degree marker
And according to the corpus matrix, realizing partial annotation on the named entity in the normalized corpus by utilizing K-Nearest Neighbor.
As shown in fig. 5, the method specifically includes the following steps:
step 501, inputting a corpus matrix.
Step 502, extracting word vectors of all the characters, words and phrases of the nominal part of speech from the corpus matrix to generate a vector space of noun phrases.
Step 503, based on the vector space, using KNN algorithm to find K nearest labeled named entities of unlabeled noun phrase, calculating cosine similarity between the unlabeled noun phrase and K nearest labeled named entities, then selecting q (q is greater than or equal to 0 and less than or equal to K) named entities with similarity value greater than preset threshold λ from the K nearest neighbors, if q is greater than 0, then taking the named entity category of the unlabeled noun phrase as the category of the named entity with maximum cosine similarity among the K nearest neighbors.
And 505, adding the newly labeled named entity into the labeled corpus to obtain part of the named entity label of the normalized corpus.
105, fine-grained marking
And training a fine-grained level tagger Semi-Markov CRFs _2 of the Semi-Markov conditional random field by using the tagged corpus, and carrying out full tagging on the normalized corpus by using the tagger.
As shown in fig. 6, the method specifically includes the following steps:
step 601, inputting linguistic data. And respectively inputting a marking corpus and a part of the marking corpus according to a training stage and a marking (testing) stage of the model, as shown by a dotted arrow in the figure.
Step 602, using a sliding window with a length of 3 units, slidably reading one window data s _ w of the phrase segmentation sequence data s _ x of one sentence from the markup corpus<s-1s0s+1>And one window flag data s _ l of the phrase flag sequence s _ y corresponding to s _ x<y′-1y′0y′+1>(ii) a Wherein: s0Representing a target short language, s-1And s+1Respectively represents s0The above and below; y'-1,y′0,y′+1E.g. Y ', Y' { R, T, P, O, N }, where R denotes a name of a person; t represents time; p represents a place; o represents an organization; n represents other types of phrases.
In step 603, Semi-Markov CRFs _2 is divided into two stages, training and testing.
(1) In the training stage, a state transition characteristic function t is created according to data read by the sliding windowv1(y′-1,y′0S _ x, k ') and t'v2(y′+1,y′0S _ x, k '), state characterization function s'v3(y′0S _ x, k'), a segmentation feature function
Figure BDA0001554762900000141
And a feature function vector F' (s _ x, s _ y) ═ F<f1(s_x,s_y),f2(s_x,s_y)…fz(s_x,s_y)>Wherein y'0,y′-1And y'+1The marks respectively represent the named entity categories of the current unit, the previous unit and the next unit of the window; k' represents the current position of the characteristic function; 0 indicates the middle position of the window,
Figure BDA0001554762900000142
representing a division s0The starting point of (a) is,
Figure BDA0001554762900000143
representing a division s0The end point of (1);
Figure BDA0001554762900000144
represents the sum of features at each position, z', z ″ -1, 2 … z; f. ofz′(y′-1,y′0S _ x, k') represents a feature function vector composed of a concatenation of v1+ v2 state transition feature functions, v3 state functions, and v4 split feature functions, where z is v1+ v2+ v3+ v 4;
establishing conditional probability of the named entity annotator, wherein the formula is as follows:
Figure BDA0001554762900000145
wherein W 'is a weight vector, Z', of F '(s _ x, s _ y)'W′(s _ x) is a normalization factor; then, according to the labeled corpus
Figure BDA0001554762900000146
Wherein, M is the total number of phrase sequences of the labeled corpus, x ' M and Y ' M respectively represent the phrase sequence of a sentence and the corresponding labeled Y ' sequence, and an objective function of a log-likelihood function is established, and the formula is as follows:
L′(W′)=ΣmlogP(y′m|x′m,W′) =∑m(W′·F′(X′m,y′m)-logZ′w′(X′m))
the gradient is calculated using the L-BFGS algorithm, and the weight vector W' is updated using the gradient ascent.
(2) A testing (labeling) stage, using a sliding window with the length of 3 units to slidably read a window data s _ w of a phrase segmentation sequence s _ x of a sentence from the normalized corpus of which the noun phrase labeling is performed<s-1s0s+1>Traversing the noun phrases which are not labeled in the corpus by using the trained fine-grained Tibetan named entity labeling device, and calculating the maximum conditional probability P (s _ y | s _ x, W') of the named entity label sequence s _ y of s _ x to determine the category of the unlabeled entity, wherein the formula is as follows:
Figure BDA0001554762900000151
wherein | W' | ═ Wv1|+|Wv2|+|Wv3|+|Wv4| which represents the vector length; and outputting the mark sequence by using a Viterbi algorithm according to the calculated conditional probability.
And step 604, traversing part of the labeled linguistic data by using the optimized Semi-Markov CRFs _2 to realize the full labeling of the named entity, so as to obtain a new named entity labeled linguistic data.
The above description is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, several modifications and variations can be made without departing from the technical principle of the present invention, and these modifications and variations should also be regarded as the protection scope of the present invention.

Claims (6)

1. A method for labeling Tibetan naming entities is characterized by comprising the following steps:
normalizing the unmarked data to obtain unmarked normalized corpora, and adding the newly marked named entity into the original marked corpora;
training a noun short-term annotator Semi-Markov CRFs _1 by using the labeled corpus, and then carrying out noun phrase segmentation and labeling on the normalized corpus by using the labeled corpus;
reading the labeled corpus and the normalized corpus, establishing a CBOW model combining characters, words, phrases and named entities, and obtaining a corpus matrix and a vector space of the characters, words, phrases and named entities of the nominal characters through the training of the CBOW model;
based on a vector space, finding K nearest neighbor labeled named entities of unlabeled noun phrases by using a KNN algorithm, calculating cosine similarity between the unlabeled noun phrases and the K nearest neighbor labeled named entities, then selecting q named entities with similarity values larger than a preset threshold lambda from the K neighbors, wherein q is more than or equal to 0 and less than or equal to K, and if q is more than 0, taking the class of the named entities of the unlabeled noun phrases as the class of the named entity with the maximum cosine similarity in the K nearest neighbors; adding the newly labeled named entity into the labeled corpus to enable the normalized corpus to obtain a part of label;
reading sequence data of the labeled corpus, and training a fine grain degree marker Semi-Markov CRFs _ 2; and then, labeling the unlabeled named entities in the normalized corpus by using Semi-Markov CRFs _2 to realize the full labeling of the named entities.
2. The method for labeling Tibetan named entities of claim 1, wherein the normalization process comprises: word segmentation and sentence normalization, punctuation normalization, word segmentation and part-of-speech tagging normalization, and stop word normalization.
3. The method for labeling Tibetan named entities according to claim 1, wherein the corpus matrix is obtained by the following method:
firstly, constructing a dictionary containing four subsets of characters, words, phrases and named entities, and carrying out vector initialization operation on each element of the dictionary: assigning a random vector with 400-600 dimensions to each element, wherein the value of each dimension is limited to [ -1,1 ];
secondly, establishing a sliding window with the length of 5, and sequentially sliding and reading data from the labeled corpus and the normalized corpus labeled by noun to obtain window data win<x-2x-1x0x+1x+2>Wherein 0 represents in the windowHeart position, x0Representing a target word;
using Context ═ x±pAnd p is 1,2 represents x0And x is carried out0For x, pre-processing of context word vectors of±pWhen the words, phrases or named entities are processed as follows:
when x is±pE.g. { word }, x±pVector of (2)
Figure FDA0002721983230000021
Taking the value as a word vector charachervector
When x is±pE { word }, x±pVector of (2)
Figure FDA0002721983230000022
Taking values as word vectors wordvecotrThe formula is as follows:
Figure FDA0002721983230000023
in the formula, wordvecotrDenotes x±pVectors corresponding to words, charactersjvectorVector, N, representing the jth Tibetan word in the word±pI denotes the target word x0A certain context word x±pThe number of words contained;
when x is±pE { phrase }, x±pVector of (2)
Figure FDA0002721983230000024
Taking the value as the phrase vector chunkingvectorThe formula is as follows:
Figure FDA0002721983230000025
wherein, chunkingvectorDenotes x±pThe vector corresponding to the time when belonging to a phrase,
Figure FDA0002721983230000026
vector representing the qth Tibetan word in the phrase, | N'±pI denotes the target word x0A certain context word x±pThe number of words contained;
when x is±pE, e { named entity }, and then correspondingly process according to the corresponding categories of the characters, the words and the phrases;
then, x input to CBOW is calculated0Vector mean Context (x) of the Context of0) The formula is as follows:
Figure FDA0002721983230000031
in the formula, Context (x)0) Represents the input of the CBOW model; p is 1, 2;
and establishing an objective function of the CBOW learning algorithm by using the comparative noise estimation, wherein the formula is as follows:
Figure FDA0002721983230000032
in the formula, θ represents Context (x)0) A weight vector of (a); d represents a corpus;
Figure FDA0002721983230000033
representing an activation function; x'0Represents a negative example; NCE (x'0) Representing a set of negative samples, x0Do not belong to this set; context (x'0) A word vector mean representing a context of a negative sample, where the original target word in the window is replaced by x'0
Finally, learning parameters by using a random gradient ascent algorithm, and updating context word vectors; and when the CBOW traverses the whole corpus, obtaining a corpus matrix.
4. The method for labeling Tibetan language named entities according to claim 1, wherein the method for constructing the vector space comprises: and extracting a vector space generated by vectors of all the nominal characters, words, phrases and named entities from the corpus matrix.
5. The method for labeling Tibetan named entities as set forth in claim 1, wherein the specific method for training the fine granularity marker Semi-Markov CRFs _2 is as follows:
one window data s _ w of phrase segmentation sequence data s _ x of one sentence is slidably read from a markup corpus using a sliding window having a length of 3 units<s-1s0s+1>And one window mark data s _ l of the phrase mark sequence s _ y corresponding to s _ x<y′-1y′0y′+1>(ii) a Wherein: s0Denotes the target phrase, s-1And s+1Respectively represents s0The above and below; y'-1,y′0,y′+1E.g. Y ', Y' { R, T, P, O, N }, where R denotes a name of a person; t represents time; p represents a place; o represents an organization; n represents other types of phrases;
constructing a state transition feature function tv1(y′-1,y′0S _ x, k ') and t'v2(y′+1,y′0S _ x, k '), state characterization function s'v3(y′0S _ x, k'), a segmentation feature function
Figure FDA0002721983230000045
And a feature function vector F' (s _ x, s _ y) ═ F<f1(s_x,s_y),f2(s_x,s_y)…fz(s_x,s_y)>Wherein y'0,y′-1And y'+1The marks respectively represent the named entity categories of the current unit, the previous unit and the next unit of the window; k' represents the current position of the characteristic function; 0 indicates the middle position of the window,
Figure FDA0002721983230000041
representing a division s0The starting point of (a) is,
Figure FDA0002721983230000042
representing a division s0The end point of (1);
Figure FDA0002721983230000043
represents the sum of features at each position, z', z ″ -1, 2 … z; f. ofz′(y′-1,y′0S _ x, k') represents a feature function vector composed of a concatenation of v1+ v2 state transition feature functions, v3 state functions, and v4 split feature functions, where z is v1+ v2+ v3+ v 4;
and (3) creating a Semi-Markov conditional random field Tibetan language fine granularity marker Semi-Markov CRFs _2 according to the characteristic function, and training the marker by utilizing the marking linguistic data and combining an L-BFGS algorithm and a gradient ascending method.
6. The method for labeling Tibetan named entities as claimed in claim 1, wherein Semi-Markov CRFs _2 is used to label named entities in full for normalized corpus, and the specific method is as follows:
from the partially labeled normalized corpus, a window data s _ w of 3 size of a phrase segmentation sequence s _ x of a sentence is read<s-1s0s+1>Traversing the noun phrases which are not labeled in the corpus by using the trained fine-grained Tibetan named entity labeling device, and calculating the maximum conditional probability P (s _ y | s _ x, W') of the named entity label sequence s _ y of s _ x to determine the category of the unlabeled entity, wherein the formula is as follows:
Figure FDA0002721983230000044
wherein | W' | ═ Wv1|+|Wv2|+|Wv3|+|Wv4| which represents the vector length; and outputting a mark sequence by using a Viterbi algorithm according to the calculated conditional probability, and finally realizing the full labeling of the named entity.
CN201810059120.7A 2018-01-22 2018-01-22 Labeling method for Tibetan named entities Active CN108268447B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810059120.7A CN108268447B (en) 2018-01-22 2018-01-22 Labeling method for Tibetan named entities

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810059120.7A CN108268447B (en) 2018-01-22 2018-01-22 Labeling method for Tibetan named entities

Publications (2)

Publication Number Publication Date
CN108268447A CN108268447A (en) 2018-07-10
CN108268447B true CN108268447B (en) 2020-12-01

Family

ID=62776300

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810059120.7A Active CN108268447B (en) 2018-01-22 2018-01-22 Labeling method for Tibetan named entities

Country Status (1)

Country Link
CN (1) CN108268447B (en)

Families Citing this family (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110738223B (en) * 2018-07-18 2022-04-08 宇通客车股份有限公司 Point cloud data clustering method and device of laser radar
CN110020428B (en) * 2018-07-19 2023-05-23 成都信息工程大学 Method for jointly identifying and normalizing Chinese medicine symptom names based on semi-Markov
CN109192201A (en) * 2018-09-14 2019-01-11 苏州亭云智能科技有限公司 Voice field order understanding method based on dual model identification
CN109388801B (en) * 2018-09-30 2023-07-14 创新先进技术有限公司 Method and device for determining similar word set and electronic equipment
CN110162749B (en) * 2018-10-22 2023-07-21 哈尔滨工业大学(深圳) Information extraction method, information extraction device, computer equipment and computer readable storage medium
CN109657061B (en) * 2018-12-21 2020-11-27 合肥工业大学 Integrated classification method for massive multi-word short texts
WO2020133039A1 (en) * 2018-12-27 2020-07-02 深圳市优必选科技有限公司 Entity identification method and apparatus in dialogue corpus, and computer device
CN110298033B (en) * 2019-05-29 2022-07-08 西南电子技术研究所(中国电子科技集团公司第十研究所) Keyword corpus labeling training extraction system
CN110287481B (en) * 2019-05-29 2022-06-14 西南电子技术研究所(中国电子科技集团公司第十研究所) Named entity corpus labeling training system
CN110472248A (en) * 2019-08-22 2019-11-19 广东工业大学 A kind of recognition methods of Chinese text name entity
CN110909548B (en) * 2019-10-10 2024-03-12 平安科技(深圳)有限公司 Chinese named entity recognition method, device and computer readable storage medium
CN114943235A (en) * 2022-07-12 2022-08-26 长安大学 Named entity recognition method based on multi-class language model

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104133848A (en) * 2014-07-01 2014-11-05 中央民族大学 Tibetan language entity knowledge information extraction method
CN104809176A (en) * 2015-04-13 2015-07-29 中央民族大学 Entity relationship extracting method of Zang language
CN107133220A (en) * 2017-06-07 2017-09-05 东南大学 Name entity recognition method in a kind of Geography field
CN107391485A (en) * 2017-07-18 2017-11-24 中译语通科技(北京)有限公司 Entity recognition method is named based on the Korean of maximum entropy and neural network model
CN107608955A (en) * 2017-08-31 2018-01-19 张国喜 A kind of Chinese hides name entity inter-translation method and device

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106484682B (en) * 2015-08-25 2019-06-25 阿里巴巴集团控股有限公司 Machine translation method, device and electronic equipment based on statistics

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104133848A (en) * 2014-07-01 2014-11-05 中央民族大学 Tibetan language entity knowledge information extraction method
CN104809176A (en) * 2015-04-13 2015-07-29 中央民族大学 Entity relationship extracting method of Zang language
CN107133220A (en) * 2017-06-07 2017-09-05 东南大学 Name entity recognition method in a kind of Geography field
CN107391485A (en) * 2017-07-18 2017-11-24 中译语通科技(北京)有限公司 Entity recognition method is named based on the Korean of maximum entropy and neural network model
CN107608955A (en) * 2017-08-31 2018-01-19 张国喜 A kind of Chinese hides name entity inter-translation method and device

Also Published As

Publication number Publication date
CN108268447A (en) 2018-07-10

Similar Documents

Publication Publication Date Title
CN108268447B (en) Labeling method for Tibetan named entities
CN113011533B (en) Text classification method, apparatus, computer device and storage medium
US9195646B2 (en) Training data generation apparatus, characteristic expression extraction system, training data generation method, and computer-readable storage medium
Zitouni et al. Maximum entropy based restoration of Arabic diacritics
CN111738003B (en) Named entity recognition model training method, named entity recognition method and medium
CN110263325B (en) Chinese word segmentation system
CN114065758B (en) Document keyword extraction method based on hypergraph random walk
CN107480200B (en) Word labeling method, device, server and storage medium based on word labels
CN112101027A (en) Chinese named entity recognition method based on reading understanding
CN113505200A (en) Sentence-level Chinese event detection method combining document key information
CN108681532B (en) Sentiment analysis method for Chinese microblog
Grönroos et al. Morfessor EM+ Prune: Improved subword segmentation with expectation maximization and pruning
Chen et al. Integrating natural language processing with image document analysis: what we learned from two real-world applications
CN107797986B (en) LSTM-CNN-based mixed corpus word segmentation method
CN114491062B (en) Short text classification method integrating knowledge graph and topic model
WO2022242074A1 (en) Multi-feature fusion-based method for named entity recognition in chinese medical text
CN111133429A (en) Extracting expressions for natural language processing
CN113127607A (en) Text data labeling method and device, electronic equipment and readable storage medium
CN116629238A (en) Text enhancement quality evaluation method, electronic device and storage medium
CN112989839A (en) Keyword feature-based intent recognition method and system embedded in language model
Bach et al. Rre task: The task of recognition of requisite part and effectuation part in law sentences
Zhou et al. Online handwritten Japanese character string recognition using conditional random fields
CN115858733A (en) Cross-language entity word retrieval method, device, equipment and storage medium
Oprean et al. Handwritten word recognition using Web resources and recurrent neural networks
Li et al. Attention-based LSTM-CNNs for uncertainty identification on Chinese social media texts

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant