CN106980609A - A kind of name entity recognition method of the condition random field of word-based vector representation - Google Patents
A kind of name entity recognition method of the condition random field of word-based vector representation Download PDFInfo
- Publication number
- CN106980609A CN106980609A CN201710169446.0A CN201710169446A CN106980609A CN 106980609 A CN106980609 A CN 106980609A CN 201710169446 A CN201710169446 A CN 201710169446A CN 106980609 A CN106980609 A CN 106980609A
- Authority
- CN
- China
- Prior art keywords
- word
- sentence
- sigma
- sequence
- log
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Withdrawn
Links
- 238000000034 method Methods 0.000 title claims abstract description 42
- 239000013598 vector Substances 0.000 title claims abstract description 41
- 230000006870 function Effects 0.000 claims abstract description 36
- 239000011159 matrix material Substances 0.000 claims description 23
- 230000007704 transition Effects 0.000 claims description 23
- 238000004364 calculation method Methods 0.000 claims description 17
- 238000012549 training Methods 0.000 claims description 10
- 238000000605 extraction Methods 0.000 claims description 5
- 238000012545 processing Methods 0.000 claims description 5
- 230000000694 effects Effects 0.000 claims description 4
- 238000011478 gradient descent method Methods 0.000 claims description 2
- 238000010606 normalization Methods 0.000 claims description 2
- 239000000463 material Substances 0.000 claims 1
- 239000002184 metal Substances 0.000 claims 1
- 238000004458 analytical method Methods 0.000 abstract description 2
- 238000013461 design Methods 0.000 abstract description 2
- 238000003058 natural language processing Methods 0.000 abstract description 2
- 230000003993 interaction Effects 0.000 abstract 2
- 108090000623 proteins and genes Proteins 0.000 description 5
- 238000005065 mining Methods 0.000 description 4
- 238000005516 engineering process Methods 0.000 description 3
- 238000010801 machine learning Methods 0.000 description 3
- 238000007781 pre-processing Methods 0.000 description 3
- 108700013575 Drosophila e1 Proteins 0.000 description 2
- 230000015572 biosynthetic process Effects 0.000 description 2
- 201000010099 disease Diseases 0.000 description 2
- 208000037265 diseases, disorders, signs and symptoms Diseases 0.000 description 2
- 102000004169 proteins and genes Human genes 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 230000011218 segmentation Effects 0.000 description 2
- 238000012706 support-vector machine Methods 0.000 description 2
- 238000012360 testing method Methods 0.000 description 2
- 238000006243 chemical reaction Methods 0.000 description 1
- 238000013145 classification model Methods 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 238000001035 drying Methods 0.000 description 1
- 239000000284 extract Substances 0.000 description 1
- 230000004927 fusion Effects 0.000 description 1
- 230000002068 genetic effect Effects 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 238000012795 verification Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
- G06F40/295—Named entity recognition
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Machine Translation (AREA)
Abstract
The invention provides a kind of name entity recognition method of the condition random field of word-based vector representation, belong to natural language processing technique field.Condition random field algorithm of the present invention including word-based vector representation, using incorporating the condition random field algorithm and the online name entity recognition system for providing graphical interaction interface using B/S structure designs that term vector represents.The biomedical name entity of biomedical text progress that can be to be resolved to user using the present invention is identified, the characteristics of identification process has played the semantic expressiveness of term vector, less dependence manual features are participated in, and the problem of condition random field is effective to discrete character representation is solved, and played the advantage of the undirected graph model of this discriminate of condition random field algorithm;The present invention provides the user the service of name entity interaction relation data retrieval;In addition, the present invention has also provided the user the debugging functions to automatic analysis result.
Description
Technical Field
The invention belongs to the field of natural language processing, relates to a method for carrying out high-quality biological named entity recognition on a biomedical text, and particularly relates to a biological named entity recognition method based on the fusion of a Conditional Random Field (CRF) model and a word representation method.
Background
The task of named entity recognition is to recognize words or phrases that appear in text with particular meaning, such as names of people, places, names of organizations, etc. Named Entity Recognition in the Biomedical field, known as biological Named Entity Recognition (Bio-NER), aims at the automatic Recognition and classification of Entity names of a given type, such as proteins, genes, diseases, cells, etc., present in the Biomedical literature using Biomedical text mining techniques. Biological named entity recognition is a key step of biomedical text mining and is a prerequisite for implementing deep text mining technologies such as relationship extraction, hypothesis discovery, and text classification, for example, to obtain relationships between biological entities such as genes, proteins, and diseases, it is necessary to be able to correctly identify these biological entities from the text. The basic processes of the most widely used machine-based learning method at present include: corpus preprocessing, characteristic extraction, model training and prediction. The corpus preprocessing step comprises the operations of the biomedical text, such as case conversion, word segmentation, word drying, word stop and the like. The applied characteristics mainly comprise: core word features, dictionary features, word formation features, morphology features, affix features, part of speech features, chunk features, and the like. The method for constructing the model by machine learning mainly comprises the following steps: hidden Markov Models (HMMs), support vector machine models (SVMs), maximum entropy Models (MEs), Maximum Entropy Markov Models (MEMMs), conditional random field models (CRFs), and the like.
For example, ABNER (http:// pages. cs. wisc. edu/. bsettles/ABNER /) is a standard named entity recognition software tool, the core of which is based on linear chain CRF. The statistical machine learning-based method does not need to manually make rules, has higher robustness, and can identify potential named entities which do not appear in a standard term dictionary. The method comprises the steps of extracting rich feature sets including vocabulary features, formation features, spelling features and the like by Dimililer and the like, performing combined classification by using different parameter SVM classification models, performing weighted voting by using an improved genetic algorithm, and finally enabling the F value of JNLPBA2004 to reach 72.51%. Liao et al (biological negative registration based on skip-chain CRFs, 2011, American Journal of Engineering and Technology Research) used the skip-chain CRF model to perform named entity recognition on the JNLPBA2004 task by considering interdependence information between distant entities, and obtained an F score of 73.2% on the test set of this task. To reduce the cost of manually extracting features, semi-supervised learning is also introduced into the machine learning method. The plum-shaped Yanpeng et al (incorporated Rich Background knowledge Classification and Recognition, 2009, BMC Bioinformatics) extracts useful information from the mass unlabeled data obtained, and then uses it as a feature to improve the effect of supervised learning, and the F value in BioCretive II is 89.05%.
When the conditional random field is used for named entity recognition, the artificial features need to be extracted according to the training corpus. Although the semi-supervised learning method reduces the cost of manually extracting features to a certain extent, specific professional field knowledge is often required when the artificial features are constructed, which is extremely difficult for non-professional field researchers, and meanwhile, new entity nouns cannot be well predicted, and the problem of semantic ambiguity cannot be well solved. In view of the recent progress of word vectors in the natural language field, it can effectively map each word appearing in the corpus into an n-dimensional space, and has a strong ability to represent semantics, i.e., words of the same semantics are positioned relatively close in space. Then, the deep semantic information is obtained by extracting different word vector characteristics from the unlabeled corpus, so that the process of manually extracting the characteristics is avoided. However, the algorithm of the conditional random field is more suitable for discrete feature representation, and how to combine the word vector of continuous real-valued representation with the CRF algorithm for named entity recognition is very challenging.
Disclosure of Invention
The invention provides a named entity recognition method based on a conditional random field represented by a word vector, which firstly solves the problems of high cost, low generalization capability and the like caused by manually extracting features, secondly solves the problem that the conditional random field is only effective to discrete feature representation, and finally improves the accuracy of the named entity recognition by the conditional random field due to the fact that semantic representation of the word vector is integrated.
The invention mainly comprises two parts: 1. and converting the linguistic data into word vectors through deep learning. 2. The conditional random field algorithm is modified so that it can be adapted to the input of a continuous type of real-valued vector.
The technical scheme of the invention is as follows:
a named entity recognition method based on a conditional random field represented by a word vector comprises the following steps:
first, a method of recognizing a named entity in a sentence in the case where there are already trained parameters (the meaning of the state transition weight matrix a having a size of (labelnum +1) × labelnum and the parameter matrix θ, labelnum, d, M having a size of labelnum × M +1) will be shown in the following steps) will be described (the method of training parameters will be described in step (five).
(I) corpus extraction and pretreatment
And converting each word in the corpus to be processed into a d-dimensional word vector (word embedding) by using a Skip-gram language model in a word2vec tool.
(II) marking plan
In the task of named entity recognition, some named entities are represented by one word and some named entities are represented by several words. In order to distinguish the difference in the components represented by a word, the word needs to be labeled, i.e., assigned different labels (tags). The present invention employs the IOBES mark plan to mark the feed.
IOBES mark plan (as shown in table one):
watch 1
Begin | Inside | End | Single | Other |
B | I | E | S | O |
For named entities represented by several words: the word representing the beginning of the named entity is labeled with B (begin), the word representing the middle of the named entity is labeled with I (inside), and the word representing the end of the named entity is labeled with E (end).
For a named entity represented by one word: the word representing the named entity is labeled with S (Single). For non-named entities: words that represent non-named entities are labeled with O (other). The number of labels is 5, indicated by labelnum.
(III) calculation of state feature weights from word vectors
The invention is based on a linear chain piece random field model, so that the processing of the corpus is carried out by taking sentences as units. For any sentence (i.e., any sequence of words):
l: the length of the sentence. X ═ X1,X2,X3,……,XnX denotes a sentence (word sequence), XiRepresenting the ith word in the sentence. Y is={Y1,Y2,Y3,……,YnY denotes the corresponding tag sequence of the sentence, YiThe labels corresponding to the ith word in the sentence are represented, the values of the labels are I, O, B, E, S,indicating that the label corresponding to the ith word in the sentence is label j, namely Yi=label[j]。
1. Computation from word vectors to Feature Matrix
How to obtain good word feature representation is one of the important links for improving the recognition accuracy of the named entity, and because the label of each word is not only related to the word itself, but also closely connected with a plurality of words around the word, except for obtaining the word vector corresponding to each word, the word vector of each word and a plurality of words around the word need to be spliced by using a window method to construct the feature vector of the word.
A window method: the size of the fixed window is determined to be M. For each word X in sentence unitsiBy Xi-(M-1)/2,……,Xi,……,Xi+(M-1)/2Sequentially splicing word vectors of M continuous words and then sequentially splicing every word XiThe method is characterized in that the end of the sentence is added with 1 as a Feature vector of the word, however, the left side and the right side of some words at the beginning and the end of the sentence are not provided with enough (M-1)/2 words, in order to solve the problem of the boundary effect, a word vector of a none, namely a zero vector is used as a filling-up, the same effect is achieved as that of marking by 'start' and 'stop', and each word in the sentence is processed by a window method, so that a Feature Matrix corresponding to the sentence can be obtained, and the size of the Feature Matrix is (d × M +1) × L.
2. Calculation of state feature weights from feature matrices
For any word X, the IOBES markup plan is adoptedi,YiThere are five possibilities, and this step will introduce YiAt a value of IOBECorresponding state feature weights in different cases of SThe size of (2).
Multiplying the parameter matrix theta with the size of labelnum × (d × M +1) by the feature matrix FeatureMatrix point obtained in the previous step to obtain a matrix mu 'with the size of labelnum × L, and processing each value in mu' by a Hardtach function to finally obtain a state feature weight matrix mu with the size of labelnum × L, wherein the size of the ith column element represents the ith word X in the sentenceiLabel Y ofiIs composed ofThe magnitude of the state characteristic weight of the time, it usesAnd (4) showing.
(IV) evaluating the tag sequence Y to identify the named entity
The named entity is found by evaluating the tag sequence to find all words labeled S and word strings labeled B, I (zero or more I), E combinations. Estimating the label sequence corresponding to the sentence, i.e. finding the label sequence Y under the condition of knowing the sentence X*Such that when Y ═ Y*When the conditional probability P (Y | X) reaches the maximum.
First, a state transition weight matrix a having a size of (labelnum +1) × labelnum is introduced.
A: the first labelnum row of a represents one label condition, the last row represents no label condition, and each column of a represents one label condition. A. them,nI.e. the mth row and nth column elements of A, which represent Xi-1Corresponding label(the label represented by the m-th line of A) andXicorresponding labelState transition weights of time. To reflect the word position in the sentence, the state transition weight is symbolizedAnd (4) showing.
1. Introduction potential function exp (∑)jλjtj(Yi-1,Yi,X,i)+∑kμksk(Yi,X,i))
The symbols in the potential function are defined and explained as follows:
j is when XiJ is more than or equal to 1 and less than or equal to labelnum when the sentence is at the beginning; when X is presentiNot at the beginning of a period, j ≦ 1 ≦ labelnum × labelnum, j being an integer, each different j representing a particular state transition scenario from label p to label q.
k is 1. ltoreq. k.ltoreq.labelnum, k being an integer, each different k representing a particular tag state q.
tj(Yi-1,YiX, i): the state transition characteristic function at two adjacent mark positions,
sk(Yix, i): the state feature function at sequence position i.
λjThe state transition characteristic weight function indicates that the label state transition condition is Y for a specific ji-1=lable[p],Yi=lable[q]Then, then
μkState feature weight function, which indicates for a particular k the tag state case Yi=lable[q]Then, then
∑jλjtj(Yi-1,YiCalculation of X, i): given a sentence sequence X and a corresponding given tag sequence Y, the state transition feature weight λ (Y) of the word at position i and the word preceding iti-1,Yi,Xi) I.e. in the state transition weight matrix A
The calculation result of (2): given a sentence sequence X and a corresponding given tag sequence Y, the state feature weight μ (Y) of the word at position ii,Xi)。
For each word, a potential function of word levelThe sum of the state transition characteristic weight and the state characteristic weight of the word is calculated, and the calculation result can be simply expressed as
2. Conditional probability P (Y | X)
Due to tj(Yi-1,YiX, i) and sk(YiX, i) are both characteristic functions,let them all be f for convenience of notationj(Yi-1,YiX, i) while letting λjAnd mukAre all lambdajTherefore ∑jλjtj(Yi-1,Yi,X,i)+∑kμksk(YiX, i) can be represented as ∑jλjfj(Yi-1,Yi,X,i)。
The conditional random field algorithm is to represent the probability that a sentence sequence X corresponds to a particular tag sequence Y by conditional probability.
WhereinIs a potential function at the sentence level. The sentence-level potential function is the sum of the word-level potential functions of each word in the sentence, and as such, it can be simply expressed as
It is the sum of the potential functions at sentence level of all possible tag sequences Y for the sentence to be a normalization factor.
3. Evaluating tag sequence Y to identify named entities
For a particular sentence sequence, Z (X) is a constant, so find the sentence-level potential functionMaximum tag sequence Y*And (4) finishing.
In calculating Y*Then, the possible tag sequence combinations of Y increase exponentially with the number of words in the sentence sequence, and if an exhaustive method is adopted to find Y*The time complexity is extremely high, and the calculation cannot be finished, so that the method is solved by using a dynamic optimization method such as Viterbi and the like.
(V) parameter training
For convenience of description, the parameter state transition weight matrix a and the parameter matrix θ are both referred to asP (Y | X) is expressed asIn the training set, each sentence sequence X has a correct tag sequence Y corresponding to it, and the correct tag sequence Y is called Y'.
The invention uses a random gradient descent method to adjust parametersSuch that, for each X in the training set, when Y ═ Y',the corresponding log-likelihood function value reaches a maximum.
1、Corresponding log-likelihood function:
when the random gradient descent is used for parameter adjustment, the parameter adjustment must be related toIs calculated, therefore despiteIs constant but still requires calculation. For convenience of notation, the logadd operation is defined: thus:
because the possible label sequence combinations of Y are exponentially increased along with the number of words in the sentence sequence, when the length of the sentence sequence reaches 10, the possible label sequence combinations of Y are nearly ten million, so the exhaustive Y calculationIt is impossible that the present invention employs a classical recursive method such that its computation speed is linearly related to the length n of the sentence sequence.
Let k, m denote any tag, and t denote the position of the sequence. The following formula is defined to facilitate the calculation:
thus, is composed ofThe result can be calculated by recursive calculation.
2. Parameter adjustment using stochastic gradient descent
The random gradient method adopted by the invention is an iterative algorithm. Since the positive gradient direction is the direction that increases the function value the fastest, at each step of the iteration, the parameters are updated in the positive gradient directionThe value of (a) is the fastest, so a random gradient method is adopted. By randomly selecting an example (X, Y), the parameters are updatedSuch that the maximization of the objective function is iterated until convergence. The iterative formula of the gradient update is:
where lambda is the learning rate of the selection,is the search direction. Thus, the derivative can be calculated by the differential chain rule to find the parameter at which the objective function is maximized
The invention constructs a set of conditional random field method based on word vector representation to complete the on-line system of biomedicine named entity recognition, and provides real-time query service for researchers. The biomedicine named entity recognition is one of important branches of biomedicine text mining, has high application value and provides a prerequisite for subsequent tasks such as relationship extraction information retrieval and the like. The invention improves the expression capability and generalization capability of the features on the basis of the traditional method, solves the problem that the conditional random field only expresses discrete features effectively, can help researchers in the biomedical field to automatically analyze texts, provides the function of searching known biomedical named entities and helps the researchers to research and analyze the biomedical named entities.
Detailed Description
The following further describes the specific embodiments of the present invention in combination with the technical solutions.
The system can perform high-quality gene name recognition on a given biomedical text, avoids the problems of high cost, low generalization capability and the like caused by extracting artificial features, improves the level of biomedical text recognition, and is simple and convenient for a user to operate. The system adopts a B/S (Browser/Server, Browser/Server mode, mainly realized by JSP, HTML, JS and other technologies) structural design and is divided into a view layer, a logic layer and a data layer.
1. User input text to be parsed
As shown in table 1, the text input supports two modes, namely, keyboard input and local file uploading, and the view layer receives the text to be retrieved input by the user, submits the text to the logic layer, and stores the text in the data layer. Assuming that a text to be analyzed by a user is "We findthat hTAFIII 32is the human homologue of Drosophila TAFII 40", the user can select 1, directly input the text or 2 through a page text box, store the text in a format of txt, doc and the like, and upload the text in a file form. The former is suitable for short text or test, and the latter is suitable for large text processing.
TABLE 2 System architecture
2. The system analyzes the text to be analyzed
(1) After preprocessing such as sentence breaking and word segmentation of a text to be analyzed, the logic layer decomposes the text into a sentence (including punctuations) with 12 tokens; a word vector table is trained through word2vec, text input by a user is compared with the word vector table, and a corresponding word vector is found for each word. As described earlier, it is converted into 12 vectors using a sliding window and the modified conditional random field based on the word vector representation is input in turn.
(2) By utilizing a named entity recognition method based on a conditional random field represented by a word vector, an optimal tag sequence 'O OO B O OOO B I O' is found for each sentence after gradual calculation, namely, the biomedical named entities 'hTAFIII 32' and 'Drosophila TAFII 40' are recognized. Thereby finding the named entity. The result is directly obtained by using the trained parameters without training during analysis.
3. Manual verification of results by a user
After the user submits the data, the system allows the user to correct the result if there is an obvious error by obtaining an optimal sequence of labels for the text and comparing the result with the correct result, and stores the corrected result in the database and re-debugs the parameters until the result is optimal for comparison with the correct result.
Claims (1)
1. A named entity recognition method based on a conditional random field represented by a word vector is characterized by comprising the following steps:
(I) corpus extraction and pretreatment
Converting each word in the corpus to be processed into a d-dimensional word vector by using a Skip-gram language model in a word2vec tool;
(II) marking plan
Assigning different labels to words, and marking the material by adopting an IOBES marking plan;
IOBES mark plan:
watch 1
For named entities represented by several words: marking a word which represents the beginning of the named entity by using a B, marking a word which represents the middle of the named entity by using an I, and marking a word which represents the end of the named entity by using an E;
for a named entity represented by one word: marking words representing the named entity by using S;
for non-named entities: marking words representing non-named entities with O; the number of labels is 5, and is represented by labelnum;
(III) calculation of state feature weights from word vectors
The named entity recognition method is based on a linear chain piece random field model, and the processing of the corpus is carried out by taking sentences as units; for any sentence, i.e. any sequence of words:
l is the length of the sentence; x ═ X1,X2,X3,……,XnX denotes a sentence, i.e. a sequence of words, XiRepresenting the ith word in the sentence; y ═ Y1,Y2,Y3,……,YnY denotes the corresponding tag sequence of the sentence, YiThe labels corresponding to the ith word in the sentence are represented, the values of the labels are I, O, B, E, S,indicating that the label corresponding to the ith word in the sentence is label j, namely Yi=label[j];
1. Computation from word vectors to Feature Matrix
Splicing each word with word vectors of a plurality of words around the word by using a window method to construct a feature vector of the word;
a window method: determining the size of the fixed window as M, and taking a sentence as a unit and X for each wordiBy Xi-(M-1)/2,……,Xi,……,Xi+(M-1)/2Sequentially splicing word vectors of M continuous words and then sequentially splicing every word XiAdding 1 to the end of the sentence as the feature vector of the word, however, if the left and right sides of some words at the beginning and the end of the sentence do not have enough adjacent (M-1)/2 words, using the word vector of none, namely zero vector as the filling, which has the same effect as the marking by 'start' and 'stop';
2. calculation of state Feature weights from Feature Matrix
For any word X, the IOBES markup plan is adoptedi,YiThere are five possibilities, and this step will introduce YiCorresponding state feature weights in different instances of values of IOBESThe size of (d);
obtaining the parameter matrix theta with the size of labelnum × (d × M +1) and the previous stepMultiplying the feature matrix FeatureMatrix point to obtain a matrix mu 'with the size of labelnum × L, processing each numerical value in the mu' by a Hardtach function to finally obtain a state feature weight matrix mu with the size of labelnum × L, wherein the sizes of the jth row and ith column elements in the mu represent the ith word X in the sentenceiLabel Y ofiIs composed ofThe magnitude of the state characteristic weight of the timeRepresents;
(IV) evaluating the tag sequence Y to identify the named entity
Evaluating the tag sequence to find all words marked as S and word strings marked as B, zero or more I, E combinations, and then finding the named entity; estimating the label sequence corresponding to the sentence, i.e. finding the label sequence Y under the condition of knowing the sentence X*Such that when Y ═ Y*Then, the conditional probability P (Y | X) reaches the maximum;
first, a state transition weight matrix A having a size of (labelnum +1) × labelnum is introduced
A: the first labelnum row of A respectively represents a label condition, the last row represents a no-label condition, each column of A respectively represents a label condition, Am,nI.e. the mth row and nth column elements of A, which represent Xi-1Corresponding labelAnd XiCorresponding labelState transition weight of time; to reflect the word position in the sentence, the state transition weight is symbolizedRepresents;
1. potential function exp (∑)jλjtj(Yi-1,Yi,X,i)+∑kμksk(Yi,X,i))
The symbols in the potential function are defined and explained as follows:
j is when XiJ is more than or equal to 1 and less than or equal to labelnum when the sentence is at the beginning; when X is presentiJ is not less than 1 and less than labelnum × labelnum when not at the beginning of the period, wherein j is an integer, and each different j represents a specific state transition condition from the label p to the label q;
k is 1. ltoreq. k. ltoreq. labelnum, k being an integer, each different k representing a particular tag state q;
tj(Yi-1,Yix, i): the state transition characteristic function at two adjacent mark positions,
sk(Yix, i): the state feature function at sequence position i,
λjthe state transition characteristic weight function indicates that the label state transition condition is Y for a specific ji-1=lable[p],Yi=lable[q]Then, then
μkState feature weight function, which indicates for a particular k the tag state case Yi=lable[q]Then, then
∑jλjtj(Yi-1,YiCalculation of X, i): given a sentence sequence X and a corresponding given tag sequence Y, the state transition feature weight λ (Y) of the word at position i and the word preceding iti-1,Yi,Xi) I.e. in the state transition weight matrix A
∑kμksk(YiCalculation of X, i): given a sentence sequence X and a corresponding given tag sequence Y, the state feature weight μ (Y) of the word at position ii,Xi);
For each word, the potential function exp at word level (∑)jλjtj(Yi-1,Yi,X,i)+∑kμksk(YiX, i)) is the sum of the state transition feature weight and the state feature weight of the calculated word, and is simply expressed as
2. Conditional probability P (Y | X)
Due to tj(Yi-1,YiX, i) and sk(YiX, i) are all characteristic functions, making them fj(Yi-1,YiX, i) while letting λjAnd mukAre all lambdajThen ∑jλjtj(Yi-1,Yi,X,i)+∑kμksk(YiX, i) is represented by ∑jλjfj(Yi-1,Yi,X,i);
The conditional random field algorithm expresses the possibility that a sentence sequence X corresponds to a specific label sequence Y through conditional probability;
wherein,is a potential function at sentence level, which is the sum of the potential functions at word level of each word in the sentence, and is simply expressed as
The normalization factor is the sum of potential functions of sentence levels of all possible label sequences Y corresponding to sentences;
3. evaluating tag sequence Y to identify named entities
For a particular sentence sequence, Z (X) is a constant, so find the sentence-level potential functionMaximum tag sequence Y*Then the method is finished;
(V) parameter training
The parameter state transition weight matrix A and the parameter matrix theta are both referred to asP (Y | X) is expressed asIn the training set, each sentence sequence X has a correct tag sequence Y corresponding to the sentence sequence X, and the correct tag sequence Y is called Y';
the named entity identification method adjusts parameters by using a random gradient descent methodSuch that, for each X in the training set, when Y ═ Y',the corresponding log-likelihood function value reaches the maximum;
1.corresponding log-likelihood function:
when the parameter is adjusted by using the random gradient descent, the method involvesAlthough calculating, althoughIs constant but still requires computation, and for convenience of notation, defines the logadd operation:therefore, the temperature of the molten metal is controlled,
let k, m denote any tag, t denote the position of the sequence, and define the following formula for ease of calculation:
thus, is composed ofObtaining a result by recursive calculation;
2. parameter adjustment using stochastic gradient descent
The named entity identification method adopts a random gradient method, and is an iterative algorithm; by randomly selecting an example (X, Y), the parameters are updatedSuch that the target function is maximized and iterated continuously until convergence; the iterative formula of the gradient update is:
where lambda is the learning rate of the selection,is the search direction; therefore, the derivative is calculated by the differential chain rule to find the parameter when the objective function is maximized
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710169446.0A CN106980609A (en) | 2017-03-21 | 2017-03-21 | A kind of name entity recognition method of the condition random field of word-based vector representation |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710169446.0A CN106980609A (en) | 2017-03-21 | 2017-03-21 | A kind of name entity recognition method of the condition random field of word-based vector representation |
Publications (1)
Publication Number | Publication Date |
---|---|
CN106980609A true CN106980609A (en) | 2017-07-25 |
Family
ID=59339001
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710169446.0A Withdrawn CN106980609A (en) | 2017-03-21 | 2017-03-21 | A kind of name entity recognition method of the condition random field of word-based vector representation |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106980609A (en) |
Cited By (23)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107526834A (en) * | 2017-09-05 | 2017-12-29 | 北京工商大学 | Joint part of speech and the word2vec improved methods of the correlation factor of word order training |
CN107767036A (en) * | 2017-09-29 | 2018-03-06 | 北斗导航位置服务(北京)有限公司 | A kind of real-time traffic states method of estimation based on condition random field |
CN107797989A (en) * | 2017-10-16 | 2018-03-13 | 平安科技(深圳)有限公司 | Enterprise name recognition methods, electronic equipment and computer-readable recording medium |
CN108229582A (en) * | 2018-02-01 | 2018-06-29 | 浙江大学 | Entity recognition dual training method is named in a kind of multitask towards medical domain |
CN108628970A (en) * | 2018-04-17 | 2018-10-09 | 大连理工大学 | A kind of biomedical event joint abstracting method based on new marking mode |
CN108717410A (en) * | 2018-05-17 | 2018-10-30 | 达而观信息科技(上海)有限公司 | Name entity recognition method and system |
CN108763201A (en) * | 2018-05-17 | 2018-11-06 | 南京大学 | A kind of open field Chinese text name entity recognition method based on semi-supervised learning |
CN108874997A (en) * | 2018-06-13 | 2018-11-23 | 广东外语外贸大学 | A kind of name name entity recognition method towards film comment |
CN109147870A (en) * | 2018-07-26 | 2019-01-04 | 刘滨 | The recognition methods of intrinsic unordered protein based on condition random field |
CN109189820A (en) * | 2018-07-30 | 2019-01-11 | 北京信息科技大学 | A kind of mine safety accidents Ontological concept abstracting method |
CN109192201A (en) * | 2018-09-14 | 2019-01-11 | 苏州亭云智能科技有限公司 | Voice field order understanding method based on dual model identification |
CN109214000A (en) * | 2018-08-23 | 2019-01-15 | 昆明理工大学 | A kind of neural network card language entity recognition method based on topic model term vector |
CN109325225A (en) * | 2018-08-28 | 2019-02-12 | 昆明理工大学 | It is a kind of general based on associated part-of-speech tagging method |
CN109635046A (en) * | 2019-01-15 | 2019-04-16 | 金陵科技学院 | A kind of protein molecule name analysis and recognition methods based on CRFs |
CN110019711A (en) * | 2017-11-27 | 2019-07-16 | 吴谨准 | A kind of control method and device of pair of medicine text data structureization processing |
CN110059320A (en) * | 2019-04-23 | 2019-07-26 | 腾讯科技(深圳)有限公司 | Entity relation extraction method, apparatus, computer equipment and storage medium |
CN110728147A (en) * | 2018-06-28 | 2020-01-24 | 阿里巴巴集团控股有限公司 | Model training method and named entity recognition method |
CN112651241A (en) * | 2021-01-08 | 2021-04-13 | 昆明理工大学 | Chinese parallel structure automatic identification method based on semi-supervised learning |
CN112861533A (en) * | 2019-11-26 | 2021-05-28 | 阿里巴巴集团控股有限公司 | Entity word recognition method and device |
CN112950170A (en) * | 2020-06-19 | 2021-06-11 | 支付宝(杭州)信息技术有限公司 | Auditing method and device |
CN113051918A (en) * | 2019-12-26 | 2021-06-29 | 北京中科闻歌科技股份有限公司 | Named entity identification method, device, equipment and medium based on ensemble learning |
CN109145303B (en) * | 2018-09-06 | 2023-04-18 | 腾讯科技(深圳)有限公司 | Named entity recognition method, device, medium and equipment |
CN111401064B (en) * | 2019-01-02 | 2024-04-19 | 中国移动通信有限公司研究院 | Named entity identification method and device and terminal equipment |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105630768A (en) * | 2015-12-23 | 2016-06-01 | 北京理工大学 | Cascaded conditional random field-based product name recognition method and device |
CN106202054A (en) * | 2016-07-25 | 2016-12-07 | 哈尔滨工业大学 | A kind of name entity recognition method learnt based on the degree of depth towards medical field |
-
2017
- 2017-03-21 CN CN201710169446.0A patent/CN106980609A/en not_active Withdrawn
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105630768A (en) * | 2015-12-23 | 2016-06-01 | 北京理工大学 | Cascaded conditional random field-based product name recognition method and device |
CN106202054A (en) * | 2016-07-25 | 2016-12-07 | 哈尔滨工业大学 | A kind of name entity recognition method learnt based on the degree of depth towards medical field |
Non-Patent Citations (9)
Title |
---|
CHUANHAI DONG 等: "Character-Based LSTM-CRF with Radical-Level Features for Chinese Named Entity Recognition", 《LECTURE NOTES IN COMPUTER SCIENCE》 * |
GUILLAUME LAMPLE 等: "Neural Architectures for Named Entity Recognition", 《COMPUTER SCIENCE > COMPUTATION AND LANGUAGE》 * |
JASON P.C. CHIU 等: "Named Entity Recognition with Bidirectional LSTM-CNNs", 《TRANSACTIONS OF THE ASSOCIATION》 * |
LISHUANG LI 等: "Domain Term Extraction Based on Conditional Random Fields Combined with Active Learning Strategy", 《JOURNAL OF INFORMATION & COMPUTATIONAL SCIENCE》 * |
敬星: "基于词向量与CRF的命名实体识别研究", 《无线互联科技》 * |
李丽双 等: "基于词表示方法的生物医学命名实体识别", 《小型微型计算机系统》 * |
李剑风: "融合外部知识的中文命名实体识别研究及其医疗领域应用", 《中国优秀硕士学位论文全文数据库信息科技辑》 * |
李鑫鑫: "自然语言处理中序列标注问题的联合学习方法研究", 《中国博士学位论文全文数据库信息科技辑》 * |
郭家清: "基于条件随机场的命名实体识别研究", 《中国优秀硕士学位论文全文数据库信息科技辑》 * |
Cited By (30)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107526834A (en) * | 2017-09-05 | 2017-12-29 | 北京工商大学 | Joint part of speech and the word2vec improved methods of the correlation factor of word order training |
CN107526834B (en) * | 2017-09-05 | 2020-10-23 | 北京工商大学 | Word2vec improvement method for training correlation factors of united parts of speech and word order |
CN107767036A (en) * | 2017-09-29 | 2018-03-06 | 北斗导航位置服务(北京)有限公司 | A kind of real-time traffic states method of estimation based on condition random field |
CN107797989A (en) * | 2017-10-16 | 2018-03-13 | 平安科技(深圳)有限公司 | Enterprise name recognition methods, electronic equipment and computer-readable recording medium |
CN110019711A (en) * | 2017-11-27 | 2019-07-16 | 吴谨准 | A kind of control method and device of pair of medicine text data structureization processing |
CN108229582A (en) * | 2018-02-01 | 2018-06-29 | 浙江大学 | Entity recognition dual training method is named in a kind of multitask towards medical domain |
CN108628970B (en) * | 2018-04-17 | 2021-06-18 | 大连理工大学 | Biomedical event combined extraction method based on new marker mode |
CN108628970A (en) * | 2018-04-17 | 2018-10-09 | 大连理工大学 | A kind of biomedical event joint abstracting method based on new marking mode |
CN108763201A (en) * | 2018-05-17 | 2018-11-06 | 南京大学 | A kind of open field Chinese text name entity recognition method based on semi-supervised learning |
CN108763201B (en) * | 2018-05-17 | 2021-07-23 | 南京大学 | Method for identifying text named entities in open domain based on semi-supervised learning |
CN108717410A (en) * | 2018-05-17 | 2018-10-30 | 达而观信息科技(上海)有限公司 | Name entity recognition method and system |
CN108874997A (en) * | 2018-06-13 | 2018-11-23 | 广东外语外贸大学 | A kind of name name entity recognition method towards film comment |
CN110728147B (en) * | 2018-06-28 | 2023-04-28 | 阿里巴巴集团控股有限公司 | Model training method and named entity recognition method |
CN110728147A (en) * | 2018-06-28 | 2020-01-24 | 阿里巴巴集团控股有限公司 | Model training method and named entity recognition method |
CN109147870A (en) * | 2018-07-26 | 2019-01-04 | 刘滨 | The recognition methods of intrinsic unordered protein based on condition random field |
CN109189820A (en) * | 2018-07-30 | 2019-01-11 | 北京信息科技大学 | A kind of mine safety accidents Ontological concept abstracting method |
CN109189820B (en) * | 2018-07-30 | 2021-08-31 | 北京信息科技大学 | Coal mine safety accident ontology concept extraction method |
CN109214000A (en) * | 2018-08-23 | 2019-01-15 | 昆明理工大学 | A kind of neural network card language entity recognition method based on topic model term vector |
CN109325225A (en) * | 2018-08-28 | 2019-02-12 | 昆明理工大学 | It is a kind of general based on associated part-of-speech tagging method |
CN109145303B (en) * | 2018-09-06 | 2023-04-18 | 腾讯科技(深圳)有限公司 | Named entity recognition method, device, medium and equipment |
CN109192201A (en) * | 2018-09-14 | 2019-01-11 | 苏州亭云智能科技有限公司 | Voice field order understanding method based on dual model identification |
CN111401064B (en) * | 2019-01-02 | 2024-04-19 | 中国移动通信有限公司研究院 | Named entity identification method and device and terminal equipment |
CN109635046A (en) * | 2019-01-15 | 2019-04-16 | 金陵科技学院 | A kind of protein molecule name analysis and recognition methods based on CRFs |
CN109635046B (en) * | 2019-01-15 | 2023-04-18 | 金陵科技学院 | Protein molecule name analysis and identification method based on CRFs |
CN110059320A (en) * | 2019-04-23 | 2019-07-26 | 腾讯科技(深圳)有限公司 | Entity relation extraction method, apparatus, computer equipment and storage medium |
CN112861533A (en) * | 2019-11-26 | 2021-05-28 | 阿里巴巴集团控股有限公司 | Entity word recognition method and device |
CN113051918A (en) * | 2019-12-26 | 2021-06-29 | 北京中科闻歌科技股份有限公司 | Named entity identification method, device, equipment and medium based on ensemble learning |
CN113051918B (en) * | 2019-12-26 | 2024-05-14 | 北京中科闻歌科技股份有限公司 | Named entity recognition method, device, equipment and medium based on ensemble learning |
CN112950170A (en) * | 2020-06-19 | 2021-06-11 | 支付宝(杭州)信息技术有限公司 | Auditing method and device |
CN112651241A (en) * | 2021-01-08 | 2021-04-13 | 昆明理工大学 | Chinese parallel structure automatic identification method based on semi-supervised learning |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106980609A (en) | A kind of name entity recognition method of the condition random field of word-based vector representation | |
US11501182B2 (en) | Method and apparatus for generating model | |
CN108984526B (en) | Document theme vector extraction method based on deep learning | |
Collobert et al. | Natural language processing (almost) from scratch | |
Abandah et al. | Automatic diacritization of Arabic text using recurrent neural networks | |
CN110807320B (en) | Short text emotion analysis method based on CNN bidirectional GRU attention mechanism | |
CN109508459B (en) | Method for extracting theme and key information from news | |
CN106980608A (en) | A kind of Chinese electronic health record participle and name entity recognition method and system | |
CN112541356B (en) | Method and system for recognizing biomedical named entities | |
CN112487820B (en) | Chinese medical named entity recognition method | |
CN111159405B (en) | Irony detection method based on background knowledge | |
CN114818717B (en) | Chinese named entity recognition method and system integrating vocabulary and syntax information | |
CN118093834B (en) | AIGC large model-based language processing question-answering system and method | |
CN111914556A (en) | Emotion guiding method and system based on emotion semantic transfer map | |
CN114943230A (en) | Chinese specific field entity linking method fusing common knowledge | |
Xu et al. | Sentence segmentation for classical Chinese based on LSTM with radical embedding | |
Ye et al. | Improving cross-domain Chinese word segmentation with word embeddings | |
CN115238693A (en) | Chinese named entity recognition method based on multi-word segmentation and multi-layer bidirectional long-short term memory | |
CN115510230A (en) | Mongolian emotion analysis method based on multi-dimensional feature fusion and comparative reinforcement learning mechanism | |
CN112329449B (en) | Emotion analysis method based on emotion dictionary and Transformer | |
CN116522165B (en) | Public opinion text matching system and method based on twin structure | |
Göker et al. | Neural text normalization for turkish social media | |
CN110929006B (en) | Data type question-answering system | |
Mahmoodvand et al. | Semi-supervised approach for Persian word sense disambiguation | |
CN114821563B (en) | Text recognition method based on multi-scale fusion CRNN model |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WW01 | Invention patent application withdrawn after publication |
Application publication date: 20170725 |
|
WW01 | Invention patent application withdrawn after publication |