CN106980609A

CN106980609A - A kind of name entity recognition method of the condition random field of word-based vector representation

Info

Publication number: CN106980609A
Application number: CN201710169446.0A
Authority: CN
Inventors: 李丽双; 姜宇新; 陈曦; 冯轶然
Original assignee: Dalian University of Technology
Current assignee: Dalian University of Technology
Priority date: 2017-03-21
Filing date: 2017-03-21
Publication date: 2017-07-25

Abstract

The invention provides a kind of name entity recognition method of the condition random field of word-based vector representation, belong to natural language processing technique field.Condition random field algorithm of the present invention including word-based vector representation, using incorporating the condition random field algorithm and the online name entity recognition system for providing graphical interaction interface using B/S structure designs that term vector represents.The biomedical name entity of biomedical text progress that can be to be resolved to user using the present invention is identified, the characteristics of identification process has played the semantic expressiveness of term vector, less dependence manual features are participated in, and the problem of condition random field is effective to discrete character representation is solved, and played the advantage of the undirected graph model of this discriminate of condition random field algorithm；The present invention provides the user the service of name entity interaction relation data retrieval；In addition, the present invention has also provided the user the debugging functions to automatic analysis result.

Description

Named entity recognition method based on word vector representation conditional random field

Technical Field

The invention belongs to the field of natural language processing, relates to a method for carrying out high-quality biological named entity recognition on a biomedical text, and particularly relates to a biological named entity recognition method based on the fusion of a Conditional Random Field (CRF) model and a word representation method.

Background

The task of named entity recognition is to recognize words or phrases that appear in text with particular meaning, such as names of people, places, names of organizations, etc. Named Entity Recognition in the Biomedical field, known as biological Named Entity Recognition (Bio-NER), aims at the automatic Recognition and classification of Entity names of a given type, such as proteins, genes, diseases, cells, etc., present in the Biomedical literature using Biomedical text mining techniques. Biological named entity recognition is a key step of biomedical text mining and is a prerequisite for implementing deep text mining technologies such as relationship extraction, hypothesis discovery, and text classification, for example, to obtain relationships between biological entities such as genes, proteins, and diseases, it is necessary to be able to correctly identify these biological entities from the text. The basic processes of the most widely used machine-based learning method at present include: corpus preprocessing, characteristic extraction, model training and prediction. The corpus preprocessing step comprises the operations of the biomedical text, such as case conversion, word segmentation, word drying, word stop and the like. The applied characteristics mainly comprise: core word features, dictionary features, word formation features, morphology features, affix features, part of speech features, chunk features, and the like. The method for constructing the model by machine learning mainly comprises the following steps: hidden Markov Models (HMMs), support vector machine models (SVMs), maximum entropy Models (MEs), Maximum Entropy Markov Models (MEMMs), conditional random field models (CRFs), and the like.

For example, ABNER (http:// pages. cs. wisc. edu/. bsettles/ABNER /) is a standard named entity recognition software tool, the core of which is based on linear chain CRF. The statistical machine learning-based method does not need to manually make rules, has higher robustness, and can identify potential named entities which do not appear in a standard term dictionary. The method comprises the steps of extracting rich feature sets including vocabulary features, formation features, spelling features and the like by Dimililer and the like, performing combined classification by using different parameter SVM classification models, performing weighted voting by using an improved genetic algorithm, and finally enabling the F value of JNLPBA2004 to reach 72.51%. Liao et al (biological negative registration based on skip-chain CRFs, 2011, American Journal of Engineering and Technology Research) used the skip-chain CRF model to perform named entity recognition on the JNLPBA2004 task by considering interdependence information between distant entities, and obtained an F score of 73.2% on the test set of this task. To reduce the cost of manually extracting features, semi-supervised learning is also introduced into the machine learning method. The plum-shaped Yanpeng et al (incorporated Rich Background knowledge Classification and Recognition, 2009, BMC Bioinformatics) extracts useful information from the mass unlabeled data obtained, and then uses it as a feature to improve the effect of supervised learning, and the F value in BioCretive II is 89.05%.

When the conditional random field is used for named entity recognition, the artificial features need to be extracted according to the training corpus. Although the semi-supervised learning method reduces the cost of manually extracting features to a certain extent, specific professional field knowledge is often required when the artificial features are constructed, which is extremely difficult for non-professional field researchers, and meanwhile, new entity nouns cannot be well predicted, and the problem of semantic ambiguity cannot be well solved. In view of the recent progress of word vectors in the natural language field, it can effectively map each word appearing in the corpus into an n-dimensional space, and has a strong ability to represent semantics, i.e., words of the same semantics are positioned relatively close in space. Then, the deep semantic information is obtained by extracting different word vector characteristics from the unlabeled corpus, so that the process of manually extracting the characteristics is avoided. However, the algorithm of the conditional random field is more suitable for discrete feature representation, and how to combine the word vector of continuous real-valued representation with the CRF algorithm for named entity recognition is very challenging.

Disclosure of Invention

The invention provides a named entity recognition method based on a conditional random field represented by a word vector, which firstly solves the problems of high cost, low generalization capability and the like caused by manually extracting features, secondly solves the problem that the conditional random field is only effective to discrete feature representation, and finally improves the accuracy of the named entity recognition by the conditional random field due to the fact that semantic representation of the word vector is integrated.

The invention mainly comprises two parts: 1. and converting the linguistic data into word vectors through deep learning. 2. The conditional random field algorithm is modified so that it can be adapted to the input of a continuous type of real-valued vector.

The technical scheme of the invention is as follows:

a named entity recognition method based on a conditional random field represented by a word vector comprises the following steps:

first, a method of recognizing a named entity in a sentence in the case where there are already trained parameters (the meaning of the state transition weight matrix a having a size of (labelnum +1) × labelnum and the parameter matrix θ, labelnum, d, M having a size of labelnum × M +1) will be shown in the following steps) will be described (the method of training parameters will be described in step (five).

(I) corpus extraction and pretreatment

And converting each word in the corpus to be processed into a d-dimensional word vector (word embedding) by using a Skip-gram language model in a word2vec tool.

(II) marking plan

In the task of named entity recognition, some named entities are represented by one word and some named entities are represented by several words. In order to distinguish the difference in the components represented by a word, the word needs to be labeled, i.e., assigned different labels (tags). The present invention employs the IOBES mark plan to mark the feed.

IOBES mark plan (as shown in table one):

watch 1

Begin	Inside	End	Single	Other
					B	I	E	S	O

For named entities represented by several words: the word representing the beginning of the named entity is labeled with B (begin), the word representing the middle of the named entity is labeled with I (inside), and the word representing the end of the named entity is labeled with E (end).

For a named entity represented by one word: the word representing the named entity is labeled with S (Single). For non-named entities: words that represent non-named entities are labeled with O (other). The number of labels is 5, indicated by labelnum.

(III) calculation of state feature weights from word vectors

The invention is based on a linear chain piece random field model, so that the processing of the corpus is carried out by taking sentences as units. For any sentence (i.e., any sequence of words):

l: the length of the sentence. X ═ X₁,X₂,X₃,……,X_nX denotes a sentence (word sequence), X_iRepresenting the ith word in the sentence. Y is＝{Y₁,Y₂,Y₃,……,Y_nY denotes the corresponding tag sequence of the sentence, Y_iThe labels corresponding to the ith word in the sentence are represented, the values of the labels are I, O, B, E, S,indicating that the label corresponding to the ith word in the sentence is label j, namely Y_i＝label[j]。

1. Computation from word vectors to Feature Matrix

How to obtain good word feature representation is one of the important links for improving the recognition accuracy of the named entity, and because the label of each word is not only related to the word itself, but also closely connected with a plurality of words around the word, except for obtaining the word vector corresponding to each word, the word vector of each word and a plurality of words around the word need to be spliced by using a window method to construct the feature vector of the word.

A window method: the size of the fixed window is determined to be M. For each word X in sentence units_iBy X_i-(M-1)/2,……,X_i,……,X_i+(M-1)/2Sequentially splicing word vectors of M continuous words and then sequentially splicing every word X_iThe method is characterized in that the end of the sentence is added with 1 as a Feature vector of the word, however, the left side and the right side of some words at the beginning and the end of the sentence are not provided with enough (M-1)/2 words, in order to solve the problem of the boundary effect, a word vector of a none, namely a zero vector is used as a filling-up, the same effect is achieved as that of marking by 'start' and 'stop', and each word in the sentence is processed by a window method, so that a Feature Matrix corresponding to the sentence can be obtained, and the size of the Feature Matrix is (d × M +1) × L.

2. Calculation of state feature weights from feature matrices

For any word X, the IOBES markup plan is adopted_i，Y_iThere are five possibilities, and this step will introduce Y_iAt a value of IOBECorresponding state feature weights in different cases of SThe size of (2).

Multiplying the parameter matrix theta with the size of labelnum × (d × M +1) by the feature matrix FeatureMatrix point obtained in the previous step to obtain a matrix mu 'with the size of labelnum × L, and processing each value in mu' by a Hardtach function to finally obtain a state feature weight matrix mu with the size of labelnum × L, wherein the size of the ith column element represents the ith word X in the sentence_iLabel Y of_iIs composed ofThe magnitude of the state characteristic weight of the time, it usesAnd (4) showing.

(IV) evaluating the tag sequence Y to identify the named entity

The named entity is found by evaluating the tag sequence to find all words labeled S and word strings labeled B, I (zero or more I), E combinations. Estimating the label sequence corresponding to the sentence, i.e. finding the label sequence Y under the condition of knowing the sentence X^*Such that when Y ═ Y^*When the conditional probability P (Y | X) reaches the maximum.

First, a state transition weight matrix a having a size of (labelnum +1) × labelnum is introduced.

A: the first labelnum row of a represents one label condition, the last row represents no label condition, and each column of a represents one label condition. A. the_m,nI.e. the mth row and nth column elements of A, which represent X_i-1Corresponding label(the label represented by the m-th line of A) andX_icorresponding labelState transition weights of time. To reflect the word position in the sentence, the state transition weight is symbolizedAnd (4) showing.

1. Introduction potential function exp (∑)_jλ_jt_j(Y_i-1,Y_i,X,i)+∑_kμ_ks_k(Y_i,X,i))

The symbols in the potential function are defined and explained as follows:

j is when X_iJ is more than or equal to 1 and less than or equal to labelnum when the sentence is at the beginning; when X is present_iNot at the beginning of a period, j ≦ 1 ≦ labelnum × labelnum, j being an integer, each different j representing a particular state transition scenario from label p to label q.

k is 1. ltoreq. k.ltoreq.labelnum, k being an integer, each different k representing a particular tag state q.

t_j(Y_i-1,Y_iX, i): the state transition characteristic function at two adjacent mark positions,

s_k(Y_ix, i): the state feature function at sequence position i.

λ_jThe state transition characteristic weight function indicates that the label state transition condition is Y for a specific j_i-1＝lable[p],Y_i＝lable[q]Then, then

μ_kState feature weight function, which indicates for a particular k the tag state case Y_i＝lable[q]Then, then

∑_jλ_jt_j(Y_i-1,Y_iCalculation of X, i): given a sentence sequence X and a corresponding given tag sequence Y, the state transition feature weight λ (Y) of the word at position i and the word preceding it_i-1,Y_i,X_i) I.e. in the state transition weight matrix A

The calculation result of (2): given a sentence sequence X and a corresponding given tag sequence Y, the state feature weight μ (Y) of the word at position i_i,X_i)。

For each word, a potential function of word levelThe sum of the state transition characteristic weight and the state characteristic weight of the word is calculated, and the calculation result can be simply expressed as

2. Conditional probability P (Y | X)

Due to t_j(Y_i-1,Y_iX, i) and s_k(Y_iX, i) are both characteristic functions,let them all be f for convenience of notation_j(Y_i-1,Y_iX, i) while letting λ_jAnd mu_kAre all lambda_jTherefore ∑_jλ_jt_j(Y_i-1,Y_i,X,i)+∑_kμ_ks_k(Y_iX, i) can be represented as ∑_jλ_jf_j(Y_i-1,Y_i,X,i)。

The conditional random field algorithm is to represent the probability that a sentence sequence X corresponds to a particular tag sequence Y by conditional probability.

WhereinIs a potential function at the sentence level. The sentence-level potential function is the sum of the word-level potential functions of each word in the sentence, and as such, it can be simply expressed as

It is the sum of the potential functions at sentence level of all possible tag sequences Y for the sentence to be a normalization factor.

3. Evaluating tag sequence Y to identify named entities

For a particular sentence sequence, Z (X) is a constant, so find the sentence-level potential functionMaximum tag sequence Y^*And (4) finishing.

In calculating Y^*Then, the possible tag sequence combinations of Y increase exponentially with the number of words in the sentence sequence, and if an exhaustive method is adopted to find Y^*The time complexity is extremely high, and the calculation cannot be finished, so that the method is solved by using a dynamic optimization method such as Viterbi and the like.

(V) parameter training

For convenience of description, the parameter state transition weight matrix a and the parameter matrix θ are both referred to asP (Y | X) is expressed asIn the training set, each sentence sequence X has a correct tag sequence Y corresponding to it, and the correct tag sequence Y is called Y'.

The invention uses a random gradient descent method to adjust parametersSuch that, for each X in the training set, when Y ═ Y',the corresponding log-likelihood function value reaches a maximum.

1、Corresponding log-likelihood function:

when the random gradient descent is used for parameter adjustment, the parameter adjustment must be related toIs calculated, therefore despiteIs constant but still requires calculation. For convenience of notation, the logadd operation is defined: thus:

because the possible label sequence combinations of Y are exponentially increased along with the number of words in the sentence sequence, when the length of the sentence sequence reaches 10, the possible label sequence combinations of Y are nearly ten million, so the exhaustive Y calculationIt is impossible that the present invention employs a classical recursive method such that its computation speed is linearly related to the length n of the sentence sequence.

Let k, m denote any tag, and t denote the position of the sequence. The following formula is defined to facilitate the calculation:

thus, is composed ofThe result can be calculated by recursive calculation.

2. Parameter adjustment using stochastic gradient descent

The random gradient method adopted by the invention is an iterative algorithm. Since the positive gradient direction is the direction that increases the function value the fastest, at each step of the iteration, the parameters are updated in the positive gradient directionThe value of (a) is the fastest, so a random gradient method is adopted. By randomly selecting an example (X, Y), the parameters are updatedSuch that the maximization of the objective function is iterated until convergence. The iterative formula of the gradient update is:

where lambda is the learning rate of the selection,is the search direction. Thus, the derivative can be calculated by the differential chain rule to find the parameter at which the objective function is maximized

The invention constructs a set of conditional random field method based on word vector representation to complete the on-line system of biomedicine named entity recognition, and provides real-time query service for researchers. The biomedicine named entity recognition is one of important branches of biomedicine text mining, has high application value and provides a prerequisite for subsequent tasks such as relationship extraction information retrieval and the like. The invention improves the expression capability and generalization capability of the features on the basis of the traditional method, solves the problem that the conditional random field only expresses discrete features effectively, can help researchers in the biomedical field to automatically analyze texts, provides the function of searching known biomedical named entities and helps the researchers to research and analyze the biomedical named entities.

Detailed Description

The following further describes the specific embodiments of the present invention in combination with the technical solutions.

The system can perform high-quality gene name recognition on a given biomedical text, avoids the problems of high cost, low generalization capability and the like caused by extracting artificial features, improves the level of biomedical text recognition, and is simple and convenient for a user to operate. The system adopts a B/S (Browser/Server, Browser/Server mode, mainly realized by JSP, HTML, JS and other technologies) structural design and is divided into a view layer, a logic layer and a data layer.

1. User input text to be parsed

As shown in table 1, the text input supports two modes, namely, keyboard input and local file uploading, and the view layer receives the text to be retrieved input by the user, submits the text to the logic layer, and stores the text in the data layer. Assuming that a text to be analyzed by a user is "We findthat hTAFIII 32is the human homologue of Drosophila TAFII 40", the user can select 1, directly input the text or 2 through a page text box, store the text in a format of txt, doc and the like, and upload the text in a file form. The former is suitable for short text or test, and the latter is suitable for large text processing.

TABLE 2 System architecture

2. The system analyzes the text to be analyzed

(1) After preprocessing such as sentence breaking and word segmentation of a text to be analyzed, the logic layer decomposes the text into a sentence (including punctuations) with 12 tokens; a word vector table is trained through word2vec, text input by a user is compared with the word vector table, and a corresponding word vector is found for each word. As described earlier, it is converted into 12 vectors using a sliding window and the modified conditional random field based on the word vector representation is input in turn.

(2) By utilizing a named entity recognition method based on a conditional random field represented by a word vector, an optimal tag sequence 'O OO B O OOO B I O' is found for each sentence after gradual calculation, namely, the biomedical named entities 'hTAFIII 32' and 'Drosophila TAFII 40' are recognized. Thereby finding the named entity. The result is directly obtained by using the trained parameters without training during analysis.

3. Manual verification of results by a user

After the user submits the data, the system allows the user to correct the result if there is an obvious error by obtaining an optimal sequence of labels for the text and comparing the result with the correct result, and stores the corrected result in the database and re-debugs the parameters until the result is optimal for comparison with the correct result.

Claims

1. A named entity recognition method based on a conditional random field represented by a word vector is characterized by comprising the following steps:

(I) corpus extraction and pretreatment

Converting each word in the corpus to be processed into a d-dimensional word vector by using a Skip-gram language model in a word2vec tool;

(II) marking plan

Assigning different labels to words, and marking the material by adopting an IOBES marking plan;

IOBES mark plan:

watch 1

Begin Inside End Single Other B I E S O

For named entities represented by several words: marking a word which represents the beginning of the named entity by using a B, marking a word which represents the middle of the named entity by using an I, and marking a word which represents the end of the named entity by using an E;

for a named entity represented by one word: marking words representing the named entity by using S;

for non-named entities: marking words representing non-named entities with O; the number of labels is 5, and is represented by labelnum;

(III) calculation of state feature weights from word vectors

The named entity recognition method is based on a linear chain piece random field model, and the processing of the corpus is carried out by taking sentences as units; for any sentence, i.e. any sequence of words:

l is the length of the sentence; x ═ X₁,X₂,X₃,……,X_nX denotes a sentence, i.e. a sequence of words, X_iRepresenting the ith word in the sentence; y ═ Y₁,Y₂,Y₃,……,Y_nY denotes the corresponding tag sequence of the sentence, Y_iThe labels corresponding to the ith word in the sentence are represented, the values of the labels are I, O, B, E, S,indicating that the label corresponding to the ith word in the sentence is label j, namely Y_i＝label[j]；

1. Computation from word vectors to Feature Matrix

Splicing each word with word vectors of a plurality of words around the word by using a window method to construct a feature vector of the word;

a window method: determining the size of the fixed window as M, and taking a sentence as a unit and X for each word_iBy X_i-(M-1)/2,……,X_i,……,X_i+(M-1)/2Sequentially splicing word vectors of M continuous words and then sequentially splicing every word X_iAdding 1 to the end of the sentence as the feature vector of the word, however, if the left and right sides of some words at the beginning and the end of the sentence do not have enough adjacent (M-1)/2 words, using the word vector of none, namely zero vector as the filling, which has the same effect as the marking by 'start' and 'stop';

2. calculation of state Feature weights from Feature Matrix

For any word X, the IOBES markup plan is adopted_i，Y_iThere are five possibilities, and this step will introduce Y_iCorresponding state feature weights in different instances of values of IOBESThe size of (d);

obtaining the parameter matrix theta with the size of labelnum × (d × M +1) and the previous stepMultiplying the feature matrix FeatureMatrix point to obtain a matrix mu 'with the size of labelnum × L, processing each numerical value in the mu' by a Hardtach function to finally obtain a state feature weight matrix mu with the size of labelnum × L, wherein the sizes of the jth row and ith column elements in the mu represent the ith word X in the sentence_iLabel Y of_iIs composed ofThe magnitude of the state characteristic weight of the timeRepresents;

(IV) evaluating the tag sequence Y to identify the named entity

Evaluating the tag sequence to find all words marked as S and word strings marked as B, zero or more I, E combinations, and then finding the named entity; estimating the label sequence corresponding to the sentence, i.e. finding the label sequence Y under the condition of knowing the sentence X^*Such that when Y ═ Y^*Then, the conditional probability P (Y | X) reaches the maximum;

first, a state transition weight matrix A having a size of (labelnum +1) × labelnum is introduced

A: the first labelnum row of A respectively represents a label condition, the last row represents a no-label condition, each column of A respectively represents a label condition, A_m,nI.e. the mth row and nth column elements of A, which represent X_i-1Corresponding labelAnd X_iCorresponding labelState transition weight of time; to reflect the word position in the sentence, the state transition weight is symbolizedRepresents;

1. potential function exp (∑)_jλ_jt_j(Y_i-1,Y_i,X,i)+∑_kμ_ks_k(Y_i,X,i))

The symbols in the potential function are defined and explained as follows:

j is when X_iJ is more than or equal to 1 and less than or equal to labelnum when the sentence is at the beginning; when X is present_iJ is not less than 1 and less than labelnum × labelnum when not at the beginning of the period, wherein j is an integer, and each different j represents a specific state transition condition from the label p to the label q;

k is 1. ltoreq. k. ltoreq. labelnum, k being an integer, each different k representing a particular tag state q;

t_{j} (Y_{i - 1}, Y_{i}, X, i) = \{\begin{matrix} 1, & \begin{matrix} i f & Y_{i - 1} = l a b l e [p], Y_{i} = l a b l e [q]; \end{matrix} \\ 0, & o t h e r w i s e; \end{matrix}

s_k(Y_ix, i): the state feature function at sequence position i,

s_{k} (Y_{i}, X, i) = \{\begin{matrix} 1, & \begin{matrix} i f & Y_{i} = l a b l e [q]; \end{matrix} \\ 0, & o t h e r w i s e; \end{matrix}

∑_kμ_ks_k(Y_iCalculation of X, i): given a sentence sequence X and a corresponding given tag sequence Y, the state feature weight μ (Y) of the word at position i_i,X_i)；

For each word, the potential function exp at word level (∑)_jλ_jt_j(Y_i-1,Y_i,X,i)+∑_kμ_ks_k(Y_iX, i)) is the sum of the state transition feature weight and the state feature weight of the calculated word, and is simply expressed as

2. Conditional probability P (Y | X)

Due to t_j(Y_i-1,Y_iX, i) and s_k(Y_iX, i) are all characteristic functions, making them f_j(Y_i-1,Y_iX, i) while letting λ_jAnd mu_kAre all lambda_jThen ∑_jλ_jt_j(Y_i-1,Y_i,X,i)+∑_kμ_ks_k(Y_iX, i) is represented by ∑_jλ_jf_j(Y_i-1,Y_i,X,i)；

The conditional random field algorithm expresses the possibility that a sentence sequence X corresponds to a specific label sequence Y through conditional probability;

\begin{matrix} P (Y | X) = \frac{1}{Z (x)} \exp (Σ_{i = 1}^{n} (Σ_{j} λ_{j} t_{j} (Y_{i - 1}, Y_{i}, X, i) + Σ_{k} μ_{k} s_{k} (Y_{i}, X, i))) \\ = \frac{1}{Z (x)} \exp (Σ_{i = 1}^{n} Σ_{j} λ_{j} t_{j} (Y_{i - 1}, Y_{i}, X, i)) \end{matrix}

wherein,is a potential function at sentence level, which is the sum of the potential functions at word level of each word in the sentence, and is simply expressed as

The normalization factor is the sum of potential functions of sentence levels of all possible label sequences Y corresponding to sentences;

3. evaluating tag sequence Y to identify named entities

For a particular sentence sequence, Z (X) is a constant, so find the sentence-level potential functionMaximum tag sequence Y^*Then the method is finished;

\begin{matrix} Y^{*} = \underset{Y}{\arg \max} P (Y | X) \\ = \underset{Y}{\arg \max} \frac{1}{Z (x)} \exp (Σ_{i = 1}^{n} \underset{j}{Σ} λ_{j} f_{j} (Y_{i - 1}, Y_{i}, X, i)) \\ = \underset{Y}{\arg \max} Σ_{i = 1}^{n} Σ_{j} λ_{j} f_{j} (Y_{i - 1}, Y_{i}, X, i) \end{matrix}

(V) parameter training

The parameter state transition weight matrix A and the parameter matrix theta are both referred to asP (Y | X) is expressed asIn the training set, each sentence sequence X has a correct tag sequence Y corresponding to the sentence sequence X, and the correct tag sequence Y is called Y';

the named entity identification method adjusts parameters by using a random gradient descent methodSuch that, for each X in the training set, when Y ═ Y',the corresponding log-likelihood function value reaches the maximum;

1.corresponding log-likelihood function:

\begin{matrix} \log P (Y | X, \tilde{θ}) = \log (\frac{1}{Z (x)} \exp (Σ_{i = 1}^{n} \underset{j}{Σ} λ_{j} f_{j} (Y_{i - 1}, Y_{i}, X, i)) \\ = Σ_{i = 1}^{n} \underset{j}{Σ} λ_{j} f_{j} (Y_{i - 1}, Y_{i}, X, i) - \log (Z (X)) \\ = Σ_{i = 1}^{n} \underset{j}{Σ} λ_{j} f_{j} (Y_{i - 1}, Y_{i}, X, i) - \log (Σ_{Y} \exp (Σ_{i = 1}^{n} \underset{j}{Σ} λ_{j} f_{j} (Y_{i - 1}, Y_{i}, X, i))) \end{matrix}

when the parameter is adjusted by using the random gradient descent, the method involvesAlthough calculating, althoughIs constant but still requires computation, and for convenience of notation, defines the logadd operation:therefore, the temperature of the molten metal is controlled,

\log P (Y | X, \tilde{θ}) = Σ_{i = 1}^{n} \underset{j}{Σ} λ_{j} f_{j} (Y_{i - 1}, Y_{i}, X, i) - \underset{Y}{\log a d d} Σ_{i = 1}^{n} \underset{j}{Σ} λ_{j} f_{j} (Y_{i - 1}, Y_{i}, X, i))

let k, m denote any tag, t denote the position of the sequence, and define the following formula for ease of calculation:

\begin{matrix} δ_{t} (k) = \underset{(Y \cap Y_{t} = k)}{\log a d d} Σ_{i = 1}^{t} (\underset{j}{Σ} λ_{j} f_{j} (Y_{i - 1}, Y_{i}, X, i)) \\ = \underset{(Y \cap Y_{t} = k)}{\log a d d} Σ_{i = 1}^{t} (A_{Y_{i - 1}, Y_{i}} + μ (Y_{i}, X_{i})) \\ = \underset{m}{\log a d d} \underset{(Y \cap Y_{t} = k \cap Y_{t - 1} = m)}{\log a d d} Σ_{i = 1}^{t} (A_{Y_{i - 1}, Y_{i}} + μ (Y_{i}, X_{i})) \\ = \underset{m}{\log a d d} \underset{(Y \cap Y_{t} = k \cap Y_{t - 1} = m)}{\log a d d} (Σ_{i = 1}^{t - 1} (A_{Y_{i - 1}, Y_{i}} + μ (Y_{i}, X_{i})) + A_{m, k} + μ (k, X_{t})) \\ = μ (k, X_{t}) + \underset{m}{\log a d d} \underset{(Y \cap Y_{t} = k \cap Y_{t - 1} = m)}{\log a d d} (Σ_{i = 1}^{t - 1} (A_{Y_{i - 1}, Y_{i}} + μ (Y_{i}, X_{i})) + A_{m, k}) \\ = μ (k, X_{t}) + \underset{m}{\log a d d} (δ_{t - 1} (k) + A_{m, k}), &ForAll; k, m \end{matrix}

thus, is composed ofObtaining a result by recursive calculation;

2. parameter adjustment using stochastic gradient descent

The named entity identification method adopts a random gradient method, and is an iterative algorithm; by randomly selecting an example (X, Y), the parameters are updatedSuch that the target function is maximized and iterated continuously until convergence; the iterative formula of the gradient update is:

\tilde{θ} &LeftArrow; \tilde{θ} + λ \frac{\partial \log P (y | x, \tilde{θ})}{\partial \tilde{θ}}

where lambda is the learning rate of the selection,is the search direction; therefore, the derivative is calculated by the differential chain rule to find the parameter when the objective function is maximized