CN112966516A

CN112966516A - Medical named entity identification method based on improved random average gradient descent

Info

Publication number: CN112966516A
Application number: CN202110435549.3A
Authority: CN
Inventors: 陈观林; 程钊; 杨武剑; 翁文勇; 李甜
Original assignee: Hangzhou City University
Current assignee: Hangzhou City University
Priority date: 2021-04-22
Filing date: 2021-04-22
Publication date: 2021-06-15

Abstract

The invention relates to a medical named entity identification method based on improved random average gradient descent, which comprises the following steps: receiving a medical unstructured text to be labeled, and performing data preprocessing to obtain labeled data; establishing an AWD-LSTM model according to the improved random average gradient descent, inputting the preprocessed data into the AWD-LSTM model for training, and obtaining a marker; and carrying out named entity labeling on the medical unstructured text to be labeled by utilizing the trained labeler. The invention has the beneficial effects that: the invention indirectly influences the gradient value by changing the value of the iteration step number, and rolls back the key parameters of the random average gradient descent optimization algorithm according to a certain rule, thereby achieving the effect of changing the shrinkage rate of the random average gradient descent optimization algorithm, achieving the purpose of jumping out of the local optimum of the random average gradient descent optimization algorithm, obtaining a better value, and not increasing the training time.

Description

Medical named entity identification method based on improved random average gradient descent

Technical Field

The invention relates to the field of natural language processing and the technical field of deep learning, in particular to a medical named entity identification method with reduced random average gradient.

Background

Natural Language Processing (NLP) is a branch of artificial intelligence and linguistics, one of the most difficult problems in artificial intelligence. Natural language processing refers to processing of information such as the shape, sound, and meaning of a natural language by a computer, that is, operations and processing for inputting, outputting, recognizing, analyzing, understanding, and generating words, sentences, and chapters. It has a significant impact on computer and human interaction. The basic tasks of natural language processing include voice recognition, information retrieval, question-answering systems, machine translation and the like, and the recurrent neural network and naive Bayes are common models of natural language processing. With the development of deep learning technology in many fields, natural language processing has made a great breakthrough.

Named Entity Recognition (NER) is a basic task in the field of NLP, and is also an important basic tool for most NLP tasks such as question and answer systems, machine translation, syntactic analysis, and the like. Previous approaches have been primarily dictionary-based and rule-based. The dictionary-based method is a method of fuzzy search or complete matching through character strings, but the quality and the size of the dictionary are limited as new entity names are continuously emerged; the rule-based method is to manually specify some rules and expand a rule set by common collocation of self characteristics and phrases of entity names, but huge human resources and time cost are consumed, the rules are generally effective only in a certain specific field, the cost of manual migration is high, and the rule portability is not strong. Named entity recognition is carried out, a machine learning method is mostly adopted, model training is continuously optimized, and the trained model shows better performance in test evaluation. Currently, the most applied models include Hidden Markov Models (HMMs), Support Vector Machines (SVMs), Maximum Entropy Markov Models (MEMMs), Conditional Random Fields (CRFs), and the like. The conditional random field model can effectively process the influence problem of the adjacent labels on the prediction sequence, so that the conditional random field model is applied to entity recognition more and has good effect. At present, a deep learning algorithm is generally adopted for the problem of sequence labeling. Compared with the traditional algorithm, the deep learning algorithm eliminates the step of manually extracting the features, and can effectively extract the distinguishing features.

After the foreign scholars Merity puts forward the AWD-LSTM model, a plurality of language models based on the model have good effects on named entity recognition. These models are trained first, and then re-trained, called "tweaks". However, the deep learning model targets non-convex functions and has many parameters, resulting in a rather difficult training process. At present, a random average gradient descent optimization algorithm is adopted in model training, namely a plurality of samples are randomly selected from a whole sample to be called a batch, then the average gradient of the batch is recorded, and the average gradient average value at all times is taken as the gradient estimation of the whole sample for training. However, the method has the defect that the model parameters cannot be converged, so that the optimal solution is easily missed, and the model training is incomplete.

In recent years, in the biomedical field, the literature resources are increased thousands of times every year, the information is mostly stored in the form of unstructured text, and the biomedicine named entity recognition aims to convert the unstructured text into structured text and recognize and classify specific entity names such as genes, proteins, diseases and the like in the biomedicine text. At present, how to quickly and efficiently retrieve relevant information from huge data is a great challenge.

In a gradient descent optimization method based on a hybrid strategy, which is disclosed in patent No. 2020109668396, the optimization method is firstly set as an Adam optimization algorithm, a gradient descent process of the Adam optimization algorithm is calculated, when a conversion mechanism is satisfied, the Adam optimization algorithm is converted into an SGDM optimization algorithm, a learning rate of the converted SGDM optimization algorithm is determined according to a scaling rule, and the gradient descent process of the SGDM optimization algorithm is calculated until a convergence condition is reached. The method can achieve a certain effect by mixing two optimization algorithms, but the method cannot achieve a good effect on retraining after restarting.

Disclosure of Invention

The invention aims to overcome the defects in the prior art and provides a random average gradient descent method based on a parameter rollback mechanism.

The medical named entity identification method based on improved random average gradient descent comprises the following steps:

step 1, receiving a medical unstructured text to be labeled, and performing data preprocessing to obtain labeled data; for the received medical unstructured text to be labeled, firstly, simple named entity labeling is carried out on a part of the medical unstructured text by using a rule-based and dictionary-based mode, then, high-quality training data are obtained by using a manual labeling method and a data enhancement technology, and linguistic data are provided for later training of a named entity recognition model;

step 2, establishing an AWD-LSTM model according to the improved random average gradient descent, inputting the preprocessed data into the AWD-LSTM model for training, and obtaining a marker;

step 2.1, initializing parameters of the medical named entity identification method based on improved random average gradient descent; the parameters are divided into hyper-parameters and ordinary parameters, and the hyper-parameters comprise: default learning rate lr, attenuation term λ, index α of iterative learning rate update, time point t at which gradient averaging starts₀Weight attenuation term, parameter rollback vector s, parameter rollback level bl, and default parameter rollback size bds; common parameters include: iteration step, iteration learning rate eta, weight updating parameter mu and rollback frequency count b;

step 2.2, dynamically reducing the weight energy through weight attenuation operation to prevent model overfitting, wherein a weight attenuation updating function is as follows:

the upper typeIn, w_tA random gradient descent weight vector representing time t,

representing a randomly selected sample at the time t, and weight _ decay represents weight attenuation; using the L2 regularization term, a smaller weight vector w is obtained_tThe complexity of the network is reduced, so that the problem of model overfitting is effectively avoided;

step 2.3, calculating a random gradient descent weight vector according to a random gradient descent update function, wherein the random gradient descent update function is as follows:

in the above formula, w_tA random gradient descent weight vector representing time t; w is a_t+1Represents a random gradient descent weight vector at time t + 1; eta represents the iterative learning rate, x_jWhich represents a sample that is randomly selected,

a gradient representing a loss function at time t;

step 2.4, the random gradient descent weight vector obtained in the step 2.3 is converted into a random average gradient descent weight vector through a random average gradient descent updating function; the random mean gradient descent update function is:

in the above formula, the first and second carbon atoms are,

a random mean gradient descent weight vector representing time t,

represents the random mean gradient descent weight vector, μ, at time t +1_tTo representWeight update parameter at time t, w_tA random gradient descent weight vector representing time t;

step 2.5, judging whether the value of the key step length parameter meets the condition requirement of rollback or not, and judging whether to perform parameter rollback operation or not by setting two judgments;

step 2.6, when the two judgment results of step 2.5 both meet the parameter rollback condition and the parameter rollback operation can be performed at the moment, performing the parameter rollback operation on the key step length parameter with the decreased random average gradient: updating the value of iteration round number step and the value of rollback number count b; the iteration round number step is updated as:

step＝max(step//bl,bds*b)

in the above formula, step is the number of iteration rounds, bl is the parameter rollback level, bds is the default parameter rollback size, and b is the rollback number count; the rollback number count is updated to:

b＝b*2

in the above formula, b is the rollback number count;

step 2.7, updating the iterative learning rate, the weight updating parameter and the sample parameter, and increasing the count of the iterative round step counter by 1;

step 2.8, repeating the steps 2.1 to 2.7 until the AWD-LSTM model meets a preset training termination criterion;

and 3, carrying out named entity labeling on the medical unstructured text to be labeled by using the trained labeler.

Preferably, step 2.5 specifically comprises the steps of:

step 2.5.1, first round judgment: judging whether the key step length parameter is divided by the default value, if the iteration round number step is divided by the default value, carrying out the second round judgment in the step 2.5.2; if the iteration round number step can not divide the default value, the iteration round number step and the rollback number count b are not updated;

step 2.5.2, second round judgment: judging whether the system random number is smaller than the quotient of the parameter rollback vector s and the rollback number count b; if the system random number is smaller than the quotient of the parameter rollback vector s and the rollback number count b, executing step 2.6; otherwise, the iteration step and the rollback number count b are not updated.

Preferably, the specific operations of updating the iterative learning rate, the weight update parameter and the sample parameter in step 2.7 are as follows:

the iterative learning rate η is updated as:

in the above formula, eta_t+1The iterative learning rate at the moment of t +1 is represented, lr represents a default learning rate, lambda represents an attenuation term, step is the number of iterative rounds, and alpha represents the update index of eta;

updating the weight update parameter to:

in the above formula,. mu._t+1A weight update parameter indicating time t +1, t indicating time t, t₀Indicating the moment when the random average gradient descent starts;

the sample parameters are updated as:

in the above formula, the first and second carbon atoms are,

represents a randomly selected sample at time t,

represents a randomly selected sample at the moment of t +1, lambda represents an attenuation term, eta represents an iterative learning rate,

representing the gradient of the loss function at time t.

Preferably, at initialization in step 2.1: setting the default learning rate lr to 20-30, setting the attenuation term lambda to 0, and overlappingThe index α of the generation learning rate update is set to 0.75, and the time point t at which gradient averaging is started is set₀Set to 0, set the value of the weight attenuation term to 1.2e^-6The value of the parameter rollback vector s is set to 0.02, the parameter rollback level bl is set to 10, and the default parameter rollback size bds is set to 10000.

Preferably, the weight update parameter μ at time t in step 2.4_tIs 1; default value of attenuation term lambda in step 2.7 is e^-4，t₀Has a default value of e⁶。

Preferably, a random number seed value is also set in step 2.1, and the value of the systematic random number in step 2.5.2 is determined by the random number seed value.

Preferably, the key step size parameter in step 2.5 and step 2.6 is the iteration round step.

Preferably, the default value for step 2.5.1 divided exactly by the iteration step is set to 1000.

The invention has the beneficial effects that:

the invention indirectly influences the gradient value by changing the value of the iteration round number step, and rolls back the key parameter (iteration round number step) of the random average gradient descent optimization algorithm according to a certain rule, thereby achieving the effect of changing the shrinkage rate of the random average gradient descent optimization algorithm, achieving the purpose of jumping out of the local optimum of the random average gradient descent optimization algorithm, obtaining a better value, and not increasing the training time.

Drawings

FIG. 1 is a flow chart of a medical named entity identification method based on improved stochastic mean gradient descent;

FIG. 2 is a flow chart of a random average gradient descent method based on a parameter rollback mechanism;

FIG. 3 is a graph of the experimental results of the AWD-LSTM model using different parameter rollback vectors in the Penn Treebank dataset;

FIG. 4 is a graph of the results of a MoS-AWD-LSTM model experiment on the Penn Treebank dataset using rolling back vectors with different parameters.

Detailed Description

The present invention will be further described with reference to the following examples. The following examples are set forth merely to aid in the understanding of the invention. It should be noted that, for a person skilled in the art, several modifications can be made to the invention without departing from the principle of the invention, and these modifications and modifications also fall within the protection scope of the claims of the present invention.

As an embodiment, an AWD-LSTM model is proposed based on Merity of a foreign student, an original random average gradient descent optimization algorithm is replaced by a random average gradient descent optimization algorithm based on a parameter rollback mechanism, and a fine tuning step required in the later training period of the original model is eliminated. As shown in fig. 1, the method mainly comprises the following steps:

step 1, receiving a medical unstructured text to be labeled, and performing data preprocessing to obtain labeled data:

for the received medical unstructured text to be labeled, firstly, simple named entity labeling is carried out on a part of the medical unstructured text by using a rule-based and dictionary-based mode, then, a manual labeling method and a data enhancement technology are utilized to obtain enough high-quality training data, and linguistic data are provided for later training of a named entity recognition model.

Step 2, inputting the preprocessed data into an AWD-LSTM model for training according to the AWD-LSTM model established by the improved random average gradient descent to obtain a marker;

compared with the existing random average gradient descent algorithm, the method carries out certain regular constraint on the step length parameter in the existing random average gradient descent algorithm, indirectly influences the shrinkage rate of the algorithm by constraining the step length parameter, finally achieves the capability of enabling the algorithm to cross over the local optimum in the later period, and can find a more optimal solution.

Referring to fig. 2, the specific steps of building the AWD-LSTM model by improving the random mean gradient descent include:

step 2.1, initializing the parameters of the random average gradient descent optimization algorithm based on the parameter rollback mechanism: the optimization algorithm provided by the invention has eight hyper-parameters which are respectively a default learning rate lr, an attenuation term lambda, an index alpha of eta update and a time point t of starting gradient average₀Weight _ decay (penalized by L2), parameter rollback vector s, parameter rollback level bl and default parameter rollback size bds, and four common parameters, iteration round number step, iteration learning rate eta, weight updating parameter mu and rollback number count b. Through experiments, the default learning rate lr value is set to be 20 to 30, and the attenuation term lambda value is set to be e^-4The exponent α value of η update is set to 0.75 and the mean time point t of the gradient is started₀The value is set to 0 and the weight decay (L2 penalty) weight _ decay value is set to 1.2e^-6When the value of the parameter rollback vector s is set to be 0.02, the value of the parameter rollback level bl is set to be 10, and the value of the default parameter rollback size bds is set to be 10000, good effects can be achieved on different models and data sets.

Step 2.2, weight attenuation operation is carried out, so that the weight can be dynamically reduced, overfitting of the model is prevented, and a weight attenuation updating function is as follows:

wherein, w_tA random gradient descent weight vector representing time t,

represents a randomly selected sample at time t, and weight _ decay represents weight decay. Using the L2 regularization term, a smaller weight vector w is obtained_tAnd the complexity of the network is reduced, so that the problem of model overfitting is effectively avoided.

Step 2.3, calculating a random gradient descent weight vector, wherein a random gradient descent updating function is as follows:

wherein, w_tRepresenting the random gradient descent weight vector at time t, η representing the iterative learning rate, x_jWhich represents a sample that is randomly selected,

representing the gradient of the loss function at time t.

Step 2.4, the random gradient descent weight vector is replaced by a random average gradient descent weight vector, and the random average gradient descent updating function is as follows:

wherein the content of the first and second substances,

random mean gradient descent weight vector, μ, representing time t_tThe default initial value is 1, which indicates the weight update parameter at time t.

Step 2.5, judging whether the value of the key step length parameter meets the condition requirement of rollback, and simultaneously judging whether the parameter rollback operation can be carried out:

in the random average gradient descent optimization algorithm, the key step size parameter influences the shrinkage rate of the random average gradient descent optimization algorithm. In the original random average gradient descent optimization algorithm, the step size parameter value is continuously increased to lead the algorithm to tend to be more and more stable, although the algorithm can obtain a more stable result in the later period, the algorithm is also trapped in local optimization due to the over-stability, and the algorithm cannot obtain a better result due to the over-small gradient value. Therefore, the improved method provided by the invention indirectly influences the gradient value by changing the value of the step length parameter, thereby achieving the effect of changing the shrinkage rate of the random average gradient descent optimization algorithm, then jumping out of local optimization and obtaining a better value.

In the proposed medical named entity recognition method for improving random average gradient descent, whether parameter rollback is performed or not is judged by setting two judgments. Judging whether the step length parameter in the improved random average gradient descent optimization algorithm is divided by the default value 1000 for the first time, namely setting a default rollback interval, and judging for the second time each time the interval is reached; and judging whether the system random number is smaller than the quotient of the parameter rollback vector s and the parameter rollback count b for the second time, wherein the system random number is influenced by the random number sub-values under the initial setting of model training.

Step 2.6, when the parameter rollback requirement is met, performing parameter rollback operation on the key step length parameter with the random average gradient decreasing:

when the two determinations in step 2.5 both meet the requirements, the value of the step parameter in the optimized random average gradient descent optimization algorithm is divided into the maximum value of the product of the parameter rollback level bl and the default parameter rollback size bds with the rollback number count b by the step parameter value, so that the gradient is ensured not to generate great fluctuation after multiple parameter rollback operations, and the phenomenon that the gradient is too large to jump out of the local optimum and the experimental result is poor is prevented. Meanwhile, the parameter rollback grade b is made to be twice of the value of the parameter rollback grade b, so that the gradient cannot greatly fluctuate under the influence of the parameter rollback operation along with the change of the number of training rounds.

The step size parameter value update function is as follows:

step＝max(step//bl,bds*b)

the parameter rollback level update function is as follows:

b＝b*2

step 2.7, updating the iterative learning rate eta, the weight updating parameter mu and the sample parameter, and increasing the step size parameter counter by 1:

the iterative learning rate update function is as follows:

where lr represents a default learning rate, λ represents an attenuation term, and e is defaulted^-4And α represents an update index of η.

The weight update parameter update function is as follows:

wherein, t₀Represents the time point of starting to perform random average gradient descent weight vector, defaultConsider the value e⁶。

The sample parameter update function is as follows:

wherein the content of the first and second substances,

represents a randomly selected sample at the time t, lambda represents an attenuation term, and default is e^(-4),η

The rate of iterative learning is represented as,

representing the gradient of the loss function at time t.

And 2.8, repeating the steps 2.1-2.7 until the model meets the preset training termination criterion.

And step 3: and carrying out named entity labeling on the received medical unstructured text to be labeled by utilizing the trained labeler.

The experimental results are as follows:

to demonstrate the effectiveness of this example, experiments were performed on the Penn TreeBank dataset for both the AWD-LSTM model and the MoS-AWD-LSTM model. The Penn TreeBank dataset has long been a common dataset for language model experiments, with the maximum number of words in the vocabulary limited to 10000.

In the training process, all experiments strictly follow the regularization and optimization technology introduced in the AWD-LSTM, a series of optimization techniques including stacking three layers of LSTM and the like are included, and the experiment is reproduced in a Pytrch-0.4 version because the Pytrch-0.2 version used for comparing two models used is older. For the sake of fairness, the original random average gradient descent optimization algorithm is replaced by a random average gradient descent method based on a parameter rollback mechanism, and other parameters and architectures are kept unchanged.

The following table 1 shows the confusion results of the AWD-LSTM model and the MoS-AWD-LSTM model in the Penn Treebank data set language modeling task, the smaller confusion degree shows that the language model has better performance, and the parameters show the number of model parameters. The results show that compared with the AWD-LSTM model, the random average gradient descent method based on the parameter rollback mechanism provided by the invention respectively improves the results by 1.23% and 1.03% (Pythrch-0.2 version), 0.03% and 0.23% (Pythrch-0.4 version) in the confusion degree of the verification set and the test set; compared with the MoS-AWD-LSTM model, the random average gradient descent method based on the parameter rollback mechanism provided by the embodiment is improved by 0.88% and 0.87% (Pythrch-0.2 version), 2.35% and 2.3% (Pythrch-0.4 version).

TABLE 1 perplexity results table of AWD-LSTM and MoS-AWD-LSTM models in Penn Treebank dataset language modeling task

Meanwhile, the embodiment verifies the influence of different parameter rollback vectors s on the experimental result. Fig. 3 and 4 show the confusion of the verification set on the Penn Treebank dataset when different parameter rollback vectors s are used for the two experimental models, and for the sake of clarity, fig. 3 and 4 only show the part reaching the lowest value, and it can be found that after the parameter rollback operation occurs, the training progress has a certain probability of being greatly reduced before. In addition, the present embodiment verifies that setting the value of the parameter rollback vector s to 0.02 performs best on the Penn Treebank dataset.

Claims

1. A medical named entity identification method based on improved random average gradient descent is characterized by comprising the following steps:

step 1, receiving a medical unstructured text to be labeled, and performing data preprocessing to obtain labeled data; for the received medical unstructured text to be labeled, firstly, simply naming the entity label on a part of the medical unstructured text in a rule-based and dictionary-based mode, and then obtaining high-quality training data;

step 2.1, initializing parameters of the medical named entity identification method based on improved random average gradient descent; the parameters are divided into hyper-parameters and ordinary parameters, and the hyper-parameters comprise: default learning rate lr, attenuation term λ, index α of iterative learning rate update, time point t at which gradient averaging starts₀Weight attenuation term, parameter rollback vector s, parameter rollback level bl, and default parameter rollback size bds; common parameters include: iteration round number step, iteration learning rate eta, weight updating parameter p and rollback number count b;

step 2.2, reducing the weight energy through weight attenuation operation, wherein the weight attenuation updating function is as follows:

in the above formula, w_tA random gradient descent weight vector representing time t,

representing a randomly selected sample at the time t, and weight-decay representing weight attenuation; using the L2 regularization term, a smaller weight vector w is obtained_t；

a gradient representing a loss function at time t;

in the above formula, the first and second carbon atoms are,

a random mean gradient descent weight vector representing time t,

represents the random mean gradient descent weight vector, μ, at time t +1_tRepresents the weight update parameter at time t, w_tA random gradient descent weight vector representing time t;

step＝max(step//bl，bds*b)

b＝b*2

in the above formula, b is the rollback number count;

2. The medical named entity recognition method based on improved stochastic mean gradient descent as claimed in claim 1, wherein step 2.5 comprises the following steps:

3. The medical named entity recognition method based on improved stochastic mean gradient descent as claimed in claim 1, wherein the specific operations of updating the iterative learning rate, the weight update parameters and the sample parameters in step 2.7 are as follows:

the iterative learning rate η is updated as:

updating the weight update parameter to:

the sample parameters are updated as:

in the above formula, the first and second carbon atoms are,

represents a randomly selected sample at time t,

representing the gradient of the loss function at time t.

4. Medical named entity recognition method based on improved stochastic mean gradient descent as claimed in claim 1, characterized in that at initialization in step 2.1: setting a default learning rate lr to 20-30, setting an attenuation term lambda to 0, setting an index alpha of iterative learning rate update to 0.75, and setting a time point t at which gradient averaging starts₀Set to 0, set the value of the weight attenuation term to 1.2e^-6The value of the parameter rollback vector s is set to 0.02, the parameter rollback level bl is set to 10, and the default parameter rollback size bds is set to 10000.

5. The medical named entity recognition method based on improved stochastic mean gradient descent as claimed in claim 1, wherein: in step 2.4, the weight updating parameter mu at the moment t_tIs 1; step 2.7 decayItem λ has a default value of e^-4，t₀Has a default value of e⁶。

6. The medical named entity recognition method based on improved stochastic mean gradient descent as claimed in claim 1, wherein: a random number seed value is also set in step 2.1, and the value of the system random number in step 2.5.2 is determined by the random number seed value.

7. The medical named entity recognition method based on improved stochastic mean gradient descent as claimed in claim 1, wherein: the key step size parameter in step 2.5 and step 2.6 is the iteration round number step.

8. The medical named entity recognition method based on improved stochastic mean gradient descent of claim 2 or 7, wherein: the default value for the integer division by iteration round step in step 2.5.1 is set to 1000.