CN116796749A

CN116796749A - Medical named entity recognition robustness enhancement method and system

Info

Publication number: CN116796749A
Application number: CN202310797089.8A
Authority: CN
Inventors: 杨飞; 张志强; 何云飞; 孟丽; 孙宸远; 马剑; 高埂
Original assignee: Anhui Medical University
Current assignee: Anhui Medical University
Priority date: 2023-06-29
Filing date: 2023-06-29
Publication date: 2023-09-22

Abstract

The invention discloses a method and a system for enhancing the recognition robustness of medical named entities, wherein the method comprises the steps of obtaining a medical text to be recognized, and inputting the medical text into a pre-trained language model to obtain a word vector sample and a countermeasure sample; calculating first mutual information of the model input and the middle hidden layer and second mutual information of the model output and the middle hidden layer by adopting a Hilbert independence criterion; taking the first mutual information and the second mutual information as mutual information of a bottleneck theoretical optimization objective function, and capturing multistage hidden semantic information by taking the bottleneck theoretical optimization objective function as a limiting condition when training a neural network model by adopting a word vector sample and an countermeasure sample; in actual prediction, decoding the multi-level hidden semantic information output by the neural network model, and selecting an entity tag sequence with the maximum probability from decoding results as a recognition result; the invention can limit the capacity of the model and reduce the resource consumption while improving the processing capacity of the model on noise and interference.

Description

Medical named entity recognition robustness enhancement method and system

Technical Field

The invention relates to the technical field of natural language processing, in particular to a medical named entity recognition robustness enhancement method and system.

Background

Named entity recognition (Named Entity Recognition, NER) is the basic research effort to accomplish medical text mining tasks, a process to find medical entities, such as diseases, drugs, symptoms, from unstructured medical text. The NER model is utilized to accurately identify the boundaries of the included medical entities from the given medical texts, and the medical entities of the same type are classified, so that metadata support can be provided for the establishment of a clinical professional knowledge base, further, the efficiency and the level of clinical scientific research of medical institutions are improved, and the NER model has very important significance for improving the service quality of a medical information system.

The current research method mainly focuses on obtaining the feature vector of the medical text by using a pre-training language model technology, fully learning the context features through a neural network model, and finally classifying the labels through technologies such as a conditional random field. The Bert language model (Bidirectional Encoder Representation from Transformers) is a mainstream model used by current researchers when word vector embedding due to strong feature extraction capability. And the deep learning method based on a Bi-directional long-short Term Memory network (Bi-directional Long Short-Term Memory, bi-LSTM) and a conditional random field (Conditional Random Fields, CRF) can better capture the Bi-directional semantic dependence, so that the method becomes one of the most widely used methods in the named entity recognition research in the medical field in recent years. In addition, the recognition precision of named entities in the medical professional field is improved to a certain extent by introducing a multi-level semantic representation method, such as constructing a medical professional dictionary, chinese word roots and the like.

However, studies have shown that neural network models tend to be locally unstable, and even small perturbations may mislead them. The input of such malicious perturbations is called an antagonistic example, which is created by deliberately adding noise in the form of small perturbations to the training data, which is often difficult for humans to discern because they are difficult to perceive, but can greatly increase the loss for deep learning models. In the patent application document with the publication number of CN115659976A, disturbance factors are injected by a FGM anti-training method and are input into a BiLSTM-CRF model to obtain a prediction result, so that the influence caused by interference noise of training samples is effectively prevented; however, the scheme only enhances the anti-interference capability of the word embedding layer, and ignores the problem of 'cascade diffusion' of noise between hidden layers of the neural network after the word embedding layer. Although a neural network hybrid compression scheme based on an information bottleneck theory is proposed in literature, a neural network hybrid compression method based on an information bottleneck theory, computer application research, superiority, etc., in which the propagation of redundant information is limited using Mutual Information (MI) while information related to basic facts is saved, thereby further reducing memory required for model storage, the problem of noise transfer between hidden layers of the neural network is not solved, and MI has high complexity in practice.

Disclosure of Invention

The invention aims to solve the technical problem of how to further inhibit the cascade diffusion problem of useless noise information between hidden layers of a model while enhancing the disturbance rejection capability of a word embedding layer of the model, thereby enhancing the processing capability of the model on noise and improving the robustness of the model.

The invention solves the technical problems by the following technical means:

a method for enhancing robustness of medical named entity recognition is provided, the method comprising:

acquiring a medical text to be identified, inputting the medical text into a pre-trained language model to acquire a word vector sample and an countermeasure sample, wherein the language model adopts a PGD attack method to calculate a loss function gradient of the word vector sample, and adds disturbance in the direction of the gradient to generate the countermeasure sample;

calculating first mutual information of the model input and the middle hidden layer and second mutual information of the model output and the middle hidden layer by adopting a Hilbert independence criterion;

taking the first mutual information and the second mutual information as the mutual information of the information bottleneck theoretical optimization objective function, and capturing multistage hidden semantic information by taking the information bottleneck theoretical optimization objective function as a limiting condition when training a neural network model by adopting the word vector sample and the countermeasure sample;

and in actual prediction, decoding the multi-level hidden layer semantic information output by the neural network model, and selecting an entity tag sequence with the maximum probability from decoding results as an identification result.

Further, inputting the medical text into a pre-trained language model to obtain a word vector sample and a countermeasure sample, including:

inputting the medical text into the pre-trained language model to obtain a word vector sample;

adding an antagonism disturbance at a word vector embedding layer of the language model, calculating a loss function gradient of a word vector sample by adopting a PGD attack method, adding the disturbance in the direction of the gradient, and generating the antagonism sample, wherein the PGD attack formula is expressed as follows:

wherein: x is x _t Representing samples generated by the t-th iteration, x _t+1 Representing the sample generated by the t+1st iteration, S representing the allowable disturbance range, n _x+S Representing the function that projects the disturbance back into the range, α represents the step size for each iteration, θ represents the parameters of the model, J (θ, x _t Y) represents the loss function, y represents the label of the correct sample，Representing the loss function versus input sample x _t Is a gradient of (a).

Further, when a disturbance is added in the direction of the loss function gradient of the word vector sample, the formula is expressed as:

wherein: d represents the distribution of input samples, x represents the input word vector samples, y represents the label, θ represents the robustness parameter of the language model, L (x+y; θ) represents the loss of a single word vector sample, L (x+Δx, y; θ) represents the sample loss after adding disturbance Δx to x, Δx represents the disturbance, Ω represents the disturbance space, Δx is less than or equal to ε, ε is a constant.

Further, the Hilbert independence criterion is formulated as:

wherein: HSIC (X, Y) represents the empirical value of HSIC between different hidden layer information X and Y, N represents the size of batch, tr () is used to calculate the trace of the input tensor, K _X And K _Y The representation of the core matrix is that, is an identity matrix, 1 is a column vector with all elements 1, and T represents a transposed symbol.

Further, the bottleneck theory optimizes the formula of the objective function by using psi _IB It is indicated that I uses the HSIC above:

ψ _IB ＝I(α,β ₁ )-θI(α,β ₂ )

wherein: psi phi type _IB Represents an information bottleneck, θ represents lagrange multiplier, I (α, β ₁ ) Representation ofFirst mutual information, I (alpha, beta ₂ ) Representing second mutual information, wherein the first mutual information and the second mutual information are HSIC experience values calculated by using Hilbert independence criteria.

In addition, the invention also provides a medical named entity recognition robustness enhancement system, which comprises:

the acquisition module is used for acquiring a medical text to be identified, inputting the medical text into a pre-trained language model to acquire a word vector sample and an countermeasure sample, wherein the language model adopts a PGD attack method to calculate a loss function gradient of the word vector sample, and adds disturbance in the direction of the gradient to generate the countermeasure sample;

the mutual information calculation module is used for calculating the first mutual information of the model input and the middle hidden layer and the second mutual information of the model output and the middle hidden layer by adopting a Hilbert independence criterion;

the training module is used for taking the first mutual information and the second mutual information as mutual information of a bottleneck theoretical optimization objective function, and capturing multistage hidden semantic information by taking the bottleneck theoretical optimization objective function as a limiting condition when training a neural network model by adopting the word vector sample and the countermeasure sample;

and the prediction module is used for decoding the multi-level hidden layer semantic information output by the neural network model during actual prediction, and selecting an entity tag sequence with the highest probability from decoding results as a recognition result.

Further, the acquisition module includes:

the word vector extraction unit is used for inputting the medical text into the pre-trained language model to obtain a word vector sample;

the disturbance countermeasure unit is used for adding disturbance countermeasure at the word vector embedding layer of the language model, calculating a loss function gradient of a word vector sample by adopting a PGD attack method, adding disturbance in the direction of the gradient, and generating a disturbance countermeasure sample, wherein the PGD attack formula is as follows:

wherein: x is x _t Representing samples generated by the t-th iteration, x _t+1 Representing the sample generated by the t+1st iteration, S representing the allowable disturbance range, n _x+S Representing the function that projects the disturbance back into the range, α represents the step size for each iteration, θ represents the parameters of the model, J (θ, x _t Y) represents the loss function, y represents the label of the sample that is correct,representing the loss function versus input sample x _t Is a gradient of (a).

Further, the Hilbert independence criterion is formulated as:

Further, the bottleneck theoretical optimization objective function is formulated as:

ψ _IB ＝I(α,β ₁ )-θI(α,β ₂ )

wherein: psi phi type _IB Represents an information bottleneck, θ represents lagrange multiplier, I (α, β ₁ ) Representing first mutual information, I (alpha, beta ₂ ) Representing the second mutual information.

The invention has the advantages that:

(1) According to the invention, an countermeasure training technology is utilized to generate a countermeasure sample aiming at the model, and the countermeasure sample is used for training a named entity recognition model with stronger robustness, so that the processing capacity of the model on noise and interference is improved; meanwhile, the information bottleneck is utilized to compress the named entity recognition model, effective propagation of useful information among layers of the neural network is guided based on the information bottleneck theory, the problem of cascade diffusion of useless noise information among hidden layers of the model is restrained, and therefore performance and robustness of the model are improved, and the information bottleneck reduces resource consumption by limiting capacity of the model and reducing unnecessary information transmission; using Hilbert-Schmitt independence criteria (Hilbert-Schmidt Independence Criterion, HSIC) to measure the level of independence between two variables, redundant information can be filtered and more useful information retained to learn high quality embeddings.

Additional aspects and advantages of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.

Drawings

FIG. 1 is a flow chart of a method for enhancing robustness of medical named entity recognition according to an embodiment of the present invention;

FIG. 2 is a schematic overall flow diagram of a method for enhancing robustness of medical named entity recognition according to an embodiment of the present invention;

FIG. 3 is a diagram of an algorithmic model of named entity recognition in an embodiment of the invention;

fig. 4 is a schematic structural diagram of a medical named entity recognition robustness enhancement system according to an embodiment of the present invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions in the embodiments of the present invention will be clearly and completely described in the following in conjunction with the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

As shown in fig. 1 and 3, a first embodiment of the present invention proposes a method for enhancing robustness of medical named entity recognition, the method comprising the steps of:

s10, acquiring a medical text to be identified, and inputting the medical text into a pre-trained language model to acquire a word vector sample and an countermeasure sample, wherein the language model adopts a PGD attack method to calculate a loss function gradient of the word vector sample, and disturbance is added in the direction of the gradient to generate the countermeasure sample;

s20, calculating first mutual information of the model input and the middle hidden layer and second mutual information of the model output and the middle hidden layer by adopting a Hilbert independence criterion;

s30, taking the first mutual information and the second mutual information as mutual information of a bottleneck theoretical optimization objective function, and capturing multistage hidden semantic information by taking the bottleneck theoretical optimization objective function as a limiting condition when training a neural network model by adopting the word vector sample and the countermeasure sample;

and S40, decoding the multi-level hidden semantic information output by the neural network model during actual prediction, and selecting an entity tag sequence with the maximum probability from decoding results as a recognition result.

It should be noted that, in the neural network process, considering the problem of information loss in the multi-layer network information propagation process, the model may be interfered by bad information or hidden layers in the learning process, and further deviate from the target of supervised learning in the wrong direction. The theory of information bottlenecks is a method of information theory, which aims at finding the most important information between input data and output data, and the core idea is to minimize the information loss between the input data and the output data while keeping the most important information. According to the embodiment, the named entity recognition model is compressed by utilizing the information bottleneck, so that the robustness of the named entity recognition model is improved, the information bottleneck limits the capacity of the model, reduces unnecessary information transmission and reduces resource consumption; and an countermeasure training technology is utilized to generate a countermeasure sample aiming at the model, which is used for training a named entity recognition model with stronger robustness, so that the processing capacity of the model on noise and interference is improved.

In an embodiment, before the word vector is obtained by using the language model, the language model needs to be trained, so that the language model can divide the word vector of the medical text.

Specifically, as shown in fig. 2, in this embodiment, the chinese electronic medical record is used as a knowledge source, and normalized text data and a standard data set are formed by data collection, entity label determination, entity labeling, data processing, and the like, that is, a named entity recognition data set in the chinese medical field is constructed, and then the data set is divided into a training set, a verification set, and a test set according to a ratio of 8:1:1, so as to be used for training a language model.

Furthermore, the original natural language text converts a Word into a vector representation with a fixed length by using Word coding, so that mathematical processing is facilitated, but the existing pre-training language model, such as ELMO, GPT, BERT, is trained based on a deep neural network model, and the neural network is easy to be attacked by linear disturbance due to the characteristic of linearity, so that disturbance countermeasure is added in a Word vector coding layer, and the robustness of the model in coping with malicious disturbance countermeasure samples is improved.

It should be noted that, in this embodiment, since the word vector embedding layer is used to combat disturbance, selection of the pre-training language model is not required, and here, a pre-training language model such as ELMO, GPT, BERT may be used.

In an embodiment, in the step S10, the step of inputting the medical text into a pre-trained language model to obtain a word vector sample and an countermeasure sample includes:

Note that the PGD (Projected Gradient Descent) attack is a white-box attack against the neural network model, whose principle is based on the gradient calculation against the sample. An attacker first selects an initial sample and calculates the corresponding gradient of the loss function. The attacker then adds some disturbance in the direction of the gradient to generate a new sample. The attacker continues to iterate this process until the generated samples are classified as the class of errors the attacker wants or a preset maximum number of iterations is reached. To prevent the perturbation from being too large, an attacker projects the perturbation back into a range (e.g., lp norm less than epsilon) at each iteration to ensure that the distance between the generated sample and the original sample is not too great.

In an embodiment, the present embodiment adds the challenge disturbance through the word vector Embedding layer, and defines the challenge training as a problem of saddle-finding from the viewpoint of optimization, namely the following Min-Max formula:

where D represents the distribution of input samples, x represents the input word vector samples, y represents the label, θ is the language model parameter, L (x+y; θ) is the loss of a single sample, Δx is the perturbation, and Ω is the perturbation space.

Specifically, the internal max refers to adding the disturbance Δx to x, and the purpose of Δx is to make L (x+y; θ) larger and better, that is to say to make the existing model prediction error as much as possible; however, Δx is also constrained to be in the Ω range. The conventional constraint is that deltax is less than epsilon, where ε is a constant. The external min means that the most robust parameter θ is found that the predicted distribution matches the distribution of the original dataset. Therefore, the sample loss L contains the robustness parameter θ of the language model, and the purpose of Δx is to make L (x+Δx, y; θ) larger and better, so as to make the prediction of the existing model error as much as possible, thereby finding an optimal robustness parameter: θ and Δx.

Further, the neural network model in the step S30 may be a model constructed by using a machine learning or deep learning algorithm, such as a conditional random field (Conditional Random Fields, CRF), a recurrent neural network (Recurrent Neural Networks, RNN), or a convolutional neural network (Convolutional Neural Networks, CNN), etc., to identify named entities in the text.

In this embodiment, after the disturbance is added, the propagation of redundant information between the word vector and the underlying network is constrained based on the theory of the information bottleneck, so as to guide the transmission of useful information between layers of the neural network. Specifically, based on the information bottleneck theory, the Hilbert independence criterion HSIC is adopted to replace mutual information as a loss function for assisting model learning so as to capture dependence among layers of the neural network and compress noise information in the neural network, and the pureness of information aggregation is effectively restrained. Given a sample z= { (X) ₁ ,Y ₁ ),…,(X _N ,Y _Y )}，Wherein X and Y represent information of two hidden layers, and the hidden layer information X= [ X ] ₁ ,...,X _N ]And Y= [ Y ] ₁ ,...,Y _N ]The HSIC empirical values in between can be expressed as:

wherein: HSIC (X, Y) represents the empirical value of HSIC between different hidden layer information X and Y, N represents the size of the batch, tr () is used to calculate the trace (trace) of the input tensor, i.e., calculate the sum of the elements on the principal diagonal of the square matrix, K _X And K _Y The representation of the core matrix is that,is an identity matrix, 1 is a column vector with all elements 1, and T represents a transposed symbol.

K _X Is an element ofWherein σ is the bandwidth, which can act as smoothing; x is X _i And X _j Representing the ith and jth row of samples in the input sample matrix X, respectively; />Representing the square of the L2 norm, i.e. the square of the Euclidean distance between two samples, K _Y Is defined similarly to K _X And will not be described here.

The calculation process of the HSIC is intuitively shown in a matrix form. According to the HSIC empirical value calculation formula, there are:

and

the method can obtain the following steps:

wherein the method comprises the steps ofK _Y And K _Y The calculation process of H is respectively equal to that of K _X And K _X H is the same as the process.

HSIC is a standard for measuring the strength of independence between two variables, and can filter redundant information and retain more useful information, thereby learning high quality embeddings. The larger the value of the HSIC is, the weaker the independence is, and vice versa, the HSIC is introduced into the propagation process between hidden layers of the named entity recognition model to limit the propagation of useless information, so that the performance and the robustness of the model are improved.

In one embodiment, the bottleneck theoretical optimization objective function is formulated as:

ψ _IB ＝I(α,β ₁ )-θI(α,β ₂ )

wherein: psi phi type _IB Represents an information bottleneck, θ represents lagrange multiplier, I (α, β ₁ ) Representing first mutual information, I (alpha, beta ₂ ) Representing second mutual information, wherein the first mutual information and the second mutual information are HSIC experience values calculated by using Hilbert independence criteria.

Finally, the embodiment adopts an output decoding layer CRF model to decode the captured multi-level hidden layer semantic information to obtain a globally optimal tag sequence; and in actual CRF prediction, selecting a candidate tag sequence with the highest probability according to the trained parameters as a final result.

Furthermore, as shown in fig. 4, a second embodiment of the present invention further proposes a medical named entity recognition robustness enhancing system, the system comprising:

the acquisition module 10 is configured to acquire a medical text to be identified, and input the medical text into a pre-trained language model to acquire a word vector sample and an countermeasure sample, where the language model calculates a loss function gradient of the word vector sample by adopting a PGD attack method, and adds disturbance in a direction of the gradient to generate the countermeasure sample;

a mutual information calculation module 20, configured to calculate first mutual information of the model input and the intermediate hidden layer and second mutual information of the model output and the intermediate hidden layer using hilbert independence criterion;

the training module 30 is configured to capture multi-level hidden semantic information by using the first mutual information and the second mutual information as mutual information of a bottleneck theoretical optimization objective function, and using the bottleneck theoretical optimization objective function as a constraint condition when training the neural network model by using the word vector sample and the challenge sample;

and the prediction module 40 is used for decoding the multi-level hidden layer semantic information output by the neural network model during actual prediction, and selecting an entity tag sequence with the highest probability from decoding results as a recognition result.

In one embodiment, the acquisition module 10 includes:

wherein: x is x _t Representing samples generated by the t-th iteration, x _t+1 Representing the sample generated by the t+1st iteration, S representing the allowable disturbance range, n _x+S Representing the function that projects the disturbance back into the range, α represents the step size for each iteration, θ represents the parameters of the model, J (θ, x _t Y) represents the loss function, y represents the label of the sample that is correct,indicating lossFunction vs. input sample x _t Is a gradient of (a).

In one embodiment, when a disturbance is added in the direction of the loss function gradient of the word vector sample, the formula is:

wherein: d represents the distribution of input samples, x represents the input word vector samples, y represents the label, θ represents the robustness parameter of the language model, L (x+y; θ) represents the loss of a single word vector sample, L (x+Δx, y; θ) represents the sample loss after adding disturbance Δx to x, Δx represents the disturbance, Ω represents the disturbance space, Δx is less than or equal to ε, ε is a constant, and ε is a uniform sign.

In one embodiment, the Hilbert independence criterion is formulated as:

ψ _IB ＝I(α,β ₁ )-θI(α,β ₂ )

It should be noted that, other embodiments of the medical named entity recognition robustness enhancement system or the implementation method thereof according to the present invention may refer to the above-mentioned embodiments of the method, and are not repeated here.

In the description of the present specification, a description referring to terms "one embodiment," "some embodiments," "examples," "specific examples," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present invention. In this specification, schematic representations of the above terms do not necessarily refer to the same embodiments or examples. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

Furthermore, the terms "first," "second," and the like, are used for descriptive purposes only and are not to be construed as indicating or implying a relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include at least one such feature. In the description of the present invention, the meaning of "plurality" means at least two, for example, two, three, etc., unless specifically defined otherwise.

While embodiments of the present invention have been shown and described above, it will be understood that the above embodiments are illustrative and not to be construed as limiting the invention, and that variations, modifications, alternatives and variations may be made to the above embodiments by one of ordinary skill in the art within the scope of the invention.

Claims

1. A method for robustness enhancement of medical named entity recognition, the method comprising:

taking the first mutual information and the second mutual information as mutual information of a bottleneck theoretical optimization objective function, and capturing multistage hidden semantic information by taking the bottleneck theoretical optimization objective function as a limiting condition when training a neural network model by adopting the word vector sample and the countermeasure sample;

2. The method of claim 1, wherein the inputting the medical text into a pre-trained language model to obtain a word vector sample and a challenge sample comprises:

3. The method of claim 2, wherein when adding a perturbation in the direction of the loss function gradient of the word vector sample, the formula is:

wherein: d represents the distribution of input samples, x represents the input word vector samples, y represents a label, θ represents the robustness parameter of the language model, L (x+y; θ) represents the loss of a single word vector sample, L (x+Δx, y; θ) represents the sample loss after adding disturbance Δx to x, Δx represents the disturbance, Ω represents the disturbance space, and ε is a constant.

4. The medical named entity recognition robustness enhancement method of claim 1, wherein the formulation of the hilbert independence criterion is:

5. The method for robustness enhancement of medical named entity recognition of claim 1 or 4, wherein the bottleneck theoretical optimization objective function is formulated as:

ψ _IB ＝I(α,β ₁ )-θI(α,β ₂ )

6. A medical named entity recognition robustness enhancement system, the system comprising:

7. The medical named entity recognition robustness enhancement system of claim 6, wherein the acquisition module comprises:

8. The medical named entity recognition robustness enhancement system of claim 7, wherein upon adding a perturbation in the direction of a loss function gradient of the word vector sample, the formula is:

9. The method of claim 6, wherein the hilbert independence criterion is formulated as:

10. The medical named entity recognition robustness enhancement method of claim 6 or 9, wherein the bottleneck theoretical optimization objective function is formulated as:

ψ _IB ＝I(α,β ₁ )-θI(α,β ₂ )