CN116796749A - Medical named entity recognition robustness enhancement method and system - Google Patents

Medical named entity recognition robustness enhancement method and system Download PDF

Info

Publication number
CN116796749A
CN116796749A CN202310797089.8A CN202310797089A CN116796749A CN 116796749 A CN116796749 A CN 116796749A CN 202310797089 A CN202310797089 A CN 202310797089A CN 116796749 A CN116796749 A CN 116796749A
Authority
CN
China
Prior art keywords
sample
word vector
disturbance
model
mutual information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310797089.8A
Other languages
Chinese (zh)
Inventor
杨飞
张志强
何云飞
孟丽
孙宸远
马剑
高埂
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Anhui Medical University
Original Assignee
Anhui Medical University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Anhui Medical University filed Critical Anhui Medical University
Priority to CN202310797089.8A priority Critical patent/CN116796749A/en
Publication of CN116796749A publication Critical patent/CN116796749A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0475Generative networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/094Adversarial learning
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/70ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for mining of medical data, e.g. analysing previous cases of other patients

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Software Systems (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Public Health (AREA)
  • Medical Informatics (AREA)
  • Pathology (AREA)
  • Databases & Information Systems (AREA)
  • Epidemiology (AREA)
  • Primary Health Care (AREA)
  • Measuring And Recording Apparatus For Diagnosis (AREA)

Abstract

The invention discloses a method and a system for enhancing the recognition robustness of medical named entities, wherein the method comprises the steps of obtaining a medical text to be recognized, and inputting the medical text into a pre-trained language model to obtain a word vector sample and a countermeasure sample; calculating first mutual information of the model input and the middle hidden layer and second mutual information of the model output and the middle hidden layer by adopting a Hilbert independence criterion; taking the first mutual information and the second mutual information as mutual information of a bottleneck theoretical optimization objective function, and capturing multistage hidden semantic information by taking the bottleneck theoretical optimization objective function as a limiting condition when training a neural network model by adopting a word vector sample and an countermeasure sample; in actual prediction, decoding the multi-level hidden semantic information output by the neural network model, and selecting an entity tag sequence with the maximum probability from decoding results as a recognition result; the invention can limit the capacity of the model and reduce the resource consumption while improving the processing capacity of the model on noise and interference.

Description

Medical named entity recognition robustness enhancement method and system
Technical Field
The invention relates to the technical field of natural language processing, in particular to a medical named entity recognition robustness enhancement method and system.
Background
Named entity recognition (Named Entity Recognition, NER) is the basic research effort to accomplish medical text mining tasks, a process to find medical entities, such as diseases, drugs, symptoms, from unstructured medical text. The NER model is utilized to accurately identify the boundaries of the included medical entities from the given medical texts, and the medical entities of the same type are classified, so that metadata support can be provided for the establishment of a clinical professional knowledge base, further, the efficiency and the level of clinical scientific research of medical institutions are improved, and the NER model has very important significance for improving the service quality of a medical information system.
The current research method mainly focuses on obtaining the feature vector of the medical text by using a pre-training language model technology, fully learning the context features through a neural network model, and finally classifying the labels through technologies such as a conditional random field. The Bert language model (Bidirectional Encoder Representation from Transformers) is a mainstream model used by current researchers when word vector embedding due to strong feature extraction capability. And the deep learning method based on a Bi-directional long-short Term Memory network (Bi-directional Long Short-Term Memory, bi-LSTM) and a conditional random field (Conditional Random Fields, CRF) can better capture the Bi-directional semantic dependence, so that the method becomes one of the most widely used methods in the named entity recognition research in the medical field in recent years. In addition, the recognition precision of named entities in the medical professional field is improved to a certain extent by introducing a multi-level semantic representation method, such as constructing a medical professional dictionary, chinese word roots and the like.
However, studies have shown that neural network models tend to be locally unstable, and even small perturbations may mislead them. The input of such malicious perturbations is called an antagonistic example, which is created by deliberately adding noise in the form of small perturbations to the training data, which is often difficult for humans to discern because they are difficult to perceive, but can greatly increase the loss for deep learning models. In the patent application document with the publication number of CN115659976A, disturbance factors are injected by a FGM anti-training method and are input into a BiLSTM-CRF model to obtain a prediction result, so that the influence caused by interference noise of training samples is effectively prevented; however, the scheme only enhances the anti-interference capability of the word embedding layer, and ignores the problem of 'cascade diffusion' of noise between hidden layers of the neural network after the word embedding layer. Although a neural network hybrid compression scheme based on an information bottleneck theory is proposed in literature, a neural network hybrid compression method based on an information bottleneck theory, computer application research, superiority, etc., in which the propagation of redundant information is limited using Mutual Information (MI) while information related to basic facts is saved, thereby further reducing memory required for model storage, the problem of noise transfer between hidden layers of the neural network is not solved, and MI has high complexity in practice.
Disclosure of Invention
The invention aims to solve the technical problem of how to further inhibit the cascade diffusion problem of useless noise information between hidden layers of a model while enhancing the disturbance rejection capability of a word embedding layer of the model, thereby enhancing the processing capability of the model on noise and improving the robustness of the model.
The invention solves the technical problems by the following technical means:
a method for enhancing robustness of medical named entity recognition is provided, the method comprising:
acquiring a medical text to be identified, inputting the medical text into a pre-trained language model to acquire a word vector sample and an countermeasure sample, wherein the language model adopts a PGD attack method to calculate a loss function gradient of the word vector sample, and adds disturbance in the direction of the gradient to generate the countermeasure sample;
calculating first mutual information of the model input and the middle hidden layer and second mutual information of the model output and the middle hidden layer by adopting a Hilbert independence criterion;
taking the first mutual information and the second mutual information as the mutual information of the information bottleneck theoretical optimization objective function, and capturing multistage hidden semantic information by taking the information bottleneck theoretical optimization objective function as a limiting condition when training a neural network model by adopting the word vector sample and the countermeasure sample;
and in actual prediction, decoding the multi-level hidden layer semantic information output by the neural network model, and selecting an entity tag sequence with the maximum probability from decoding results as an identification result.
Further, inputting the medical text into a pre-trained language model to obtain a word vector sample and a countermeasure sample, including:
inputting the medical text into the pre-trained language model to obtain a word vector sample;
adding an antagonism disturbance at a word vector embedding layer of the language model, calculating a loss function gradient of a word vector sample by adopting a PGD attack method, adding the disturbance in the direction of the gradient, and generating the antagonism sample, wherein the PGD attack formula is expressed as follows:
wherein: x is x t Representing samples generated by the t-th iteration, x t+1 Representing the sample generated by the t+1st iteration, S representing the allowable disturbance range, n x+S Representing the function that projects the disturbance back into the range, α represents the step size for each iteration, θ represents the parameters of the model, J (θ, x t Y) represents the loss function, y represents the label of the correct sample,Representing the loss function versus input sample x t Is a gradient of (a).
Further, when a disturbance is added in the direction of the loss function gradient of the word vector sample, the formula is expressed as:
wherein: d represents the distribution of input samples, x represents the input word vector samples, y represents the label, θ represents the robustness parameter of the language model, L (x+y; θ) represents the loss of a single word vector sample, L (x+Δx, y; θ) represents the sample loss after adding disturbance Δx to x, Δx represents the disturbance, Ω represents the disturbance space, Δx is less than or equal to ε, ε is a constant.
Further, the Hilbert independence criterion is formulated as:
wherein: HSIC (X, Y) represents the empirical value of HSIC between different hidden layer information X and Y, N represents the size of batch, tr () is used to calculate the trace of the input tensor, K X And K Y The representation of the core matrix is that, is an identity matrix, 1 is a column vector with all elements 1, and T represents a transposed symbol.
Further, the bottleneck theory optimizes the formula of the objective function by using psi IB It is indicated that I uses the HSIC above:
ψ IB =I(α,β 1 )-θI(α,β 2 )
wherein: psi phi type IB Represents an information bottleneck, θ represents lagrange multiplier, I (α, β 1 ) Representation ofFirst mutual information, I (alpha, beta 2 ) Representing second mutual information, wherein the first mutual information and the second mutual information are HSIC experience values calculated by using Hilbert independence criteria.
In addition, the invention also provides a medical named entity recognition robustness enhancement system, which comprises:
the acquisition module is used for acquiring a medical text to be identified, inputting the medical text into a pre-trained language model to acquire a word vector sample and an countermeasure sample, wherein the language model adopts a PGD attack method to calculate a loss function gradient of the word vector sample, and adds disturbance in the direction of the gradient to generate the countermeasure sample;
the mutual information calculation module is used for calculating the first mutual information of the model input and the middle hidden layer and the second mutual information of the model output and the middle hidden layer by adopting a Hilbert independence criterion;
the training module is used for taking the first mutual information and the second mutual information as mutual information of a bottleneck theoretical optimization objective function, and capturing multistage hidden semantic information by taking the bottleneck theoretical optimization objective function as a limiting condition when training a neural network model by adopting the word vector sample and the countermeasure sample;
and the prediction module is used for decoding the multi-level hidden layer semantic information output by the neural network model during actual prediction, and selecting an entity tag sequence with the highest probability from decoding results as a recognition result.
Further, the acquisition module includes:
the word vector extraction unit is used for inputting the medical text into the pre-trained language model to obtain a word vector sample;
the disturbance countermeasure unit is used for adding disturbance countermeasure at the word vector embedding layer of the language model, calculating a loss function gradient of a word vector sample by adopting a PGD attack method, adding disturbance in the direction of the gradient, and generating a disturbance countermeasure sample, wherein the PGD attack formula is as follows:
wherein: x is x t Representing samples generated by the t-th iteration, x t+1 Representing the sample generated by the t+1st iteration, S representing the allowable disturbance range, n x+S Representing the function that projects the disturbance back into the range, α represents the step size for each iteration, θ represents the parameters of the model, J (θ, x t Y) represents the loss function, y represents the label of the sample that is correct,representing the loss function versus input sample x t Is a gradient of (a).
Further, when a disturbance is added in the direction of the loss function gradient of the word vector sample, the formula is expressed as:
wherein: d represents the distribution of input samples, x represents the input word vector samples, y represents the label, θ represents the robustness parameter of the language model, L (x+y; θ) represents the loss of a single word vector sample, L (x+Δx, y; θ) represents the sample loss after adding disturbance Δx to x, Δx represents the disturbance, Ω represents the disturbance space, Δx is less than or equal to ε, ε is a constant.
Further, the Hilbert independence criterion is formulated as:
wherein: HSIC (X, Y) represents the empirical value of HSIC between different hidden layer information X and Y, N represents the size of batch, tr () is used to calculate the trace of the input tensor, K X And K Y The representation of the core matrix is that, is an identity matrix, 1 is a column vector with all elements 1, and T represents a transposed symbol.
Further, the bottleneck theoretical optimization objective function is formulated as:
ψ IB =I(α,β 1 )-θI(α,β 2 )
wherein: psi phi type IB Represents an information bottleneck, θ represents lagrange multiplier, I (α, β 1 ) Representing first mutual information, I (alpha, beta 2 ) Representing the second mutual information.
The invention has the advantages that:
(1) According to the invention, an countermeasure training technology is utilized to generate a countermeasure sample aiming at the model, and the countermeasure sample is used for training a named entity recognition model with stronger robustness, so that the processing capacity of the model on noise and interference is improved; meanwhile, the information bottleneck is utilized to compress the named entity recognition model, effective propagation of useful information among layers of the neural network is guided based on the information bottleneck theory, the problem of cascade diffusion of useless noise information among hidden layers of the model is restrained, and therefore performance and robustness of the model are improved, and the information bottleneck reduces resource consumption by limiting capacity of the model and reducing unnecessary information transmission; using Hilbert-Schmitt independence criteria (Hilbert-Schmidt Independence Criterion, HSIC) to measure the level of independence between two variables, redundant information can be filtered and more useful information retained to learn high quality embeddings.
Additional aspects and advantages of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.
Drawings
FIG. 1 is a flow chart of a method for enhancing robustness of medical named entity recognition according to an embodiment of the present invention;
FIG. 2 is a schematic overall flow diagram of a method for enhancing robustness of medical named entity recognition according to an embodiment of the present invention;
FIG. 3 is a diagram of an algorithmic model of named entity recognition in an embodiment of the invention;
fig. 4 is a schematic structural diagram of a medical named entity recognition robustness enhancement system according to an embodiment of the present invention.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions in the embodiments of the present invention will be clearly and completely described in the following in conjunction with the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
As shown in fig. 1 and 3, a first embodiment of the present invention proposes a method for enhancing robustness of medical named entity recognition, the method comprising the steps of:
s10, acquiring a medical text to be identified, and inputting the medical text into a pre-trained language model to acquire a word vector sample and an countermeasure sample, wherein the language model adopts a PGD attack method to calculate a loss function gradient of the word vector sample, and disturbance is added in the direction of the gradient to generate the countermeasure sample;
s20, calculating first mutual information of the model input and the middle hidden layer and second mutual information of the model output and the middle hidden layer by adopting a Hilbert independence criterion;
s30, taking the first mutual information and the second mutual information as mutual information of a bottleneck theoretical optimization objective function, and capturing multistage hidden semantic information by taking the bottleneck theoretical optimization objective function as a limiting condition when training a neural network model by adopting the word vector sample and the countermeasure sample;
and S40, decoding the multi-level hidden semantic information output by the neural network model during actual prediction, and selecting an entity tag sequence with the maximum probability from decoding results as a recognition result.
It should be noted that, in the neural network process, considering the problem of information loss in the multi-layer network information propagation process, the model may be interfered by bad information or hidden layers in the learning process, and further deviate from the target of supervised learning in the wrong direction. The theory of information bottlenecks is a method of information theory, which aims at finding the most important information between input data and output data, and the core idea is to minimize the information loss between the input data and the output data while keeping the most important information. According to the embodiment, the named entity recognition model is compressed by utilizing the information bottleneck, so that the robustness of the named entity recognition model is improved, the information bottleneck limits the capacity of the model, reduces unnecessary information transmission and reduces resource consumption; and an countermeasure training technology is utilized to generate a countermeasure sample aiming at the model, which is used for training a named entity recognition model with stronger robustness, so that the processing capacity of the model on noise and interference is improved.
In an embodiment, before the word vector is obtained by using the language model, the language model needs to be trained, so that the language model can divide the word vector of the medical text.
Specifically, as shown in fig. 2, in this embodiment, the chinese electronic medical record is used as a knowledge source, and normalized text data and a standard data set are formed by data collection, entity label determination, entity labeling, data processing, and the like, that is, a named entity recognition data set in the chinese medical field is constructed, and then the data set is divided into a training set, a verification set, and a test set according to a ratio of 8:1:1, so as to be used for training a language model.
Furthermore, the original natural language text converts a Word into a vector representation with a fixed length by using Word coding, so that mathematical processing is facilitated, but the existing pre-training language model, such as ELMO, GPT, BERT, is trained based on a deep neural network model, and the neural network is easy to be attacked by linear disturbance due to the characteristic of linearity, so that disturbance countermeasure is added in a Word vector coding layer, and the robustness of the model in coping with malicious disturbance countermeasure samples is improved.
It should be noted that, in this embodiment, since the word vector embedding layer is used to combat disturbance, selection of the pre-training language model is not required, and here, a pre-training language model such as ELMO, GPT, BERT may be used.
In an embodiment, in the step S10, the step of inputting the medical text into a pre-trained language model to obtain a word vector sample and an countermeasure sample includes:
inputting the medical text into the pre-trained language model to obtain a word vector sample;
adding an antagonism disturbance at a word vector embedding layer of the language model, calculating a loss function gradient of a word vector sample by adopting a PGD attack method, adding the disturbance in the direction of the gradient, and generating the antagonism sample, wherein the PGD attack formula is expressed as follows:
wherein: x is x t Representing samples generated by the t-th iteration, x t+1 Representing the sample generated by the t+1st iteration, S representing the allowable disturbance range, n x+S Representing the function that projects the disturbance back into the range, α represents the step size for each iteration, θ represents the parameters of the model, J (θ, x t Y) represents the loss function, y represents the label of the sample that is correct,representing the loss function versus input sample x t Is a gradient of (a).
Note that the PGD (Projected Gradient Descent) attack is a white-box attack against the neural network model, whose principle is based on the gradient calculation against the sample. An attacker first selects an initial sample and calculates the corresponding gradient of the loss function. The attacker then adds some disturbance in the direction of the gradient to generate a new sample. The attacker continues to iterate this process until the generated samples are classified as the class of errors the attacker wants or a preset maximum number of iterations is reached. To prevent the perturbation from being too large, an attacker projects the perturbation back into a range (e.g., lp norm less than epsilon) at each iteration to ensure that the distance between the generated sample and the original sample is not too great.
In an embodiment, the present embodiment adds the challenge disturbance through the word vector Embedding layer, and defines the challenge training as a problem of saddle-finding from the viewpoint of optimization, namely the following Min-Max formula:
where D represents the distribution of input samples, x represents the input word vector samples, y represents the label, θ is the language model parameter, L (x+y; θ) is the loss of a single sample, Δx is the perturbation, and Ω is the perturbation space.
Specifically, the internal max refers to adding the disturbance Δx to x, and the purpose of Δx is to make L (x+y; θ) larger and better, that is to say to make the existing model prediction error as much as possible; however, Δx is also constrained to be in the Ω range. The conventional constraint is that deltax is less than epsilon, where ε is a constant. The external min means that the most robust parameter θ is found that the predicted distribution matches the distribution of the original dataset. Therefore, the sample loss L contains the robustness parameter θ of the language model, and the purpose of Δx is to make L (x+Δx, y; θ) larger and better, so as to make the prediction of the existing model error as much as possible, thereby finding an optimal robustness parameter: θ and Δx.
Further, the neural network model in the step S30 may be a model constructed by using a machine learning or deep learning algorithm, such as a conditional random field (Conditional Random Fields, CRF), a recurrent neural network (Recurrent Neural Networks, RNN), or a convolutional neural network (Convolutional Neural Networks, CNN), etc., to identify named entities in the text.
In this embodiment, after the disturbance is added, the propagation of redundant information between the word vector and the underlying network is constrained based on the theory of the information bottleneck, so as to guide the transmission of useful information between layers of the neural network. Specifically, based on the information bottleneck theory, the Hilbert independence criterion HSIC is adopted to replace mutual information as a loss function for assisting model learning so as to capture dependence among layers of the neural network and compress noise information in the neural network, and the pureness of information aggregation is effectively restrained. Given a sample z= { (X) 1 ,Y 1 ),…,(X N ,Y Y )},Wherein X and Y represent information of two hidden layers, and the hidden layer information X= [ X ] 1 ,...,X N ]And Y= [ Y ] 1 ,...,Y N ]The HSIC empirical values in between can be expressed as:
wherein: HSIC (X, Y) represents the empirical value of HSIC between different hidden layer information X and Y, N represents the size of the batch, tr () is used to calculate the trace (trace) of the input tensor, i.e., calculate the sum of the elements on the principal diagonal of the square matrix, K X And K Y The representation of the core matrix is that,is an identity matrix, 1 is a column vector with all elements 1, and T represents a transposed symbol.
K X Is an element ofWherein σ is the bandwidth, which can act as smoothing; x is X i And X j Representing the ith and jth row of samples in the input sample matrix X, respectively; />Representing the square of the L2 norm, i.e. the square of the Euclidean distance between two samples, K Y Is defined similarly to K X And will not be described here.
The calculation process of the HSIC is intuitively shown in a matrix form. According to the HSIC empirical value calculation formula, there are:
and
the method can obtain the following steps:
wherein the method comprises the steps ofK Y And K Y The calculation process of H is respectively equal to that of K X And K X H is the same as the process.
HSIC is a standard for measuring the strength of independence between two variables, and can filter redundant information and retain more useful information, thereby learning high quality embeddings. The larger the value of the HSIC is, the weaker the independence is, and vice versa, the HSIC is introduced into the propagation process between hidden layers of the named entity recognition model to limit the propagation of useless information, so that the performance and the robustness of the model are improved.
In one embodiment, the bottleneck theoretical optimization objective function is formulated as:
ψ IB =I(α,β 1 )-θI(α,β 2 )
wherein: psi phi type IB Represents an information bottleneck, θ represents lagrange multiplier, I (α, β 1 ) Representing first mutual information, I (alpha, beta 2 ) Representing second mutual information, wherein the first mutual information and the second mutual information are HSIC experience values calculated by using Hilbert independence criteria.
Finally, the embodiment adopts an output decoding layer CRF model to decode the captured multi-level hidden layer semantic information to obtain a globally optimal tag sequence; and in actual CRF prediction, selecting a candidate tag sequence with the highest probability according to the trained parameters as a final result.
Furthermore, as shown in fig. 4, a second embodiment of the present invention further proposes a medical named entity recognition robustness enhancing system, the system comprising:
the acquisition module 10 is configured to acquire a medical text to be identified, and input the medical text into a pre-trained language model to acquire a word vector sample and an countermeasure sample, where the language model calculates a loss function gradient of the word vector sample by adopting a PGD attack method, and adds disturbance in a direction of the gradient to generate the countermeasure sample;
a mutual information calculation module 20, configured to calculate first mutual information of the model input and the intermediate hidden layer and second mutual information of the model output and the intermediate hidden layer using hilbert independence criterion;
the training module 30 is configured to capture multi-level hidden semantic information by using the first mutual information and the second mutual information as mutual information of a bottleneck theoretical optimization objective function, and using the bottleneck theoretical optimization objective function as a constraint condition when training the neural network model by using the word vector sample and the challenge sample;
and the prediction module 40 is used for decoding the multi-level hidden layer semantic information output by the neural network model during actual prediction, and selecting an entity tag sequence with the highest probability from decoding results as a recognition result.
In one embodiment, the acquisition module 10 includes:
the word vector extraction unit is used for inputting the medical text into the pre-trained language model to obtain a word vector sample;
the disturbance countermeasure unit is used for adding disturbance countermeasure at the word vector embedding layer of the language model, calculating a loss function gradient of a word vector sample by adopting a PGD attack method, adding disturbance in the direction of the gradient, and generating a disturbance countermeasure sample, wherein the PGD attack formula is as follows:
wherein: x is x t Representing samples generated by the t-th iteration, x t+1 Representing the sample generated by the t+1st iteration, S representing the allowable disturbance range, n x+S Representing the function that projects the disturbance back into the range, α represents the step size for each iteration, θ represents the parameters of the model, J (θ, x t Y) represents the loss function, y represents the label of the sample that is correct,indicating lossFunction vs. input sample x t Is a gradient of (a).
In one embodiment, when a disturbance is added in the direction of the loss function gradient of the word vector sample, the formula is:
wherein: d represents the distribution of input samples, x represents the input word vector samples, y represents the label, θ represents the robustness parameter of the language model, L (x+y; θ) represents the loss of a single word vector sample, L (x+Δx, y; θ) represents the sample loss after adding disturbance Δx to x, Δx represents the disturbance, Ω represents the disturbance space, Δx is less than or equal to ε, ε is a constant, and ε is a uniform sign.
In one embodiment, the Hilbert independence criterion is formulated as:
wherein: HSIC (X, Y) represents the empirical value of HSIC between different hidden layer information X and Y, N represents the size of batch, tr () is used to calculate the trace of the input tensor, K X And K Y The representation of the core matrix is that, is an identity matrix, 1 is a column vector with all elements 1, and T represents a transposed symbol.
In one embodiment, the bottleneck theoretical optimization objective function is formulated as:
ψ IB =I(α,β 1 )-θI(α,β 2 )
wherein: psi phi type IB Represents an information bottleneck, θ represents lagrange multiplier, I (α, β 1 ) Representing first mutual information, I (alpha, beta 2 ) Representing the second mutual information.
It should be noted that, other embodiments of the medical named entity recognition robustness enhancement system or the implementation method thereof according to the present invention may refer to the above-mentioned embodiments of the method, and are not repeated here.
In the description of the present specification, a description referring to terms "one embodiment," "some embodiments," "examples," "specific examples," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present invention. In this specification, schematic representations of the above terms do not necessarily refer to the same embodiments or examples. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.
Furthermore, the terms "first," "second," and the like, are used for descriptive purposes only and are not to be construed as indicating or implying a relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include at least one such feature. In the description of the present invention, the meaning of "plurality" means at least two, for example, two, three, etc., unless specifically defined otherwise.
While embodiments of the present invention have been shown and described above, it will be understood that the above embodiments are illustrative and not to be construed as limiting the invention, and that variations, modifications, alternatives and variations may be made to the above embodiments by one of ordinary skill in the art within the scope of the invention.

Claims (10)

1. A method for robustness enhancement of medical named entity recognition, the method comprising:
acquiring a medical text to be identified, inputting the medical text into a pre-trained language model to acquire a word vector sample and an countermeasure sample, wherein the language model adopts a PGD attack method to calculate a loss function gradient of the word vector sample, and adds disturbance in the direction of the gradient to generate the countermeasure sample;
calculating first mutual information of the model input and the middle hidden layer and second mutual information of the model output and the middle hidden layer by adopting a Hilbert independence criterion;
taking the first mutual information and the second mutual information as mutual information of a bottleneck theoretical optimization objective function, and capturing multistage hidden semantic information by taking the bottleneck theoretical optimization objective function as a limiting condition when training a neural network model by adopting the word vector sample and the countermeasure sample;
and in actual prediction, decoding the multi-level hidden layer semantic information output by the neural network model, and selecting an entity tag sequence with the maximum probability from decoding results as an identification result.
2. The method of claim 1, wherein the inputting the medical text into a pre-trained language model to obtain a word vector sample and a challenge sample comprises:
inputting the medical text into the pre-trained language model to obtain a word vector sample;
adding an antagonism disturbance at a word vector embedding layer of the language model, calculating a loss function gradient of a word vector sample by adopting a PGD attack method, adding the disturbance in the direction of the gradient, and generating the antagonism sample, wherein the PGD attack formula is expressed as follows:
wherein: x is x t Representing samples generated by the t-th iteration, x t+1 Representing the sample generated by the t+1st iteration, S representing the allowable disturbance range, n x+S Representing the function that projects the disturbance back into the range, α represents the step size for each iteration, θ represents the parameters of the model, J (θ, x t Y) represents the loss function, y represents the label of the sample that is correct,representing the loss function versus input sample x t Is a gradient of (a).
3. The method of claim 2, wherein when adding a perturbation in the direction of the loss function gradient of the word vector sample, the formula is:
wherein: d represents the distribution of input samples, x represents the input word vector samples, y represents a label, θ represents the robustness parameter of the language model, L (x+y; θ) represents the loss of a single word vector sample, L (x+Δx, y; θ) represents the sample loss after adding disturbance Δx to x, Δx represents the disturbance, Ω represents the disturbance space, and ε is a constant.
4. The medical named entity recognition robustness enhancement method of claim 1, wherein the formulation of the hilbert independence criterion is:
wherein: HSIC (X, Y) represents the empirical value of HSIC between different hidden layer information X and Y, N represents the size of batch, tr () is used to calculate the trace of the input tensor, K X And K Y The representation of the core matrix is that, is an identity matrix, 1 is a column vector with all elements 1, and T represents a transposed symbol.
5. The method for robustness enhancement of medical named entity recognition of claim 1 or 4, wherein the bottleneck theoretical optimization objective function is formulated as:
ψ IB =I(α,β 1 )-θI(α,β 2 )
wherein: psi phi type IB Represents an information bottleneck, θ represents lagrange multiplier, I (α, β 1 ) Representing first mutual information, I (alpha, beta 2 ) Representing the second mutual information.
6. A medical named entity recognition robustness enhancement system, the system comprising:
the acquisition module is used for acquiring a medical text to be identified, inputting the medical text into a pre-trained language model to acquire a word vector sample and an countermeasure sample, wherein the language model adopts a PGD attack method to calculate a loss function gradient of the word vector sample, and adds disturbance in the direction of the gradient to generate the countermeasure sample;
the mutual information calculation module is used for calculating the first mutual information of the model input and the middle hidden layer and the second mutual information of the model output and the middle hidden layer by adopting a Hilbert independence criterion;
the training module is used for taking the first mutual information and the second mutual information as mutual information of a bottleneck theoretical optimization objective function, and capturing multistage hidden semantic information by taking the bottleneck theoretical optimization objective function as a limiting condition when training a neural network model by adopting the word vector sample and the countermeasure sample;
and the prediction module is used for decoding the multi-level hidden layer semantic information output by the neural network model during actual prediction, and selecting an entity tag sequence with the highest probability from decoding results as a recognition result.
7. The medical named entity recognition robustness enhancement system of claim 6, wherein the acquisition module comprises:
the word vector extraction unit is used for inputting the medical text into the pre-trained language model to obtain a word vector sample;
the disturbance countermeasure unit is used for adding disturbance countermeasure at the word vector embedding layer of the language model, calculating a loss function gradient of a word vector sample by adopting a PGD attack method, adding disturbance in the direction of the gradient, and generating a disturbance countermeasure sample, wherein the PGD attack formula is as follows:
wherein: x is x t Representing samples generated by the t-th iteration, x t+1 Representing the sample generated by the t+1st iteration, S representing the allowable disturbance range, n x+S Representing the function that projects the disturbance back into the range, α represents the step size for each iteration, θ represents the parameters of the model, J (θ, x t Y) represents the loss function, y represents the label of the sample that is correct,representing the loss function versus input sample x t Is a gradient of (a).
8. The medical named entity recognition robustness enhancement system of claim 7, wherein upon adding a perturbation in the direction of a loss function gradient of the word vector sample, the formula is:
wherein: d represents the distribution of input samples, x represents the input word vector samples, y represents the label, θ represents the robustness parameter of the language model, L (x+y; θ) represents the loss of a single word vector sample, L (x+Δx, y; θ) represents the sample loss after adding disturbance Δx to x, Δx represents the disturbance, Ω represents the disturbance space, Δx is less than or equal to ε, ε is a constant.
9. The method of claim 6, wherein the hilbert independence criterion is formulated as:
wherein: HSIC (X, Y) represents the empirical value of HSIC between different hidden layer information X and Y, N represents the size of batch, tr () is used to calculate the trace of the input tensor, K X And K Y The representation of the core matrix is that, is an identity matrix, 1 is a column vector with all elements 1, and T represents a transposed symbol.
10. The medical named entity recognition robustness enhancement method of claim 6 or 9, wherein the bottleneck theoretical optimization objective function is formulated as:
ψ IB =I(α,β 1 )-θI(α,β 2 )
wherein: psi phi type IB Represents an information bottleneck, θ represents lagrange multiplier, I (α, β 1 ) Representing first mutual information, I (alpha, beta 2 ) Representing the second mutual information.
CN202310797089.8A 2023-06-29 2023-06-29 Medical named entity recognition robustness enhancement method and system Pending CN116796749A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310797089.8A CN116796749A (en) 2023-06-29 2023-06-29 Medical named entity recognition robustness enhancement method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310797089.8A CN116796749A (en) 2023-06-29 2023-06-29 Medical named entity recognition robustness enhancement method and system

Publications (1)

Publication Number Publication Date
CN116796749A true CN116796749A (en) 2023-09-22

Family

ID=88045871

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310797089.8A Pending CN116796749A (en) 2023-06-29 2023-06-29 Medical named entity recognition robustness enhancement method and system

Country Status (1)

Country Link
CN (1) CN116796749A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117874175A (en) * 2024-03-12 2024-04-12 武汉纺织大学 Information retrieval robustness method and system based on information bottleneck

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117874175A (en) * 2024-03-12 2024-04-12 武汉纺织大学 Information retrieval robustness method and system based on information bottleneck
CN117874175B (en) * 2024-03-12 2024-06-04 武汉纺织大学 Information bottleneck-based information retrieval method and system

Similar Documents

Publication Publication Date Title
CN108334574B (en) Cross-modal retrieval method based on collaborative matrix decomposition
WO2022068195A1 (en) Cross-modal data processing method and device, storage medium and electronic device
CN109948149B (en) Text classification method and device
CN109858015B (en) Semantic similarity calculation method and device based on CTW (computational cost) and KM (K-value) algorithm
CN112802568A (en) Multi-label stomach disease classification method and device based on medical history text
CN112613571A (en) Quantum neural network method, system and medium for image recognition
CN111985228B (en) Text keyword extraction method, text keyword extraction device, computer equipment and storage medium
CN110990596B (en) Multi-mode hash retrieval method and system based on self-adaptive quantization
US11709858B2 (en) Mapping of unlabeled data onto a target schema via semantic type detection
CN116842126B (en) Method, medium and system for realizing accurate output of knowledge base by using LLM
CN116594994B (en) Application method of visual language knowledge distillation in cross-modal hash retrieval
CN116796749A (en) Medical named entity recognition robustness enhancement method and system
CN113836896A (en) Patent text abstract generation method and device based on deep learning
CN109977961B (en) Binary feature learning method and system based on layered attention mechanism
CN113688955B (en) Text recognition method, device, equipment and medium
Zeng et al. Pyramid hybrid pooling quantization for efficient fine-grained image retrieval
CN115048539A (en) Social media data online retrieval method and system based on dynamic memory
CN116955644A (en) Knowledge fusion method, system and storage medium based on knowledge graph
CN116680407A (en) Knowledge graph construction method and device
CN114298052A (en) Entity joint labeling relation extraction method and system based on probability graph
CN113961701A (en) Message text clustering method and device
CN117807259A (en) Cross-modal hash retrieval method based on deep learning technology
CN117891957B (en) Knowledge graph completion method based on pre-training language model
CN116050391B (en) Speech recognition error correction method and device based on subdivision industry error correction word list
Lulu et al. TMNIO: Triplet merged network with involution operators for improved few‐shot image classification

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination