CN112487820B - Chinese medical named entity recognition method - Google Patents
Chinese medical named entity recognition method Download PDFInfo
- Publication number
- CN112487820B CN112487820B CN202110157254.4A CN202110157254A CN112487820B CN 112487820 B CN112487820 B CN 112487820B CN 202110157254 A CN202110157254 A CN 202110157254A CN 112487820 B CN112487820 B CN 112487820B
- Authority
- CN
- China
- Prior art keywords
- word
- vector
- medical
- text
- function
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 22
- 239000013598 vector Substances 0.000 claims abstract description 162
- 230000007246 mechanism Effects 0.000 claims abstract description 43
- 238000007781 pre-processing Methods 0.000 claims abstract description 35
- 230000002457 bidirectional effect Effects 0.000 claims abstract description 32
- 230000006870 function Effects 0.000 claims description 70
- 238000012549 training Methods 0.000 claims description 55
- 238000004364 calculation method Methods 0.000 claims description 27
- 230000011218 segmentation Effects 0.000 claims description 20
- 239000011159 matrix material Substances 0.000 claims description 14
- 238000002372 labelling Methods 0.000 claims description 8
- 230000005284 excitation Effects 0.000 claims description 6
- NAWXUBYGYWOOIX-SFHVURJKSA-N (2s)-2-[[4-[2-(2,4-diaminoquinazolin-6-yl)ethyl]benzoyl]amino]-4-methylidenepentanedioic acid Chemical compound C1=CC2=NC(N)=NC(N)=C2C=C1CCC1=CC=C(C(=O)N[C@@H](CC(=C)C(O)=O)C(O)=O)C=C1 NAWXUBYGYWOOIX-SFHVURJKSA-N 0.000 claims description 3
- 238000007476 Maximum Likelihood Methods 0.000 claims description 3
- 238000003491 array Methods 0.000 claims description 3
- 230000000694 effects Effects 0.000 claims description 3
- 238000010380 label transfer Methods 0.000 claims description 3
- 230000009466 transformation Effects 0.000 claims description 3
- 230000007704 transition Effects 0.000 claims description 3
- 125000004122 cyclic group Chemical group 0.000 abstract description 14
- 230000002708 enhancing effect Effects 0.000 abstract description 3
- 230000015654 memory Effects 0.000 description 6
- 230000007787 long-term memory Effects 0.000 description 5
- 230000006403 short-term memory Effects 0.000 description 5
- 238000013528 artificial neural network Methods 0.000 description 3
- 238000003058 natural language processing Methods 0.000 description 3
- 210000000988 bone and bone Anatomy 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 230000008569 process Effects 0.000 description 2
- 230000006978 adaptation Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000010485 coping Effects 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 230000005484 gravity Effects 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 238000010606 normalization Methods 0.000 description 1
- 238000002360 preparation method Methods 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 210000000952 spleen Anatomy 0.000 description 1
- 238000010911 splenectomy Methods 0.000 description 1
- 238000005728 strengthening Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
- G06F40/295—Named entity recognition
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/044—Recurrent networks, e.g. Hopfield networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/048—Activation functions
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- General Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- General Engineering & Computer Science (AREA)
- Biomedical Technology (AREA)
- Evolutionary Computation (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Data Mining & Analysis (AREA)
- Biophysics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Life Sciences & Earth Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Medical Treatment And Welfare Office Work (AREA)
- Machine Translation (AREA)
Abstract
The invention discloses a Chinese medical named entity recognition method, which comprises the steps of generating a characteristic vector of each word in a medical text through a language preprocessing model based on an attention mechanism, generating a final label sequence through a medical entity recognition model based on a bidirectional gated cyclic network, recognizing a medical named entity according to the label sequence, generating a word vector for enhancing semantics in advance before entity recognition through the language preprocessing model based on the attention mechanism, and adding a multi-head attention layer in the medical entity recognition model to extract multiple semantics of the word in the medical text.
Description
Technical Field
The invention relates to a medical named entity recognition method, and belongs to the technical field of named entity recognition in natural language processing.
Background
Natural language processing is a popular research direction in recent years, and aims to allow computing mechanisms to solve human languages and perform effective interaction. Named entity recognition technology is a very important technology in natural language processing, and aims to recognize entities with specific meanings including names of people, places, organizations, proper nouns and the like in sentences. The named entity recognition task can be divided into named entity recognition in the general field and named entity recognition in the specific field, such as the financial field, the medical field, the military field and the like.
Early medical field named entity recognition mainly used dictionary and rule based methods, and named entities were mainly recognized by manually built medical field dictionaries and customized recognition rules. Later, machine learning methods based on statistical learning were applied to medical named entity recognition models, where more is the use of conditional random field models. In recent years, with the great increase of hardware computing power, a deep neural network-based method has been widely applied to medical named entity recognition, wherein the most common method is to use a combined model of a bidirectional long-short term memory network and a conditional random field.
Disclosure of Invention
The purpose of the invention is as follows: in order to solve the problems that the professionality of named entities in medical texts is high, the entities are nested with each other and a word is ambiguous in the prior art, the invention provides a Chinese medical named entity recognition method. The medical field lacks high-quality labeled data, the long-term and short-term memory network model has more parameters and the training time is longer, so the invention uses the bidirectional gated circulation network to replace a bidirectional long-term and short-term memory neural network so as to improve the speed of entity identification.
The technical scheme is as follows: in order to achieve the purpose, the invention adopts the technical scheme that:
a Chinese medical named entity recognition method comprises a language preprocessing model based on an attention mechanism and a medical entity recognition model. In the language preprocessing model, an attention mechanism is introduced, so that the generated word vectors can learn long-distance dependency relationships among characters, semantic features of the word vectors are enhanced, for example, for texts containing Chinese medical information, such as electronic medical records, prescriptions, physical examination reports and the like, the texts are firstly segmented into characters, and then the word vectors of each character are generated through the language preprocessing model based on the attention mechanism. In the medical entity recognition model, a bidirectional gated cyclic network is used for replacing a bidirectional long-short term memory network to improve the model training speed, a multi-head attention layer is added to further extract multiple semantic information of words, the accuracy of medical named entity recognition is improved, finally, a conditional random field is used for generating a final label sequence, and the medical named entities in the text are recognized according to the label sequence. The Chinese medical named entity recognition method is mainly applied to medical information extraction and has important application value in multiple fields of Chinese medical robots, Chinese medical knowledge maps and the like. The traditional named entity recognition method is generally based on a two-way long-short term memory network and a conditional random field, the two-way long-short term memory network cannot process data in parallel, the training speed is low, and simultaneously, the method lacks of a good coping scheme for the problems of strong entity speciality, mutual nesting of entities and the like existing in a Chinese medical text, so that the invention improves the training speed by using the two-way gating cycle network to replace the two-way long-short term memory network, trains characters through a language preprocessing model based on an attention mechanism and generates character vectors, enhances the semantic representation of the characters, adds a multi-head attention layer behind the two-way gating cycle network layer of the medical entity recognition model, further excavates the local characteristics of the medical text and the multiple semantic information of the characters, and improves the accuracy and the recognition efficiency of the Chinese medical named entity recognition, the method specifically comprises the following steps:
And 2, labeling the segmentation characters of the medical text for training to obtain a labeled medical text for training, wherein the starting characters of the medical named entities are labeled as 'B', the non-starting characters of the medical named entities are labeled as 'I', and the characters which are not the entities are labeled as 'O'.
And 3, training the language preprocessing model based on the attention mechanism by using the labeled medical text for training obtained in the step 2 to obtain the trained language preprocessing model based on the attention mechanism. The language preprocessing model based on the attention mechanism comprises a word embedding layer, a position vector embedding layer and an attention mechanism layer which are connected in sequence.
And 3.1, sending the marked medical text for training obtained in the step 2 into a word embedding layer of a language preprocessing model based on an attention mechanism in sentence units. The word embedding layer generates a word vector for each word using a word skip model. The skip word model predicts the surrounding words using a middle word, the first word in the text sequence for a medical text of length LThe words are expressed asMaximizing the probability that a given random center word generates all its background words:
wherein,indicating that the probability is calculated starting from the first word in the text,meaning that for each central word all distances from it do not exceedThe probability of occurrence of the background word of (2),the size of the window is indicated and,is shown inIs a central word which is a Chinese character,for window size, its respective background wordEquation (1) is equivalent to minimizing the first loss function:
Suppose a central wordIn the text, index isBackground wordIn the text, index isThe conditional probability of a given center word in the first penalty function generating a background word is normalized by a normalizing exponential functionComprises the following steps:
wherein,the representation index isThe vector of the center word of (a),the representation index isThe vector of the background word of (a),representing the transpose of the background word vector,representing a dot-product of two vectors,representing a text pairEach character inThe dot-product is performed and,an exponential function based on a natural constant e is shown. Solving for center word vector in the above equation using stochastic gradient descentGradient (2):
iteratively training an attention-based language pre-processing model using equation (4) until a first loss function value is less than a first threshold valueAfter training, any index in the medical text isAll get its vector as the center word。
And 3.2, transmitting the word vector generated by the word embedding layer to a position vector embedding layer, using the position vector to represent the position relation of each character by the position vector embedding layer, and superposing the word vector and the position vector to obtain a new feature vector of the word. The position vector calculation formula is shown in formula (5) and formula (6):
wherein,is a two-dimensional matrix, the number of columns of the matrix is the same as the dimension of the word vector generated before,the row of (a) represents each word, the column represents the position vector of each word in each dimension, and the total number of columns is equal to the total dimension of the word vector.Is the total dimension of the position vector,the specific dimensions of the vector are represented,the representation index isThe value of the position vector of the word in the even dimension is calculated using a sine function.The representation index isThe value of the position vector of the word in odd dimensions is calculated using a cosine function. Finally, the position vector and the word vector are added to obtain a new feature vector of the word, as shown in formula (7):
wherein,the representation index isThe position vector of the word of (a),indicates that any index isThe word of (a) is used as the word vector of the central word,representing a new feature vector with embedded position information.
And 3.3, learning the long-distance dependency relationship among the characters by using an attention mechanism, so that the character vector contains information of all other characters in the sentence. And the output of the attention mechanism layer is a final generated word vector, and further the training of the language preprocessing model based on the attention mechanism is completed.
And 4, training the medical entity recognition model by using the labeled medical text for training obtained in the step 2 to obtain the trained medical entity recognition model, wherein the medical entity recognition model comprises a bidirectional gating circulation network layer, a multi-head attention layer and a conditional random field layer which are sequentially connected.
And 4.1, performing bidirectional coding on the word vector by using a bidirectional gated cyclic network layer, wherein the bidirectional gated cyclic network layer comprises a forward gated cyclic network layer and a reverse gated cyclic network layer, the forward gated cyclic network layer learns the postamble characteristics, and the reverse gated cyclic network layer learns the preamble characteristics, so that the generated vector can better capture the contextual semantic information and learn the context. The gated loop network layer is only composed of an update gate and a reset gate, wherein the update gate determines the amount of information that is passed to the future in the past, the reset gate determines the amount of forgetting of the past information, and the gated loop network layer is calculated as shown in formula (10) -formula (13):
wherein,for renewing the doorThe output state at the time of day is,to reset the gateThe output state at the time of day is,in the form of a candidate state, the state,to representThe output state of the network at the moment,indicating the state of the input at the current time,representing the hidden state of the gated-loop network node output at the last time,to representThe function of the function is that of the function,representing an excitation functionUpdating doors for trainingThe weight parameter of (a) is determined,resetting a door for trainingThe weight parameter of (a) is determined,to calculate candidate statesThe weight parameter used.Indicating that the two vectors are connected. Updating doorFor controlling the output state of the network at the present momentHow much history state to keepResetting doorHas the effect of determining the candidate stateHidden state of last time gate control circulation network node outputThe degree of dependence of (c).
Step 4.2, a multi-head attention layer is used for further extracting multiple semantics: a multi-head attention layer essentially means performing more than two attention head operations for a network layer through bidirectional gated loopsOutput state of time of day networkFirst, a single-shot attention calculation is performed by equation (16):
wherein,to representThe result of the individual attention-head calculations,indicates that there is aThe attention of the individual is focused on the head,to generate the weight parameters of the query vector,in order to generate the weight parameters of the key vectors,in order to generate the weight parameters of the value vector,is composed ofThe adjustment of the dimension is a smooth term,to normalize the exponential function, and finally, concatenate thisThe secondary calculation result is subjected to linear transformation to obtain the result of each timeCirculating network layers by bidirectional gatingOutput state of time of day networkThe result of the multi-head attention calculation is shown in formula (17):
wherein,showing the results of the calculation of the multi-head attention layer,is a weight parameter;
step 4.3, obtaining an optimal label sequence by using the conditional random field layer: for input sentencesSentence tag sequenceThe scoring of (A) is as follows:
wherein,a scoring function representing the input sentence x generates a sequence of labels y,in order to be the length of the sequence,in order to shift the scoring matrix, the score matrix,representing by a labelTransfer to labelThe score of the transition of (a) is,andthe start and end tags in the presentation sentence,is shown asThe words are marked asThe probability of (c). Normalized to obtainMaximum probability of tag sequence, as in equation (19):
wherein,which represents the sequence of the actual tag(s),representing the set of all possible tag sequences.
Solving a minimized second loss function of the medical entity identification model using maximum likelihood estimation, as in equation (20):
wherein,expressing the second loss function value, and iteratively training the medical entity recognition model until the second loss function valueLess than a second thresholdAnd then, obtaining a global optimal sequence by utilizing a Viterbi algorithm, wherein the global optimal sequence is a labeling result of the final medical field named entity identification.
Finally, the medical named entities in the text are identified according to the tag sequence. Wherein if the character is marked as (B), it represents that it is the first character of the medical named entity, if the character is marked as (I), it represents that it is the non-beginning part of the medical named entity, and if the character is marked as (O), it represents that it is not the medical named entity.
And 5, during recognition, importing the medical text segmentation characters for recognition into a trained language preprocessing model based on an attention mechanism to generate word vectors. And importing the obtained generated word vector into a trained medical entity recognition model to recognize the medical named entity in the text.
Preferably: in step 3.3, the calculation formula of the gravity mechanism is shown as a formula (8):
wherein,the score of attention is shown as a score,a representation of the query vector is provided,a key vector is represented by a vector of keys,a vector of values is represented that is,represents the square root of the dimension of the key vector,the function is a normalized exponential function.
Preferably: normalized exponential function softmax function:
wherein,an array of data is represented,representing arraysTo (1)The number of the elements is one,the value of (a) is an arrayTo middleThe ratio of the index of an element to the sum of the indices of all other elements.
Preferably: step 4.1The function value field is (-1, 1), and the expression is shown in formula (14):
Preferably: in step 4.1, the value domain of the excitation function is (-1, 1), and the expression is shown in formula (15):
preferably: in step 4.3, the global optimal sequence is obtained by using the viterbi algorithm, as shown in formula (21):
Compared with the prior art, the invention has the following beneficial effects:
the method comprises the steps of preprocessing a text by using a language preprocessing model based on an attention mechanism and generating a corresponding word vector, bidirectionally encoding the word vector by using a bidirectional gating circulation network layer, further acquiring local features of the text and multiple semantics of an entity by using a multi-head attention layer, finally generating a final label sequence by using a conditional random field layer, and identifying a medical named entity in the text according to the label sequence, so that the problems of inaccurate identification and low identification speed of the Chinese medical named entity are solved. Semantic representation of words is enhanced by generating a word vector containing positional features of the words and associations between characters for each word in a medical text by an attention-based language pre-processing model. In the medical entity recognition model, a bidirectional gate control circulation network is used for replacing a bidirectional long-term and short-term memory network, the training overhead is reduced to a certain extent, the model training efficiency is improved, a multi-head attention layer is added, the local features of medical texts and the multiple semantics of characters are further learned, and the accuracy of medical named entity recognition is improved.
Drawings
FIG. 1 is a schematic flow chart of the present invention.
FIG. 2 is a language pre-processing model framework based on an attention mechanism.
FIG. 3 is a medical entity recognition model framework.
FIG. 4 is a schematic diagram of a gated loop network.
Detailed Description
The present invention is further illustrated by the following description in conjunction with the accompanying drawings and the specific embodiments, it is to be understood that these examples are given solely for the purpose of illustration and are not intended as a definition of the limits of the invention, since various equivalent modifications will occur to those skilled in the art upon reading the present invention and fall within the limits of the appended claims.
A Chinese medical named entity recognition method includes the steps of firstly, using a medical text to perform segmentation and marking processing, training a language preprocessing model, then sending the medical text to be recognized into the trained language preprocessing model to generate word vectors for enhancing semantics, then using the trained medical entity recognition model to generate a label sequence according to the word vectors, and finally recognizing a medical named entity according to the label sequence, wherein the method specifically includes the following steps:
And 2, labeling the segmentation characters of the medical text for training to obtain a labeled medical text for training, wherein the starting characters of the medical named entities are labeled as 'B', the non-starting characters of the medical named entities are labeled as 'I', and the characters which are not the entities are labeled as 'O'. If for the medical text ' no obvious fracture ' is seen ', the final labeling sequence is ' no (O) ', ' see (O) ', ' clear (O) ', ' bone (B) ', or ' fold (I) ', wherein the ' BIO ' label is used to distinguish the medical named entity in preparation for subsequent training of the medical entity recognition model.
And 3, training the language preprocessing model based on the attention mechanism by using the labeled medical text for training obtained in the step 2 to obtain the trained language preprocessing model based on the attention mechanism. As shown in fig. 2, the language preprocessing model based on the attention mechanism includes a word embedding layer, a position vector embedding layer and an attention mechanism layer which are connected in sequence, for the segmented text, firstly, a word vector is generated by the word embedding layer using a word skipping model, then, the position vector embedding layer learns the position information of each character by adding the position vector, and finally, the attention mechanism layer learns the relation between each character and all other characters, thereby strengthening the semantic representation of the character.
And 3.1, sending the marked medical text for training obtained in the step 2 into a word embedding layer of a language preprocessing model based on an attention mechanism in sentence units. The word embedding layer generates a word vector for each word using a word skip model. The word skipping model uses oneThe middle word predicts the words around it, for the medical text of length L, the first word in the text sequenceThe words are expressed asMaximizing the probability that a given random center word generates all its background words:
wherein,indicating that the probability is calculated starting from the first word in the text,meaning that for each center word the probability of occurrence of all background words not more than m away from it is calculated,representing the window size, the distance between the generated background word and the central word being not greater than,Is shown inIs a central word which is a Chinese character,for window size, its respective background wordProbability of occurrence ofEquation (1) is equivalent to minimizing the first loss function:
Suppose a central wordIn the text, index isBackground wordIn the text, index isThe conditional probability of a given center word in the first penalty function generating a background word is normalized by a normalizing exponential functionNormalization is as follows:
wherein,the representation index isThe vector of the center word of (a),the representation index isThe vector of the background word of (a),representing the transpose of the background word vector,representing a dot-product of two vectors,representing a text pairEach character inThe dot-product is performed and,an exponential function based on a natural constant e is shown. Solving for center word vector in the above equation using stochastic gradient descentGradient (2):
iteratively training an attention-based language pre-processing model using equation (4) until a first loss function value is less than a first threshold valueFirst threshold valueFor a preset constant, after training is finished, any index in the medical text isAll get its vector as the center wordUse ofAs the final output vector of the word embedding layer.
And 3.2, transmitting the word vector generated by the word embedding layer to a position vector embedding layer, using the position vector to represent the position relation of each character by the position vector embedding layer, and superposing the word vector and the position vector to obtain a new feature vector of the word. The position vector calculation formula is shown in formula (5) and formula (6):
wherein,is a two-dimensional matrix, the number of columns of the matrix is the same as the dimension of the word vector generated before,the row of (a) represents each word, the column represents the position vector of each word in each dimension, and the total number of columns is equal to the total dimension of the word vector.Indicating an index of words in the medical text,is the total dimension of the position vector,the specific dimensions of the vector are represented,the representation index isThe value of the position vector of the word in the even dimension is calculated using a sine function.The representation index isThe value of the position vector of the word in odd dimensions is calculated using a cosine function. Finally, the position vector and the word vector are added to obtain a new feature vector of the word, as shown in formula (7):
wherein,the representation index isThe position vector of the word of (a),indicates that any index isThe word of (a) is used as the word vector of the central word,representing a new feature vector with embedded position information. Embedding position vectors in word vectorsThe purpose is to prepare for subsequent attention calculations. If the attention calculation is carried out on one word in the medical text and the other two words in the text which have the same content but different positions, the same attention calculation result can be obtained if a position vector is not embedded to represent the difference, but the association degree of the word and the two words is different, so that the position vector must be used to represent the position relation of each character.
And 3.3, learning the long-distance dependency relationship among the characters by using an attention mechanism, so that the character vector contains information of all other characters in the sentence. The word vector generated by the word embedding layer uses the background word to predict the central word, and the dependency relationship of long-distance characters cannot be learned. Adding a mechanism of attention can make the word vector learn the dependency of all other characters in the sentence. The specific calculation formula of the attention mechanism is shown as formula (8):
the attention mechanism calculation formula is shown in formula (8):
wherein,a function of the attention-scoring is represented,a representation of the query vector is provided,a key vector is represented by a vector of keys,a vector of values is represented that is,,,obtained by multiplying the word vectors with the corresponding weight matrix.The square root, which represents the dimension of the key vector, is used to prevent the multiplication result from being too large,the function is a normalized exponential function, and a specific mathematical expression of the function is shown as the formula (9):
wherein,an array of data is represented,representing arraysTo (1)The number of the elements is one,the value of (a) is an arrayTo middleThe ratio of the index of an element to the sum of the indices of all other elements.
And the output of the attention mechanism layer is the word vector finally generated by the language preprocessing model.The function is to score and normalize all the characters in the text, with the score for each character being a positive value and the sum being 1. Equation (8) is thus essentially a weighted sum of the vectors of values for each character in the text,the value of (d) is the weight coefficient of the corresponding value vector. And the output of the attention mechanism layer is a final generated word vector, and further the training of the language preprocessing model based on the attention mechanism is completed. The finally generated word vector contains the position information of the word and the dependency relationship of each other character in the sentence, thereby enhancing the semantic meaning of the word and improving the accuracy of the medical entity recognition model.
And 4, training the medical entity recognition model by using the labeled medical text for training obtained in the step 2 to obtain the trained medical entity recognition model, wherein the medical entity recognition model comprises a bidirectional gating cycle network layer, a multi-head attention layer and a conditional random field layer which are sequentially connected as shown in fig. 3. The medical text firstly generates a corresponding word vector through a trained language preprocessing model. The bidirectional gating circulation network layer is composed of two layers of gating circulation networks, bidirectional coding is carried out on the word vectors, and context relations are fully learned. Output of multi-headed attention layer to bidirectional gated cyclic network layerAnd performing attention operation for many times, further learning local features of the medical text and multiple semantics of the words, finally generating a final label sequence by using the conditional random field layer, and identifying the medical named entity according to the label sequence.
Step 4.1, bidirectional gating circulation network layer is used for bidirectional coding of word vectors to fully learn the context relationship, the named entities in the medical field are complex in structure, the subsequences of the entities may also be entities, such as 'splenectomy' and 'spleen', and meanwhile, the characters have strong relevance before and afterThe relationships of the word context are fully considered when training using neural networks. The traditional named entity recognition model usually uses a bidirectional long and short term memory network for coding, but the long and short term memory network has more parameters and slower training speed. The bidirectional gated cyclic network layer comprises a forward gated cyclic network layer and a reverse gated cyclic network layer, the forward gated cyclic network layer learns postamble characteristics, and the reverse gated cyclic network layer learns foreamble characteristics, so that the generated vector can better capture contextual semantic information and learn context. The gated loop network is a variant of a long-short term memory network and only consists of an updating gate and a resetting gate, wherein the updating gate determines the amount of information which is transmitted to the future in the past, and the resetting gate determines the forgetting amount of the past information. The specific structure of the gated loop network is shown in fig. 4, in which,a weighting operation of the representation vector is performed,the specific calculation structure of the dot multiplication algorithm for representing the number and the matrix is shown as formula (10) -formula (13):
wherein,for renewing the doorThe output state at the time of day is,to reset the gateThe output state at the time of day is,in the form of a candidate state, the state,to representThe output state of the network at the moment,indicating the state of the input at the current time,representing the hidden state of the gated-loop network node output at the last time,to representThe function of the function is that of the function,representing an excitation functionUpdating doors for trainingThe weight parameter of (a) is determined,resetting a door for trainingThe weight parameter of (a) is determined,to calculate candidate statesThe weight parameter used in the time-of-day,indicating that the two vectors are connected. Updating doorFor controlling the output state of the network at the present momentHow much history state to keepResetting doorHas the effect of determining the candidate stateHidden state of last time gate control circulation network node outputThe degree of dependence of (c).
The value field of the excitation function is (-1, 1), and the expression is shown in formula (15):
step 4.2, further extracting multiple semantic meanings by using a multi-head attention layer: the medical text has a word ambiguity phenomenon, so a multi-head attention layer is added behind a bidirectional gating circulation network to further learn the dependency relationship of entities and capture the multiple semantics of words. The head attention layer essentially performs a number of attention operations for the network layer through two-way gated loopsOutput state of time of day networkFirst, a single-shot attention calculation is performed by equation (16):
wherein,to representThe result of the individual attention-head calculations,indicates that there is aAttention head, i.e. total calculationNext, the process of the present invention,to generate the weight parameters of the query vector,in order to generate the weight parameters of the key vectors,in order to generate the weight parameters of the value vector,is composed ofThe adjustment of the dimension is a smooth term, prevents the vector product from being too large,to normalize the exponential function, and finally, concatenate thisThe secondary calculation result is subjected to linear transformation to obtain the result of each timeCirculating network layers by bidirectional gatingOutput state of time of day networkThe result of the multi-head attention calculation is shown in formula (17):
wherein,showing the results of the calculation of the multi-head attention layer,indicates that there is aThe attention of the individual is focused on the head,is a weight parameter. The multi-head attention layer expands the capability of the medical entity recognition model to focus on different positions, so that multiple semantics of words in the medical text are further extracted.
Step 4.3, obtaining an optimal label sequence by using the conditional random field layer: in the medical named entity recognition model, the bidirectional gated loop network layer can only obtain word vectors containing further context information, and the dependency relationship between tags cannot be considered even if a multi-head attention layer is added, for example, a tag (I) must be behind a tag (B). Therefore, the invention adopts the conditional random field layer to consider the adjacent relation between the labels to obtain the globally optimal label sequence. A conditional random field model is a classical discriminative probabilistic undirected graph model that is often applied in sequence labeling tasks for input sentencesSentence tag sequenceThe scoring of (A) is as follows:
wherein,a scoring function representing the input sentence x generates a sequence of labels y,in order to be the length of the sequence,in order to shift the scoring matrix, the score matrix,representing by a labelTransfer to labelThe score of the transition of (a) is,andrepresenting the start and end tags in the sentence, which are only temporarily added at the time of computation,is shown asThe words are marked asThe probability of (c). Normalized to obtainMaximum probability of tag sequence, as in equation (19):
wherein,which represents the sequence of the actual tag(s),representing the set of all possible tag sequences.
Solving a minimized second loss function of the medical entity identification model using maximum likelihood estimation, as in equation (20):
wherein,expressing the second loss function value, and iteratively training the medical entity recognition model until the second loss function valueLess than a second thresholdSecond threshold valueFor the preset constant, then, the Viterbi algorithm is used to obtain the global optimum sequenceThe column is the labeling result of the final medical field named entity identification, as shown in formula (21):
Finally, the medical named entities in the text are identified according to the tag sequence. Wherein if the character is marked as (B), it represents that it is the first character of the medical named entity, if the character is marked as (I), it represents that it is the non-beginning part of the medical named entity, and if the character is marked as (O), it represents that it is not the medical named entity. If the input text is ' continuously heated for four days ', the final labels are ' continuously (O) ', ' continuously (B) ', ' hot (I) ', ' four (O) ' ' day (O) ' ' and the medically named entity is identified as ' heated ' according to the label.
And 5, during recognition, importing the medical text segmentation characters for recognition into a trained language preprocessing model based on an attention mechanism to generate word vectors. And importing the obtained generated word vector into a trained medical entity recognition model to recognize the medical named entity in the text.
The above description is only of the preferred embodiments of the present invention, and it should be noted that: it will be apparent to those skilled in the art that various modifications and adaptations can be made without departing from the principles of the invention and these are intended to be within the scope of the invention.
Claims (4)
1. A Chinese medical named entity recognition method is characterized by comprising the following steps:
step 1, performing character-level segmentation on a medical text for training to obtain segmentation characters of the medical text for training; performing character level segmentation on the medical text for identification to obtain medical text segmentation characters for identification;
step 2, labeling the segmentation characters of the medical text for training to obtain a labeled medical text for training, wherein the starting characters of the medical named entities are labeled as 'B', the non-starting characters of the medical named entities are labeled as 'I', and the characters which are not entities are labeled as 'O';
step 3, training the language preprocessing model based on the attention mechanism by using the labeled medical text for training obtained in the step 2 to obtain a trained language preprocessing model based on the attention mechanism; the language preprocessing model based on the attention mechanism comprises a word embedding layer, a position vector embedding layer and an attention mechanism layer which are sequentially connected;
step 3.1, sending the marked medical text for training obtained in the step 2 into a word embedding layer of a language preprocessing model based on an attention mechanism by taking a sentence as a unit; the word embedding layer generates a word vector of each word by using a word skipping model; the skip word model predicts the surrounding words using a middle word, the first word in the text sequence for a medical text of length LThe words are expressed asMaximizing the probability that a given random center word generates all its background words:
wherein,indicating that the probability is calculated starting from the first word in the text,meaning that for each central word all distances from it do not exceedThe probability of occurrence of the background word of (2),the size of the window is indicated and,is shown inIs a central word which is a Chinese character,for window size, its respective background wordEquation (1) is equivalent to minimizing the first loss function:
suppose a central wordIn the text, index isBackground wordIn the text, index isThe conditional probability of a given center word in the first penalty function generating a background word is normalized by a normalizing exponential functionComprises the following steps:
wherein,the representation index isThe vector of the center word of (a),the representation index isThe vector of the background word of (a),representing the transpose of the background word vector,representing a dot-product of two vectors,an exponential function with a natural constant e as a base number is represented; solving for center word vector in the above equation using stochastic gradient descentGradient (2):
iteratively training an attention-based language pre-processing model using equation (4) until a first loss function value is less than a first threshold valueAfter training, any index in the medical text isAll get its vector as the center word;
Step 3.2, the word vector generated by the word embedding layer is sent to a position vector embedding layer, the position vector embedding layer uses the position vector to represent the position relation of each character, and the word vector and the position vector are superposed to obtain a new feature vector of the word; the position vector calculation formula is shown in formula (5) and formula (6):
wherein,is a two-dimensional matrix, the number of columns of the matrix is the same as the dimension of the word vector generated before,the column represents the position vector of each word in each dimension, and the total column number is equal to the total dimension of the word vector;is the total dimension of the position vector,the specific dimensions of the vector are represented,the representation index isThe value of the position vector of the word in even dimension is calculated by using a sine function;the representation index isThe value of the position vector of the word in odd dimensionality is calculated by using a cosine function; finally, the position vector and the word vector are added to obtain a new feature vector of the word, as shown in formula (7):
wherein,the representation index isThe position vector of the word of (a),indicates that any index isThe word of (a) is used as the word vector of the central word,representing a new feature vector in which the position information is embedded;
3.3, learning the long-distance dependency relationship among the characters by using an attention mechanism, so that the character vector contains information of all other characters in the sentence; the output of the attention mechanism layer is a final generated word vector, and then training of a language preprocessing model based on the attention mechanism is completed;
the attention mechanism calculation formula is shown in formula (8):
wherein,the score of attention is shown as a score,a representation of the query vector is provided,a key vector is represented by a vector of keys,a vector of values is represented that is,represents the square root of the dimension of the key vector,the function is a normalized exponential function;
normalized exponential function softmax function:
wherein,an array of data is represented,representing arraysTo (1)The number of the elements is one,the value of (a) is an arrayTo middleThe ratio of the index of an element to the sum of the indices of all other elements;
step 4, training the medical entity recognition model by using the labeled medical text for training obtained in the step 2 to obtain a trained medical entity recognition model, wherein the medical entity recognition model comprises a bidirectional gating circulation network layer, a multi-head attention layer and a conditional random field layer which are sequentially connected;
step 4.1, bidirectional coding is carried out on the word vector by using a bidirectional gating circulation network layer, the bidirectional gating circulation network layer comprises a forward gating circulation network layer and a reverse gating circulation network layer, the forward gating circulation network layer learns the postamble characteristics, and the reverse gating circulation network layer learns the preamble characteristics, so that the generated vector can better capture the contextual semantic information and learn the context; the gated loop network layer is only composed of an update gate and a reset gate, wherein the update gate determines the amount of information that is passed to the future in the past, the reset gate determines the amount of forgetting of the past information, and the gated loop network layer is calculated as shown in formula (10) -formula (13):
wherein,for renewing the doorThe output state at the time of day is,to reset the gateThe output state at the time of day is,in the form of a candidate state, the state,to representThe output state of the network at the moment,indicating the state of the input at the current time,representing the hidden state of the gated-loop network node output at the last time,to representThe function of the function is that of the function,representing an excitation functionUpdating doors for trainingThe weight parameter of (a) is determined,resetting a door for trainingThe weight parameter of (a) is determined,to calculate candidate statesA weight parameter used in the time;representing that the two vectors are connected; updating doorFor controlling the output state of the network at the present momentHow much history state to keepResetting doorHas the effect of determining the candidate stateHidden state of last time gate control circulation network node outputThe degree of dependence of;
step 4.2, a multi-head attention layer is used for further extracting multiple semantics: a multi-head attention layer essentially means performing more than two attention head operations for a network layer through bidirectional gated loopsOutput state of time of day networkFirst, a single-shot attention calculation is performed by equation (16):
wherein,to representThe result of the individual attention-head calculations,indicates that there is aThe attention of the individual is focused on the head,to generate the weight parameters of the query vector,in order to generate the weight parameters of the key vectors,in order to generate the weight parameters of the value vector,is composed ofThe adjustment of the dimension is a smooth term,to normalize the exponential function, and finally, concatenate thisThe secondary calculation result is subjected to linear transformation to obtain the result of each timeCirculating network layers by bidirectional gatingOutput state of time of day networkThe result of the multi-head attention calculation is shown in formula (17):
wherein,showing the results of the calculation of the multi-head attention layer,is a weight parameter;
step 4.3, obtaining an optimal label sequence by using the conditional random field layer: for input sentencesSentence tag sequenceThe scoring of (A) is as follows:
wherein,a scoring function representing the input sentence x generates a sequence of labels y,in order to be the length of the sequence,in order to shift the scoring matrix, the score matrix,representing by a labelTransfer to labelThe score of the transition of (a) is,andthe start and end tags in the presentation sentence,is shown asThe words are marked asThe probability of (d); normalized to obtainMaximum probability of tag sequence, as in equation (19):
wherein,which represents the sequence of the actual tag(s),represents a set of all possible tag sequences;
solving a minimized second loss function of the medical entity identification model using maximum likelihood estimation, as in equation (20):
wherein,expressing the second loss function value, and iteratively training the medical entity recognition model until the second loss function valueLess than a second thresholdThen, a global optimal sequence is obtained by utilizing a Viterbi algorithm, and the global optimal sequence is a labeling result of the final medical field named entity identification;
finally, identifying medical named entities in the text according to the label sequence; wherein if the character is marked as (B), it represents that it is the first character of the medical named entity, if the character is marked as (I), it represents that it is the non-beginning part of the medical named entity, if the character is marked as (O), it represents that it is not the medical named entity;
step 5, during recognition, importing the medical text segmentation characters for recognition into a trained language preprocessing model based on an attention mechanism to generate word vectors; and importing the obtained generated word vector into a trained medical entity recognition model to recognize the medical named entity in the text.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110157254.4A CN112487820B (en) | 2021-02-05 | 2021-02-05 | Chinese medical named entity recognition method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110157254.4A CN112487820B (en) | 2021-02-05 | 2021-02-05 | Chinese medical named entity recognition method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN112487820A CN112487820A (en) | 2021-03-12 |
CN112487820B true CN112487820B (en) | 2021-05-25 |
Family
ID=74912336
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110157254.4A Active CN112487820B (en) | 2021-02-05 | 2021-02-05 | Chinese medical named entity recognition method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112487820B (en) |
Families Citing this family (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113033207B (en) * | 2021-04-07 | 2023-08-29 | 东北大学 | Biomedical nested type entity identification method based on layer-by-layer perception mechanism |
CN113221533B (en) * | 2021-04-29 | 2024-07-05 | 支付宝(杭州)信息技术有限公司 | Label extraction method, device and equipment for experience sound |
CN113241128B (en) * | 2021-04-29 | 2022-05-13 | 天津大学 | Molecular property prediction method based on molecular space position coding attention neural network model |
CN113051897B (en) * | 2021-05-25 | 2021-09-10 | 中国电子科技集团公司第三十研究所 | GPT2 text automatic generation method based on Performer structure |
CN113223656A (en) * | 2021-05-28 | 2021-08-06 | 西北工业大学 | Medicine combination prediction method based on deep learning |
CN114239585B (en) * | 2021-12-17 | 2024-06-21 | 安徽理工大学 | Biomedical nested named entity recognition method |
CN114692636B (en) * | 2022-03-09 | 2023-11-03 | 南京海泰医疗信息系统有限公司 | Nested named entity identification method based on relationship classification and sequence labeling |
CN114332872B (en) * | 2022-03-14 | 2022-05-24 | 四川国路安数据技术有限公司 | Contract document fault-tolerant information extraction method based on graph attention network |
CN115796407B (en) * | 2023-02-13 | 2023-05-23 | 中建科技集团有限公司 | Production line fault prediction method and related equipment |
Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112115721A (en) * | 2020-09-28 | 2020-12-22 | 青岛海信网络科技股份有限公司 | Named entity identification method and device |
Family Cites Families (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111368541B (en) * | 2018-12-06 | 2024-06-11 | 北京搜狗科技发展有限公司 | Named entity identification method and device |
CN110781683B (en) * | 2019-11-04 | 2024-04-05 | 河海大学 | Entity relation joint extraction method |
CN111626056B (en) * | 2020-04-11 | 2023-04-07 | 中国人民解放军战略支援部队信息工程大学 | Chinese named entity identification method and device based on RoBERTA-BiGRU-LAN model |
CN111783466A (en) * | 2020-07-15 | 2020-10-16 | 电子科技大学 | Named entity identification method for Chinese medical records |
-
2021
- 2021-02-05 CN CN202110157254.4A patent/CN112487820B/en active Active
Patent Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112115721A (en) * | 2020-09-28 | 2020-12-22 | 青岛海信网络科技股份有限公司 | Named entity identification method and device |
Also Published As
Publication number | Publication date |
---|---|
CN112487820A (en) | 2021-03-12 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN112487820B (en) | Chinese medical named entity recognition method | |
CN109471895B (en) | Electronic medical record phenotype extraction and phenotype name normalization method and system | |
CN111858931B (en) | Text generation method based on deep learning | |
CN111738007B (en) | Chinese named entity identification data enhancement algorithm based on sequence generation countermeasure network | |
CN112257449B (en) | Named entity recognition method and device, computer equipment and storage medium | |
CN106980609A (en) | A kind of name entity recognition method of the condition random field of word-based vector representation | |
CN111985239A (en) | Entity identification method and device, electronic equipment and storage medium | |
CN109800437A (en) | A kind of name entity recognition method based on Fusion Features | |
CN113297364A (en) | Natural language understanding method and device for dialog system | |
CN114818717B (en) | Chinese named entity recognition method and system integrating vocabulary and syntax information | |
CN111881256B (en) | Text entity relation extraction method and device and computer readable storage medium equipment | |
CN111460824A (en) | Unmarked named entity identification method based on anti-migration learning | |
CN111145914B (en) | Method and device for determining text entity of lung cancer clinical disease seed bank | |
US20240104353A1 (en) | Sequence-to sequence neural network systems using look ahead tree search | |
CN113641809A (en) | XLNET-BiGRU-CRF-based intelligent question answering method | |
CN115600597A (en) | Named entity identification method, device and system based on attention mechanism and intra-word semantic fusion and storage medium | |
CN118227769B (en) | Knowledge graph enhancement-based large language model question-answer generation method | |
CN114781375A (en) | Military equipment relation extraction method based on BERT and attention mechanism | |
CN115238693A (en) | Chinese named entity recognition method based on multi-word segmentation and multi-layer bidirectional long-short term memory | |
CN115169349A (en) | Chinese electronic resume named entity recognition method based on ALBERT | |
Yang et al. | Sequence-to-sequence prediction of personal computer software by recurrent neural network | |
CN111523320A (en) | Chinese medical record word segmentation method based on deep learning | |
CN117875326A (en) | Judicial named entity recognition method based on vocabulary enhancement | |
CN116720519B (en) | Seedling medicine named entity identification method | |
CN114969343B (en) | Weak supervision text classification method combined with relative position information |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |