CN112487820B

CN112487820B - Chinese medical named entity recognition method

Info

Publication number: CN112487820B
Application number: CN202110157254.4A
Authority: CN
Inventors: 司逸晨; 管有庆
Original assignee: Nanjing University of Posts and Telecommunications
Current assignee: Nanjing University of Posts and Telecommunications
Priority date: 2021-02-05
Filing date: 2021-02-05
Publication date: 2021-05-25
Anticipated expiration: 2041-02-05
Also published as: CN112487820A

Abstract

The invention discloses a Chinese medical named entity recognition method, which comprises the steps of generating a characteristic vector of each word in a medical text through a language preprocessing model based on an attention mechanism, generating a final label sequence through a medical entity recognition model based on a bidirectional gated cyclic network, recognizing a medical named entity according to the label sequence, generating a word vector for enhancing semantics in advance before entity recognition through the language preprocessing model based on the attention mechanism, and adding a multi-head attention layer in the medical entity recognition model to extract multiple semantics of the word in the medical text.

Description

Chinese medical named entity recognition method

Technical Field

The invention relates to a medical named entity recognition method, and belongs to the technical field of named entity recognition in natural language processing.

Background

Natural language processing is a popular research direction in recent years, and aims to allow computing mechanisms to solve human languages and perform effective interaction. Named entity recognition technology is a very important technology in natural language processing, and aims to recognize entities with specific meanings including names of people, places, organizations, proper nouns and the like in sentences. The named entity recognition task can be divided into named entity recognition in the general field and named entity recognition in the specific field, such as the financial field, the medical field, the military field and the like.

Early medical field named entity recognition mainly used dictionary and rule based methods, and named entities were mainly recognized by manually built medical field dictionaries and customized recognition rules. Later, machine learning methods based on statistical learning were applied to medical named entity recognition models, where more is the use of conditional random field models. In recent years, with the great increase of hardware computing power, a deep neural network-based method has been widely applied to medical named entity recognition, wherein the most common method is to use a combined model of a bidirectional long-short term memory network and a conditional random field.

Disclosure of Invention

The purpose of the invention is as follows: in order to solve the problems that the professionality of named entities in medical texts is high, the entities are nested with each other and a word is ambiguous in the prior art, the invention provides a Chinese medical named entity recognition method. The medical field lacks high-quality labeled data, the long-term and short-term memory network model has more parameters and the training time is longer, so the invention uses the bidirectional gated circulation network to replace a bidirectional long-term and short-term memory neural network so as to improve the speed of entity identification.

The technical scheme is as follows: in order to achieve the purpose, the invention adopts the technical scheme that:

a Chinese medical named entity recognition method comprises a language preprocessing model based on an attention mechanism and a medical entity recognition model. In the language preprocessing model, an attention mechanism is introduced, so that the generated word vectors can learn long-distance dependency relationships among characters, semantic features of the word vectors are enhanced, for example, for texts containing Chinese medical information, such as electronic medical records, prescriptions, physical examination reports and the like, the texts are firstly segmented into characters, and then the word vectors of each character are generated through the language preprocessing model based on the attention mechanism. In the medical entity recognition model, a bidirectional gated cyclic network is used for replacing a bidirectional long-short term memory network to improve the model training speed, a multi-head attention layer is added to further extract multiple semantic information of words, the accuracy of medical named entity recognition is improved, finally, a conditional random field is used for generating a final label sequence, and the medical named entities in the text are recognized according to the label sequence. The Chinese medical named entity recognition method is mainly applied to medical information extraction and has important application value in multiple fields of Chinese medical robots, Chinese medical knowledge maps and the like. The traditional named entity recognition method is generally based on a two-way long-short term memory network and a conditional random field, the two-way long-short term memory network cannot process data in parallel, the training speed is low, and simultaneously, the method lacks of a good coping scheme for the problems of strong entity speciality, mutual nesting of entities and the like existing in a Chinese medical text, so that the invention improves the training speed by using the two-way gating cycle network to replace the two-way long-short term memory network, trains characters through a language preprocessing model based on an attention mechanism and generates character vectors, enhances the semantic representation of the characters, adds a multi-head attention layer behind the two-way gating cycle network layer of the medical entity recognition model, further excavates the local characteristics of the medical text and the multiple semantic information of the characters, and improves the accuracy and the recognition efficiency of the Chinese medical named entity recognition, the method specifically comprises the following steps:

step 1, performing character-level segmentation on the medical text for training to obtain segmentation characters of the medical text for training. And performing character-level segmentation on the medical text for identification to obtain medical text segmentation characters for identification.

And 2, labeling the segmentation characters of the medical text for training to obtain a labeled medical text for training, wherein the starting characters of the medical named entities are labeled as 'B', the non-starting characters of the medical named entities are labeled as 'I', and the characters which are not the entities are labeled as 'O'.

And 3, training the language preprocessing model based on the attention mechanism by using the labeled medical text for training obtained in the step 2 to obtain the trained language preprocessing model based on the attention mechanism. The language preprocessing model based on the attention mechanism comprises a word embedding layer, a position vector embedding layer and an attention mechanism layer which are connected in sequence.

And 3.1, sending the marked medical text for training obtained in the step 2 into a word embedding layer of a language preprocessing model based on an attention mechanism in sentence units. The word embedding layer generates a word vector for each word using a word skip model. The skip word model predicts the surrounding words using a middle word, the first word in the text sequence for a medical text of length L

The words are expressed as

Maximizing the probability that a given random center word generates all its background words:

（1）

wherein,

indicating that the probability is calculated starting from the first word in the text,

meaning that for each central word all distances from it do not exceed

The probability of occurrence of the background word of (2),

the size of the window is indicated and,

is shown in

Is a central word which is a Chinese character,

for window size, its respective background word

Equation (1) is equivalent to minimizing the first loss function:

（2）

wherein,

a logarithmic loss function is represented.

Suppose a central word

In the text, index is

Background word

In the text, index is

The conditional probability of a given center word in the first penalty function generating a background word is normalized by a normalizing exponential function

Comprises the following steps:

（3）

wherein,

the representation index is

The vector of the center word of (a),

the representation index is

The vector of the background word of (a),

representing the transpose of the background word vector,

representing a dot-product of two vectors,

representing a text pair

Each character in

The dot-product is performed and,

an exponential function based on a natural constant e is shown. Solving for center word vector in the above equation using stochastic gradient descent

Gradient (2):

（4）

iteratively training an attention-based language pre-processing model using equation (4) until a first loss function value is less than a first threshold value

After training, any index in the medical text is

All get its vector as the center word

。

And 3.2, transmitting the word vector generated by the word embedding layer to a position vector embedding layer, using the position vector to represent the position relation of each character by the position vector embedding layer, and superposing the word vector and the position vector to obtain a new feature vector of the word. The position vector calculation formula is shown in formula (5) and formula (6):

（5）

（6）

wherein,

is a two-dimensional matrix, the number of columns of the matrix is the same as the dimension of the word vector generated before,

the row of (a) represents each word, the column represents the position vector of each word in each dimension, and the total number of columns is equal to the total dimension of the word vector.

Is the total dimension of the position vector,

the specific dimensions of the vector are represented,

the representation index is

The value of the position vector of the word in the even dimension is calculated using a sine function.

The representation index is

The value of the position vector of the word in odd dimensions is calculated using a cosine function. Finally, the position vector and the word vector are added to obtain a new feature vector of the word, as shown in formula (7):

（7）

wherein,

the representation index is

The position vector of the word of (a),

indicates that any index is

The word of (a) is used as the word vector of the central word,

representing a new feature vector with embedded position information.

And 3.3, learning the long-distance dependency relationship among the characters by using an attention mechanism, so that the character vector contains information of all other characters in the sentence. And the output of the attention mechanism layer is a final generated word vector, and further the training of the language preprocessing model based on the attention mechanism is completed.

And 4, training the medical entity recognition model by using the labeled medical text for training obtained in the step 2 to obtain the trained medical entity recognition model, wherein the medical entity recognition model comprises a bidirectional gating circulation network layer, a multi-head attention layer and a conditional random field layer which are sequentially connected.

And 4.1, performing bidirectional coding on the word vector by using a bidirectional gated cyclic network layer, wherein the bidirectional gated cyclic network layer comprises a forward gated cyclic network layer and a reverse gated cyclic network layer, the forward gated cyclic network layer learns the postamble characteristics, and the reverse gated cyclic network layer learns the preamble characteristics, so that the generated vector can better capture the contextual semantic information and learn the context. The gated loop network layer is only composed of an update gate and a reset gate, wherein the update gate determines the amount of information that is passed to the future in the past, the reset gate determines the amount of forgetting of the past information, and the gated loop network layer is calculated as shown in formula (10) -formula (13):

（10）

（11）

（12）

（13）

wherein,

for renewing the door

The output state at the time of day is,

to reset the gate

The output state at the time of day is,

in the form of a candidate state, the state,

to represent

The output state of the network at the moment,

indicating the state of the input at the current time,

representing the hidden state of the gated-loop network node output at the last time,

to represent

The function of the function is that of the function,

representing an excitation function

Updating doors for training

The weight parameter of (a) is determined,

resetting a door for training

The weight parameter of (a) is determined,

to calculate candidate states

The weight parameter used.

Indicating that the two vectors are connected. Updating door

For controlling the output state of the network at the present moment

How much history state to keep

Resetting door

Has the effect of determining the candidate state

Hidden state of last time gate control circulation network node output

The degree of dependence of (c).

Step 4.2, a multi-head attention layer is used for further extracting multiple semantics: a multi-head attention layer essentially means performing more than two attention head operations for a network layer through bidirectional gated loops

Output state of time of day network

First, a single-shot attention calculation is performed by equation (16):

（16）

wherein,

to represent

The result of the individual attention-head calculations,

indicates that there is a

The attention of the individual is focused on the head,

to generate the weight parameters of the query vector,

in order to generate the weight parameters of the key vectors,

in order to generate the weight parameters of the value vector,

is composed of

The adjustment of the dimension is a smooth term,

to normalize the exponential function, and finally, concatenate this

The secondary calculation result is subjected to linear transformation to obtain the result of each time

Circulating network layers by bidirectional gating

Output state of time of day network

The result of the multi-head attention calculation is shown in formula (17):

（17）

wherein,

showing the results of the calculation of the multi-head attention layer,

is a weight parameter;

step 4.3, obtaining an optimal label sequence by using the conditional random field layer: for input sentences

Sentence tag sequence

The scoring of (A) is as follows:

（18）

wherein,

a scoring function representing the input sentence x generates a sequence of labels y,

in order to be the length of the sequence,

in order to shift the scoring matrix, the score matrix,

representing by a label

Transfer to label

The score of the transition of (a) is,

and

the start and end tags in the presentation sentence,

is shown as

The words are marked as

The probability of (c). Normalized to obtain

Maximum probability of tag sequence, as in equation (19):

（19）

wherein,

which represents the sequence of the actual tag(s),

representing the set of all possible tag sequences.

Solving a minimized second loss function of the medical entity identification model using maximum likelihood estimation, as in equation (20):

（20）

wherein,

expressing the second loss function value, and iteratively training the medical entity recognition model until the second loss function value

Less than a second threshold

And then, obtaining a global optimal sequence by utilizing a Viterbi algorithm, wherein the global optimal sequence is a labeling result of the final medical field named entity identification.

Finally, the medical named entities in the text are identified according to the tag sequence. Wherein if the character is marked as (B), it represents that it is the first character of the medical named entity, if the character is marked as (I), it represents that it is the non-beginning part of the medical named entity, and if the character is marked as (O), it represents that it is not the medical named entity.

And 5, during recognition, importing the medical text segmentation characters for recognition into a trained language preprocessing model based on an attention mechanism to generate word vectors. And importing the obtained generated word vector into a trained medical entity recognition model to recognize the medical named entity in the text.

Preferably: in step 3.3, the calculation formula of the gravity mechanism is shown as a formula (8):

（8）

wherein,

the score of attention is shown as a score,

a representation of the query vector is provided,

a key vector is represented by a vector of keys,

a vector of values is represented that is,

represents the square root of the dimension of the key vector,

the function is a normalized exponential function.

Preferably: normalized exponential function softmax function:

（9）

wherein,

an array of data is represented,

representing arrays

To (1)

The number of the elements is one,

the value of (a) is an array

To middle

The ratio of the index of an element to the sum of the indices of all other elements.

Preferably: step 4.1

The function value field is (-1, 1), and the expression is shown in formula (14):

（14）

wherein,

representing the input to the function.

Preferably: in step 4.1, the value domain of the excitation function is (-1, 1), and the expression is shown in formula (15):

（15）

preferably: in step 4.3, the global optimal sequence is obtained by using the viterbi algorithm, as shown in formula (21):

（21）

wherein,

the sequence of tags in the set that maximizes the score function.

Compared with the prior art, the invention has the following beneficial effects:

the method comprises the steps of preprocessing a text by using a language preprocessing model based on an attention mechanism and generating a corresponding word vector, bidirectionally encoding the word vector by using a bidirectional gating circulation network layer, further acquiring local features of the text and multiple semantics of an entity by using a multi-head attention layer, finally generating a final label sequence by using a conditional random field layer, and identifying a medical named entity in the text according to the label sequence, so that the problems of inaccurate identification and low identification speed of the Chinese medical named entity are solved. Semantic representation of words is enhanced by generating a word vector containing positional features of the words and associations between characters for each word in a medical text by an attention-based language pre-processing model. In the medical entity recognition model, a bidirectional gate control circulation network is used for replacing a bidirectional long-term and short-term memory network, the training overhead is reduced to a certain extent, the model training efficiency is improved, a multi-head attention layer is added, the local features of medical texts and the multiple semantics of characters are further learned, and the accuracy of medical named entity recognition is improved.

Drawings

FIG. 1 is a schematic flow chart of the present invention.

FIG. 2 is a language pre-processing model framework based on an attention mechanism.

FIG. 3 is a medical entity recognition model framework.

FIG. 4 is a schematic diagram of a gated loop network.

Detailed Description

The present invention is further illustrated by the following description in conjunction with the accompanying drawings and the specific embodiments, it is to be understood that these examples are given solely for the purpose of illustration and are not intended as a definition of the limits of the invention, since various equivalent modifications will occur to those skilled in the art upon reading the present invention and fall within the limits of the appended claims.

A Chinese medical named entity recognition method includes the steps of firstly, using a medical text to perform segmentation and marking processing, training a language preprocessing model, then sending the medical text to be recognized into the trained language preprocessing model to generate word vectors for enhancing semantics, then using the trained medical entity recognition model to generate a label sequence according to the word vectors, and finally recognizing a medical named entity according to the label sequence, wherein the method specifically includes the following steps:

step 1, performing character-level segmentation on the medical text for training to obtain segmentation characters of the medical text for training. If the medical text is 'no obvious fracture', the characters 'no', 'see', 'clear', 'obvious', 'bone', 'broken' are obtained after the segmentation. And performing character-level segmentation on the medical text for identification to obtain medical text segmentation characters for identification. If the input text is 'continuously heated for four days', the processed text is 'continuously', 'hair', 'hot', 'four', 'day'.

And 2, labeling the segmentation characters of the medical text for training to obtain a labeled medical text for training, wherein the starting characters of the medical named entities are labeled as 'B', the non-starting characters of the medical named entities are labeled as 'I', and the characters which are not the entities are labeled as 'O'. If for the medical text ' no obvious fracture ' is seen ', the final labeling sequence is ' no (O) ', ' see (O) ', ' clear (O) ', ' bone (B) ', or ' fold (I) ', wherein the ' BIO ' label is used to distinguish the medical named entity in preparation for subsequent training of the medical entity recognition model.

And 3, training the language preprocessing model based on the attention mechanism by using the labeled medical text for training obtained in the step 2 to obtain the trained language preprocessing model based on the attention mechanism. As shown in fig. 2, the language preprocessing model based on the attention mechanism includes a word embedding layer, a position vector embedding layer and an attention mechanism layer which are connected in sequence, for the segmented text, firstly, a word vector is generated by the word embedding layer using a word skipping model, then, the position vector embedding layer learns the position information of each character by adding the position vector, and finally, the attention mechanism layer learns the relation between each character and all other characters, thereby strengthening the semantic representation of the character.

And 3.1, sending the marked medical text for training obtained in the step 2 into a word embedding layer of a language preprocessing model based on an attention mechanism in sentence units. The word embedding layer generates a word vector for each word using a word skip model. The word skipping model uses oneThe middle word predicts the words around it, for the medical text of length L, the first word in the text sequence

The words are expressed as

（1）

wherein,

meaning that for each center word the probability of occurrence of all background words not more than m away from it is calculated,

representing the window size, the distance between the generated background word and the central word being not greater than

，

Is shown in

Is a central word which is a Chinese character,

for window size, its respective background word

Probability of occurrence ofEquation (1) is equivalent to minimizing the first loss function:

（2）

wherein,

a logarithmic loss function is represented.

Suppose a central word

In the text, index is

Background word

In the text, index is

Normalization is as follows:

（3）

wherein,

the representation index is

The vector of the center word of (a),

the representation index is

The vector of the background word of (a),

representing the transpose of the background word vector,

representing a dot-product of two vectors,

representing a text pair

Each character in

The dot-product is performed and,

Gradient (2):

（4）

First threshold value

For a preset constant, after training is finished, any index in the medical text is

All get its vector as the center word

Use of

As the final output vector of the word embedding layer.

（5）

（6）

wherein,

Indicating an index of words in the medical text,

is the total dimension of the position vector,

the specific dimensions of the vector are represented,

the representation index is

The representation index is

（7）

wherein,

the representation index is

The position vector of the word of (a),

indicates that any index is

The word of (a) is used as the word vector of the central word,

representing a new feature vector with embedded position information. Embedding position vectors in word vectorsThe purpose is to prepare for subsequent attention calculations. If the attention calculation is carried out on one word in the medical text and the other two words in the text which have the same content but different positions, the same attention calculation result can be obtained if a position vector is not embedded to represent the difference, but the association degree of the word and the two words is different, so that the position vector must be used to represent the position relation of each character.

And 3.3, learning the long-distance dependency relationship among the characters by using an attention mechanism, so that the character vector contains information of all other characters in the sentence. The word vector generated by the word embedding layer uses the background word to predict the central word, and the dependency relationship of long-distance characters cannot be learned. Adding a mechanism of attention can make the word vector learn the dependency of all other characters in the sentence. The specific calculation formula of the attention mechanism is shown as formula (8):

the attention mechanism calculation formula is shown in formula (8):

（8）

wherein,

a function of the attention-scoring is represented,

a representation of the query vector is provided,

a key vector is represented by a vector of keys,

a vector of values is represented that is,

，

，

obtained by multiplying the word vectors with the corresponding weight matrix.

The square root, which represents the dimension of the key vector, is used to prevent the multiplication result from being too large,

the function is a normalized exponential function, and a specific mathematical expression of the function is shown as the formula (9):

（9）

wherein,

an array of data is represented,

representing arrays

To (1)

The number of the elements is one,

the value of (a) is an array

To middle

And the output of the attention mechanism layer is the word vector finally generated by the language preprocessing model.

The function is to score and normalize all the characters in the text, with the score for each character being a positive value and the sum being 1. Equation (8) is thus essentially a weighted sum of the vectors of values for each character in the text,

the value of (d) is the weight coefficient of the corresponding value vector. And the output of the attention mechanism layer is a final generated word vector, and further the training of the language preprocessing model based on the attention mechanism is completed. The finally generated word vector contains the position information of the word and the dependency relationship of each other character in the sentence, thereby enhancing the semantic meaning of the word and improving the accuracy of the medical entity recognition model.

And 4, training the medical entity recognition model by using the labeled medical text for training obtained in the step 2 to obtain the trained medical entity recognition model, wherein the medical entity recognition model comprises a bidirectional gating cycle network layer, a multi-head attention layer and a conditional random field layer which are sequentially connected as shown in fig. 3. The medical text firstly generates a corresponding word vector through a trained language preprocessing model. The bidirectional gating circulation network layer is composed of two layers of gating circulation networks, bidirectional coding is carried out on the word vectors, and context relations are fully learned. Output of multi-headed attention layer to bidirectional gated cyclic network layer

And performing attention operation for many times, further learning local features of the medical text and multiple semantics of the words, finally generating a final label sequence by using the conditional random field layer, and identifying the medical named entity according to the label sequence.

Step 4.1, bidirectional gating circulation network layer is used for bidirectional coding of word vectors to fully learn the context relationship, the named entities in the medical field are complex in structure, the subsequences of the entities may also be entities, such as 'splenectomy' and 'spleen', and meanwhile, the characters have strong relevance before and afterThe relationships of the word context are fully considered when training using neural networks. The traditional named entity recognition model usually uses a bidirectional long and short term memory network for coding, but the long and short term memory network has more parameters and slower training speed. The bidirectional gated cyclic network layer comprises a forward gated cyclic network layer and a reverse gated cyclic network layer, the forward gated cyclic network layer learns postamble characteristics, and the reverse gated cyclic network layer learns foreamble characteristics, so that the generated vector can better capture contextual semantic information and learn context. The gated loop network is a variant of a long-short term memory network and only consists of an updating gate and a resetting gate, wherein the updating gate determines the amount of information which is transmitted to the future in the past, and the resetting gate determines the forgetting amount of the past information. The specific structure of the gated loop network is shown in fig. 4, in which,

a weighting operation of the representation vector is performed,

the specific calculation structure of the dot multiplication algorithm for representing the number and the matrix is shown as formula (10) -formula (13):

（10）

（11）

（12）

（13）

wherein,

for renewing the door

The output state at the time of day is,

to reset the gate

The output state at the time of day is,

in the form of a candidate state, the state,

to represent

The output state of the network at the moment,

indicating the state of the input at the current time,

to represent

The function of the function is that of the function,

representing an excitation function

Updating doors for training

The weight parameter of (a) is determined,

resetting a door for training

The weight parameter of (a) is determined,

to calculate candidate states

The weight parameter used in the time-of-day,

indicating that the two vectors are connected. Updating door

For controlling the output state of the network at the present moment

How much history state to keep

Resetting door

Has the effect of determining the candidate state

Hidden state of last time gate control circulation network node output

The degree of dependence of (c).

（14）

wherein,

representing the input to the function.

The value field of the excitation function is (-1, 1), and the expression is shown in formula (15):

（15）

step 4.2, further extracting multiple semantic meanings by using a multi-head attention layer: the medical text has a word ambiguity phenomenon, so a multi-head attention layer is added behind a bidirectional gating circulation network to further learn the dependency relationship of entities and capture the multiple semantics of words. The head attention layer essentially performs a number of attention operations for the network layer through two-way gated loops

Output state of time of day network

First, a single-shot attention calculation is performed by equation (16):

（16）

wherein,

to represent

The result of the individual attention-head calculations,

indicates that there is a

Attention head, i.e. total calculation

Next, the process of the present invention,

to generate the weight parameters of the query vector,

in order to generate the weight parameters of the key vectors,

in order to generate the weight parameters of the value vector,

is composed of

The adjustment of the dimension is a smooth term, prevents the vector product from being too large,

to normalize the exponential function, and finally, concatenate this

Circulating network layers by bidirectional gating

Output state of time of day network

The result of the multi-head attention calculation is shown in formula (17):

（17）

wherein,

showing the results of the calculation of the multi-head attention layer,

indicates that there is a

The attention of the individual is focused on the head,

is a weight parameter. The multi-head attention layer expands the capability of the medical entity recognition model to focus on different positions, so that multiple semantics of words in the medical text are further extracted.

Step 4.3, obtaining an optimal label sequence by using the conditional random field layer: in the medical named entity recognition model, the bidirectional gated loop network layer can only obtain word vectors containing further context information, and the dependency relationship between tags cannot be considered even if a multi-head attention layer is added, for example, a tag (I) must be behind a tag (B). Therefore, the invention adopts the conditional random field layer to consider the adjacent relation between the labels to obtain the globally optimal label sequence. A conditional random field model is a classical discriminative probabilistic undirected graph model that is often applied in sequence labeling tasks for input sentences

Sentence tag sequence

The scoring of (A) is as follows:

（18）

wherein,

in order to be the length of the sequence,

in order to shift the scoring matrix, the score matrix,

representing by a label

Transfer to label

The score of the transition of (a) is,

and

representing the start and end tags in the sentence, which are only temporarily added at the time of computation,

is shown as

The words are marked as

The probability of (c). Normalized to obtain

Maximum probability of tag sequence, as in equation (19):

（19）

wherein,

which represents the sequence of the actual tag(s),

representing the set of all possible tag sequences.

（20）

wherein,

Less than a second threshold

Second threshold value

For the preset constant, then, the Viterbi algorithm is used to obtain the global optimum sequenceThe column is the labeling result of the final medical field named entity identification, as shown in formula (21):

（21）

wherein,

the sequence of tags in the set that maximizes the score function.

Finally, the medical named entities in the text are identified according to the tag sequence. Wherein if the character is marked as (B), it represents that it is the first character of the medical named entity, if the character is marked as (I), it represents that it is the non-beginning part of the medical named entity, and if the character is marked as (O), it represents that it is not the medical named entity. If the input text is ' continuously heated for four days ', the final labels are ' continuously (O) ', ' continuously (B) ', ' hot (I) ', ' four (O) ' ' day (O) ' ' and the medically named entity is identified as ' heated ' according to the label.

The above description is only of the preferred embodiments of the present invention, and it should be noted that: it will be apparent to those skilled in the art that various modifications and adaptations can be made without departing from the principles of the invention and these are intended to be within the scope of the invention.

Claims

1. A Chinese medical named entity recognition method is characterized by comprising the following steps:

step 1, performing character-level segmentation on a medical text for training to obtain segmentation characters of the medical text for training; performing character level segmentation on the medical text for identification to obtain medical text segmentation characters for identification;

step 2, labeling the segmentation characters of the medical text for training to obtain a labeled medical text for training, wherein the starting characters of the medical named entities are labeled as 'B', the non-starting characters of the medical named entities are labeled as 'I', and the characters which are not entities are labeled as 'O';

step 3, training the language preprocessing model based on the attention mechanism by using the labeled medical text for training obtained in the step 2 to obtain a trained language preprocessing model based on the attention mechanism; the language preprocessing model based on the attention mechanism comprises a word embedding layer, a position vector embedding layer and an attention mechanism layer which are sequentially connected;

step 3.1, sending the marked medical text for training obtained in the step 2 into a word embedding layer of a language preprocessing model based on an attention mechanism by taking a sentence as a unit; the word embedding layer generates a word vector of each word by using a word skipping model; the skip word model predicts the surrounding words using a middle word, the first word in the text sequence for a medical text of length L

The words are expressed as

（1）

wherein,

meaning that for each central word all distances from it do not exceed

The probability of occurrence of the background word of (2),

the size of the window is indicated and,

is shown in

Is a central word which is a Chinese character,

for window size, its respective background word

Equation (1) is equivalent to minimizing the first loss function:

（2）

wherein,

representing a logarithmic loss function;

suppose a central word

In the text, index is

Background word

In the text, index is

Comprises the following steps:

（3）

wherein,

the representation index is

The vector of the center word of (a),

the representation index is

The vector of the background word of (a),

representing the transpose of the background word vector,

representing a dot-product of two vectors,

an exponential function with a natural constant e as a base number is represented; solving for center word vector in the above equation using stochastic gradient descent

Gradient (2):

（4）

After training, any index in the medical text is

All get its vector as the center word

；

Step 3.2, the word vector generated by the word embedding layer is sent to a position vector embedding layer, the position vector embedding layer uses the position vector to represent the position relation of each character, and the word vector and the position vector are superposed to obtain a new feature vector of the word; the position vector calculation formula is shown in formula (5) and formula (6):

（5）

（6）

wherein,

the column represents the position vector of each word in each dimension, and the total column number is equal to the total dimension of the word vector;

is the total dimension of the position vector,

the specific dimensions of the vector are represented,

the representation index is

The value of the position vector of the word in even dimension is calculated by using a sine function;

the representation index is

The value of the position vector of the word in odd dimensionality is calculated by using a cosine function; finally, the position vector and the word vector are added to obtain a new feature vector of the word, as shown in formula (7):

（7）

wherein,

the representation index is

The position vector of the word of (a),

indicates that any index is

The word of (a) is used as the word vector of the central word,

representing a new feature vector in which the position information is embedded;

3.3, learning the long-distance dependency relationship among the characters by using an attention mechanism, so that the character vector contains information of all other characters in the sentence; the output of the attention mechanism layer is a final generated word vector, and then training of a language preprocessing model based on the attention mechanism is completed;

the attention mechanism calculation formula is shown in formula (8):

（8）

wherein,

the score of attention is shown as a score,

a representation of the query vector is provided,

a key vector is represented by a vector of keys,

a vector of values is represented that is,

represents the square root of the dimension of the key vector,

the function is a normalized exponential function;

normalized exponential function softmax function:

（9）

wherein,

an array of data is represented,

representing arrays

To (1)

The number of the elements is one,

the value of (a) is an array

To middle

The ratio of the index of an element to the sum of the indices of all other elements;

step 4, training the medical entity recognition model by using the labeled medical text for training obtained in the step 2 to obtain a trained medical entity recognition model, wherein the medical entity recognition model comprises a bidirectional gating circulation network layer, a multi-head attention layer and a conditional random field layer which are sequentially connected;

step 4.1, bidirectional coding is carried out on the word vector by using a bidirectional gating circulation network layer, the bidirectional gating circulation network layer comprises a forward gating circulation network layer and a reverse gating circulation network layer, the forward gating circulation network layer learns the postamble characteristics, and the reverse gating circulation network layer learns the preamble characteristics, so that the generated vector can better capture the contextual semantic information and learn the context; the gated loop network layer is only composed of an update gate and a reset gate, wherein the update gate determines the amount of information that is passed to the future in the past, the reset gate determines the amount of forgetting of the past information, and the gated loop network layer is calculated as shown in formula (10) -formula (13):

（10）

（11）

（12）

（13）

wherein,

for renewing the door

The output state at the time of day is,

to reset the gate

The output state at the time of day is,

in the form of a candidate state, the state,

to represent

The output state of the network at the moment,

indicating the state of the input at the current time,

to represent

The function of the function is that of the function,

representing an excitation function

Updating doors for training

The weight parameter of (a) is determined,

resetting a door for training

The weight parameter of (a) is determined,

to calculate candidate states

A weight parameter used in the time;

representing that the two vectors are connected; updating door

For controlling the output state of the network at the present moment

How much history state to keep

Resetting door

Has the effect of determining the candidate state

Hidden state of last time gate control circulation network node output

The degree of dependence of;

Output state of time of day network

First, a single-shot attention calculation is performed by equation (16):

（16）

wherein,

to represent

The result of the individual attention-head calculations,

indicates that there is a

The attention of the individual is focused on the head,

to generate the weight parameters of the query vector,

in order to generate the weight parameters of the key vectors,

in order to generate the weight parameters of the value vector,

is composed of

The adjustment of the dimension is a smooth term,

to normalize the exponential function, and finally, concatenate this

Circulating network layers by bidirectional gating

Output state of time of day network

The result of the multi-head attention calculation is shown in formula (17):

（17）

wherein,

showing the results of the calculation of the multi-head attention layer,

is a weight parameter;

Sentence tag sequence

The scoring of (A) is as follows:

（18）

wherein,

in order to be the length of the sequence,

in order to shift the scoring matrix, the score matrix,

representing by a label

Transfer to label

The score of the transition of (a) is,

and

the start and end tags in the presentation sentence,

is shown as

The words are marked as

The probability of (d); normalized to obtain

Maximum probability of tag sequence, as in equation (19):

（19）

wherein,

which represents the sequence of the actual tag(s),

represents a set of all possible tag sequences;

（20）

wherein,

Less than a second threshold

Then, a global optimal sequence is obtained by utilizing a Viterbi algorithm, and the global optimal sequence is a labeling result of the final medical field named entity identification;

finally, identifying medical named entities in the text according to the label sequence; wherein if the character is marked as (B), it represents that it is the first character of the medical named entity, if the character is marked as (I), it represents that it is the non-beginning part of the medical named entity, if the character is marked as (O), it represents that it is not the medical named entity;

step 5, during recognition, importing the medical text segmentation characters for recognition into a trained language preprocessing model based on an attention mechanism to generate word vectors; and importing the obtained generated word vector into a trained medical entity recognition model to recognize the medical named entity in the text.

2. The method for identifying named entities in chinese medical treatment according to claim 1, wherein: step 4.1

（14）

wherein,

representing the input to the function.

3. The method for identifying named entities in chinese medical treatment according to claim 2, wherein: in step 4.1, the value domain of the excitation function is (-1, 1), and the expression is shown in formula (15):

（15）。

4. the method for identifying named entities in Chinese medical science according to claim 3, wherein: in step 4.3, the global optimal sequence is obtained by using the viterbi algorithm, as shown in formula (21):

（21）

wherein,

the sequence of tags in the set that maximizes the score function.