CN113674866A

CN113674866A - Medical text oriented pre-training method

Info

Publication number: CN113674866A
Application number: CN202110690028.2A
Authority: CN
Inventors: 朱强; 王卫东; 杨毅; 徐高军
Original assignee: Jiangsu Skyray Precision Medical Technology Co ltd
Current assignee: Jiangsu Skyray Precision Medical Technology Co ltd
Priority date: 2021-06-23
Filing date: 2021-06-23
Publication date: 2021-11-19
Anticipated expiration: 2041-06-23
Also published as: CN113674866B

Abstract

The invention discloses a medical text oriented pre-training method, which comprises the following specific steps: acquiring medical dictionaries of diseases, examination and examination, symptoms, medicines, body parts, operations and the like; acquiring medical text contents in encyclopedias and electronic medical records; loading a medical dictionary, and performing word segmentation processing on a medical text by using jieba to serve as a training corpus; acquiring pictures of Chinese characters from the Han dictionary, wherein characters which do not exist are acquired, and constructing corresponding pictures; extracting character features by using a VGG-16 convolutional network; reducing the dimensionality of the extracted character features by using PCA (principal component analysis) to serve as a word vector; superposing the word vector and the position vector of the word to be used as a new word vector; loading an open-source Chinese word vector corpus as an initial word vector; training medical text contents by using an ELMo model to obtain a final ELMo pre-training model; an ELMo pre-training model is used to generate an ELMo vector for a particular word in a sentence. The pre-training method can solve the problem that the general corpus is not suitable for the medical natural language processing task.

Description

Medical text oriented pre-training method

Technical Field

The invention relates to a medical text oriented pre-training method, and belongs to the technical field of natural language processing.

Background

Natural language processing is an important research direction in the fields of computer science and artificial intelligence, and aims to enable machines to learn understanding and operating human natural language. In recent years, natural language processing has made an important breakthrough in tasks such as machine translation, reading and understanding, emotion analysis, part of speech tagging, text classification, and the like, thanks to the development of deep learning. Among these tasks, the development of pre-training techniques plays a critical role.

Natural language processing tasks based on deep learning generally require a large amount of labeled data to achieve good results, and in the case of data deficiency, the model effect is often unsatisfactory. However, the advent of pre-trained models has changed this situation, and its idea is to train a deep base model on a large data set, and then by means of fine tuning and the like, the model can be migrated to downstream natural language processing tasks. Currently, common pre-training models such as ELMo, GPT, BERT and the like are trained by adopting general linguistic data, the content of a medical text is greatly different from that of a general text, and the existing pre-training models are directly applied to medical text tasks, so that the expected effect cannot be easily achieved. Therefore, on the basis of the current pre-training model, the pre-training model suitable for the medical text is generated, and the effect of medical natural language processing is improved.

Disclosure of Invention

The invention aims to overcome the defect that the traditional pre-training model is difficult to adapt to a medical natural language processing task, and provides a pre-training method for medical texts.

In order to achieve the purpose, the invention adopts the technical scheme that:

a medical text-oriented pre-training method comprises corpus generation, word vector construction and model training. In the corpus acquisition, dictionaries such as diseases, examination, symptoms, medicines, body parts, operations and the like are acquired, medical data such as encyclopedia and electronic medical records are acquired, and then the medical data are participled by using jieba in combination with the dictionaries to serve as training corpuses. In the word vector construction, due to ideographical property of Chinese characters, the VGG-16 convolution network is used for extracting the character pattern characteristics of the Chinese characters, and then the PCA is used for reducing the dimension of the Chinese characters to be used as word vectors. The word vector portion selects the Chinese word vector that loads the open source. In model training, an ELMo model is used for training a medical text, and optimal model parameters are stored. Then, the pre-trained ELMo model is used to generate ELMo vectors for the particular word. The method specifically comprises the following steps:

step 1, acquiring entities such as diseases, examination, symptoms, medicines, body parts, operations and the like from medical encyclopedia, and respectively constructing corresponding dictionaries;

step 2, according to the dictionary established in the step 1, using Scapy to acquire corresponding medical science popularization data in encyclopedia, medical encyclopedia and medical inquiry, and then acquiring electronic medical record data in love doctor;

step 3, loading the dictionary in the step 1, and performing word segmentation processing on the medical data by using jieba;

and 4, collecting Chinese character pictures in the Chinese dictionary, acquiring corresponding fonts and sizes, and naming the Chinese characters in the pictures. Counting all characters in the step 2, namely, except for non-Chinese characters, uniformly converting English letters into lower case, uniformly converting numbers and punctuations into half corners, constructing corresponding pictures by adopting fonts and sizes corresponding to the Chinese characters, and preprocessing the character pictures;

and 5, extracting the character picture preprocessed in the step 4 by using a VGG-16 convolutional network. The VGG-16 convolutional network consists of 13 convolutional layers and 3 full-connected layers, and the character features acquire the output result of the 2 nd full-connected layer;

step 6, reducing the dimensionality of the character features extracted in the step 5 by adopting PCA (principal component analysis), and using the dimensionality as a word vector of the word embedding layer;

and 7, superposing the word vector in the word embedding layer and the position vector of the word position embedding layer to obtain a new word vector fusing the word information and the position information, wherein the position vector calculation formula is as follows:

wherein,

is the position of the word or words,

for the total number of dimensions of the word embedding,

embedding specific dimensions for the word when

When it is even, select

(ii) a When in use

When the number of the channels is odd, selecting

；

Step 8, loading the word vectors with open sources as word vectors of a word embedding layer;

step 9, training the medical text after word segmentation in the step 3 by using an ELMo model and combining the word vector in the step 7 and the word vector in the step 8 to obtain an ELMo pre-training model;

step 9.1, adopting 7 different convolution layers such as [ [1, 32], [2, 32], [3, 64], [4, 128], [5, 256], [6, 512], [7, 1024] ] and the like, carrying out convolution operation on the word segmentation result in the step 3, then respectively connecting each convolution layer with a global maximum pooling layer, and splicing all pooled vectors to obtain a word vector;

step 9.2, transmitting the word vector in the step 9.1 into a HighWay layer, repeating the execution twice, splicing the obtained word vector with the word vector in the step 8, and then obtaining a final word vector by a Linear Projection layer (Linear Projection) to obtain the final word vector

；

Step 9.3, the obtained word vector

Input to a bi-directional language model (biLM) comprising a forward language model and a backward language model. In the forward language model, the first

Word passing

The hidden layer obtained by the layer is represented as

(wherein

) The forward language model passes through previously observed word sequences

To predict the next moment word

Modeling is carried out on the word sequence, and the joint probability is as follows:

the backward language model is similar to the forward language model, and models the word sequence at the current moment by observing the future word sequence, and the joint probability is as follows:

step 9.4, combining the forward language model and the backward language model, and maximizing the log-likelihood functions in two directions:

wherein,

the layer in which the words are embedded is represented,

the representation of the Softmax layer is shown,

the parameters representing the forward language model are,

parameters representing a backward language model;

step 10, using an ELMo pre-training model to obtain an ELMo vector of a specific word in a sentence, wherein for an L-layer bidirectional language model, 2L +1 layers are represented in total:

wherein when

When the temperature of the water is higher than the set temperature,

is a word vector, i.e.

，

Is a representation of forward and reverse language models

The final ELMo word vector is represented as:

wherein,

is the weight normalized by Softmax,

is a scaling parameter.

Preferably, the preprocessing of the character picture in step 4 is to read the character picture in RGB format, convert the color channel into BGR format, then convert the picture channel sequence into CHW, and then subtract the mean value from the color channel:

wherein,

b, G, R color channel values, respectively;

respectively, are the mean values of the B, G, R color channels.

Preferably, the HighWay layer formula in step 9.2 is as follows:

wherein,

the vectors spliced in step 9.1 are represented,

both represent a linear function of the intensity of the light,

it is shown that the activation function is,

a sigmod function is represented.

Preferably, the linear projection layer formula in step 9.2 is as follows:

wherein,

the weight value is represented by a weight value,

the variable is represented by a number of variables,

indicating the bias.

Preferably, the bi-directional language model (biLM) in step 9.3 refers to a bi-directional long-short time memory network (LSTM) which is formed by combining a forward LSTM and a backward LSTM, and the formula of LSTM is:

wherein,

which indicates the current time of day,

the last time of day is indicated,

respectively represents a forgetting gate, an input gate, an output gate and an output,

in the form of a sigmod function,

respectively representing a candidate state, a state at the previous time, and a new state,

is a training parameter, which is automatically updated during training.

Preferably, the Softmax function in step 10 is:

is the output of the current output unit,

is an output index, the total number is

，

The ratio of the index of the current element to the sum of the indices of all elements is shown.

The pre-training method for the medical text has the following advantages that:

the invention uses VGG-16 convolution network to extract character features in medical text, can exert ideographic characteristics of Chinese characters and extract picture features of the Chinese characters; the PCA is used for reducing the dimension of the extracted features, so that the dimension of a word embedding layer can be reduced, and the training efficiency of the ELMo model is improved; the Chinese character vector and the position vector of the character are fused, and the sequence order of the characters in the word can be represented; the ELMo model is used for training the preprocessed medical text, so that semantic relations in words can be mined; the medical texts are adopted for pre-training, so that the problem that the universal language material is not suitable for the medical natural language processing task can be solved.

Drawings

FIG. 1 is a schematic flow chart of the present invention.

Fig. 2 is an overall architecture diagram of a medical text-oriented pre-training method.

FIG. 3 is a VGG-16 convolutional network architecture diagram.

Detailed Description

To better illustrate the objects, technical solutions and advantages of the present invention, a pre-training method for medical texts according to the present invention is described in detail below with reference to the accompanying drawings and embodiments.

A medical text-oriented pre-training method comprises the steps of firstly constructing a medical entity dictionary, secondly collecting medical text contents according to the dictionary, thirdly preprocessing the medical text according to the dictionary, secondly acquiring word vectors, thirdly training the preprocessed medical text by using an ELMo model according to the word vectors, and lastly generating the ELMo word vectors by using the pretrained ELMo model, as shown in FIG. 1, and specifically comprises the following steps:

step 1, obtaining entities such as diseases, examination, symptoms, medicines, body parts, operations and the like from medical encyclopedia, and respectively constructing corresponding dictionaries, such as obtaining a disease entity 'gastritis';

step 2, collecting medical science popularization data corresponding to gastritis in Baidu encyclopedia, medical encyclopedia and medical inquiry medicines by using Scapy, and then collecting electronic medical record data in Aiyi medicine, wherein the collected data is 'upper abdominal pain of a gastritis patient';

step 3, loading the dictionary in the step 1, and using jieba to treat ' epigastric pain of gastritis patients ' and ' word segmentation treatment of ' epigastric pain of gastritis patients ';

and 4, acquiring a Chinese character picture in the Han dictionary, acquiring a corresponding font and size, and naming the Chinese characters in the picture, such as ' stomach ' and png '. Counting all characters in the step 2, except non-Chinese characters, English letters are uniformly converted into lower case, numbers and punctuation marks are uniformly converted into half corners, corresponding pictures are constructed by adopting fonts and sizes corresponding to the Chinese characters, for example, the picture is ' epigastric pain of a gastritis patient ' and ' upper abdominal pain of the gastritis patient ' and ' the corresponding picture is the character picture part shown in figure 2. Then, the picture of the character is read in RGB format, the color channel is converted to BGR format, then the picture channel order is converted to CHW, and then the color channels subtract the mean values:

wherein,

b, G, R color channel values, respectively;

b, G, R color channel means, respectively;

and 5, extracting the character picture processed in the step 4 by using a VGG-16 convolutional network. The VGG-16 convolutional network is composed of 13 convolutional layers and 3 fully-connected layers, and the character features acquire the output result of the 2 nd fully-connected layer, such as the retained part of the VGG-16 convolutional network in FIG. 3;

step 6, reducing the dimensionality of the character features extracted in the step 5 by adopting PCA to serve as a character vector of a character embedding layer, wherein if the 'stomach' is 10-dimensional [0, 1, 2, 3, 4, 5, 6, 7, 8, 9], the dimensionality is reduced to 6-dimensional [0, 1, 2, 3, 4, 5] by PCA;

wherein, the position of the character is the position of the character,

for the total number of dimensions of the word embedding,

embedding specific dimensions for the word when

When it is even, select

(ii) a When in use

When the number of the channels is odd, selecting

；

Step 8, loading the word vector with an open source as the word vector of the word embedding layer, wherein if the word vector of the gastritis is [0.1, 0.2 ];

step 9.1, adopting 7 different convolution layers such as [ [1, 32], [2, 32], [3, 64], [4, 128], [5, 256], [6, 512], [7, 1024] ] and the like, extracting a word vector part by convolution as in the convolution in the figure 2, carrying out convolution operation on the word segmentation result in the step 3, then respectively connecting each convolution layer with a global maximum pooling layer, and splicing all pooled vectors to obtain a word vector;

step 9.2, the word vector in step 9.1 is transmitted into the HighWay layer, and the process is repeated twice, for example, the HighWay layer in fig. 2, and the word vector for "gastritis" is [0.3, 0.4, 0.5%]And splicing with the gastritis in the step 8 to obtain [0.1, 0.2, 0.3, 0.4, 0.5 ]]Then, Linear Projection layer (Linear Projection) is used to obtain the final word vector

For example, the "gastritis" is transformed into [0.15, 0.25, 0.35 ]]；

Step 9.3, the obtained word vector

Input to a bidirectional language model (biLM), such as the portion of the bidirectional language model in fig. 2, which includes a forward language model and a forward language modelA backward language model. In the forward language model, the first

Word passing

The hidden layer obtained by the layer is represented as

(wherein

) The forward language model passes through previously observed word sequences

To predict the next moment word

wherein,

the layer in which the words are embedded is represented,

the representation of the Softmax layer is shown,

the parameters representing the forward language model are,

parameters representing a backward language model;

wherein when

When the temperature of the water is higher than the set temperature,

is a word vector, i.e.

，

Is a representation of forward and reverse language models

The final ELMo word vector is represented as:

wherein,

is the weight normalized by Softmax,

is a scaling parameter.

If a 2-level bi-directional language model is selected, the "gastritis" ELMo word vector is linearly composed of the ELMo word vector portion of FIG. 2.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention and are not intended to be limiting. Although the present invention has been described in detail with reference to the embodiments, it will be apparent to those skilled in the art that modifications may be made to the embodiments of the present invention or equivalents may be substituted for elements thereof without departing from the scope of the claims.

Claims

1. A medical text-oriented pre-training method is characterized by comprising the following steps:

step 4, collecting Chinese character pictures in the Chinese dictionary, obtaining corresponding fonts and sizes, naming the Chinese characters in the pictures, counting all characters in the step 2, uniformly converting English letters into lowercase except the Chinese characters, uniformly converting numbers and punctuations into semi-corners, constructing corresponding pictures by adopting the fonts and the sizes corresponding to the Chinese characters, and preprocessing the character pictures;

step 5, extracting the character picture preprocessed in the step 4 by using a VGG-16 convolutional network, wherein the VGG-16 convolutional network consists of 13 convolutional layers and 3 full-link layers, and the character is characterized by acquiring a result output by the 2 nd full-link layer;

where p is the position of the word, d_eFor the total dimension of word embedding, i for the specific dimension of word embedding, when i is even, PE is selected_(p，2i)(ii) a When i is odd, PE is selected_(p，2i+1)；

Step 9.3, the obtained word vector

Inputting the input into a bidirectional language model (biLM) comprising a forward language model in which the k-th word is represented as a hidden layer through an L layer

(where j ═ 1, 2, …, L), the forward language model passes through the previously observed word sequence t₁，t₂，…，t_k-1To predict the word t at the next moment_kModeling is carried out on the word sequence, and the joint probability is as follows:

wherein, theta_xRepresenting a word embedding layer, Θ_sThe representation of the Softmax layer is shown,

the parameters representing the forward language model are,

parameters representing a backward language model;

wherein, when j is 0,

is a word vector, i.e.

Is a representation of forward and reverse language models

The final ELMo word vector is represented as:

wherein,

is the weight, γ, after Softmax normalization^taskIs a scaling parameter.

2. The medical text-oriented pre-training method as claimed in claim 1, wherein the pre-processing of the character picture in step 4 is to read the picture of the character in RGB format, convert the color channel into BGR format, then convert the picture channel sequence into CHW, and then subtract the mean value from the color channel:

[X_B，X_G，X_R]＝[X_B，X_G，X_R]-[X_{B_MEAN}，X_{G_MEAN}，X_{R_MEAN}]

wherein, X_B，X_G，X_RB, G, R color channel values, respectively; x_{B_MEAN}，X_{G_MEAN}，X_{R_MEAN}Respectively, are the mean values of the B, G, R color channels.

3. A medical text oriented pre-training method according to claim 1, wherein the HighWay layer formula of step 9.2 is as follows:

y＝g*x+(1-g)*f(A(x))

g＝σ(B(x))

where x denotes the vector spliced in step 9.1, A, B denotes a linear function, f denotes an activation function, and σ denotes a sigmod function.

4. A medical text oriented pre-training method according to claim 1, wherein the linear projection layer formula of step 9.2 is as follows:

y＝w*x+b

wherein w represents a weight, x represents a variable, and b represents a bias.

5. A medical text oriented pretraining method as claimed in claim 1, wherein said bidirectional language model (biLM) in step 9.3 is bidirectional long-short-time memory network (LSTM) and is formed by combining forward LSTM and backward LSTM, and the formula of LSTM is:

f^(t)＝σ(W^(f)x^(t)+U^(f)h^(t-1))

i^(t)＝σ(W⁽ⁱ⁾x^(t)+U⁽ⁱ⁾h^(t-1)

o^(t)＝σ(W^(o)x^(t)+U^(o)h^(t-1)

where t represents the current time, t-1 represents the last time, f^(t)、i^(t)、o^(t)、h^(t)Respectively representing a forgetting gate, an input gate, an output gate and an output, sigma is a sigmod function,

c^(t-1)、c^(t)respectively representing a candidate state, a state at the previous time, a new state, W^(f)、U^(f)、W⁽ⁱ⁾、U⁽ⁱ⁾、W^(o)、U^(o)、W^(c)、U^(c)Is a training parameter, which is automatically updated during training.

6. The medical text-oriented pre-training method according to claim 1, wherein the Softmax function in step 10 is:

y_iis the output of the current output unit, j is the output index, the total number is C, Softmax (y)_i) The ratio of the index of the current element to the sum of the indices of all elements is shown.