CN113674866A - Medical text oriented pre-training method - Google Patents
Medical text oriented pre-training method Download PDFInfo
- Publication number
- CN113674866A CN113674866A CN202110690028.2A CN202110690028A CN113674866A CN 113674866 A CN113674866 A CN 113674866A CN 202110690028 A CN202110690028 A CN 202110690028A CN 113674866 A CN113674866 A CN 113674866A
- Authority
- CN
- China
- Prior art keywords
- word
- medical
- vector
- layer
- training
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000012549 training Methods 0.000 title claims abstract description 47
- 238000000034 method Methods 0.000 title claims abstract description 20
- 239000013598 vector Substances 0.000 claims abstract description 82
- 238000000513 principal component analysis Methods 0.000 claims abstract description 10
- 230000011218 segmentation Effects 0.000 claims abstract description 10
- 239000003814 drug Substances 0.000 claims abstract description 7
- 201000010099 disease Diseases 0.000 claims abstract description 6
- 208000037265 diseases, disorders, signs and symptoms Diseases 0.000 claims abstract description 6
- 229940079593 drug Drugs 0.000 claims abstract description 6
- 208000024891 symptom Diseases 0.000 claims abstract description 5
- 238000012545 processing Methods 0.000 claims abstract description 3
- 108091006146 Channels Proteins 0.000 claims description 17
- 230000006870 function Effects 0.000 claims description 11
- 230000002457 bidirectional effect Effects 0.000 claims description 8
- 238000007781 pre-processing Methods 0.000 claims description 5
- 238000004364 calculation method Methods 0.000 claims description 3
- 238000011176 pooling Methods 0.000 claims description 3
- 230000004913 activation Effects 0.000 claims description 2
- 238000012886 linear function Methods 0.000 claims description 2
- 238000010606 normalization Methods 0.000 claims 1
- 238000003058 natural language processing Methods 0.000 abstract description 9
- 208000007882 Gastritis Diseases 0.000 description 12
- 206010000087 Abdominal pain upper Diseases 0.000 description 5
- 238000010276 construction Methods 0.000 description 2
- 238000013135 deep learning Methods 0.000 description 2
- 238000011161 development Methods 0.000 description 2
- 230000018109 developmental process Effects 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 210000002784 stomach Anatomy 0.000 description 2
- XLYOFNOQVPJJNP-UHFFFAOYSA-N water Substances O XLYOFNOQVPJJNP-UHFFFAOYSA-N 0.000 description 2
- 238000004458 analytical method Methods 0.000 description 1
- 238000013473 artificial intelligence Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 230000007812 deficiency Effects 0.000 description 1
- 230000008451 emotion Effects 0.000 description 1
- 230000003631 expected effect Effects 0.000 description 1
- 239000000463 material Substances 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000008569 process Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 230000000717 retained effect Effects 0.000 description 1
- 238000013519 translation Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16H—HEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
- G16H50/00—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
- G16H50/70—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for mining of medical data, e.g. analysing previous cases of other patients
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/3332—Query translation
- G06F16/3335—Syntactic pre-processing, e.g. stopword elimination, stemming
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
- G06F16/3344—Query execution using natural language analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/213—Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
- G06F18/2135—Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods based on approximation criteria, e.g. principal component analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2415—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/044—Recurrent networks, e.g. Hopfield networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/047—Probabilistic or stochastic networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/048—Activation functions
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02A—TECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
- Y02A90/00—Technologies having an indirect contribution to adaptation to climate change
- Y02A90/10—Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- General Health & Medical Sciences (AREA)
- Biomedical Technology (AREA)
- Mathematical Physics (AREA)
- Computing Systems (AREA)
- Databases & Information Systems (AREA)
- Biophysics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Software Systems (AREA)
- Molecular Biology (AREA)
- Bioinformatics & Computational Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Evolutionary Biology (AREA)
- Probability & Statistics with Applications (AREA)
- Medical Informatics (AREA)
- Public Health (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Pathology (AREA)
- Epidemiology (AREA)
- Primary Health Care (AREA)
- Machine Translation (AREA)
Abstract
The invention discloses a medical text oriented pre-training method, which comprises the following specific steps: acquiring medical dictionaries of diseases, examination and examination, symptoms, medicines, body parts, operations and the like; acquiring medical text contents in encyclopedias and electronic medical records; loading a medical dictionary, and performing word segmentation processing on a medical text by using jieba to serve as a training corpus; acquiring pictures of Chinese characters from the Han dictionary, wherein characters which do not exist are acquired, and constructing corresponding pictures; extracting character features by using a VGG-16 convolutional network; reducing the dimensionality of the extracted character features by using PCA (principal component analysis) to serve as a word vector; superposing the word vector and the position vector of the word to be used as a new word vector; loading an open-source Chinese word vector corpus as an initial word vector; training medical text contents by using an ELMo model to obtain a final ELMo pre-training model; an ELMo pre-training model is used to generate an ELMo vector for a particular word in a sentence. The pre-training method can solve the problem that the general corpus is not suitable for the medical natural language processing task.
Description
Technical Field
The invention relates to a medical text oriented pre-training method, and belongs to the technical field of natural language processing.
Background
Natural language processing is an important research direction in the fields of computer science and artificial intelligence, and aims to enable machines to learn understanding and operating human natural language. In recent years, natural language processing has made an important breakthrough in tasks such as machine translation, reading and understanding, emotion analysis, part of speech tagging, text classification, and the like, thanks to the development of deep learning. Among these tasks, the development of pre-training techniques plays a critical role.
Natural language processing tasks based on deep learning generally require a large amount of labeled data to achieve good results, and in the case of data deficiency, the model effect is often unsatisfactory. However, the advent of pre-trained models has changed this situation, and its idea is to train a deep base model on a large data set, and then by means of fine tuning and the like, the model can be migrated to downstream natural language processing tasks. Currently, common pre-training models such as ELMo, GPT, BERT and the like are trained by adopting general linguistic data, the content of a medical text is greatly different from that of a general text, and the existing pre-training models are directly applied to medical text tasks, so that the expected effect cannot be easily achieved. Therefore, on the basis of the current pre-training model, the pre-training model suitable for the medical text is generated, and the effect of medical natural language processing is improved.
Disclosure of Invention
The invention aims to overcome the defect that the traditional pre-training model is difficult to adapt to a medical natural language processing task, and provides a pre-training method for medical texts.
In order to achieve the purpose, the invention adopts the technical scheme that:
a medical text-oriented pre-training method comprises corpus generation, word vector construction and model training. In the corpus acquisition, dictionaries such as diseases, examination, symptoms, medicines, body parts, operations and the like are acquired, medical data such as encyclopedia and electronic medical records are acquired, and then the medical data are participled by using jieba in combination with the dictionaries to serve as training corpuses. In the word vector construction, due to ideographical property of Chinese characters, the VGG-16 convolution network is used for extracting the character pattern characteristics of the Chinese characters, and then the PCA is used for reducing the dimension of the Chinese characters to be used as word vectors. The word vector portion selects the Chinese word vector that loads the open source. In model training, an ELMo model is used for training a medical text, and optimal model parameters are stored. Then, the pre-trained ELMo model is used to generate ELMo vectors for the particular word. The method specifically comprises the following steps:
step 2, according to the dictionary established in the step 1, using Scapy to acquire corresponding medical science popularization data in encyclopedia, medical encyclopedia and medical inquiry, and then acquiring electronic medical record data in love doctor;
and 4, collecting Chinese character pictures in the Chinese dictionary, acquiring corresponding fonts and sizes, and naming the Chinese characters in the pictures. Counting all characters in the step 2, namely, except for non-Chinese characters, uniformly converting English letters into lower case, uniformly converting numbers and punctuations into half corners, constructing corresponding pictures by adopting fonts and sizes corresponding to the Chinese characters, and preprocessing the character pictures;
and 5, extracting the character picture preprocessed in the step 4 by using a VGG-16 convolutional network. The VGG-16 convolutional network consists of 13 convolutional layers and 3 full-connected layers, and the character features acquire the output result of the 2 nd full-connected layer;
step 6, reducing the dimensionality of the character features extracted in the step 5 by adopting PCA (principal component analysis), and using the dimensionality as a word vector of the word embedding layer;
and 7, superposing the word vector in the word embedding layer and the position vector of the word position embedding layer to obtain a new word vector fusing the word information and the position information, wherein the position vector calculation formula is as follows:
wherein,is the position of the word or words,for the total number of dimensions of the word embedding,embedding specific dimensions for the word whenWhen it is even, select(ii) a When in useWhen the number of the channels is odd, selecting;
Step 8, loading the word vectors with open sources as word vectors of a word embedding layer;
step 9, training the medical text after word segmentation in the step 3 by using an ELMo model and combining the word vector in the step 7 and the word vector in the step 8 to obtain an ELMo pre-training model;
step 9.1, adopting 7 different convolution layers such as [ [1, 32], [2, 32], [3, 64], [4, 128], [5, 256], [6, 512], [7, 1024] ] and the like, carrying out convolution operation on the word segmentation result in the step 3, then respectively connecting each convolution layer with a global maximum pooling layer, and splicing all pooled vectors to obtain a word vector;
step 9.2, transmitting the word vector in the step 9.1 into a HighWay layer, repeating the execution twice, splicing the obtained word vector with the word vector in the step 8, and then obtaining a final word vector by a Linear Projection layer (Linear Projection) to obtain the final word vector;
Step 9.3, the obtained word vectorInput to a bi-directional language model (biLM) comprising a forward language model and a backward language model. In the forward language model, the firstWord passingThe hidden layer obtained by the layer is represented as(wherein) The forward language model passes through previously observed word sequencesTo predict the next moment wordModeling is carried out on the word sequence, and the joint probability is as follows:
the backward language model is similar to the forward language model, and models the word sequence at the current moment by observing the future word sequence, and the joint probability is as follows:
step 9.4, combining the forward language model and the backward language model, and maximizing the log-likelihood functions in two directions:
wherein,the layer in which the words are embedded is represented,the representation of the Softmax layer is shown,the parameters representing the forward language model are,parameters representing a backward language model;
step 10, using an ELMo pre-training model to obtain an ELMo vector of a specific word in a sentence, wherein for an L-layer bidirectional language model, 2L +1 layers are represented in total:
wherein whenWhen the temperature of the water is higher than the set temperature,is a word vector, i.e.,Is a representation of forward and reverse language modelsThe final ELMo word vector is represented as:
Preferably, the preprocessing of the character picture in step 4 is to read the character picture in RGB format, convert the color channel into BGR format, then convert the picture channel sequence into CHW, and then subtract the mean value from the color channel:
wherein,b, G, R color channel values, respectively;respectively, are the mean values of the B, G, R color channels.
Preferably, the HighWay layer formula in step 9.2 is as follows:
wherein,the vectors spliced in step 9.1 are represented,both represent a linear function of the intensity of the light,it is shown that the activation function is,a sigmod function is represented.
Preferably, the linear projection layer formula in step 9.2 is as follows:
wherein,the weight value is represented by a weight value,the variable is represented by a number of variables,indicating the bias.
Preferably, the bi-directional language model (biLM) in step 9.3 refers to a bi-directional long-short time memory network (LSTM) which is formed by combining a forward LSTM and a backward LSTM, and the formula of LSTM is:
wherein,which indicates the current time of day,the last time of day is indicated,respectively represents a forgetting gate, an input gate, an output gate and an output,in the form of a sigmod function,respectively representing a candidate state, a state at the previous time, and a new state,is a training parameter, which is automatically updated during training.
Preferably, the Softmax function in step 10 is:
is the output of the current output unit,is an output index, the total number is,The ratio of the index of the current element to the sum of the indices of all elements is shown.
The pre-training method for the medical text has the following advantages that:
the invention uses VGG-16 convolution network to extract character features in medical text, can exert ideographic characteristics of Chinese characters and extract picture features of the Chinese characters; the PCA is used for reducing the dimension of the extracted features, so that the dimension of a word embedding layer can be reduced, and the training efficiency of the ELMo model is improved; the Chinese character vector and the position vector of the character are fused, and the sequence order of the characters in the word can be represented; the ELMo model is used for training the preprocessed medical text, so that semantic relations in words can be mined; the medical texts are adopted for pre-training, so that the problem that the universal language material is not suitable for the medical natural language processing task can be solved.
Drawings
FIG. 1 is a schematic flow chart of the present invention.
Fig. 2 is an overall architecture diagram of a medical text-oriented pre-training method.
FIG. 3 is a VGG-16 convolutional network architecture diagram.
Detailed Description
To better illustrate the objects, technical solutions and advantages of the present invention, a pre-training method for medical texts according to the present invention is described in detail below with reference to the accompanying drawings and embodiments.
A medical text-oriented pre-training method comprises the steps of firstly constructing a medical entity dictionary, secondly collecting medical text contents according to the dictionary, thirdly preprocessing the medical text according to the dictionary, secondly acquiring word vectors, thirdly training the preprocessed medical text by using an ELMo model according to the word vectors, and lastly generating the ELMo word vectors by using the pretrained ELMo model, as shown in FIG. 1, and specifically comprises the following steps:
step 2, collecting medical science popularization data corresponding to gastritis in Baidu encyclopedia, medical encyclopedia and medical inquiry medicines by using Scapy, and then collecting electronic medical record data in Aiyi medicine, wherein the collected data is 'upper abdominal pain of a gastritis patient';
and 4, acquiring a Chinese character picture in the Han dictionary, acquiring a corresponding font and size, and naming the Chinese characters in the picture, such as ' stomach ' and png '. Counting all characters in the step 2, except non-Chinese characters, English letters are uniformly converted into lower case, numbers and punctuation marks are uniformly converted into half corners, corresponding pictures are constructed by adopting fonts and sizes corresponding to the Chinese characters, for example, the picture is ' epigastric pain of a gastritis patient ' and ' upper abdominal pain of the gastritis patient ' and ' the corresponding picture is the character picture part shown in figure 2. Then, the picture of the character is read in RGB format, the color channel is converted to BGR format, then the picture channel order is converted to CHW, and then the color channels subtract the mean values:
and 5, extracting the character picture processed in the step 4 by using a VGG-16 convolutional network. The VGG-16 convolutional network is composed of 13 convolutional layers and 3 fully-connected layers, and the character features acquire the output result of the 2 nd fully-connected layer, such as the retained part of the VGG-16 convolutional network in FIG. 3;
step 6, reducing the dimensionality of the character features extracted in the step 5 by adopting PCA to serve as a character vector of a character embedding layer, wherein if the 'stomach' is 10-dimensional [0, 1, 2, 3, 4, 5, 6, 7, 8, 9], the dimensionality is reduced to 6-dimensional [0, 1, 2, 3, 4, 5] by PCA;
and 7, superposing the word vector in the word embedding layer and the position vector of the word position embedding layer to obtain a new word vector fusing the word information and the position information, wherein the position vector calculation formula is as follows:
wherein, the position of the character is the position of the character,for the total number of dimensions of the word embedding,embedding specific dimensions for the word whenWhen it is even, select(ii) a When in useWhen the number of the channels is odd, selecting;
Step 8, loading the word vector with an open source as the word vector of the word embedding layer, wherein if the word vector of the gastritis is [0.1, 0.2 ];
step 9, training the medical text after word segmentation in the step 3 by using an ELMo model and combining the word vector in the step 7 and the word vector in the step 8 to obtain an ELMo pre-training model;
step 9.1, adopting 7 different convolution layers such as [ [1, 32], [2, 32], [3, 64], [4, 128], [5, 256], [6, 512], [7, 1024] ] and the like, extracting a word vector part by convolution as in the convolution in the figure 2, carrying out convolution operation on the word segmentation result in the step 3, then respectively connecting each convolution layer with a global maximum pooling layer, and splicing all pooled vectors to obtain a word vector;
step 9.2, the word vector in step 9.1 is transmitted into the HighWay layer, and the process is repeated twice, for example, the HighWay layer in fig. 2, and the word vector for "gastritis" is [0.3, 0.4, 0.5%]And splicing with the gastritis in the step 8 to obtain [0.1, 0.2, 0.3, 0.4, 0.5 ]]Then, Linear Projection layer (Linear Projection) is used to obtain the final word vectorFor example, the "gastritis" is transformed into [0.15, 0.25, 0.35 ]];
Step 9.3, the obtained word vectorInput to a bidirectional language model (biLM), such as the portion of the bidirectional language model in fig. 2, which includes a forward language model and a forward language modelA backward language model. In the forward language model, the firstWord passingThe hidden layer obtained by the layer is represented as(wherein) The forward language model passes through previously observed word sequencesTo predict the next moment wordModeling is carried out on the word sequence, and the joint probability is as follows:
the backward language model is similar to the forward language model, and models the word sequence at the current moment by observing the future word sequence, and the joint probability is as follows:
step 9.4, combining the forward language model and the backward language model, and maximizing the log-likelihood functions in two directions:
wherein,the layer in which the words are embedded is represented,the representation of the Softmax layer is shown,the parameters representing the forward language model are,parameters representing a backward language model;
step 10, using an ELMo pre-training model to obtain an ELMo vector of a specific word in a sentence, wherein for an L-layer bidirectional language model, 2L +1 layers are represented in total:
wherein whenWhen the temperature of the water is higher than the set temperature,is a word vector, i.e.,Is a representation of forward and reverse language modelsThe final ELMo word vector is represented as:
If a 2-level bi-directional language model is selected, the "gastritis" ELMo word vector is linearly composed of the ELMo word vector portion of FIG. 2.
Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention and are not intended to be limiting. Although the present invention has been described in detail with reference to the embodiments, it will be apparent to those skilled in the art that modifications may be made to the embodiments of the present invention or equivalents may be substituted for elements thereof without departing from the scope of the claims.
Claims (6)
1. A medical text-oriented pre-training method is characterized by comprising the following steps:
step 1, acquiring entities such as diseases, examination, symptoms, medicines, body parts, operations and the like from medical encyclopedia, and respectively constructing corresponding dictionaries;
step 2, according to the dictionary established in the step 1, using Scapy to acquire corresponding medical science popularization data in encyclopedia, medical encyclopedia and medical inquiry, and then acquiring electronic medical record data in love doctor;
step 3, loading the dictionary in the step 1, and performing word segmentation processing on the medical data by using jieba;
step 4, collecting Chinese character pictures in the Chinese dictionary, obtaining corresponding fonts and sizes, naming the Chinese characters in the pictures, counting all characters in the step 2, uniformly converting English letters into lowercase except the Chinese characters, uniformly converting numbers and punctuations into semi-corners, constructing corresponding pictures by adopting the fonts and the sizes corresponding to the Chinese characters, and preprocessing the character pictures;
step 5, extracting the character picture preprocessed in the step 4 by using a VGG-16 convolutional network, wherein the VGG-16 convolutional network consists of 13 convolutional layers and 3 full-link layers, and the character is characterized by acquiring a result output by the 2 nd full-link layer;
step 6, reducing the dimensionality of the character features extracted in the step 5 by adopting PCA (principal component analysis), and using the dimensionality as a word vector of the word embedding layer;
and 7, superposing the word vector in the word embedding layer and the position vector of the word position embedding layer to obtain a new word vector fusing the word information and the position information, wherein the position vector calculation formula is as follows:
where p is the position of the word, deFor the total dimension of word embedding, i for the specific dimension of word embedding, when i is even, PE is selected(p,2i)(ii) a When i is odd, PE is selected(p,2i+1);
Step 8, loading the word vectors with open sources as word vectors of a word embedding layer;
step 9, training the medical text after word segmentation in the step 3 by using an ELMo model and combining the word vector in the step 7 and the word vector in the step 8 to obtain an ELMo pre-training model;
step 9.1, adopting 7 different convolution layers such as [ [1, 32], [2, 32], [3, 64], [4, 128], [5, 256], [6, 512], [7, 1024] ] and the like, carrying out convolution operation on the word segmentation result in the step 3, then respectively connecting each convolution layer with a global maximum pooling layer, and splicing all pooled vectors to obtain a word vector;
step 9.2, transmitting the word vector in the step 9.1 into a HighWay layer, repeating the execution twice, splicing the obtained word vector with the word vector in the step 8, and then obtaining a final word vector by a Linear Projection layer (Linear Projection) to obtain the final word vector
Step 9.3, the obtained word vectorInputting the input into a bidirectional language model (biLM) comprising a forward language model in which the k-th word is represented as a hidden layer through an L layer(where j ═ 1, 2, …, L), the forward language model passes through the previously observed word sequence t1,t2,…,tk-1To predict the word t at the next momentkModeling is carried out on the word sequence, and the joint probability is as follows:
the backward language model is similar to the forward language model, and models the word sequence at the current moment by observing the future word sequence, and the joint probability is as follows:
step 9.4, combining the forward language model and the backward language model, and maximizing the log-likelihood functions in two directions:
wherein, thetaxRepresenting a word embedding layer, ΘsThe representation of the Softmax layer is shown,the parameters representing the forward language model are,parameters representing a backward language model;
step 10, using an ELMo pre-training model to obtain an ELMo vector of a specific word in a sentence, wherein for an L-layer bidirectional language model, 2L +1 layers are represented in total:
wherein, when j is 0,is a word vector, i.e.Is a representation of forward and reverse language modelsThe final ELMo word vector is represented as:
2. The medical text-oriented pre-training method as claimed in claim 1, wherein the pre-processing of the character picture in step 4 is to read the picture of the character in RGB format, convert the color channel into BGR format, then convert the picture channel sequence into CHW, and then subtract the mean value from the color channel:
[XB,XG,XR]=[XB,XG,XR]-[XB_MEAN,XG_MEAN,XR_MEAN]
wherein, XB,XG,XRB, G, R color channel values, respectively; xB_MEAN,XG_MEAN,XR_MEANRespectively, are the mean values of the B, G, R color channels.
3. A medical text oriented pre-training method according to claim 1, wherein the HighWay layer formula of step 9.2 is as follows:
y=g*x+(1-g)*f(A(x))
g=σ(B(x))
where x denotes the vector spliced in step 9.1, A, B denotes a linear function, f denotes an activation function, and σ denotes a sigmod function.
4. A medical text oriented pre-training method according to claim 1, wherein the linear projection layer formula of step 9.2 is as follows:
y=w*x+b
wherein w represents a weight, x represents a variable, and b represents a bias.
5. A medical text oriented pretraining method as claimed in claim 1, wherein said bidirectional language model (biLM) in step 9.3 is bidirectional long-short-time memory network (LSTM) and is formed by combining forward LSTM and backward LSTM, and the formula of LSTM is:
f(t)=σ(W(f)x(t)+U(f)h(t-1))
i(t)=σ(W(i)x(t)+U(i)h(t-1)
o(t)=σ(W(o)x(t)+U(o)h(t-1)
where t represents the current time, t-1 represents the last time, f(t)、i(t)、o(t)、h(t)Respectively representing a forgetting gate, an input gate, an output gate and an output, sigma is a sigmod function,c(t-1)、c(t)respectively representing a candidate state, a state at the previous time, a new state, W(f)、U(f)、W(i)、U(i)、W(o)、U(o)、W(c)、U(c)Is a training parameter, which is automatically updated during training.
6. The medical text-oriented pre-training method according to claim 1, wherein the Softmax function in step 10 is:
yiis the output of the current output unit, j is the output index, the total number is C, Softmax (y)i) The ratio of the index of the current element to the sum of the indices of all elements is shown.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110690028.2A CN113674866B (en) | 2021-06-23 | 2021-06-23 | Pre-training method for medical text |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110690028.2A CN113674866B (en) | 2021-06-23 | 2021-06-23 | Pre-training method for medical text |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113674866A true CN113674866A (en) | 2021-11-19 |
CN113674866B CN113674866B (en) | 2024-06-14 |
Family
ID=78538272
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110690028.2A Active CN113674866B (en) | 2021-06-23 | 2021-06-23 | Pre-training method for medical text |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113674866B (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114429129A (en) * | 2021-12-22 | 2022-05-03 | 南京信息工程大学 | Literature mining and material property prediction method |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR20170081350A (en) * | 2016-01-04 | 2017-07-12 | 한국전자통신연구원 | Text Interpretation Apparatus and Method for Performing Text Recognition and Translation Per Frame Length Unit of Image |
US20190197109A1 (en) * | 2017-12-26 | 2019-06-27 | The Allen Institute For Artificial Intelligence | System and methods for performing nlp related tasks using contextualized word representations |
CN110705293A (en) * | 2019-08-23 | 2020-01-17 | 中国科学院苏州生物医学工程技术研究所 | Electronic medical record text named entity recognition method based on pre-training language model |
CN111626383A (en) * | 2020-05-29 | 2020-09-04 | Oppo广东移动通信有限公司 | Font identification method and device, electronic equipment and storage medium |
CN111783767A (en) * | 2020-07-27 | 2020-10-16 | 平安银行股份有限公司 | Character recognition method and device, electronic equipment and storage medium |
CN112989041A (en) * | 2021-03-10 | 2021-06-18 | 中国建设银行股份有限公司 | Text data processing method and device based on BERT |
-
2021
- 2021-06-23 CN CN202110690028.2A patent/CN113674866B/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR20170081350A (en) * | 2016-01-04 | 2017-07-12 | 한국전자통신연구원 | Text Interpretation Apparatus and Method for Performing Text Recognition and Translation Per Frame Length Unit of Image |
US20190197109A1 (en) * | 2017-12-26 | 2019-06-27 | The Allen Institute For Artificial Intelligence | System and methods for performing nlp related tasks using contextualized word representations |
CN110705293A (en) * | 2019-08-23 | 2020-01-17 | 中国科学院苏州生物医学工程技术研究所 | Electronic medical record text named entity recognition method based on pre-training language model |
CN111626383A (en) * | 2020-05-29 | 2020-09-04 | Oppo广东移动通信有限公司 | Font identification method and device, electronic equipment and storage medium |
CN111783767A (en) * | 2020-07-27 | 2020-10-16 | 平安银行股份有限公司 | Character recognition method and device, electronic equipment and storage medium |
CN112989041A (en) * | 2021-03-10 | 2021-06-18 | 中国建设银行股份有限公司 | Text data processing method and device based on BERT |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114429129A (en) * | 2021-12-22 | 2022-05-03 | 南京信息工程大学 | Literature mining and material property prediction method |
Also Published As
Publication number | Publication date |
---|---|
CN113674866B (en) | 2024-06-14 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN113010693B (en) | Knowledge graph intelligent question-answering method integrating pointer generation network | |
CN110750959B (en) | Text information processing method, model training method and related device | |
CN111079377B (en) | Method for recognizing named entities of Chinese medical texts | |
CN108416065B (en) | Hierarchical neural network-based image-sentence description generation system and method | |
CN106484674B (en) | Chinese electronic medical record concept extraction method based on deep learning | |
CN111666758B (en) | Chinese word segmentation method, training device and computer readable storage medium | |
CN111368086A (en) | CNN-BilSTM + attribute model-based sentiment classification method for case-involved news viewpoint sentences | |
CN111783466A (en) | Named entity identification method for Chinese medical records | |
CN112151183A (en) | Entity identification method of Chinese electronic medical record based on Lattice LSTM model | |
CN113707339B (en) | Method and system for concept alignment and content inter-translation among multi-source heterogeneous databases | |
Pan et al. | AMAM: an attention-based multimodal alignment model for medical visual question answering | |
Ke et al. | Medical entity recognition and knowledge map relationship analysis of Chinese EMRs based on improved BiLSTM-CRF | |
Lu et al. | Sentiment analysis: Comprehensive reviews, recent advances, and open challenges | |
CN115879546A (en) | Method and system for constructing composite neural network psychology medicine knowledge map | |
CN114220516A (en) | Brain CT medical report generation method based on hierarchical recurrent neural network decoding | |
CN115510230A (en) | Mongolian emotion analysis method based on multi-dimensional feature fusion and comparative reinforcement learning mechanism | |
Tan et al. | Chinese medical named entity recognition based on Chinese character radical features and pre-trained language models | |
CN114757188A (en) | Standard medical text rewriting method based on generation of confrontation network | |
Elbedwehy et al. | Enhanced descriptive captioning model for histopathological patches | |
CN112216379A (en) | Disease diagnosis system based on intelligent joint learning | |
CN113674866B (en) | Pre-training method for medical text | |
CN117497178A (en) | Knowledge-graph-based common disease auxiliary decision-making method | |
CN112528989A (en) | Description generation method for semantic fine granularity of image | |
CN116432632A (en) | Interpretable reading understanding model based on T5 neural network | |
CN113971405A (en) | Medical named entity recognition system and method based on ALBERT model fusion |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |