CN113674866A - Medical text oriented pre-training method - Google Patents

Medical text oriented pre-training method Download PDF

Info

Publication number
CN113674866A
CN113674866A CN202110690028.2A CN202110690028A CN113674866A CN 113674866 A CN113674866 A CN 113674866A CN 202110690028 A CN202110690028 A CN 202110690028A CN 113674866 A CN113674866 A CN 113674866A
Authority
CN
China
Prior art keywords
word
medical
vector
layer
training
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110690028.2A
Other languages
Chinese (zh)
Other versions
CN113674866B (en
Inventor
朱强
王卫东
杨毅
徐高军
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jiangsu Skyray Precision Medical Technology Co ltd
Original Assignee
Jiangsu Skyray Precision Medical Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jiangsu Skyray Precision Medical Technology Co ltd filed Critical Jiangsu Skyray Precision Medical Technology Co ltd
Priority to CN202110690028.2A priority Critical patent/CN113674866B/en
Publication of CN113674866A publication Critical patent/CN113674866A/en
Application granted granted Critical
Publication of CN113674866B publication Critical patent/CN113674866B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/70ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for mining of medical data, e.g. analysing previous cases of other patients
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/3332Query translation
    • G06F16/3335Syntactic pre-processing, e.g. stopword elimination, stemming
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/213Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
    • G06F18/2135Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods based on approximation criteria, e.g. principal component analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • Computing Systems (AREA)
  • Databases & Information Systems (AREA)
  • Biophysics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Software Systems (AREA)
  • Molecular Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • Probability & Statistics with Applications (AREA)
  • Medical Informatics (AREA)
  • Public Health (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Pathology (AREA)
  • Epidemiology (AREA)
  • Primary Health Care (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a medical text oriented pre-training method, which comprises the following specific steps: acquiring medical dictionaries of diseases, examination and examination, symptoms, medicines, body parts, operations and the like; acquiring medical text contents in encyclopedias and electronic medical records; loading a medical dictionary, and performing word segmentation processing on a medical text by using jieba to serve as a training corpus; acquiring pictures of Chinese characters from the Han dictionary, wherein characters which do not exist are acquired, and constructing corresponding pictures; extracting character features by using a VGG-16 convolutional network; reducing the dimensionality of the extracted character features by using PCA (principal component analysis) to serve as a word vector; superposing the word vector and the position vector of the word to be used as a new word vector; loading an open-source Chinese word vector corpus as an initial word vector; training medical text contents by using an ELMo model to obtain a final ELMo pre-training model; an ELMo pre-training model is used to generate an ELMo vector for a particular word in a sentence. The pre-training method can solve the problem that the general corpus is not suitable for the medical natural language processing task.

Description

Medical text oriented pre-training method
Technical Field
The invention relates to a medical text oriented pre-training method, and belongs to the technical field of natural language processing.
Background
Natural language processing is an important research direction in the fields of computer science and artificial intelligence, and aims to enable machines to learn understanding and operating human natural language. In recent years, natural language processing has made an important breakthrough in tasks such as machine translation, reading and understanding, emotion analysis, part of speech tagging, text classification, and the like, thanks to the development of deep learning. Among these tasks, the development of pre-training techniques plays a critical role.
Natural language processing tasks based on deep learning generally require a large amount of labeled data to achieve good results, and in the case of data deficiency, the model effect is often unsatisfactory. However, the advent of pre-trained models has changed this situation, and its idea is to train a deep base model on a large data set, and then by means of fine tuning and the like, the model can be migrated to downstream natural language processing tasks. Currently, common pre-training models such as ELMo, GPT, BERT and the like are trained by adopting general linguistic data, the content of a medical text is greatly different from that of a general text, and the existing pre-training models are directly applied to medical text tasks, so that the expected effect cannot be easily achieved. Therefore, on the basis of the current pre-training model, the pre-training model suitable for the medical text is generated, and the effect of medical natural language processing is improved.
Disclosure of Invention
The invention aims to overcome the defect that the traditional pre-training model is difficult to adapt to a medical natural language processing task, and provides a pre-training method for medical texts.
In order to achieve the purpose, the invention adopts the technical scheme that:
a medical text-oriented pre-training method comprises corpus generation, word vector construction and model training. In the corpus acquisition, dictionaries such as diseases, examination, symptoms, medicines, body parts, operations and the like are acquired, medical data such as encyclopedia and electronic medical records are acquired, and then the medical data are participled by using jieba in combination with the dictionaries to serve as training corpuses. In the word vector construction, due to ideographical property of Chinese characters, the VGG-16 convolution network is used for extracting the character pattern characteristics of the Chinese characters, and then the PCA is used for reducing the dimension of the Chinese characters to be used as word vectors. The word vector portion selects the Chinese word vector that loads the open source. In model training, an ELMo model is used for training a medical text, and optimal model parameters are stored. Then, the pre-trained ELMo model is used to generate ELMo vectors for the particular word. The method specifically comprises the following steps:
step 1, acquiring entities such as diseases, examination, symptoms, medicines, body parts, operations and the like from medical encyclopedia, and respectively constructing corresponding dictionaries;
step 2, according to the dictionary established in the step 1, using Scapy to acquire corresponding medical science popularization data in encyclopedia, medical encyclopedia and medical inquiry, and then acquiring electronic medical record data in love doctor;
step 3, loading the dictionary in the step 1, and performing word segmentation processing on the medical data by using jieba;
and 4, collecting Chinese character pictures in the Chinese dictionary, acquiring corresponding fonts and sizes, and naming the Chinese characters in the pictures. Counting all characters in the step 2, namely, except for non-Chinese characters, uniformly converting English letters into lower case, uniformly converting numbers and punctuations into half corners, constructing corresponding pictures by adopting fonts and sizes corresponding to the Chinese characters, and preprocessing the character pictures;
and 5, extracting the character picture preprocessed in the step 4 by using a VGG-16 convolutional network. The VGG-16 convolutional network consists of 13 convolutional layers and 3 full-connected layers, and the character features acquire the output result of the 2 nd full-connected layer;
step 6, reducing the dimensionality of the character features extracted in the step 5 by adopting PCA (principal component analysis), and using the dimensionality as a word vector of the word embedding layer;
and 7, superposing the word vector in the word embedding layer and the position vector of the word position embedding layer to obtain a new word vector fusing the word information and the position information, wherein the position vector calculation formula is as follows:
Figure 822384DEST_PATH_IMAGE001
Figure 362693DEST_PATH_IMAGE002
wherein,
Figure 14867DEST_PATH_IMAGE003
is the position of the word or words,
Figure 223126DEST_PATH_IMAGE004
for the total number of dimensions of the word embedding,
Figure 99815DEST_PATH_IMAGE005
embedding specific dimensions for the word when
Figure 797644DEST_PATH_IMAGE005
When it is even, select
Figure 877726DEST_PATH_IMAGE006
(ii) a When in use
Figure 138943DEST_PATH_IMAGE005
When the number of the channels is odd, selecting
Figure 73401DEST_PATH_IMAGE007
Step 8, loading the word vectors with open sources as word vectors of a word embedding layer;
step 9, training the medical text after word segmentation in the step 3 by using an ELMo model and combining the word vector in the step 7 and the word vector in the step 8 to obtain an ELMo pre-training model;
step 9.1, adopting 7 different convolution layers such as [ [1, 32], [2, 32], [3, 64], [4, 128], [5, 256], [6, 512], [7, 1024] ] and the like, carrying out convolution operation on the word segmentation result in the step 3, then respectively connecting each convolution layer with a global maximum pooling layer, and splicing all pooled vectors to obtain a word vector;
step 9.2, transmitting the word vector in the step 9.1 into a HighWay layer, repeating the execution twice, splicing the obtained word vector with the word vector in the step 8, and then obtaining a final word vector by a Linear Projection layer (Linear Projection) to obtain the final word vector
Figure 673622DEST_PATH_IMAGE008
Step 9.3, the obtained word vector
Figure 303318DEST_PATH_IMAGE009
Input to a bi-directional language model (biLM) comprising a forward language model and a backward language model. In the forward language model, the first
Figure 853379DEST_PATH_IMAGE010
Word passing
Figure 642343DEST_PATH_IMAGE011
The hidden layer obtained by the layer is represented as
Figure 681975DEST_PATH_IMAGE012
(wherein
Figure 985917DEST_PATH_IMAGE013
) The forward language model passes through previously observed word sequences
Figure 526620DEST_PATH_IMAGE014
To predict the next moment word
Figure 42527DEST_PATH_IMAGE015
Modeling is carried out on the word sequence, and the joint probability is as follows:
Figure 987481DEST_PATH_IMAGE016
the backward language model is similar to the forward language model, and models the word sequence at the current moment by observing the future word sequence, and the joint probability is as follows:
Figure 529452DEST_PATH_IMAGE017
step 9.4, combining the forward language model and the backward language model, and maximizing the log-likelihood functions in two directions:
Figure 873845DEST_PATH_IMAGE018
wherein,
Figure 371823DEST_PATH_IMAGE019
the layer in which the words are embedded is represented,
Figure 753257DEST_PATH_IMAGE020
the representation of the Softmax layer is shown,
Figure 662086DEST_PATH_IMAGE021
the parameters representing the forward language model are,
Figure 810171DEST_PATH_IMAGE022
parameters representing a backward language model;
step 10, using an ELMo pre-training model to obtain an ELMo vector of a specific word in a sentence, wherein for an L-layer bidirectional language model, 2L +1 layers are represented in total:
Figure 772442DEST_PATH_IMAGE023
wherein when
Figure 980569DEST_PATH_IMAGE024
When the temperature of the water is higher than the set temperature,
Figure 418504DEST_PATH_IMAGE025
is a word vector, i.e.
Figure 652170DEST_PATH_IMAGE026
Figure 734527DEST_PATH_IMAGE027
Is a representation of forward and reverse language models
Figure 113555DEST_PATH_IMAGE028
The final ELMo word vector is represented as:
Figure 911223DEST_PATH_IMAGE029
wherein,
Figure 197847DEST_PATH_IMAGE030
is the weight normalized by Softmax,
Figure 869131DEST_PATH_IMAGE031
is a scaling parameter.
Preferably, the preprocessing of the character picture in step 4 is to read the character picture in RGB format, convert the color channel into BGR format, then convert the picture channel sequence into CHW, and then subtract the mean value from the color channel:
Figure 419061DEST_PATH_IMAGE032
wherein,
Figure 644637DEST_PATH_IMAGE033
b, G, R color channel values, respectively;
Figure 734953DEST_PATH_IMAGE034
respectively, are the mean values of the B, G, R color channels.
Preferably, the HighWay layer formula in step 9.2 is as follows:
Figure 729585DEST_PATH_IMAGE035
Figure 981575DEST_PATH_IMAGE036
wherein,
Figure 691517DEST_PATH_IMAGE037
the vectors spliced in step 9.1 are represented,
Figure 319945DEST_PATH_IMAGE038
both represent a linear function of the intensity of the light,
Figure 434663DEST_PATH_IMAGE039
it is shown that the activation function is,
Figure 857554DEST_PATH_IMAGE040
a sigmod function is represented.
Preferably, the linear projection layer formula in step 9.2 is as follows:
Figure 57722DEST_PATH_IMAGE041
wherein,
Figure 489840DEST_PATH_IMAGE042
the weight value is represented by a weight value,
Figure 193485DEST_PATH_IMAGE043
the variable is represented by a number of variables,
Figure 52857DEST_PATH_IMAGE044
indicating the bias.
Preferably, the bi-directional language model (biLM) in step 9.3 refers to a bi-directional long-short time memory network (LSTM) which is formed by combining a forward LSTM and a backward LSTM, and the formula of LSTM is:
Figure 409495DEST_PATH_IMAGE045
Figure 458354DEST_PATH_IMAGE046
Figure 216838DEST_PATH_IMAGE047
Figure 997843DEST_PATH_IMAGE048
Figure 156292DEST_PATH_IMAGE049
Figure 680946DEST_PATH_IMAGE050
wherein,
Figure 421500DEST_PATH_IMAGE051
which indicates the current time of day,
Figure 622674DEST_PATH_IMAGE052
the last time of day is indicated,
Figure 891588DEST_PATH_IMAGE053
respectively represents a forgetting gate, an input gate, an output gate and an output,
Figure 16670DEST_PATH_IMAGE054
in the form of a sigmod function,
Figure 64260DEST_PATH_IMAGE055
respectively representing a candidate state, a state at the previous time, and a new state,
Figure 187068DEST_PATH_IMAGE056
is a training parameter, which is automatically updated during training.
Preferably, the Softmax function in step 10 is:
Figure 130229DEST_PATH_IMAGE057
Figure 245952DEST_PATH_IMAGE058
is the output of the current output unit,
Figure 898782DEST_PATH_IMAGE059
is an output index, the total number is
Figure 254808DEST_PATH_IMAGE060
Figure 609566DEST_PATH_IMAGE061
The ratio of the index of the current element to the sum of the indices of all elements is shown.
The pre-training method for the medical text has the following advantages that:
the invention uses VGG-16 convolution network to extract character features in medical text, can exert ideographic characteristics of Chinese characters and extract picture features of the Chinese characters; the PCA is used for reducing the dimension of the extracted features, so that the dimension of a word embedding layer can be reduced, and the training efficiency of the ELMo model is improved; the Chinese character vector and the position vector of the character are fused, and the sequence order of the characters in the word can be represented; the ELMo model is used for training the preprocessed medical text, so that semantic relations in words can be mined; the medical texts are adopted for pre-training, so that the problem that the universal language material is not suitable for the medical natural language processing task can be solved.
Drawings
FIG. 1 is a schematic flow chart of the present invention.
Fig. 2 is an overall architecture diagram of a medical text-oriented pre-training method.
FIG. 3 is a VGG-16 convolutional network architecture diagram.
Detailed Description
To better illustrate the objects, technical solutions and advantages of the present invention, a pre-training method for medical texts according to the present invention is described in detail below with reference to the accompanying drawings and embodiments.
A medical text-oriented pre-training method comprises the steps of firstly constructing a medical entity dictionary, secondly collecting medical text contents according to the dictionary, thirdly preprocessing the medical text according to the dictionary, secondly acquiring word vectors, thirdly training the preprocessed medical text by using an ELMo model according to the word vectors, and lastly generating the ELMo word vectors by using the pretrained ELMo model, as shown in FIG. 1, and specifically comprises the following steps:
step 1, obtaining entities such as diseases, examination, symptoms, medicines, body parts, operations and the like from medical encyclopedia, and respectively constructing corresponding dictionaries, such as obtaining a disease entity 'gastritis';
step 2, collecting medical science popularization data corresponding to gastritis in Baidu encyclopedia, medical encyclopedia and medical inquiry medicines by using Scapy, and then collecting electronic medical record data in Aiyi medicine, wherein the collected data is 'upper abdominal pain of a gastritis patient';
step 3, loading the dictionary in the step 1, and using jieba to treat ' epigastric pain of gastritis patients ' and ' word segmentation treatment of ' epigastric pain of gastritis patients ';
and 4, acquiring a Chinese character picture in the Han dictionary, acquiring a corresponding font and size, and naming the Chinese characters in the picture, such as ' stomach ' and png '. Counting all characters in the step 2, except non-Chinese characters, English letters are uniformly converted into lower case, numbers and punctuation marks are uniformly converted into half corners, corresponding pictures are constructed by adopting fonts and sizes corresponding to the Chinese characters, for example, the picture is ' epigastric pain of a gastritis patient ' and ' upper abdominal pain of the gastritis patient ' and ' the corresponding picture is the character picture part shown in figure 2. Then, the picture of the character is read in RGB format, the color channel is converted to BGR format, then the picture channel order is converted to CHW, and then the color channels subtract the mean values:
Figure 279713DEST_PATH_IMAGE062
wherein,
Figure 36316DEST_PATH_IMAGE033
b, G, R color channel values, respectively;
Figure 560314DEST_PATH_IMAGE034
b, G, R color channel means, respectively;
and 5, extracting the character picture processed in the step 4 by using a VGG-16 convolutional network. The VGG-16 convolutional network is composed of 13 convolutional layers and 3 fully-connected layers, and the character features acquire the output result of the 2 nd fully-connected layer, such as the retained part of the VGG-16 convolutional network in FIG. 3;
step 6, reducing the dimensionality of the character features extracted in the step 5 by adopting PCA to serve as a character vector of a character embedding layer, wherein if the 'stomach' is 10-dimensional [0, 1, 2, 3, 4, 5, 6, 7, 8, 9], the dimensionality is reduced to 6-dimensional [0, 1, 2, 3, 4, 5] by PCA;
and 7, superposing the word vector in the word embedding layer and the position vector of the word position embedding layer to obtain a new word vector fusing the word information and the position information, wherein the position vector calculation formula is as follows:
Figure 418680DEST_PATH_IMAGE063
Figure 813889DEST_PATH_IMAGE064
wherein, the position of the character is the position of the character,
Figure 300365DEST_PATH_IMAGE065
for the total number of dimensions of the word embedding,
Figure 201456DEST_PATH_IMAGE066
embedding specific dimensions for the word when
Figure 530806DEST_PATH_IMAGE066
When it is even, select
Figure 625580DEST_PATH_IMAGE067
(ii) a When in use
Figure 841929DEST_PATH_IMAGE066
When the number of the channels is odd, selecting
Figure 163189DEST_PATH_IMAGE068
Step 8, loading the word vector with an open source as the word vector of the word embedding layer, wherein if the word vector of the gastritis is [0.1, 0.2 ];
step 9, training the medical text after word segmentation in the step 3 by using an ELMo model and combining the word vector in the step 7 and the word vector in the step 8 to obtain an ELMo pre-training model;
step 9.1, adopting 7 different convolution layers such as [ [1, 32], [2, 32], [3, 64], [4, 128], [5, 256], [6, 512], [7, 1024] ] and the like, extracting a word vector part by convolution as in the convolution in the figure 2, carrying out convolution operation on the word segmentation result in the step 3, then respectively connecting each convolution layer with a global maximum pooling layer, and splicing all pooled vectors to obtain a word vector;
step 9.2, the word vector in step 9.1 is transmitted into the HighWay layer, and the process is repeated twice, for example, the HighWay layer in fig. 2, and the word vector for "gastritis" is [0.3, 0.4, 0.5%]And splicing with the gastritis in the step 8 to obtain [0.1, 0.2, 0.3, 0.4, 0.5 ]]Then, Linear Projection layer (Linear Projection) is used to obtain the final word vector
Figure 730568DEST_PATH_IMAGE069
For example, the "gastritis" is transformed into [0.15, 0.25, 0.35 ]];
Step 9.3, the obtained word vector
Figure 342946DEST_PATH_IMAGE070
Input to a bidirectional language model (biLM), such as the portion of the bidirectional language model in fig. 2, which includes a forward language model and a forward language modelA backward language model. In the forward language model, the first
Figure 738768DEST_PATH_IMAGE071
Word passing
Figure 43978DEST_PATH_IMAGE072
The hidden layer obtained by the layer is represented as
Figure 82341DEST_PATH_IMAGE073
(wherein
Figure 436093DEST_PATH_IMAGE074
) The forward language model passes through previously observed word sequences
Figure 141881DEST_PATH_IMAGE075
To predict the next moment word
Figure 352414DEST_PATH_IMAGE076
Modeling is carried out on the word sequence, and the joint probability is as follows:
Figure 891455DEST_PATH_IMAGE077
the backward language model is similar to the forward language model, and models the word sequence at the current moment by observing the future word sequence, and the joint probability is as follows:
Figure 970269DEST_PATH_IMAGE078
step 9.4, combining the forward language model and the backward language model, and maximizing the log-likelihood functions in two directions:
Figure 530564DEST_PATH_IMAGE079
wherein,
Figure 849681DEST_PATH_IMAGE080
the layer in which the words are embedded is represented,
Figure 128215DEST_PATH_IMAGE081
the representation of the Softmax layer is shown,
Figure 823770DEST_PATH_IMAGE082
the parameters representing the forward language model are,
Figure 238571DEST_PATH_IMAGE083
parameters representing a backward language model;
step 10, using an ELMo pre-training model to obtain an ELMo vector of a specific word in a sentence, wherein for an L-layer bidirectional language model, 2L +1 layers are represented in total:
Figure 994168DEST_PATH_IMAGE084
wherein when
Figure 432103DEST_PATH_IMAGE085
When the temperature of the water is higher than the set temperature,
Figure 915037DEST_PATH_IMAGE086
is a word vector, i.e.
Figure 197726DEST_PATH_IMAGE087
Figure 107913DEST_PATH_IMAGE088
Is a representation of forward and reverse language models
Figure 846193DEST_PATH_IMAGE089
The final ELMo word vector is represented as:
Figure 132818DEST_PATH_IMAGE090
wherein,
Figure 272944DEST_PATH_IMAGE091
is the weight normalized by Softmax,
Figure 619611DEST_PATH_IMAGE092
is a scaling parameter.
If a 2-level bi-directional language model is selected, the "gastritis" ELMo word vector is linearly composed of the ELMo word vector portion of FIG. 2.
Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention and are not intended to be limiting. Although the present invention has been described in detail with reference to the embodiments, it will be apparent to those skilled in the art that modifications may be made to the embodiments of the present invention or equivalents may be substituted for elements thereof without departing from the scope of the claims.

Claims (6)

1. A medical text-oriented pre-training method is characterized by comprising the following steps:
step 1, acquiring entities such as diseases, examination, symptoms, medicines, body parts, operations and the like from medical encyclopedia, and respectively constructing corresponding dictionaries;
step 2, according to the dictionary established in the step 1, using Scapy to acquire corresponding medical science popularization data in encyclopedia, medical encyclopedia and medical inquiry, and then acquiring electronic medical record data in love doctor;
step 3, loading the dictionary in the step 1, and performing word segmentation processing on the medical data by using jieba;
step 4, collecting Chinese character pictures in the Chinese dictionary, obtaining corresponding fonts and sizes, naming the Chinese characters in the pictures, counting all characters in the step 2, uniformly converting English letters into lowercase except the Chinese characters, uniformly converting numbers and punctuations into semi-corners, constructing corresponding pictures by adopting the fonts and the sizes corresponding to the Chinese characters, and preprocessing the character pictures;
step 5, extracting the character picture preprocessed in the step 4 by using a VGG-16 convolutional network, wherein the VGG-16 convolutional network consists of 13 convolutional layers and 3 full-link layers, and the character is characterized by acquiring a result output by the 2 nd full-link layer;
step 6, reducing the dimensionality of the character features extracted in the step 5 by adopting PCA (principal component analysis), and using the dimensionality as a word vector of the word embedding layer;
and 7, superposing the word vector in the word embedding layer and the position vector of the word position embedding layer to obtain a new word vector fusing the word information and the position information, wherein the position vector calculation formula is as follows:
Figure RE-FDA0003301721960000011
Figure RE-FDA0003301721960000012
where p is the position of the word, deFor the total dimension of word embedding, i for the specific dimension of word embedding, when i is even, PE is selected(p,2i)(ii) a When i is odd, PE is selected(p,2i+1)
Step 8, loading the word vectors with open sources as word vectors of a word embedding layer;
step 9, training the medical text after word segmentation in the step 3 by using an ELMo model and combining the word vector in the step 7 and the word vector in the step 8 to obtain an ELMo pre-training model;
step 9.1, adopting 7 different convolution layers such as [ [1, 32], [2, 32], [3, 64], [4, 128], [5, 256], [6, 512], [7, 1024] ] and the like, carrying out convolution operation on the word segmentation result in the step 3, then respectively connecting each convolution layer with a global maximum pooling layer, and splicing all pooled vectors to obtain a word vector;
step 9.2, transmitting the word vector in the step 9.1 into a HighWay layer, repeating the execution twice, splicing the obtained word vector with the word vector in the step 8, and then obtaining a final word vector by a Linear Projection layer (Linear Projection) to obtain the final word vector
Figure RE-FDA0003301721960000013
Step 9.3, the obtained word vector
Figure RE-FDA0003301721960000014
Inputting the input into a bidirectional language model (biLM) comprising a forward language model in which the k-th word is represented as a hidden layer through an L layer
Figure RE-FDA0003301721960000015
(where j ═ 1, 2, …, L), the forward language model passes through the previously observed word sequence t1,t2,…,tk-1To predict the word t at the next momentkModeling is carried out on the word sequence, and the joint probability is as follows:
Figure RE-FDA0003301721960000016
the backward language model is similar to the forward language model, and models the word sequence at the current moment by observing the future word sequence, and the joint probability is as follows:
Figure RE-FDA0003301721960000017
step 9.4, combining the forward language model and the backward language model, and maximizing the log-likelihood functions in two directions:
Figure RE-FDA0003301721960000018
wherein, thetaxRepresenting a word embedding layer, ΘsThe representation of the Softmax layer is shown,
Figure RE-FDA0003301721960000021
the parameters representing the forward language model are,
Figure RE-FDA0003301721960000022
parameters representing a backward language model;
step 10, using an ELMo pre-training model to obtain an ELMo vector of a specific word in a sentence, wherein for an L-layer bidirectional language model, 2L +1 layers are represented in total:
Figure RE-FDA0003301721960000023
wherein, when j is 0,
Figure RE-FDA0003301721960000024
is a word vector, i.e.
Figure RE-FDA0003301721960000025
Is a representation of forward and reverse language models
Figure RE-FDA0003301721960000026
The final ELMo word vector is represented as:
Figure RE-FDA0003301721960000027
wherein,
Figure RE-FDA0003301721960000028
is the weight, γ, after Softmax normalizationtaskIs a scaling parameter.
2. The medical text-oriented pre-training method as claimed in claim 1, wherein the pre-processing of the character picture in step 4 is to read the picture of the character in RGB format, convert the color channel into BGR format, then convert the picture channel sequence into CHW, and then subtract the mean value from the color channel:
[XB,XG,XR]=[XB,XG,XR]-[XB_MEAN,XG_MEAN,XR_MEAN]
wherein, XB,XG,XRB, G, R color channel values, respectively; xB_MEAN,XG_MEAN,XR_MEANRespectively, are the mean values of the B, G, R color channels.
3. A medical text oriented pre-training method according to claim 1, wherein the HighWay layer formula of step 9.2 is as follows:
y=g*x+(1-g)*f(A(x))
g=σ(B(x))
where x denotes the vector spliced in step 9.1, A, B denotes a linear function, f denotes an activation function, and σ denotes a sigmod function.
4. A medical text oriented pre-training method according to claim 1, wherein the linear projection layer formula of step 9.2 is as follows:
y=w*x+b
wherein w represents a weight, x represents a variable, and b represents a bias.
5. A medical text oriented pretraining method as claimed in claim 1, wherein said bidirectional language model (biLM) in step 9.3 is bidirectional long-short-time memory network (LSTM) and is formed by combining forward LSTM and backward LSTM, and the formula of LSTM is:
f(t)=σ(W(f)x(t)+U(f)h(t-1))
i(t)=σ(W(i)x(t)+U(i)h(t-1)
o(t)=σ(W(o)x(t)+U(o)h(t-1)
Figure RE-FDA0003301721960000029
Figure RE-FDA00033017219600000210
Figure RE-FDA00033017219600000211
where t represents the current time, t-1 represents the last time, f(t)、i(t)、o(t)、h(t)Respectively representing a forgetting gate, an input gate, an output gate and an output, sigma is a sigmod function,
Figure RE-FDA00033017219600000212
c(t-1)、c(t)respectively representing a candidate state, a state at the previous time, a new state, W(f)、U(f)、W(i)、U(i)、W(o)、U(o)、W(c)、U(c)Is a training parameter, which is automatically updated during training.
6. The medical text-oriented pre-training method according to claim 1, wherein the Softmax function in step 10 is:
Figure RE-FDA0003301721960000031
yiis the output of the current output unit, j is the output index, the total number is C, Softmax (y)i) The ratio of the index of the current element to the sum of the indices of all elements is shown.
CN202110690028.2A 2021-06-23 2021-06-23 Pre-training method for medical text Active CN113674866B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110690028.2A CN113674866B (en) 2021-06-23 2021-06-23 Pre-training method for medical text

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110690028.2A CN113674866B (en) 2021-06-23 2021-06-23 Pre-training method for medical text

Publications (2)

Publication Number Publication Date
CN113674866A true CN113674866A (en) 2021-11-19
CN113674866B CN113674866B (en) 2024-06-14

Family

ID=78538272

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110690028.2A Active CN113674866B (en) 2021-06-23 2021-06-23 Pre-training method for medical text

Country Status (1)

Country Link
CN (1) CN113674866B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114429129A (en) * 2021-12-22 2022-05-03 南京信息工程大学 Literature mining and material property prediction method

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20170081350A (en) * 2016-01-04 2017-07-12 한국전자통신연구원 Text Interpretation Apparatus and Method for Performing Text Recognition and Translation Per Frame Length Unit of Image
US20190197109A1 (en) * 2017-12-26 2019-06-27 The Allen Institute For Artificial Intelligence System and methods for performing nlp related tasks using contextualized word representations
CN110705293A (en) * 2019-08-23 2020-01-17 中国科学院苏州生物医学工程技术研究所 Electronic medical record text named entity recognition method based on pre-training language model
CN111626383A (en) * 2020-05-29 2020-09-04 Oppo广东移动通信有限公司 Font identification method and device, electronic equipment and storage medium
CN111783767A (en) * 2020-07-27 2020-10-16 平安银行股份有限公司 Character recognition method and device, electronic equipment and storage medium
CN112989041A (en) * 2021-03-10 2021-06-18 中国建设银行股份有限公司 Text data processing method and device based on BERT

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20170081350A (en) * 2016-01-04 2017-07-12 한국전자통신연구원 Text Interpretation Apparatus and Method for Performing Text Recognition and Translation Per Frame Length Unit of Image
US20190197109A1 (en) * 2017-12-26 2019-06-27 The Allen Institute For Artificial Intelligence System and methods for performing nlp related tasks using contextualized word representations
CN110705293A (en) * 2019-08-23 2020-01-17 中国科学院苏州生物医学工程技术研究所 Electronic medical record text named entity recognition method based on pre-training language model
CN111626383A (en) * 2020-05-29 2020-09-04 Oppo广东移动通信有限公司 Font identification method and device, electronic equipment and storage medium
CN111783767A (en) * 2020-07-27 2020-10-16 平安银行股份有限公司 Character recognition method and device, electronic equipment and storage medium
CN112989041A (en) * 2021-03-10 2021-06-18 中国建设银行股份有限公司 Text data processing method and device based on BERT

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114429129A (en) * 2021-12-22 2022-05-03 南京信息工程大学 Literature mining and material property prediction method

Also Published As

Publication number Publication date
CN113674866B (en) 2024-06-14

Similar Documents

Publication Publication Date Title
CN113010693B (en) Knowledge graph intelligent question-answering method integrating pointer generation network
CN110750959B (en) Text information processing method, model training method and related device
CN111079377B (en) Method for recognizing named entities of Chinese medical texts
CN108416065B (en) Hierarchical neural network-based image-sentence description generation system and method
CN106484674B (en) Chinese electronic medical record concept extraction method based on deep learning
CN111666758B (en) Chinese word segmentation method, training device and computer readable storage medium
CN111368086A (en) CNN-BilSTM + attribute model-based sentiment classification method for case-involved news viewpoint sentences
CN111783466A (en) Named entity identification method for Chinese medical records
CN112151183A (en) Entity identification method of Chinese electronic medical record based on Lattice LSTM model
CN113707339B (en) Method and system for concept alignment and content inter-translation among multi-source heterogeneous databases
Pan et al. AMAM: an attention-based multimodal alignment model for medical visual question answering
Ke et al. Medical entity recognition and knowledge map relationship analysis of Chinese EMRs based on improved BiLSTM-CRF
Lu et al. Sentiment analysis: Comprehensive reviews, recent advances, and open challenges
CN115879546A (en) Method and system for constructing composite neural network psychology medicine knowledge map
CN114220516A (en) Brain CT medical report generation method based on hierarchical recurrent neural network decoding
CN115510230A (en) Mongolian emotion analysis method based on multi-dimensional feature fusion and comparative reinforcement learning mechanism
Tan et al. Chinese medical named entity recognition based on Chinese character radical features and pre-trained language models
CN114757188A (en) Standard medical text rewriting method based on generation of confrontation network
Elbedwehy et al. Enhanced descriptive captioning model for histopathological patches
CN112216379A (en) Disease diagnosis system based on intelligent joint learning
CN113674866B (en) Pre-training method for medical text
CN117497178A (en) Knowledge-graph-based common disease auxiliary decision-making method
CN112528989A (en) Description generation method for semantic fine granularity of image
CN116432632A (en) Interpretable reading understanding model based on T5 neural network
CN113971405A (en) Medical named entity recognition system and method based on ALBERT model fusion

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant