CN115759073A - Medical examination pre-training model construction method and identification method - Google Patents

Medical examination pre-training model construction method and identification method Download PDF

Info

Publication number
CN115759073A
CN115759073A CN202211456934.7A CN202211456934A CN115759073A CN 115759073 A CN115759073 A CN 115759073A CN 202211456934 A CN202211456934 A CN 202211456934A CN 115759073 A CN115759073 A CN 115759073A
Authority
CN
China
Prior art keywords
training
model
medical examination
data
layer
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211456934.7A
Other languages
Chinese (zh)
Inventor
姚佳
刘忠禹
殷晋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
West China Hospital of Sichuan University
Original Assignee
West China Hospital of Sichuan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by West China Hospital of Sichuan University filed Critical West China Hospital of Sichuan University
Priority to CN202211456934.7A priority Critical patent/CN115759073A/en
Publication of CN115759073A publication Critical patent/CN115759073A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Medical Treatment And Welfare Office Work (AREA)

Abstract

The invention discloses a construction method of a medical examination pre-training model, which comprises the following steps: constructing training original data by using data of a medical examination related report; carrying out random mask masking processing on a physical layer and a digital layer of the training original data to obtain training data; a MedExBERT model is built by adopting a Transformer Encoder as a basic structure; adding a loss calculation function, wherein the loss calculation function comprises a cross entropy loss function and a numerical loss function; and training the MedExBERT model by using the training data to obtain a medical examination pre-training model. Through the scheme, the method has the advantages of simple logic, accuracy, reliability and the like, and has high practical value and popularization value in the technical field of medical examination text recognition.

Description

Medical examination pre-training model construction method and identification method
Technical Field
The invention relates to the technical field of medical examination text recognition, in particular to a medical examination pre-training model construction method and a medical examination pre-training model recognition method.
Background
Reports and texts related to medical examination are important components of medical texts, and the automatic structural analysis of the medical examination texts is the basis for realizing artificial intelligence to assist disease diagnosis and treatment. However, due to the unique sub-language characteristics of the medical examination text and the complexity and diversity of the medical examination description, the open-source chinese pre-training model that can be used for the medical examination text structuring at present cannot accurately form the characterization of the medical examination text, and particularly cannot form relatively accurate perception for entities related to the medical examination and examination result indexes, resulting in low accuracy of the text structuring result and poor analysis effect.
Since google team released the BERT pre-training model based on the transform's Encoder architecture, the large model-based pre-training-fine tuning strategy has become a standard paradigm in the field of NLPs. In the biomedical field, there are tens of BERT-like pre-trained language models that have been released today, including BioBERT, pubMedBERT, sciBERT, etc. pre-trained with large-scale biomedical-related research literature corpora, and G-BERT, med-BERT, etc. pre-trained with large-scale clinical text corpora. However, these pre-training models have the disadvantage that they are pre-trained using english corpus and cannot adapt to chinese medical text processing task. For the Chinese biomedical text field, MC-BERT, PCL-MedBERT and the like exist at present. However, these pre-trained language models are suitable for the field of biomedicine, and have poor adaptation effects for text processing in subdivided fields such as medical examination and medical diagnosis. In addition, the simple random word mask or random entity mask method adopted by the existing BERT pre-training model does not form more accurate semantic perception and modeling for medical index related entities (such as 25%, "7.4h", "2.5g/L" and the like), and can cause the medical index related entities to generate larger semantic deviation in downstream tasks, and can cause the problems of poor analytic performance and the like when being used for real-world medical text processing.
Therefore, it is urgently needed to provide a construction method and an identification method of a medical examination pre-training model with simple logic, accuracy and reliability
Disclosure of Invention
In view of the above problems, the present invention aims to provide a medical examination pre-training model construction method and an identification method, and the technical scheme adopted by the present invention is as follows:
a first part: the technology provides a medical examination pre-training model construction method, which comprises the following steps:
constructing training original data by using data of medical examination related reports;
carrying out random mask masking processing on a physical layer and a digital layer of the training original data to obtain training data;
a MedExBERT model is built by adopting a Transformer Encoder as a basic structure;
adding a loss calculation function, wherein the loss calculation function comprises a cross entropy loss function and a numerical loss function;
and training the MedExBERT model by using the training data to obtain a medical examination pre-training model.
In the second part, the technology provides a method for recognizing a medical examination text, which adopts a network model constructed by a medical examination pre-training model construction method for recognition.
In a third aspect, the present technology provides a medical examination pre-training model building apparatus, including:
the training original data analysis module collects data of medical examination related reports to construct training original data;
the preprocessing module is connected with the training original data analysis module and is used for processing the random mask of the physical layer and the digital layer of the training original data to obtain training data;
the model building module is used for building a MedExBERT model by adopting a Transformer Encoder as a basic structure;
the loss function adding module is connected with the model building module and is used for adding a loss calculation function, and the loss calculation function comprises a cross entropy loss function and a numerical value loss function;
and the training module is connected with the preprocessing module, the model building module and the loss function adding module, and trains the MedExBERT model by using the training data to obtain a medical examination pre-training model.
In a fourth aspect, the present technology provides an electronic device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor implements a medical examination pre-training model construction method when executing the computer program.
In a fifth aspect, the present technology provides a computer-readable storage medium, which stores a computer program, wherein the computer program, when executed by a processor, implements the steps of a medical examination pre-training model construction method according to any one of claims 1 to 4.
Compared with the prior art, the invention has the following beneficial effects:
(1) The invention provides a number mask method for medical examination digital entities, which comprises the steps of describing standardized masks for a few digital entities, a plurality of digital entities and the like and predicting; in addition, the invention provides a method for accurately capturing the differentiation of indexes aiming at the numerical value loss of medical indexes and increasing the understanding of a pre-training model to medical deep knowledge;
(2) According to the invention, the quantitative, morphological and morphological word class data are calculated by adopting a numerical loss function, so that the difference of index numerical values is accurately captured, and the understanding of a large model to medical deep knowledge is increased;
in conclusion, the method has the advantages of simple logic, accuracy, reliability and the like, and has high practical value and popularization value in the technical field of medical examination text recognition.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present invention, and therefore should not be considered as limiting the scope of protection, and it is obvious for those skilled in the art that other related drawings can be obtained according to these drawings without inventive efforts.
FIG. 1 is a logic flow diagram of the present invention.
Fig. 2 is a diagram of the model architecture of the present invention.
FIG. 3 is a flow chart of numerical loss function calculation according to the present invention.
Detailed Description
To further clarify the objects, technical solutions and advantages of the present application, the present invention will be further described with reference to the accompanying drawings and examples, and embodiments of the present invention include, but are not limited to, the following examples. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
Examples
As shown in fig. 1 to fig. 3, the present embodiment provides a medical examination pre-training model construction method, which includes the following steps:
the first step, constructing training raw data by using data of medical examination related reports, mainly relating to data cleaning, processing and integration of various types of medical examination related reports. Comprises extracting and cleaning text fields of examination reports such as physical examination report, ultrasonic examination report, puncture examination report, blood routine report and electrocardiogram report, and integrally forming raw data
And secondly, processing the random mask of the physical layer and the digital layer of the training original data to obtain training data. The method mainly relates to random mask masking processing of an entity level and a digital level on original data. The entity layer mainly carries out entity matching in training data by constructing an entity dictionary (including medical examination related examination item entities, examination description entities, examination diagnosis entities and the like); the digital layer mask is mainly used for matching numbers and unit combination entities (such as descriptions of 25%, 7.4h, 2.5g/L, 13-18 and the like) in the training data by constructing a regular expression; meanwhile, on the digital mask level, digital standardization is carried out on unseen, a few and most digital expression entities and then mask matching is carried out, and the standardization operation is mainly completed by constructing a digital standardization dictionary.
After matching the entities in the original training data from the entity level and the digital level, mask masking the candidate entities by the following strategies: (1) randomly selecting 80% of candidate entities to be replaced by mask words; (2) randomly selecting 10% of candidate entities to replace with any one of the entities of the same type (for example, "hypertension" is replaced with "chronic nephritis" and "7.4h" is replaced with "2.5g/L" in examination and diagnosis); (3) the remaining 10% of the candidate entities are selected and kept unchanged.
Thirdly, a MedExBERT model is built by adopting a Transformer Encoder as a basic structure:
(1) Model architecture
MedExBERT adopts a Transformer Encoder as a basic structure, and each Encoder layer is composed of two sublayers: the first layer is a Multi-head Self Attention (MSA) layer; the second layer is a Feed-Forward neural Network (FFN). And a Layer Normalization Layer is arranged behind each sub-Layer to perform Layer Normalization on residual connecting values after the sub-Layer inputs and outputs.
The model overall architecture can be represented by formula (1). Wherein Encoder i Represents the i-th layer of the transform Encoder layer,
Figure BDA0003953595620000051
and
Figure BDA0003953595620000052
respectively representing the output tensors of the i-th and i-1-th Encoder layers.
Figure BDA0003953595620000053
Each Encoder layer is composed of two sublayers of MSA and FFN, and the outputs of the two sublayers are respectively subjected to residual error connection and enter the LayerNorm layer as shown in formula (2). Where y denotes the output tensor, x denotes the input tensor, layerNorm denotes the Layer Normalization Layer, and SubLayer denotes the SubLayer (i.e., MSA or FFN) Layer.
y=LayerNorm(x+SubLayer(x)) (2)
The FFN layer is shown in formula (3). Where x represents the FFN layer input tensor, W 1 And W 2 Weight parameters representing the first and second hidden linear layers in the FFN layer, respectively, b 1 And b 2 Representing the bias parameters of the first and second hidden linear layers, respectively.
FFN(x)=max(0,xW 1 +b 1 )W 2 +b 2 (3)
The structure of the MSA layer is shown in FIG. 2, and the formula can be represented by (4). Q, K and V respectively represent query tensors, key tensors and value tensors in an Attention mechanism, and Q, K and V are input tensors in the model; concat denotes the tensor splicing operation, head i Indicates the output of the ith attention head, W ° Representing the weight parameter of the last linear layer in the MSA.
MSA(Q,K,V)=Concat(head 1 ,...,head i ,...,head h )W ° (4)
head i The formula (2) is shown in (5). Wherein W i Q 、W i K 、W i V And weight parameters of the wiring layer after Q, K and V tensors are respectively expressed.
head i =Attention(QW i Q ,KW i K ,VW i V ) (5)
The equation of the Attention layer is shown in (6). Wherein K T Transpose of the representation tensor K, d k The eigendimension representing tensor K.
Figure BDA0003953595620000061
The hyperparameters of the MedExBERT model are shown in Table 1:
TABLE 1 model hyper-parameters
Hyper-parameter Meaning of Value of
N Number of layers of transform Encoder 12
d model Feature dimensions of input and output tensors of model sub-layers 768
h MSA layer attention head number 12
d k Eigendimension of tensor K in Attention layer 64
d ff Characteristic dimension of hidden linear layer in middle of FFN layer 3072
Fourthly, adding a loss calculation function, wherein the loss calculation function comprises a cross entropy loss function and a numerical value loss function;
(1) For the mask words of the medical noun character description types, the loss of the real value and the predicted value is calculated through a cross entropy function, and the expression is as follows:
Figure BDA0003953595620000062
wherein y and y' respectively represent the real value and the predicted value of the current mask word, C represents the size of the dictionary, yi and y i ' real probability value and prediction probability value of the i-th word of the dictionary as the mask word are respectively;
(2) For the digital data, the combination of a cross entropy loss function and a numerical loss function is adopted, and the expression is as follows:
Loss(y,y′)=λ×CELoss(y,y′)+(1-λ)×NumberLoss(n,n′)
NumberLoss(n,n′)=|n-n′|
wherein λ represents a cross entropy loss weight; n and n 'represent digital values obtained by removing units from y and y', respectively;
(3) For the quantitative and adjective word class data, a numerical loss function is adopted for calculation, and the expression is as follows:
Figure BDA0003953595620000071
n i =MLP(h i )
w i =W*one_hot(x i )
wherein x is i And h i Respectively representing the numerical value of the ith position to be predicted and the hidden layer tensor of the position after the position passes through a pre-training model; one _ hot denotes a one-hot encoding algorithm, W denotes a learnable parameter, W i Is expressed according to x i The weight tensor obtained by the belonged type (actual quantifier after numerical value); MLP represents a multi-layer perceptron network; v. of i Denotes x i The normalized value, δ, represents the sigmoid activation function.
The quantitative and morphological word class data are calculated by adopting a numerical loss function, and the method comprises the following steps of:
step I01, putting quantitative, morphological and word class data to be predicted into a large model, and outputting one-dimensional data; the large model is a BERT training model or an MC-BERT training model;
step I02, subtracting the one-dimensional data from the quantity, form and word class data to be predicted after normalization processing, and calculating an absolute value;
step I03, multiplying the absolute value by the weight tensor w corresponding to the predicted quantitative and configurational word class data i And applying to the sigmoid function;
step I04, repeating the steps I01 to I03 until any quantitative adjective class data to be predicted is processed;
and step I05, summing output values of the sigmoid functions.
And fifthly, training the MedExBERT model by using the training data to obtain a medical examination pre-training model.
The above-mentioned embodiments are only preferred embodiments of the present invention, and do not limit the scope of the present invention, but all the modifications made by the principles of the present invention and the non-inventive efforts based on the above-mentioned embodiments shall fall within the scope of the present invention.

Claims (8)

1. A medical examination pre-training model construction method is characterized by comprising the following steps:
constructing training original data by using data of a medical examination related report;
carrying out random mask masking processing on a physical layer and a digital layer of the training original data to obtain training data;
a MedExBERT model is built by adopting a Transformer Encoder as a basic structure;
adding a loss calculation function, wherein the loss calculation function comprises a cross entropy loss function and a numerical loss function;
and training the MedExBERT model by using the training data to obtain a medical examination pre-training model.
2. The medical examination pre-training model construction method according to claim 1, wherein the loss calculation function addition comprises the following steps:
s1, for mask words of medical noun character description types, calculating loss of a real value and a predicted value through a cross entropy function, wherein the expression is as follows:
Figure FDA0003953595610000011
wherein y and y' respectively represent the real value and the predicted value of the current mask word, C represents the size of the dictionary, and y represents the size of the dictionary i And y' i Respectively representing the real probability value and the predicted probability value of the ith word of the dictionary as the mask word;
s2, for the metering digital data, the combination of a cross entropy loss function and a numerical loss function is adopted, and the expression is as follows:
Loss(y,y′)=λ×CELoss(y,y′)+(1-λ)×NumberLoss(n,n′)
NumberLoss(n,n′)=|n-n′|
wherein λ represents a cross entropy loss weight; n and n 'represent digital values obtained by removing units from y and y', respectively;
s3, calculating the quantitative and morphological word class data by adopting a numerical loss function, wherein the expression is as follows:
Figure FDA0003953595610000012
n i =MLP(h i )
w i =W*one_hot(x i )
wherein x is i And h i Respectively representing the numerical value of the ith position to be predicted and the hidden layer tensor of the position after the position passes through a pre-training model; one _ hot denotes a one-hot encoding algorithm, W denotes a learnable parameter, W i Is expressed according to x i A weight tensor obtained by the type; MLP represents a multi-layer perceptron network; v. of i Denotes x i The normalized value, δ, represents the sigmoid activation function.
3. The method for constructing a pre-training model for medical examination according to claim 2, wherein the quantitative morphological data class is calculated by using a numerical loss function, comprising the following steps:
step I01, putting quantitative, morphological and word class data to be predicted into a large model, and outputting one-dimensional data; the large model is a BERT training model or an MC-BERT training model;
step I02, subtracting the one-dimensional data and the quantity, form and word class data to be predicted after normalization processing, and calculating an absolute value;
step I03, multiplying the absolute value by the predicted quantitative adjective classData correspondence weight tensor w i And applied to the sigmoid function;
step I04, repeating the steps I01 to I03 until any quantitative adjective class data to be predicted is processed;
and step I05, summing output values of the sigmoid function.
4. The method as claimed in claim 1, wherein any encoder Layer of the MedExBERT model is composed of a multi-head self-attention Layer and a feedforward neural network connected in sequence, and a Layer Normalization Layer is arranged behind any sub-Layer to perform Layer Normalization on residual connection values after sub-Layer input and output.
5. A method for recognizing a medical examination text, characterized in that a network model constructed by the medical examination pre-training model construction method according to any one of claims 1 to 4 is used for recognition.
6. A medical examination pre-training model building device is characterized by comprising:
the training original data analysis module is used for collecting data of a medical examination related report to construct training original data;
the preprocessing module is connected with the training original data analysis module and is used for processing the random mask of the physical layer and the digital layer of the training original data to obtain training data;
the model building module is used for building a MedExBERT model by adopting a Transformer Encoder as a basic structure;
the loss function adding module is connected with the model building module and is used for adding a loss calculation function, and the loss calculation function comprises a cross entropy loss function and a numerical value loss function;
and the training module is connected with the preprocessing module, the model building module and the loss function adding module, and trains the MedExBERT model by using the training data to obtain a medical examination pre-training model.
7. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements a medical examination pre-training model construction method as claimed in any one of claims 1 to 4 when executing the computer program.
8. A computer-readable storage medium, storing a computer program, wherein the computer program, when executed by a processor, implements the steps of a medical examination pre-training model construction method according to any one of claims 1 to 4.
CN202211456934.7A 2022-11-21 2022-11-21 Medical examination pre-training model construction method and identification method Pending CN115759073A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211456934.7A CN115759073A (en) 2022-11-21 2022-11-21 Medical examination pre-training model construction method and identification method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211456934.7A CN115759073A (en) 2022-11-21 2022-11-21 Medical examination pre-training model construction method and identification method

Publications (1)

Publication Number Publication Date
CN115759073A true CN115759073A (en) 2023-03-07

Family

ID=85333690

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211456934.7A Pending CN115759073A (en) 2022-11-21 2022-11-21 Medical examination pre-training model construction method and identification method

Country Status (1)

Country Link
CN (1) CN115759073A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117370525A (en) * 2023-10-20 2024-01-09 厦门狄耐克物联智慧科技有限公司 Intelligent diagnosis guiding method based on fine tuning large model

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117370525A (en) * 2023-10-20 2024-01-09 厦门狄耐克物联智慧科技有限公司 Intelligent diagnosis guiding method based on fine tuning large model

Similar Documents

Publication Publication Date Title
CN107516110B (en) Medical question-answer semantic clustering method based on integrated convolutional coding
CN110210037B (en) Syndrome-oriented medical field category detection method
CN114169330A (en) Chinese named entity identification method fusing time sequence convolution and Transformer encoder
CN113707307A (en) Disease analysis method and device, electronic equipment and storage medium
WO2024001104A1 (en) Image-text data mutual-retrieval method and apparatus, and device and readable storage medium
CN117077786A (en) Knowledge graph-based data knowledge dual-drive intelligent medical dialogue system and method
CN116680105A (en) Time sequence abnormality detection method based on neighborhood information fusion attention mechanism
CN115759073A (en) Medical examination pre-training model construction method and identification method
Qian Exploration of machine algorithms based on deep learning model and feature extraction
CN115879546A (en) Method and system for constructing composite neural network psychology medicine knowledge map
Peng et al. DSCSSA: A classification framework for spatiotemporal features extraction of arrhythmia based on the Seq2Seq model with attention mechanism
Hu et al. An efficient Long Short-Term Memory model based on Laplacian Eigenmap in artificial neural networks
Manik et al. Out-of-Scope Intent Detection on A Knowledge-Based Chatbot.
Jaiswal et al. Ensemble Approach: XGBoost, CATBoost, and LightGBM for Diabetes Mellitus Risk Prediction
Xiao et al. AHE detection with a hybrid intelligence model in smart healthcare
Majewski et al. Sentence recognition using artificial neural networks
CN116629361A (en) Knowledge reasoning method based on ontology learning and attention mechanism
Yousif Classification of mental disorders figures based on soft computing methods
Dadgar et al. A hybrid method of feature selection and neural network with genetic algorithm to predict diabetes
Lin et al. Lifelong Text-Audio Sentiment Analysis learning
CN114997155A (en) Fact verification method and device based on table retrieval and entity graph reasoning
Wang et al. TransH-RA: A Learning Model of Knowledge Representation by Hyperplane Projection and Relational Attributes
Veinović Apparent Personality Analysis Based on Aggregation Model
Bommasani et al. SPARSE: Structured prediction using argument-relative structured encoding
Luz Deep neural semantic parsing: translating from natural language into SPARQL

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination