CN115759073A

CN115759073A - Medical examination pre-training model construction method and identification method

Info

Publication number: CN115759073A
Application number: CN202211456934.7A
Authority: CN
Inventors: 姚佳; 刘忠禹; 殷晋
Original assignee: West China Hospital of Sichuan University
Current assignee: West China Hospital of Sichuan University
Priority date: 2022-11-21
Filing date: 2022-11-21
Publication date: 2023-03-07

Abstract

The invention discloses a construction method of a medical examination pre-training model, which comprises the following steps: constructing training original data by using data of a medical examination related report; carrying out random mask masking processing on a physical layer and a digital layer of the training original data to obtain training data; a MedExBERT model is built by adopting a Transformer Encoder as a basic structure; adding a loss calculation function, wherein the loss calculation function comprises a cross entropy loss function and a numerical loss function; and training the MedExBERT model by using the training data to obtain a medical examination pre-training model. Through the scheme, the method has the advantages of simple logic, accuracy, reliability and the like, and has high practical value and popularization value in the technical field of medical examination text recognition.

Description

Medical examination pre-training model construction method and identification method

Technical Field

The invention relates to the technical field of medical examination text recognition, in particular to a medical examination pre-training model construction method and a medical examination pre-training model recognition method.

Background

Reports and texts related to medical examination are important components of medical texts, and the automatic structural analysis of the medical examination texts is the basis for realizing artificial intelligence to assist disease diagnosis and treatment. However, due to the unique sub-language characteristics of the medical examination text and the complexity and diversity of the medical examination description, the open-source chinese pre-training model that can be used for the medical examination text structuring at present cannot accurately form the characterization of the medical examination text, and particularly cannot form relatively accurate perception for entities related to the medical examination and examination result indexes, resulting in low accuracy of the text structuring result and poor analysis effect.

Since google team released the BERT pre-training model based on the transform's Encoder architecture, the large model-based pre-training-fine tuning strategy has become a standard paradigm in the field of NLPs. In the biomedical field, there are tens of BERT-like pre-trained language models that have been released today, including BioBERT, pubMedBERT, sciBERT, etc. pre-trained with large-scale biomedical-related research literature corpora, and G-BERT, med-BERT, etc. pre-trained with large-scale clinical text corpora. However, these pre-training models have the disadvantage that they are pre-trained using english corpus and cannot adapt to chinese medical text processing task. For the Chinese biomedical text field, MC-BERT, PCL-MedBERT and the like exist at present. However, these pre-trained language models are suitable for the field of biomedicine, and have poor adaptation effects for text processing in subdivided fields such as medical examination and medical diagnosis. In addition, the simple random word mask or random entity mask method adopted by the existing BERT pre-training model does not form more accurate semantic perception and modeling for medical index related entities (such as 25%, "7.4h", "2.5g/L" and the like), and can cause the medical index related entities to generate larger semantic deviation in downstream tasks, and can cause the problems of poor analytic performance and the like when being used for real-world medical text processing.

Therefore, it is urgently needed to provide a construction method and an identification method of a medical examination pre-training model with simple logic, accuracy and reliability

Disclosure of Invention

In view of the above problems, the present invention aims to provide a medical examination pre-training model construction method and an identification method, and the technical scheme adopted by the present invention is as follows:

a first part: the technology provides a medical examination pre-training model construction method, which comprises the following steps:

constructing training original data by using data of medical examination related reports;

carrying out random mask masking processing on a physical layer and a digital layer of the training original data to obtain training data;

a MedExBERT model is built by adopting a Transformer Encoder as a basic structure;

adding a loss calculation function, wherein the loss calculation function comprises a cross entropy loss function and a numerical loss function;

and training the MedExBERT model by using the training data to obtain a medical examination pre-training model.

In the second part, the technology provides a method for recognizing a medical examination text, which adopts a network model constructed by a medical examination pre-training model construction method for recognition.

In a third aspect, the present technology provides a medical examination pre-training model building apparatus, including:

the training original data analysis module collects data of medical examination related reports to construct training original data;

the preprocessing module is connected with the training original data analysis module and is used for processing the random mask of the physical layer and the digital layer of the training original data to obtain training data;

the model building module is used for building a MedExBERT model by adopting a Transformer Encoder as a basic structure;

the loss function adding module is connected with the model building module and is used for adding a loss calculation function, and the loss calculation function comprises a cross entropy loss function and a numerical value loss function;

and the training module is connected with the preprocessing module, the model building module and the loss function adding module, and trains the MedExBERT model by using the training data to obtain a medical examination pre-training model.

In a fourth aspect, the present technology provides an electronic device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor implements a medical examination pre-training model construction method when executing the computer program.

In a fifth aspect, the present technology provides a computer-readable storage medium, which stores a computer program, wherein the computer program, when executed by a processor, implements the steps of a medical examination pre-training model construction method according to any one of claims 1 to 4.

Compared with the prior art, the invention has the following beneficial effects:

(1) The invention provides a number mask method for medical examination digital entities, which comprises the steps of describing standardized masks for a few digital entities, a plurality of digital entities and the like and predicting; in addition, the invention provides a method for accurately capturing the differentiation of indexes aiming at the numerical value loss of medical indexes and increasing the understanding of a pre-training model to medical deep knowledge;

(2) According to the invention, the quantitative, morphological and morphological word class data are calculated by adopting a numerical loss function, so that the difference of index numerical values is accurately captured, and the understanding of a large model to medical deep knowledge is increased;

in conclusion, the method has the advantages of simple logic, accuracy, reliability and the like, and has high practical value and popularization value in the technical field of medical examination text recognition.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present invention, and therefore should not be considered as limiting the scope of protection, and it is obvious for those skilled in the art that other related drawings can be obtained according to these drawings without inventive efforts.

FIG. 1 is a logic flow diagram of the present invention.

Fig. 2 is a diagram of the model architecture of the present invention.

FIG. 3 is a flow chart of numerical loss function calculation according to the present invention.

Detailed Description

To further clarify the objects, technical solutions and advantages of the present application, the present invention will be further described with reference to the accompanying drawings and examples, and embodiments of the present invention include, but are not limited to, the following examples. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

Examples

As shown in fig. 1 to fig. 3, the present embodiment provides a medical examination pre-training model construction method, which includes the following steps:

the first step, constructing training raw data by using data of medical examination related reports, mainly relating to data cleaning, processing and integration of various types of medical examination related reports. Comprises extracting and cleaning text fields of examination reports such as physical examination report, ultrasonic examination report, puncture examination report, blood routine report and electrocardiogram report, and integrally forming raw data

And secondly, processing the random mask of the physical layer and the digital layer of the training original data to obtain training data. The method mainly relates to random mask masking processing of an entity level and a digital level on original data. The entity layer mainly carries out entity matching in training data by constructing an entity dictionary (including medical examination related examination item entities, examination description entities, examination diagnosis entities and the like); the digital layer mask is mainly used for matching numbers and unit combination entities (such as descriptions of 25%, 7.4h, 2.5g/L, 13-18 and the like) in the training data by constructing a regular expression; meanwhile, on the digital mask level, digital standardization is carried out on unseen, a few and most digital expression entities and then mask matching is carried out, and the standardization operation is mainly completed by constructing a digital standardization dictionary.

After matching the entities in the original training data from the entity level and the digital level, mask masking the candidate entities by the following strategies: (1) randomly selecting 80% of candidate entities to be replaced by mask words; (2) randomly selecting 10% of candidate entities to replace with any one of the entities of the same type (for example, "hypertension" is replaced with "chronic nephritis" and "7.4h" is replaced with "2.5g/L" in examination and diagnosis); (3) the remaining 10% of the candidate entities are selected and kept unchanged.

Thirdly, a MedExBERT model is built by adopting a Transformer Encoder as a basic structure:

(1) Model architecture

MedExBERT adopts a Transformer Encoder as a basic structure, and each Encoder layer is composed of two sublayers: the first layer is a Multi-head Self Attention (MSA) layer; the second layer is a Feed-Forward neural Network (FFN). And a Layer Normalization Layer is arranged behind each sub-Layer to perform Layer Normalization on residual connecting values after the sub-Layer inputs and outputs.

The model overall architecture can be represented by formula (1). Wherein Encoder _i Represents the i-th layer of the transform Encoder layer,

and

respectively representing the output tensors of the i-th and i-1-th Encoder layers.

Each Encoder layer is composed of two sublayers of MSA and FFN, and the outputs of the two sublayers are respectively subjected to residual error connection and enter the LayerNorm layer as shown in formula (2). Where y denotes the output tensor, x denotes the input tensor, layerNorm denotes the Layer Normalization Layer, and SubLayer denotes the SubLayer (i.e., MSA or FFN) Layer.

y＝LayerNorm(x+SubLayer(x)) (2)

The FFN layer is shown in formula (3). Where x represents the FFN layer input tensor, W ₁ And W ₂ Weight parameters representing the first and second hidden linear layers in the FFN layer, respectively, b ₁ And b ₂ Representing the bias parameters of the first and second hidden linear layers, respectively.

FFN(x)＝max(0,xW ₁ +b ₁ )W ₂ +b ₂ (3)

The structure of the MSA layer is shown in FIG. 2, and the formula can be represented by (4). Q, K and V respectively represent query tensors, key tensors and value tensors in an Attention mechanism, and Q, K and V are input tensors in the model; concat denotes the tensor splicing operation, head _i Indicates the output of the ith attention head, W ^° Representing the weight parameter of the last linear layer in the MSA.

MSA(Q,K,V)＝Concat(head ₁ ,...,head _i ,...,head _h )W ^° (4)

head _i The formula (2) is shown in (5). Wherein W _i ^Q 、W _i ^K 、W _i ^V And weight parameters of the wiring layer after Q, K and V tensors are respectively expressed.

head _i ＝Attention(QW _i ^Q ,KW _i ^K ,VW _i ^V ) (5)

The equation of the Attention layer is shown in (6). Wherein K ^T Transpose of the representation tensor K, d _k The eigendimension representing tensor K.

The hyperparameters of the MedExBERT model are shown in Table 1:

TABLE 1 model hyper-parameters

Hyper-parameter	Meaning of	Value of
			N	Number of layers of transform Encoder	12
d _model	Feature dimensions of input and output tensors of model sub-layers	768
			h	MSA layer attention head number	12
d _k	Eigendimension of tensor K in Attention layer	64
			d _ff	Characteristic dimension of hidden linear layer in middle of FFN layer	3072

Fourthly, adding a loss calculation function, wherein the loss calculation function comprises a cross entropy loss function and a numerical value loss function;

(1) For the mask words of the medical noun character description types, the loss of the real value and the predicted value is calculated through a cross entropy function, and the expression is as follows:

wherein y and y' respectively represent the real value and the predicted value of the current mask word, C represents the size of the dictionary, yi and y _i ' real probability value and prediction probability value of the i-th word of the dictionary as the mask word are respectively;

(2) For the digital data, the combination of a cross entropy loss function and a numerical loss function is adopted, and the expression is as follows:

Loss(y,y′)＝λ×CELoss(y,y′)+(1-λ)×NumberLoss(n,n′)

NumberLoss(n,n′)＝|n-n′|

wherein λ represents a cross entropy loss weight; n and n 'represent digital values obtained by removing units from y and y', respectively;

(3) For the quantitative and adjective word class data, a numerical loss function is adopted for calculation, and the expression is as follows:

n _i ＝MLP(h _i )

w _i ＝W*one_hot(x _i )

wherein x is _i And h _i Respectively representing the numerical value of the ith position to be predicted and the hidden layer tensor of the position after the position passes through a pre-training model; one _ hot denotes a one-hot encoding algorithm, W denotes a learnable parameter, W _i Is expressed according to x _i The weight tensor obtained by the belonged type (actual quantifier after numerical value); MLP represents a multi-layer perceptron network; v. of _i Denotes x _i The normalized value, δ, represents the sigmoid activation function.

The quantitative and morphological word class data are calculated by adopting a numerical loss function, and the method comprises the following steps of:

step I01, putting quantitative, morphological and word class data to be predicted into a large model, and outputting one-dimensional data; the large model is a BERT training model or an MC-BERT training model;

step I02, subtracting the one-dimensional data from the quantity, form and word class data to be predicted after normalization processing, and calculating an absolute value;

step I03, multiplying the absolute value by the weight tensor w corresponding to the predicted quantitative and configurational word class data _i And applying to the sigmoid function;

step I04, repeating the steps I01 to I03 until any quantitative adjective class data to be predicted is processed;

and step I05, summing output values of the sigmoid functions.

And fifthly, training the MedExBERT model by using the training data to obtain a medical examination pre-training model.

The above-mentioned embodiments are only preferred embodiments of the present invention, and do not limit the scope of the present invention, but all the modifications made by the principles of the present invention and the non-inventive efforts based on the above-mentioned embodiments shall fall within the scope of the present invention.

Claims

1. A medical examination pre-training model construction method is characterized by comprising the following steps:

constructing training original data by using data of a medical examination related report;

2. The medical examination pre-training model construction method according to claim 1, wherein the loss calculation function addition comprises the following steps:

s1, for mask words of medical noun character description types, calculating loss of a real value and a predicted value through a cross entropy function, wherein the expression is as follows:

wherein y and y' respectively represent the real value and the predicted value of the current mask word, C represents the size of the dictionary, and y represents the size of the dictionary _i And y' _i Respectively representing the real probability value and the predicted probability value of the ith word of the dictionary as the mask word;

s2, for the metering digital data, the combination of a cross entropy loss function and a numerical loss function is adopted, and the expression is as follows:

Loss(y,y′)＝λ×CELoss(y,y′)+(1-λ)×NumberLoss(n,n′)

NumberLoss(n,n′)＝|n-n′|

s3, calculating the quantitative and morphological word class data by adopting a numerical loss function, wherein the expression is as follows:

n _i ＝MLP(h _i )

w _i ＝W*one_hot(x _i )

wherein x is _i And h _i Respectively representing the numerical value of the ith position to be predicted and the hidden layer tensor of the position after the position passes through a pre-training model; one _ hot denotes a one-hot encoding algorithm, W denotes a learnable parameter, W _i Is expressed according to x _i A weight tensor obtained by the type; MLP represents a multi-layer perceptron network; v. of _i Denotes x _i The normalized value, δ, represents the sigmoid activation function.

3. The method for constructing a pre-training model for medical examination according to claim 2, wherein the quantitative morphological data class is calculated by using a numerical loss function, comprising the following steps:

step I02, subtracting the one-dimensional data and the quantity, form and word class data to be predicted after normalization processing, and calculating an absolute value;

step I03, multiplying the absolute value by the predicted quantitative adjective classData correspondence weight tensor w _i And applied to the sigmoid function;

and step I05, summing output values of the sigmoid function.

4. The method as claimed in claim 1, wherein any encoder Layer of the MedExBERT model is composed of a multi-head self-attention Layer and a feedforward neural network connected in sequence, and a Layer Normalization Layer is arranged behind any sub-Layer to perform Layer Normalization on residual connection values after sub-Layer input and output.

5. A method for recognizing a medical examination text, characterized in that a network model constructed by the medical examination pre-training model construction method according to any one of claims 1 to 4 is used for recognition.

6. A medical examination pre-training model building device is characterized by comprising:

the training original data analysis module is used for collecting data of a medical examination related report to construct training original data;

7. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements a medical examination pre-training model construction method as claimed in any one of claims 1 to 4 when executing the computer program.

8. A computer-readable storage medium, storing a computer program, wherein the computer program, when executed by a processor, implements the steps of a medical examination pre-training model construction method according to any one of claims 1 to 4.