CN116797848A

CN116797848A - Disease positioning method and system based on medical image text alignment

Info

Publication number: CN116797848A
Application number: CN202310847723.4A
Authority: CN
Inventors: 白亮; 李佳敏; 杜航原
Original assignee: Shanxi University
Current assignee: Shanxi University
Priority date: 2023-07-12
Filing date: 2023-07-12
Publication date: 2023-09-22

Abstract

The invention discloses a disease positioning method and system based on medical image text alignment, and belongs to the technical field of artificial intelligence and digital medical treatment. According to the disease positioning method based on the text alignment of the medical image, a disease positioning network model is trained by utilizing the multi-granularity alignment relation between the medical image and the text report, so that the words describing the disease in the text are effectively corresponding to the focus area in the medical image, and the accuracy of disease positioning is improved. The disease positioning system based on medical image text alignment includes a medical picture text data storage unit; a medical picture text data preprocessing unit; a disease positioning network model training unit; a disease positioning output unit; the disease positioning system based on the text alignment of the medical pictures visualizes the disease positioning result, helps doctors to check focus positions more conveniently, and enables doctors to formulate more accurate treatment schemes and improve diagnosis efficiency.

Description

Disease positioning method and system based on medical image text alignment

Technical Field

The invention relates to the technical field of artificial intelligence and digital medical treatment, in particular to a disease positioning method and system based on medical image text alignment.

Background

The real world is composed of multiple modes such as sound, vision, smell and language, among which, the visual mode and the language mode are two most important modes, and in the medical field, a large amount of unmarked electronic multi-mode data including medical images and paired text reports are generated every day, thus providing data support for researchers to realize computer-aided diagnosis. However, the following challenges are presented in conducting medical data analysis: (1) Understanding of medical images relies heavily on manual labeling by experts, while most of the data in reality is unlabeled data. (2) The differences in lesions in medical images are very small and it is difficult to capture critical information during analysis. (3) The technical terms in the text report are difficult to understand, a certain field knowledge is needed, the writing styles of different institutions are different, and generalization across institutions is difficult. This makes it difficult for researchers to directly utilize medical images and text report data in a supervised manner. Therefore, researchers begin to learn the alignment relationship between the medical image and the text report in a self-supervising manner, thereby improving the utilization rate of the medical image and the text report, reducing the cost of manual labeling, and enabling researchers to build a more accurate and powerful model for medical image analysis.

There are three granularity alignment relationships between medical images and text, namely, instance level, disease level, region level. Specifically, instance-level alignment is to learn semantic associations between medical images and text from the same patient. The registration of the disease level learns semantic associations between medical images and text from different patients, but describing the same disease. The alignment at the region level aims to correspond each word in the text with each region of the image. The disease positioning network model is trained by utilizing the alignment relation between the medical image and the text, after the disease positioning network model is trained, the model can be applied to disease positioning, the words describing the disease in the text are corresponding to the focus area in the medical image, misdiagnosis and missed diagnosis of doctors are greatly reduced, doctors are assisted to make diagnosis, and diagnosis efficiency is improved. In addition, the disease positioning result can also provide diagnosis basis for doctors, which is beneficial for doctors to draw accurate treatment schemes.

Recently, the methods for performing text alignment of medical images in a self-supervising manner are: zhang et al in Contrastive learning ofmedical visual representations from paired images andtext propose a ConVIRT method, and simultaneously use paired information composed of medical image data and medical text data to perform two-way contrast learning, and learn an example-level alignment relationship between a medical image and text. Boecking et al, making the most oftext semantics to improve biomedical vision-language processing, disclose a BioViL method that proposes a text encoder CXR-BERT that is more suitable for radiological reporting, and randomly mask reconstructing the text. However, since these methods do not consider the fine-grained alignment between the medical image and the text report, it is impossible to correspond the word describing the disease to the lesion area, and there is a limitation in capturing a minute difference, resulting in poor effect of disease localization.

Huang et al, gloria, A multimodal global-local representation learning framework for label-efficient medical image recognition, propose a GLoRIA method, their purpose being to learn a local vector representation by comparing the attention weighted areas of medical images with corresponding textual descriptions. However, one limitation of this approach is that when a long sentence contains many words that are not related to the disease, or a medical image has very small pathological features, the alignment result may be dominated by irrelevant information, resulting in an inability to precisely locate the disease-related region. Chen et al, multi-modal masked autoencoders for medical vision-and-lan pre-trackingM is proposed herein ³ AE method, randomly masking a part of image blocks, and restoring with the rest of visible image blocks and text information; conversely, a portion of the words in the text are randomly masked and restored with the remaining visible words, as well as the image information. The patent with publication number of CN114972929A discloses a method and a device for pre-training a medical multi-modal model, which fully capture the associated information of a medical image and multi-granularity text by using a masking method and cross-modal image-text contrast learning, thereby improving the model learning accuracy and efficiency. Although these methods may achieve regional level alignment to some extent, because of the randomness of their masks, terms or image blocks that are irrelevant to the disease are selected, resulting in less accurate disease localization.

In summary, the disease localization can be achieved using the medical image text alignment method. However, existing methods typically utilize only a portion of the alignment, and do not integrate multiple granularity alignments between medical image text, which can provide complementary semantic information for disease localization, helping to map text to medical images. Furthermore, the area level alignment method is generally used to mask in a random manner, but such a method may mask information not related to a disease, such as: prepositions in the text and background in the medical image, etc., result in positional deviations in the text being located on the medical image. Therefore, how to align semantic information of "pneumonia" in text form and "pneumonia" in picture is a key problem for improving the positioning accuracy of medical image diseases.

Disclosure of Invention

Aiming at the problems, the invention aims to provide a disease positioning method and a disease positioning system based on medical image text alignment by utilizing the multi-granularity alignment relation between medical image texts. The method and the system can correspond the words describing the diseases in the text report with the focus areas in the medical images, so that the labeling cost of the medical images is reduced, and the accuracy of disease positioning is improved.

In order to achieve the above purpose, the present invention provides the following technical solutions:

the invention provides a disease positioning method and system based on medical image text alignment. Firstly, preprocessing a historical medical image text data set, then, amplifying the preprocessed historical medical image text data set, then, constructing a disease positioning network model, training the disease positioning network model by using the amplified historical medical image text data set, and finally, performing disease positioning on medical images and texts of newly-diagnosed patients by using the trained disease positioning network model. The main parameters related to the invention include: batch size, optimizer parameters, training wheel number of the model, and temperature parameters, wherein the batch size is used for controlling the sample number of each training input in the class level alignment process; the optimizer parameters are used for adjusting model weights and deviations, and comprise an optimizer type, an initial learning rate and a learning rate attenuation method; the training round number of the model is used for setting the ending condition of model training; the temperature parameter is used to adjust the similarity measure between samples during class level alignment.

A disease localization method based on medical image text alignment, comprising the steps of:

S1, preprocessing a historical medical image text data set, and forming a disease type soft label of a medical image and a disease type soft label of a text by utilizing an automatic labeler, wherein the method specifically comprises the following steps of:

s11, processing a historical medical image text data set. For improving the quality of text information, for a certain medical image I _i Corresponding text report, the method only extracts key information, such as case description and diagnosis result, cuts the text report into single sentences and removes sentences with less than three words to form a new text data seti represents the ith report in the medical image text data set, and t represents the number of sentences cut by the report. During training, data are loaded from the medical image I each time _i Corresponding text data set->Is selected randomly from a sentence +.>The initial sample pair that forms the model training process is denoted as { (I) ₁ ,T ₁ ),(I ₂ ,T ₂ ),…,(I _n ,T _n ) (I) where n is the total sample size _i ,T _i ) Representing the ith pair of medical images, text samples, I _i Representing the ith medical image sample, T _i Representing an ith text sample;

s12, utilizing the automatic labeler to carry out on the text data set obtained in S12Extracting medical entity and disease category soft labels, and aggregating the disease category soft labels of each sentence to form a disease category soft label of the corresponding medical image >Sentence T selected in S11 _i The corresponding disease category soft tag is marked +.>The set of medical entities is->E is sentence T _i The number of medical entities extracted.

S2, the medical image and text data samples in the medical image text data set in the step S1 are amplified to obtain medical image amplified samples and text amplified samples, and the method specifically comprises the following steps:

s21, a method for training model parameters by using small batch data is adopted. Specifically, the sample batch size is set as B, namely each round of training is divided into S=n/B steps for iteration, B pairs of medical images and text samples are input simultaneously in each step of training process, and a small batch of medical image samples I and a small batch of text samples T are obtained;

s22, carrying out medical image augmentation on small-and-medium-batch medical image samples I in the step S21, and sequentially carrying out random cutting, scaling, horizontal overturning, random color dithering, data format conversion and normalization on the medical image samples;

s23, performing text augmentation on small-batch text samples T in the step S21, sequentially performing synonym replacement, random word replacement and random word deletion on the text samples, and enabling the length to exceed N _len The sentences of the sub-words are truncated, the word segmentation is carried out on the text by using a word segmentation device in the CXR-BERT model, and the text data are converted into tensor data which can be input into a disease positioning network model.

S3, constructing each network module in a disease positioning network model, wherein the disease positioning network model comprises five network modules, namely an image coding network, a text coding network, an image projection head network, a text projection head network and a multi-mode coding network, the whole structure of the disease positioning network model is shown in figure 1, and the disease positioning network model specifically comprises the following steps:

s31, constructing an image coding network by using a Swin-transducer model. Specifically, the image coding network is composed of twelve layers of transformers, each layer of transformers is composed of a self-attention layer and a feedforward neural network layer in sequence;

s32, constructing a text coding network by using the CXR-BERT model. Specifically, the text encoding network is composed of twelve layers of transformer blocks, each layer of transformer blocks is composed of a self-attention layer and a feedforward neural network layer in sequence;

s33, constructing an image projection head network g _x (. Cndot.) the use of a catalyst. Specifically, the image projection head network is a single linear layer, which can be denoted as g _x (·)＝W _x (. Cndot.) wherein W _x Trainable parameters in the image projection head network;

s34, constructing a text projection head network g _w (. Cndot.) the use of a catalyst. In particular, the text projection head network is a single linear layer, which can be denoted as g _w (·)＝W _w (. Cndot.) wherein W _w Is a trainable parameter in a text projection head network;

S35, constructing a multi-mode coding network by using the CXR-BERT model. Specifically, the multi-mode coding network is a double-flow transducer structure, and each layer of transducer block sequentially comprises a self-attention layer, a cross-attention layer and a feedforward neural network layer, and six layers of transducer blocks are all arranged.

S4, training E the disease category soft label of the medical image obtained in the step S1, the disease category soft label of the text and the medical image augmentation sample in the step S2 and the text augmentation sample to the disease positioning network model in the step S3 through an Adam optimizer _p The method comprises the steps of determining disease positioning network model parameters by taking single-step training as an example by using B pairs of medical images and text augmentation samples for each step of training, and specifically comprises the following steps:

s41, performing vector representation extraction operation on the medical image augmentation sample and the text augmentation sample in the step S2. Specifically, given a medical image I _i The medical image is divided into N blocks, and a [ CLS ] is additionally added]the token is used for learning the whole information of the medical image to form an image token sequence X _i ＝{x _i(cls) ,x _i(1) ,x _i(2) ...,x _i(N) Inputting them into image coding network to obtain medical image vector representation D is the dimension of the medical image vector representation. Likewise, given a sentence of text T _j Firstly, a word segmentation device in a CXR-BERT model is used for dividing text information into M sub words, and then a [ CLS ] is additionally added]the token is used for learning the whole information of the text to form a text token sequence W _j ＝{w _j(cls) ,w _j(1) ,w _j(2) ...,w _j(M) Inputting them into a text encoder to obtain a text vector representation +.> D is the dimension of the text vector representation. Medical image to be obtainedVector representation and text vector representation input to a multi-modal encoding network to obtain a multi-modal medical image vector representation +.> Multimodal text vector representation-> The text-to-medical image attention score produced at the fifth layer of the multi-modality encoding network is noted attn;

s42, calculating semantic similarity by using the disease type soft label of the medical image obtained in the step S1 and the disease type soft label of the text, wherein the definition is shown in formulas (1) and (2):

wherein ,x-w is image-to-text, w-x is text-to-image, and B is sample batch size;

vector representation similarity is calculated using the medical image vector representation and the text vector representation obtained in step S41, and is defined as shown in equations (3), (4):

wherein ,b is the sample batch size, and τ is a temperature parameter;

S43, calculating cross entropy loss by using the semantic similarity and the vector representation similarity obtained in the step S42, wherein the definition is shown in formulas (5) and (6):

the loss of alignment at the disease level is defined as (7):

s44, according to semantic similarity S _ij For medical image I _i Sampling a text from a different instance is a difficult case, the greater the similarity, the greater the probability of being sampled. Likewise, the text T _j Sampling a medical image is a difficult example. For a batch of medical image text, a total of N can be obtained _itm The image text pairs (including B positive and 2B negative pairs) are used for example level alignment. Calculating an instance level alignment loss using the multimodal medical image vector representation and the multimodal text vector representation obtained in S41, as shown in (8):

wherein ,representing a result of predicting whether the ith pair of medical image texts comes from the same instance, < >>Representing whether the ith pair of medical image texts come from the real label of the same instance, and sigma represents a Sigmoid function;

s45, utilizing the medical entity set obtained in the step S12Comparing sentences T one by one _i Whether or not each subword of (a) is associated with +.>The medical entities in the list are the same, if the medical entities are the same, the word is masked and special characters are used]Instead of. Masking text- >And complete medical image I _i Inputting a model, and calculating the reduction loss of the text as shown in a formula (9):

wherein ,w_i(j) Representing sentence T _i The j-th masked subword of (M) _w The number of subwords representing the mask,representing the sub-words which are not masked, wherein θ is a disease positioning network model parameter;

s46, in order to make up for the semantic gap between the medical image and the text, the method uses a high-order vector representation instead of a low-order original pixel to promote alignment between the medical image and the text when masking and predicting the medical image. Specifically, using the attention score attn obtained in S41, N is selected from high to low _x Masking the image blocks to obtain vectors of the tokenThe representation is initialized to 0 and the vector representations of all image blocks are aggregated to obtain the image CLS]Vector representation of tokenRespectively for image block vector representation sum [ CLS ]]Vector representation of token the calculated vector represents the reduction loss and classification loss as shown in equations (10), (11):

wherein ,vector representation representing an unmasked image block, for example>A label representing a predicted disease category, f _m (. Cndot.) represents a linear classification layer, (. Cndot.)>A disease-type soft tag representing the medical image obtained in S13. Thereby, the alignment loss at the region level is obtained, and the definition is as shown in the formula (11):

Calculation of Single step Total training lossTotal training loss by training network parameters>Defined as formula (12):

s47, repeating the steps S41-S46 in an iterative mode, and training a disease positioning network model E _p And (3) updating the parameters of the disease positioning network model in a random gradient descending mode by using an Adam optimizer in turn, and determining the parameters of the disease positioning network model after training is completed.

S5, extracting vector representation of medical images and text data of a newly-diagnosed patient by using the disease positioning network model trained in the step S4, and performing disease positioning by using the vector representation, wherein the method specifically comprises the following steps of:

s51, sequentially performing zooming, data format conversion and normalization augmentation operation on medical images of new patients to be diagnosed, and performing word segmentation on texts by using a word segmentation device in a CXR-BERT model to obtain a text with a length exceeding N _len Cutting off the text of the sub word to obtain a medical image and a text augmentation sample;

s52, performing vector representation extraction operation of the step S41 on the medical image and the text augmentation sample in the step S51 by using the disease positioning network model trained in the step S47 to obtain a medical image and text augmentation sample vector representation;

s53, disease localization is performed by using the medical image and the text augmentation sample vector representation in the step S52. Specifically, text [ CLS ] is calculated using cosine similarity ]And (5) obtaining a disease positioning result by the similarity between the token and the image block token. To measure the effectiveness of disease localization, the present invention uses a contrast to noise ratio CNR index that measures the interior X and exterior of a bounding boxThe higher the CNR value, the more accurate the positioning result, the more accurate the CNR is, as defined in formula (13):

wherein ,μ_X and σ_X Mean the average and variance of similarity values in the X region, and />Is->Mean and variance of the similarity in (c).

S54, visualizing by using the similarity matrix obtained in the step S53, wherein Grad-CAM is used in the invention. Specifically, the higher the similarity of text to an image block, the higher the focus of the text on that image block, and the darker the color on the image.

The invention also provides a disease positioning system based on the medical image text alignment, which is used for realizing the disease positioning method based on the medical image text alignment, and comprises a computer processor, a memory and a graphic processor; a medical image text data storage unit; a medical image text data preprocessing unit; a disease positioning network model training unit; and a disease positioning output unit.

Further, the medical image text data storage unit loads medical image text data into a computer memory; the medical image text data preprocessing unit extracts medical image text samples with batch size B from the memory step by step, performs image and text augmentation preprocessing in step S2 to obtain medical image augmentation samples and medical text augmentation samples, and loads the medical image augmentation samples and the medical text augmentation samples into the graphic processor; the disease positioning network model training unit uses the medical image augmentation sample and the medical text augmentation sample to execute the steps S3-S4 in the graphic processor to train the disease positioning network model; using the disease positioning network model after training, the disease positioning output unit executes step S5 in the graphic processor to obtain a disease positioning result; specific data processing and computing work in all units is done by the computer processor.

Compared with the prior art, the invention has the beneficial effects that:

1. the disease positioning method based on the text alignment of the medical images, which is designed by the invention, utilizes the soft labels of disease categories to learn the alignment relation between the medical images and the text to learn the disease level. The method can enable the whole representation of the semantically similar images and texts to be closer, and improve the understanding capability of the model, thereby effectively improving the quality of the representation of the medical image text data vector and the utilization rate of the medical image text data;

2. the invention provides a novel difficult negative sampling strategy for a disease positioning method based on medical image text alignment. The strategy can help medical images and texts learn example-level alignment relations, better distinguish medical images and texts with the same disease category but different disease positions and severity degrees, and provide more accurate disease positioning results.

3. The disease positioning method based on medical image text alignment introduces a new multi-mode masking strategy, and only masks image blocks and subwords related to diseases in medical images and texts. The multi-mode mask strategy can be more concentrated on information related to diseases in the medical image and the text, the capability of the disease positioning network model for focusing on key information in the medical image and the text is enhanced, focus areas described by the text are accurately positioned, and the accuracy of disease positioning is improved.

Drawings

FIG. 1 is a diagram of a disease localization network model in a disease localization method based on medical image text alignment according to the present invention.

Fig. 2 is a block diagram of a disease positioning module in the disease positioning method based on text alignment of medical images according to the present invention.

FIG. 3 is a block diagram of a computer implemented system for a method of disease localization based on text alignment of medical images according to the present invention.

Fig. 4 is a flowchart of a disease localization method based on medical image text alignment according to the present invention.

Detailed Description

The disease positioning method based on the medical image text alignment is implemented by a computer program, and a system structure diagram realized by a computer is shown in fig. 3. The following describes the technical solution of the present invention in detail with reference to the model structure diagram shown in fig. 1 and 2 and the method flowchart in fig. 4, taking MIMIC-CXR data as an example, the data set is a commonly used large-scale medical data set, and comprises 377111 chest X-ray images and 201063 related radiology text reports. The implementation mainly comprises the following key contents:

s1, preprocessing a historical medical image text data set, and forming an image disease type soft label and a text disease type soft label by utilizing an automatic labeler:

S11, processing the medical image text data set. In the medical image text dataset, the medical image is an X-ray chest image of the patient and the text is a text report describing the X-ray chest image. For improving the quality of text information, for a certain medical image I _i Corresponding text information, the method extracts only key information such as case descriptions (documents), diagnosis results (expressions), cuts the text into single sentences and removes sentences with less than three words to form a new text data set i represents the ith report in the medical image text data set, and t represents the number of sentences cut by the report. During training, each time data is loaded, a sentence is randomly selected from a text data set corresponding to the medical image in the data set>The initial sample pair that forms the model training process is denoted as { (I) ₁ ,T ₁ ),(I ₂ ,T ₂ ),…,(I _n ,T _n ) (I) where n is the total sample size _i ,T _i ) Representing the ith pair of medical images, text sample pairs, I _i Representing the ith medical treatmentImage sample, T _i Representing an ith text sample;

s12, utilizing the automatic labeler to carry out on the text data set obtained in S12And extracting the medical entity and disease category soft labels. Specifically, the method uses the Chexpert labeler to extract 14 predefined medical entities, obtains a multi-hot vector for each sentence to represent disease category, and then aggregates disease category soft labels of each sentence to form disease category soft labels of corresponding medical images >Sentence T selected in S12 _i The corresponding disease category soft label is marked asThe set of medical entities is->E is sentence T _i The number of medical entities extracted. Thus, for a batch of image text +.>Is->Its dimension is [64,14 ]]。

S2, the medical images and text samples in the medical image text dataset in the step S1 are amplified to obtain image amplified samples and text amplified samples, and the method specifically comprises the following steps:

s21, the invention adopts a method for training model parameters by using small batch data. Specifically, the invention sets the batch size in the model training process to 64, namely each round of training is performed in 5892 steps of iteration, and 64 pairs of medical images and text samples are used in each step of training process;

s22, performing image augmentation on the small-batch medical image samples I in the step S21. Specifically, for each medical image sample, scaling the original image to 256×256 is performed in sequence, and the image is cut randomly to 224×224 standard images; performing horizontal mirror image overturning operation with the parameter of 0.5 on the standard image; random color dithering operation with a parameter of 0.2; random affine operation with a degree range of 10, a maximum absolute offset of 0.0625, and a scale factor interval of (0.8,1.1); finally, converting the processed image into tensor data with the dimension of [64,3,224,224] and carrying out normalization operation with the mean value of 0.5862 and the variance of 0.2795; where 64 is the batch size, 3 is the color channel of the image, [224,224] is the image size;

S23, performing text augmentation on small-batch text samples T in the step S21. Specifically, for a text sample, synonym replacement, random replacement of words and random deletion of words are performed on sentences by using a text library. Then the word segmentation device in CXR-BERT is used for segmenting the sentence into M sub words, and index numbers in the word list are used for replacing the sub words; cutting off sentences with the lengths of 77 subwords; setting the maximum subword length in the same sentence as M _max The length in the same batch is less than M _max The sentences of (2) are complemented into the same length by 0; finally, the processed text is converted into a text with the dimension of [64, M _max ]For use by a text encoder network;

s3, constructing each module in the disease positioning network model, wherein the overall structure of each module is shown in the figure 1, and the method specifically comprises the following steps:

s31, constructing an image encoder network by using a Swin-transducer model. The input to the image encoder network is dimension 64,3,224,224]Wherein 64 is the batch size, 3 is the color channel of the image, [224,224 ]]Is the image size. Its output is dimension [64,50,768 ]]Image tensor representation of (a)Wherein 768 represents the vector dimension of each image token in the output tensor, 50 is the number of image block tokens and the image [ CLS ] ]the sum of the number of the token;

s32, constructing text by using CXR-BERT modelAn encoder network. The input to the text encoder network is the dimension [64, M _max ]Where M is a small batch of text samples _max The output is dimension [64, (M) for the maximum length in the same batch of text _max +1),768]Text tensor representationWhere 768 represents the vector dimension of each text token in the output tensor;

s33, constructing an image projection head network g _x (. Cndot.) the use of a catalyst. Specifically, an image projection head network g _x (. Cndot.) is a single linear layer, which can be expressed as g _x (·)＝W _x (. Cndot.) wherein W _x For trainable parameters in an image projection head network, the input dimension is [64,768 ]]Output dimension is [64,512 ]]；

S34, constructing a text projection head network g _w (. Cndot.) the use of a catalyst. Specifically, a text projection head network g _w (. Cndot.) is a single linear layer, which can be expressed as g _w (·)＝W _w (. Cndot.) wherein W _w For trainable parameters in a text projection head network, the input dimension is [64,768 ]]Output dimension is [64,512 ]]；

S35, constructing a multi-mode coding network by using the CXR-BERT model. Specifically, word vectors are first multiplied by three matricesThree new vectors Q are obtained _w 、K _w 、V _w The image block vectors are co-processed to obtain Q _x 、K _x 、V _x . For medical images, the self-attention layer may be expressed as +.>While the cross-attention layer may be denoted +. >Whereas for text the self-attention layer can be expressed as +.> While the cross-attention layer may be denoted +.>The input and output dimensions of the multi-mode coding network at the image end are [64,50,768 ]]The input and output dimensions of the multi-mode coding network at the text end are 64, (M) _max +1),768]。

S4, training the disease positioning network model in the step S3 for 10 rounds by using the disease type soft label of the medical image and the disease type soft label of the text obtained in the step S1 and the medical image augmentation sample and the text augmentation sample in the step S2 through an Adam optimizer to determine the parameters of the disease positioning network model, wherein each step of training uses 64 pairs of medical images and the text augmentation sample, taking single step training as an example, and the method comprises the following specific steps:

s41, performing vector representation extraction operation on the medical image augmentation sample and the text augmentation sample in the step S2. Specifically, given a medical image I _i The image is first divided into 49 blocks, each block having a size of [32,32 ]]An additional [ CLS ] is added]token, input them into image coding network to obtain dimension [50,768 ]]Is a vector representation of (c). For a batch of images (64), a dimension of [64,50,768 ] can be obtained]Is a small batch image tensor representation of (c). Likewise, given a sentence of medical text information T _j Firstly, a word segmentation device is used for dividing text information into M sub words, and then a [ CLS ] is additionally added]token, input them into text encoder to obtain dimension [ (M+1) 768]Is a vector representation of (c). For a batch of text (64), a dimension of [64, (M) can be obtained _max +1),768]Is a small batch of text tensor representations. Inputting the obtained medical image representation and text representation into a multi-modal encoding network to obtain tensor dimension as [64,50,768 ]]Multi-modal medical image representation of (a)And [64, (M) _max +1),768]Multi-modal text table of (a)Show->The text-to-medical image attention score produced at the fifth layer of the cross attention layer through the multimodal encoding network is noted attn, with a dimension of [64,49 ]]；

S42, as shown in FIG. 1, for a batch of pairs of medical image text (64 pairs), the disease category soft tag of the image obtained in S1 is usedDisease category soft tag of text->The semantic similarity is calculated, and the definition is shown as formulas (1) and (2):

wherein ,x-w is image-to-text, w-x is text-to-image, and B is sample batch size; similarly, vector representation similarity is calculated using the batch of medical image representations and text representations obtained in S41, defined as shown in equations (3), (4):

wherein ,image projection network g _x (. Cndot.) input image [ CLS ] ]Vector representation dimension [64,768 ]]The output image vector represents the dimension of [64,512 ]]The method comprises the steps of carrying out a first treatment on the surface of the Text projection network g _w Text entered [ CLS ]]Vector representation dimension [64,768 ]]The output text vector represents the dimension of [64,512 ]]τ is a learnable temperature super parameter, initialized to 0.07;

s43, calculating cross entropy loss by using the semantic similarity and the vector representation similarity obtained in the S42, wherein the loss definition is shown in formulas (5) and (6):

the loss of alignment at the disease level is defined as (7):

s44, according to semantic similarity S _ij For each medical image I _i Sampling a text from a different instance is a difficult negative example, the greater the similarity, the higher the probability of sampling. Likewise, for each text T _j Sampling an image is a difficult example. A total of 192 image text pairs (including 64 positive and 128 negative pairs) may be obtained for a batch of image text for example level alignment. Calculating an instance-level alignment loss using the multimodal medical image representation and the multimodal text representation obtained in S41, as shown in (8):

wherein ,indicating whether the predicted ith pair of image texts comes from the same instance result +.>Representing whether the ith pair of image texts come from the real labels of the same instance, and sigma represents a Sigmoid function;

S45, utilizing the medical entity set obtained in the step S12Comparing sentences T one by one _i Whether or not each subword of (a) is associated with +.>The medical entities in the list are the same, if the medical entities are the same, the sub-words are masked and special words are used [ MASK ]]Instead of. Masking text->And complete image I _i Inputting a model, and calculating text reduction loss as shown in a formula (9):

wherein ,w_i(j) Representing sentence T _i The j-th masked subword of (M) _w The number of subwords representing the mask,representing the subwords which are not masked, θ being a model parameter;

s46, in order to make up for the semantic gap between the medical image and the text, the method uses a high-order vector representation instead of a low-order original pixel to promote alignment between the medical image and the text when masking and predicting the image. Specifically, using the attention score attn obtained in step S41, the vectors of these image blocks are represented as the first time by masking (i.e., 37) the image blocks of which 75% are selected from high to lowInitialized to 0 and vector representations of all image blocks are aggregated to obtain an image [ CLS ]]Vector representation of tokenRespectively for image block vector representation sum [ CLS ]]Vector representation of token the calculated vector represents the reduction loss and classification loss as shown in equations (10), (11): / >

wherein ,representing image I _i Vector representation of non-masked image blocks of a dimension [12,768]，Disease category label representing predicted image, f _m (. Cndot.) is a linear classification layer with input dimension of [64,768 ]]The output dimension is [64,14 ]]，/>A disease category soft tag representing the image obtained in S12;

thereby, the region-level alignment loss is obtained as shown in the formula (11):

s47, repeating the steps S41-S46 in an iterative mode, training the disease positioning network model 10 for a round, and updating the disease positioning network model parameters in a random gradient descent mode by using an Adam optimizer. Specifically, the weight decay rate of Adam optimizer is set to 0.02. In the first 1000 iterations, the learning rate is increased to 1e ^-4 Then follow cosine annealing, decay to 1e ^-5 . After training, determining parameters of the disease positioning network model.

S5, as shown in FIG. 2, extracting vector representations of medical images and texts of the newly-diagnosed patient by using the disease positioning network model obtained in the step S4, and performing disease positioning by using the vector representations;

s51, scaling a medical image of a newly-diagnosed patient to 224 multiplied by 224, converting the processed image into tensor data with the dimension of [64,3,224,224] and carrying out normalization operation with the mean value of 0.5862 and the variance of 0.2795, wherein 64 is the batch size, 3 is the color channel of the image, [224,224] is the image size, segmenting a text by using a word segmentation device in a CXR-BERT model, and cutting off the text with the length of more than 77 sub words to obtain a medical image and a text augmentation sample;

S52, carrying out vector representation extraction operation of the step S41 on the medical image and the text augmentation sample of the patient in the step S51 by using the disease positioning network model obtained in the step S47, so as to obtain vector representation of the image and the text augmentation sample;

s53, disease localization is performed by using the patient medical image and the text augmentation sample vector representation in the step S52. Specifically, text [ CLS ] is calculated using cosine similarity]token and image block token (divided image [ CLS ]]Outside token) similarity, the similarity matrix dimension is [1,49 ]]The dimension of the similarity matrix is converted to [7,7]Obtaining disease location result, measuring effectiveness of disease locationThe present invention uses a contrast-to-noise ratio CNR indicator that measures the interior X and exterior of a bounding boxThe difference of similarity values between the two, the higher the CNR value is, the more accurate the positioning result is, the invention compares with ConVIRT, bioViL, GLoRIA and MedCLIP four methods on an MS-CXR data set, the MS-CXR data set is randomly sampled from MIMIMI-CXR and the bounding box and descriptive text are marked by radiologists, the CNR is defined as formula (13):

wherein ,μ_X and σ_X Mean the average and variance of similarity values in the X region, and />Is->Average and variance of the similarity, and disease localization results are shown in table 1;

TABLE 1 disease localization results

As shown in fig. 3, a disease positioning system based on medical image text alignment is used for implementing the disease positioning method based on medical image text alignment, and comprises a computer processor, a memory and a graphics processor; a medical image text data storage unit; a medical image text data preprocessing unit; and the disease positioning network model training unit and the disease positioning output unit. The medical image text data storage unit loads medical image text data into a memory of a computer; the medical image text data preprocessing unit extracts medical image text samples with batch size from the memory step by step, and executes the step S2 to carry out image and text augmentation to obtain medical image augmentation samples and medical text augmentation samples, and the medical image text augmentation samples are loaded into the graphic processor; the disease positioning network model training unit uses the medical image augmentation sample and the medical text augmentation sample to execute the steps S3-S4 in the graphic processor, and determines the parameters of the disease positioning network model; using the trained disease positioning network model, the disease positioning output unit executes step S5 in the graphic processor to obtain a required disease positioning result; specific data processing and computing work in all units is done by the computer processor.

Finally, the foregoing description is only of the preferred embodiments of the present invention and is not intended to limit the invention, and various modifications and variations may be made by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A disease positioning method based on medical image text alignment, comprising the steps of:

s11, processing a historical medical image text data set, wherein in order to improve the quality of text information, a certain medical image I is processed _i Corresponding text report, extracting only key information, segmenting the text report into single sentences and removing sentences of less than three words to form a new text data setWherein I represents the ith report in the text data set of the medical image, t represents the number of sentences segmented by the report, and each time the data is loaded from the medical image I during training _i Corresponding text data set- >Is selected randomly from a sentence +.>The initial sample pair that forms the model training process is denoted as { (I) ₁ ,T ₁ ),(I ₂ ,T ₂ ),…,(I _n ,T _n ) Wherein n is the total sample size, (I) _i ,T _i ) Representing the ith pair of medical images, text samples, I _i Representing the ith medical image sample, T _i Representing an ith text sample;

s12, utilizing the automatic labeler to carry out on the text data set obtained in S11Extracting a medical entity set and disease category soft labels, and aggregating the disease category soft labels of each sentence to form a disease category soft label of a corresponding medical image>The sentence T selected in the step S11 _i The corresponding disease category soft tag is marked +.>The set of medical entities is denoted epsilon _i ＝{e _i1 ,e _i2 ,…,e _iE E is sentence T _i The number of the medical entities extracted;

s21, setting the size of a sample batch as B by adopting a method for training model parameters by small batch data, namely, each round of training is divided into S=n/B steps for iteration, and B pairs of medical images and text samples are simultaneously input in each step of training process, so as to obtain a small batch of medical image samples I and small batch of text samples T;

s23, performing text augmentation on small-batch text samples T in the step S21, sequentially performing synonym replacement, random word replacement and random word deletion on the text samples, and enabling the length to exceed N _len Cutting off sentences of the sub words, and using a word segmentation device in the CXR-BERT model to segment the text, so as to convert the text data into tensor data which can be input into a disease positioning network model;

s3, constructing each network module in the disease positioning network model, which specifically comprises the following steps:

s31, constructing an image coding network by utilizing a Swin-transducer model; the image coding network consists of twelve layers of transformation blocks, and each layer of transformation block consists of a self-attention layer and a feedforward neural network layer in sequence;

s32, constructing a text coding network by using the CXR-BERT model; the text coding network consists of twelve layers of transformation blocks, and each layer of transformation block consists of a self-attention layer and a feedforward neural network layer in sequence;

S33, constructing an image projection head network g _x (. Cndot.); the image projection head network is a single-layer linear layer, denoted as g _x (·)＝W _x (. Cndot.) wherein W _x Trainable parameters in the image projection head network;

s34, constructing a text projection head network g _w (. Cndot.); the text projection head network is a single linear layer, denoted g _w (·)＝W _w (. Cndot.) wherein W _w Is a trainable parameter in a text projection head network;

s35, constructing a multi-mode coding network by using a CXR-BERT model; the multi-mode coding network is of a double-flow converter structure, each converter block sequentially comprises a self-attention layer, a cross-attention layer and a feedforward neural network layer, and six converters are used in total;

s4, training E on the disease positioning network model in the step S3 through an Adam optimizer by using the disease type soft tag of the medical image obtained in the step S1, the disease type soft tag of the text and the medical image augmentation sample and the text augmentation sample in the step S2 _p The method comprises the following specific steps of taking single-step training as an example, determining disease positioning network model parameters, wherein each step of training uses B pairs of medical images and text augmentation samples:

s41, performing vector representation extraction operation on the medical image augmentation sample and the text augmentation sample in the step S2; given a medical image I _i The medical image is divided into N blocks, and a [ CLS ] is additionally added]the token learns the whole information of the medical image and forms an image token sequence X _i ＝{x _i(cls) ,x _i(1) ,x _i(2) ...,x _i(N) Inputting them into image coding network to obtain medical image vector representationWherein D is the dimension of the medical image vector representation; given a sentence of text T _j Firstly, a word segmentation device in a CXR-BERT model is used for dividing text information into M sub words, and then a [ CLS ] is additionally added]the token is used for learning the whole information of the text to form a text token sequence W _j ＝{w _j(cls) ,w _j(1) ,w _j(2) ...,w _j(M) Input them into text encoder to obtain text vector representationWherein D is the dimension of the text vector representation, and the obtained medical image vector representation and the text vector representation are input into a multi-mode coding network to obtain the multi-mode medical image vector representationMultimodal text vector representationThe text-to-medical image attention score produced at the fifth layer of the multi-modality encoding network is noted attn;

wherein ,b is the sample batch size, and τ is a temperature parameter;

the loss of alignment at the disease level is defined as (7):

s44, according to semantic similarity S _ij For medical image I _i Sampling a text from different instances as a difficult negative example, the greater the similarity, the greater the probability of being sampled, and likewise, the text T _j Sampling a medical image as a difficult example, and obtaining a total N for a batch of medical image texts _itm The image text pairs, i.e., comprising B positive and 2B negative pairs, are used for example level alignment, and example level alignment loss is calculated using the multi-modal medical image vector representation and the multi-modal text vector representation obtained in step S41, as shown in (8):

S45, utilizing the medical entity set epsilon obtained in the step 12 _i Comparing sentences T one by one _i Whether or not each subword of (a) is identical to epsilon _i The medical entities in the list are the same, if the medical entities are the same, the word is masked and special characters are used]Instead, the masked textAnd complete medical image I _i Inputting a model, and calculating the reduction loss of the text as shown in a formula (9):

s46, selecting N from high to low by using the attention score attn obtained in the step S41 _x Masking the image blocks, initializing the vector representations of the token to 0, and aggregating the vector representations of all the image blocks to obtain the image [ CLS ]]Vector representation of tokenRespectively for image block vector representation sum [ CLS ]]Vector representation of token the calculated vector represents the reduction loss and classification loss as shown in equations (10), (11):

wherein ,vector representation representing an unmasked image block, for example>A label representing a predicted disease category, f _m (. Cndot.) represents a linear classification layer, (. Cndot.)>The disease type soft label representing the medical image obtained in step S12, the alignment loss at the area level is obtained, and the definition is as shown in formula (11):

s47, repeating the steps S41-S46 in an iterative mode, and training a disease positioning network model E _p The disease positioning network model parameters are updated in a random gradient descending mode by using an Adam optimizer in turn, and after training is completed, the disease positioning network model parameters are determined;

s53, performing disease localization by using the medical image and the text augmentation sample vector representation in the step S52; computing text [ CLS ] with cosine similarity ]Similarity between token and image block token, obtaining disease localization result, for measuring effectiveness of disease localization, using contrast-to-noise ratio CNR index which measures internal X and external of a bounding boxThe higher the CNR value, the more accurate the positioning result, the more accurate the CNR is, as defined in formula (13):

wherein ,μ_X and σ_X Mean the average and variance of similarity values in the X region, and />Is->Mean and variance of the medium similarity;

s54, visualizing by using the similarity matrix obtained in the step S53; grad-CAM is used, specifically, the higher the similarity of text to an image block, the higher the focus of text on that image block, the darker the color on the image.

2. A disease positioning system based on medical picture text alignment, characterized in that: the disease localization system is used for realizing the disease localization method based on the medical picture text alignment of claim 1, and comprises the following steps: the system comprises a computer processor, a memory, a graphic processor, a medical picture text data storage unit, a medical picture text data preprocessing unit, a disease positioning network model training unit and a disease positioning output unit.

3. A medical picture text alignment based disease location system as claimed in claim 2, wherein: the medical picture text data storage unit loads medical picture text data into a memory of a computer; the medical picture text data preprocessing unit extracts medical picture text samples with batch size B from the memory step by step, performs picture and text augmentation in step S2 to obtain medical picture augmentation samples and medical text augmentation samples, and loads the medical picture augmentation samples and the medical text augmentation samples into the graphic processor; the disease positioning network model training unit uses the medical picture augmentation sample and the medical text augmentation sample to execute the steps S3-S4 in the graphic processor to train the disease positioning network model; using the disease positioning network model after training, the disease positioning output unit executes step S5 in the graphic processor to obtain a disease positioning result; specific data processing and computing work in all units is done by the computer processor.