CN118072898A

CN118072898A - Image report generation method and model training method

Info

Publication number: CN118072898A
Application number: CN202410199819.9A
Authority: CN
Inventors: 郝泳涛; 尹恒
Original assignee: Tongji University
Current assignee: Tongji University
Priority date: 2024-02-23
Filing date: 2024-02-23
Publication date: 2024-05-24

Abstract

The invention relates to a training method of an image report generation model based on multi-mode recognition, which comprises the following steps: acquiring image data and text data matched with the image data; inputting the image data into the image encoder, inputting the text data into the text encoder and the cross-modal text encoder, and performing contrast learning training and matching training; constructing mask image data and mask text data based on the image data and the text data, and performing multi-mode mask image modeling training and multi-mode mask language modeling training; shifting characters in the text data according to a preset rule to obtain processed text data; inputting the processed text data into the cross-modal text decoder to perform autoregressive generation training. The invention can improve training efficiency, promote fine granularity semantic understanding of the image report generating system to the related data, and increase report accuracy.

Description

Image report generation method and model training method

Technical Field

The invention relates to the technical field of computer vision and natural language processing, in particular to an image report generating method and a model training method.

Background

Medical reports contain textual descriptions of medical images by specialized doctors and play a vital role in medical diagnosis. But the writing of medical reports is a very labor intensive process requiring even a professional doctor to write a report for one medical image for 5-10 minutes. Thus, the automatic generation of medical reports using computer technology can greatly reduce the manpower and material requirements of this task. With the application and development of deep learning, especially various neural networks, the related technology of automatically generating reports by AI has been significantly advanced in recent years, but the existing methods still have some problems and disadvantages:

1. Extraction of fine-grained features is difficult: medical images often contain complex biological details and subtle anomalies, these fine-grained features are critical to accurate diagnosis, but existing models often have difficulty adequately capturing and understanding such information;

2. The report is not accurate enough: automatically generated reports are difficult to provide accurate information for the unique situation of a particular patient. Furthermore, there is often a lack of sufficient detail in the report, which is critical to clinical decision making;

3. challenges for cross-modal understanding: the images and the texts belong to different data modalities, and the mapping and the corresponding relation between the images and the texts are complex. The existing model still has challenges in effectively converting visual features into accurate and meaningful text descriptions;

4. Limitation and bias of data sets: high quality medical image datasets are relatively scarce and acquisition is often accompanied by privacy and ethical issues. In addition, due to the limitation of training data, the model may learn the bias in the data, affecting its generalization ability and fairness;

5. Interpretation and confidence level: in the medical field, the interpretation and credibility of the model is of paramount importance. However, current deep learning models are often considered "black boxes" and it is difficult to provide an interpretable diagnostic basis.

Therefore, there is a need for a training method that is optimized for medical image and report text customization that can ameliorate existing problems without overcomplicating the training model.

Disclosure of Invention

The technical problem to be solved by the invention is to provide the image report generating method and the model training method, which can improve training efficiency, improve fine granularity semantic understanding of an image report generating system on related data and increase report accuracy.

The technical scheme adopted for solving the technical problems is as follows: the image report generating method based on multi-mode identification comprises the following steps:

S0, acquiring target image data;

S1, analyzing the target image data by using an image report generating model to obtain a target report text, wherein the image report generating model comprises the following components:

the image encoder is used for extracting the characteristics of the input image data to obtain image characteristic vectors;

the text encoder is used for extracting the characteristics of the input text data to obtain text characteristic vectors;

the cross-modal image encoder is used for carrying out feature extraction on the image data and fusing the image data with the text feature vector during extraction to obtain a fused image feature vector;

the cross-modal text encoder is used for extracting features of the text data and fusing the features with the image feature vectors during extraction to obtain fused text feature vectors;

a cross-modal text decoder to generate report text based on the image feature vector;

The image feature vector, the text feature vector, the fused image feature vector, and the fused text feature vector are used to reference the image report generation model.

Further, the image encoder comprises a self-attention layer and a feedforward propagation layer, the cross-mode image encoder is constructed by placing a second cross-attention layer between the self-attention layer and the feedforward propagation layer of the image encoder, the second cross-attention layer is connected with the output of the text encoder, and the cross-mode image encoder and the image encoder share network parameters.

Further, the text encoder comprises a self-attention layer and a feed-forward propagation layer, the cross-modal text encoder is constructed by placing a first cross-attention layer between the self-attention layer and the feed-forward propagation layer of the text encoder, the first cross-attention layer is connected with the output of the image encoder, and the cross-modal text encoder shares network parameters with the text encoder.

Further, the cross-modal text decoder is constructed by replacing the self-attention layer of the cross-modal text encoder with a causal self-attention layer, and the cross-modal text decoder shares network parameters other than the causal self-attention layer with the cross-modal text encoder.

The technical scheme adopted for solving the technical problems is as follows: there is provided a training method of an image report generating model based on multi-modal recognition for training any one of the image report generating models described above, comprising the steps of:

Acquiring image data and text data matched with the image data;

Inputting the image data into the image encoder, inputting the text data into the text encoder and the cross-modal text encoder, and performing contrast learning training and matching training;

constructing mask image data and mask text data based on the image data and the text data;

inputting the mask image data and the mask text data into the cross-modal image encoder and the cross-modal text encoder respectively, and performing multi-modal mask image modeling training and multi-modal mask language modeling training;

Shifting characters in the text data according to a preset rule to obtain processed text data;

inputting the processed text data into the cross-modal text decoder to perform autoregressive generation training.

Further, the contrast learning training includes:

analyzing the image data with the image encoder to obtain an image feature vector;

analyzing the text data by using the text encoder to obtain text feature vectors;

Training the image report generating model based on the vector similarity between the image feature vector and the text feature vector.

Further, the training the image report generating model based on the vector similarity between the image feature vector and the text feature vector includes:

calculating the vector similarity of the image feature vector and the text feature vector;

And taking the vector similarity of the image feature vector and the text feature vector which are matched with each other as a learning positive sample, and taking the other vector similarity as a learning negative sample to perform contrast learning training.

Further, the matching training includes:

analyzing the text data by using a cross-modal text encoder to obtain a fused text feature vector;

And training the image report generating model based on the vector matching degree between the image feature vector and the fusion text feature vector.

Further, the multi-modal mask image modeling training includes:

according to the mask image data and the text feature vector, performing image restoration by using a cross-mode image encoder to obtain image restoration data;

training the image report generation model based on the image restoration data and the image similarity of the image data.

Further, the multimodal masking language modeling training includes:

According to the mask text data and the image feature vector, performing text restoration by using a cross-mode text encoder to obtain text restoration data;

training the image report generating model based on the text restoration data and the text matching degree of the text data.

Advantageous effects

Due to the adoption of the technical scheme, compared with the prior art, the invention has the following advantages and positive effects: according to the invention, through network parameter sharing among the models, the parameter quantity to be adjusted is reduced, the complexity of training the models is reduced, and the training efficiency of multi-task collaborative training is improved; the model can better extract global features and perform multi-mode alignment by using excessive task collaborative training; through multi-mode mask modeling training, fine granularity semantic understanding of a report generation model on medical images and report texts is enhanced, and the accuracy of the report is improved.

Drawings

FIG. 1 is a system diagram of a first embodiment of the invention as applied to medical report generation;

FIG. 2 is a flow chart of a second embodiment of the present invention;

FIG. 3 is a flow chart of a second embodiment of the invention applied to medical report generation;

FIG. 4 is a diagram of a multimodal mask modeling training for application of a second embodiment of the invention to medical report generation.

Detailed Description

The application will be further illustrated with reference to specific examples. It is to be understood that these examples are illustrative of the present application and are not intended to limit the scope of the present application. Furthermore, it should be understood that various changes and modifications can be made by one skilled in the art after reading the teachings of the present application, and such equivalents are intended to fall within the scope of the application as defined in the appended claims.

The first embodiment of the invention relates to an image report generation method based on multi-mode identification, which comprises the following steps:

S0, acquiring target image data;

S1, analyzing target image data by using an image report generation model to obtain target report text.

As shown in fig. 1, the image report generation model includes:

An image encoder comprising a self-attention layer and a feedforward propagation layer, which are used for extracting the characteristics of input image data to obtain image characteristic vectors;

The text encoder comprises a global self-attention layer and a feedforward propagation layer, and is used for extracting characteristics of input text data to obtain text characteristic vectors;

The cross-mode image encoder is constructed by placing a second cross-attention layer between the self-attention layer and the feedforward propagation layer of the image encoder, and the second cross-attention layer is connected with the output of the text encoder; the cross-mode image encoder and the image encoder share network parameters and are used for extracting features of image data and fusing the features with text feature vectors during extraction to obtain fused image feature vectors;

The cross-modal text encoder is constructed by placing a first cross-attention layer between a global self-attention layer and a feedforward propagation layer of the text encoder, and the first cross-attention layer is connected with the output of the image encoder; the cross-modal text encoder and the text encoder share network parameters and are used for extracting features of the text data and fusing the features with the image feature vectors during extraction to obtain fused text feature vectors;

The cross-modal text decoder is constructed by adopting a causal self-attention layer to replace a global self-attention layer of the cross-modal text encoder; the cross-modal text decoder shares network parameters other than the causal self-attention layer with the cross-modal text encoder for generating report text based on the image feature vectors;

The image feature vector, the text feature vector, the fused image feature vector, and the fused text feature vector are used to reference the image report generating model.

When the medical report is automatically generated by the present embodiment, the method includes the steps of:

and step B1, performing a model use state.

And step B2, acquiring a medical image of the report to be output, and inputting the medical image into the model.

And step B3, extracting the characteristic F_i of the medical image by using the image encoder for subsequent use.

And B4, generating a report by using a cross-modal text decoder, interacting with the medical image feature F_i by using the cross-attention layer, and outputting a final report text.

A second embodiment of the present invention relates to a training method of an image report generating model based on multi-modal recognition for training an image report generating system including an image encoder and a cross-modal text decoder, as shown in fig. 2, comprising the steps of:

t0 acquires image data and text data matched with the image data;

T1 performs multitasking collaborative training by using image data and text data, and comprises the following steps:

Inputting image data into an image encoder, inputting text data into a text encoder and a cross-mode text encoder, and performing contrast learning training and matching training;

Inputting mask image data and mask text data into a cross-mode image encoder and a cross-mode text encoder respectively, and performing multi-mode mask image modeling training and multi-mode mask language modeling training;

the processed text data is input into a cross-modal text decoder for autoregressive generation training.

Wherein, contrast learning training includes:

analyzing the image data by using an image encoder to obtain an image feature vector;

Analyzing the text data by using a text encoder to obtain text feature vectors;

The matching training comprises the following steps:

the image report generation model is trained based on a degree of vector matching between the image feature vector and the fused text feature vector.

The multi-modality mask image modeling training includes:

the image report generation model is trained based on the image restoration data and the image similarity of the image data.

Multimodal masking language modeling training includes:

The following further describes a specific application of the present embodiment in the context of automatic generation of medical reports.

Medical image and report data sets are used, such as the large radiology image and report data set MIMIC-CXR data sets. The dataset contained 377110 Zhang Xiongbu radiological images from 63478 patients and 227827 corresponding medical text reports. And (3) cleaning the data of the collected medical images and reports, so that the medical report text description is required to be corresponding to all medical images, and the medical report text description is required to be corresponding to all medical images. The data set may be divided into training sets, validation sets, and test sets at a rate of 80%,10%, 10%.

A multi-task co-training task framework is constructed, as shown in FIG. 1, and comprises 5 modules, namely a visual encoder, a cross-modal visual encoder, a text encoder, a cross-modal text encoder and a cross-modal text decoder. Wherein the image encoder and the cross-modal image encoder share the remaining parameters except for the cross-attention layer, the text encoder and the cross-modal text encoder share the remaining parameters except for the cross-attention layer, and the cross-modal text decoder and the cross-modal text encoder share the remaining parameters except for the self-attention layer. The visual encoder employs a 12-layer transducer architecture. It breaks the image into a series of small blocks (patches) and then processes the image blocks like processing sequence data, allowing each block of data in the model to capture global information. At the same time, a special image block [ CLS ] is added before the image block sequence to represent the global information of the image. The use of a transducer's self-attention mechanism to weight different parts of the image enables the network to focus on the most important areas in the image, thus better understanding the scene. The text encoder also employs a 12-layer transducer architecture and pre-trains on biomedical field text in advance to enhance understanding of medical professional terminology. The text encoder first performs word segmentation on the report text to obtain a text tokens sequence. A special [ CLS ] token is then also added to represent global information. Based on the visual encoder and the text encoder, the cross-mode visual encoder and the cross-mode text encoder are obtained by inserting a cross-attention layer between a self-attention layer and a feedforward propagation layer of the network, and the cross-mode encoder allows interaction between two mode information of images and texts so as to align the two mode information. Based on the text cross-mode encoder, the cross-mode text decoder is obtained by replacing the bi-directional self-attention layer with the causal self-attention layer to meet the report text generation task.

The model is subjected to multitask collaborative training and is divided into 5 subtasks in three major categories, as shown in fig. 3, and the specific mode is as follows:

Global alignment task training: including image-report contrast learning and image-report matching two sub-training tasks. For the image-report contrast learning task, an image encoder and a text encoder are utilized to respectively perform feature extraction encoding on the medical image and the report, and a complete image feature vector F_i and a complete text feature vector F_t are obtained. And calculating the similarity of the two features F_i and F_t through cosine similarity calculation, taking an original image-report pair and a randomly combined image-report pair as a positive sample and a negative sample respectively, and calculating contrast learning loss. Because the images and their corresponding report text (i.e., positive sample pairs) should be highly similar, while the random image and report text combinations (i.e., negative sample pairs) should be less similar, the model is trained by contrast learning to calculate a high similarity for positive samples and a low similarity for negative samples, enhancing the understanding of the model for data. The contrast learning loss is calculated by using InfoNCE Loss formulas, and the purpose of the contrast learning loss is to unify the data features of two different modes of an image and a text into the same feature space by pulling in the similarity of the positive sample pair of the image and the text and pushing out the similarity of the negative sample pair of the image and the text, namely, the difference between the features of the two modes is aligned and eliminated, so that the model can understand and correlate the image and the text data finally. For the image-report matching task, report text is input into a cross-modal text encoder, interaction between a cross attention layer and a complete image feature F_i is utilized, and the output result is changed into a 2-dimensional tensor M through linear projection. And calculating the classification loss according to whether the input images and the reports are matched, and performing parameter adjustment (gradient descent algorithm of the neural network) on the cross-modal text encoder in the process of optimizing the classification loss, so that the cross-modal text encoder can judge whether any pair of images and texts are matched, and the understanding capability of the model on the images and the report texts is enhanced.

Multimodal masking task training: the method comprises two sub-training tasks of multi-mode mask image modeling and multi-mode mask language modeling, and as shown in fig. 4, the model is enabled to restore the shielding information by shielding key detail information in an image (such as shielding a lung and a heart part in the image) or key detail in a text (such as shielding a heart in heart enlargement), so that the fine granularity semantic information is learned. For the multi-mode mask image modeling task, the image is preprocessed to obtain an image with 224×224 resolution, and the image is segmented according to a block size of 16×16, so that 14×14=196 image blocks are obtained. The blocks are randomly occluded with a 60% probability. The resulting occluded image i_m is input to a cross-modality image encoder, interacting with the full text feature f_t using a cross-attention layer. The output result is subjected to linear change to obtain a restored image I_r. The manhattan distance between the restored image i_r and the original image I is calculated as a penalty for modeling the multi-modality mask image, and the network parameters are adjusted by optimizing the penalty. For multimodal masking language modeling tasks. The report text is subjected to random mask shielding on the text token according to the probability of 30%. And inputting the blocked text T_m into a cross-modal text encoder, and interacting with the complete image characteristic F_i by using the cross attention layer. The output result is subjected to linear change to obtain a restored text T_r. The cross entropy classification loss between t_r and the original text T is calculated as a loss of multimodal masking language modeling, and the network parameters are adjusted by optimizing the loss.

Report text generation task training: the training task completes the final medical report generation. Adding a special character to the beginning of the text, using a text cross-modal decoder, inputting the target report text T_s after right shift into the cross-modal text decoder, interacting with the complete image feature F_i by using the cross attention layer, continuously predicting the next character according to the generated text content until the generated text finishes the autoregressive generation type task training, and generating task loss for the report text by the obtained loss.

And under the cooperative training of the 3 main tasks, the final training of the model is completed.

Claims

1. An image report generating method based on multi-mode identification is characterized by comprising the following steps:

S0, acquiring target image data;

2. The training method of claim 1, wherein the image encoder comprises a self-attention layer and a feed-forward propagation layer, wherein the cross-modality image encoder is constructed by interposing a second cross-attention layer between the self-attention layer and the feed-forward propagation layer of the image encoder, wherein the second cross-attention layer is connected to an output of the text encoder, and wherein the cross-modality image encoder shares network parameters with the image encoder.

3. The training method of claim 1, wherein the text encoder comprises a self-attention layer and a feed-forward propagation layer, wherein the cross-modal text encoder is constructed by interposing a first cross-attention layer between the self-attention layer and the feed-forward propagation layer of the text encoder, wherein the first cross-attention layer is connected to an output of the image encoder, and wherein the cross-modal text encoder shares network parameters with the text encoder.

4. The training method of claim 3, wherein the cross-modal text decoder is constructed by replacing a self-attention layer of the cross-modal text encoder with a causal self-attention layer, and wherein the cross-modal text decoder shares network parameters with the cross-modal text encoder other than the causal self-attention layer.

5. A training method for an image report generating model based on multi-modal recognition, for training the image report generating model according to any one of claims 1 to 4, comprising the steps of:

Acquiring image data and text data matched with the image data;

6. The training method of claim 5, wherein the contrast learning training comprises:

7. The training method of claim 6, wherein the training the image report generation model based on the vector similarity between the image feature vector and the text feature vector comprises:

8. The training method of claim 5, wherein the matching training comprises:

9. The training method of claim 5, wherein the multi-modal mask image modeling training comprises: according to the mask image data and the text feature vector, performing image restoration by using a cross-mode image encoder to obtain image restoration data;

10. The training method of claim 5, wherein the multimodal masking language modeling training comprises: according to the mask text data and the image feature vector, performing text restoration by using a cross-mode text encoder to obtain text restoration data;