CN116525052A

CN116525052A - Hierarchical image report generation method and device combined with sentence level contrast learning

Info

Publication number: CN116525052A
Application number: CN202310320888.6A
Authority: CN
Inventors: 徐枫; 刘傲寒; 郭雨晨; 雍俊海
Original assignee: Tsinghua University
Current assignee: Tsinghua University
Priority date: 2023-03-29
Filing date: 2023-03-29
Publication date: 2023-08-01

Abstract

The application provides a hierarchical image report generation method combining sentence level contrast learning, which comprises the following steps: acquiring a plurality of medical image-image report pairs, and constructing an image report generation model, wherein the image report generation model comprises a sentence-level image-text contrast learning module and a hierarchical report generation module; training a sentence-level image-text contrast learning module based on the first joint loss function, and outputting a feature vector of the medical image, a topic vector and a content vector of each sentence in the image report; training the hierarchical report generating module based on the second joint loss function to obtain a trained image report generating model; and acquiring the medical image to be identified, inputting the medical image to be identified into a trained image report generation model, and generating an image report of the medical image to be identified. The feature extraction capability of the image encoder is enhanced, and the accuracy of generating the image report is improved.

Description

Hierarchical image report generation method and device combined with sentence level contrast learning

Technical Field

The present disclosure relates to the field of computer vision, natural language processing, and computer-aided diagnosis, and more particularly, to a method and apparatus for generating a hierarchical image report in combination with sentence-level contrast learning.

Background

The image report is a text description of a medical image, and contains detailed information of physiological structures, abnormalities, diseases and the like in the image, wherein the main content can be abstracted into a section of speech composed of a plurality of sentences. Different sentences pay attention to different contents and topics and are independent from each other. For example, image reporting of chest X-ray, some sentences focus on heart size, some on whether there is pneumonia, some on bones, etc. The manual image report writing is time-consuming and labor-consuming, but by using a deep learning method, a large amount of image and report data can be used for training a neural network model to automatically generate the report.

Image-text contrast learning is a weak supervised learning method, which can be used for model pre-training or combining other tasks to achieve better effects. In the context of visual report generation, image-text contrast learning generally treats the entire report as a whole text. However, since the report contains a plurality of sentences, and each sentence topic and each sentence content are different, it is not reasonable to consider the whole report as a whole. Therefore, how to accurately and automatically generate an image report of a medical image is a serious problem at present.

Disclosure of Invention

The present application aims to solve, at least to some extent, one of the technical problems in the related art.

Therefore, the first objective of the present application is to provide a hierarchical image report generating method combining sentence-level contrast learning, which solves the technical problem that the whole image report is generated as a whole text, and reduces accuracy in the existing method.

A second objective of the present application is to provide a hierarchical image report generating device that combines sentence-level contrast learning.

A third object of the present application is to propose a non-transitory computer readable storage medium.

To achieve the above objective, an embodiment of a first aspect of the present application provides a hierarchical image report generating method combined with sentence-level contrast learning, including: acquiring a plurality of medical image-image report pairs, and constructing an image report generation model, wherein the image report generation model comprises a sentence-level image-text contrast learning module and a hierarchical report generation module; training an input sentence-level image-text contrast learning module based on a plurality of medical image-image reports based on a first joint loss function, and outputting a feature vector of the medical image, a topic vector and a content vector of each sentence in the image report; inputting the feature vector of the medical image, the topic vector of each sentence in the image report and the word sequence of each sentence in the image report into a hierarchical report generation module for training based on the second joint loss function to obtain a trained image report generation model; and acquiring the medical image to be identified, inputting the medical image to be identified into a trained image report generation model, and generating an image report of the medical image to be identified.

Optionally, in an embodiment of the present application, constructing the image report generating model includes:

constructing a sentence-level image-text contrast learning module according to the image encoder and the text encoder;

constructing a hierarchical report generating module according to the vector quantization module, the sentence decoder and the word decoder;

and obtaining an image report generation model according to the sentence-level image-text comparison learning module and the hierarchical report generation module.

Optionally, in one embodiment of the present application, training the input sentence-level image-text contrast learning module based on the first joint loss function with a plurality of medical image-image reports, outputting a feature vector of the medical image, a topic vector and a content vector of each sentence in the image report, includes:

inputting the medical image into a convolutional neural network of an image encoder to obtain a feature map, and expanding the feature map to be used as a sequence to be input into a transducer encoder of the image encoder to obtain a feature vector of the medical image;

inputting the image report into a text encoder, processing each sentence of the image report to generate a sequence, and inputting the sequence into Transformer Encoder of the text encoder to generate a topic vector and a content vector of each sentence in the image report;

And constructing a first joint loss function according to the feature vector of the medical image, the topic vector of each sentence in the image report and the content vector of each sentence in the image report, and training the image encoder and the text encoder according to the first joint loss function.

Optionally, in one embodiment of the present application, constructing the first joint loss function according to the feature vector of the medical image, the topic vector of each sentence in the image report, and the content vector of each sentence in the image report includes:

calculating topic-guided image features of each sentence according to the topic vector of each sentence in the image report and the feature vector of the medical image;

calculating the similarity of the topic-guided image features and the content vectors of each sentence, and constructing comparison loss according to the similarity;

constructing a first joint loss function by introducing topic loss and topic inconsistency loss according to the comparison loss;

the calculation formula of the similarity is expressed as:

wherein S is a sentence in the image report, I is a medical image, C is a content vector of the sentence S in the image report, T _i Is the ith topic vector, q, of sentence S in the image report _i For medical imaging The ith eigenvector of image I, cos represents the cosine of the angle between the two vectors.

The contrast loss is expressed as:

wherein L is _silc For comparison loss, b is training batch size, R _i For the ith image report in the training batch, S is the ith image report R _i One sentence of (I) _i To train the ith medical image in the lot, I _k For the kth medical image in the training batch, sim (x, y) represents the similarity of sentence x and medical image y, t is a trainable temperature parameter;

topic loss is expressed as:

wherein L is _topic Is topic loss, lambda is a hyper-parameter, n is the number of sentences in the image report, H () is the entropy function of the vector,n is the length of vector v, v _i The ith element of v, G _i The topic vector of the ith sentence in the image report;

topic inconsistency loss is expressed as:

wherein L is _diff Represents the topic inconsistency loss, m represents the number of sentences in the image report, cos represents the cosine value of the included angle of two vectors, T ⁱ Topic vector representing ith sentence in image report, T ^k A topic vector representing the kth sentence in the visual report.

Optionally, in one embodiment of the present application, inputting the feature vector of the medical image, the topic vector of each sentence in the image report, and the word sequence of each sentence in the image report into the hierarchical report generation module for training based on the second joint loss function includes:

Inputting the topic vector of each sentence in the image report into a vector quantization module, and acquiring a target vector closest to the topic vector of each sentence through a discrete space of the vector quantization module;

constructing a loss function according to the topic vector of each sentence and the obtained target vector, and training a vector quantization module;

the loss function of the vector quantization module is expressed as:

wherein L is _vq Is the loss function of the vector quantization module, sg () is gradient cut-off operation, v is input vector, e _v For the vector nearest to the input vector v in the discrete space of the vector quantization module, e is the unused vector in the discrete space of the vector quantization module, v _e Is the nearest vector to vector e among the inputted vectors.

inputting the feature vector of the medical image into a sentence decoder, predicting the sentence topic vector of the medical image through the sentence decoder, and constructing a loss function to train the sentence decoder according to the topic vector of each sentence in the image report and the sentence topic vector obtained by prediction;

The loss function of the sentence decoder is expressed as:

wherein L is _sg Is the loss function of the sentence decoder, P is the probability function,t _i is the topic vector of the ith sentence in the image report, t _1:-1 The topic vector sequence from the 1 st sentence to the i-1 st sentence in the image report is that Q is the feature vector of the medical image and theta is the parameter of the sentence decoder.

inputting the feature vector of the medical image and the topic vector of each sentence in the image report into a word decoder, and predicting by the word decoder to obtain a word sequence of each sentence;

constructing a loss function according to the word sequence of each sentence in the image report and the word sequence of the sentence obtained by prediction, and training a word decoder;

the loss function of the word decoder is expressed as:

wherein L is _wg Is the loss function of the word decoder, P is the probability function, T is the topic vector of the sentence, w _i Is the i-th word of a sentence, w _1:-1 Q is a feature vector of the medical image, and θ is a parameter of a word decoder for a sequence of 1 st to i-1 st words in a sentence.

Optionally, in one embodiment of the present application, inputting the medical image to be identified into a trained image report generation model, generating an image report of the medical image to be identified includes:

inputting the medical image to be identified into an image encoder for encoding to obtain a feature vector of the medical image to be identified;

inputting the feature vectors of the medical images to be identified into a sentence decoder, predicting to obtain subscripts of topic vectors of a plurality of sentences, and acquiring the topic vectors of the plurality of sentences from a vector quantization module according to the subscripts of the topic vectors of the plurality of sentences;

inputting the topic vectors of the multiple sentences into a word decoder, sequentially generating word sequences of each sentence according to the feature vector of the medical image to be identified and the topic vector of each sentence to obtain the multiple sentences, and connecting the multiple sentences to obtain an image report of the medical image to be identified.

To achieve the above object, in a second aspect of the present application, a hierarchical image report generating device combined with sentence-level contrast learning is provided, including:

the system comprises an acquisition module, a generation module and a generation module, wherein the acquisition module is used for acquiring a plurality of medical image-image report pairs and constructing an image report generation model, and the image report generation model comprises a sentence-level image-text contrast learning module and a hierarchical report generation module;

The first training module is used for training the input sentence-level image-text contrast learning module based on a plurality of medical image-image reports based on a first joint loss function, and outputting the feature vector of the medical image, the topic vector and the content vector of each sentence in the image report;

the second training module is used for inputting the feature vector of the medical image, the topic vector of each sentence in the image report and the word sequence of each sentence in the image report into the hierarchical report generation module for training based on the second joint loss function to obtain a trained image report generation model;

the generation module is used for acquiring the medical image to be identified, inputting the medical image to be identified into the trained image report generation model, and generating an image report of the medical image to be identified.

To achieve the above object, an embodiment of a third aspect of the present application proposes a non-transitory computer-readable storage medium, which when executed by a processor, is capable of performing a hierarchical image report generation method in combination with sentence-level contrast learning.

The hierarchical image report generation method, the hierarchical image report generation device and the non-transitory computer readable storage medium combining sentence level comparison learning solve the technical problem that the whole image report is generated as a whole text by the existing method, accuracy is reduced, the image encoder is forced to learn corresponding image features for each topic by taking sentence level image-text comparison learning as an auxiliary task for report generation, feature extraction capacity of the image encoder is enhanced, meanwhile, a hierarchical report generation module is utilized to generate topics of each sentence first, actual sentences are generated according to the topics and the images, and accuracy of generating the image report is improved.

Additional aspects and advantages of the application will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the application.

Drawings

The foregoing and/or additional aspects and advantages of the present application will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings, in which:

FIG. 1 is a flowchart of a hierarchical image report generation method combining sentence level contrast learning according to an embodiment of the present disclosure;

FIG. 2 is a training flow diagram of an image report generation model of a hierarchical image report generation method combining sentence-level contrast learning according to an embodiment of the present application;

FIG. 3 is a schematic diagram of an actual usage flow of an image report generating model of a hierarchical image report generating method combining sentence-level contrast learning according to an embodiment of the present application;

fig. 4 is a schematic structural diagram of a hierarchical image report generating device combining sentence-level contrast learning according to a second embodiment of the present application.

Detailed Description

Embodiments of the present application are described in detail below, examples of which are illustrated in the accompanying drawings, wherein the same or similar reference numerals refer to the same or similar elements or elements having the same or similar functions throughout. The embodiments described below by referring to the drawings are exemplary and intended for the purpose of explaining the present application and are not to be construed as limiting the present application.

Deep learning is an artificial intelligence method that achieves certain specific functions by building a neural network model and training on a large amount of data. The method for automatically generating the image report based on the deep learning is generally based on the encoder-decoder structure. The input video is first encoded by an encoder as feature vectors or feature maps and then a decoder generates a text sequence of the report. The decoder portion generally has two modes, direct generation and hierarchical generation: directly generating a whole image report as a sequence, and sequentially generating each word or word in the sequence; the hierarchical generation sequentially generates the characteristics of each sentence in the report, and then sequentially generates the characters or words for each sentence. In terms of model architecture, the encoder is typically based on Convolutional Neural Networks (CNNs) or transducers, and the decoder is typically based on cyclic neural networks (RNNs) or transducers. The decoders in the hierarchical generation model are typically based on recurrent neural networks.

Image-text contrast learning is a weak supervised learning method, which can be used for model pre-training or combining other tasks to achieve better effects. In the context of visual report generation, image-text contrast learning generally treats the entire report as a whole text. However, since the report contains a plurality of sentences, and each sentence topic and each sentence content are different, it is not reasonable to consider the whole report as a whole.

The application provides an automatic image report generation method based on deep learning, which regards an image report as a plurality of independent sentences, and performs contrast learning with the images by taking the sentences as units, and mainly comprises two modules: sentence-level contrast learning and hierarchical image report generation, wherein the hierarchical decoding of the present application uses a transducer as a decoder.

The following describes a hierarchical image report generation method and apparatus combining sentence-level contrast learning according to an embodiment of the present application with reference to the accompanying drawings.

Fig. 1 is a flowchart of a hierarchical image report generating method combining sentence-level contrast learning according to an embodiment of the present application.

As shown in fig. 1, the method for generating a hierarchical image report in combination with sentence-level contrast learning includes the following steps:

step 101, acquiring a plurality of medical image-image report pairs, and constructing an image report generation model, wherein the image report generation model comprises a sentence-level image-text contrast learning module and a hierarchical report generation module;

step 102, training an input sentence-level image-text contrast learning module by using a plurality of medical image-image reports based on a first joint loss function, and outputting a feature vector of a medical image, a topic vector and a content vector of each sentence in the image report;

Step 103, inputting the feature vector of the medical image, the topic vector of each sentence in the image report and the word sequence of each sentence in the image report into a hierarchical report generation module for training based on the second joint loss function, so as to obtain a trained image report generation model;

step 104, obtaining the medical image to be identified, inputting the medical image to be identified into the trained image report generation model, and generating an image report of the medical image to be identified.

According to the hierarchical image report generation method combining sentence-level contrast learning, a plurality of medical image-image report pairs are obtained, and an image report generation model is constructed, wherein the image report generation model comprises a sentence-level image-text contrast learning module and a hierarchical report generation module; training an input sentence-level image-text contrast learning module based on a plurality of medical image-image reports based on a first joint loss function, and outputting a feature vector of the medical image, a topic vector and a content vector of each sentence in the image report; inputting the feature vector of the medical image, the topic vector of each sentence in the image report and the word sequence of each sentence in the image report into a hierarchical report generation module for training based on the second joint loss function to obtain a trained image report generation model; and acquiring the medical image to be identified, inputting the medical image to be identified into a trained image report generation model, and generating an image report of the medical image to be identified. Therefore, the technical problem that the accuracy is reduced because the whole image report is generated as a whole text by the existing method can be solved, the image encoder is forced to learn corresponding image features for each topic by taking sentence-level image-text comparison learning as an auxiliary task for report generation, the feature extraction capability of the image encoder is enhanced, meanwhile, the topic of each sentence is generated by using the hierarchical report generation module, and then the actual sentence is generated according to the topic and the image, so that the accuracy of generating the image report is improved.

Further, in an embodiment of the present application, constructing an image report generating model includes:

The image report generating model in the embodiment of the application mainly comprises two modules: sentence-level image-text contrast learning and hierarchical report generation, wherein the sentence-level image-text contrast learning module mainly comprises two neural networks: an image encoder and a text encoder; the hierarchical report generating module is used for generating an image report, and the hierarchical report generating module and the sentence-level image-text comparison learning module share an image encoder and a text encoder during training.

Wherein, in the sentence-level image-text contrast learning module, for one image-report pair, the image encoder encodes an image (i.e., an image) into N D-dimensional feature vectors, each of which encodes the content of a particular topic in the image (e.g., the 1 st vector encodes information of the heart in the image, the 2 nd vector encodes information of the bones in the image, etc.). The text encoder encodes each sentence in the report into an N-dimensional topic vector T and a D-dimensional content vector C, respectively. Where all elements of T are non-negative and sum to 1, as a soft representation of the topic of the sentence (e.g., if t= [0.9,0.1,0,0.] then 90% of this sentence is for the first topic and 10% is for the second topic). The image encoder, the text encoder and the encoded vectors are obtained from training data through contrast learning training.

In the hierarchical report generation module, the vector quantization module is used for mapping topic vectors T of different sentences in the dataset to a relatively small discrete space E. The function of the sentence decoder is to predict (generate) topic vectors of several sentences that should appear in its visual report (these topics are selected from the discrete space E) from the feature vectors of the input image. The word decoder functions to predict (generate) a word of a sentence based on a feature vector of an input image and a topic vector of the sentence. In practical use, the topic vectors predicted by the sentence decoder are respectively generated into sentences by the word decoder and then are connected to obtain a final report.

Further, in an embodiment of the present application, training the input sentence-level image-text contrast learning module by using a plurality of medical image-image reports based on the first joint loss function, outputting a feature vector of the medical image, a topic vector and a content vector of each sentence in the image report, including:

In the embodiment of the application, the image encoder is composed of a convolutional neural network and a Transformer Encoder neural network. The input medical image is first output a feature map via a convolutional neural network (the specific size is determined by the convolutional neural network, for example, if a res net18 network is used, a feature map of 7×7×512 is output). After the space dimension is flattened (for example, res net18, flattening is performed to 49 512-dimensional feature vectors), the feature map is input as a sequence into one Transformer Encoder, and N D-dimensional feature vectors (for example, res net18, n=49, d=512) are output.

In the embodiment of the application, the text encoder is composed of Transformer Encoder and an output layer. A sentence is firstly operated by token embedding, position embedding and the like, and is sent into Transformer Encoder as a sequence, and an output vector at the position of the sentence head is added with a Softmax layer through a full-connection layer to obtain an N-dimensional topic vector T; a D-dimensional content vector C is obtained via a full connection layer.

Further, in the embodiment of the present application, constructing a first joint loss function according to the feature vector of the medical image, the topic vector of each sentence in the image report, and the content vector of each sentence in the image report includes:

the calculation formula of the similarity is expressed as:

wherein S is a sentence in the image report, I is a medical image, C is a content vector of the sentence S in the image report, T _i Is the ith topic vector, q, of sentence S in the image report _i For the ith feature vector of the medical image I, cos represents the cosine value of the angle between the two vectors.

The contrast loss is expressed as:

topic loss is expressed as:

topic inconsistency loss is expressed as:

The loss function of sentence-level image-text contrast learning is mainly composed of three parts.

First is contrast loss. For a sentence, using its N-dimensional Topic vector T as a weight, N D-dimensional feature vectors of a certain image are weighted and summed to obtain a D-dimensional vector, called Topic-guided image feature (tgpic-guided image content, for short). This vector represents the content of the image corresponding to this topic. When the sentence and the image match each other (i.e., the two are a pair), the TGIC should be similar to the content C of the sentence; conversely, when the sentence and image do not match (i.e., the two are not a pair), the TGIC should be dissimilar to C. During training, for a certain sentence of the ith sample in the batch, the topic T is used for extracting TGICs for all images in the same batch (batch), and the cosine similarity of the TGICs and sentence content C is calculated. The calculation formula of the similarity is expressed as:

Wherein S is a sentence in the image report, I is a medical image, C is a content vector of the sentence S in the image report, T _i Is the ith element, q, of the topic vector of sentence S in the visual report _i For the ith feature vector of the medical image I, cos represents the cosine value of the angle between the two vectors.

The contrast loss is expressed as:

secondly, the topic vector T is required to satisfy local sparsity and global balance. That is, for a certain vector T, only a few elements should be large, since a sentence does not involve too many topics; for many T's, globally looking at each element at each location is used to ensure refinementTopic division of (c). Thus introducing another penalty. For n sentences in a report, the topic vector is G E R ^n×N Then add topic loss:

wherein L is _topic Is topic loss, lambda is a hyper-parameter, n is the number of sentences in the image report, H () is the entropy function of the vector, N is the length of vector v, v _i The ith element of v, G _i The topic vector of the ith sentence in the image report;

meanwhile, the added topic inconsistency loss is expressed as:

Further, in an embodiment of the present application, inputting the feature vector of the medical image, the topic vector of each sentence in the image report, and the word sequence of each sentence in the image report into the hierarchical report generation module for training based on the second joint loss function includes:

the loss function of the vector quantization module is expressed as:

The vector quantization module is a Vector Quantization (VQ) module. This module maps the input vector v to the discrete space E. Specifically, the VQ module contains K N-dimensional vectors, E rk×n. For the input N-dimensional vector v, the VQ module finds the vector ev closest to it in E and outputs ev and the corresponding subscript. The loss function in the training of the vector quantization module comprises two parts, wherein one part is to draw the vector closest to the input vector in E to the input vector, and the other part is to draw the vector which is not used recently in E to the vector closest to the input vector.

The loss function of the sentence decoder is expressed as:

wherein L is _sg Is the loss function of a sentence decoder, P is the probability function, t _i Is the topic vector of the ith sentence in the image report, t _1:-1 The topic vector sequence from the 1 st sentence to the i-1 st sentence in the image report is that Q is the feature vector of the medical image and theta is the parameter of the sentence decoder.

The sentence decoder is composed of one Transformer Decoder, predicts topics of each sentence in the report from the image feature vector. N feature vectors obtained by an image encoder of an input image are taken as input of a sentence decoder cross-section, which autoregressively predicts (generates) sentence topic vectors. But it does not predict the vector directly, but predicts a sequence of subscripts from which the corresponding topic vector can be extracted from the VQ module.

For one report, if the true topic vectors of multiple sentences (i.e., topic vectors obtained by the text encoder) correspond to t=t in the VQ module, respectively ₁ ,t ₂ ,...t _n The loss function of the sentence decoder is:

wherein P is a probability function, Q reports the feature vector sequence of the corresponding image, and θ is a parameter of the sentence decoder.

the loss function of the word decoder is expressed as:

The word decoder is composed of one Transformer Decoder, and predicts the sequence of words contained in a sentence based on the image feature vector and the topic vector of the sentence. The topic vector of the sentence is obtained by a text encoder during training and is predicted by a sentence decoder during actual use. The image feature vector is the same as the sentence decoder as the input to this decoder cross-section. The sentence topic vector is input as the first position of the sequence, on which basis the word decoder autoregressively generates the sequence of words in the sentence.

For a sentence, the word sequence is expressed as w=w ₁ ,w ₂ ,...w _m The loss function of the word decoder is:

wherein P is a probability function, T is a topic vector of the sentence, Q is an image feature vector sequence, and θ is a parameter of a word decoder.

As shown in fig. 2, training a sentence-level image-text generation module of an image report generation model, including obtaining a plurality of image-report pairs, and inputting an image into a convolutional neural network of an image encoder and a transducer encoder to obtain feature vectors of a medical image; inputting the image report into a text encoder, generating a topic vector and a content vector of each sentence in the image report, and constructing a contrast loss L according to the feature vector of the medical image, the topic vector of each sentence in the image report and the content vector of each sentence in the image report _silc Topic loss L _topic Loss of topic inconsistency L _diff The image encoder and the text encoder are trained.

Generating model for image reportA hierarchical report generating module of the type trains, comprising: inputting the topic vector of each sentence in the image report into a vector quantization module VQ, acquiring a target vector closest to the topic vector of each sentence through a discrete space of the vector quantization module, and constructing a loss function L _vq Training a vector quantization module; inputting the feature vector of the medical image into a sentence decoder, predicting the sentence topic vector of the medical image through the sentence decoder, and constructing a loss function L _sg Training a sentence decoder; inputting the feature vector of the medical image and the topic vector of each sentence in the image report into a word decoder, and predicting by the word decoder to obtain a word sequence of each sentence; constructing a loss function L according to the word sequence of each sentence in the image report and the predicted word sequence of the sentence _wg The word decoder is trained.

Further, in an embodiment of the present application, inputting a medical image to be identified into a trained image report generation model, generating an image report of the medical image to be identified includes:

As shown in fig. 3, inputting a medical image to be identified into an image encoder for encoding to obtain a feature vector of the medical image to be identified, and inputting the feature vector of the medical image to be identified into a sentence decoder and a word decoder; the sentence decoder predicts subscript T of topic vectors of a plurality of sentences according to the feature vector of the medical image to be identified ₄₇ ···T ₂₁₃ 、T ₁₈₉ The method comprises the steps of carrying out a first treatment on the surface of the Inputting subscripts of topic vectors of the sentences into a vector quantization module, and acquiring the topic vectors of the sentences from the vector quantization module according to the subscripts; the word decoder sequentially generates word sequences of each sentence according to the feature vector of the medical image to be identified and the topic vector of each sentence to obtain a plurality of sentences, and connects the sentences to obtain an image report of the medical image to be identified.

According to the method, sentence-level image-text contrast learning is used as an auxiliary task for report generation, an image encoder is forced to learn corresponding image features for each topic, feature extraction capacity of the image encoder is enhanced, meanwhile, topics of each sentence are generated through a hierarchical report generation module, then actual sentences are generated according to the topics and the images, and the method is consistent with the properties of an image report and the habit of a doctor for writing the image report, so that a more accurate report can be generated.

As shown in fig. 4, the hierarchical image report generating device combined with sentence level contrast learning includes:

an acquisition module 10 for acquiring a plurality of medical image-image report pairs and constructing an image report generation model, wherein the image report generation model comprises a sentence-level image-text contrast learning module and a hierarchical report generation module;

a first training module 20, configured to train the input sentence-level image-text contrast learning module with a plurality of medical image-image reports based on a first joint loss function, and output a feature vector of the medical image, a topic vector and a content vector of each sentence in the image report;

the second training module 30 is configured to input the feature vector of the medical image, the topic vector of each sentence in the image report, and the word sequence of each sentence in the image report into the hierarchical report generation module for training based on the second joint loss function, so as to obtain a trained image report generation model;

the generating module 40 is configured to acquire a medical image to be identified, input the medical image to be identified into the trained image report generating model, and generate an image report of the medical image to be identified.

The hierarchical image report generating device combining sentence level contrast learning according to the embodiment of the application includes: the system comprises an acquisition module, a generation module and a generation module, wherein the acquisition module is used for acquiring a plurality of medical image-image report pairs and constructing an image report generation model, and the image report generation model comprises a sentence-level image-text contrast learning module and a hierarchical report generation module; the first training module is used for training the input sentence-level image-text contrast learning module based on a plurality of medical image-image reports based on a first joint loss function, and outputting the feature vector of the medical image, the topic vector and the content vector of each sentence in the image report; the second training module is used for inputting the feature vector of the medical image, the topic vector of each sentence in the image report and the word sequence of each sentence in the image report into the hierarchical report generation module for training based on the second joint loss function to obtain a trained image report generation model; the generation module is used for acquiring the medical image to be identified, inputting the medical image to be identified into the trained image report generation model, and generating an image report of the medical image to be identified. Therefore, the technical problem that the accuracy is reduced because the whole image report is generated as a whole text by the existing method can be solved, the image encoder is forced to learn corresponding image features for each topic by taking sentence-level image-text comparison learning as an auxiliary task for report generation, the feature extraction capability of the image encoder is enhanced, meanwhile, the topic of each sentence is generated by using the hierarchical report generation module, and then the actual sentence is generated according to the topic and the image, so that the accuracy of generating the image report is improved.

In order to implement the above embodiment, the application further provides a non-transitory computer readable storage medium, on which a computer program is stored, which when executed by a processor implements the hierarchical image report generation method combining sentence-level contrast learning of the above embodiment.

In the description of the present specification, a description referring to terms "one embodiment," "some embodiments," "examples," "specific examples," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present application. In this specification, schematic representations of the above terms are not necessarily directed to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, the different embodiments or examples described in this specification and the features of the different embodiments or examples may be combined and combined by those skilled in the art without contradiction.

Furthermore, the terms "first," "second," and the like, are used for descriptive purposes only and are not to be construed as indicating or implying a relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include at least one such feature. In the description of the present application, the meaning of "plurality" is at least two, such as two, three, etc., unless explicitly defined otherwise.

Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing specific logical functions or steps of the process, and additional implementations are included within the scope of the preferred embodiment of the present application in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the embodiments of the present application.

Logic and/or steps represented in the flowcharts or otherwise described herein, e.g., a ordered listing of executable instructions for implementing logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions. For the purposes of this description, a "computer-readable medium" can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device. More specific examples (a non-exhaustive list) of the computer-readable medium would include the following: an electrical connection (electronic device) having one or more wires, a portable computer diskette (magnetic device), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber device, and a portable compact disc read-only memory (CDROM). In addition, the computer readable medium may even be paper or other suitable medium on which the program is printed, as the program may be electronically captured, via, for instance, optical scanning of the paper or other medium, then compiled, interpreted or otherwise processed in a suitable manner, if necessary, and then stored in a computer memory.

It is to be understood that portions of the present application may be implemented in hardware, software, firmware, or a combination thereof. In the above-described embodiments, the various steps or methods may be implemented in software or firmware stored in a memory and executed by a suitable instruction execution system. As with the other embodiments, if implemented in hardware, may be implemented using any one or combination of the following techniques, as is well known in the art: discrete logic circuits having logic gates for implementing logic functions on data signals, application specific integrated circuits having suitable combinational logic gates, programmable Gate Arrays (PGAs), field Programmable Gate Arrays (FPGAs), and the like.

Those of ordinary skill in the art will appreciate that all or a portion of the steps carried out in the method of the above-described embodiments may be implemented by a program to instruct related hardware, where the program may be stored in a computer readable storage medium, and where the program, when executed, includes one or a combination of the steps of the method embodiments.

In addition, each functional unit in each embodiment of the present application may be integrated in one processing module, or each unit may exist alone physically, or two or more units may be integrated in one module. The integrated modules may be implemented in hardware or in software functional modules. The integrated modules may also be stored in a computer readable storage medium if implemented in the form of software functional modules and sold or used as a stand-alone product.

The above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, or the like. Although embodiments of the present application have been shown and described above, it will be understood that the above embodiments are illustrative and not to be construed as limiting the application, and that variations, modifications, alternatives, and variations may be made to the above embodiments by one of ordinary skill in the art within the scope of the application.

Claims

1. The hierarchical image report generation method combined with sentence level contrast learning is characterized by comprising the following steps of:

acquiring a plurality of medical image-image report pairs, and constructing an image report generation model, wherein the image report generation model comprises a sentence-level image-text contrast learning module and a hierarchical report generation module;

training the plurality of medical image-image reports based on a first joint loss function, inputting sentence-level image-text contrast learning modules, and outputting feature vectors of medical images, topic vectors and content vectors of each sentence in the image reports;

inputting the feature vector of the medical image, the topic vector of each sentence in the image report and the word sequence of each sentence in the image report into the hierarchical report generation module for training based on a second joint loss function to obtain a trained image report generation model;

And acquiring a medical image to be identified, inputting the medical image to be identified into the trained image report generation model, and generating an image report of the medical image to be identified.

2. The method of claim 1, wherein the constructing the visual report generating model comprises:

3. The method of claim 2, wherein training the plurality of medical image-to-image reports based on the first joint loss function to input the sentence-level image-to-text contrast learning module, outputting feature vectors for medical images, topic vectors and content vectors for each sentence in an image report, comprises:

inputting a medical image into a convolutional neural network of an image encoder to obtain a feature map, and expanding the feature map to be used as a sequence to be input into a transducer encoder of the image encoder to obtain a feature vector of the medical image;

Inputting an image report into a text encoder, processing each sentence of the image report to generate a sequence, and inputting the sequence into Transformer Encoder of the text encoder to generate a topic vector and a content vector of each sentence in the image report;

4. The method of claim 3, wherein constructing a first joint loss function from the feature vector of the medical image, the topic vector of each sentence in the image report, and the content vector of each sentence in the image report comprises:

calculating the similarity of the topic-guided image features and the content vectors of each sentence, and constructing contrast loss according to the similarity;

The calculation formula of the similarity is expressed as follows:

The contrast loss is expressed as:

the topic loss is expressed as:

the topic inconsistency loss is expressed as:

5. The method of claim 2, wherein the inputting the feature vector of the medical image, the topic vector of each sentence in the image report, and the word sequence of each sentence in the image report into the hierarchical report generation module for training based on the second joint loss function comprises:

constructing a loss function according to the topic vector of each sentence and the obtained target vector to train the vector quantization module;

the loss function of the vector quantization module is expressed as:

6. The method of claim 2, wherein the inputting the feature vector of the medical image, the topic vector of each sentence in the image report, and the word sequence of each sentence in the image report into the hierarchical report generation module for training based on the second joint loss function comprises:

the loss function of the sentence decoder is expressed as:

wherein L is _sg Is the loss function of a sentence decoder, P is the probability function, t _i Is the topic vector of the ith sentence in the image report, t _1:i-1 The topic vector sequence from the 1 st sentence to the i-1 st sentence in the image report is that Q is the feature vector of the medical image and theta is the parameter of the sentence decoder.

7. The method of claim 2, wherein the inputting the feature vector of the medical image, the topic vector of each sentence in the image report, and the word sequence of each sentence in the image report into the hierarchical report generation module for training based on the second joint loss function comprises:

Inputting the feature vector of the medical image and the topic vector of each sentence in the image report into a word decoder, and predicting the word sequence of each sentence by the word decoder;

constructing a loss function according to the word sequence of each sentence in the image report and the word sequence of the sentence obtained by prediction, and training the word decoder;

the loss function of the word decoder is expressed as:

8. The method of claim 1, wherein the inputting the medical image to be identified into the trained image report generation model generates an image report of the medical image to be identified, comprising:

inputting the feature vector of the medical image to be identified into a sentence decoder, predicting to obtain subscripts of topic vectors of a plurality of sentences, and acquiring the topic vectors of the plurality of sentences from a vector quantization module according to the subscripts of the topic vectors of the plurality of sentences;

And inputting the topic vectors of the multiple sentences into a word decoder, sequentially generating word sequences of each sentence according to the feature vector of the medical image to be identified and the topic vector of each sentence to obtain multiple sentences, and connecting the multiple sentences to obtain an image report of the medical image to be identified.

9. A hierarchical image report generating device for sentence-level contrast learning, comprising:

the first training module is used for training the plurality of medical image-image reports based on a first joint loss function, inputting the sentence-level image-text contrast learning module, and outputting a feature vector of the medical image, a topic vector and a content vector of each sentence in the image report;

the second training module is used for inputting the feature vector of the medical image, the topic vector of each sentence in the image report and the word sequence of each sentence in the image report into the hierarchical report generating module for training based on a second joint loss function to obtain a trained image report generating model;

10. A non-transitory computer readable storage medium, having stored thereon a computer program, which when executed by a processor, implements the method according to any of claims 1-8.