CN115293128A

CN115293128A - Model training method and system based on multi-modal contrast learning radiology report generation

Info

Publication number: CN115293128A
Application number: CN202210931458.3A
Authority: CN
Inventors: 武星; 李婧雯
Original assignee: University of Shanghai for Science and Technology
Current assignee: University of Shanghai for Science and Technology
Priority date: 2022-08-04
Filing date: 2022-08-04
Publication date: 2022-11-04

Abstract

The invention discloses a radiology report generation model training method and system based on multi-modal contrast learning, which comprises the following steps: learning sentence representation based on a sentence-level training strategy of contrast learning by adopting a self-supervision representation learning method, and obtaining image representation through bidirectional contrast learning between paired images and texts; then embedding the learned image encoder and sentence encoder into a radiology report generation model, generating an Impulse part of a report through an encoding-decoding process, and recursively generating a Finding part of the report by fusing visual features and semantic features. The method and the system for training the radiology report generation model based on the multi-modal contrast learning optimize the representation of images and texts, realize a radiology report generation task, assist a radiologist in making diagnosis, and improve the accuracy in the diagnosis process.

Description

Model training method and system based on multi-modal contrast learning radiology report generation

Technical Field

The invention relates to a radiology report generation model training method and system based on multi-modal contrast learning, and belongs to the field of computers and medicine.

Background

Medical images, such as radiological images, are widely used for diagnosis of diseases. Reading and understanding of medical images is typically performed by professional medical personnel who, by analyzing the images being examined, identify normal and abnormal areas therein, and use learned medical knowledge and accumulated work experience to compose radiology reports. However, the generated report will be in error due to lack of knowledge of the radiologist, misreasoning, personnel shortage, excessive workload, and the like. Therefore, automated generation of radiology reports has become an attractive research direction for artificial intelligence and clinical medicine to reduce the workload of radiologists and minimize the occurrence of errors.

Imaging studies are often accompanied by radiology reports that document the radiologist's observations in routine clinical care, with the passage consisting of the Impression and Findings sections representing the most direct transcription for imaging studies. The expression part is a conclusive diagnosis and can be regarded as a report conclusion or a subject sentence, and the Findings part is a paragraph consisting of a plurality of structural sentences, each of which focuses on a specific medical observation of a certain region in a radiology image. These sentences are typically longer and more complex than the titles in the standard image title data set. Therefore, many existing image header models are not directly suitable for this task, which requires special solutions.

In recent years, much work has explored the automatic generation of radiology reports and many new ideas based on traditional generation methods have been proposed to improve. However, the existing generation method generally has several problems: (1) Previous studies mostly utilized CNN encoders based on ImageNet pre-training, which is not suitable for medical images; (2) Sentences obtained by LSTM-based or BERT-based sentence encoders used in previous studies are of low quality representation; (3) The generated reports are not semantically consistent in the medical field, and the optimization of the clinical accuracy of the generated reports is at the cost of reducing the performance of other indexes to a certain extent.

Disclosure of Invention

The purpose of the invention is: the method realizes the task of generating the radiological report, improves the visual characteristics of the extracted medical images and the semantic characteristics of the texts, and enhances the consistency between the images and the texts.

In order to achieve the above object, one technical solution of the present invention is to provide a method for training a radiology report generation model based on multimodal contrast learning, which is characterized by comprising the following steps:

s100, sample data acquisition:

acquiring radiological images and text data, and transmitting the image data and the corresponding text data to a sample database, wherein each text data comprises a conclusive diagnostic statement Impression and a detailed description paragraph Finding;

s200, multi-modal contrast learning:

an image encoder for obtaining visual features in image data and a sentence encoder for extracting semantic features in text data are obtained based on sample database learning by adopting a self-supervision characterization learning method, wherein:

the sentence encoder learns sentence representations through a training strategy based on sentence levels of contrast learning;

an image encoder learns image representations through a bi-directional contrast learning objective between pairs of image data and text data;

s300, generating a radiology report:

recursively generating diagnostic sentences Impression and description paragraphs Findings of a radiology report by fusing visual features extracted by an image encoder and semantic features extracted by a sentence encoder, comprising in particular the steps of:

s301, generating a single diagnosis statement Impression by an Impression generating module based on an encoder-decoder framework, and specifically comprising the following steps:

the image encoder extracts visual features from an input image, then sends the visual features into an Impression part sentence decoder, and generates a whole sentence word by word as a diagnosis sentence Impression;

s302, fusing the visual characteristics of the image and the semantic characteristics of the sentence by a Finding generation module, generating the sentence circularly, and finally generating a long section containing a plurality of structural sentences, wherein the method specifically comprises the following steps:

in the Findings generation module, in order to make the generated sentence focus on describing different image areas, based on the attention frame, the visual features output by the image encoder and the semantic features of the previous sentence output by the sentence encoder are input into a full-link layer and then into the SoftMax layer to obtain weighted visual features, the weighted visual features are input into a Findings partial sentence decoder, the coding of the sentence is obtained by the Findings partial sentence decoder, the coding of the sentence is input into the sentence encoder as the coding of the previous sentence, and then the sentence encoder is input, so as to obtain new weighted visual features, and the process is repeated until the Findings partial sentence decoder generates an empty sentence, which indicates that the generation of the description paragraph finds is completed;

wherein the coding of the previous sentence and the visual features of the image are combined by the weighted visual features to guide the generation of the next sentence, the attention network for calculating the weighted visual representation is defined as:

V _w ＝Attention(v,s)

wherein, V _w Is the weighted visual representation to be obtained; attention () is an attention function; v = { v = ₁ ,v ₂ ,…,v _k }，

Is an image feature learned by an image encoder, each feature v _i Are all one d _v A representation of a dimension corresponding to a portion of an image;

code representing the previous sentence，d _s Is the dimension of the semantic feature;

in view of visual features

And the coding of the previous sentence, the attention distribution is generated in K areas of the image through a single attention distributor, and is expressed through a single-layer neural network and a SoftMax function:

α＝softmax(z)

in the formula:

is a vector, all elements are set to 1;

is a parameter of the attention network;

note the weight of the features in v in (v, s).

Weighted visual representation V based on attention distribution _w Obtained by the following steps:

in the formula, alpha _i Is the ith dimension element in alpha;

s400, report analysis and evaluation:

the generated radiology report is evaluated using the evaluation index:

s500, outputting a result:

and combining the diagnosis sentence Impression and the description paragraphs Finding which are respectively generated into a complete radiology report for outputting, and simultaneously realizing the evaluation of output results by using a plurality of evaluation indexes.

Preferably, in step S100, the image data and the text data in the sample database are in a one-to-one correspondence relationship, and include a training set, a test set, and a verification set.

Preferably, in step S200, learning sentence characterization through a training strategy based on the sentence levels of the comparative learning comprises the following steps:

for the same sentence in a sentence set, a series of sentence representations of different enhanced versions are obtained by implementing different data enhancement methods to be used as positive examples in a training set, a test set or a verification set, and other sentences are used as negative examples;

when the sentence encoder is trained, the sentence encoder is established and the semantic sentence embedding is established by furthest improving the consistency between the sentence representations of different enhanced versions of the same sample and simultaneously keeping the sentence vectors of different samples as far as possible.

Preferably, in step S200, a set of sentences is taken

x _i Represents the ith sentence, m represents the set

The total number of inner sentences; for sentence x _i Applying two different data enhancement methods f () and f' () to generate two different versions of sentence embedding e _i 、e′ _i ：

e _i ＝f(x _i )

e′ _i ＝f′(x _i )

In the formula, e _i ,

L is the length of sentence embedding, D is the hidden dimension of sentence embedding;

then, the sentence is embedded into e _i 、e′ _i Is encoded to obtain a sentence representation h _i 、h′ _i 。

Then for a small batch of N sentences, sentence x _i Training target of

The following:

wherein τ is a temperature hyperparameter, sim () is cosine similarity, then

Ultimate loss of contrast

Is the average of all N in-batch losses:

preferably, in step S200, learning the image representation by a bidirectional contrast learning target between the paired image and text, comprises the following steps:

by paired input (X) _v ,X _s ) Learning image encoder, wherein X _v Representing an image or a group of images, X _s Representative description of X _v A sentence sequence of mid-imaging information; for each input image X _v And each input sentence X _d They are encoded by an image encoder f _v () And a sentence encoder f _s () Conversion into a fixed dimension vector h _v And h _s (ii) a Then, a representation h of the two modes _v 、h _s By projecting a function g _v () And g _s () Project from their encoder space to the same D-dimensional space for comparative learning;

for N input pairs (X) _v ,X _s )，Resulting in corresponding N representation pairs (v, s), wherein:

v＝g _v (f _v (X _v ))

s＝g _s (f _s (X _s ))

in the formula, v,

with (v) _i ,s _i ) Represent the ith pair of representations whose training objectives include two loss functions: loss of image-to-text contrast

And loss of text-to-image contrast

Ultimate loss of training

Is a weighted combination of two penalties:

wherein λ is a scalar weight;

by maximizing the correspondence between image-text representation pairs, an image encoder is learned that maps images to a fixed-dimension vector.

Preferably, in step S301, the im compression part sentence decoder generates a word to generate a title in each time step according to the context vector, the previous hidden state and the previously generated word by using the LSTM-based method;

the initial hidden state and cell state of the LSTM are set to zero, the hidden state h at time t _t Is modeled as:

h _t ＝LSTM(x _t ,h _t-1 ,m _t-1 )

wherein x is _t Is an input vector, m _t-1 Is the memory cell vector at time t-1.

The visual feature vector is used as the initial input of the LSTM to predict the first word of a sentence; before inputting LSTM, the visual feature vector output by the image encoder is converted into the same dimension as word embedding by a full connection layer, and the LSTM extracts visual features from the last convolution layer and then generates the whole sentence word by word.

Preferably, the step S400 includes:

s401, evaluating the generated report by using the BLEU and the variant thereof;

s402, evaluating the generated report by using METEOR;

s403, evaluating the generated report by using the ROUGE;

and S404, evaluating the generated report by using CIDER.

Another technical solution of the present invention is to provide a model training system for generating a radiology report based on multimodal contrast learning, which is characterized by comprising:

the sample database is used for storing the acquired radiological images and text data, wherein each text data comprises a conclusive diagnosis sentence Impression and a detailed description paragraph Findings;

the multimode contrast learning module adopts a self-supervision characterization learning method, and learns based on a sample database to obtain an image encoder for visual features in image data and a sentence encoder for extracting semantic features in text data, wherein:

a radiology report generation module for recursively generating a diagnosis sentence Impression and a description paragraphs Finding of a radiology report by fusing the visual features extracted by the image encoder and the semantic features extracted by the sentence encoder;

the radiology report generation module further comprises an expression generation module and a fixings generation module, wherein:

the Impactor generation module generates a single diagnostic statement Impactor based on an encoder-decoder framework, the implementation of the Impactor generation module comprising the steps of:

the method comprises the following steps of fusing visual features of an image and semantic features of a sentence by a Findings generation module, generating the sentence in a circulating manner, and finally generating a long section containing a plurality of structural sentences, wherein the Findings generation module comprises the following steps:

in the Findings generation module, in order to make the generated sentence focus on describing different image areas, based on the attention frame, the visual features output by the image encoder and the semantic features of the previous sentence output by the sentence encoder are input into a full-link layer and then into the SoftMax layer to obtain weighted visual features, the weighted visual features are input into a Findings partial sentence decoder, the coding of the sentence is obtained by the Findings partial sentence decoder, the coding of the sentence is input into the sentence encoder as the coding of the previous sentence, and then the sentence encoder is input, so as to obtain new weighted visual features, and the process is repeated until the Findings partial sentence decoder generates an empty sentence, which indicates that the generation of the Findings part is completed;

V _w ＝Attention(v,s)

code representing the previous sentence, d _s Is the dimension of the semantic feature;

in view of visual features

α＝softmax(z)

in the formula:

is a vector, all elements are set to 1;

is a parameter of the attention network;

note the weight of the features in v in (v, s).

in the formula, alpha _i Is the ith dimension element in α;

a report analysis evaluation module that evaluates the generated radiology report using the evaluation index;

and the result output module is used for combining the respectively generated diagnosis sentences Impression and description paragraphs into a complete radiology report to be output, and simultaneously realizing the evaluation of output results by using a plurality of evaluation indexes.

The radiology report generation model training system based on multi-modal contrast learning of claim 8, wherein in the radiology report generation module, the image encoder is modeled as a CNN; the sentence encoder is modeled as a model obtained after fine-tuning in a BERT language model, and semantic representations of sentences are generated by averaging the pooling layers.

The invention has reasonable structural design, utilizes the sample database to carry out self-supervision learning, sends the radiology images into the system, finally generates corresponding radiology reports, assists the radiologist to make diagnosis, and greatly improves the accuracy in the diagnosis process.

In summary, compared with the prior art, the invention has at least the following advantages:

(1) The invention provides a multi-modal contrast learning-based recursive model for generating a radiology report. The model combines the visual characteristics of medical images and the semantic characteristics of sentences, and respectively generates a radiology report expressing part and a Finding part through a recursive network;

(2) The invention provides a model pre-training method based on multi-modal contrast learning, which is used for improving the expressive force of visual characteristics and text characteristics;

(3) The invention uses paired medical images and reports to carry out bidirectional contrast learning, and carries out pre-training on the image encoder, so that the image encoder can effectively extract visual representation and improve the consistency between image data and text data.

(4) The sentence encoder is established on a sentence-level training target based on comparative learning, so that the sentence encoder can establish semantically coherent sentence embedding for text representation;

(5) The radiology report generation model training method and system based on multi-modal contrast learning can effectively provide interpretable reasons, self-supervision learning is carried out by utilizing the sample database, radiology images are sent into the system, a radiology report corresponding to the radiology image is generated finally, a radiologist is assisted to make a decision, and the accuracy in the diagnosis process is greatly improved.

Drawings

FIG. 1 is an overall frame diagram of the present invention;

FIG. 2 is sample data according to an embodiment of the present invention;

FIG. 3 is a block diagram of multi-modal contrast learning in accordance with the present invention;

FIG. 4 is a schematic diagram of multi-modal feature fusion in the present invention.

Detailed Description

The invention will be further illustrated with reference to the following specific examples. It should be understood that these examples are for illustrative purposes only and are not intended to limit the scope of the present invention. Further, it should be understood that various changes or modifications of the present invention may be made by those skilled in the art after reading the teaching of the present invention, and such equivalents may fall within the scope of the present invention as defined in the appended claims.

As shown in fig. 1, an embodiment of the present invention provides a method for training a radiology report generation model based on multi-modal contrast learning, in which an auto-supervised characterization learning method, i.e., multi-modal contrast learning, is embedded in a radiology report generation model to generate a corresponding radiology report for each radiology image, and the method includes the following steps:

s100, sample data acquisition:

the method comprises the steps of collecting a radiological image of the chest of a human body through a nuclear magnetic resonance apparatus, and transmitting image data and corresponding text data to a sample database. Specifically, referring to fig. 2, the image data and the text data in the sample database are in a one-to-one correspondence relationship, and include a training set, a test set, and a verification set. The text data is a radiology report containing a conclusive diagnostic statement Impression and a detailed description paragraph Findings.

S200, multi-modal contrast learning:

an image encoder and a sentence encoder are obtained based on sample database learning by adopting a self-supervision characterization learning method and are respectively used for extracting visual features in an image and semantic features in a sentence, and as shown in fig. 3, the method specifically comprises the following steps:

s201, sentence representation is learned through a sentence level training strategy based on comparative learning.

Taking a set of sentences

x _i Representing the ith sentence, m represents the set

Total number of inner sentences. For sentence x _i Applying two different data enhancement methods f () and f' () to generate two different versions of sentence embedding e _i 、e′ _i ：

e _i ＝f(x _i )

e′ _i ＝f′(x _i )

In the formula, e _i ,

L is the length of sentence embedding and D is the hiding dimension of sentence embedding.

Therefore, for the sameSentences, by implementing different data enhancement methods such as cutting, removing and the like, the invention takes a series of different obtained embeddings as 'positive examples' and takes other sentences in the same batch as 'negative examples'. Then, for a small batch of N sentences, sentence x _i Training target of

The following were used:

wherein τ is a temperature hyperparameter, sim () is cosine similarity, then

The molecule in the above formula represents h _i 、h′ _i Cosine similarity between, wherein h' _i Is a positive example; and the denominator represents h _j 、h′ _j Sum of cosine similarity therebetween, wherein h' _j Including all positive and negative examples.

Ultimate loss of contrast

Is the average of all N in-batch losses:

by furthest improving the consistency between different enhanced versions of the same sample and simultaneously keeping sentence vectors of different samples as far as possible, the invention establishes a sentence encoder and constructs semantic sentence embedding.

S202, learning image representation through a bidirectional contrast learning target between the paired images and texts.

By paired input (X) _v ,X _s ) Learning image weavingCode device, wherein X _v Representing an image or a group of images, X _s Representative description of X _v A sentence sequence of imaging information. For each input image X _v And each input sentence X _s They are encoded by an image encoder f _v () And a sentence encoder f _s () Conversion into a fixed dimension vector h _v And h _s . Then, a representation h of the two modes _v 、h _s By projecting a function g _v () And g _s () Projected from their encoder space to the same D-dimensional space for comparative learning.

Thus, for N input pairs (x) _v ,X _s ) Corresponding N representation pairs (v, s) can be obtained, where:

v＝g _v (f _v (X _v ))

s＝g _s (f _s (x _s ))

in the formula, v,

with (v) _i ,s _i ) Representation pair i is represented and trained using the same medical data set as the downstream task. The present invention uses Info-NCE as a loss function, which is a comparative loss function for self-supervised learning. The ith pair represents the pair (v) _i ,s _i ) Includes two loss functions: loss of image-to-text contrast

Loss of contrast with text-image

Ultimate loss of training

Is a weighted combination of two penalties:

where λ is a scalar weight.

By maximizing the correspondence between image-text representation pairs, the present invention learns an image encoder that maps images to a fixed-dimension vector.

And S203, embedding the image encoder and the sentence encoder which are learned through the steps into a radiology report generation model.

S300, generating a radiology report:

recursively generating an Impression part and a Findings part of the radiology report by fusing the visual features extracted by the image encoder and the semantic features extracted by the sentence encoder, comprising the steps of:

s301, generating a single conclusive diagnostic statement imcompression by an imcompression generation module based on a simple encoder-decoder framework.

Specifically, the image encoder first extracts visual features from the input image, and then feeds them to the sentence decoder, generating the whole sentence word by word.

The purpose of the image encoder is to automatically extract visual features from the image, map the image into a context vector as a visual input for all subsequent modules, this vector being obtained by multimodal contrast pre-training. In particular, the image encoder is parameterized as a fully-connected layer, with visual features extracted from the last convolutional layer. The visual features are then fed into the sentence decoder to generate the Impression part.

The invention adopts an LSTM-based method, according to the context vector,A previous hidden state and a previously generated word, one word being generated at each time step to produce a title. The initial hidden state and cell state of the LSTM are set to zero, the hidden state h at time t _t Is modeled as:

h _t ＝LSTM(x _t ,h _t-1 ,m _t-1 )

The visual feature vector is used as the initial input of the LSTM to predict the first word of a sentence. But before entering LSTM, a full connectivity layer is needed to convert visual feature vectors to the same dimensions as word embedding. The entire sentence can then be generated word by word.

S302, fusing the visual characteristics of the image and the semantic characteristics of the sentence by a Finding generation module, circularly generating the sentence, and finally generating a long section containing a plurality of structural sentences.

In particular, multi-modal feature fusion referring to fig. 4, the upper half of fig. 4 produces branches of visual features, and the resulting image encoder, via contrast pre-training, is modeled as a CNN for extracting visual representations from the input image. The lower part of fig. 4 shows the branches that generate semantic features, the sentence coder obtained by the comparative pre-training is modeled as a model obtained by fine-tuning in a language model similar to BERT, and the semantic representation of the sentences is generated by averaging the pooling layers.

In order to focus the generated sentence on describing different image areas, based on the attention framework, the visual features of the image and the semantic features of the text are input into a fully connected layer and then fed into the SoftMax layer to obtain weighted visual features.

The attention network used to compute the weighted visual representation is defined as:

V _w ＝Attention(v,s)

wherein, V _w Is the weighted visual representation to be obtained; attention () is the Attention function; v = { v = ₁ ,v ₂ ,…,v _k }，

code representing the previous sentence, d _s Is the dimension of the semantic features.

In view of visual features

And coding of the previous sentence, the invention generates attention distribution in K areas of the image through a single attention distributor, and the attention distribution is expressed through a single-layer neural network and a SoftMax function:

α＝softmax(z)

in the formula:

is a vector, all elements are set to 1;

is a parameter of the attention network;

note the weight of features in v in (v, s).

Weighted visual representation V based on attention distribution _w Can be obtained by the following method:

in the formula, alpha _i Is the ith dimension element in alpha.

The input to the sentence decoder is now a weighted visual representation, so that the decoder will take note of specific regions of the image in order to generate sentences describing different image regions. The learned coding of the previous sentence and the visual characteristics of the image are combined to guide the generation of the next sentence. This process is repeated until an empty sentence is produced, indicating that the creation of the Findings section has been completed. In this way, as different sentences are generated, the model can focus on different areas of the image according to the context of the previous sentence and ensure consistency and consistency of the medical semantics of the generated report.

S400, report analysis and evaluation:

the generated radiology report is evaluated by using four common evaluation indexes, wherein the larger the value of the evaluation index is, the better the performance of the radiology report generation model is represented, and the method specifically comprises the following steps:

s401, evaluating the generated report using the BLEU and its variants.

BLEU is a method for automatically evaluating machine translation, the general idea of which is accuracy, and can be further divided into many variants according to "n-gram", four common indicators being BLEU-1, BLEU-2, BLEU-3 and BLEU-4, where n-gram refers to the number of consecutive words being n.

And S402, evaluating the generated report by using METEOR.

METEOR is an automatic index for machine translation evaluation, has better correlation with human judgment, and takes the accuracy and recall rate based on the whole corpus into consideration and obtains a final index, which is different from BLEU.

And S403, evaluating the generated report by using the ROUGE.

The ROUGE is designed to measure the quality of the summary, measures the "similarity" between the automatically generated summary and the reference summary, and calculates a corresponding score.

And S404, evaluating the generated report by using the CIDER.

CIDER is a consensus-based image description evaluation that calculates the cosine similarity of a reference title and a title generated by a model as a score.

S500, outputting a result:

and combining the respectively generated Impression part and Findings part into a complete radiology report for outputting, and simultaneously realizing the evaluation of output results by using a plurality of evaluation indexes.

And the result output module is in signal connection with a display screen and a printer. The display screen and the printer are connected with the setting result output module through signals, screen display and document printing of the diagnosis report are achieved, and analysis of the report by medical personnel is facilitated.

In this embodiment, the radiological image is acquired by a nuclear magnetic resonance apparatus. The principle of the nuclear magnetic resonance instrument is that a human body is placed in a special magnetic field, a radio frequency pulse is used for exciting hydrogen atomic nuclei in the human body to cause the hydrogen atomic nuclei to resonate, energy is absorbed, after the radio frequency pulse is stopped, the hydrogen atomic nuclei send out radio signals according to specific frequency, the absorbed energy is released and recorded by a receiver outside the human body, and an image is obtained through processing of an electronic computer.

In this embodiment, the result output module is in signal connection with a display screen and a printer. The display screen and the printer are connected through the signal of the setting result output module, screen display and document printing of the diagnosis report are achieved, and analysis of the report by medical personnel is facilitated.

Claims

1. A model training method based on multi-modal contrast learning radiology report generation is characterized by comprising the following steps:

s100, sample data acquisition:

acquiring radiological images and text data, and transmitting the image data and the corresponding text data to a sample database, wherein each text data comprises a conclusive diagnosis sentence Impression and a detailed description paragraph Findings;

s200, multimodal contrast learning:

an image encoder learns image representations by a bi-directional contrast learning objective between pairs of imagery data and text data;

s300, generating a radiology report:

s301, generating a single diagnostic statement imcompression by the imcompression generation module based on the encoder-decoder framework, specifically including the following steps:

the image encoder extracts visual features from an input image, then sends the visual features into an Impression partial sentence decoder, and generates a whole sentence word by word as a diagnosis sentence Impression;

s302, fusing the visual characteristics of the image and the semantic characteristics of the sentence by a Findings generation module, generating the sentence in a circulating way, and finally generating a long paragraph containing a plurality of structural sentences, wherein the method specifically comprises the following steps:

V _w ＝Attention(v,s)

in view of visual features

α＝softmax(z)

in the formula:

is a vector, all elements are set to 1;

is a parameter of the attention network;

note the weight of the features in v in (v, s).

in the formula, alpha _i Is the ith dimension element in α;

s400, report analysis and evaluation:

the generated radiology report was evaluated using the evaluation index:

s500, outputting a result:

2. The method for training the radiology report generation model based on multi-modal contrast learning of claim 1, wherein in step S100, the image data and the text data in the sample database have a one-to-one correspondence relationship, and the sample database includes a training set, a testing set and a verification set.

3. The method as claimed in claim 2, wherein the step S200 of learning sentence characterization by sentence-level training strategy based on comparative learning comprises the following steps:

for the same sentence in a sentence set, a series of sentence representations of different enhanced versions are obtained by implementing different data enhancement methods to serve as positive examples in a training set, a test set or a verification set, and other sentences serve as negative examples;

4. The method of claim 3, wherein in step S200, a set of sentences is selected

x _i Representing the ith sentence, m represents the set

The total number of inner sentences; for sentence x _i Two different data enhancement methods f () and f' () are applied to generate two different versions of sentence-embedding e _i 、e′ _i ：

e _i ＝f(x _i )

e′ _i ＝f′(x _i )

In the formula (I), the compound is shown in the specification,

l is the length of sentence embedding, D is the hiding dimension of sentence embedding;

then, sentence embedding e _i 、e′ _i Is encoded to obtain a sentence representation h _i 、h′ _i 。

Then for a small batch of N sentences, sentence x _i Training target l _i The following were used:

wherein τ is a temperature hyperparameter, sim () is cosine similarity, then

Final loss of contrast

Is the average of all N in-batch losses:

5. the multi-modal contrast learning radiology report generation model training method of claim 1, wherein learning image representations through bi-directional contrast learning objects between pairs of images and text in step S200 comprises the steps of:

by paired input (X) _v ,X _s ) Learning image encoder, wherein X _v Representing an image or a group of images, X _s Representative description X _v A sentence sequence of imaging information; for each input image X _v And each input sentence X _s They are encoded by an image encoder f _v () And a sentence encoder f _s () Conversion into a fixed-dimension vector h _v And h _s (ii) a Then, a representation h of the two modes _v 、h _s By projecting a function g _v () And g _s () Project from their encoder space to the same D-dimensional space for contrast learning;

for N input pairs (X) _v ,X _s ) Resulting in corresponding N representation pairs (v, s), wherein:

v＝g _v (f _v (X _v ))

s＝g _s (f _s (X _s ))

in the formula (I), the compound is shown in the specification,

Loss of contrast with text-image

Ultimate loss of training

Is a weighted combination of two penalties:

wherein λ is a scalar weight;

6. The method as claimed in claim 1, wherein in step S301, the im compression partial sentence decoder generates a word to generate a title in each time step according to the context vector, the previous hidden state and the previously generated word by using LSTM-based method;

h _t ＝LSTM(x _t ,h _t-1 ,m _t-1 )

7. The method as claimed in claim 1, wherein the step S400 includes:

s402, evaluating the generated report by using METEOR;

s403, evaluating the generated report by using the ROUGE;

and S404, evaluating the generated report by using CIDER.

8. A radiology report generation model training system based on multimodal contrast learning, comprising:

in a Findings generation module, in order to make the generated sentence focus on describing different image regions, based on the attention framework, the visual features output by the image encoder and the semantic features of the previous sentence output by the sentence encoder are input into a full link layer and then into a SoftMax layer to obtain weighted visual features, the weighted visual features are input into a Findings partial sentence decoder, the coding of the sentence is obtained by the Findings partial sentence decoder, the coding of the sentence is used as the coding of the previous sentence input to the sentence encoder, which is input to the sentence encoder, so as to obtain new weighted visual features, and the process is repeated until the Findings partial sentence decoder generates an empty sentence, which indicates that the Findings partial sentence generation is completed;

V _w ＝Attention(v,s)

wherein, V _w Is the weighted visual representation to be obtained; attention () is an Attention function; v = { v) ₁ ,v ₂ ,…,v _k }，

in view of visual features

α＝softmax(z)

in the formula:

is a vector, all elements are set to 1;

is a parameter of the attention network;

note the weight of features in v in (v, s).

in the formula, alpha _i Is the ith dimension element in alpha;

9. The multi-modal contrast learning-based radiology report generation model training system of claim 8, wherein in the radiology report generation module, the image encoder is modeled as a CNN; the sentence encoder is modeled as a model after fine-tuning in a BERT language model, and semantic representations of sentences are generated by averaging the pooling layers.