WO2023029817A1

WO2023029817A1 - Medical report generation method and apparatus, model training method and apparatus, and device

Info

Publication number: WO2023029817A1
Application number: PCT/CN2022/107921
Authority: WO
Inventors: 边成
Original assignee: 北京字节跳动网络技术有限公司
Priority date: 2021-08-31
Filing date: 2022-07-26
Publication date: 2023-03-09
Also published as: CN113539408A; CN113539408B

Abstract

A medical report generation method and apparatus, a model training method and apparatus, and a device. The method comprises: extracting image features of a source image and a target image, respectively, and using the image features to respectively obtain corresponding first medical report text and second medical report text; then, using a discriminator to obtain a first discrimination result and a second discrimination result respectively corresponding to the first medical report text and the second medical report text; and finally, calculating source image specific loss, target image specific loss, cross entropy loss, first adversarial loss, second adversarial loss and third adversarial loss, and training a medical report generation model by using the calculated loss. The medical report generation model obtained by training can apply knowledge learned from the field of source images having lots of labels to the field of medical images of other types, thereby automatically generating medical report texts of medical images having fewer or no labels.

Description

Medical report generation method, model training method, device and equipment

This application claims the priority of Chinese Patent Application No. 202111013687.9 submitted on August 31, 2021, and the content disclosed in the above-mentioned Chinese Patent Application is hereby cited in its entirety as a part of this application.

technical field

Embodiments of the present disclosure relate to a method for generating a medical report, a method for training a model, a device, and a device.

Background technique

Medical imaging is an image of the internal tissue of the human body or a certain part of the human body, which can help doctors understand the health status of patients. The medical image has a corresponding medical report, and the medical report contains the analysis result of the medical image. For example, a medical report may include the location of the patient's disease, the extent of the lesion, and the affected organs determined from the medical images.

Currently, it is difficult to automatically generate corresponding medical reports for medical images. How to automatically generate medical reports based on medical images is a problem that needs to be solved.

Contents of the invention

Embodiments of the present disclosure provide a method for generating a medical report, a method for training a model, a device, and a device, which can automatically generate a medical report based on medical images.

In a first aspect, embodiments of the present disclosure provide a method for training a medical report generation model, the method comprising:

The source image is input into the first encoder to obtain the first image feature, and the source image is input into the second encoder to obtain the second image feature; the source image corresponds to a medical text label;

inputting the target image into a third encoder to obtain a third image feature, and inputting the target image into the second encoder to obtain a fourth image feature;

Inputting the second image feature into the text generator to obtain the first medical report text;

Inputting the fourth image feature into the text generator to obtain a second medical report text;

Inputting the first medical report text into a discriminator to obtain a first discriminant result;

inputting the second medical report text into the discriminator to obtain a second discriminant result;

calculating a source image-specific loss based on the first image feature and the second image feature, and calculating a target image-specific loss based on the third image feature and the fourth image feature;

calculating a cross-entropy loss according to the first medical report text and the medical text label corresponding to the source image;

calculating a first adversarial loss according to the first discrimination result, and calculating a second adversarial loss and a third adversarial loss according to the second discrimination result;

According to the source image characteristic loss, the target image characteristic loss, the cross entropy loss, the first adversarial loss, the second adversarial loss and the third adversarial loss, train the The first encoder, the second encoder, the third encoder, the text generator and the discriminator repeatedly execute the inputting the source image into the first image feature encoder and subsequent steps until reaching preset conditions.

In a second aspect, an embodiment of the present disclosure provides a method for generating a medical report, the method comprising:

Input the medical image into the encoder to obtain the medical image features;

Inputting the medical image features into a text generator to obtain a medical report text;

The encoder is a second encoder trained according to the training method of the medical report generation model described in any one of the above embodiments;

The text generator is a text generator trained according to the training method of the medical report generation model described in any one of the above embodiments.

In a third aspect, embodiments of the present disclosure provide a training device for a medical report generation model, the device comprising:

The first input unit is configured to input a source image into a first encoder to obtain a first image feature, and input the source image to a second encoder to obtain a second image feature; the source image corresponds to a medical text label;

The second input unit is configured to input the target image into the third encoder to obtain a third image feature, and input the target image to the second encoder to obtain a fourth image feature;

A third input unit, configured to input the second image feature into the text generator to obtain the first medical report text;

A fourth input unit, configured to input the fourth image feature into the text generator to obtain a second medical report text;

The fifth input unit is used to input the first medical report text into the discriminator to obtain the first discriminant result;

A sixth input unit, configured to input the second medical report text into the discriminator to obtain a second discriminant result;

A first calculation unit, configured to calculate a source image-specific loss based on the first image feature and the second image feature, and calculate a target image-specific loss based on the third image feature and the fourth image feature ;

A second calculation unit, configured to calculate a cross-entropy loss according to the first medical report text and the medical text label corresponding to the source image;

A third calculation unit, configured to calculate a first adversarial loss according to the first discrimination result, and calculate a second adversarial loss and a third adversarial loss according to the second discrimination result;

an execution unit, configured to use the source image characteristic loss, the target image characteristic loss, the cross-entropy loss, the first adversarial loss, the second adversarial loss, and the third adversarial loss Loss, training the first encoder, the second encoder, the third encoder, the text generator and the discriminator, repeatedly performing the input of the source image into the first image feature encoder and Subsequent steps until preset conditions are met.

In a fourth aspect, an embodiment of the present disclosure provides a medical report generating device, the device comprising:

The input unit is used to input the medical image into the encoder to obtain the medical image features;

A generating unit, configured to input the medical image features into a text generator to obtain a medical report text;

In a fifth aspect, an embodiment of the present disclosure provides an electronic device, including:

one or more processors;

a storage device on which one or more programs are stored,

When the one or more programs are executed by the one or more processors, the one or more processors implement the training method of the medical report generation model as described in any of the above embodiments, or implement the above implementation The medical report generation method described in the example.

In the sixth aspect, the embodiments of the present disclosure provide a computer-readable medium on which a computer program is stored, wherein, when the program is executed by a processor, the training of the medical report generation model as described in any of the above-mentioned embodiments is realized method, or realize the medical report generation method described in the above-mentioned embodiments.

It can be seen that the embodiments of the present disclosure have the following beneficial effects:

Embodiments of the present disclosure provide a training method for a medical report generation model and a medical report generation method, by extracting image features from the source image and the target image respectively, and using the image features to obtain the corresponding first medical report text and second medical report respectively text; then use the discriminator to obtain the first and second discrimination results corresponding to the first medical report text and the second medical report text respectively; finally use the image features to calculate the source image-specific loss and target image-specific loss, and use the first A medical report text and the medical text label corresponding to the source image calculate the cross-entropy loss, calculate the first adversarial loss according to the first discrimination result, calculate the second adversarial loss and the third adversarial loss according to the second discrimination result, and calculate the second adversarial loss according to the source image Specific loss, target image specific loss, cross-entropy loss, first adversarial loss, second adversarial loss, and third adversarial loss, train first encoder, second encoder, third encoder, text generation The above training steps are repeated until the preset conditions are met, and the encoder and text generator for medical report generation are obtained. Input the medical image into the encoder to obtain the medical image features, and then input the medical image features into the text generator to obtain the medical report text.

In this way, using the source image with more medical report text labels and the target image with no medical report text labels or less medical report text labels, train the encoder that generates the medical report text of the medical image type for the target image and text generators. Through the source image and the target image, domain-invariant features can be learned, so that the knowledge learned from the source image field with more labels can be applied to the field of other types of medical images, and the automatic generation of labels with few or no labels can be realized. of medical images for medical report text.

Description of drawings

Fig. 1 is a schematic framework diagram of an exemplary application scenario provided by at least one embodiment of the present disclosure;

FIG. 2 is a flowchart of a training method for a medical report generation model provided by at least one embodiment of the present disclosure;

Fig. 3 is a schematic diagram of a method for generating a medical report model provided by at least one embodiment of the present disclosure;

FIG. 4 is a schematic diagram of another method for generating a medical report model provided by at least one embodiment of the present disclosure;

Fig. 5 is a flow chart of a method for generating a medical report provided by at least one embodiment of the present disclosure;

Fig. 6 is a schematic structural diagram of a training device for a medical report generation model provided by at least one embodiment of the present disclosure;

Fig. 7 is a schematic structural diagram of a medical report generating device provided by at least one embodiment of the present disclosure; and

Fig. 8 is a schematic diagram of a basic structure of an electronic device provided by at least one embodiment of the present disclosure.

Detailed ways

In order to make the above objects, features and advantages of the present disclosure more comprehensible, the embodiments of the present application will be further described in detail below in conjunction with the accompanying drawings and specific implementation methods.

In order to facilitate the understanding of the technical solution provided by the present disclosure, the background technology involved in the present disclosure will first be described below.

After studying the traditional medical report text generation method, it is found that currently the medical image with labels is used as the training data, and the training data is used to train the model for generating the medical report text. However, it is difficult to generate labels for medical images, and the image types of medical images with many labels are relatively single. At present, medical images with many labels are basically chest images, and it is difficult to obtain a model for generating medical report text for other types of medical images.

Based on this, the embodiments of the present disclosure provide a training method of a medical report generation model and a medical report generation method, by extracting image features from the source image and the target image respectively, and using the image features to obtain the corresponding first medical report text respectively and the second medical report text; then use the discriminator to obtain the first discrimination result and the second discrimination result corresponding to the first medical report text and the second medical report text respectively; finally use the image features to calculate the source image specificity loss and target image specificity Computing the cross-entropy loss by using the medical text labels corresponding to the first medical report text and the source image, calculating the first adversarial loss according to the first discrimination result, calculating the second adversarial loss and the third adversarial loss according to the second discrimination result Loss, according to source image-specific loss, target image-specific loss, cross-entropy loss, first adversarial loss, second adversarial loss and third adversarial loss, train the first encoder, the second encoder, the third The encoder, the text generator and the discriminator repeat the above training steps until the preset conditions are met, and the encoder and the text generator for medical report generation are obtained. Input the medical image into the encoder to obtain the medical image features, and then input the medical image features into the text generator to obtain the medical report text.

In order to facilitate understanding of a method for generating a medical report provided by an embodiment of the present disclosure, the following description will be made in conjunction with the scenario example shown in FIG. 1 . Referring to FIG. 1 , the figure is a schematic framework diagram of an exemplary application scenario provided by at least one embodiment of the present disclosure.

In practical applications, the medical image 101 is input into the trained encoder 102 to obtain the medical image feature 103 corresponding to the medical image 101, and then the medical image feature 103 is input into the trained text generator 104 to obtain the text The medical report text 105 output by the generator 104 .

Those skilled in the art can understand that the schematic frame diagram shown in FIG. 1 is only an example in which the embodiments of the present disclosure can be implemented. The scope of applicability of the disclosed embodiments is not limited in any way by this framework.

Based on the above description, the training method of the medical report generation model provided by the present disclosure will be described in detail below with reference to the accompanying drawings.

Referring to FIG. 2 , which is a flowchart of a method for training a medical report generation model provided by at least one embodiment of the present disclosure, the method includes steps S201-S210.

S201: Input a source image into a first encoder to obtain a first image feature, and input the source image into a second encoder to obtain a second image feature; the source image corresponds to a medical text label.

Refer to FIG. 3 , which is a schematic diagram of a method for generating a medical report model provided by at least one embodiment of the present disclosure.

The source images are medical images with corresponding medical text labels. Wherein, the medical text label refers to the medical report text corresponding to the medical image, for example, it may be the test report text. In a possible implementation manner, the source image may be a chest radiograph image derived from MIMIC-CXR (a data set).

The first encoder is used to extract image features specific to the source image, that is, image features belonging to the source domain. The source image is input into the first encoder to obtain the first image feature output by the first encoder.

The second encoder is an encoder shared by the source image and the target image, which is used to extract features similar to the source domain and the target domain on the feature dimension of the hidden layer, that is, the common features in the source domain and the target domain. The source image is input into the second encoder to obtain the second image features output by the second encoder.

The first encoder and the second encoder may be composed of four convolutional layers.

In a possible implementation manner, the second encoder may use Inception-v3 (a neural network), and the first encoder may use ResNet (Deep residual network, deep residual network).

S202: Input the target image into a third encoder to obtain a third image feature, and input the target image into a second encoder to obtain a fourth image feature.

The target image is a medical image belonging to an image type other than the medical image type to which the source image belongs. The target image may include one or more image types of medical images, and the target image includes the image types of medical images that need to be used to generate medical report text. For example, when it is necessary to generate a medical report text for a medical image generated by an endoscope, the target image includes the medical image generated by the endoscope. In addition, the target image may also include medical images of other image types such as CT (Computed Tomography, computerized tomography) images.

The target image may include a medical image without a label, or may include a medical image with a corresponding label. The label of the target image can be a manually labeled medical report text, or it can be a descriptive text related to the target image in the literature, articles, etc. to which the target image belongs.

The target image is input into the second encoder to obtain a fourth image feature output by the second encoder.

The third encoder is used to extract image features specific to the target image, that is, image features belonging to the target domain. The target image is input into the third encoder to obtain the third image feature output by the third encoder.

Wherein, the third encoder may be composed of four convolutional layers. The third encoder can use ResNet (Deep residual network, deep residual network).

S203: Input the second image feature into the text generator to obtain the first medical report text.

The text generator is used to generate corresponding medical report text according to the image features of the input medical image. The text generator can be composed of a bidirectional two-layer LSTM (Long Short-Term Memory, long-term short-term memory artificial neural network).

The second image feature is input into the text generator to obtain the first medical report text output by the text generator.

S204: Input the fourth image feature into the text generator to obtain the second medical report text.

The fourth image feature is input into the above-mentioned text generator to obtain the second medical report text output by the text generator.

S205: Input the first medical report text into the discriminator to obtain a first discriminant result.

The discriminator is used to determine the domain to which the input medical report text belongs, that is, to determine whether the input medical report text belongs to the source domain or the target domain. The discriminator can be composed of two convolutional layers and a fully connected layer of CNN (Convolutional Neural Network, Convolutional Neural Network).

The first medical report text is input into the discriminator, and a first discrimination result of the discriminator on the first medical report text is obtained.

S206: Input the second medical report text into the discriminator to obtain a second discriminant result.

The second medical report text is input into the discriminator, and a second discrimination result of the discriminator on the second medical report text is obtained.

Adversarial training can be achieved by using the discriminator, so that the second encoder can narrow the difference between the first medical report text and the first medical report text, map features from different domains into the same domain, and achieve feature-level alignment.

S207: Calculate the source image-specific loss according to the first image feature and the second image feature, and calculate the target image-specific loss according to the third image feature and the fourth image feature.

The first image feature and the second image feature are obtained by feature extraction of the source image by different encoders. According to the first image feature and the second image feature, a source image specific loss can be calculated. The source image specific loss is used to measure the gap between the features of the first image and the features of the second image.

The source image specific loss can be expressed by the following formula:

in,

is the second image feature,

is the first image feature.

is the frobenius norm.

Similarly, the third image feature and the fourth image feature are obtained by different encoders performing feature extraction on the target image. According to the third image feature and the fourth image feature, the target image-specific loss can be calculated. The target image-specific loss is used to measure the gap between the third image feature and the fourth image feature.

The target image-specific loss can be expressed by the following formula:

in,

is the fourth image feature,

is the third image feature.

S208: Calculate a cross-entropy loss according to the first medical report text and the medical text label corresponding to the source image.

The source images have corresponding medical text labels. A cross-entropy loss is calculated according to the first medical report text and the medical text label corresponding to the source image. Cross-entropy loss is used to measure the gap between the first medical report text and medical text labels.

S209: Calculate a first adversarial loss according to the first discrimination result, and calculate a second adversarial loss and a third adversarial loss according to the second discrimination result.

The first adversarial loss is calculated according to the first discrimination result output by the discriminator, and the second adversarial loss and the third adversarial loss are calculated according to the second discrimination result. The first adversarial loss, the second adversarial loss, and the third adversarial loss are used to measure whether the generated discriminant results belong to the corresponding domain.

In a possible implementation, at least one embodiment of the present disclosure provides a specific implementation of calculating the first adversarial loss according to the first discrimination result, and a method of calculating the second adversarial loss and the third adversarial loss according to the second discrimination result. For specific implementation methods of sexual loss, please refer to the following for details.

S210: Train the first encoder, the second encoder, the third The encoder, the text generator, and the discriminator repeatedly execute the input of the source image into the first image feature encoder and subsequent steps until a preset condition is reached.

Based on the obtained source image-specific loss, target image-specific loss, cross-entropy loss, first adversarial loss, second adversarial loss and third adversarial loss, train the first encoder, second encoder, third Encoder, Text Generator, and Discriminator.

Based on the source image-specific loss, it is possible to enable the first encoder and the second encoder to learn different image features about the source image. Based on the target image-specific loss, it is possible to enable the second encoder and the third encoder to learn different image features about the target image. Using the cross-entropy loss, it is possible to train the text generator to generate more accurate first medical report text. Using the first adversarial loss, the second adversarial loss and the third adversarial loss, the domain-invariant features of the target domain and the source domain can be made as close as possible.

In a possible implementation, at least one embodiment of the present disclosure provides a method based on source image-specific loss, target image-specific loss, cross-entropy loss, first adversarial loss, second adversarial loss, and third adversarial loss loss, the specific implementation of training the first encoder, the second encoder, the third encoder, the text generator and the discriminator, please refer to the following for details.

After the training of the first encoder, the second encoder, the third encoder, the text generator and the discriminator is completed, the above steps of S201-S210 are repeated until the preset condition is met. Wherein, the preset condition is a condition for completing the training. The preset condition may be, for example, the number of times of training, or may be a numerical condition satisfied by the loss function.

Based on the relevant content of S201-S210 above, it can be known that the second encoder and text generator trained by confrontation training based on domain-invariant features can generate corresponding medical reports for medical images belonging to the image type of the target image text. In this way, medical report texts can be generated for medical images of image types lacking labels, and the scope of medical image types for generating medical report texts can be expanded. Moreover, the discriminator can be used to map data sources from different domains to the same domain, and achieve feature-level alignment, so that the encoder and text generator obtained after training can generate more accurate medical reports corresponding to medical images text.

In a possible implementation manner, the discriminator is used to determine the probability of the image corresponding to the text of the medical report. The first medical report text is input into the discriminator, and the obtained first discrimination result output by the discriminator includes the first probability value generated by each word segment in the first medical report text from the source image. The second medical report text is input into the discriminator, and the obtained second discriminant result output by the discriminator includes a second probability value generated by each word segment in the second medical report text from the source image. Wherein, the first probability value may be expressed as D(y _s ), and y _s represents the first medical report text. The second probability value may be expressed as D(y _t ), where y _t represents the text of the second medical report. The value range of the first probability value and the second probability value is from 0 to 1. Among them, the closer to 1, the higher the probability of being generated by the source image, and the closer to 0, the lower the probability of being generated by the source image.

Correspondingly, at least one embodiment of the present disclosure provides a method for calculating a first adversarial loss according to a first discrimination result, which specifically includes:

The logarithms of the first probability values are taken and summed to obtain the first summation result, and the negative value of the first summation result is taken to obtain the first adversarial loss.

The logarithms of the first probability values are taken and then summed to obtain the first summation result. The first summation result can be expressed as Σlog[D(y _s )].

Then calculate the negative value of the first summation result to obtain the first adversarial loss.

The first adversarial loss can be expressed by formula (3).

L _adv1 (y _s )＝-∑log[D(y _s )] (3)

At least one embodiment of the present disclosure provides a method for calculating a second adversarial loss and a third adversarial loss according to a second discrimination result, including:

The logarithms of the second probability values are taken and summed to obtain a second summation result, and the negative value of the second summation result is taken to obtain a second adversarial loss.

The difference between 1 and the second probability value is calculated and summed to obtain the third summation result, and the negative value of the third summation result is taken to obtain the third adversarial loss.

The logarithms of the second probability values are taken and then summed to obtain the second summation result. The second summation result can be expressed as Σlog[D(y _t )].

Then calculate the negative value of the second summation result to obtain the second adversarial loss.

The second adversarial loss can be expressed by formula (4).

L _adv2 (y _t )＝-∑log[D(y _t )] (4)

The difference between 1 and each value in the second probability value is calculated and summed to obtain the third summation result, and the negative value of the third summation result is taken to obtain the third adversarial loss.

The third adversarial loss can be expressed by formula (5).

L _adv3 (y _t )＝-∑[1-D(y _t )] (5)

In a possible implementation, for the case where the target image lacks labels, the model can be optimized by reconstructing the image. At least one embodiment of the present disclosure provides a method for training a medical report generation model. In addition to the above steps S201-S210, the method further includes the following three steps.

Refer to FIG. 4 , which is a schematic diagram of another method for generating a medical report model provided by at least one embodiment of the present disclosure.

A1: Input the first image feature and the second image feature into the first decoder to obtain the reconstructed source image.

The first decoder is used to generate a reconstructed source image according to domain-invariant features and unique features of the input source image. The first image feature and the second image feature are input into the first decoder to obtain the reconstructed source image.

The first decoder may be composed of four convolutional layers.

A2: Input the third image feature and the fourth image feature into the second decoder to obtain the reconstructed target image.

The second decoder is used to generate a reconstructed target image according to the domain-invariant features and unique features of the input target image. The third image feature and the fourth image feature are input into the second decoder to obtain a reconstructed target image.

The second decoder may consist of four convolutional layers. In the embodiments of the present disclosure, the encoder and the decoder adopt an autoencoder structure.

A3: Calculate the perceptual loss of the source image based on the source image and the reconstructed source image, and calculate the perceptual loss of the target image based on the target image and the reconstructed target image.

Based on the source image and the reconstructed source image, compute the source image perceptual loss. The source image perceptual loss is used to measure the gap between the source image and the reconstructed source image.

Based on the target image and the reconstructed target image, the target image perceptual loss is calculated. The target image perceptual loss is used to measure the gap between the target image and the reconstructed target image.

In a possible implementation, at least one embodiment of the present disclosure provides a specific implementation of calculating the perceptual loss of the source image according to the source image and reconstructing the source image, and a method of calculating the target image according to the target image and reconstructing the target image For the specific implementation of perceptual loss, please refer to the following.

Correspondingly, at least one embodiment of the present disclosure provides a method for training the first The specific implementation of the encoder, the second encoder, the third encoder, the text generator and the discriminator specifically includes:

Train the first encoder based on source image characteristic loss, target image characteristic loss, cross-entropy loss, first adversarial loss, second adversarial loss, third adversarial loss, source image perceptual loss, and target image perceptual loss , a second encoder, a third encoder, a text generator, a discriminator, a first decoder, and a second decoder.

After obtaining the perceptual loss of the source image and the perceptual loss of the target image, the model can also be optimized according to the perceptual loss of the source image and the perceptual loss of the target image, so as to reduce the gap between the source image and the reconstructed source image and the target image and the reconstructed target image, and improve the model The accuracy of image feature extraction for source and target images.

In a possible implementation, according to source image specific loss, target image specific loss, cross entropy loss, first adversarial loss, second adversarial loss, third adversarial loss, source image perceptual loss and target The image perceptual loss can be calculated to get the total loss. The total loss can be expressed by the following formula:

L=L _difference +L _rec +L _ce +λ _adv1 L _adv1 (y _s )+λ _adv2 L _adv2 (y _t )+λ _adv3 L _adv3 (y _t ) (6)

where L _difference represents the sum of source image-specific loss and target image-specific loss. L _ce represents the cross-entropy loss. L _rec represents the sum of source image perceptual loss and target image perceptual loss. L _adv1 (y _s ) represents the first adversarial loss, and λ _adv1 is the weight corresponding to the first adversarial loss. L _adv2 (y _t ) represents the second adversarial loss, and λ _adv2 is the weight corresponding to the second adversarial loss. L _adv3 (y _t ) represents the third adversarial loss, and λ _adv3 is the weight corresponding to the third adversarial loss.

L _difference can be expressed by the following formula:

L _difference = L _sdist + L _tdist (7)

where L _sdist represents the source image-specific loss and L _tdist represents the target image-specific loss.

L _rec can be expressed by the following formula:

L _rec = L _perc (x _s , x _srec ; w) + L _perc (x _t , x _trec ; w) (8)

Among them, L _perc (x _s , x _srec ; w) represents the source image perceptual loss, and L _perc (x _t , x _trec ; w) represents the target image perceptual loss.

After the total loss is obtained, the result of minimizing the total loss can be maximized as the optimization goal, and the first encoder, the second encoder, the third encoder, the text generator, the discriminator, and the first decoder are trained and a second decoder.

Based on the above content, it can be seen that by using the image reconstruction method, the encoder can be optimized on the premise that the target image does not have a label, so that the encoder can extract more accurate image features and improve the accuracy of the trained model.

In a possible implementation, at least one embodiment of the present disclosure provides a method for calculating the perceptual loss of the source image according to the source image and the reconstructed source image, including the following four steps:

B1: Input the source image into the third image feature extraction network, and obtain the seventh image feature output by each feature extraction layer of the third image feature extraction network.

The third image feature extraction network is used to extract image features of the image. The source image is input into the third image feature extraction network to obtain seventh image features output by each feature extraction layer of the third image feature extraction network.

Wherein, the third image feature extraction network may be VGG Net (a deep convolutional neural network). VGG Net can be pre-trained. Input the source image into VGG Net to get the seventh image feature

Among them, x _s represents the source image, and l represents the l-th feature extraction layer in VGG Net. l is a positive integer greater than or equal to 1 and less than or equal to L, and L is the total number of layers of the feature extraction layer of VGG Net.

B2: Input the reconstructed source image into the third image feature extraction network, and obtain the eighth image feature output by each feature extraction layer of the third image feature extraction network.

The third image feature extraction network is used to extract the image features of the reconstructed source image to obtain eighth image features output by each feature extraction layer in the third image feature extraction network.

Taking the above-mentioned third image feature extraction network as VGG Net as an example, the eighth image feature can be expressed as

where x _srec represents the reconstructed source image.

B3: Calculate the source image loss corresponding to the feature extraction layer according to the seventh image feature, the eighth image feature output by each feature extraction layer and the weight corresponding to the feature extraction layer.

Each feature extraction layer in the third feature extraction network has a corresponding weight. According to the weight of each feature extraction layer, the seventh image feature output by each feature extraction layer, and the eighth image feature output by each feature extraction layer, the source image loss corresponding to the feature extraction layer is calculated.

In a possible implementation, the difference between the seventh image feature and the eighth image feature can be calculated first, and then the L1 norm of the obtained difference can be calculated, and finally the L1 norm of the obtained difference can be combined with the weight Multiplied together, the source image loss corresponding to the feature extraction layer is obtained.

B4: Sum the source image losses corresponding to each feature extraction layer to obtain the source image perception loss.

Compute the sum of the source image losses of each feature extraction layer to obtain the source image perceptual loss.

The source image perceptual loss L _perc (x _s ,x _srec ; w) can be expressed by the following formula:

Among them, w ^(l) represents the weight of the feature extraction layer of the l-th layer, N ^(l) represents the number of layers of the feature extraction layer, and ||·|| ₁ represents the L1 norm.

Similarly, in a possible implementation manner, at least one embodiment of the present disclosure provides a specific implementation manner of calculating the perceptual loss of the target image according to the target image and the reconstructed target image, which specifically includes the following four steps:

B5: Input the target image into the third image feature extraction network, and obtain the ninth image feature output by each feature extraction layer of the third image feature extraction network.

The third image feature extraction network is used to extract image features of the target image to obtain ninth image features output by each feature extraction layer in the third image feature extraction network.

Taking the above-mentioned third image feature extraction network as VGG Net as an example, the ninth image feature can be expressed as φ ^(l) (x _t ). where x _t represents the target image.

B6: Input the reconstructed target image into the third image feature extraction network, and obtain the tenth image feature output by each feature extraction layer of the third image feature extraction network.

The third image feature extraction network is used to extract image features of the reconstructed target image to obtain tenth image features output by each feature extraction layer in the third image feature extraction network.

Taking the above-mentioned third image feature extraction network as VGG Net as an example, the tenth image feature can be expressed as φ ^(l) (x _trec ). Among them, x _trec represents the reconstructed target image.

B7: Calculate the target image loss corresponding to the feature extraction layer according to the ninth image feature, the tenth image feature output by each feature extraction layer and the weight corresponding to the feature extraction layer.

Each feature extraction layer in the third feature extraction network has a corresponding weight. According to the weight of each feature extraction layer, the ninth image feature output by each feature extraction layer, and the tenth image feature output by each feature extraction layer, the target image loss corresponding to the feature extraction layer is calculated.

In a possible implementation, the difference between the ninth image feature and the tenth image feature can be calculated first, and then the L1 norm of the obtained difference can be calculated, and finally the L1 norm of the obtained difference can be combined with the weight Multiplied together, the target image loss corresponding to the feature extraction layer is obtained.

B8: Sum the target image losses corresponding to each feature extraction layer to obtain the target image perception loss.

Calculate the sum of the target image loss of each feature extraction layer to obtain the target image perceptual loss.

The target image perceptual loss L _perc (x _t ,x _trec ; w) can be expressed by the following formula:

Part of the target image may have a corresponding medical text label. For target images with medical text labels, the model can be trained in a semi-supervised manner.

Correspondingly, in a possible implementation manner, at least one embodiment of the present disclosure provides a training method for a medical report generation model. After the training in the above steps S201-S210 is completed, the training can be performed again, that is, in addition to the above In addition to steps, the following five steps are included:

C1: Determine the first score according to the difference between the source image and the reconstructed source image and the difference between the target image and the reconstructed target image.

The first score is used to measure the difference between the source image and the reconstructed source image, and the difference between the target image and the reconstructed target image.

In a possible implementation manner, at least one embodiment of the present disclosure provides a specific implementation manner of determining the first score according to the difference between the source image and the reconstructed source image and the difference between the target image and the reconstructed target image, please refer to the following .

C2: Determine the second score according to the source image-specific loss and the target image-specific loss.

The second score is related to the source image specific loss as well as the target image specific loss.

In a possible implementation manner, at least one embodiment of the present disclosure provides a specific implementation manner of determining the second score according to the source image specific loss and the target image specific loss, please refer to the following.

C3: If the target image corresponds to a medical text label, calculate the natural language evaluation index as the third score according to the second medical report text and the medical text label corresponding to the target image.

When some target images have corresponding medical text labels, the natural language evaluation index can be calculated according to the medical text labels of the target images and the second medical report text. The calculated natural language evaluation indicator is determined as a third score.

Among them, the natural language evaluation index can be an index such as CIDEr (Consensus-based Image Description Evaluation, consensus-based image description evaluation). The third score can be represented by formula (11).

SCORE _eval = CIDEr(y _t ,y) (11)

Among them, CIDEr(y _t , y) represents the CIDEr of y _t and y, y _t is the second medical report text generated based on the target image, and y is the medical text label corresponding to the target image.

C4: The first score, the second score and the third score are weighted and summed to obtain the reward value.

Calculate the weighted sum of the first score, the second score and the third score to obtain the reward value. The reward value REWARD can be expressed by formula (12):

REWARD = λ ₁ SCORE _rec + λ ₂ SCORE _dist + λ ₃ SCORE _eval (12)

Wherein, SCORE _rec represents the first score, SCORE _dist represents the second score, and SCORE _eval represents the third score. _λ1 is the weight corresponding to the first score, _λ2 is the weight corresponding to the second score, and _λ3 is the weight corresponding to the third score.

The weights corresponding to the first score, the second score and the third score can be set as required. For example, when the target image does not have a corresponding medical text label, λ ₁ =λ ₂ =0.5, λ ₃ =0. When the target image has a corresponding medical text label, λ ₁ =λ ₂ =0.3, λ ₃ =0.4.

The reward value can reflect the training situation of the model through three aspects: the difference of the reconstructed image, the specificity loss of the image, and the natural language evaluation index.

C5: Retrain the first encoder, second encoder, third encoder, text generator, discriminator, first decoder, and second decoder with the goal of maximizing the reward value.

Taking the maximum reward value as the training target, retrain the first encoder, the second encoder, the third encoder, the text generator, the discriminator, the first decoder and the second decoder in the model.

In the embodiment of the present disclosure, maximizing the reward value is taken as the training goal, and the text generator can be updated through reinforcement learning. Moreover, using the natural language evaluation index as the third score can take the natural language evaluation index into account when training the model, so that the goals of model training and model application are consistent, and the accuracy of the model can be further improved.

Further, at least one embodiment of the present disclosure provides a method for determining the first score based on the difference between the source image and the reconstructed source image and the difference between the target image and the reconstructed target image, including the following seven steps:

D1: Input the source image into the third image feature extraction network, and obtain the eleventh image feature output by the third image feature extraction network.

The third image feature extraction network is used to extract image features of the image. The source image is input into the third image feature extraction network to obtain an eleventh image feature output by the third image feature extraction network.

Wherein, the third image feature extraction network may be VGG Net (a deep convolutional neural network). VGG Net can be pre-trained. Input the source image into VGG Net to get the eleventh image feature

Among them, x _s represents the image features of the source image, and l represents the lth layer of the activation function in VGG Net. l is a positive integer greater than or equal to 1 and less than or equal to L, and L is the maximum number of layers of the VGG Net after the activation function.

D2: input the reconstructed source image into the third image feature extraction network, and obtain the twelfth image feature output by the third image feature extraction network.

Using the third image feature extraction network to extract image features of the reconstructed source image to obtain a twelfth image feature output by the third image feature extraction network.

Still taking the above-mentioned third image feature extraction network as VGG Net as an example, the twelfth image feature can be

where x _srec is the image feature of the reconstructed source image.

D3: Obtain a first difference value according to the difference between the eleventh image feature and the twelfth image feature.

The first difference value is used to indicate the difference between the eleventh image feature and the twelfth image feature.

In a possible implementation manner, the difference between the twelfth image feature and the eleventh image feature may be calculated first to obtain the first difference. Then calculate the L1 norm of the first difference to obtain the first difference.

The first difference value _S1 can be represented by the following formula:

D4: Input the target image into the third image feature extraction network, and obtain the thirteenth image feature output by the third image feature extraction network.

Using the third image feature extraction network to extract image features of the target image to obtain a thirteenth image feature.

Still taking the above-mentioned third image feature extraction network as VGG Net as an example, the thirteenth image feature can be

where _xt is the image feature of the target image.

D5: Input the target reconstructed image into the third image feature extraction network, and obtain the fourteenth image feature output by the third image feature extraction network.

Using the third image feature extraction network to extract image features of the reconstructed target image to obtain a fourteenth image feature.

where x _trec is the image feature of the reconstructed target image.

D6: Obtain a second difference value according to the difference between the thirteenth image feature and the fourteenth image feature.

The second difference value is used to indicate the difference between the thirteenth image feature and the fourteenth image feature.

In a possible implementation manner, the difference between the thirteenth image feature and the fourteenth image feature may be calculated first to obtain the second difference. Then calculate the L1 norm of the second difference to obtain the second difference.

The second difference value _S2 can be represented by the following formula:

D7: sum the first difference value and the second difference value to obtain a fourth summation result, and take the negative value of the fourth summation result to obtain the first score.

Calculate the sum of the first difference value and the second difference value, and then take the negative value of the obtained sum to obtain the first score.

The first score SCORE _rec can be expressed by formula (15):

Further, at least one embodiment of the present disclosure provides a specific implementation of determining the second score according to the source image-specific loss and the target image-specific loss, including:

The source image-specific loss and the target image-specific loss are summed to obtain a fifth summation result, and the negative value of the fifth summation result is taken to obtain a second score.

The second score SCORE _dist can be expressed by formula (16):

SCORE _dist = -L _difference = -(L _sdist +L _tdist ) (16)

Among them, L _sdist is the source image-specific loss, L _tdist is the target image-specific loss, and L _difference is the fifth summation result.

In a possible implementation manner, the first encoder, the second encoder, and the third encoder may also be trained in advance.

At least one embodiment of the present disclosure provides a training method for a medical report generation model, which includes the following three steps in addition to the above steps.

E1: Input the training image into the first image feature extraction network to obtain the fifth image feature, and input the fifth image feature into the first classification network to obtain the first predicted classification result of the training image; according to the first predicted classification result of the training image and The classification label corresponding to the training image is trained to train the first image feature extraction network and the first classification network.

Training images are the images used to train the encoder. The training images are medical images with classification labels. The classification label is the disease corresponding to the medical image. The medical image used as the training image may be a chest radiograph, and the corresponding classification label may be, for example, disease names of diseases such as pneumonia, pulmonary nodules, and cardiac hypertrophy. The training images can be images from the CheXpert-small dataset.

The first image feature extraction network is used to extract image features. The training image is input into the first image feature extraction network to obtain the fifth image feature output by the first image feature extraction network. Wherein, the first image feature extraction network may adopt an Inception-v3 network structure.

The first classification network is used to determine the classification type of the image according to the input image features. The obtained fifth image feature is re-inputted into the first classification network to obtain the first predicted classification result of the training image. The first predicted classification result may include the image type of the training image.

The classification label of the training image can be used to measure the accuracy of the first predicted classification result of the training image. A first image feature extraction network and a first classification network are trained according to the classification label of the training image and the first predicted classification result.

E2: Input the training image into the second image feature extraction network to obtain the sixth image feature, and input the sixth image feature into the second classification network to obtain the second predicted classification result of the training image; according to the second predicted classification result of the training image and The classification label corresponding to the training image is used to train the second image feature extraction network and the second classification network; the network structure of the first image feature extraction network is different from that of the second image feature extraction network.

The second image feature extraction network is a network with a different structure from the first image feature extraction network. The second image feature extraction network is used to extract image features. The training image is input into the second image feature extraction network to obtain the sixth image feature output by the second image feature extraction network.

The second classification network is used to determine the classification type of the image according to the input image features. The obtained sixth image feature is re-inputted into the second classification network to obtain a second predicted classification result of the training image output by the second classification network. The second predicted classification result may include the image type of the training image.

The classification label of the training image can be used to measure whether the second predicted classification result of the training image is accurate. Using the classification labels of the training images and the second predicted classification results of the training images, the second image feature extraction network and the second classification network are trained.

E3: Determine the model parameters of the trained first image feature extraction network as the initial model parameters of the first encoder and the third encoder, and determine the model parameters of the trained second image feature extraction network as the second encoder The initial model parameters; the first image feature extraction network has the same network structure as the first encoder and the third encoder, and the second image feature extraction network has the same network structure as the second encoder.

The network structure of the first image feature extraction network is the same as that of the first encoder and the third encoder. After obtaining the trained first image feature extraction network, use the first image feature extraction network to determine the initial model parameters of the first encoder and the third encoder. Specifically, the model parameters of the first feature extraction network are determined as the initial model parameters of the first encoder and the initial model parameters of the third encoder.

The second image feature extraction network has the same network structure as the second encoder. After obtaining the trained second image feature extraction network, use the model parameters of the second image feature extraction network to determine the initial model parameters of the second encoder. Specifically, the model parameters of the second image feature extraction network are determined as the initial model parameters of the second encoder.

Based on the above content, it can be known that by using the training images to pre-train the first image feature extraction network and the second image feature extraction network, and then use the model parameters of the first image feature extraction network and the second image feature extraction network to determine the first encoder, Initial model parameters for the second and third encoders. In this way, pre-training makes the first encoder, the second encoder, and the third encoder more accurate, improves the accuracy of the first encoder, the second encoder, and the third encoder in extracting image features, and improves the accuracy of model training. efficiency.

In another possible implementation manner, initial model parameters of the first encoder, the second encoder, and the third encoder may be randomly initialized and determined. At least one embodiment of the present disclosure also provides a training method for a medical report generation model, which includes the following steps in addition to the above steps.

The initial model parameters of the first encoder, the second encoder and the third encoder are randomly initialized.

Before training with the first encoder, the second encoder, and the third encoder, the initial model parameters of the first encoder, the second encoder, and the third encoder are randomly initialized. Then, in the above manner, the first encoder, the second encoder and the third encoder are trained to determine model parameters.

Based on the method for training a medical report generation model provided in the foregoing embodiments, an embodiment of the present disclosure provides a method for generating a medical report. Referring to Fig. 5, which is a flow chart of a method for generating a medical report provided by at least one embodiment of the present disclosure, the method includes S501-S502:

S501: Input medical images into an encoder to obtain medical image features.

The encoder is the second encoder trained by using the above training method of the medical report generation model. The trained second encoder can more accurately extract medical image features of the medical image.

The medical image that needs to generate the corresponding medical report text is input into the encoder to obtain the medical image features corresponding to the medical image. It should be noted that the image type of the medical image is consistent with the image type of the target image. For example, the target image includes an image generated by an endoscope, and correspondingly, the medical image may be an image generated by an endoscope.

S502: Input the medical image features into the text generator to obtain the medical report text.

The text generator is a text generator trained by the above-mentioned training method of the medical report generation model. The trained text generator can generate more accurate medical report text based on the input medical image features.

The medical image features of the medical image output by the encoder are input into the text generator to obtain the medical report text output by the text generator.

Based on the above content, it can be seen that in the embodiments of the present disclosure, the encoder and text generator trained by the training method of the above-mentioned medical report generation model can be applied to the medical image of the image type corresponding to the target image, and generate the corresponding Medical report text.

Based on the training method of the medical report generation model provided by the above method embodiments, at least one embodiment of the present disclosure also provides a training device for the medical report generation model. The training device for the medical report generation model will be described below in conjunction with the accompanying drawings Be explained.

Referring to FIG. 6 , which is a schematic structural diagram of a training device for a medical report generation model provided by at least one embodiment of the present disclosure. As shown in Figure 6, the training device of this medical report generation model includes:

The first input unit 601 is configured to input a source image into a first encoder to obtain a first image feature, and input the source image into a second encoder to obtain a second image feature; the source image corresponds to a medical text label;

The second input unit 602 is configured to input a target image into a third encoder to obtain a third image feature, and input the target image to the second encoder to obtain a fourth image feature;

The third input unit 603 is configured to input the second image feature into the text generator to obtain the first medical report text;

A fourth input unit 604, configured to input the fourth image feature into the text generator to obtain a second medical report text;

The fifth input unit 605 is configured to input the first medical report text into the discriminator to obtain a first discriminant result;

A sixth input unit 606, configured to input the second medical report text into the discriminator to obtain a second discriminant result;

The first calculation unit 607 is configured to calculate the source image specificity loss according to the first image feature and the second image feature, and calculate the target image specificity loss according to the third image feature and the fourth image feature loss;

The second calculation unit 608 is configured to calculate a cross-entropy loss according to the first medical report text and the medical text label corresponding to the source image;

The third calculation unit 609 is configured to calculate a first adversarial loss according to the first discrimination result, and calculate a second adversarial loss and a third adversarial loss according to the second discrimination result;

Execution unit 610, configured to, according to the source image characteristic loss, the target image characteristic loss, the cross-entropy loss, the first adversarial loss, the second adversarial loss, and the third adversarial loss performance loss, train the first encoder, the second encoder, the third encoder, the text generator, and the discriminator, and repeat the process of inputting the source image into the first image feature encoder And subsequent steps until the preset condition is reached.

In a possible implementation manner, the device further includes:

A seventh input unit, configured to input the first image feature and the second image feature into the first decoder to obtain a reconstructed source image;

An eighth input unit, configured to input the third image feature and the fourth image feature into a second decoder to obtain a reconstructed target image;

A fourth calculation unit, configured to calculate the perceptual loss of the source image according to the source image and the reconstructed source image, and calculate the perceptual loss of the target image according to the target image and the reconstructed target image;

The execution unit is specifically used for the said source image characteristic loss, the target image characteristic loss, the cross-entropy loss, the first adversarial loss, the second adversarial loss, the The third adversarial loss, the source image perceptual loss and the target image perceptual loss, train the first encoder, the second encoder, the third encoder, the text generator, the The discriminator, the first decoder and the second decoder.

In a possible implementation manner, part of the target image corresponds to a medical text label; the device further includes:

A first determining unit, configured to determine a first score according to the difference between the source image and the reconstructed source image and the difference between the target image and the reconstructed target image;

a second determining unit, configured to determine a second score according to the source image-specific loss and the target image-specific loss;

A fifth calculation unit, configured to calculate a natural language evaluation index as a third score according to the second medical report text and the medical text label corresponding to the target image if the target image corresponds to a medical text label;

A summation unit, configured to weight and sum the first score, the second score and the third score to obtain a reward value;

a training unit, configured to retrain the first encoder, the second encoder, the third encoder, the text generator, the discriminator, and the second encoder with the goal of maximizing the reward value A decoder and the second decoder.

In a possible implementation manner, the device further includes:

The seventh input unit is used to input the training image into the first image feature extraction network to obtain the fifth image feature, and input the fifth image feature into the first classification network to obtain the first predicted classification result of the training image; according to The first predicted classification result of the training image and the classification label corresponding to the training image, training the first image feature extraction network and the first classification network;

The eighth input unit is used to input the training image into the second image feature extraction network to obtain the sixth image feature, and input the sixth image feature into the second classification network to obtain the second predicted classification result of the training image; according to The second predicted classification result of the training image and the classification label corresponding to the training image, train the second image feature extraction network and the second classification network; the first image feature extraction network and the second The network structure of the image feature extraction network is different;

The third determining unit is configured to determine the model parameters of the trained first image feature extraction network as the initial model parameters of the first encoder and the third encoder, and determine the trained second The model parameters of the image feature extraction network are determined as the initial model parameters of the second encoder; the first image feature extraction network has the same network structure as the first encoder and the third encoder, and the second encoder The network structure of the second image feature extraction network is the same as that of the second encoder.

In a possible implementation manner, the device further includes:

An initialization unit, configured to randomly initialize initial model parameters of the first encoder, the second encoder, and the third encoder.

In a possible implementation manner, the first discrimination result includes a first probability value for judging whether each word segment in the first medical report text is generated by the source image, and the second judgment result includes judging the A second probability value of whether each word segment in the second medical report text is generated by the source image;

The third computing unit 609 is specifically configured to take logarithms of the first probability values and then sum them to obtain a first summation result, and take a negative value of the first summation result to obtain a first adversarial loss ;

The third computing unit 609 is specifically configured to take logarithms of the second probability values and then sum them to obtain a second summation result, and take a negative value of the second summation result to obtain a second adversarial loss ;

The difference between 1 and the second probability value is calculated and then summed to obtain a third summation result, and the negative value of the third summation result is taken to obtain a third adversarial loss.

In a possible implementation manner, the fourth calculation unit 610 is specifically configured to input the source image into the third image feature extraction network, and obtain the first output of each feature extraction layer of the third image feature extraction network. Seven image features;

Inputting the reconstruction source image into the third image feature extraction network, and obtaining the eighth image feature output by each feature extraction layer of the third image feature extraction network;

According to the seventh image feature, the eighth image feature of each described feature extraction layer output and the weight corresponding to this feature extraction layer, calculate the source image loss corresponding to this feature extraction layer;

The source image loss corresponding to each feature extraction layer is summed to obtain the source image perception loss;

The fourth calculation unit 610 is specifically configured to input the target image into a third image feature extraction network, and obtain ninth image features output by each feature extraction layer of the third image feature extraction network;

Input the reconstruction target image into the third image feature extraction network, and obtain the tenth image feature output by each feature extraction layer of the third image feature extraction network;

Calculate the target image loss corresponding to the feature extraction layer according to the ninth image feature, the tenth image feature output by each feature extraction layer, and the weight corresponding to the feature extraction layer;

The target image loss corresponding to each feature extraction layer is summed to obtain the target image perception loss.

In a possible implementation manner, the first determination unit is specifically configured to input the source image into a third image feature extraction network, and obtain an eleventh image feature output by the third image feature extraction network;

Inputting the reconstructed source image into the third image feature extraction network, and obtaining the twelfth image feature output by the third image feature extraction network;

Obtaining a first difference value according to the difference between the eleventh image feature and the twelfth image feature;

Inputting the target image into the third image feature extraction network, and obtaining the thirteenth image feature output by the third image feature extraction network;

Inputting the target reconstructed image into the third image feature extraction network, and obtaining the fourteenth image feature output by the third image feature extraction network;

Obtaining a second difference value according to the difference between the thirteenth image feature and the fourteenth image feature;

Summing the first difference value and the second difference value to obtain a fourth summation result, and taking a negative value of the fourth summation result to obtain a first score.

In a possible implementation manner, the second determination unit is specifically configured to sum the source image-specific loss and the target image-specific loss to obtain a fifth summation result, and take the fifth Sum the negative values of the result to get the second score.

Based on the method for generating a medical report provided by the above method embodiments, at least one embodiment of the present disclosure further provides a device for generating a medical report. The device for generating a medical report will be described below with reference to the accompanying drawings.

Refer to FIG. 7 , which is a schematic structural diagram of a medical report generating device provided by at least one embodiment of the present disclosure. As shown in Figure 7, the medical report generation device includes:

The input unit 701 is configured to input the medical image into the encoder to obtain medical image features;

A generating unit 702, configured to input the medical image features into a text generator to obtain a medical report text;

Based on the method for training a medical report generation model and the method for generating a medical report provided by the above method embodiments, at least one embodiment of the present disclosure further provides an electronic device, including: one or more processors; a storage device on which There are one or more programs, when the one or more programs are executed by the one or more processors, so that the one or more processors implement the medical report generation model as described in any of the above embodiments Training method, or realize the medical report generating method as described in the above-mentioned embodiment.

Referring now to FIG. 8 , it shows a schematic structural diagram of an electronic device 800 suitable for implementing the embodiments of the present disclosure. The terminal equipment in the embodiment of the present disclosure may include but not limited to mobile phones, notebook computers, digital broadcast receivers, PDA (Personal Digital Assistant, personal digital assistant), PAD (portable android device, tablet computer), PMP (Portable Media Player, portable multimedia player), mobile terminals such as vehicle-mounted terminals (such as vehicle-mounted navigation terminals), and fixed terminals such as digital TVs (television, television sets), desktop computers, and the like. The electronic device shown in FIG. 8 is only an example, and should not limit the functions and scope of use of the embodiments of the present disclosure.

As shown in FIG. 8 , an electronic device 800 may include a processing device (such as a central processing unit, a graphics processing unit, etc.) Various appropriate actions and processes are executed by programs in the memory (RAM) 803 . In the RAM 803, various programs and data necessary for the operation of the electronic device 800 are also stored. The processing device 801, ROM 802, and RAM 803 are connected to each other through a bus 804. An input/output (I/O) interface 805 is also connected to the bus 804 .

Typically, the following devices can be connected to the I/O interface 805: input devices 806 including, for example, a touch screen, touchpad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; including, for example, a liquid crystal display (LCD), speaker, vibration an output device 807 such as a computer; a storage device 808 including, for example, a magnetic tape, a hard disk, etc.; and a communication device 809. The communication means 809 may allow the electronic device 800 to communicate with other devices wirelessly or by wire to exchange data. While FIG. 8 shows electronic device 800 having various means, it is to be understood that implementing or having all of the means shown is not a requirement. More or fewer means may alternatively be implemented or provided.

In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts can be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product, which includes a computer program carried on a non-transitory computer readable medium, where the computer program includes program code for executing the method shown in the flowchart. In such an embodiment, the computer program may be downloaded and installed from a network via communication means 809, or from storage means 808, or from ROM 802. When the computer program is executed by the processing device 801, the above-mentioned functions defined in the methods of the embodiments of the present disclosure are performed.

The electronic device provided by the embodiment of the present disclosure belongs to the same inventive concept as the training method of the medical report generation model and the method of generating the medical report provided by the above embodiment, and the technical details not described in detail in this embodiment can be referred to the above embodiment. And this embodiment has the same beneficial effect as the above embodiment.

Based on the method for training a medical report generation model and the method for generating a medical report provided by the above method embodiments, at least one embodiment of the present disclosure provides a computer storage medium on which a computer program is stored, wherein the program is processed The training method of the medical report generation model described in any of the above embodiments, or the medical report generation method described in the above embodiments is implemented when the device is executed.

It should be noted that the above-mentioned computer-readable medium in the present disclosure may be a computer-readable signal medium or a computer-readable storage medium or any combination of the above two. A computer readable storage medium may be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination thereof. More specific examples of computer-readable storage media may include, but are not limited to, electrical connections with one or more wires, portable computer diskettes, hard disks, random access memory (RAM), read-only memory (ROM), erasable Programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination of the above. In the present disclosure, a computer-readable storage medium may be any tangible medium that contains or stores a program that can be used by or in conjunction with an instruction execution system, apparatus, or device. In this disclosure, however, a computer-readable signal medium may include a data signal in baseband or propagated as part of a carrier wave carrying computer-readable program code thereon. Such propagated data signals may take many forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the foregoing. A computer-readable signal medium may also be any computer-readable medium other than a computer-readable storage medium, which can transmit, propagate, or transmit a program for use by or in conjunction with an instruction execution system, apparatus, or device . Program code embodied on a computer readable medium may be transmitted by any appropriate medium, including but not limited to wires, optical cables, RF (radio frequency), etc., or any suitable combination of the above.

In some embodiments, the client and the server can communicate using any currently known or future network protocols such as HTTP (HyperText Transfer Protocol, Hypertext Transfer Protocol), and can communicate with digital data in any form or medium The communication (eg, communication network) interconnections. Examples of communication networks include local area networks ("LANs"), wide area networks ("WANs"), internetworks (e.g., the Internet), and peer-to-peer networks (e.g., ad hoc peer-to-peer networks), as well as any currently known or future developed network of.

The above-mentioned computer-readable medium may be included in the above-mentioned electronic device, or may exist independently without being incorporated into the electronic device.

The above-mentioned computer-readable medium carries one or more programs, and when the above-mentioned one or more programs are executed by the electronic device, the electronic device is made to execute the above-mentioned training method of the medical report generation model, or the medical report generation method.

Computer program code for carrying out operations of the present disclosure may be written in one or more programming languages, or combinations thereof, including but not limited to object-oriented programming languages—such as Java, Smalltalk, C++, and Includes conventional procedural programming languages - such as the "C" language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In cases involving a remote computer, the remote computer can be connected to the user computer through any kind of network, including a local area network (LAN) or a wide area network (WAN), or it can be connected to an external computer (such as through an Internet service provider). Internet connection).

The flowchart and block diagrams in the Figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in a flowchart or block diagram may represent a module, program segment, or portion of code that contains one or more logical functions for implementing specified executable instructions. It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or they may sometimes be executed in the reverse order, depending upon the functionality involved. It should also be noted that each block of the block diagrams and/or flowchart illustrations, and combinations of blocks in the block diagrams and/or flowchart illustrations, can be implemented by a dedicated hardware-based system that performs the specified functions or operations , or may be implemented by a combination of dedicated hardware and computer instructions.

The units involved in the embodiments described in the present disclosure may be implemented by software or by hardware. Wherein, the name of the unit/module does not constitute a limitation on the unit itself under certain circumstances, for example, the voice data collection module can also be described as a "data collection module".

The functions described herein above may be performed at least in part by one or more hardware logic components. For example, without limitation, exemplary types of hardware logic components that may be used include: Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), System on Chips (SOCs), Complex Programmable Logical device (CPLD) and so on.

In the context of the present disclosure, a machine-readable medium may be a tangible medium that may contain or store a program for use by or in conjunction with an instruction execution system, apparatus, or device. A machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, apparatus, or devices, or any suitable combination of the foregoing. More specific examples of machine-readable storage media would include one or more wire-based electrical connections, portable computer discs, hard drives, random access memory (RAM), read only memory (ROM), erasable programmable read only memory (EPROM or flash memory), optical fiber, compact disk read only memory (CD-ROM), optical storage, magnetic storage, or any suitable combination of the foregoing.

According to one or more embodiments of the present disclosure, Example 1 provides a method for training a medical report generation model, the method comprising:

According to one or more embodiments of the present disclosure, Example 2 provides a method for training a medical report generation model, the method further comprising:

inputting the first image feature and the second image feature into a first decoder to obtain a reconstructed source image;

inputting the third image feature and the fourth image feature into a second decoder to obtain a reconstructed target image;

calculating a source image perceptual loss based on the source image and the reconstructed source image, and calculating a target image perceptual loss based on the target image and the reconstructed target image;

According to the source image characteristic loss, the target image characteristic loss, the cross-entropy loss, the first adversarial loss, the second adversarial loss and the third adversarial loss, training The first encoder, the second encoder, the third encoder, the text generator and the discriminator include:

According to the source image characteristic loss, the target image characteristic loss, the cross-entropy loss, the first adversarial loss, the second adversarial loss, the third adversarial loss, the The source image perception loss and the target image perception loss, train the first encoder, the second encoder, the third encoder, the text generator, the discriminator, and the first decoder and the second decoder.

According to one or more embodiments of the present disclosure, Example 3 provides a training method of a medical report generation model, and part of the target image corresponds to a medical text label; the method also includes:

determining a first score based on a difference between the source image and the reconstructed source image and a difference between the target image and the reconstructed target image;

determining a second score based on the source image-specific loss and the target image-specific loss;

If the target image corresponds to a medical text label, calculate a natural language evaluation index as a third score according to the second medical report text and the medical text label corresponding to the target image;

weighting and summing the first score, the second score and the third score to obtain a reward value;

With the goal of maximizing the reward value, retrain the first encoder, the second encoder, the third encoder, the text generator, the discriminator, the first decoder and all Describe the second decoder.

According to one or more embodiments of the present disclosure, Example 4 provides a method for training a medical report generation model, the method further comprising:

Input the training image into the first image feature extraction network to obtain the fifth image feature, and input the fifth image feature into the first classification network to obtain the first predicted classification result of the training image; according to the first prediction classification result of the training image predicting classification results and classification labels corresponding to the training images, training the first image feature extraction network and the first classification network;

The training image is input to the second image feature extraction network to obtain the sixth image feature, and the sixth image feature is input to the second classification network to obtain the second predicted classification result of the training image; according to the second of the training image Predict classification results and classification labels corresponding to the training images, train the second image feature extraction network and the second classification network; the network structure of the first image feature extraction network and the second image feature extraction network different;

Determining the model parameters of the first image feature extraction network that has been trained as initial model parameters of the first encoder and the third encoder, and determining the model parameters of the second image feature extraction network that has been trained determined as the initial model parameters of the second encoder; the network structure of the first image feature extraction network is the same as that of the first encoder and the third encoder, and the second image feature extraction network is the same as the network structure of the third encoder The network structure of the second encoder is the same as above.

According to one or more embodiments of the present disclosure, Example 5 provides a method for training a medical report generation model, the method further comprising:

Randomly initialize the initial model parameters of the first encoder, the second encoder and the third encoder.

According to one or more embodiments of the present disclosure, Example 6 provides a training method for a medical report generation model, the first judgment result includes judging whether each word segment in the first medical report text is represented by the source image The generated first probability value, the second judgment result includes a second probability value for judging whether each word segment in the second medical report text is generated by the source image;

The calculating the first adversarial loss according to the first discrimination result includes:

taking the logarithm of the first probability value and summing to obtain a first summation result, and taking the negative value of the first summation result to obtain a first adversarial loss;

The calculating the second adversarial loss and the third adversarial loss according to the second discrimination result includes:

summing the logarithms of the second probability values to obtain a second summation result, and taking the negative value of the second summation result to obtain a second adversarial loss;

According to one or more embodiments of the present disclosure, Example 7 provides a training method for a medical report generation model, and calculating the perceptual loss of the source image according to the source image and the reconstructed source image, including:

Input the source image into the third image feature extraction network, and obtain the seventh image feature output by each feature extraction layer of the third image feature extraction network;

Calculate the source image loss corresponding to the feature extraction layer according to the seventh image feature, the eighth image feature output by each feature extraction layer, and the weight corresponding to the feature extraction layer;

The calculating the target image perception loss according to the target image and the reconstructed target image includes:

Input the target image into the third image feature extraction network, and obtain the ninth image feature output by each feature extraction layer of the third image feature extraction network;

According to the ninth image feature of each described feature extraction layer output, the tenth image feature and the weight corresponding to this feature extraction layer, calculate the corresponding target image loss of this feature extraction layer;

According to one or more embodiments of the present disclosure, Example 8 provides a training method for a medical report generation model, according to the difference between the source image and the reconstructed source image and the target image and the reconstructed target The differences of the images, to determine the first score, consist of:

Inputting the source image into a third image feature extraction network to obtain the eleventh image feature output by the third image feature extraction network;

According to one or more embodiments of the present disclosure, Example 9 provides a training method for a medical report generation model, wherein the second score is determined according to the source image-specific loss and the target image-specific loss, include:

Summing the source image-specific loss and the target image-specific loss to obtain a fifth summation result, and taking a negative value of the fifth summation result to obtain a second score.

According to one or more embodiments of the present disclosure, example ten provides a method for generating a medical report, the method comprising:

Input the medical image into the encoder to obtain the medical image features;

According to one or more embodiments of the present disclosure, Example Eleven provides a training device for a medical report generation model, the device comprising:

According to one or more embodiments of the present disclosure, Example 12 provides a training device for a medical report generation model, the device further comprising:

According to one or more embodiments of the present disclosure, Example 13 provides a training device for a medical report generation model, part of the target image corresponds to a medical text label; the device further includes:

According to one or more embodiments of the present disclosure, Example Fourteen provides a training device for a medical report generation model, the device further comprising:

According to one or more embodiments of the present disclosure, Example 15 provides a training device for a medical report generation model, the device further comprising:

According to one or more embodiments of the present disclosure, Example 16 provides a training device for a medical report generation model, the first judgment result includes judging whether each word segment in the first medical report text is provided by the source The first probability value generated by the image, the second judgment result includes a second probability value for judging whether each word segment in the second medical report text is generated by the source image;

The third calculation unit is specifically configured to sum the logarithms of the first probability values to obtain a first summation result, and take a negative value of the first summation result to obtain a first adversarial loss;

The third calculation unit is specifically configured to sum the second probability values after taking logarithms to obtain a second summation result, and take a negative value of the second summation result to obtain a second adversarial loss;

According to one or more embodiments of the present disclosure, Example 17 provides a training device for a medical report generation model, the fourth calculation unit is specifically configured to input the source image into the third image feature extraction network to obtain The seventh image feature output by each feature extraction layer of the third image feature extraction network;

The reconstruction source image is input into the third image feature extraction network to obtain the eighth image feature output by each feature extraction layer of the third image feature extraction network;

The fourth calculation unit is specifically configured to input the target image into a third image feature extraction network, and obtain ninth image features output by each feature extraction layer of the third image feature extraction network;

According to one or more embodiments of the present disclosure, Example 18 provides a training device for a medical report generation model, the first determining unit is specifically configured to input the source image into a third image feature extraction network to obtain The eleventh image feature output by the third image feature extraction network;

According to one or more embodiments of the present disclosure, Example Nineteen provides a training device for a medical report generation model, the second determination unit is specifically configured to combine the source image specific loss and the target image specific and summation losses to obtain a fifth summation result, and take the negative value of the fifth summation result to obtain a second score.

According to one or more embodiments of the present disclosure, Example 20 provides a medical report generating device, the device comprising:

According to one or more embodiments of the present disclosure, Example 21 provides an electronic device, including:

one or more processors;

a storage device on which one or more programs are stored,

According to one or more embodiments of the present disclosure, Example 22 provides a computer-readable medium on which a computer program is stored, wherein, when the program is executed by a processor, the implementation of any of the above-mentioned embodiments The training method of the medical report generation model, or realize the medical report generation method described in the above embodiment.

It should be noted that each embodiment in this specification is described in a progressive manner, each embodiment focuses on the differences from other embodiments, and the same and similar parts of each embodiment can be referred to each other. As for the system or device disclosed in the embodiment, since it corresponds to the method disclosed in the embodiment, the description is relatively simple, and for relevant details, please refer to the description of the method part.

It should be understood that in the present disclosure, "at least one (item)" means one or more, and "plurality" means two or more. "And/or" is used to describe the association relationship of associated objects, indicating that there can be three types of relationships, for example, "A and/or B" can mean: only A exists, only B exists, and A and B exist at the same time , where A and B can be singular or plural. The character "/" generally indicates that the contextual objects are an "or" relationship. "At least one of the following" or similar expressions refer to any combination of these items, including any combination of single or plural items. For example, at least one item (piece) of a, b or c can mean: a, b, c, "a and b", "a and c", "b and c", or "a and b and c ", where a, b, c can be single or multiple.

It should also be noted that in this article, relational terms such as first and second etc. are only used to distinguish one entity or operation from another entity or operation, and do not necessarily require or imply that these entities or operations Any such actual relationship or order exists between. Furthermore, the term "comprises", "comprises" or any other variation thereof is intended to cover a non-exclusive inclusion such that a process, method, article, or apparatus comprising a set of elements includes not only those elements, but also includes elements not expressly listed. other elements of or also include elements inherent in such a process, method, article, or device. Without further limitations, an element defined by the phrase "comprising a ..." does not exclude the presence of additional identical elements in the process, method, article or apparatus comprising said element.

The steps of the methods or algorithms described in connection with the embodiments disclosed herein may be directly implemented by hardware, software modules executed by a processor, or a combination of both. Software modules can be placed in random access memory (RAM), internal memory, read-only memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, removable disk, CD-ROM, or any other Any other known storage medium.

The above description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present disclosure. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the general principles defined herein may be implemented in other embodiments without departing from the spirit or scope of the present disclosure. Therefore, the present disclosure will not be limited to the embodiments shown herein, but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

A training method for a medical report generation model, comprising:

The source image is input into the first encoder to obtain the first image feature, and the source image is input into the second encoder to obtain the second image feature; the source image corresponds to a medical text label;

inputting the target image into a third encoder to obtain a third image feature, and inputting the target image into the second encoder to obtain a fourth image feature;

Inputting the second image feature into the text generator to obtain the first medical report text;

Inputting the fourth image feature into the text generator to obtain a second medical report text;

Inputting the first medical report text into a discriminator to obtain a first discriminant result;

inputting the second medical report text into the discriminator to obtain a second discriminant result;

calculating a source image-specific loss based on the first image feature and the second image feature, and calculating a target image-specific loss based on the third image feature and the fourth image feature;

calculating a cross-entropy loss according to the first medical report text and the medical text label corresponding to the source image;

calculating a first adversarial loss according to the first discrimination result, and calculating a second adversarial loss and a third adversarial loss according to the second discrimination result;

According to the source image characteristic loss, the target image characteristic loss, the cross entropy loss, the first adversarial loss, the second adversarial loss and the third adversarial loss, train the The first encoder, the second encoder, the third encoder, the text generator and the discriminator repeatedly execute the inputting the source image into the first image feature encoder and subsequent steps until reaching preset conditions.
The method according to claim 1, further comprising:

inputting the first image feature and the second image feature into a first decoder to obtain a reconstructed source image;

inputting the third image feature and the fourth image feature into a second decoder to obtain a reconstructed target image;

calculating a source image perceptual loss based on the source image and the reconstructed source image, and calculating a target image perceptual loss based on the target image and the reconstructed target image;

According to the source image characteristic loss, the target image characteristic loss, the cross-entropy loss, the first adversarial loss, the second adversarial loss and the third adversarial loss, training The first encoder, the second encoder, the third encoder, the text generator and the discriminator include:

According to the source image characteristic loss, the target image characteristic loss, the cross-entropy loss, the first adversarial loss, the second adversarial loss, the third adversarial loss, the source image perception loss and the target image perception loss, train the first encoder, the second encoder, the third encoder, the text generator, the discriminator, the first decoder and the Describe the second decoder.
The method according to claim 2, wherein part of the target image corresponds to a medical text label; the method further comprises:

determining a first score based on a difference between the source image and the reconstructed source image and a difference between the target image and the reconstructed target image;

determining a second score based on the source image-specific loss and the target image-specific loss;

If the target image corresponds to a medical text label, calculate a natural language evaluation index as a third score according to the second medical report text and the medical text label corresponding to the target image;

weighting and summing the first score, the second score and the third score to obtain a reward value;

With the goal of maximizing the reward value, retrain the first encoder, the second encoder, the third encoder, the text generator, the discriminator, the first decoder and all Describe the second decoder.
The method according to any one of claims 1-3, further comprising:

Input the training image into the first image feature extraction network to obtain the fifth image feature, and input the fifth image feature into the first classification network to obtain the first predicted classification result of the training image; according to the first prediction classification result of the training image predicting classification results and classification labels corresponding to the training images, training the first image feature extraction network and the first classification network;

The training image is input to the second image feature extraction network to obtain the sixth image feature, and the sixth image feature is input to the second classification network to obtain the second predicted classification result of the training image; according to the second of the training image Predict classification results and classification labels corresponding to the training images, train the second image feature extraction network and the second classification network; the network structure of the first image feature extraction network and the second image feature extraction network different;

Determining the model parameters of the first image feature extraction network that has been trained as initial model parameters of the first encoder and the third encoder, and determining the model parameters of the second image feature extraction network that has been trained determined as the initial model parameters of the second encoder; the network structure of the first image feature extraction network is the same as that of the first encoder and the third encoder, and the second image feature extraction network is the same as the network structure of the third encoder The network structure of the second encoder is the same as above.
The method according to any one of claims 1-3, further comprising:

Randomly initialize initial model parameters of the first encoder, the second encoder, and the third encoder.
The method according to any one of claims 1-5, wherein the first judgment result includes a first probability value for judging whether each word segment in the first medical report text is generated by the source image, the The second judgment result includes a second probability value for judging whether each word segment in the second medical report text is generated by the source image;

The calculating the first adversarial loss according to the first discrimination result includes:

taking the logarithm of the first probability value and summing to obtain a first summation result, and taking the negative value of the first summation result to obtain a first adversarial loss;

The calculating the second adversarial loss and the third adversarial loss according to the second discrimination result includes:

summing the logarithms of the second probability values to obtain a second summation result, and taking the negative value of the second summation result to obtain a second adversarial loss;

The difference between 1 and the second probability value is calculated and then summed to obtain a third summation result, and the negative value of the third summation result is taken to obtain a third adversarial loss.
The method according to claim 2 or 3, wherein the calculating the source image perceptual loss according to the source image and the reconstructed source image comprises:

Input the source image into the third image feature extraction network, and obtain the seventh image feature output by each feature extraction layer of the third image feature extraction network;

Inputting the reconstruction source image into the third image feature extraction network, and obtaining the eighth image feature output by each feature extraction layer of the third image feature extraction network;

Calculate the source image loss corresponding to the feature extraction layer according to the seventh image feature, the eighth image feature output by each feature extraction layer, and the weight corresponding to the feature extraction layer;

The source image loss corresponding to each feature extraction layer is summed to obtain the source image perception loss;

The calculating the target image perception loss according to the target image and the reconstructed target image includes:

Input the target image into the third image feature extraction network, and obtain the ninth image feature output by each feature extraction layer of the third image feature extraction network;

Input the reconstruction target image into the third image feature extraction network, and obtain the tenth image feature output by each feature extraction layer of the third image feature extraction network;

Calculate the target image loss corresponding to the feature extraction layer according to the ninth image feature, the tenth image feature output by each feature extraction layer, and the weight corresponding to the feature extraction layer;

The target image loss corresponding to each feature extraction layer is summed to obtain the target image perception loss.
The method according to claim 3, wherein said determining the first score according to the difference between the source image and the reconstructed source image and the difference between the target image and the reconstructed target image comprises:

Inputting the source image into a third image feature extraction network to obtain the eleventh image feature output by the third image feature extraction network;

Inputting the reconstructed source image into the third image feature extraction network, and obtaining the twelfth image feature output by the third image feature extraction network;

Obtaining a first difference value according to the difference between the eleventh image feature and the twelfth image feature;

Inputting the target image into the third image feature extraction network, and obtaining the thirteenth image feature output by the third image feature extraction network;

Inputting the target reconstructed image into the third image feature extraction network, and obtaining the fourteenth image feature output by the third image feature extraction network;

Obtaining a second difference value according to the difference between the thirteenth image feature and the fourteenth image feature;

Summing the first difference value and the second difference value to obtain a fourth summation result, and taking a negative value of the fourth summation result to obtain a first score.
The method according to claim 3 or 8, wherein said determining a second score based on said source image-specific loss and said target image-specific loss comprises:

Summing the source image-specific loss and the target image-specific loss to obtain a fifth summation result, and taking a negative value of the fifth summation result to obtain a second score.
A method of generating a medical report comprising:

Input the medical image into the encoder to obtain the medical image features;

Inputting the medical image features into a text generator to obtain a medical report text;

The encoder is a second encoder trained according to the training method of the medical report generation model described in any one of claims 1-9;

The text generator is a text generator trained according to the training method of the medical report generation model described in any one of claims 1-9.
A training device for a medical report generation model, comprising:

The first input unit is configured to input a source image into a first encoder to obtain a first image feature, and input the source image to a second encoder to obtain a second image feature; the source image corresponds to a medical text label;

The second input unit is configured to input the target image into the third encoder to obtain a third image feature, and input the target image to the second encoder to obtain a fourth image feature;

A third input unit, configured to input the second image feature into the text generator to obtain the first medical report text;

A fourth input unit, configured to input the fourth image feature into the text generator to obtain a second medical report text;

The fifth input unit is used to input the first medical report text into the discriminator to obtain the first discriminant result;

A sixth input unit, configured to input the second medical report text into the discriminator to obtain a second discriminant result;

A first calculation unit, configured to calculate a source image-specific loss based on the first image feature and the second image feature, and calculate a target image-specific loss based on the third image feature and the fourth image feature ;

A second calculation unit, configured to calculate a cross-entropy loss according to the first medical report text and the medical text label corresponding to the source image;

A third calculation unit, configured to calculate a first adversarial loss according to the first discrimination result, and calculate a second adversarial loss and a third adversarial loss according to the second discrimination result;

an execution unit, configured to use the source image characteristic loss, the target image characteristic loss, the cross-entropy loss, the first adversarial loss, the second adversarial loss, and the third adversarial loss Loss, training the first encoder, the second encoder, the third encoder, the text generator and the discriminator, repeatedly performing the input of the source image into the first image feature encoder and Subsequent steps until preset conditions are met.
A medical report generating device, comprising:

The input unit is used to input the medical image into the encoder to obtain the medical image features;

A generating unit, configured to input the medical image features into a text generator to obtain a medical report text;

The encoder is a second encoder trained according to the training method of the medical report generation model described in any one of claims 1-9;

The text generator is a text generator trained according to the training method of the medical report generation model described in any one of claims 1-9.
An electronic device comprising:

one or more processors;

a storage device on which one or more programs are stored,

When the one or more programs are executed by the one or more processors, so that the one or more processors implement the training method of the medical report generation model according to any one of claims 1-9, or Realize the medical report generation method as claimed in claim 10.
A computer-readable medium storing a computer program, wherein, when the computer program is executed by a processor, the method for training the medical report generation model according to any one of claims 1-9 is realized, or the method according to claim 10 is realized. The medical report generating method.