WO2023029817A1 - Medical report generation method and apparatus, model training method and apparatus, and device - Google Patents

Medical report generation method and apparatus, model training method and apparatus, and device Download PDF

Info

Publication number
WO2023029817A1
WO2023029817A1 PCT/CN2022/107921 CN2022107921W WO2023029817A1 WO 2023029817 A1 WO2023029817 A1 WO 2023029817A1 CN 2022107921 W CN2022107921 W CN 2022107921W WO 2023029817 A1 WO2023029817 A1 WO 2023029817A1
Authority
WO
WIPO (PCT)
Prior art keywords
image
loss
encoder
image feature
feature extraction
Prior art date
Application number
PCT/CN2022/107921
Other languages
French (fr)
Chinese (zh)
Inventor
边成
Original Assignee
北京字节跳动网络技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京字节跳动网络技术有限公司 filed Critical 北京字节跳动网络技术有限公司
Publication of WO2023029817A1 publication Critical patent/WO2023029817A1/en

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H15/00ICT specially adapted for medical reports, e.g. generation or transmission thereof
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/213Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
    • G06F18/2132Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods based on discrimination criteria, e.g. discriminant analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T9/00Image coding
    • G06T9/002Image coding using neural networks

Definitions

  • Embodiments of the present disclosure relate to a method for generating a medical report, a method for training a model, a device, and a device.
  • Medical imaging is an image of the internal tissue of the human body or a certain part of the human body, which can help doctors understand the health status of patients.
  • the medical image has a corresponding medical report, and the medical report contains the analysis result of the medical image.
  • a medical report may include the location of the patient's disease, the extent of the lesion, and the affected organs determined from the medical images.
  • Embodiments of the present disclosure provide a method for generating a medical report, a method for training a model, a device, and a device, which can automatically generate a medical report based on medical images.
  • embodiments of the present disclosure provide a method for training a medical report generation model, the method comprising:
  • the source image is input into the first encoder to obtain the first image feature, and the source image is input into the second encoder to obtain the second image feature; the source image corresponds to a medical text label;
  • the source image characteristic loss the target image characteristic loss, the cross entropy loss, the first adversarial loss, the second adversarial loss and the third adversarial loss, train the The first encoder, the second encoder, the third encoder, the text generator and the discriminator repeatedly execute the inputting the source image into the first image feature encoder and subsequent steps until reaching preset conditions.
  • an embodiment of the present disclosure provides a method for generating a medical report, the method comprising:
  • the encoder is a second encoder trained according to the training method of the medical report generation model described in any one of the above embodiments;
  • the text generator is a text generator trained according to the training method of the medical report generation model described in any one of the above embodiments.
  • embodiments of the present disclosure provide a training device for a medical report generation model, the device comprising:
  • the first input unit is configured to input a source image into a first encoder to obtain a first image feature, and input the source image to a second encoder to obtain a second image feature; the source image corresponds to a medical text label;
  • the second input unit is configured to input the target image into the third encoder to obtain a third image feature, and input the target image to the second encoder to obtain a fourth image feature;
  • a third input unit configured to input the second image feature into the text generator to obtain the first medical report text
  • a fourth input unit configured to input the fourth image feature into the text generator to obtain a second medical report text
  • the fifth input unit is used to input the first medical report text into the discriminator to obtain the first discriminant result
  • a sixth input unit configured to input the second medical report text into the discriminator to obtain a second discriminant result
  • a first calculation unit configured to calculate a source image-specific loss based on the first image feature and the second image feature, and calculate a target image-specific loss based on the third image feature and the fourth image feature ;
  • a second calculation unit configured to calculate a cross-entropy loss according to the first medical report text and the medical text label corresponding to the source image
  • a third calculation unit configured to calculate a first adversarial loss according to the first discrimination result, and calculate a second adversarial loss and a third adversarial loss according to the second discrimination result;
  • an execution unit configured to use the source image characteristic loss, the target image characteristic loss, the cross-entropy loss, the first adversarial loss, the second adversarial loss, and the third adversarial loss Loss, training the first encoder, the second encoder, the third encoder, the text generator and the discriminator, repeatedly performing the input of the source image into the first image feature encoder and Subsequent steps until preset conditions are met.
  • an embodiment of the present disclosure provides a medical report generating device, the device comprising:
  • the input unit is used to input the medical image into the encoder to obtain the medical image features
  • a generating unit configured to input the medical image features into a text generator to obtain a medical report text
  • the encoder is a second encoder trained according to the training method of the medical report generation model described in any one of the above embodiments;
  • the text generator is a text generator trained according to the training method of the medical report generation model described in any one of the above embodiments.
  • an electronic device including:
  • processors one or more processors
  • the one or more processors When the one or more programs are executed by the one or more processors, the one or more processors implement the training method of the medical report generation model as described in any of the above embodiments, or implement the above implementation The medical report generation method described in the example.
  • the embodiments of the present disclosure provide a computer-readable medium on which a computer program is stored, wherein, when the program is executed by a processor, the training of the medical report generation model as described in any of the above-mentioned embodiments is realized method, or realize the medical report generation method described in the above-mentioned embodiments.
  • Embodiments of the present disclosure provide a training method for a medical report generation model and a medical report generation method, by extracting image features from the source image and the target image respectively, and using the image features to obtain the corresponding first medical report text and second medical report respectively text; then use the discriminator to obtain the first and second discrimination results corresponding to the first medical report text and the second medical report text respectively; finally use the image features to calculate the source image-specific loss and target image-specific loss, and use the first A medical report text and the medical text label corresponding to the source image calculate the cross-entropy loss, calculate the first adversarial loss according to the first discrimination result, calculate the second adversarial loss and the third adversarial loss according to the second discrimination result, and calculate the second adversarial loss according to the source image Specific loss, target image specific loss, cross-entropy loss, first adversarial loss, second adversarial loss, and third adversarial loss, train first encoder, second encoder, third encoder, text generation
  • the above training steps are repeated until the preset conditions
  • Fig. 1 is a schematic framework diagram of an exemplary application scenario provided by at least one embodiment of the present disclosure
  • FIG. 2 is a flowchart of a training method for a medical report generation model provided by at least one embodiment of the present disclosure
  • Fig. 3 is a schematic diagram of a method for generating a medical report model provided by at least one embodiment of the present disclosure
  • FIG. 4 is a schematic diagram of another method for generating a medical report model provided by at least one embodiment of the present disclosure
  • Fig. 5 is a flow chart of a method for generating a medical report provided by at least one embodiment of the present disclosure
  • Fig. 6 is a schematic structural diagram of a training device for a medical report generation model provided by at least one embodiment of the present disclosure
  • Fig. 7 is a schematic structural diagram of a medical report generating device provided by at least one embodiment of the present disclosure.
  • Fig. 8 is a schematic diagram of a basic structure of an electronic device provided by at least one embodiment of the present disclosure.
  • the embodiments of the present disclosure provide a training method of a medical report generation model and a medical report generation method, by extracting image features from the source image and the target image respectively, and using the image features to obtain the corresponding first medical report text respectively and the second medical report text; then use the discriminator to obtain the first discrimination result and the second discrimination result corresponding to the first medical report text and the second medical report text respectively; finally use the image features to calculate the source image specificity loss and target image specificity Computing the cross-entropy loss by using the medical text labels corresponding to the first medical report text and the source image, calculating the first adversarial loss according to the first discrimination result, calculating the second adversarial loss and the third adversarial loss according to the second discrimination result Loss, according to source image-specific loss, target image-specific loss, cross-entropy loss, first adversarial loss, second adversarial loss and third adversarial loss, train the first encoder, the second encoder, the third The encoder, the text generator and the discriminator repeat the above training
  • FIG. 1 the figure is a schematic framework diagram of an exemplary application scenario provided by at least one embodiment of the present disclosure.
  • the medical image 101 is input into the trained encoder 102 to obtain the medical image feature 103 corresponding to the medical image 101, and then the medical image feature 103 is input into the trained text generator 104 to obtain the text
  • the medical report text 105 output by the generator 104 .
  • FIG. 1 is only an example in which the embodiments of the present disclosure can be implemented.
  • the scope of applicability of the disclosed embodiments is not limited in any way by this framework.
  • FIG. 2 is a flowchart of a method for training a medical report generation model provided by at least one embodiment of the present disclosure, the method includes steps S201-S210.
  • S201 Input a source image into a first encoder to obtain a first image feature, and input the source image into a second encoder to obtain a second image feature; the source image corresponds to a medical text label.
  • FIG. 3 is a schematic diagram of a method for generating a medical report model provided by at least one embodiment of the present disclosure.
  • the source images are medical images with corresponding medical text labels.
  • the medical text label refers to the medical report text corresponding to the medical image, for example, it may be the test report text.
  • the source image may be a chest radiograph image derived from MIMIC-CXR (a data set).
  • the first encoder is used to extract image features specific to the source image, that is, image features belonging to the source domain.
  • the source image is input into the first encoder to obtain the first image feature output by the first encoder.
  • the second encoder is an encoder shared by the source image and the target image, which is used to extract features similar to the source domain and the target domain on the feature dimension of the hidden layer, that is, the common features in the source domain and the target domain.
  • the source image is input into the second encoder to obtain the second image features output by the second encoder.
  • the first encoder and the second encoder may be composed of four convolutional layers.
  • the second encoder may use Inception-v3 (a neural network), and the first encoder may use ResNet (Deep residual network, deep residual network).
  • ResNet Deep residual network, deep residual network
  • S202 Input the target image into a third encoder to obtain a third image feature, and input the target image into a second encoder to obtain a fourth image feature.
  • the target image is a medical image belonging to an image type other than the medical image type to which the source image belongs.
  • the target image may include one or more image types of medical images, and the target image includes the image types of medical images that need to be used to generate medical report text.
  • the target image includes the medical image generated by the endoscope.
  • the target image may also include medical images of other image types such as CT (Computed Tomography, computerized tomography) images.
  • the target image may include a medical image without a label, or may include a medical image with a corresponding label.
  • the label of the target image can be a manually labeled medical report text, or it can be a descriptive text related to the target image in the literature, articles, etc. to which the target image belongs.
  • the target image is input into the second encoder to obtain a fourth image feature output by the second encoder.
  • the third encoder is used to extract image features specific to the target image, that is, image features belonging to the target domain.
  • the target image is input into the third encoder to obtain the third image feature output by the third encoder.
  • the third encoder may be composed of four convolutional layers.
  • the third encoder can use ResNet (Deep residual network, deep residual network).
  • S203 Input the second image feature into the text generator to obtain the first medical report text.
  • the text generator is used to generate corresponding medical report text according to the image features of the input medical image.
  • the text generator can be composed of a bidirectional two-layer LSTM (Long Short-Term Memory, long-term short-term memory artificial neural network).
  • the second image feature is input into the text generator to obtain the first medical report text output by the text generator.
  • the fourth image feature is input into the above-mentioned text generator to obtain the second medical report text output by the text generator.
  • the discriminator is used to determine the domain to which the input medical report text belongs, that is, to determine whether the input medical report text belongs to the source domain or the target domain.
  • the discriminator can be composed of two convolutional layers and a fully connected layer of CNN (Convolutional Neural Network, Convolutional Neural Network).
  • the first medical report text is input into the discriminator, and a first discrimination result of the discriminator on the first medical report text is obtained.
  • the second medical report text is input into the discriminator, and a second discrimination result of the discriminator on the second medical report text is obtained.
  • Adversarial training can be achieved by using the discriminator, so that the second encoder can narrow the difference between the first medical report text and the first medical report text, map features from different domains into the same domain, and achieve feature-level alignment.
  • S207 Calculate the source image-specific loss according to the first image feature and the second image feature, and calculate the target image-specific loss according to the third image feature and the fourth image feature.
  • the first image feature and the second image feature are obtained by feature extraction of the source image by different encoders. According to the first image feature and the second image feature, a source image specific loss can be calculated. The source image specific loss is used to measure the gap between the features of the first image and the features of the second image.
  • the source image specific loss can be expressed by the following formula:
  • the third image feature and the fourth image feature are obtained by different encoders performing feature extraction on the target image.
  • the target image-specific loss can be calculated.
  • the target image-specific loss is used to measure the gap between the third image feature and the fourth image feature.
  • the target image-specific loss can be expressed by the following formula:
  • S208 Calculate a cross-entropy loss according to the first medical report text and the medical text label corresponding to the source image.
  • the source images have corresponding medical text labels.
  • a cross-entropy loss is calculated according to the first medical report text and the medical text label corresponding to the source image.
  • Cross-entropy loss is used to measure the gap between the first medical report text and medical text labels.
  • S209 Calculate a first adversarial loss according to the first discrimination result, and calculate a second adversarial loss and a third adversarial loss according to the second discrimination result.
  • the first adversarial loss is calculated according to the first discrimination result output by the discriminator, and the second adversarial loss and the third adversarial loss are calculated according to the second discrimination result.
  • the first adversarial loss, the second adversarial loss, and the third adversarial loss are used to measure whether the generated discriminant results belong to the corresponding domain.
  • At least one embodiment of the present disclosure provides a specific implementation of calculating the first adversarial loss according to the first discrimination result, and a method of calculating the second adversarial loss and the third adversarial loss according to the second discrimination result.
  • specific implementation methods of sexual loss please refer to the following for details.
  • the first encoder and the second encoder Based on the source image-specific loss, it is possible to enable the first encoder and the second encoder to learn different image features about the source image. Based on the target image-specific loss, it is possible to enable the second encoder and the third encoder to learn different image features about the target image.
  • the cross-entropy loss it is possible to train the text generator to generate more accurate first medical report text.
  • the domain-invariant features of the target domain and the source domain can be made as close as possible.
  • At least one embodiment of the present disclosure provides a method based on source image-specific loss, target image-specific loss, cross-entropy loss, first adversarial loss, second adversarial loss, and third adversarial loss loss, the specific implementation of training the first encoder, the second encoder, the third encoder, the text generator and the discriminator, please refer to the following for details.
  • the preset condition is a condition for completing the training.
  • the preset condition may be, for example, the number of times of training, or may be a numerical condition satisfied by the loss function.
  • the second encoder and text generator trained by confrontation training based on domain-invariant features can generate corresponding medical reports for medical images belonging to the image type of the target image text.
  • medical report texts can be generated for medical images of image types lacking labels, and the scope of medical image types for generating medical report texts can be expanded.
  • the discriminator can be used to map data sources from different domains to the same domain, and achieve feature-level alignment, so that the encoder and text generator obtained after training can generate more accurate medical reports corresponding to medical images text.
  • the discriminator is used to determine the probability of the image corresponding to the text of the medical report.
  • the first medical report text is input into the discriminator, and the obtained first discrimination result output by the discriminator includes the first probability value generated by each word segment in the first medical report text from the source image.
  • the second medical report text is input into the discriminator, and the obtained second discriminant result output by the discriminator includes a second probability value generated by each word segment in the second medical report text from the source image.
  • the first probability value may be expressed as D(y s )
  • y s represents the first medical report text.
  • the second probability value may be expressed as D(y t ), where y t represents the text of the second medical report.
  • the value range of the first probability value and the second probability value is from 0 to 1. Among them, the closer to 1, the higher the probability of being generated by the source image, and the closer to 0, the lower the probability of being generated by the source image.
  • At least one embodiment of the present disclosure provides a method for calculating a first adversarial loss according to a first discrimination result, which specifically includes:
  • the logarithms of the first probability values are taken and summed to obtain the first summation result, and the negative value of the first summation result is taken to obtain the first adversarial loss.
  • the logarithms of the first probability values are taken and then summed to obtain the first summation result.
  • the first summation result can be expressed as ⁇ log[D(y s )].
  • the first adversarial loss can be expressed by formula (3).
  • At least one embodiment of the present disclosure provides a method for calculating a second adversarial loss and a third adversarial loss according to a second discrimination result, including:
  • the logarithms of the second probability values are taken and summed to obtain a second summation result, and the negative value of the second summation result is taken to obtain a second adversarial loss.
  • the difference between 1 and the second probability value is calculated and summed to obtain the third summation result, and the negative value of the third summation result is taken to obtain the third adversarial loss.
  • the logarithms of the second probability values are taken and then summed to obtain the second summation result.
  • the second summation result can be expressed as ⁇ log[D(y t )].
  • the second adversarial loss can be expressed by formula (4).
  • the difference between 1 and each value in the second probability value is calculated and summed to obtain the third summation result, and the negative value of the third summation result is taken to obtain the third adversarial loss.
  • the third adversarial loss can be expressed by formula (5).
  • the model can be optimized by reconstructing the image.
  • At least one embodiment of the present disclosure provides a method for training a medical report generation model. In addition to the above steps S201-S210, the method further includes the following three steps.
  • FIG. 4 is a schematic diagram of another method for generating a medical report model provided by at least one embodiment of the present disclosure.
  • A1 Input the first image feature and the second image feature into the first decoder to obtain the reconstructed source image.
  • the first decoder is used to generate a reconstructed source image according to domain-invariant features and unique features of the input source image.
  • the first image feature and the second image feature are input into the first decoder to obtain the reconstructed source image.
  • the first decoder may be composed of four convolutional layers.
  • A2 Input the third image feature and the fourth image feature into the second decoder to obtain the reconstructed target image.
  • the second decoder is used to generate a reconstructed target image according to the domain-invariant features and unique features of the input target image.
  • the third image feature and the fourth image feature are input into the second decoder to obtain a reconstructed target image.
  • the second decoder may consist of four convolutional layers.
  • the encoder and the decoder adopt an autoencoder structure.
  • A3 Calculate the perceptual loss of the source image based on the source image and the reconstructed source image, and calculate the perceptual loss of the target image based on the target image and the reconstructed target image.
  • the source image perceptual loss is used to measure the gap between the source image and the reconstructed source image.
  • the target image perceptual loss is calculated.
  • the target image perceptual loss is used to measure the gap between the target image and the reconstructed target image.
  • At least one embodiment of the present disclosure provides a specific implementation of calculating the perceptual loss of the source image according to the source image and reconstructing the source image, and a method of calculating the target image according to the target image and reconstructing the target image
  • perceptual loss please refer to the following.
  • At least one embodiment of the present disclosure provides a method for training the first
  • the specific implementation of the encoder, the second encoder, the third encoder, the text generator and the discriminator specifically includes:
  • the first encoder based on source image characteristic loss, target image characteristic loss, cross-entropy loss, first adversarial loss, second adversarial loss, third adversarial loss, source image perceptual loss, and target image perceptual loss , a second encoder, a third encoder, a text generator, a discriminator, a first decoder, and a second decoder.
  • the model can also be optimized according to the perceptual loss of the source image and the perceptual loss of the target image, so as to reduce the gap between the source image and the reconstructed source image and the target image and the reconstructed target image, and improve the model The accuracy of image feature extraction for source and target images.
  • the image perceptual loss can be calculated to get the total loss.
  • the total loss can be expressed by the following formula:
  • L difference represents the sum of source image-specific loss and target image-specific loss.
  • L ce represents the cross-entropy loss.
  • L rec represents the sum of source image perceptual loss and target image perceptual loss.
  • L adv1 (y s ) represents the first adversarial loss, and ⁇ adv1 is the weight corresponding to the first adversarial loss.
  • L adv2 (y t ) represents the second adversarial loss, and ⁇ adv2 is the weight corresponding to the second adversarial loss.
  • L adv3 (y t ) represents the third adversarial loss, and ⁇ adv3 is the weight corresponding to the third adversarial loss.
  • L sdist represents the source image-specific loss and L tdist represents the target image-specific loss.
  • L perc (x s , x srec ; w) represents the source image perceptual loss
  • L perc (x t , x trec ; w) represents the target image perceptual loss
  • the result of minimizing the total loss can be maximized as the optimization goal, and the first encoder, the second encoder, the third encoder, the text generator, the discriminator, and the first decoder are trained and a second decoder.
  • the encoder can be optimized on the premise that the target image does not have a label, so that the encoder can extract more accurate image features and improve the accuracy of the trained model.
  • At least one embodiment of the present disclosure provides a method for calculating the perceptual loss of the source image according to the source image and the reconstructed source image, including the following four steps:
  • B1 Input the source image into the third image feature extraction network, and obtain the seventh image feature output by each feature extraction layer of the third image feature extraction network.
  • the third image feature extraction network is used to extract image features of the image.
  • the source image is input into the third image feature extraction network to obtain seventh image features output by each feature extraction layer of the third image feature extraction network.
  • the third image feature extraction network may be VGG Net (a deep convolutional neural network).
  • VGG Net can be pre-trained. Input the source image into VGG Net to get the seventh image feature Among them, x s represents the source image, and l represents the l-th feature extraction layer in VGG Net. l is a positive integer greater than or equal to 1 and less than or equal to L, and L is the total number of layers of the feature extraction layer of VGG Net.
  • the third image feature extraction network is used to extract the image features of the reconstructed source image to obtain eighth image features output by each feature extraction layer in the third image feature extraction network.
  • the eighth image feature can be expressed as where x srec represents the reconstructed source image.
  • Each feature extraction layer in the third feature extraction network has a corresponding weight. According to the weight of each feature extraction layer, the seventh image feature output by each feature extraction layer, and the eighth image feature output by each feature extraction layer, the source image loss corresponding to the feature extraction layer is calculated.
  • the difference between the seventh image feature and the eighth image feature can be calculated first, and then the L1 norm of the obtained difference can be calculated, and finally the L1 norm of the obtained difference can be combined with the weight Multiplied together, the source image loss corresponding to the feature extraction layer is obtained.
  • the source image perceptual loss L perc (x s ,x srec ; w) can be expressed by the following formula:
  • w (l) represents the weight of the feature extraction layer of the l-th layer
  • N (l) represents the number of layers of the feature extraction layer
  • 1 represents the L1 norm.
  • At least one embodiment of the present disclosure provides a specific implementation manner of calculating the perceptual loss of the target image according to the target image and the reconstructed target image, which specifically includes the following four steps:
  • B5 Input the target image into the third image feature extraction network, and obtain the ninth image feature output by each feature extraction layer of the third image feature extraction network.
  • the third image feature extraction network is used to extract image features of the target image to obtain ninth image features output by each feature extraction layer in the third image feature extraction network.
  • the ninth image feature can be expressed as ⁇ (l) (x t ). where x t represents the target image.
  • B6 Input the reconstructed target image into the third image feature extraction network, and obtain the tenth image feature output by each feature extraction layer of the third image feature extraction network.
  • the third image feature extraction network is used to extract image features of the reconstructed target image to obtain tenth image features output by each feature extraction layer in the third image feature extraction network.
  • the tenth image feature can be expressed as ⁇ (l) (x trec ).
  • x trec represents the reconstructed target image.
  • Each feature extraction layer in the third feature extraction network has a corresponding weight. According to the weight of each feature extraction layer, the ninth image feature output by each feature extraction layer, and the tenth image feature output by each feature extraction layer, the target image loss corresponding to the feature extraction layer is calculated.
  • the difference between the ninth image feature and the tenth image feature can be calculated first, and then the L1 norm of the obtained difference can be calculated, and finally the L1 norm of the obtained difference can be combined with the weight Multiplied together, the target image loss corresponding to the feature extraction layer is obtained.
  • the target image perceptual loss L perc (x t ,x trec ; w) can be expressed by the following formula:
  • w (l) represents the weight of the feature extraction layer of the l-th layer
  • N (l) represents the number of layers of the feature extraction layer
  • 1 represents the L1 norm.
  • Part of the target image may have a corresponding medical text label.
  • the model can be trained in a semi-supervised manner.
  • At least one embodiment of the present disclosure provides a training method for a medical report generation model. After the training in the above steps S201-S210 is completed, the training can be performed again, that is, in addition to the above In addition to steps, the following five steps are included:
  • C1 Determine the first score according to the difference between the source image and the reconstructed source image and the difference between the target image and the reconstructed target image.
  • the first score is used to measure the difference between the source image and the reconstructed source image, and the difference between the target image and the reconstructed target image.
  • At least one embodiment of the present disclosure provides a specific implementation manner of determining the first score according to the difference between the source image and the reconstructed source image and the difference between the target image and the reconstructed target image, please refer to the following .
  • C2 Determine the second score according to the source image-specific loss and the target image-specific loss.
  • the second score is related to the source image specific loss as well as the target image specific loss.
  • At least one embodiment of the present disclosure provides a specific implementation manner of determining the second score according to the source image specific loss and the target image specific loss, please refer to the following.
  • C3 If the target image corresponds to a medical text label, calculate the natural language evaluation index as the third score according to the second medical report text and the medical text label corresponding to the target image.
  • the natural language evaluation index can be calculated according to the medical text labels of the target images and the second medical report text.
  • the calculated natural language evaluation indicator is determined as a third score.
  • the natural language evaluation index can be an index such as CIDEr (Consensus-based Image Description Evaluation, consensus-based image description evaluation).
  • CIDEr Consensus-based Image Description Evaluation, consensus-based image description evaluation.
  • the third score can be represented by formula (11).
  • CIDEr(y t , y) represents the CIDEr of y t and y
  • y t is the second medical report text generated based on the target image
  • y is the medical text label corresponding to the target image.
  • C4 The first score, the second score and the third score are weighted and summed to obtain the reward value.
  • the reward value REWARD can be expressed by formula (12):
  • SCORE rec represents the first score
  • SCORE dist represents the second score
  • SCORE eval represents the third score.
  • ⁇ 1 is the weight corresponding to the first score
  • ⁇ 2 is the weight corresponding to the second score
  • ⁇ 3 is the weight corresponding to the third score.
  • the reward value can reflect the training situation of the model through three aspects: the difference of the reconstructed image, the specificity loss of the image, and the natural language evaluation index.
  • C5 Retrain the first encoder, second encoder, third encoder, text generator, discriminator, first decoder, and second decoder with the goal of maximizing the reward value.
  • maximizing the reward value is taken as the training goal, and the text generator can be updated through reinforcement learning.
  • using the natural language evaluation index as the third score can take the natural language evaluation index into account when training the model, so that the goals of model training and model application are consistent, and the accuracy of the model can be further improved.
  • At least one embodiment of the present disclosure provides a method for determining the first score based on the difference between the source image and the reconstructed source image and the difference between the target image and the reconstructed target image, including the following seven steps:
  • D1 Input the source image into the third image feature extraction network, and obtain the eleventh image feature output by the third image feature extraction network.
  • the third image feature extraction network is used to extract image features of the image.
  • the source image is input into the third image feature extraction network to obtain an eleventh image feature output by the third image feature extraction network.
  • the third image feature extraction network may be VGG Net (a deep convolutional neural network).
  • VGG Net can be pre-trained. Input the source image into VGG Net to get the eleventh image feature Among them, x s represents the image features of the source image, and l represents the lth layer of the activation function in VGG Net. l is a positive integer greater than or equal to 1 and less than or equal to L, and L is the maximum number of layers of the VGG Net after the activation function.
  • D2 input the reconstructed source image into the third image feature extraction network, and obtain the twelfth image feature output by the third image feature extraction network.
  • the twelfth image feature can be where x srec is the image feature of the reconstructed source image.
  • D3 Obtain a first difference value according to the difference between the eleventh image feature and the twelfth image feature.
  • the first difference value is used to indicate the difference between the eleventh image feature and the twelfth image feature.
  • the difference between the twelfth image feature and the eleventh image feature may be calculated first to obtain the first difference. Then calculate the L1 norm of the first difference to obtain the first difference.
  • the first difference value S1 can be represented by the following formula:
  • D4 Input the target image into the third image feature extraction network, and obtain the thirteenth image feature output by the third image feature extraction network.
  • the thirteenth image feature can be where xt is the image feature of the target image.
  • D5 Input the target reconstructed image into the third image feature extraction network, and obtain the fourteenth image feature output by the third image feature extraction network.
  • the twelfth image feature can be where x trec is the image feature of the reconstructed target image.
  • D6 Obtain a second difference value according to the difference between the thirteenth image feature and the fourteenth image feature.
  • the second difference value is used to indicate the difference between the thirteenth image feature and the fourteenth image feature.
  • the difference between the thirteenth image feature and the fourteenth image feature may be calculated first to obtain the second difference. Then calculate the L1 norm of the second difference to obtain the second difference.
  • the second difference value S2 can be represented by the following formula:
  • D7 sum the first difference value and the second difference value to obtain a fourth summation result, and take the negative value of the fourth summation result to obtain the first score.
  • the first score SCORE rec can be expressed by formula (15):
  • At least one embodiment of the present disclosure provides a specific implementation of determining the second score according to the source image-specific loss and the target image-specific loss, including:
  • the source image-specific loss and the target image-specific loss are summed to obtain a fifth summation result, and the negative value of the fifth summation result is taken to obtain a second score.
  • the second score SCORE dist can be expressed by formula (16):
  • L sdist is the source image-specific loss
  • L tdist is the target image-specific loss
  • L difference is the fifth summation result
  • the first encoder, the second encoder, and the third encoder may also be trained in advance.
  • At least one embodiment of the present disclosure provides a training method for a medical report generation model, which includes the following three steps in addition to the above steps.
  • E1 Input the training image into the first image feature extraction network to obtain the fifth image feature, and input the fifth image feature into the first classification network to obtain the first predicted classification result of the training image; according to the first predicted classification result of the training image and The classification label corresponding to the training image is trained to train the first image feature extraction network and the first classification network.
  • Training images are the images used to train the encoder.
  • the training images are medical images with classification labels.
  • the classification label is the disease corresponding to the medical image.
  • the medical image used as the training image may be a chest radiograph, and the corresponding classification label may be, for example, disease names of diseases such as pneumonia, pulmonary nodules, and cardiac hypertrophy.
  • the training images can be images from the CheXpert-small dataset.
  • the first image feature extraction network is used to extract image features.
  • the training image is input into the first image feature extraction network to obtain the fifth image feature output by the first image feature extraction network.
  • the first image feature extraction network may adopt an Inception-v3 network structure.
  • the first classification network is used to determine the classification type of the image according to the input image features.
  • the obtained fifth image feature is re-inputted into the first classification network to obtain the first predicted classification result of the training image.
  • the first predicted classification result may include the image type of the training image.
  • the classification label of the training image can be used to measure the accuracy of the first predicted classification result of the training image.
  • a first image feature extraction network and a first classification network are trained according to the classification label of the training image and the first predicted classification result.
  • E2 Input the training image into the second image feature extraction network to obtain the sixth image feature, and input the sixth image feature into the second classification network to obtain the second predicted classification result of the training image; according to the second predicted classification result of the training image and The classification label corresponding to the training image is used to train the second image feature extraction network and the second classification network; the network structure of the first image feature extraction network is different from that of the second image feature extraction network.
  • the second image feature extraction network is a network with a different structure from the first image feature extraction network.
  • the second image feature extraction network is used to extract image features.
  • the training image is input into the second image feature extraction network to obtain the sixth image feature output by the second image feature extraction network.
  • the second classification network is used to determine the classification type of the image according to the input image features.
  • the obtained sixth image feature is re-inputted into the second classification network to obtain a second predicted classification result of the training image output by the second classification network.
  • the second predicted classification result may include the image type of the training image.
  • the classification label of the training image can be used to measure whether the second predicted classification result of the training image is accurate. Using the classification labels of the training images and the second predicted classification results of the training images, the second image feature extraction network and the second classification network are trained.
  • E3 Determine the model parameters of the trained first image feature extraction network as the initial model parameters of the first encoder and the third encoder, and determine the model parameters of the trained second image feature extraction network as the second encoder
  • the initial model parameters the first image feature extraction network has the same network structure as the first encoder and the third encoder
  • the second image feature extraction network has the same network structure as the second encoder.
  • the network structure of the first image feature extraction network is the same as that of the first encoder and the third encoder. After obtaining the trained first image feature extraction network, use the first image feature extraction network to determine the initial model parameters of the first encoder and the third encoder. Specifically, the model parameters of the first feature extraction network are determined as the initial model parameters of the first encoder and the initial model parameters of the third encoder.
  • the second image feature extraction network has the same network structure as the second encoder. After obtaining the trained second image feature extraction network, use the model parameters of the second image feature extraction network to determine the initial model parameters of the second encoder. Specifically, the model parameters of the second image feature extraction network are determined as the initial model parameters of the second encoder.
  • pre-training makes the first encoder, the second encoder, and the third encoder more accurate, improves the accuracy of the first encoder, the second encoder, and the third encoder in extracting image features, and improves the accuracy of model training. efficiency.
  • initial model parameters of the first encoder, the second encoder, and the third encoder may be randomly initialized and determined.
  • At least one embodiment of the present disclosure also provides a training method for a medical report generation model, which includes the following steps in addition to the above steps.
  • the initial model parameters of the first encoder, the second encoder and the third encoder are randomly initialized.
  • the initial model parameters of the first encoder, the second encoder, and the third encoder are randomly initialized. Then, in the above manner, the first encoder, the second encoder and the third encoder are trained to determine model parameters.
  • an embodiment of the present disclosure provides a method for generating a medical report.
  • Fig. 5 is a flow chart of a method for generating a medical report provided by at least one embodiment of the present disclosure, the method includes S501-S502:
  • S501 Input medical images into an encoder to obtain medical image features.
  • the encoder is the second encoder trained by using the above training method of the medical report generation model.
  • the trained second encoder can more accurately extract medical image features of the medical image.
  • the medical image that needs to generate the corresponding medical report text is input into the encoder to obtain the medical image features corresponding to the medical image.
  • the image type of the medical image is consistent with the image type of the target image.
  • the target image includes an image generated by an endoscope, and correspondingly, the medical image may be an image generated by an endoscope.
  • S502 Input the medical image features into the text generator to obtain the medical report text.
  • the text generator is a text generator trained by the above-mentioned training method of the medical report generation model.
  • the trained text generator can generate more accurate medical report text based on the input medical image features.
  • the medical image features of the medical image output by the encoder are input into the text generator to obtain the medical report text output by the text generator.
  • the encoder and text generator trained by the training method of the above-mentioned medical report generation model can be applied to the medical image of the image type corresponding to the target image, and generate the corresponding Medical report text.
  • At least one embodiment of the present disclosure also provides a training device for the medical report generation model.
  • the training device for the medical report generation model will be described below in conjunction with the accompanying drawings Be explained.
  • FIG. 6 is a schematic structural diagram of a training device for a medical report generation model provided by at least one embodiment of the present disclosure.
  • the training device of this medical report generation model includes:
  • the first input unit 601 is configured to input a source image into a first encoder to obtain a first image feature, and input the source image into a second encoder to obtain a second image feature; the source image corresponds to a medical text label;
  • the second input unit 602 is configured to input a target image into a third encoder to obtain a third image feature, and input the target image to the second encoder to obtain a fourth image feature;
  • the third input unit 603 is configured to input the second image feature into the text generator to obtain the first medical report text;
  • a fourth input unit 604 configured to input the fourth image feature into the text generator to obtain a second medical report text
  • the fifth input unit 605 is configured to input the first medical report text into the discriminator to obtain a first discriminant result
  • a sixth input unit 606, configured to input the second medical report text into the discriminator to obtain a second discriminant result
  • the first calculation unit 607 is configured to calculate the source image specificity loss according to the first image feature and the second image feature, and calculate the target image specificity loss according to the third image feature and the fourth image feature loss;
  • the second calculation unit 608 is configured to calculate a cross-entropy loss according to the first medical report text and the medical text label corresponding to the source image;
  • the third calculation unit 609 is configured to calculate a first adversarial loss according to the first discrimination result, and calculate a second adversarial loss and a third adversarial loss according to the second discrimination result;
  • Execution unit 610 configured to, according to the source image characteristic loss, the target image characteristic loss, the cross-entropy loss, the first adversarial loss, the second adversarial loss, and the third adversarial loss performance loss, train the first encoder, the second encoder, the third encoder, the text generator, and the discriminator, and repeat the process of inputting the source image into the first image feature encoder And subsequent steps until the preset condition is reached.
  • the device further includes:
  • a seventh input unit configured to input the first image feature and the second image feature into the first decoder to obtain a reconstructed source image
  • An eighth input unit configured to input the third image feature and the fourth image feature into a second decoder to obtain a reconstructed target image
  • a fourth calculation unit configured to calculate the perceptual loss of the source image according to the source image and the reconstructed source image, and calculate the perceptual loss of the target image according to the target image and the reconstructed target image;
  • the execution unit is specifically used for the said source image characteristic loss, the target image characteristic loss, the cross-entropy loss, the first adversarial loss, the second adversarial loss, the The third adversarial loss, the source image perceptual loss and the target image perceptual loss, train the first encoder, the second encoder, the third encoder, the text generator, the The discriminator, the first decoder and the second decoder.
  • part of the target image corresponds to a medical text label; the device further includes:
  • a first determining unit configured to determine a first score according to the difference between the source image and the reconstructed source image and the difference between the target image and the reconstructed target image;
  • a second determining unit configured to determine a second score according to the source image-specific loss and the target image-specific loss
  • a fifth calculation unit configured to calculate a natural language evaluation index as a third score according to the second medical report text and the medical text label corresponding to the target image if the target image corresponds to a medical text label;
  • a summation unit configured to weight and sum the first score, the second score and the third score to obtain a reward value
  • a training unit configured to retrain the first encoder, the second encoder, the third encoder, the text generator, the discriminator, and the second encoder with the goal of maximizing the reward value A decoder and the second decoder.
  • the device further includes:
  • the seventh input unit is used to input the training image into the first image feature extraction network to obtain the fifth image feature, and input the fifth image feature into the first classification network to obtain the first predicted classification result of the training image; according to The first predicted classification result of the training image and the classification label corresponding to the training image, training the first image feature extraction network and the first classification network;
  • the eighth input unit is used to input the training image into the second image feature extraction network to obtain the sixth image feature, and input the sixth image feature into the second classification network to obtain the second predicted classification result of the training image; according to The second predicted classification result of the training image and the classification label corresponding to the training image, train the second image feature extraction network and the second classification network; the first image feature extraction network and the second The network structure of the image feature extraction network is different;
  • the third determining unit is configured to determine the model parameters of the trained first image feature extraction network as the initial model parameters of the first encoder and the third encoder, and determine the trained second
  • the model parameters of the image feature extraction network are determined as the initial model parameters of the second encoder; the first image feature extraction network has the same network structure as the first encoder and the third encoder, and the second encoder
  • the network structure of the second image feature extraction network is the same as that of the second encoder.
  • the device further includes:
  • An initialization unit configured to randomly initialize initial model parameters of the first encoder, the second encoder, and the third encoder.
  • the first discrimination result includes a first probability value for judging whether each word segment in the first medical report text is generated by the source image
  • the second judgment result includes judging the A second probability value of whether each word segment in the second medical report text is generated by the source image
  • the third computing unit 609 is specifically configured to take logarithms of the first probability values and then sum them to obtain a first summation result, and take a negative value of the first summation result to obtain a first adversarial loss ;
  • the third computing unit 609 is specifically configured to take logarithms of the second probability values and then sum them to obtain a second summation result, and take a negative value of the second summation result to obtain a second adversarial loss ;
  • the difference between 1 and the second probability value is calculated and then summed to obtain a third summation result, and the negative value of the third summation result is taken to obtain a third adversarial loss.
  • the fourth calculation unit 610 is specifically configured to input the source image into the third image feature extraction network, and obtain the first output of each feature extraction layer of the third image feature extraction network. Seven image features;
  • the eighth image feature of each described feature extraction layer output and the weight corresponding to this feature extraction layer, calculate the source image loss corresponding to this feature extraction layer;
  • the source image loss corresponding to each feature extraction layer is summed to obtain the source image perception loss
  • the fourth calculation unit 610 is specifically configured to input the target image into a third image feature extraction network, and obtain ninth image features output by each feature extraction layer of the third image feature extraction network;
  • the target image loss corresponding to each feature extraction layer is summed to obtain the target image perception loss.
  • the first determination unit is specifically configured to input the source image into a third image feature extraction network, and obtain an eleventh image feature output by the third image feature extraction network;
  • the second determination unit is specifically configured to sum the source image-specific loss and the target image-specific loss to obtain a fifth summation result, and take the fifth Sum the negative values of the result to get the second score.
  • At least one embodiment of the present disclosure further provides a device for generating a medical report.
  • the device for generating a medical report will be described below with reference to the accompanying drawings.
  • FIG. 7 is a schematic structural diagram of a medical report generating device provided by at least one embodiment of the present disclosure.
  • the medical report generation device includes:
  • the input unit 701 is configured to input the medical image into the encoder to obtain medical image features
  • a generating unit 702 configured to input the medical image features into a text generator to obtain a medical report text
  • the encoder is a second encoder trained according to the training method of the medical report generation model described in any one of the above embodiments;
  • the text generator is a text generator trained according to the training method of the medical report generation model described in any one of the above embodiments.
  • At least one embodiment of the present disclosure further provides an electronic device, including: one or more processors; a storage device on which There are one or more programs, when the one or more programs are executed by the one or more processors, so that the one or more processors implement the medical report generation model as described in any of the above embodiments Training method, or realize the medical report generating method as described in the above-mentioned embodiment.
  • FIG. 8 it shows a schematic structural diagram of an electronic device 800 suitable for implementing the embodiments of the present disclosure.
  • the terminal equipment in the embodiment of the present disclosure may include but not limited to mobile phones, notebook computers, digital broadcast receivers, PDA (Personal Digital Assistant, personal digital assistant), PAD (portable android device, tablet computer), PMP (Portable Media Player, portable multimedia player), mobile terminals such as vehicle-mounted terminals (such as vehicle-mounted navigation terminals), and fixed terminals such as digital TVs (television, television sets), desktop computers, and the like.
  • the electronic device shown in FIG. 8 is only an example, and should not limit the functions and scope of use of the embodiments of the present disclosure.
  • an electronic device 800 may include a processing device (such as a central processing unit, a graphics processing unit, etc.) Various appropriate actions and processes are executed by programs in the memory (RAM) 803 . In the RAM 803, various programs and data necessary for the operation of the electronic device 800 are also stored.
  • the processing device 801, ROM 802, and RAM 803 are connected to each other through a bus 804.
  • An input/output (I/O) interface 805 is also connected to the bus 804 .
  • the following devices can be connected to the I/O interface 805: input devices 806 including, for example, a touch screen, touchpad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; including, for example, a liquid crystal display (LCD), speaker, vibration an output device 807 such as a computer; a storage device 808 including, for example, a magnetic tape, a hard disk, etc.; and a communication device 809.
  • the communication means 809 may allow the electronic device 800 to communicate with other devices wirelessly or by wire to exchange data. While FIG. 8 shows electronic device 800 having various means, it is to be understood that implementing or having all of the means shown is not a requirement. More or fewer means may alternatively be implemented or provided.
  • embodiments of the present disclosure include a computer program product, which includes a computer program carried on a non-transitory computer readable medium, where the computer program includes program code for executing the method shown in the flowchart.
  • the computer program may be downloaded and installed from a network via communication means 809, or from storage means 808, or from ROM 802.
  • the processing device 801 When the computer program is executed by the processing device 801, the above-mentioned functions defined in the methods of the embodiments of the present disclosure are performed.
  • the electronic device provided by the embodiment of the present disclosure belongs to the same inventive concept as the training method of the medical report generation model and the method of generating the medical report provided by the above embodiment, and the technical details not described in detail in this embodiment can be referred to the above embodiment. And this embodiment has the same beneficial effect as the above embodiment.
  • At least one embodiment of the present disclosure provides a computer storage medium on which a computer program is stored, wherein the program is processed
  • the training method of the medical report generation model described in any of the above embodiments, or the medical report generation method described in the above embodiments is implemented when the device is executed.
  • the above-mentioned computer-readable medium in the present disclosure may be a computer-readable signal medium or a computer-readable storage medium or any combination of the above two.
  • a computer readable storage medium may be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination thereof. More specific examples of computer-readable storage media may include, but are not limited to, electrical connections with one or more wires, portable computer diskettes, hard disks, random access memory (RAM), read-only memory (ROM), erasable Programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination of the above.
  • a computer-readable storage medium may be any tangible medium that contains or stores a program that can be used by or in conjunction with an instruction execution system, apparatus, or device.
  • a computer-readable signal medium may include a data signal in baseband or propagated as part of a carrier wave carrying computer-readable program code thereon. Such propagated data signals may take many forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the foregoing.
  • a computer-readable signal medium may also be any computer-readable medium other than a computer-readable storage medium, which can transmit, propagate, or transmit a program for use by or in conjunction with an instruction execution system, apparatus, or device .
  • Program code embodied on a computer readable medium may be transmitted by any appropriate medium, including but not limited to wires, optical cables, RF (radio frequency), etc., or any suitable combination of the above.
  • the client and the server can communicate using any currently known or future network protocols such as HTTP (HyperText Transfer Protocol, Hypertext Transfer Protocol), and can communicate with digital data in any form or medium
  • HTTP HyperText Transfer Protocol
  • the communication eg, communication network
  • Examples of communication networks include local area networks (“LANs”), wide area networks (“WANs”), internetworks (e.g., the Internet), and peer-to-peer networks (e.g., ad hoc peer-to-peer networks), as well as any currently known or future developed network of.
  • the above-mentioned computer-readable medium may be included in the above-mentioned electronic device, or may exist independently without being incorporated into the electronic device.
  • the above-mentioned computer-readable medium carries one or more programs, and when the above-mentioned one or more programs are executed by the electronic device, the electronic device is made to execute the above-mentioned training method of the medical report generation model, or the medical report generation method.
  • Computer program code for carrying out operations of the present disclosure may be written in one or more programming languages, or combinations thereof, including but not limited to object-oriented programming languages—such as Java, Smalltalk, C++, and Includes conventional procedural programming languages - such as the "C" language or similar programming languages.
  • the program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server.
  • the remote computer can be connected to the user computer through any kind of network, including a local area network (LAN) or a wide area network (WAN), or it can be connected to an external computer (such as through an Internet service provider). Internet connection).
  • LAN local area network
  • WAN wide area network
  • Internet service provider such as AT&T, MCI, Sprint, EarthLink, MSN, GTE, etc.
  • each block in a flowchart or block diagram may represent a module, program segment, or portion of code that contains one or more logical functions for implementing specified executable instructions.
  • the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or they may sometimes be executed in the reverse order, depending upon the functionality involved.
  • each block of the block diagrams and/or flowchart illustrations, and combinations of blocks in the block diagrams and/or flowchart illustrations can be implemented by a dedicated hardware-based system that performs the specified functions or operations , or may be implemented by a combination of dedicated hardware and computer instructions.
  • the units involved in the embodiments described in the present disclosure may be implemented by software or by hardware.
  • the name of the unit/module does not constitute a limitation on the unit itself under certain circumstances, for example, the voice data collection module can also be described as a "data collection module”.
  • FPGAs Field Programmable Gate Arrays
  • ASICs Application Specific Integrated Circuits
  • ASSPs Application Specific Standard Products
  • SOCs System on Chips
  • CPLD Complex Programmable Logical device
  • a machine-readable medium may be a tangible medium that may contain or store a program for use by or in conjunction with an instruction execution system, apparatus, or device.
  • a machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium.
  • a machine-readable medium may include, but is not limited to, electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, apparatus, or devices, or any suitable combination of the foregoing.
  • machine-readable storage media would include one or more wire-based electrical connections, portable computer discs, hard drives, random access memory (RAM), read only memory (ROM), erasable programmable read only memory (EPROM or flash memory), optical fiber, compact disk read only memory (CD-ROM), optical storage, magnetic storage, or any suitable combination of the foregoing.
  • RAM random access memory
  • ROM read only memory
  • EPROM or flash memory erasable programmable read only memory
  • CD-ROM compact disk read only memory
  • magnetic storage or any suitable combination of the foregoing.
  • Example 1 provides a method for training a medical report generation model, the method comprising:
  • the source image is input into the first encoder to obtain the first image feature, and the source image is input into the second encoder to obtain the second image feature; the source image corresponds to a medical text label;
  • the source image characteristic loss the target image characteristic loss, the cross entropy loss, the first adversarial loss, the second adversarial loss and the third adversarial loss, train the The first encoder, the second encoder, the third encoder, the text generator and the discriminator repeatedly execute the inputting the source image into the first image feature encoder and subsequent steps until reaching preset conditions.
  • Example 2 provides a method for training a medical report generation model, the method further comprising:
  • the first encoder, the second encoder, the third encoder, the text generator and the discriminator include:
  • the source image characteristic loss, the target image characteristic loss, the cross-entropy loss, the first adversarial loss, the second adversarial loss, the third adversarial loss, the The source image perception loss and the target image perception loss train the first encoder, the second encoder, the third encoder, the text generator, the discriminator, and the first decoder and the second decoder.
  • Example 3 provides a training method of a medical report generation model, and part of the target image corresponds to a medical text label; the method also includes:
  • the target image corresponds to a medical text label
  • Example 4 provides a method for training a medical report generation model, the method further comprising:
  • the training image is input to the second image feature extraction network to obtain the sixth image feature
  • the sixth image feature is input to the second classification network to obtain the second predicted classification result of the training image
  • Predict classification results and classification labels corresponding to the training images train the second image feature extraction network and the second classification network; the network structure of the first image feature extraction network and the second image feature extraction network different;
  • the network structure of the first image feature extraction network is the same as that of the first encoder and the third encoder, and the second image feature extraction network is the same as the network structure of the third encoder
  • the network structure of the second encoder is the same as above.
  • Example 5 provides a method for training a medical report generation model, the method further comprising:
  • Example 6 provides a training method for a medical report generation model, the first judgment result includes judging whether each word segment in the first medical report text is represented by the source image The generated first probability value, the second judgment result includes a second probability value for judging whether each word segment in the second medical report text is generated by the source image;
  • the calculating the first adversarial loss according to the first discrimination result includes:
  • the calculating the second adversarial loss and the third adversarial loss according to the second discrimination result includes:
  • the difference between 1 and the second probability value is calculated and then summed to obtain a third summation result, and the negative value of the third summation result is taken to obtain a third adversarial loss.
  • Example 7 provides a training method for a medical report generation model, and calculating the perceptual loss of the source image according to the source image and the reconstructed source image, including:
  • the source image loss corresponding to each feature extraction layer is summed to obtain the source image perception loss
  • the calculating the target image perception loss according to the target image and the reconstructed target image includes:
  • the ninth image feature of each described feature extraction layer output, the tenth image feature and the weight corresponding to this feature extraction layer, calculate the corresponding target image loss of this feature extraction layer;
  • the target image loss corresponding to each feature extraction layer is summed to obtain the target image perception loss.
  • Example 8 provides a training method for a medical report generation model, according to the difference between the source image and the reconstructed source image and the target image and the reconstructed target The differences of the images, to determine the first score, consist of:
  • Example 9 provides a training method for a medical report generation model, wherein the second score is determined according to the source image-specific loss and the target image-specific loss, include:
  • example ten provides a method for generating a medical report, the method comprising:
  • the encoder is a second encoder trained according to the training method of the medical report generation model described in any one of the above embodiments;
  • the text generator is a text generator trained according to the training method of the medical report generation model described in any one of the above embodiments.
  • Example Eleven provides a training device for a medical report generation model, the device comprising:
  • the first input unit is configured to input a source image into a first encoder to obtain a first image feature, and input the source image to a second encoder to obtain a second image feature; the source image corresponds to a medical text label;
  • the second input unit is configured to input the target image into the third encoder to obtain a third image feature, and input the target image to the second encoder to obtain a fourth image feature;
  • a third input unit configured to input the second image feature into the text generator to obtain the first medical report text
  • a fourth input unit configured to input the fourth image feature into the text generator to obtain a second medical report text
  • the fifth input unit is used to input the first medical report text into the discriminator to obtain the first discriminant result
  • a sixth input unit configured to input the second medical report text into the discriminator to obtain a second discriminant result
  • a first calculation unit configured to calculate a source image-specific loss based on the first image feature and the second image feature, and calculate a target image-specific loss based on the third image feature and the fourth image feature ;
  • a second calculation unit configured to calculate a cross-entropy loss according to the first medical report text and the medical text label corresponding to the source image
  • a third calculation unit configured to calculate a first adversarial loss according to the first discrimination result, and calculate a second adversarial loss and a third adversarial loss according to the second discrimination result;
  • an execution unit configured to use the source image characteristic loss, the target image characteristic loss, the cross-entropy loss, the first adversarial loss, the second adversarial loss, and the third adversarial loss Loss, training the first encoder, the second encoder, the third encoder, the text generator and the discriminator, repeatedly performing the input of the source image into the first image feature encoder and Subsequent steps until preset conditions are met.
  • Example 12 provides a training device for a medical report generation model, the device further comprising:
  • a seventh input unit configured to input the first image feature and the second image feature into the first decoder to obtain a reconstructed source image
  • An eighth input unit configured to input the third image feature and the fourth image feature into a second decoder to obtain a reconstructed target image
  • a fourth calculation unit configured to calculate the perceptual loss of the source image according to the source image and the reconstructed source image, and calculate the perceptual loss of the target image according to the target image and the reconstructed target image;
  • the execution unit is specifically used for the said source image characteristic loss, the target image characteristic loss, the cross-entropy loss, the first adversarial loss, the second adversarial loss, the The third adversarial loss, the source image perceptual loss and the target image perceptual loss, train the first encoder, the second encoder, the third encoder, the text generator, the The discriminator, the first decoder and the second decoder.
  • Example 13 provides a training device for a medical report generation model, part of the target image corresponds to a medical text label; the device further includes:
  • a first determining unit configured to determine a first score according to the difference between the source image and the reconstructed source image and the difference between the target image and the reconstructed target image;
  • a second determining unit configured to determine a second score according to the source image-specific loss and the target image-specific loss
  • a fifth calculation unit configured to calculate a natural language evaluation index as a third score according to the second medical report text and the medical text label corresponding to the target image if the target image corresponds to a medical text label;
  • a summation unit configured to weight and sum the first score, the second score and the third score to obtain a reward value
  • a training unit configured to retrain the first encoder, the second encoder, the third encoder, the text generator, the discriminator, and the second encoder with the goal of maximizing the reward value A decoder and the second decoder.
  • Example Fourteen provides a training device for a medical report generation model, the device further comprising:
  • the seventh input unit is used to input the training image into the first image feature extraction network to obtain the fifth image feature, and input the fifth image feature into the first classification network to obtain the first predicted classification result of the training image; according to The first predicted classification result of the training image and the classification label corresponding to the training image, training the first image feature extraction network and the first classification network;
  • the eighth input unit is used to input the training image into the second image feature extraction network to obtain the sixth image feature, and input the sixth image feature into the second classification network to obtain the second predicted classification result of the training image; according to The second predicted classification result of the training image and the classification label corresponding to the training image, train the second image feature extraction network and the second classification network; the first image feature extraction network and the second The network structure of the image feature extraction network is different;
  • the third determining unit is configured to determine the model parameters of the trained first image feature extraction network as the initial model parameters of the first encoder and the third encoder, and determine the trained second
  • the model parameters of the image feature extraction network are determined as the initial model parameters of the second encoder; the first image feature extraction network has the same network structure as the first encoder and the third encoder, and the second encoder
  • the network structure of the second image feature extraction network is the same as that of the second encoder.
  • Example 15 provides a training device for a medical report generation model, the device further comprising:
  • An initialization unit configured to randomly initialize initial model parameters of the first encoder, the second encoder, and the third encoder.
  • Example 16 provides a training device for a medical report generation model, the first judgment result includes judging whether each word segment in the first medical report text is provided by the source The first probability value generated by the image, the second judgment result includes a second probability value for judging whether each word segment in the second medical report text is generated by the source image;
  • the third calculation unit is specifically configured to sum the logarithms of the first probability values to obtain a first summation result, and take a negative value of the first summation result to obtain a first adversarial loss;
  • the third calculation unit is specifically configured to sum the second probability values after taking logarithms to obtain a second summation result, and take a negative value of the second summation result to obtain a second adversarial loss;
  • the difference between 1 and the second probability value is calculated and then summed to obtain a third summation result, and the negative value of the third summation result is taken to obtain a third adversarial loss.
  • Example 17 provides a training device for a medical report generation model, the fourth calculation unit is specifically configured to input the source image into the third image feature extraction network to obtain The seventh image feature output by each feature extraction layer of the third image feature extraction network;
  • the reconstruction source image is input into the third image feature extraction network to obtain the eighth image feature output by each feature extraction layer of the third image feature extraction network;
  • the source image loss corresponding to each feature extraction layer is summed to obtain the source image perception loss
  • the fourth calculation unit is specifically configured to input the target image into a third image feature extraction network, and obtain ninth image features output by each feature extraction layer of the third image feature extraction network;
  • the target image loss corresponding to each feature extraction layer is summed to obtain the target image perception loss.
  • Example 18 provides a training device for a medical report generation model, the first determining unit is specifically configured to input the source image into a third image feature extraction network to obtain The eleventh image feature output by the third image feature extraction network;
  • Example Nineteen provides a training device for a medical report generation model
  • the second determination unit is specifically configured to combine the source image specific loss and the target image specific and summation losses to obtain a fifth summation result, and take the negative value of the fifth summation result to obtain a second score.
  • Example 20 provides a medical report generating device, the device comprising:
  • the input unit is used to input the medical image into the encoder to obtain the medical image features
  • a generating unit configured to input the medical image features into a text generator to obtain a medical report text
  • the encoder is a second encoder trained according to the training method of the medical report generation model described in any one of the above embodiments;
  • the text generator is a text generator trained according to the training method of the medical report generation model described in any one of the above embodiments.
  • Example 21 provides an electronic device, including:
  • processors one or more processors
  • the one or more processors When the one or more programs are executed by the one or more processors, the one or more processors implement the training method of the medical report generation model as described in any of the above embodiments, or implement the above implementation The medical report generation method described in the example.
  • Example 22 provides a computer-readable medium on which a computer program is stored, wherein, when the program is executed by a processor, the implementation of any of the above-mentioned embodiments The training method of the medical report generation model, or realize the medical report generation method described in the above embodiment.
  • each embodiment in this specification is described in a progressive manner, each embodiment focuses on the differences from other embodiments, and the same and similar parts of each embodiment can be referred to each other.
  • the system or device disclosed in the embodiment since it corresponds to the method disclosed in the embodiment, the description is relatively simple, and for relevant details, please refer to the description of the method part.
  • At least one (item) means one or more, and “plurality” means two or more.
  • “And/or” is used to describe the association relationship of associated objects, indicating that there can be three types of relationships, for example, “A and/or B” can mean: only A exists, only B exists, and A and B exist at the same time , where A and B can be singular or plural.
  • the character “/” generally indicates that the contextual objects are an “or” relationship.
  • At least one of the following” or similar expressions refer to any combination of these items, including any combination of single or plural items.
  • At least one item (piece) of a, b or c can mean: a, b, c, "a and b", “a and c", “b and c", or "a and b and c ", where a, b, c can be single or multiple.
  • RAM random access memory
  • ROM read-only memory
  • EEPROM electrically programmable ROM
  • EEPROM electrically erasable programmable ROM
  • registers hard disk, removable disk, CD-ROM, or any other Any other known storage medium.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Molecular Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Multimedia (AREA)
  • Epidemiology (AREA)
  • Medical Informatics (AREA)
  • Primary Health Care (AREA)
  • Public Health (AREA)
  • Image Analysis (AREA)
  • Medical Treatment And Welfare Office Work (AREA)

Abstract

A medical report generation method and apparatus, a model training method and apparatus, and a device. The method comprises: extracting image features of a source image and a target image, respectively, and using the image features to respectively obtain corresponding first medical report text and second medical report text; then, using a discriminator to obtain a first discrimination result and a second discrimination result respectively corresponding to the first medical report text and the second medical report text; and finally, calculating source image specific loss, target image specific loss, cross entropy loss, first adversarial loss, second adversarial loss and third adversarial loss, and training a medical report generation model by using the calculated loss. The medical report generation model obtained by training can apply knowledge learned from the field of source images having lots of labels to the field of medical images of other types, thereby automatically generating medical report texts of medical images having fewer or no labels.

Description

医学报告生成方法、模型的训练方法、装置及设备Medical report generation method, model training method, device and equipment
本申请要求于2021年8月31日递交的中国专利申请第202111013687.9号的优先权,在此全文引用上述中国专利申请公开的内容以作为本申请的一部分。This application claims the priority of Chinese Patent Application No. 202111013687.9 submitted on August 31, 2021, and the content disclosed in the above-mentioned Chinese Patent Application is hereby cited in its entirety as a part of this application.
技术领域technical field
本公开的实施例涉及一种医学报告生成方法、模型的训练方法、装置及设备。Embodiments of the present disclosure relate to a method for generating a medical report, a method for training a model, a device, and a device.
背景技术Background technique
医学影像是对人体或者人体某部分获得的内部组织影像,能够帮助医生了解患者的健康状况。医学影像具有对应的医学报告,医学报告中具有对该医学影像分析的结果。例如,医学报告中可能具有根据医学影像确定的患者的发病位置、病变的程度以及受到影响的器官等。Medical imaging is an image of the internal tissue of the human body or a certain part of the human body, which can help doctors understand the health status of patients. The medical image has a corresponding medical report, and the medical report contains the analysis result of the medical image. For example, a medical report may include the location of the patient's disease, the extent of the lesion, and the affected organs determined from the medical images.
目前,难以针对医学影像自动生成对应的医学报告。如何基于医学影像自动生成医学报告是需要解决的问题。Currently, it is difficult to automatically generate corresponding medical reports for medical images. How to automatically generate medical reports based on medical images is a problem that needs to be solved.
发明内容Contents of the invention
本公开的实施例提供一种医学报告生成方法、模型的训练方法、装置及设备,能够根据医学影像自动生成医学报告。Embodiments of the present disclosure provide a method for generating a medical report, a method for training a model, a device, and a device, which can automatically generate a medical report based on medical images.
第一方面,本公开的实施例提供一种医学报告生成模型的训练方法,所述方法包括:In a first aspect, embodiments of the present disclosure provide a method for training a medical report generation model, the method comprising:
将源图像输入第一编码器,得到第一图像特征,将所述源图像输入第二编码器,得到第二图像特征;所述源图像对应有医学文本标签;The source image is input into the first encoder to obtain the first image feature, and the source image is input into the second encoder to obtain the second image feature; the source image corresponds to a medical text label;
将目标图像输入第三编码器,得到第三图像特征,将所述目标图像输入所述第二编码器,得到第四图像特征;inputting the target image into a third encoder to obtain a third image feature, and inputting the target image into the second encoder to obtain a fourth image feature;
将所述第二图像特征输入文本生成器,得到第一医学报告文本;Inputting the second image feature into the text generator to obtain the first medical report text;
将所述第四图像特征输入所述文本生成器,得到第二医学报告文本;Inputting the fourth image feature into the text generator to obtain a second medical report text;
将所述第一医学报告文本输入判别器,得到第一判别结果;Inputting the first medical report text into a discriminator to obtain a first discriminant result;
将所述第二医学报告文本输入所述判别器,得到第二判别结果;inputting the second medical report text into the discriminator to obtain a second discriminant result;
根据所述第一图像特征以及所述第二图像特征,计算源图像特异性损失,根据所述第三图像特征以及所述第四图像特征,计算目标图像特异性损失;calculating a source image-specific loss based on the first image feature and the second image feature, and calculating a target image-specific loss based on the third image feature and the fourth image feature;
根据所述第一医学报告文本以及所述源图像对应的医学文本标签计算交叉熵损失;calculating a cross-entropy loss according to the first medical report text and the medical text label corresponding to the source image;
根据所述第一判别结果计算第一对抗性损失,根据所述第二判别结果计算第二对抗性损失以及第三对抗性损失;calculating a first adversarial loss according to the first discrimination result, and calculating a second adversarial loss and a third adversarial loss according to the second discrimination result;
根据所述源图像特征性损失、所述目标图像特征性损失、所述交叉熵损失、所述第一对抗性损失、所述第二对抗性损失以及所述第三对抗性损失,训练所述第一编码器、所述第二编码器、所述第三编码器、所述文本生成器以及所述判别器,重复执行所述将源图像输入第一图像特征编码器以及后续步骤,直到达到预设条件。According to the source image characteristic loss, the target image characteristic loss, the cross entropy loss, the first adversarial loss, the second adversarial loss and the third adversarial loss, train the The first encoder, the second encoder, the third encoder, the text generator and the discriminator repeatedly execute the inputting the source image into the first image feature encoder and subsequent steps until reaching preset conditions.
第二方面,本公开的实施例提供一种医学报告生成方法,所述方法包括:In a second aspect, an embodiment of the present disclosure provides a method for generating a medical report, the method comprising:
将医学图像输入编码器,得到医学图像特征;Input the medical image into the encoder to obtain the medical image features;
将所述医学图像特征输入文本生成器,得到医学报告文本;Inputting the medical image features into a text generator to obtain a medical report text;
所述编码器是根据上述任一项实施例所述的医学报告生成模型的训练方法训练得到的第二编码器;The encoder is a second encoder trained according to the training method of the medical report generation model described in any one of the above embodiments;
所述文本生成器是根据上述任一项实施例所述的医学报告生成模型的训练方法训练得到的文本生成器。The text generator is a text generator trained according to the training method of the medical report generation model described in any one of the above embodiments.
第三方面,本公开的实施例提供一种医学报告生成模型的训练装置,所述装置包括:In a third aspect, embodiments of the present disclosure provide a training device for a medical report generation model, the device comprising:
第一输入单元,用于将源图像输入第一编码器,得到第一图像特征,将所述源图像输入第二编码器,得到第二图像特征;所述源图像对应有医学文本标签;The first input unit is configured to input a source image into a first encoder to obtain a first image feature, and input the source image to a second encoder to obtain a second image feature; the source image corresponds to a medical text label;
第二输入单元,用于将目标图像输入第三编码器,得到第三图像特征,将所述目标图像输入所述第二编码器,得到第四图像特征;The second input unit is configured to input the target image into the third encoder to obtain a third image feature, and input the target image to the second encoder to obtain a fourth image feature;
第三输入单元,用于将所述第二图像特征输入文本生成器,得到第一医学报告文本;A third input unit, configured to input the second image feature into the text generator to obtain the first medical report text;
第四输入单元,用于将所述第四图像特征输入所述文本生成器,得到第二医学报告文本;A fourth input unit, configured to input the fourth image feature into the text generator to obtain a second medical report text;
第五输入单元,用于将所述第一医学报告文本输入判别器,得到第一判 别结果;The fifth input unit is used to input the first medical report text into the discriminator to obtain the first discriminant result;
第六输入单元,用于将所述第二医学报告文本输入所述判别器,得到第二判别结果;A sixth input unit, configured to input the second medical report text into the discriminator to obtain a second discriminant result;
第一计算单元,用于根据所述第一图像特征以及所述第二图像特征,计算源图像特异性损失,根据所述第三图像特征以及所述第四图像特征,计算目标图像特异性损失;A first calculation unit, configured to calculate a source image-specific loss based on the first image feature and the second image feature, and calculate a target image-specific loss based on the third image feature and the fourth image feature ;
第二计算单元,用于根据所述第一医学报告文本以及所述源图像对应的医学文本标签计算交叉熵损失;A second calculation unit, configured to calculate a cross-entropy loss according to the first medical report text and the medical text label corresponding to the source image;
第三计算单元,用于根据所述第一判别结果计算第一对抗性损失,根据所述第二判别结果计算第二对抗性损失以及第三对抗性损失;A third calculation unit, configured to calculate a first adversarial loss according to the first discrimination result, and calculate a second adversarial loss and a third adversarial loss according to the second discrimination result;
执行单元,用于根据所述源图像特征性损失、所述目标图像特征性损失、所述交叉熵损失、所述第一对抗性损失、所述第二对抗性损失以及所述第三对抗性损失,训练所述第一编码器、所述第二编码器、所述第三编码器、所述文本生成器以及所述判别器,重复执行所述将源图像输入第一图像特征编码器以及后续步骤,直到达到预设条件。an execution unit, configured to use the source image characteristic loss, the target image characteristic loss, the cross-entropy loss, the first adversarial loss, the second adversarial loss, and the third adversarial loss Loss, training the first encoder, the second encoder, the third encoder, the text generator and the discriminator, repeatedly performing the input of the source image into the first image feature encoder and Subsequent steps until preset conditions are met.
第四方面,本公开的实施例提供一种医学报告生成装置,所述装置包括:In a fourth aspect, an embodiment of the present disclosure provides a medical report generating device, the device comprising:
输入单元,用于将医学图像输入编码器,得到医学图像特征;The input unit is used to input the medical image into the encoder to obtain the medical image features;
生成单元,用于将所述医学图像特征输入文本生成器,得到医学报告文本;A generating unit, configured to input the medical image features into a text generator to obtain a medical report text;
所述编码器是根据上述任一项实施例所述的医学报告生成模型的训练方法训练得到的第二编码器;The encoder is a second encoder trained according to the training method of the medical report generation model described in any one of the above embodiments;
所述文本生成器是根据上述任一项实施例所述的医学报告生成模型的训练方法训练得到的文本生成器。The text generator is a text generator trained according to the training method of the medical report generation model described in any one of the above embodiments.
第五方面,本公开的实施例提供一种电子设备,包括:In a fifth aspect, an embodiment of the present disclosure provides an electronic device, including:
一个或多个处理器;one or more processors;
存储装置,其上存储有一个或多个程序,a storage device on which one or more programs are stored,
当所述一个或多个程序被所述一个或多个处理器执行,使得所述一个或多个处理器实现如上述任一实施例所述的医学报告生成模型的训练方法,或者实现上述实施例所述的医学报告生成方法。When the one or more programs are executed by the one or more processors, the one or more processors implement the training method of the medical report generation model as described in any of the above embodiments, or implement the above implementation The medical report generation method described in the example.
第六方面,本公开的实施例提供一种计算机可读介质,其上存储有计算机程序,其中,所述程序被处理器执行时实现如上述任一实施例所述的医学 报告生成模型的训练方法,或者实现上述实施例所述的医学报告生成方法。In the sixth aspect, the embodiments of the present disclosure provide a computer-readable medium on which a computer program is stored, wherein, when the program is executed by a processor, the training of the medical report generation model as described in any of the above-mentioned embodiments is realized method, or realize the medical report generation method described in the above-mentioned embodiments.
由此可见,本公开的实施例具有如下有益效果:It can be seen that the embodiments of the present disclosure have the following beneficial effects:
本公开的实施例提供一种医学报告生成模型的训练方法和医学报告生成方法,通过分别对源图像和目标图像提取图像特征,利用图像特征分别得到对应的第一医学报告文本和第二医学报告文本;再利用判别器得到第一医学报告文本和第二医学报告文本分别对应的第一判别结果和第二判别结果;最后利用图像特征计算源图像特异性损失和目标图像特异性损失,利用第一医学报告文本和源图像对应的医学文本标签计算交叉熵损失,根据第一判别结果计算第一对抗性损失,根据第二判别结果计算第二对抗性损失以及第三对抗性损失,根据源图像特异性损失、目标图像特异性损失、交叉熵损失、第一对抗性损失、第二对抗性损失以及第三对抗性损失,训练第一编码器、第二编码器、第三编码器、文本生成器以及判别器,重复执行上述训练步骤,直到达到预设条件,得到用于医学报告生成的编码器和文本生成器。将医学图像输入编码器中,得到医学图像特征,再将医学图像特征输入文本生成器中,得到医学报告文本。Embodiments of the present disclosure provide a training method for a medical report generation model and a medical report generation method, by extracting image features from the source image and the target image respectively, and using the image features to obtain the corresponding first medical report text and second medical report respectively text; then use the discriminator to obtain the first and second discrimination results corresponding to the first medical report text and the second medical report text respectively; finally use the image features to calculate the source image-specific loss and target image-specific loss, and use the first A medical report text and the medical text label corresponding to the source image calculate the cross-entropy loss, calculate the first adversarial loss according to the first discrimination result, calculate the second adversarial loss and the third adversarial loss according to the second discrimination result, and calculate the second adversarial loss according to the source image Specific loss, target image specific loss, cross-entropy loss, first adversarial loss, second adversarial loss, and third adversarial loss, train first encoder, second encoder, third encoder, text generation The above training steps are repeated until the preset conditions are met, and the encoder and text generator for medical report generation are obtained. Input the medical image into the encoder to obtain the medical image features, and then input the medical image features into the text generator to obtain the medical report text.
如此,利用具有医学报告文本标签较多的源图像和没有医学报告文本标签或者医学报告文本标签较少的目标图像,训练得到生成针对目标图像的医学图像类型的医学图像的医学报告文本的编码器和文本生成器。通过源图像和目标图像,能够学习到域不变的特征,从而将从标签较多的源图像领域学习到的知识应用于其他类型的医学图像的领域中,实现自动生成标签较少或者没有标签的医学图像的医学报告文本。In this way, using the source image with more medical report text labels and the target image with no medical report text labels or less medical report text labels, train the encoder that generates the medical report text of the medical image type for the target image and text generators. Through the source image and the target image, domain-invariant features can be learned, so that the knowledge learned from the source image field with more labels can be applied to the field of other types of medical images, and the automatic generation of labels with few or no labels can be realized. of medical images for medical report text.
附图说明Description of drawings
图1为本公开至少一实施例提供的示例性应用场景的框架示意图;Fig. 1 is a schematic framework diagram of an exemplary application scenario provided by at least one embodiment of the present disclosure;
图2为本公开至少一实施例提供的一种医学报告生成模型的训练方法的流程图;FIG. 2 is a flowchart of a training method for a medical report generation model provided by at least one embodiment of the present disclosure;
图3为本公开至少一实施例提供的一种医学报告生成模型的方法示意图;Fig. 3 is a schematic diagram of a method for generating a medical report model provided by at least one embodiment of the present disclosure;
图4为本公开至少一实施例提供的另一种医学报告生成模型的方法示意图;FIG. 4 is a schematic diagram of another method for generating a medical report model provided by at least one embodiment of the present disclosure;
图5为本公开至少一实施例提供的一种医学报告生成方法的流程图;Fig. 5 is a flow chart of a method for generating a medical report provided by at least one embodiment of the present disclosure;
图6为本公开至少一实施例提供的一种医学报告生成模型的训练装置的 结构示意图;Fig. 6 is a schematic structural diagram of a training device for a medical report generation model provided by at least one embodiment of the present disclosure;
图7为本公开至少一实施例提供的一种医学报告生成装置的结构示意图;以及Fig. 7 is a schematic structural diagram of a medical report generating device provided by at least one embodiment of the present disclosure; and
图8为本公开至少一实施例提供的一种电子设备的基本结构的示意图。Fig. 8 is a schematic diagram of a basic structure of an electronic device provided by at least one embodiment of the present disclosure.
具体实施方式Detailed ways
为使本公开的上述目的、特征和优点能够更加明显易懂,下面结合附图和具体实施方式对本申请实施例作进一步详细的说明。In order to make the above objects, features and advantages of the present disclosure more comprehensible, the embodiments of the present application will be further described in detail below in conjunction with the accompanying drawings and specific implementation methods.
为便于理解本公开提供的技术方案,下面将先对本公开涉及的背景技术进行说明。In order to facilitate the understanding of the technical solution provided by the present disclosure, the background technology involved in the present disclosure will first be described below.
在对传统的医学报告文本生成方法进行研究后发现,目前是将具有标签的医学图像作为训练数据,利用训练数据训练得到生成医学报告文本的模型。但是,生成医学图像的标签较为困难,并且具有较多标签的医学图像的图像类型较为单一。目前,具有较多标签标记的医学图像基本为胸片图像,难以得到生成针对其他类型的医学图像的医学报告文本的模型。After studying the traditional medical report text generation method, it is found that currently the medical image with labels is used as the training data, and the training data is used to train the model for generating the medical report text. However, it is difficult to generate labels for medical images, and the image types of medical images with many labels are relatively single. At present, medical images with many labels are basically chest images, and it is difficult to obtain a model for generating medical report text for other types of medical images.
基于此,本公开的实施例提供一种医学报告生成模型的训练方法和一种医学报告生成方法,通过分别对源图像和目标图像提取图像特征,利用图像特征分别得到对应的第一医学报告文本和第二医学报告文本;再利用判别器得到第一医学报告文本和第二医学报告文本分别对应的第一判别结果和第二判别结果;最后利用图像特征计算源图像特异性损失和目标图像特异性损失,利用第一医学报告文本和源图像对应的医学文本标签计算交叉熵损失,根据第一判别结果计算第一对抗性损失,根据第二判别结果计算第二对抗性损失以及第三对抗性损失,根据源图像特异性损失、目标图像特异性损失、交叉熵损失、第一对抗性损失、第二对抗性损失以及第三对抗性损失,训练第一编码器、第二编码器、第三编码器、文本生成器以及判别器,重复执行上述训练步骤,直到达到预设条件,得到用于医学报告生成的编码器和文本生成器。将医学图像输入编码器中,得到医学图像特征,再将医学图像特征输入文本生成器中,得到医学报告文本。Based on this, the embodiments of the present disclosure provide a training method of a medical report generation model and a medical report generation method, by extracting image features from the source image and the target image respectively, and using the image features to obtain the corresponding first medical report text respectively and the second medical report text; then use the discriminator to obtain the first discrimination result and the second discrimination result corresponding to the first medical report text and the second medical report text respectively; finally use the image features to calculate the source image specificity loss and target image specificity Computing the cross-entropy loss by using the medical text labels corresponding to the first medical report text and the source image, calculating the first adversarial loss according to the first discrimination result, calculating the second adversarial loss and the third adversarial loss according to the second discrimination result Loss, according to source image-specific loss, target image-specific loss, cross-entropy loss, first adversarial loss, second adversarial loss and third adversarial loss, train the first encoder, the second encoder, the third The encoder, the text generator and the discriminator repeat the above training steps until the preset conditions are met, and the encoder and the text generator for medical report generation are obtained. Input the medical image into the encoder to obtain the medical image features, and then input the medical image features into the text generator to obtain the medical report text.
为了便于理解本公开实施例提供的一种医学报告生成方法,下面结合图1所示的场景示例进行说明。参见图1,该图为本公开至少一实施例提供的示例性应用场景的框架示意图。In order to facilitate understanding of a method for generating a medical report provided by an embodiment of the present disclosure, the following description will be made in conjunction with the scenario example shown in FIG. 1 . Referring to FIG. 1 , the figure is a schematic framework diagram of an exemplary application scenario provided by at least one embodiment of the present disclosure.
在实际应用中,将医学图像101输入至训练好的编码器102中,得到医学图像101所对应的医学图像特征103,再将医学图像特征103输入至训练好的文本生成器104中,得到文本生成器104输出的医学报告文本105。In practical applications, the medical image 101 is input into the trained encoder 102 to obtain the medical image feature 103 corresponding to the medical image 101, and then the medical image feature 103 is input into the trained text generator 104 to obtain the text The medical report text 105 output by the generator 104 .
本领域技术人员可以理解,图1所示的框架示意图仅是本公开的实施方式可以在其中得以实现的一个示例。本公开实施方式的适用范围不受到该框架任何方面的限制。Those skilled in the art can understand that the schematic frame diagram shown in FIG. 1 is only an example in which the embodiments of the present disclosure can be implemented. The scope of applicability of the disclosed embodiments is not limited in any way by this framework.
基于上述说明,下面将结合附图对本公开提供的医学报告生成模型的训练方法进行详细说明。Based on the above description, the training method of the medical report generation model provided by the present disclosure will be described in detail below with reference to the accompanying drawings.
参见图2所示,该图为本公开至少一实施例提供的一种医学报告生成模型的训练方法的流程图,该方法包括步骤S201-S210。Referring to FIG. 2 , which is a flowchart of a method for training a medical report generation model provided by at least one embodiment of the present disclosure, the method includes steps S201-S210.
S201:将源图像输入第一编码器,得到第一图像特征,将源图像输入第二编码器,得到第二图像特征;源图像对应有医学文本标签。S201: Input a source image into a first encoder to obtain a first image feature, and input the source image into a second encoder to obtain a second image feature; the source image corresponds to a medical text label.
参见图3所示,该图为本公开至少一实施例提供的一种医学报告生成模型的方法示意图。Refer to FIG. 3 , which is a schematic diagram of a method for generating a medical report model provided by at least one embodiment of the present disclosure.
源图像是具有对应的医学文本标签的医学图像。其中,医学文本标签是指医学图像对应的医学报告文本,例如可以是检测报告文本。在一种可能的实现方式中,源图像可以是来源于MIMIC-CXR(一种数据集)的胸片图像。The source images are medical images with corresponding medical text labels. Wherein, the medical text label refers to the medical report text corresponding to the medical image, for example, it may be the test report text. In a possible implementation manner, the source image may be a chest radiograph image derived from MIMIC-CXR (a data set).
第一编码器用于提取源图像所特有的图像特征,也就是属于源域的图像特征。将源图像输入第一编码器中,得到第一编码器输出的第一图像特征。The first encoder is used to extract image features specific to the source image, that is, image features belonging to the source domain. The source image is input into the first encoder to obtain the first image feature output by the first encoder.
第二编码器是源图像和目标图像共享的编码器,用于提取隐藏层的特征维度上源域和目标域相似的特征,也就是源域和目标域中共同的特征。将源图像输入至第二编码器中,得到第二编码器输出的第二图像特征。The second encoder is an encoder shared by the source image and the target image, which is used to extract features similar to the source domain and the target domain on the feature dimension of the hidden layer, that is, the common features in the source domain and the target domain. The source image is input into the second encoder to obtain the second image features output by the second encoder.
第一编码器和第二编码器可以是由四层卷积层构成的。The first encoder and the second encoder may be composed of four convolutional layers.
在一种可能的实现方式中,第二编码器可以是采用Inception-v3(一种神经网络),第一编码器可以采用ResNet(Deep residual network,深度残差网络)。In a possible implementation manner, the second encoder may use Inception-v3 (a neural network), and the first encoder may use ResNet (Deep residual network, deep residual network).
S202:将目标图像输入第三编码器,得到第三图像特征,将目标图像输入第二编码器,得到第四图像特征。S202: Input the target image into a third encoder to obtain a third image feature, and input the target image into a second encoder to obtain a fourth image feature.
目标图像是属于除源图像所属的医学图像类型以外的图像类型的医学图像。目标图像可以包括一种或者多种图像类型的医学图像,目标图像中包括需要实现生成医学报告文本的医学图像的图像类型。比如,当需要对内窥 镜产生的医学图像生成医学报告文本时,目标图像中包括通过内窥镜产生的医学图像。此外,目标图像中还可以包括CT(Computed Tomography,电子计算机断层扫描)图像等其他图像类型的医学图像。The target image is a medical image belonging to an image type other than the medical image type to which the source image belongs. The target image may include one or more image types of medical images, and the target image includes the image types of medical images that need to be used to generate medical report text. For example, when it is necessary to generate a medical report text for a medical image generated by an endoscope, the target image includes the medical image generated by the endoscope. In addition, the target image may also include medical images of other image types such as CT (Computed Tomography, computerized tomography) images.
目标图像可以包括不具有标签的医学图像,也可以包括具有对应的标签的医学图像。目标图像的标签可以是人工标注的医学报告文本,也可以是目标图像所属的文献、文章等文本中与目标图像相关的描述文本。The target image may include a medical image without a label, or may include a medical image with a corresponding label. The label of the target image can be a manually labeled medical report text, or it can be a descriptive text related to the target image in the literature, articles, etc. to which the target image belongs.
将目标图像输入至第二编码器中,得到第二编码器输出的第四图像特征。The target image is input into the second encoder to obtain a fourth image feature output by the second encoder.
第三编码器用于提取目标图像所特有的图像特征,也就是属于目标域的图像特征。将目标图像输入至第三编码器中,得到由第三编码器输出的第三图像特征。The third encoder is used to extract image features specific to the target image, that is, image features belonging to the target domain. The target image is input into the third encoder to obtain the third image feature output by the third encoder.
其中,第三编码器可以是由四层卷积层构成的。第三编码器可以采用ResNet(Deep residual network,深度残差网络)。Wherein, the third encoder may be composed of four convolutional layers. The third encoder can use ResNet (Deep residual network, deep residual network).
S203:将第二图像特征输入文本生成器,得到第一医学报告文本。S203: Input the second image feature into the text generator to obtain the first medical report text.
文本生成器用于根据输入的医学图像的图像特征生成对应的医学报告文本。文本生成器可以由双向的双层的LSTM(Long Short-Term Memory,长短期记忆人工神经网络)构成。The text generator is used to generate corresponding medical report text according to the image features of the input medical image. The text generator can be composed of a bidirectional two-layer LSTM (Long Short-Term Memory, long-term short-term memory artificial neural network).
将第二图像特征输入文本生成器中,得到由文本生成器输出的第一医学报告文本。The second image feature is input into the text generator to obtain the first medical report text output by the text generator.
S204:将第四图像特征输入文本生成器,得到第二医学报告文本。S204: Input the fourth image feature into the text generator to obtain the second medical report text.
将第四图像特征输入上述文本生成器中,得到由文本生成器输出的第二医学报告文本。The fourth image feature is input into the above-mentioned text generator to obtain the second medical report text output by the text generator.
S205:将第一医学报告文本输入判别器,得到第一判别结果。S205: Input the first medical report text into the discriminator to obtain a first discriminant result.
判别器用于确定输入的医学报告文本所属的域,也就是确定输入的医学报告文本是属于源域还是目标域。判别器可以由两层卷积层和一层全连接层的CNN(Convolutional Neural Network,卷积神经网络)构成。The discriminator is used to determine the domain to which the input medical report text belongs, that is, to determine whether the input medical report text belongs to the source domain or the target domain. The discriminator can be composed of two convolutional layers and a fully connected layer of CNN (Convolutional Neural Network, Convolutional Neural Network).
将第一医学报告文本输入至判别器中,得到判别器对于第一医学报告文本的第一判别结果。The first medical report text is input into the discriminator, and a first discrimination result of the discriminator on the first medical report text is obtained.
S206:将第二医学报告文本输入判别器,得到第二判别结果。S206: Input the second medical report text into the discriminator to obtain a second discriminant result.
将第二医学报告文本输入至判别器中,得到判别器对于第二医学报告文本的第二判别结果。The second medical report text is input into the discriminator, and a second discrimination result of the discriminator on the second medical report text is obtained.
利用判别器能够实现对抗训练,使得第二编码器能够缩小第一医学报告 文本和第一医学报告文本之间的差异,将来自不同域的特征映射到相同域中,实现特征级别的对齐。Adversarial training can be achieved by using the discriminator, so that the second encoder can narrow the difference between the first medical report text and the first medical report text, map features from different domains into the same domain, and achieve feature-level alignment.
S207:根据第一图像特征以及第二图像特征,计算源图像特异性损失,根据第三图像特征以及第四图像特征,计算目标图像特异性损失。S207: Calculate the source image-specific loss according to the first image feature and the second image feature, and calculate the target image-specific loss according to the third image feature and the fourth image feature.
第一图像特征和第二图像特征是由不同的编码器对源图像进行特征提取得到的。根据第一图像特征和第二图像特征,能够计算得到源图像特异性损失。源图像特异性损失用于衡量第一图像特征和第二图像特征之间的差距。The first image feature and the second image feature are obtained by feature extraction of the source image by different encoders. According to the first image feature and the second image feature, a source image specific loss can be calculated. The source image specific loss is used to measure the gap between the features of the first image and the features of the second image.
源图像特异性损失可以通过下式表示:The source image specific loss can be expressed by the following formula:
Figure PCTCN2022107921-appb-000001
Figure PCTCN2022107921-appb-000001
其中,
Figure PCTCN2022107921-appb-000002
为第二图像特征,
Figure PCTCN2022107921-appb-000003
为第一图像特征。
Figure PCTCN2022107921-appb-000004
为frobenius范数。
in,
Figure PCTCN2022107921-appb-000002
is the second image feature,
Figure PCTCN2022107921-appb-000003
is the first image feature.
Figure PCTCN2022107921-appb-000004
is the frobenius norm.
类似的,第三图像特征和第四图像特征是由不同的编码器对目标图像进行特征提取得到的。根据第三图像特征和第四图像特征,能够计算得到目标图像特异性损失。目标图像特异性损失用于衡量第三图像特征和第四图像特征之间的差距。Similarly, the third image feature and the fourth image feature are obtained by different encoders performing feature extraction on the target image. According to the third image feature and the fourth image feature, the target image-specific loss can be calculated. The target image-specific loss is used to measure the gap between the third image feature and the fourth image feature.
目标图像特异性损失可以通过下式表示:The target image-specific loss can be expressed by the following formula:
Figure PCTCN2022107921-appb-000005
Figure PCTCN2022107921-appb-000005
其中,
Figure PCTCN2022107921-appb-000006
为第四图像特征,
Figure PCTCN2022107921-appb-000007
为第三图像特征。
in,
Figure PCTCN2022107921-appb-000006
is the fourth image feature,
Figure PCTCN2022107921-appb-000007
is the third image feature.
S208:根据第一医学报告文本以及源图像对应的医学文本标签计算交叉熵损失。S208: Calculate a cross-entropy loss according to the first medical report text and the medical text label corresponding to the source image.
源图像具有对应的医学文本标签。根据第一医学报告文本和源图像对应的医学文本标签,计算交叉熵损失。交叉熵损失用于衡量第一医学报告文本和医学文本标签之间的差距。The source images have corresponding medical text labels. A cross-entropy loss is calculated according to the first medical report text and the medical text label corresponding to the source image. Cross-entropy loss is used to measure the gap between the first medical report text and medical text labels.
S209:根据第一判别结果计算第一对抗性损失,根据第二判别结果计算第二对抗性损失以及第三对抗性损失。S209: Calculate a first adversarial loss according to the first discrimination result, and calculate a second adversarial loss and a third adversarial loss according to the second discrimination result.
根据判别器输出的第一判别结果计算第一对抗性损失,根据第二判别结果计算第二对抗性损失以及第三对抗性损失。第一对抗性损失、第二对抗性损失以及第三对抗性损失是用于衡量生成判别结果是否属于相应的域。The first adversarial loss is calculated according to the first discrimination result output by the discriminator, and the second adversarial loss and the third adversarial loss are calculated according to the second discrimination result. The first adversarial loss, the second adversarial loss, and the third adversarial loss are used to measure whether the generated discriminant results belong to the corresponding domain.
在一种可能的实现方式中,本公开至少一实施例提供根据第一判别结果计算第一对抗性损失的具体实施方式,以及一种根据第二判别结果计算第二对抗性损失以及第三对抗性损失的具体实施方式,具体请参见下文。In a possible implementation, at least one embodiment of the present disclosure provides a specific implementation of calculating the first adversarial loss according to the first discrimination result, and a method of calculating the second adversarial loss and the third adversarial loss according to the second discrimination result. For specific implementation methods of sexual loss, please refer to the following for details.
S210:根据源图像特异性损失、目标图像特异性损失、交叉熵损失、第一对抗性损失、第二对抗性损失以及第三对抗性损失,训练第一编码器、第二编码器、第三编码器、文本生成器以及判别器,重复执行将源图像输入第一图像特征编码器以及后续步骤,直到达到预设条件。S210: Train the first encoder, the second encoder, the third The encoder, the text generator, and the discriminator repeatedly execute the input of the source image into the first image feature encoder and subsequent steps until a preset condition is reached.
基于得到的源图像特异性损失、目标图像特异性损失、交叉熵损失、第一对抗性损失、第二对抗性损失以及第三对抗性损失,训练第一编码器、第二编码器、第三编码器、文本生成器和判别器。Based on the obtained source image-specific loss, target image-specific loss, cross-entropy loss, first adversarial loss, second adversarial loss and third adversarial loss, train the first encoder, second encoder, third Encoder, Text Generator, and Discriminator.
基于源图像特异性损失,能够使得第一编码器和第二编码器学习到关于源图像的不同的图像特征。基于目标图像特异性损失,能够使得第二编码器和第三编码器学习到关于目标图像的不同的图像特征。利用交叉熵损失,能够训练使得文本生成器生成较为准确的第一医学报告文本。利用第一对抗性损失、第二对抗性损失以及第三对抗性损失,能够使得目标域和源域的域不变特征尽可能的接近。Based on the source image-specific loss, it is possible to enable the first encoder and the second encoder to learn different image features about the source image. Based on the target image-specific loss, it is possible to enable the second encoder and the third encoder to learn different image features about the target image. Using the cross-entropy loss, it is possible to train the text generator to generate more accurate first medical report text. Using the first adversarial loss, the second adversarial loss and the third adversarial loss, the domain-invariant features of the target domain and the source domain can be made as close as possible.
在一种可能的实现方式中,本公开至少一实施例提供一种根据源图像特异性损失、目标图像特异性损失、交叉熵损失、第一对抗性损失、第二对抗性损失以及第三对抗性损失,训练第一编码器、第二编码器、第三编码器、文本生成器以及判别器的具体实现方式,具体请参见下文。In a possible implementation, at least one embodiment of the present disclosure provides a method based on source image-specific loss, target image-specific loss, cross-entropy loss, first adversarial loss, second adversarial loss, and third adversarial loss loss, the specific implementation of training the first encoder, the second encoder, the third encoder, the text generator and the discriminator, please refer to the following for details.
在完成一次对第一编码器、第二编码器、第三编码器、文本生成器以及判别器的训练之后,重复执行上述S201-S210的步骤,直到达到预设条件。其中,预设条件是完成训练的条件。预设条件例如可以为训练的次数,或者可以为损失函数满足的数值条件。After the training of the first encoder, the second encoder, the third encoder, the text generator and the discriminator is completed, the above steps of S201-S210 are repeated until the preset condition is met. Wherein, the preset condition is a condition for completing the training. The preset condition may be, for example, the number of times of training, or may be a numerical condition satisfied by the loss function.
基于上述S201-S210的相关内容可知,通过基于域不变特征,采用对抗训练的方式训练得到的第二编码器和文本生成器,能够对属于目标图像的图像类型的医学图像生成对应的医学报告文本。如此能够实现针对缺少标签的图像类型的医学图像生成医学报告文本,扩大了生成医学报告文本的医学图像类型的范围。并且,利用判别器能够实现将来自不同域的数据源映射到相同域中,实现特征级别的对齐,使得利用训练后得到的编码器和文本生成器,能够生成较为准确的医学图像对应的医学报告文本。Based on the relevant content of S201-S210 above, it can be known that the second encoder and text generator trained by confrontation training based on domain-invariant features can generate corresponding medical reports for medical images belonging to the image type of the target image text. In this way, medical report texts can be generated for medical images of image types lacking labels, and the scope of medical image types for generating medical report texts can be expanded. Moreover, the discriminator can be used to map data sources from different domains to the same domain, and achieve feature-level alignment, so that the encoder and text generator obtained after training can generate more accurate medical reports corresponding to medical images text.
在一种可能的实现方式中,判别器用于确定医学报告文本所对应的图像的概率。将第一医学报告文本输入判别器,得到的判别器输出的第一判别结果中,包括第一医学报告文本中各个分词由源图像生成的第一概率值。将第 二医学报告文本输入判别器,得到的判别器输出的第二判别结果中,包括第二医学报告文本中各个分词由源图像生成的第二概率值。其中,第一概率值可以表示为D(y s),y s表示第一医学报告文本。第二概率值可以表示为D(y t),y t表示第二医学报告文本。第一概率值和第二概率值的取值范围为从0到1。其中,越接近1表示由源图像生成的概率越高,月接近0表示由源图像生成的概率越低。 In a possible implementation manner, the discriminator is used to determine the probability of the image corresponding to the text of the medical report. The first medical report text is input into the discriminator, and the obtained first discrimination result output by the discriminator includes the first probability value generated by each word segment in the first medical report text from the source image. The second medical report text is input into the discriminator, and the obtained second discriminant result output by the discriminator includes a second probability value generated by each word segment in the second medical report text from the source image. Wherein, the first probability value may be expressed as D(y s ), and y s represents the first medical report text. The second probability value may be expressed as D(y t ), where y t represents the text of the second medical report. The value range of the first probability value and the second probability value is from 0 to 1. Among them, the closer to 1, the higher the probability of being generated by the source image, and the closer to 0, the lower the probability of being generated by the source image.
对应的,本公开至少一实施例提供一种根据第一判别结果计算第一对抗性损失的方法,具体包括:Correspondingly, at least one embodiment of the present disclosure provides a method for calculating a first adversarial loss according to a first discrimination result, which specifically includes:
将第一概率值取对数后求和,得到第一求和结果,取第一求和结果的负数值,得到第一对抗性损失。The logarithms of the first probability values are taken and summed to obtain the first summation result, and the negative value of the first summation result is taken to obtain the first adversarial loss.
将第一概率值取对数后求和,得到第一求和结果。第一求和结果可以表示为∑log[D(y s)]。 The logarithms of the first probability values are taken and then summed to obtain the first summation result. The first summation result can be expressed as Σlog[D(y s )].
再计算第一求和结果的负数值,得到第一对抗性损失。Then calculate the negative value of the first summation result to obtain the first adversarial loss.
第一对抗性损失可以由公式(3)表示。The first adversarial loss can be expressed by formula (3).
L adv1(y s)=-∑log[D(y s)]                                        (3) L adv1 (y s )=-∑log[D(y s )] (3)
本公开至少一实施例提供一种根据第二判别结果计算第二对抗性损失以及第三对抗性损失的方法,包括:At least one embodiment of the present disclosure provides a method for calculating a second adversarial loss and a third adversarial loss according to a second discrimination result, including:
将第二概率值取对数后求和,得到第二求和结果,取第二求和结果的负数值,得到第二对抗性损失。The logarithms of the second probability values are taken and summed to obtain a second summation result, and the negative value of the second summation result is taken to obtain a second adversarial loss.
计算1与第二概率值之差后求和,得到第三求和结果,取第三求和结果的负数值,得到第三对抗性损失。The difference between 1 and the second probability value is calculated and summed to obtain the third summation result, and the negative value of the third summation result is taken to obtain the third adversarial loss.
将第二概率值取对数后求和,得到第二求和结果。第二求和结果可以表示为∑log[D(y t)]。 The logarithms of the second probability values are taken and then summed to obtain the second summation result. The second summation result can be expressed as Σlog[D(y t )].
再计算第二求和结果的负数值,得到第二对抗性损失。Then calculate the negative value of the second summation result to obtain the second adversarial loss.
第二对抗性损失可以由公式(4)表示。The second adversarial loss can be expressed by formula (4).
L adv2(y t)=-∑log[D(y t)]                                        (4) L adv2 (y t )=-∑log[D(y t )] (4)
计算1与第二概率值中各个数值之差后求和,得到第三求和结果,取第三求和结果的负数值,得到第三对抗性损失。The difference between 1 and each value in the second probability value is calculated and summed to obtain the third summation result, and the negative value of the third summation result is taken to obtain the third adversarial loss.
第三对抗性损失可以由公式(5)表示。The third adversarial loss can be expressed by formula (5).
L adv3(y t)=-∑[1-D(y t)]                                      (5) L adv3 (y t )=-∑[1-D(y t )] (5)
在一种可能的实现方式中,针对目标图像缺少标签的情况,可以通过重建图像的方式进行模型的优化。本公开至少一实施例提供一种医学报告生成模型的训练方法,除上述S201-S210的步骤以外,方法还包括以下三个步骤。In a possible implementation, for the case where the target image lacks labels, the model can be optimized by reconstructing the image. At least one embodiment of the present disclosure provides a method for training a medical report generation model. In addition to the above steps S201-S210, the method further includes the following three steps.
参见图4所示,该图为本公开至少一实施例提供的另一种医学报告生成模型的方法示意图。Refer to FIG. 4 , which is a schematic diagram of another method for generating a medical report model provided by at least one embodiment of the present disclosure.
A1:将第一图像特征以及第二图像特征输入第一解码器,得到重建源图像。A1: Input the first image feature and the second image feature into the first decoder to obtain the reconstructed source image.
第一解码器用于根据输入的源图像的域不变特征和特有特征生成重建的源图像。将第一图像特征和第二图像特征输入第一解码器中,得到重建源图像。The first decoder is used to generate a reconstructed source image according to domain-invariant features and unique features of the input source image. The first image feature and the second image feature are input into the first decoder to obtain the reconstructed source image.
第一解码器可以是由四层卷积层构成的。The first decoder may be composed of four convolutional layers.
A2:将第三图像特征以及第四图像特征输入第二解码器,得到重建目标图像。A2: Input the third image feature and the fourth image feature into the second decoder to obtain the reconstructed target image.
第二解码器用于根据输入的目标图像的域不变特征和特有特征生成重建的目标图像。将第三图像特征和第四图像特征输入第二解码器中,得到重建目标图像。The second decoder is used to generate a reconstructed target image according to the domain-invariant features and unique features of the input target image. The third image feature and the fourth image feature are input into the second decoder to obtain a reconstructed target image.
第二解码器可以是由四层卷积层构成的。在本公开的实施例中,编码器和解码器采用自编码器结构。The second decoder may consist of four convolutional layers. In the embodiments of the present disclosure, the encoder and the decoder adopt an autoencoder structure.
A3:根据源图像以及重建源图像,计算源图像感知损失,根据目标图像以及重建目标图像,计算目标图像感知损失。A3: Calculate the perceptual loss of the source image based on the source image and the reconstructed source image, and calculate the perceptual loss of the target image based on the target image and the reconstructed target image.
根据源图像和重建源图像,计算源图像感知损失。源图像感知损失用于衡量源图像和重建源图像之间的差距。Based on the source image and the reconstructed source image, compute the source image perceptual loss. The source image perceptual loss is used to measure the gap between the source image and the reconstructed source image.
根据目标图像和重建目标图像,计算目标图像感知损失。目标图像感知损失用于衡量目标图像和重建目标图像之间的差距。Based on the target image and the reconstructed target image, the target image perceptual loss is calculated. The target image perceptual loss is used to measure the gap between the target image and the reconstructed target image.
在一种可能的实现方式中,本公开至少一实施例提供一种根据源图像以及重建源图像,计算源图像感知损失的具体实施方式,以及一种根据目标图像以及重建目标图像,计算目标图像感知损失的具体实施方式,具体请参见下文。In a possible implementation, at least one embodiment of the present disclosure provides a specific implementation of calculating the perceptual loss of the source image according to the source image and reconstructing the source image, and a method of calculating the target image according to the target image and reconstructing the target image For the specific implementation of perceptual loss, please refer to the following.
对应的,本公开至少一实施例提供一种根据源图像特征性损失、目标图像特征性损失、交叉熵损失、第一对抗性损失、第二对抗性损失以及第三对抗性损失,训练第一编码器、第二编码器、第三编码器、文本生成器以及判 别器的具体实现方式,具体包括:Correspondingly, at least one embodiment of the present disclosure provides a method for training the first The specific implementation of the encoder, the second encoder, the third encoder, the text generator and the discriminator specifically includes:
根据源图像特征性损失、目标图像特征性损失、交叉熵损失、第一对抗性损失、第二对抗性损失、第三对抗性损失、源图像感知损失以及目标图像感知损失,训练第一编码器、第二编码器、第三编码器、文本生成器、判别器、第一解码器以及第二解码器。Train the first encoder based on source image characteristic loss, target image characteristic loss, cross-entropy loss, first adversarial loss, second adversarial loss, third adversarial loss, source image perceptual loss, and target image perceptual loss , a second encoder, a third encoder, a text generator, a discriminator, a first decoder, and a second decoder.
在得到源图像感知损失和目标图像感知损失之后,还可以根据源图像感知损失和目标图像感知损失优化模型,减小源图像和重建源图像以及目标图像和重建目标图像之间的差距,提高模型对源图像和目标图像的图像特征提取的准确程度。After obtaining the perceptual loss of the source image and the perceptual loss of the target image, the model can also be optimized according to the perceptual loss of the source image and the perceptual loss of the target image, so as to reduce the gap between the source image and the reconstructed source image and the target image and the reconstructed target image, and improve the model The accuracy of image feature extraction for source and target images.
在一种可能的实现方式中,根据源图像特异性损失、目标图像特异性损失、交叉熵损失、第一对抗性损失、第二对抗性损失、第三对抗性损失、源图像感知损失以及目标图像感知损失可以计算得到总损失。总损失可以由下式表示:In a possible implementation, according to source image specific loss, target image specific loss, cross entropy loss, first adversarial loss, second adversarial loss, third adversarial loss, source image perceptual loss and target The image perceptual loss can be calculated to get the total loss. The total loss can be expressed by the following formula:
L=L difference+L rec+L ceadv1L adv1(y s)+λ adv2L adv2(y t)+λ adv3L adv3(y t)    (6) L=L difference +L rec +L ceadv1 L adv1 (y s )+λ adv2 L adv2 (y t )+λ adv3 L adv3 (y t ) (6)
其中,L difference表示源图像特异性损失和目标图像特异性损失的和。L ce表示交叉熵损失。L rec表示源图像感知损失和目标图像感知损失的和。L adv1(y s)表示第一对抗性损失,λ adv1为第一对抗性损失对应的权重。L adv2(y t)表示第二对抗性损失,λ adv2为第二对抗性损失对应的权重。L adv3(y t)表示第三对抗性损失,λ adv3为第三对抗性损失对应的权重。 where L difference represents the sum of source image-specific loss and target image-specific loss. L ce represents the cross-entropy loss. L rec represents the sum of source image perceptual loss and target image perceptual loss. L adv1 (y s ) represents the first adversarial loss, and λ adv1 is the weight corresponding to the first adversarial loss. L adv2 (y t ) represents the second adversarial loss, and λ adv2 is the weight corresponding to the second adversarial loss. L adv3 (y t ) represents the third adversarial loss, and λ adv3 is the weight corresponding to the third adversarial loss.
L difference可以由下式表示: L difference can be expressed by the following formula:
L difference=L sdist+L tdist                                          (7) L difference = L sdist + L tdist (7)
其中,L sdist表示源图像特异性损失,L tdist表示目标图像特异性损失。 where L sdist represents the source image-specific loss and L tdist represents the target image-specific loss.
L rec可以由下式表示: L rec can be expressed by the following formula:
L rec=L perc(x s,x srec;w)+L perc(x t,x trec;w)                            (8) L rec = L perc (x s , x srec ; w) + L perc (x t , x trec ; w) (8)
其中,L perc(x s,x srec;w)表示源图像感知损失,L perc(x t,x trec;w)表示目标图像感知损失。 Among them, L perc (x s , x srec ; w) represents the source image perceptual loss, and L perc (x t , x trec ; w) represents the target image perceptual loss.
在得到总损失之后,可以以总损失的最小化的结果进行最大化为优化的目标,训练第一编码器、第二编码器、第三编码器、文本生成器、判别器、第一解码器以及第二解码器。After the total loss is obtained, the result of minimizing the total loss can be maximized as the optimization goal, and the first encoder, the second encoder, the third encoder, the text generator, the discriminator, and the first decoder are trained and a second decoder.
基于上述内容可知,采用重建图像的方式,可以实现在目标图像不具有标签的前提下优化编码器,使得编码器提取较为准确的图像特征,提高训练 得到的模型的准确程度。Based on the above content, it can be seen that by using the image reconstruction method, the encoder can be optimized on the premise that the target image does not have a label, so that the encoder can extract more accurate image features and improve the accuracy of the trained model.
在一种可能的实现方式中,本公开至少一实施例提供一种根据源图像以及重建源图像计算源图像感知损失的方法,包括以下四个步骤:In a possible implementation, at least one embodiment of the present disclosure provides a method for calculating the perceptual loss of the source image according to the source image and the reconstructed source image, including the following four steps:
B1:将源图像输入第三图像特征提取网络,获取第三图像特征提取网络的各特征提取层输出的第七图像特征。B1: Input the source image into the third image feature extraction network, and obtain the seventh image feature output by each feature extraction layer of the third image feature extraction network.
第三图像特征提取网络用于提取图像的图像特征。将源图像输入至第三图像特征提取网络中,得到由第三图像特征提取网络的各个特征提取层输出的第七图像特征。The third image feature extraction network is used to extract image features of the image. The source image is input into the third image feature extraction network to obtain seventh image features output by each feature extraction layer of the third image feature extraction network.
其中,第三图像特征提取网络可以是VGG Net(一种深度卷积神经网络)。VGG Net可以是预先训练得到的。将源图像输入至VGG Net中,得到第七图像特征
Figure PCTCN2022107921-appb-000008
其中,x s表示源图像,l表示VGG Net中的第l层特征提取层。l为大于等于1,小于等于L的正整数,L为VGG Net的特征提取层的总层数。
Wherein, the third image feature extraction network may be VGG Net (a deep convolutional neural network). VGG Net can be pre-trained. Input the source image into VGG Net to get the seventh image feature
Figure PCTCN2022107921-appb-000008
Among them, x s represents the source image, and l represents the l-th feature extraction layer in VGG Net. l is a positive integer greater than or equal to 1 and less than or equal to L, and L is the total number of layers of the feature extraction layer of VGG Net.
B2:将重建源图像输入第三图像特征提取网络,获取第三图像特征提取网络的各特征提取层输出的第八图像特征。B2: Input the reconstructed source image into the third image feature extraction network, and obtain the eighth image feature output by each feature extraction layer of the third image feature extraction network.
利用第三图像特征提取网络提取重建源图像的图像特征,得到由第三图像特征提取网络中的各个特征提取层输出的第八图像特征。The third image feature extraction network is used to extract the image features of the reconstructed source image to obtain eighth image features output by each feature extraction layer in the third image feature extraction network.
以上述第三图像特征提取网络为VGG Net为例,第八图像特征可以表示为
Figure PCTCN2022107921-appb-000009
其中,x srec表示重建源图像。
Taking the above-mentioned third image feature extraction network as VGG Net as an example, the eighth image feature can be expressed as
Figure PCTCN2022107921-appb-000009
where x srec represents the reconstructed source image.
B3:根据每一特征提取层输出的第七图像特征、第八图像特征以及该特征提取层对应的权重,计算该特征提取层对应的源图像损失。B3: Calculate the source image loss corresponding to the feature extraction layer according to the seventh image feature, the eighth image feature output by each feature extraction layer and the weight corresponding to the feature extraction layer.
第三特征提取网络中各个特征提取层具有对应的权重。根据每一特征提取层的权重,每一特征提取层输出的第七图像特征以及每一特征提取层输出的第八图像特征,计算该特征提取层对应的源图像损失。Each feature extraction layer in the third feature extraction network has a corresponding weight. According to the weight of each feature extraction layer, the seventh image feature output by each feature extraction layer, and the eighth image feature output by each feature extraction layer, the source image loss corresponding to the feature extraction layer is calculated.
在一种可能的实现方式中,可以先计算第七图像特征和第八图像特征之间的差值,再计算得到的差值的L1范数,最后将得到的差值的L1范数与权重相乘,得到该特征提取层对应的源图像损失。In a possible implementation, the difference between the seventh image feature and the eighth image feature can be calculated first, and then the L1 norm of the obtained difference can be calculated, and finally the L1 norm of the obtained difference can be combined with the weight Multiplied together, the source image loss corresponding to the feature extraction layer is obtained.
B4:将各个特征提取层对应的源图像损失求和,得到源图像感知损失。B4: Sum the source image losses corresponding to each feature extraction layer to obtain the source image perception loss.
计算各个特征提取层的源图像损失之和,得到源图像感知损失。Compute the sum of the source image losses of each feature extraction layer to obtain the source image perceptual loss.
源图像感知损失L perc(x s,x srec;w)可以由下式表示: The source image perceptual loss L perc (x s ,x srec ; w) can be expressed by the following formula:
Figure PCTCN2022107921-appb-000010
Figure PCTCN2022107921-appb-000010
其中,w (l)表示第l层特征提取层的权重,N (l)表示特征提取层的层数,||·|| 1表示L1范数。 Among them, w (l) represents the weight of the feature extraction layer of the l-th layer, N (l) represents the number of layers of the feature extraction layer, and ||·|| 1 represents the L1 norm.
类似的,在一种可能的实现方式中,本公开至少一实施例提供一种根据目标图像以及重建目标图像,计算目标图像感知损失的具体实施方式,具体包括以下四个步骤:Similarly, in a possible implementation manner, at least one embodiment of the present disclosure provides a specific implementation manner of calculating the perceptual loss of the target image according to the target image and the reconstructed target image, which specifically includes the following four steps:
B5:将目标图像输入第三图像特征提取网络,获取第三图像特征提取网络的各特征提取层输出的第九图像特征。B5: Input the target image into the third image feature extraction network, and obtain the ninth image feature output by each feature extraction layer of the third image feature extraction network.
利用第三图像特征提取网络提取目标图像的图像特征,得到由第三图像特征提取网络中的各个特征提取层输出的第九图像特征。The third image feature extraction network is used to extract image features of the target image to obtain ninth image features output by each feature extraction layer in the third image feature extraction network.
以上述第三图像特征提取网络为VGG Net为例,第九图像特征可以表示为φ (l)(x t)。其中,x t表示目标图像。 Taking the above-mentioned third image feature extraction network as VGG Net as an example, the ninth image feature can be expressed as φ (l) (x t ). where x t represents the target image.
B6:将重建目标图像输入第三图像特征提取网络,获取第三图像特征提取网络的各特征提取层输出的第十图像特征。B6: Input the reconstructed target image into the third image feature extraction network, and obtain the tenth image feature output by each feature extraction layer of the third image feature extraction network.
利用第三图像特征提取网络提取重建目标图像的图像特征,得到由第三图像特征提取网络中的各个特征提取层输出的第十图像特征。The third image feature extraction network is used to extract image features of the reconstructed target image to obtain tenth image features output by each feature extraction layer in the third image feature extraction network.
以上述第三图像特征提取网络为VGG Net为例,第十图像特征可以表示为φ (l)(x trec)。其中,x trec表示重建目标图像。 Taking the above-mentioned third image feature extraction network as VGG Net as an example, the tenth image feature can be expressed as φ (l) (x trec ). Among them, x trec represents the reconstructed target image.
B7:根据每一特征提取层输出的第九图像特征、第十图像特征以及该特征提取层对应的权重,计算该特征提取层对应的目标图像损失。B7: Calculate the target image loss corresponding to the feature extraction layer according to the ninth image feature, the tenth image feature output by each feature extraction layer and the weight corresponding to the feature extraction layer.
第三特征提取网络中各个特征提取层具有对应的权重。根据每一特征提取层的权重,每一特征提取层输出的第九图像特征以及每一特征提取层输出的第十图像特征,计算该特征提取层对应的目标图像损失。Each feature extraction layer in the third feature extraction network has a corresponding weight. According to the weight of each feature extraction layer, the ninth image feature output by each feature extraction layer, and the tenth image feature output by each feature extraction layer, the target image loss corresponding to the feature extraction layer is calculated.
在一种可能的实现方式中,可以先计算第九图像特征和第十图像特征之间的差值,再计算得到的差值的L1范数,最后将得到的差值的L1范数与权重相乘,得到该特征提取层对应的目标图像损失。In a possible implementation, the difference between the ninth image feature and the tenth image feature can be calculated first, and then the L1 norm of the obtained difference can be calculated, and finally the L1 norm of the obtained difference can be combined with the weight Multiplied together, the target image loss corresponding to the feature extraction layer is obtained.
B8:将各个特征提取层对应的目标图像损失求和,得到目标图像感知损失。B8: Sum the target image losses corresponding to each feature extraction layer to obtain the target image perception loss.
计算各个特征提取层的目标图像损失之和,得到目标图像感知损失。Calculate the sum of the target image loss of each feature extraction layer to obtain the target image perceptual loss.
目标图像感知损失L perc(x t,x trec;w)可以由下式表示: The target image perceptual loss L perc (x t ,x trec ; w) can be expressed by the following formula:
Figure PCTCN2022107921-appb-000011
Figure PCTCN2022107921-appb-000011
其中,w (l)表示第l层特征提取层的权重,N (l)表示特征提取层的层数, ||·|| 1表示L1范数。 Among them, w (l) represents the weight of the feature extraction layer of the l-th layer, N (l) represents the number of layers of the feature extraction layer, and ||·|| 1 represents the L1 norm.
目标图像中部分目标图像可能具有对应的医学文本标签。对于具有医学文本标签的目标图像,可以采用半监督的方式训练模型。Part of the target image may have a corresponding medical text label. For target images with medical text labels, the model can be trained in a semi-supervised manner.
对应的,在一种可能的实现方式中,本公开至少一实施例提供一种医学报告生成模型的训练方法,在上述步骤S201-S210训练完成的基础上,还可以再次进行训练,即除上述步骤以外,还包括以下五个步骤:Correspondingly, in a possible implementation manner, at least one embodiment of the present disclosure provides a training method for a medical report generation model. After the training in the above steps S201-S210 is completed, the training can be performed again, that is, in addition to the above In addition to steps, the following five steps are included:
C1:根据源图像与重建源图像的差异以及目标图像与重建目标图像的差异,确定第一分值。C1: Determine the first score according to the difference between the source image and the reconstructed source image and the difference between the target image and the reconstructed target image.
第一分值用于衡量源图像与重建源图像的差异,以及目标图像与重建目标图像的差异。The first score is used to measure the difference between the source image and the reconstructed source image, and the difference between the target image and the reconstructed target image.
在一种可能的实现方式中,本公开至少一实施例提供一种根据源图像与重建源图像的差异以及目标图像与重建目标图像的差异,确定第一分值的具体实施方式,请参见下文。In a possible implementation manner, at least one embodiment of the present disclosure provides a specific implementation manner of determining the first score according to the difference between the source image and the reconstructed source image and the difference between the target image and the reconstructed target image, please refer to the following .
C2:根据源图像特异性损失以及目标图像特异性损失,确定第二分值。C2: Determine the second score according to the source image-specific loss and the target image-specific loss.
第二分值与源图像特异性损失以及目标图像特异性损失相关。The second score is related to the source image specific loss as well as the target image specific loss.
在一种可能的实现方式中,本公开至少一实施例提供一种根据源图像特异性损失以及目标图像特异性损失,确定第二分值的具体实现方式,请参见下文。In a possible implementation manner, at least one embodiment of the present disclosure provides a specific implementation manner of determining the second score according to the source image specific loss and the target image specific loss, please refer to the following.
C3:如果目标图像对应有医学文本标签,根据第二医学报告文本以及目标图像对应的医学文本标签,计算自然语言评估指标作为第三分值。C3: If the target image corresponds to a medical text label, calculate the natural language evaluation index as the third score according to the second medical report text and the medical text label corresponding to the target image.
在部分目标图像具有对应的医学文本标签时,可以根据目标图像的医学文本标签和第二医学报告文本计算自然语言评估指标。将计算得到的自然语言评估指标确定为第三分值。When some target images have corresponding medical text labels, the natural language evaluation index can be calculated according to the medical text labels of the target images and the second medical report text. The calculated natural language evaluation indicator is determined as a third score.
其中,自然语言评估指标可以是CIDEr(Consensus-based Image Description Evaluation,基于共识的图像描述评估)等指标。第三分值可以由公式(11)表示。Among them, the natural language evaluation index can be an index such as CIDEr (Consensus-based Image Description Evaluation, consensus-based image description evaluation). The third score can be represented by formula (11).
SCORE eval=CIDEr(y t,y)                                       (11) SCORE eval = CIDEr(y t ,y) (11)
其中,CIDEr(y t,y)表示y t与y的CIDEr,y t为基于目标图像生成的第二医学报告文本,y为目标图像对应的医学文本标签。 Among them, CIDEr(y t , y) represents the CIDEr of y t and y, y t is the second medical report text generated based on the target image, and y is the medical text label corresponding to the target image.
C4:将第一分值、第二分值以及第三分值加权求和,得到奖励值。C4: The first score, the second score and the third score are weighted and summed to obtain the reward value.
计算第一分值、第二分值和第三分值的加权和,得到奖励值。奖励值 REWARD可以由公式(12)表示:Calculate the weighted sum of the first score, the second score and the third score to obtain the reward value. The reward value REWARD can be expressed by formula (12):
REWARD=λ 1SCORE rec2SCORE dist3SCORE eval                  (12) REWARD = λ 1 SCORE rec + λ 2 SCORE dist + λ 3 SCORE eval (12)
其中,SCORE rec表示第一分值,SCORE dist表示第二分值,SCORE eval表示第三分值。λ 1为第一分值对应的权重,λ 2为第二分值对应的权重,λ 3为第三分值对应的权重。 Wherein, SCORE rec represents the first score, SCORE dist represents the second score, and SCORE eval represents the third score. λ1 is the weight corresponding to the first score, λ2 is the weight corresponding to the second score, and λ3 is the weight corresponding to the third score.
第一分值、第二分值和第三分值分别对应的权重可以根据需要进行设置。比如,在当目标图像不具有对应的医学文本标签时,λ 1=λ 2=0.5,λ 3=0。在当目标图像具有对应的医学文本标签时,λ 1=λ 2=0.3,λ 3=0.4。 The weights corresponding to the first score, the second score and the third score can be set as required. For example, when the target image does not have a corresponding medical text label, λ 12 =0.5, λ 3 =0. When the target image has a corresponding medical text label, λ 12 =0.3, λ 3 =0.4.
奖励值能够通过重建图像的差异、图像的特异性损失以及自然语言评估指标三个方面反映模型的训练情况。The reward value can reflect the training situation of the model through three aspects: the difference of the reconstructed image, the specificity loss of the image, and the natural language evaluation index.
C5:以最大化奖励值为目标,重新训练第一编码器、第二编码器、第三编码器、文本生成器、判别器、第一解码器以及第二解码器。C5: Retrain the first encoder, second encoder, third encoder, text generator, discriminator, first decoder, and second decoder with the goal of maximizing the reward value.
将最大化奖励值作为训练目标,重新训练模型中的第一编码器、第二编码器、第三编码器、文本生成器、判别器、第一解码器以及第二解码器。Taking the maximum reward value as the training target, retrain the first encoder, the second encoder, the third encoder, the text generator, the discriminator, the first decoder and the second decoder in the model.
在本公开的实施例中,将最大化奖励值作为训练的目标,能够强化学习更新文本生成器。并且,将自然语言评估指标作为第三分值,能够在训练模型时考虑到自然语言评估指标,实现模型训练与模型应用的目标一致,进一步提高模型的准确性。In the embodiment of the present disclosure, maximizing the reward value is taken as the training goal, and the text generator can be updated through reinforcement learning. Moreover, using the natural language evaluation index as the third score can take the natural language evaluation index into account when training the model, so that the goals of model training and model application are consistent, and the accuracy of the model can be further improved.
进一步的,本公开至少一实施例提供一种根据源图像与重建源图像的差异以及目标图像与重建目标图像的差异,确定第一分值的具体实施方式,包括以下7个步骤:Further, at least one embodiment of the present disclosure provides a method for determining the first score based on the difference between the source image and the reconstructed source image and the difference between the target image and the reconstructed target image, including the following seven steps:
D1:将源图像输入第三图像特征提取网络,获取第三图像特征提取网络输出的第十一图像特征。D1: Input the source image into the third image feature extraction network, and obtain the eleventh image feature output by the third image feature extraction network.
第三图像特征提取网络用于提取图像的图像特征。将源图像输入至第三图像特征提取网络中,得到第三图像特征提取网络输出的第十一图像特征。The third image feature extraction network is used to extract image features of the image. The source image is input into the third image feature extraction network to obtain an eleventh image feature output by the third image feature extraction network.
其中,第三图像特征提取网络可以是VGG Net(一种深度卷积神经网络)。VGG Net可以是预先训练得到的。将源图像输入至VGG Net中,得到第十一图像特征
Figure PCTCN2022107921-appb-000012
其中,x s表示源图像的图像特征,l表示VGG Net中经过激活函数的第l层。l为大于等于1,小于等于L的正整数,L为VGG Net经过激活函数的最大层数。
Wherein, the third image feature extraction network may be VGG Net (a deep convolutional neural network). VGG Net can be pre-trained. Input the source image into VGG Net to get the eleventh image feature
Figure PCTCN2022107921-appb-000012
Among them, x s represents the image features of the source image, and l represents the lth layer of the activation function in VGG Net. l is a positive integer greater than or equal to 1 and less than or equal to L, and L is the maximum number of layers of the VGG Net after the activation function.
D2:将重建源图像输入第三图像特征提取网络,获取第三图像特征提取 网络输出的第十二图像特征。D2: input the reconstructed source image into the third image feature extraction network, and obtain the twelfth image feature output by the third image feature extraction network.
利用第三图像特征提取网络提取重建源图像的图像特征,得到由第三图像特征提取网络输出的第十二图像特征。Using the third image feature extraction network to extract image features of the reconstructed source image to obtain a twelfth image feature output by the third image feature extraction network.
仍以上述第三图像特征提取网络为VGG Net为例,第十二图像特征可以为
Figure PCTCN2022107921-appb-000013
其中,x srec是重建源图像的图像特征。
Still taking the above-mentioned third image feature extraction network as VGG Net as an example, the twelfth image feature can be
Figure PCTCN2022107921-appb-000013
where x srec is the image feature of the reconstructed source image.
D3:根据第十一图像特征与第十二图像特征的差异,得到第一差异值。D3: Obtain a first difference value according to the difference between the eleventh image feature and the twelfth image feature.
第一差异值用于指示第十一图像特征和第十二图像特征的差异。The first difference value is used to indicate the difference between the eleventh image feature and the twelfth image feature.
在一种可能的实现方式中,可以先计算第十二图像特征与第十一图像特征的差值,得到第一差值。再计算第一差值的L1范数,得到第一差异值。In a possible implementation manner, the difference between the twelfth image feature and the eleventh image feature may be calculated first to obtain the first difference. Then calculate the L1 norm of the first difference to obtain the first difference.
第一差异值S 1可以由下式表示: The first difference value S1 can be represented by the following formula:
Figure PCTCN2022107921-appb-000014
Figure PCTCN2022107921-appb-000014
D4:将目标图像输入第三图像特征提取网络,获取第三图像特征提取网络输出的第十三图像特征。D4: Input the target image into the third image feature extraction network, and obtain the thirteenth image feature output by the third image feature extraction network.
利用第三图像特征提取网络提取目标图像的图像特征,得到第十三图像特征。Using the third image feature extraction network to extract image features of the target image to obtain a thirteenth image feature.
仍以上述第三图像特征提取网络为VGG Net为例,第十三图像特征可以为
Figure PCTCN2022107921-appb-000015
其中,x t是目标图像的图像特征。
Still taking the above-mentioned third image feature extraction network as VGG Net as an example, the thirteenth image feature can be
Figure PCTCN2022107921-appb-000015
where xt is the image feature of the target image.
D5:将目标重建图像输入第三图像特征提取网络,获取第三图像特征提取网络输出的第十四图像特征。D5: Input the target reconstructed image into the third image feature extraction network, and obtain the fourteenth image feature output by the third image feature extraction network.
利用第三图像特征提取网络提取重建目标图像的图像特征,得到第十四图像特征。Using the third image feature extraction network to extract image features of the reconstructed target image to obtain a fourteenth image feature.
仍以上述第三图像特征提取网络为VGG Net为例,第十二图像特征可以为
Figure PCTCN2022107921-appb-000016
其中,x trec是重建目标图像的图像特征。
Still taking the above-mentioned third image feature extraction network as VGG Net as an example, the twelfth image feature can be
Figure PCTCN2022107921-appb-000016
where x trec is the image feature of the reconstructed target image.
D6:根据第十三图像特征与第十四图像特征的差异,得到第二差异值。D6: Obtain a second difference value according to the difference between the thirteenth image feature and the fourteenth image feature.
第二差异值用于指示第十三图像特征和第十四图像特征的差异。The second difference value is used to indicate the difference between the thirteenth image feature and the fourteenth image feature.
在一种可能的实现方式中,可以先计算第十三图像特征与第十四图像特征的差值,得到第二差值。再计算第二差值的L1范数,得到第二差异值。In a possible implementation manner, the difference between the thirteenth image feature and the fourteenth image feature may be calculated first to obtain the second difference. Then calculate the L1 norm of the second difference to obtain the second difference.
第二差异值S 2可以由下式表示: The second difference value S2 can be represented by the following formula:
Figure PCTCN2022107921-appb-000017
Figure PCTCN2022107921-appb-000017
D7:将第一差异值与第二差异值求和,得到第四求和结果,取第四求和结果的负数值,得到第一分值。D7: sum the first difference value and the second difference value to obtain a fourth summation result, and take the negative value of the fourth summation result to obtain the first score.
计算第一差异值和第二差异值的和,再对得到的和取负数值,得到第一分值。Calculate the sum of the first difference value and the second difference value, and then take the negative value of the obtained sum to obtain the first score.
第一分值SCORE rec可以由公式(15)表示: The first score SCORE rec can be expressed by formula (15):
Figure PCTCN2022107921-appb-000018
Figure PCTCN2022107921-appb-000018
进一步的,本公开至少一实施例提供一种根据源图像特异性损失以及目标图像特异性损失,确定第二分值的具体实施方式,包括:Further, at least one embodiment of the present disclosure provides a specific implementation of determining the second score according to the source image-specific loss and the target image-specific loss, including:
将源图像特异性损失以及目标图像特异性损失求和,得到第五求和结果,取第五求和结果的负数值,得到第二分值。The source image-specific loss and the target image-specific loss are summed to obtain a fifth summation result, and the negative value of the fifth summation result is taken to obtain a second score.
第二分值SCORE dist可以由公式(16)表示: The second score SCORE dist can be expressed by formula (16):
SCORE dist=-L difference=-(L sdist+L tdist)                         (16) SCORE dist = -L difference = -(L sdist +L tdist ) (16)
其中,L sdist为源图像特异性损失,L tdist为目标图像特异性损失,L difference为第五求和结果。 Among them, L sdist is the source image-specific loss, L tdist is the target image-specific loss, and L difference is the fifth summation result.
在一种可能的实现方式中,还可以预先对第一编码器、第二编码器和第三编码器进行训练。In a possible implementation manner, the first encoder, the second encoder, and the third encoder may also be trained in advance.
本公开至少一实施例提供一种医学报告生成模型的训练方法,除上述步骤以外,还包括以下三个步骤。At least one embodiment of the present disclosure provides a training method for a medical report generation model, which includes the following three steps in addition to the above steps.
E1:将训练图像输入第一图像特征提取网络,得到第五图像特征,将第五图像特征输入第一分类网络,得到训练图像的第一预测分类结果;根据训练图像的第一预测分类结果以及训练图像对应的分类标签,训练第一图像特征提取网络以及第一分类网络。E1: Input the training image into the first image feature extraction network to obtain the fifth image feature, and input the fifth image feature into the first classification network to obtain the first predicted classification result of the training image; according to the first predicted classification result of the training image and The classification label corresponding to the training image is trained to train the first image feature extraction network and the first classification network.
训练图像为用于训练编码器的图像。训练图像为具有分类标签的医学图像。分类标签为该医学图像对应的疾病。作为训练图像的医学图像可以是胸片,对应的分类标签例如可以是肺炎、肺结节、心肌肥大等疾病的疾病名称。训练图像可以是CheXpert-small数据集的图像。Training images are the images used to train the encoder. The training images are medical images with classification labels. The classification label is the disease corresponding to the medical image. The medical image used as the training image may be a chest radiograph, and the corresponding classification label may be, for example, disease names of diseases such as pneumonia, pulmonary nodules, and cardiac hypertrophy. The training images can be images from the CheXpert-small dataset.
第一图像特征提取网络用于提取图像特征。将训练图像输入至第一图像特征提取网络中,得到第一图像特征提取网络输出的第五图像特征。其中,第一图像特征提取网络可以是采用Inception-v3网络结构。The first image feature extraction network is used to extract image features. The training image is input into the first image feature extraction network to obtain the fifth image feature output by the first image feature extraction network. Wherein, the first image feature extraction network may adopt an Inception-v3 network structure.
第一分类网络用于根据输入的图像特征确定图像的分类类型。将得到的第五图像特征再输入至第一分类网络中,得到训练图像的第一预测分类结果。第一预测分类结果中可以包括训练图像的图像类型。The first classification network is used to determine the classification type of the image according to the input image features. The obtained fifth image feature is re-inputted into the first classification network to obtain the first predicted classification result of the training image. The first predicted classification result may include the image type of the training image.
训练图像的分类标签能够用于衡量训练图像的第一预测分类结果的准 确程度。根据训练图像的分类标签和第一预测分类结果,训练第一图像特征提取网络和第一分类网络。The classification label of the training image can be used to measure the accuracy of the first predicted classification result of the training image. A first image feature extraction network and a first classification network are trained according to the classification label of the training image and the first predicted classification result.
E2:将训练图像输入第二图像特征提取网络,得到第六图像特征,将第六图像特征输入第二分类网络,得到训练图像的第二预测分类结果;根据训练图像的第二预测分类结果以及训练图像对应的分类标签,训练第二图像特征提取网络以及第二分类网络;第一图像特征提取网络与第二图像特征提取网络的网络结构不同。E2: Input the training image into the second image feature extraction network to obtain the sixth image feature, and input the sixth image feature into the second classification network to obtain the second predicted classification result of the training image; according to the second predicted classification result of the training image and The classification label corresponding to the training image is used to train the second image feature extraction network and the second classification network; the network structure of the first image feature extraction network is different from that of the second image feature extraction network.
第二图像特征提取网络是与第一图像特征提取网络结构不同的网络。第二图像特征提取网络用于提取图像特征。将训练图像输入至第二图像特征提取网络中,得到第二图像特征提取网络输出的第六图像特征。The second image feature extraction network is a network with a different structure from the first image feature extraction network. The second image feature extraction network is used to extract image features. The training image is input into the second image feature extraction network to obtain the sixth image feature output by the second image feature extraction network.
第二分类网络用于根据输入的图像特征确定图像的分类类型。将得到的第六图像特征再输入至第二分类网络中,得到第二分类网络输出的训练图像的第二预测分类结果。第二预测分类结果中可以包括训练图像的图像类型。The second classification network is used to determine the classification type of the image according to the input image features. The obtained sixth image feature is re-inputted into the second classification network to obtain a second predicted classification result of the training image output by the second classification network. The second predicted classification result may include the image type of the training image.
训练图像的分类标签可以用于衡量训练图像的第二预测分类结果是否准确。利用训练图像的分类标签和训练图像的第二预测分类结果,训练第二图像特征提取网络和第二分类网络。The classification label of the training image can be used to measure whether the second predicted classification result of the training image is accurate. Using the classification labels of the training images and the second predicted classification results of the training images, the second image feature extraction network and the second classification network are trained.
E3:将训练完成的第一图像特征提取网络的模型参数确定为第一编码器以及第三编码器的初始模型参数,将训练完成的第二图像特征提取网络的模型参数确定为第二编码器的初始模型参数;第一图像特征提取网络与第一编码器以及第三编码器的网络结构相同,第二图像特征提取网络与第二编码器的网络结构相同。E3: Determine the model parameters of the trained first image feature extraction network as the initial model parameters of the first encoder and the third encoder, and determine the model parameters of the trained second image feature extraction network as the second encoder The initial model parameters; the first image feature extraction network has the same network structure as the first encoder and the third encoder, and the second image feature extraction network has the same network structure as the second encoder.
第一图像特征提取网络与第一编码器以及第三编码器的网络结构相同。在得到训练完成的第一图像特征提取网络之后,利用第一图像特征提取网络确定第一编码器和第三编码器的初始模型参数。具体的,将第一特征提取网络的模型参数确定为第一编码器的初始模型参数,以及第三编码器的初始模型参数。The network structure of the first image feature extraction network is the same as that of the first encoder and the third encoder. After obtaining the trained first image feature extraction network, use the first image feature extraction network to determine the initial model parameters of the first encoder and the third encoder. Specifically, the model parameters of the first feature extraction network are determined as the initial model parameters of the first encoder and the initial model parameters of the third encoder.
第二图像特征提取网络与第二编码器的网络结构相同。在得到训练完成的第二图像特征提取网络之后,利用第二图像特征提取网络的模型参数确定第二编码器的初始模型参数。具体的,将第二图像特征提取网络的模型参数确定为第二编码器的初始模型参数。The second image feature extraction network has the same network structure as the second encoder. After obtaining the trained second image feature extraction network, use the model parameters of the second image feature extraction network to determine the initial model parameters of the second encoder. Specifically, the model parameters of the second image feature extraction network are determined as the initial model parameters of the second encoder.
基于上述内容可知,通过利用训练图像预先训练第一图像特征提取网络 和第二图像特征提取网络,再利用第一图像特征提取网络和第二图像特征提取网络的模型参数,确定第一编码器、第二编码器和第三编码器的初始模型参数。如此通过预训练使得第一编码器、第二编码器和第三编码器更为准确,提高第一编码器、第二编码器和第三编码器提取图像特征的准确性,并且提高模型训练的效率。Based on the above content, it can be known that by using the training images to pre-train the first image feature extraction network and the second image feature extraction network, and then use the model parameters of the first image feature extraction network and the second image feature extraction network to determine the first encoder, Initial model parameters for the second and third encoders. In this way, pre-training makes the first encoder, the second encoder, and the third encoder more accurate, improves the accuracy of the first encoder, the second encoder, and the third encoder in extracting image features, and improves the accuracy of model training. efficiency.
在另一种可能的实现方式中,可以随机初始化确定第一编码器、第二编码器和第三编码器的初始模型参数。本公开至少一实施例还提供一种医学报告生成模型的训练方法,除上述步骤以外,还包括以下步骤。In another possible implementation manner, initial model parameters of the first encoder, the second encoder, and the third encoder may be randomly initialized and determined. At least one embodiment of the present disclosure also provides a training method for a medical report generation model, which includes the following steps in addition to the above steps.
随机初始化第一编码器、第二编码器以及第三编码器的初始模型参数。The initial model parameters of the first encoder, the second encoder and the third encoder are randomly initialized.
在利用第一编码器、第二编码器和第三编码器训练之前,随机初始化第一编码器、第二编码器和第三编码器的初始模型参数。然后通过上述方式,对第一编码器、第二编码器和第三编码器进行训练,确定模型参数。Before training with the first encoder, the second encoder, and the third encoder, the initial model parameters of the first encoder, the second encoder, and the third encoder are randomly initialized. Then, in the above manner, the first encoder, the second encoder and the third encoder are trained to determine model parameters.
基于上述实施例提供的一种医学报告生成模型的训练方法,本公开实施例提供一种医学报告生成方法。参见图5所示,该图为本公开至少一实施例提供的一种医学报告生成方法的流程图,方法包括S501-S502:Based on the method for training a medical report generation model provided in the foregoing embodiments, an embodiment of the present disclosure provides a method for generating a medical report. Referring to Fig. 5, which is a flow chart of a method for generating a medical report provided by at least one embodiment of the present disclosure, the method includes S501-S502:
S501:将医学图像输入编码器,得到医学图像特征。S501: Input medical images into an encoder to obtain medical image features.
编码器是采用上述医学报告生成模型的训练方法训练得到的第二编码器。训练得到的第二编码器能够较为准确地提取医学图像的医学图像特征。The encoder is the second encoder trained by using the above training method of the medical report generation model. The trained second encoder can more accurately extract medical image features of the medical image.
将需要生成对应的医学报告文本的医学图像输入至编码器中,得到医学图像对应的医学图像特征。需要说明的是,医学图像的图像类型与目标图像的图像类型一致。例如,目标图像中包括内窥镜产生的图像,对应的,医学图像可以为内窥镜产生的图像。The medical image that needs to generate the corresponding medical report text is input into the encoder to obtain the medical image features corresponding to the medical image. It should be noted that the image type of the medical image is consistent with the image type of the target image. For example, the target image includes an image generated by an endoscope, and correspondingly, the medical image may be an image generated by an endoscope.
S502:将医学图像特征输入文本生成器,得到医学报告文本。S502: Input the medical image features into the text generator to obtain the medical report text.
文本生成器是采用上述医学报告生成模型的训练方法所训练得到的文本生成器。训练得到的文本生成器能够基于输入的医学图像特征生成较为准确的医学报告文本。The text generator is a text generator trained by the above-mentioned training method of the medical report generation model. The trained text generator can generate more accurate medical report text based on the input medical image features.
将编码器输出的医学图像的医学图像特征输入文本生成器中,得到由文本生成器输出的医学报告文本。The medical image features of the medical image output by the encoder are input into the text generator to obtain the medical report text output by the text generator.
基于上述内容可知,在本公开的实施例中,利用上述医学报告生成模型的训练方法训练得到的编码器和文本生成器,能够适用于目标图像对应的图像类型的医学图像,生成医学图像对应的医学报告文本。Based on the above content, it can be seen that in the embodiments of the present disclosure, the encoder and text generator trained by the training method of the above-mentioned medical report generation model can be applied to the medical image of the image type corresponding to the target image, and generate the corresponding Medical report text.
基于上述方法实施例提供的一种医学报告生成模型的训练方法,本公开的至少一实施例还提供了一种医学报告生成模型的训练装置,下面将结合附图对医学报告生成模型的训练装置进行说明。Based on the training method of the medical report generation model provided by the above method embodiments, at least one embodiment of the present disclosure also provides a training device for the medical report generation model. The training device for the medical report generation model will be described below in conjunction with the accompanying drawings Be explained.
参见图6所示,该图为本公开至少一实施例提供的一种医学报告生成模型的训练装置的结构示意图。如图6所示,该医学报告生成模型的训练装置包括:Referring to FIG. 6 , which is a schematic structural diagram of a training device for a medical report generation model provided by at least one embodiment of the present disclosure. As shown in Figure 6, the training device of this medical report generation model includes:
第一输入单元601,用于将源图像输入第一编码器,得到第一图像特征,将所述源图像输入第二编码器,得到第二图像特征;所述源图像对应有医学文本标签;The first input unit 601 is configured to input a source image into a first encoder to obtain a first image feature, and input the source image into a second encoder to obtain a second image feature; the source image corresponds to a medical text label;
第二输入单元602,用于将目标图像输入第三编码器,得到第三图像特征,将所述目标图像输入所述第二编码器,得到第四图像特征;The second input unit 602 is configured to input a target image into a third encoder to obtain a third image feature, and input the target image to the second encoder to obtain a fourth image feature;
第三输入单元603,用于将所述第二图像特征输入文本生成器,得到第一医学报告文本;The third input unit 603 is configured to input the second image feature into the text generator to obtain the first medical report text;
第四输入单元604,用于将所述第四图像特征输入所述文本生成器,得到第二医学报告文本;A fourth input unit 604, configured to input the fourth image feature into the text generator to obtain a second medical report text;
第五输入单元605,用于将所述第一医学报告文本输入判别器,得到第一判别结果;The fifth input unit 605 is configured to input the first medical report text into the discriminator to obtain a first discriminant result;
第六输入单元606,用于将所述第二医学报告文本输入所述判别器,得到第二判别结果;A sixth input unit 606, configured to input the second medical report text into the discriminator to obtain a second discriminant result;
第一计算单元607,用于根据所述第一图像特征以及所述第二图像特征,计算源图像特异性损失,根据所述第三图像特征以及所述第四图像特征,计算目标图像特异性损失;The first calculation unit 607 is configured to calculate the source image specificity loss according to the first image feature and the second image feature, and calculate the target image specificity loss according to the third image feature and the fourth image feature loss;
第二计算单元608,用于根据所述第一医学报告文本以及所述源图像对应的医学文本标签计算交叉熵损失;The second calculation unit 608 is configured to calculate a cross-entropy loss according to the first medical report text and the medical text label corresponding to the source image;
第三计算单元609,用于根据所述第一判别结果计算第一对抗性损失,根据所述第二判别结果计算第二对抗性损失以及第三对抗性损失;The third calculation unit 609 is configured to calculate a first adversarial loss according to the first discrimination result, and calculate a second adversarial loss and a third adversarial loss according to the second discrimination result;
执行单元610,用于根据所述源图像特征性损失、所述目标图像特征性损失、所述交叉熵损失、所述第一对抗性损失、所述第二对抗性损失以及所述第三对抗性损失,训练所述第一编码器、所述第二编码器、所述第三编码器、所述文本生成器以及所述判别器,重复执行所述将源图像输入第一图像特征编码器以及后续步骤,直到达到预设条件。 Execution unit 610, configured to, according to the source image characteristic loss, the target image characteristic loss, the cross-entropy loss, the first adversarial loss, the second adversarial loss, and the third adversarial loss performance loss, train the first encoder, the second encoder, the third encoder, the text generator, and the discriminator, and repeat the process of inputting the source image into the first image feature encoder And subsequent steps until the preset condition is reached.
在一种可能的实现方式中,所述装置还包括:In a possible implementation manner, the device further includes:
第七输入单元,用于将所述第一图像特征以及所述第二图像特征输入第一解码器,得到重建源图像;A seventh input unit, configured to input the first image feature and the second image feature into the first decoder to obtain a reconstructed source image;
第八输入单元,用于将所述第三图像特征以及所述第四图像特征输入第二解码器,得到重建目标图像;An eighth input unit, configured to input the third image feature and the fourth image feature into a second decoder to obtain a reconstructed target image;
第四计算单元,用于根据所述源图像以及所述重建源图像,计算源图像感知损失,根据所述目标图像以及所述重建目标图像,计算目标图像感知损失;A fourth calculation unit, configured to calculate the perceptual loss of the source image according to the source image and the reconstructed source image, and calculate the perceptual loss of the target image according to the target image and the reconstructed target image;
所述执行单元,具体用于所述根据所述源图像特征性损失、所述目标图像特征性损失、所述交叉熵损失、所述第一对抗性损失、所述第二对抗性损失、所述第三对抗性损失、所述源图像感知损失以及所述目标图像感知损失,训练所述第一编码器、所述第二编码器、所述第三编码器、所述文本生成器、所述判别器、第一解码器以及所述第二解码器。The execution unit is specifically used for the said source image characteristic loss, the target image characteristic loss, the cross-entropy loss, the first adversarial loss, the second adversarial loss, the The third adversarial loss, the source image perceptual loss and the target image perceptual loss, train the first encoder, the second encoder, the third encoder, the text generator, the The discriminator, the first decoder and the second decoder.
在一种可能的实现方式中,部分所述目标图像对应有医学文本标签;所述装置还包括:In a possible implementation manner, part of the target image corresponds to a medical text label; the device further includes:
第一确定单元,用于根据所述源图像与所述重建源图像的差异以及所述目标图像与所述重建目标图像的差异,确定第一分值;A first determining unit, configured to determine a first score according to the difference between the source image and the reconstructed source image and the difference between the target image and the reconstructed target image;
第二确定单元,用于根据所述源图像特异性损失以及所述目标图像特异性损失,确定第二分值;a second determining unit, configured to determine a second score according to the source image-specific loss and the target image-specific loss;
第五计算单元,用于如果所述目标图像对应有医学文本标签,根据所述第二医学报告文本以及所述目标图像对应的医学文本标签,计算自然语言评估指标作为第三分值;A fifth calculation unit, configured to calculate a natural language evaluation index as a third score according to the second medical report text and the medical text label corresponding to the target image if the target image corresponds to a medical text label;
求和单元,用于将所述第一分值、所述第二分值以及所述第三分值加权求和,得到奖励值;A summation unit, configured to weight and sum the first score, the second score and the third score to obtain a reward value;
训练单元,用于以最大化所述奖励值为目标,重新训练所述第一编码器、所述第二编码器、所述第三编码器、所述文本生成器、所述判别器、第一解码器以及所述第二解码器。a training unit, configured to retrain the first encoder, the second encoder, the third encoder, the text generator, the discriminator, and the second encoder with the goal of maximizing the reward value A decoder and the second decoder.
在一种可能的实现方式中,所述装置还包括:In a possible implementation manner, the device further includes:
第七输入单元,用于将训练图像输入第一图像特征提取网络,得到第五图像特征,将所述第五图像特征输入第一分类网络,得到所述训练图像的第一预测分类结果;根据所述训练图像的第一预测分类结果以及所述训练图像 对应的分类标签,训练所述第一图像特征提取网络以及所述第一分类网络;The seventh input unit is used to input the training image into the first image feature extraction network to obtain the fifth image feature, and input the fifth image feature into the first classification network to obtain the first predicted classification result of the training image; according to The first predicted classification result of the training image and the classification label corresponding to the training image, training the first image feature extraction network and the first classification network;
第八输入单元,用于将训练图像输入第二图像特征提取网络,得到第六图像特征,将所述第六图像特征输入第二分类网络,得到所述训练图像的第二预测分类结果;根据所述训练图像的第二预测分类结果以及所述训练图像对应的分类标签,训练所述第二图像特征提取网络以及所述第二分类网络;所述第一图像特征提取网络与所述第二图像特征提取网络的网络结构不同;The eighth input unit is used to input the training image into the second image feature extraction network to obtain the sixth image feature, and input the sixth image feature into the second classification network to obtain the second predicted classification result of the training image; according to The second predicted classification result of the training image and the classification label corresponding to the training image, train the second image feature extraction network and the second classification network; the first image feature extraction network and the second The network structure of the image feature extraction network is different;
第三确定单元,用于将训练完成的所述第一图像特征提取网络的模型参数确定为所述第一编码器以及所述第三编码器的初始模型参数,将训练完成的所述第二图像特征提取网络的模型参数确定为所述第二编码器的初始模型参数;所述第一图像特征提取网络与所述第一编码器以及所述第三编码器的网络结构相同,所述第二图像特征提取网络与所述第二编码器的网络结构相同。The third determining unit is configured to determine the model parameters of the trained first image feature extraction network as the initial model parameters of the first encoder and the third encoder, and determine the trained second The model parameters of the image feature extraction network are determined as the initial model parameters of the second encoder; the first image feature extraction network has the same network structure as the first encoder and the third encoder, and the second encoder The network structure of the second image feature extraction network is the same as that of the second encoder.
在一种可能的实现方式中,所述装置还包括:In a possible implementation manner, the device further includes:
初始化单元,用于随机初始化所述第一编码器、所述第二编码器以及所述第三编码器的初始模型参数。An initialization unit, configured to randomly initialize initial model parameters of the first encoder, the second encoder, and the third encoder.
在一种可能的实现方式中,所述第一判别结果包括判别所述第一医学报告文本中每个分词是否由所述源图像生成的第一概率值,所述第二判别结果包括判别所述第二医学报告文本中每个分词是否由所述源图像生成的第二概率值;In a possible implementation manner, the first discrimination result includes a first probability value for judging whether each word segment in the first medical report text is generated by the source image, and the second judgment result includes judging the A second probability value of whether each word segment in the second medical report text is generated by the source image;
所述第三计算单元609,具体用于将所述第一概率值取对数后求和,得到第一求和结果,取所述第一求和结果的负数值,得到第一对抗性损失;The third computing unit 609 is specifically configured to take logarithms of the first probability values and then sum them to obtain a first summation result, and take a negative value of the first summation result to obtain a first adversarial loss ;
所述第三计算单元609,具体用于将所述第二概率值取对数后求和,得到第二求和结果,取所述第二求和结果的负数值,得到第二对抗性损失;The third computing unit 609 is specifically configured to take logarithms of the second probability values and then sum them to obtain a second summation result, and take a negative value of the second summation result to obtain a second adversarial loss ;
计算1与所述第二概率值之差后求和,得到第三求和结果,取所述第三求和结果的负数值,得到第三对抗性损失。The difference between 1 and the second probability value is calculated and then summed to obtain a third summation result, and the negative value of the third summation result is taken to obtain a third adversarial loss.
在一种可能的实现方式中,所述第四计算单元610,具体用于将所述源图像输入第三图像特征提取网络,获取所述第三图像特征提取网络的各特征提取层输出的第七图像特征;In a possible implementation manner, the fourth calculation unit 610 is specifically configured to input the source image into the third image feature extraction network, and obtain the first output of each feature extraction layer of the third image feature extraction network. Seven image features;
将所述重建源图像输入所述第三图像特征提取网络,获取所述第三图像特征提取网络的各特征提取层输出的第八图像特征;Inputting the reconstruction source image into the third image feature extraction network, and obtaining the eighth image feature output by each feature extraction layer of the third image feature extraction network;
根据每一所述特征提取层输出的第七图像特征、第八图像特征以及该特 征提取层对应的权重,计算该特征提取层对应的源图像损失;According to the seventh image feature, the eighth image feature of each described feature extraction layer output and the weight corresponding to this feature extraction layer, calculate the source image loss corresponding to this feature extraction layer;
将各个特征提取层对应的源图像损失求和,得到源图像感知损失;The source image loss corresponding to each feature extraction layer is summed to obtain the source image perception loss;
所述第四计算单元610,具体用于将所述目标图像输入第三图像特征提取网络,获取所述第三图像特征提取网络的各特征提取层输出的第九图像特征;The fourth calculation unit 610 is specifically configured to input the target image into a third image feature extraction network, and obtain ninth image features output by each feature extraction layer of the third image feature extraction network;
将所述重建目标图像输入所述第三图像特征提取网络,获取所述第三图像特征提取网络的各特征提取层输出的第十图像特征;Input the reconstruction target image into the third image feature extraction network, and obtain the tenth image feature output by each feature extraction layer of the third image feature extraction network;
根据每一所述特征提取层输出的第九图像特征、第十图像特征以及该特征提取层对应的权重,计算该特征提取层对应的目标图像损失;Calculate the target image loss corresponding to the feature extraction layer according to the ninth image feature, the tenth image feature output by each feature extraction layer, and the weight corresponding to the feature extraction layer;
将各个所述特征提取层对应的目标图像损失求和,得到目标图像感知损失。The target image loss corresponding to each feature extraction layer is summed to obtain the target image perception loss.
在一种可能的实现方式中,所述第一确定单元,具体用于将所述源图像输入第三图像特征提取网络,获取所述第三图像特征提取网络输出的第十一图像特征;In a possible implementation manner, the first determination unit is specifically configured to input the source image into a third image feature extraction network, and obtain an eleventh image feature output by the third image feature extraction network;
将所述重建源图像输入所述第三图像特征提取网络,获取所述第三图像特征提取网络输出的第十二图像特征;Inputting the reconstructed source image into the third image feature extraction network, and obtaining the twelfth image feature output by the third image feature extraction network;
根据所述第十一图像特征与所述第十二图像特征的差异,得到第一差异值;Obtaining a first difference value according to the difference between the eleventh image feature and the twelfth image feature;
将所述目标图像输入所述第三图像特征提取网络,获取所述第三图像特征提取网络输出的第十三图像特征;Inputting the target image into the third image feature extraction network, and obtaining the thirteenth image feature output by the third image feature extraction network;
将所述目标重建图像输入所述第三图像特征提取网络,获取所述第三图像特征提取网络输出的第十四图像特征;Inputting the target reconstructed image into the third image feature extraction network, and obtaining the fourteenth image feature output by the third image feature extraction network;
根据所述第十三图像特征与所述第十四图像特征的差异,得到第二差异值;Obtaining a second difference value according to the difference between the thirteenth image feature and the fourteenth image feature;
将所述第一差异值与所述第二差异值求和,得到第四求和结果,取所述第四求和结果的负数值,得到第一分值。Summing the first difference value and the second difference value to obtain a fourth summation result, and taking a negative value of the fourth summation result to obtain a first score.
在一种可能的实现方式中,所述第二确定单元,具体用于将所述源图像特异性损失以及所述目标图像特异性损失求和,得到第五求和结果,取所述第五求和结果的负数值,得到第二分值。In a possible implementation manner, the second determination unit is specifically configured to sum the source image-specific loss and the target image-specific loss to obtain a fifth summation result, and take the fifth Sum the negative values of the result to get the second score.
基于上述方法实施例提供的一种医学报告生成方法,本公开至少一实施例还提供了一种医学报告生成装置,下面将结合附图对医学报告生成装置进 行说明。Based on the method for generating a medical report provided by the above method embodiments, at least one embodiment of the present disclosure further provides a device for generating a medical report. The device for generating a medical report will be described below with reference to the accompanying drawings.
参见图7所示,该图为本公开至少一实施例提供的一种医学报告生成装置的结构示意图。如图7所示,该医学报告生成装置包括:Refer to FIG. 7 , which is a schematic structural diagram of a medical report generating device provided by at least one embodiment of the present disclosure. As shown in Figure 7, the medical report generation device includes:
输入单元701,用于将医学图像输入编码器,得到医学图像特征;The input unit 701 is configured to input the medical image into the encoder to obtain medical image features;
生成单元702,用于将所述医学图像特征输入文本生成器,得到医学报告文本;A generating unit 702, configured to input the medical image features into a text generator to obtain a medical report text;
所述编码器是根据上述任一项实施例所述的医学报告生成模型的训练方法训练得到的第二编码器;The encoder is a second encoder trained according to the training method of the medical report generation model described in any one of the above embodiments;
所述文本生成器是根据上述任一项实施例所述的医学报告生成模型的训练方法训练得到的文本生成器。The text generator is a text generator trained according to the training method of the medical report generation model described in any one of the above embodiments.
基于上述方法实施例提供的一种医学报告生成模型的训练方法和医学报告生成方法,本公开至少一实施例还提供一种电子设备,包括:一个或多个处理器;存储装置,其上存储有一个或多个程序,当所述一个或多个程序被所述一个或多个处理器执行,使得所述一个或多个处理器实现如上述任一实施例所述的医学报告生成模型的训练方法,或者实现如上述实施例所述的医学报告生成方法。Based on the method for training a medical report generation model and the method for generating a medical report provided by the above method embodiments, at least one embodiment of the present disclosure further provides an electronic device, including: one or more processors; a storage device on which There are one or more programs, when the one or more programs are executed by the one or more processors, so that the one or more processors implement the medical report generation model as described in any of the above embodiments Training method, or realize the medical report generating method as described in the above-mentioned embodiment.
下面参考图8,其示出了适于用来实现本公开实施例的电子设备800的结构示意图。本公开实施例中的终端设备可以包括但不限于诸如移动电话、笔记本电脑、数字广播接收器、PDA(Personal Digital Assistant,个人数字助理)、PAD(portable android device,平板电脑)、PMP(Portable Media Player,便携式多媒体播放器)、车载终端(例如车载导航终端)等等的移动终端以及诸如数字TV(television,电视机)、台式计算机等等的固定终端。图8示出的电子设备仅仅是一个示例,不应对本公开实施例的功能和使用范围带来任何限制。Referring now to FIG. 8 , it shows a schematic structural diagram of an electronic device 800 suitable for implementing the embodiments of the present disclosure. The terminal equipment in the embodiment of the present disclosure may include but not limited to mobile phones, notebook computers, digital broadcast receivers, PDA (Personal Digital Assistant, personal digital assistant), PAD (portable android device, tablet computer), PMP (Portable Media Player, portable multimedia player), mobile terminals such as vehicle-mounted terminals (such as vehicle-mounted navigation terminals), and fixed terminals such as digital TVs (television, television sets), desktop computers, and the like. The electronic device shown in FIG. 8 is only an example, and should not limit the functions and scope of use of the embodiments of the present disclosure.
如图8所示,电子设备800可以包括处理装置(例如中央处理器、图形处理器等)801,其可以根据存储在只读存储器(ROM)802中的程序或者从存储装置808加载到随机访问存储器(RAM)803中的程序而执行各种适当的动作和处理。在RAM 803中,还存储有电子设备800操作所需的各种程序和数据。处理装置801、ROM 802以及RAM 803通过总线804彼此相连。输入/输出(I/O)接口805也连接至总线804。As shown in FIG. 8 , an electronic device 800 may include a processing device (such as a central processing unit, a graphics processing unit, etc.) Various appropriate actions and processes are executed by programs in the memory (RAM) 803 . In the RAM 803, various programs and data necessary for the operation of the electronic device 800 are also stored. The processing device 801, ROM 802, and RAM 803 are connected to each other through a bus 804. An input/output (I/O) interface 805 is also connected to the bus 804 .
通常,以下装置可以连接至I/O接口805:包括例如触摸屏、触摸板、键 盘、鼠标、摄像头、麦克风、加速度计、陀螺仪等的输入装置806;包括例如液晶显示器(LCD)、扬声器、振动器等的输出装置807;包括例如磁带、硬盘等的存储装置808;以及通信装置809。通信装置809可以允许电子设备800与其他设备进行无线或有线通信以交换数据。虽然图8示出了具有各种装置的电子设备800,但是应理解的是,并不要求实施或具备所有示出的装置。可以替代地实施或具备更多或更少的装置。Typically, the following devices can be connected to the I/O interface 805: input devices 806 including, for example, a touch screen, touchpad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; including, for example, a liquid crystal display (LCD), speaker, vibration an output device 807 such as a computer; a storage device 808 including, for example, a magnetic tape, a hard disk, etc.; and a communication device 809. The communication means 809 may allow the electronic device 800 to communicate with other devices wirelessly or by wire to exchange data. While FIG. 8 shows electronic device 800 having various means, it is to be understood that implementing or having all of the means shown is not a requirement. More or fewer means may alternatively be implemented or provided.
特别地,根据本公开的实施例,上文参考流程图描述的过程可以被实现为计算机软件程序。例如,本公开的实施例包括一种计算机程序产品,其包括承载在非暂态计算机可读介质上的计算机程序,该计算机程序包含用于执行流程图所示的方法的程序代码。在这样的实施例中,该计算机程序可以通过通信装置809从网络上被下载和安装,或者从存储装置808被安装,或者从ROM 802被安装。在该计算机程序被处理装置801执行时,执行本公开实施例的方法中限定的上述功能。In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts can be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product, which includes a computer program carried on a non-transitory computer readable medium, where the computer program includes program code for executing the method shown in the flowchart. In such an embodiment, the computer program may be downloaded and installed from a network via communication means 809, or from storage means 808, or from ROM 802. When the computer program is executed by the processing device 801, the above-mentioned functions defined in the methods of the embodiments of the present disclosure are performed.
本公开实施例提供的电子设备与上述实施例提供的一种医学报告生成模型的训练方法和医学报告生成方法属于同一发明构思,未在本实施例中详尽描述的技术细节可参见上述实施例,并且本实施例与上述实施例具有相同的有益效果。The electronic device provided by the embodiment of the present disclosure belongs to the same inventive concept as the training method of the medical report generation model and the method of generating the medical report provided by the above embodiment, and the technical details not described in detail in this embodiment can be referred to the above embodiment. And this embodiment has the same beneficial effect as the above embodiment.
基于上述方法实施例提供的一种医学报告生成模型的训练方法和医学报告生成方法,本公开至少一实施例提供了一种计算机存储介质,其上存储有计算机程序,其中,所述程序被处理器执行时实现如上述任一实施例所述的医学报告生成模型的训练方法,或者上述实施例所述的医学报告生成方法。Based on the method for training a medical report generation model and the method for generating a medical report provided by the above method embodiments, at least one embodiment of the present disclosure provides a computer storage medium on which a computer program is stored, wherein the program is processed The training method of the medical report generation model described in any of the above embodiments, or the medical report generation method described in the above embodiments is implemented when the device is executed.
需要说明的是,本公开上述的计算机可读介质可以是计算机可读信号介质或者计算机可读存储介质或者是上述两者的任意组合。计算机可读存储介质例如可以是——但不限于——电、磁、光、电磁、红外线、或半导体的系统、装置或器件,或者任意以上的组合。计算机可读存储介质的更具体的例子可以包括但不限于:具有一个或多个导线的电连接、便携式计算机磁盘、硬盘、随机访问存储器(RAM)、只读存储器(ROM)、可擦式可编程只读存储器(EPROM或闪存)、光纤、便携式紧凑磁盘只读存储器(CD-ROM)、光存储器件、磁存储器件、或者上述的任意合适的组合。在本公开中,计算机可读存储介质可以是任何包含或存储程序的有形介质,该程序可以被指令执行系统、装置或者器件使用或者与其结合使用。而在本公开中,计算机可读 信号介质可以包括在基带中或者作为载波一部分传播的数据信号,其中承载了计算机可读的程序代码。这种传播的数据信号可以采用多种形式,包括但不限于电磁信号、光信号或上述的任意合适的组合。计算机可读信号介质还可以是计算机可读存储介质以外的任何计算机可读介质,该计算机可读信号介质可以发送、传播或者传输用于由指令执行系统、装置或者器件使用或者与其结合使用的程序。计算机可读介质上包含的程序代码可以用任何适当的介质传输,包括但不限于:电线、光缆、RF(射频)等等,或者上述的任意合适的组合。It should be noted that the above-mentioned computer-readable medium in the present disclosure may be a computer-readable signal medium or a computer-readable storage medium or any combination of the above two. A computer readable storage medium may be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination thereof. More specific examples of computer-readable storage media may include, but are not limited to, electrical connections with one or more wires, portable computer diskettes, hard disks, random access memory (RAM), read-only memory (ROM), erasable Programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination of the above. In the present disclosure, a computer-readable storage medium may be any tangible medium that contains or stores a program that can be used by or in conjunction with an instruction execution system, apparatus, or device. In this disclosure, however, a computer-readable signal medium may include a data signal in baseband or propagated as part of a carrier wave carrying computer-readable program code thereon. Such propagated data signals may take many forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the foregoing. A computer-readable signal medium may also be any computer-readable medium other than a computer-readable storage medium, which can transmit, propagate, or transmit a program for use by or in conjunction with an instruction execution system, apparatus, or device . Program code embodied on a computer readable medium may be transmitted by any appropriate medium, including but not limited to wires, optical cables, RF (radio frequency), etc., or any suitable combination of the above.
在一些实施方式中,客户端、服务器可以利用诸如HTTP(HyperText Transfer Protocol,超文本传输协议)之类的任何当前已知或未来研发的网络协议进行通信,并且可以与任意形式或介质的数字数据通信(例如,通信网络)互连。通信网络的示例包括局域网(“LAN”),广域网(“WAN”),网际网(例如,互联网)以及端对端网络(例如,ad hoc端对端网络),以及任何当前已知或未来研发的网络。In some embodiments, the client and the server can communicate using any currently known or future network protocols such as HTTP (HyperText Transfer Protocol, Hypertext Transfer Protocol), and can communicate with digital data in any form or medium The communication (eg, communication network) interconnections. Examples of communication networks include local area networks ("LANs"), wide area networks ("WANs"), internetworks (e.g., the Internet), and peer-to-peer networks (e.g., ad hoc peer-to-peer networks), as well as any currently known or future developed network of.
上述计算机可读介质可以是上述电子设备中所包含的;也可以是单独存在,而未装配入该电子设备中。The above-mentioned computer-readable medium may be included in the above-mentioned electronic device, or may exist independently without being incorporated into the electronic device.
上述计算机可读介质承载有一个或者多个程序,当上述一个或者多个程序被该电子设备执行时,使得该电子设备执行上述医学报告生成模型的训练方法,或者医学报告生成方法。The above-mentioned computer-readable medium carries one or more programs, and when the above-mentioned one or more programs are executed by the electronic device, the electronic device is made to execute the above-mentioned training method of the medical report generation model, or the medical report generation method.
可以以一种或多种程序设计语言或其组合来编写用于执行本公开的操作的计算机程序代码,上述程序设计语言包括但不限于面向对象的程序设计语言—诸如Java、Smalltalk、C++,还包括常规的过程式程序设计语言—诸如“C”语言或类似的程序设计语言。程序代码可以完全地在用户计算机上执行、部分地在用户计算机上执行、作为一个独立的软件包执行、部分在用户计算机上部分在远程计算机上执行、或者完全在远程计算机或服务器上执行。在涉及远程计算机的情形中,远程计算机可以通过任意种类的网络——包括局域网(LAN)或广域网(WAN)—连接到用户计算机,或者,可以连接到外部计算机(例如利用因特网服务提供商来通过因特网连接)。Computer program code for carrying out operations of the present disclosure may be written in one or more programming languages, or combinations thereof, including but not limited to object-oriented programming languages—such as Java, Smalltalk, C++, and Includes conventional procedural programming languages - such as the "C" language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In cases involving a remote computer, the remote computer can be connected to the user computer through any kind of network, including a local area network (LAN) or a wide area network (WAN), or it can be connected to an external computer (such as through an Internet service provider). Internet connection).
附图中的流程图和框图,图示了按照本公开各种实施例的系统、方法和计算机程序产品的可能实现的体系架构、功能和操作。在这点上,流程图或框图中的每个方框可以代表一个模块、程序段、或代码的一部分,该模块、 程序段、或代码的一部分包含一个或多个用于实现规定的逻辑功能的可执行指令。也应当注意,在有些作为替换的实现中,方框中所标注的功能也可以以不同于附图中所标注的顺序发生。例如,两个接连地表示的方框实际上可以基本并行地执行,它们有时也可以按相反的顺序执行,这依所涉及的功能而定。也要注意的是,框图和/或流程图中的每个方框、以及框图和/或流程图中的方框的组合,可以用执行规定的功能或操作的专用的基于硬件的系统来实现,或者可以用专用硬件与计算机指令的组合来实现。The flowchart and block diagrams in the Figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in a flowchart or block diagram may represent a module, program segment, or portion of code that contains one or more logical functions for implementing specified executable instructions. It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or they may sometimes be executed in the reverse order, depending upon the functionality involved. It should also be noted that each block of the block diagrams and/or flowchart illustrations, and combinations of blocks in the block diagrams and/or flowchart illustrations, can be implemented by a dedicated hardware-based system that performs the specified functions or operations , or may be implemented by a combination of dedicated hardware and computer instructions.
描述于本公开实施例中所涉及到的单元可以通过软件的方式实现,也可以通过硬件的方式来实现。其中,单元/模块的名称在某种情况下并不构成对该单元本身的限定,例如,语音数据采集模块还可以被描述为“数据采集模块”。The units involved in the embodiments described in the present disclosure may be implemented by software or by hardware. Wherein, the name of the unit/module does not constitute a limitation on the unit itself under certain circumstances, for example, the voice data collection module can also be described as a "data collection module".
本文中以上描述的功能可以至少部分地由一个或多个硬件逻辑部件来执行。例如,非限制性地,可以使用的示范类型的硬件逻辑部件包括:现场可编程门阵列(FPGA)、专用集成电路(ASIC)、专用标准产品(ASSP)、片上系统(SOC)、复杂可编程逻辑设备(CPLD)等等。The functions described herein above may be performed at least in part by one or more hardware logic components. For example, without limitation, exemplary types of hardware logic components that may be used include: Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), System on Chips (SOCs), Complex Programmable Logical device (CPLD) and so on.
在本公开的上下文中,机器可读介质可以是有形的介质,其可以包含或存储以供指令执行系统、装置或设备使用或与指令执行系统、装置或设备结合地使用的程序。机器可读介质可以是机器可读信号介质或机器可读储存介质。机器可读介质可以包括但不限于电子的、磁性的、光学的、电磁的、红外的、或半导体系统、装置或设备,或者上述内容的任何合适组合。机器可读存储介质的更具体示例会包括基于一个或多个线的电气连接、便携式计算机盘、硬盘、随机存取存储器(RAM)、只读存储器(ROM)、可擦除可编程只读存储器(EPROM或快闪存储器)、光纤、便捷式紧凑盘只读存储器(CD-ROM)、光学储存设备、磁储存设备、或上述内容的任何合适组合。In the context of the present disclosure, a machine-readable medium may be a tangible medium that may contain or store a program for use by or in conjunction with an instruction execution system, apparatus, or device. A machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, apparatus, or devices, or any suitable combination of the foregoing. More specific examples of machine-readable storage media would include one or more wire-based electrical connections, portable computer discs, hard drives, random access memory (RAM), read only memory (ROM), erasable programmable read only memory (EPROM or flash memory), optical fiber, compact disk read only memory (CD-ROM), optical storage, magnetic storage, or any suitable combination of the foregoing.
根据本公开的一个或多个实施例,示例一提供了一种医学报告生成模型的训练方法,所述方法包括:According to one or more embodiments of the present disclosure, Example 1 provides a method for training a medical report generation model, the method comprising:
将源图像输入第一编码器,得到第一图像特征,将所述源图像输入第二编码器,得到第二图像特征;所述源图像对应有医学文本标签;The source image is input into the first encoder to obtain the first image feature, and the source image is input into the second encoder to obtain the second image feature; the source image corresponds to a medical text label;
将目标图像输入第三编码器,得到第三图像特征,将所述目标图像输入所述第二编码器,得到第四图像特征;inputting the target image into a third encoder to obtain a third image feature, and inputting the target image into the second encoder to obtain a fourth image feature;
将所述第二图像特征输入文本生成器,得到第一医学报告文本;Inputting the second image feature into the text generator to obtain the first medical report text;
将所述第四图像特征输入所述文本生成器,得到第二医学报告文本;Inputting the fourth image feature into the text generator to obtain a second medical report text;
将所述第一医学报告文本输入判别器,得到第一判别结果;Inputting the first medical report text into a discriminator to obtain a first discriminant result;
将所述第二医学报告文本输入所述判别器,得到第二判别结果;inputting the second medical report text into the discriminator to obtain a second discriminant result;
根据所述第一图像特征以及所述第二图像特征,计算源图像特异性损失,根据所述第三图像特征以及所述第四图像特征,计算目标图像特异性损失;calculating a source image-specific loss based on the first image feature and the second image feature, and calculating a target image-specific loss based on the third image feature and the fourth image feature;
根据所述第一医学报告文本以及所述源图像对应的医学文本标签计算交叉熵损失;calculating a cross-entropy loss according to the first medical report text and the medical text label corresponding to the source image;
根据所述第一判别结果计算第一对抗性损失,根据所述第二判别结果计算第二对抗性损失以及第三对抗性损失;calculating a first adversarial loss according to the first discrimination result, and calculating a second adversarial loss and a third adversarial loss according to the second discrimination result;
根据所述源图像特征性损失、所述目标图像特征性损失、所述交叉熵损失、所述第一对抗性损失、所述第二对抗性损失以及所述第三对抗性损失,训练所述第一编码器、所述第二编码器、所述第三编码器、所述文本生成器以及所述判别器,重复执行所述将源图像输入第一图像特征编码器以及后续步骤,直到达到预设条件。According to the source image characteristic loss, the target image characteristic loss, the cross entropy loss, the first adversarial loss, the second adversarial loss and the third adversarial loss, train the The first encoder, the second encoder, the third encoder, the text generator and the discriminator repeatedly execute the inputting the source image into the first image feature encoder and subsequent steps until reaching preset conditions.
根据本公开的一个或多个实施例,示例二提供了一种医学报告生成模型的训练方法,所述方法还包括:According to one or more embodiments of the present disclosure, Example 2 provides a method for training a medical report generation model, the method further comprising:
将所述第一图像特征以及所述第二图像特征输入第一解码器,得到重建源图像;inputting the first image feature and the second image feature into a first decoder to obtain a reconstructed source image;
将所述第三图像特征以及所述第四图像特征输入第二解码器,得到重建目标图像;inputting the third image feature and the fourth image feature into a second decoder to obtain a reconstructed target image;
根据所述源图像以及所述重建源图像,计算源图像感知损失,根据所述目标图像以及所述重建目标图像,计算目标图像感知损失;calculating a source image perceptual loss based on the source image and the reconstructed source image, and calculating a target image perceptual loss based on the target image and the reconstructed target image;
所述根据所述源图像特征性损失、所述目标图像特征性损失、所述交叉熵损失、所述第一对抗性损失、所述第二对抗性损失以及所述第三对抗性损失,训练所述第一编码器、所述第二编码器、所述第三编码器、所述文本生成器以及所述判别器,包括:According to the source image characteristic loss, the target image characteristic loss, the cross-entropy loss, the first adversarial loss, the second adversarial loss and the third adversarial loss, training The first encoder, the second encoder, the third encoder, the text generator and the discriminator include:
所述根据所述源图像特征性损失、所述目标图像特征性损失、所述交叉熵损失、所述第一对抗性损失、所述第二对抗性损失、所述第三对抗性损失、所述源图像感知损失以及所述目标图像感知损失,训练所述第一编码器、所述第二编码器、所述第三编码器、所述文本生成器、所述判别器、第一解码器以及所述第二解码器。According to the source image characteristic loss, the target image characteristic loss, the cross-entropy loss, the first adversarial loss, the second adversarial loss, the third adversarial loss, the The source image perception loss and the target image perception loss, train the first encoder, the second encoder, the third encoder, the text generator, the discriminator, and the first decoder and the second decoder.
根据本公开的一个或多个实施例,示例三提供了一种医学报告生成模型 的训练方法,部分所述目标图像对应有医学文本标签;所述方法还包括:According to one or more embodiments of the present disclosure, Example 3 provides a training method of a medical report generation model, and part of the target image corresponds to a medical text label; the method also includes:
根据所述源图像与所述重建源图像的差异以及所述目标图像与所述重建目标图像的差异,确定第一分值;determining a first score based on a difference between the source image and the reconstructed source image and a difference between the target image and the reconstructed target image;
根据所述源图像特异性损失以及所述目标图像特异性损失,确定第二分值;determining a second score based on the source image-specific loss and the target image-specific loss;
如果所述目标图像对应有医学文本标签,根据所述第二医学报告文本以及所述目标图像对应的医学文本标签,计算自然语言评估指标作为第三分值;If the target image corresponds to a medical text label, calculate a natural language evaluation index as a third score according to the second medical report text and the medical text label corresponding to the target image;
将所述第一分值、所述第二分值以及所述第三分值加权求和,得到奖励值;weighting and summing the first score, the second score and the third score to obtain a reward value;
以最大化所述奖励值为目标,重新训练所述第一编码器、所述第二编码器、所述第三编码器、所述文本生成器、所述判别器、第一解码器以及所述第二解码器。With the goal of maximizing the reward value, retrain the first encoder, the second encoder, the third encoder, the text generator, the discriminator, the first decoder and all Describe the second decoder.
根据本公开的一个或多个实施例,示例四提供了一种医学报告生成模型的训练方法,所述方法还包括:According to one or more embodiments of the present disclosure, Example 4 provides a method for training a medical report generation model, the method further comprising:
将训练图像输入第一图像特征提取网络,得到第五图像特征,将所述第五图像特征输入第一分类网络,得到所述训练图像的第一预测分类结果;根据所述训练图像的第一预测分类结果以及所述训练图像对应的分类标签,训练所述第一图像特征提取网络以及所述第一分类网络;Input the training image into the first image feature extraction network to obtain the fifth image feature, and input the fifth image feature into the first classification network to obtain the first predicted classification result of the training image; according to the first prediction classification result of the training image predicting classification results and classification labels corresponding to the training images, training the first image feature extraction network and the first classification network;
将训练图像输入第二图像特征提取网络,得到第六图像特征,将所述第六图像特征输入第二分类网络,得到所述训练图像的第二预测分类结果;根据所述训练图像的第二预测分类结果以及所述训练图像对应的分类标签,训练所述第二图像特征提取网络以及所述第二分类网络;所述第一图像特征提取网络与所述第二图像特征提取网络的网络结构不同;The training image is input to the second image feature extraction network to obtain the sixth image feature, and the sixth image feature is input to the second classification network to obtain the second predicted classification result of the training image; according to the second of the training image Predict classification results and classification labels corresponding to the training images, train the second image feature extraction network and the second classification network; the network structure of the first image feature extraction network and the second image feature extraction network different;
将训练完成的所述第一图像特征提取网络的模型参数确定为所述第一编码器以及所述第三编码器的初始模型参数,将训练完成的所述第二图像特征提取网络的模型参数确定为所述第二编码器的初始模型参数;所述第一图像特征提取网络与所述第一编码器以及所述第三编码器的网络结构相同,所述第二图像特征提取网络与所述第二编码器的网络结构相同。Determining the model parameters of the first image feature extraction network that has been trained as initial model parameters of the first encoder and the third encoder, and determining the model parameters of the second image feature extraction network that has been trained determined as the initial model parameters of the second encoder; the network structure of the first image feature extraction network is the same as that of the first encoder and the third encoder, and the second image feature extraction network is the same as the network structure of the third encoder The network structure of the second encoder is the same as above.
根据本公开的一个或多个实施例,示例五提供了一种医学报告生成模型的训练方法,所述方法还包括:According to one or more embodiments of the present disclosure, Example 5 provides a method for training a medical report generation model, the method further comprising:
随机初始化所述第一编码器、所述第二编码器以及所述第三编码器的初 始模型参数。Randomly initialize the initial model parameters of the first encoder, the second encoder and the third encoder.
根据本公开的一个或多个实施例,示例六提供了一种医学报告生成模型的训练方法,所述第一判别结果包括判别所述第一医学报告文本中每个分词是否由所述源图像生成的第一概率值,所述第二判别结果包括判别所述第二医学报告文本中每个分词是否由所述源图像生成的第二概率值;According to one or more embodiments of the present disclosure, Example 6 provides a training method for a medical report generation model, the first judgment result includes judging whether each word segment in the first medical report text is represented by the source image The generated first probability value, the second judgment result includes a second probability value for judging whether each word segment in the second medical report text is generated by the source image;
所述根据所述第一判别结果计算第一对抗性损失,包括:The calculating the first adversarial loss according to the first discrimination result includes:
将所述第一概率值取对数后求和,得到第一求和结果,取所述第一求和结果的负数值,得到第一对抗性损失;taking the logarithm of the first probability value and summing to obtain a first summation result, and taking the negative value of the first summation result to obtain a first adversarial loss;
所述根据所述第二判别结果计算第二对抗性损失以及第三对抗性损失,包括:The calculating the second adversarial loss and the third adversarial loss according to the second discrimination result includes:
将所述第二概率值取对数后求和,得到第二求和结果,取所述第二求和结果的负数值,得到第二对抗性损失;summing the logarithms of the second probability values to obtain a second summation result, and taking the negative value of the second summation result to obtain a second adversarial loss;
计算1与所述第二概率值之差后求和,得到第三求和结果,取所述第三求和结果的负数值,得到第三对抗性损失。The difference between 1 and the second probability value is calculated and then summed to obtain a third summation result, and the negative value of the third summation result is taken to obtain a third adversarial loss.
根据本公开的一个或多个实施例,示例七提供了一种医学报告生成模型的训练方法,所述根据所述源图像以及所述重建源图像,计算源图像感知损失,包括:According to one or more embodiments of the present disclosure, Example 7 provides a training method for a medical report generation model, and calculating the perceptual loss of the source image according to the source image and the reconstructed source image, including:
将所述源图像输入第三图像特征提取网络,获取所述第三图像特征提取网络的各特征提取层输出的第七图像特征;Input the source image into the third image feature extraction network, and obtain the seventh image feature output by each feature extraction layer of the third image feature extraction network;
将所述重建源图像输入所述第三图像特征提取网络,获取所述第三图像特征提取网络的各特征提取层输出的第八图像特征;Inputting the reconstruction source image into the third image feature extraction network, and obtaining the eighth image feature output by each feature extraction layer of the third image feature extraction network;
根据每一所述特征提取层输出的第七图像特征、第八图像特征以及该特征提取层对应的权重,计算该特征提取层对应的源图像损失;Calculate the source image loss corresponding to the feature extraction layer according to the seventh image feature, the eighth image feature output by each feature extraction layer, and the weight corresponding to the feature extraction layer;
将各个特征提取层对应的源图像损失求和,得到源图像感知损失;The source image loss corresponding to each feature extraction layer is summed to obtain the source image perception loss;
所述根据所述目标图像以及所述重建目标图像,计算目标图像感知损失,包括:The calculating the target image perception loss according to the target image and the reconstructed target image includes:
将所述目标图像输入第三图像特征提取网络,获取所述第三图像特征提取网络的各特征提取层输出的第九图像特征;Input the target image into the third image feature extraction network, and obtain the ninth image feature output by each feature extraction layer of the third image feature extraction network;
将所述重建目标图像输入所述第三图像特征提取网络,获取所述第三图像特征提取网络的各特征提取层输出的第十图像特征;Input the reconstruction target image into the third image feature extraction network, and obtain the tenth image feature output by each feature extraction layer of the third image feature extraction network;
根据每一所述特征提取层输出的第九图像特征、第十图像特征以及该特 征提取层对应的权重,计算该特征提取层对应的目标图像损失;According to the ninth image feature of each described feature extraction layer output, the tenth image feature and the weight corresponding to this feature extraction layer, calculate the corresponding target image loss of this feature extraction layer;
将各个所述特征提取层对应的目标图像损失求和,得到目标图像感知损失。The target image loss corresponding to each feature extraction layer is summed to obtain the target image perception loss.
根据本公开的一个或多个实施例,示例八提供了一种医学报告生成模型的训练方法,所述根据所述源图像与所述重建源图像的差异以及所述目标图像与所述重建目标图像的差异,确定第一分值,包括:According to one or more embodiments of the present disclosure, Example 8 provides a training method for a medical report generation model, according to the difference between the source image and the reconstructed source image and the target image and the reconstructed target The differences of the images, to determine the first score, consist of:
将所述源图像输入第三图像特征提取网络,获取所述第三图像特征提取网络输出的第十一图像特征;Inputting the source image into a third image feature extraction network to obtain the eleventh image feature output by the third image feature extraction network;
将所述重建源图像输入所述第三图像特征提取网络,获取所述第三图像特征提取网络输出的第十二图像特征;Inputting the reconstructed source image into the third image feature extraction network, and obtaining the twelfth image feature output by the third image feature extraction network;
根据所述第十一图像特征与所述第十二图像特征的差异,得到第一差异值;Obtaining a first difference value according to the difference between the eleventh image feature and the twelfth image feature;
将所述目标图像输入所述第三图像特征提取网络,获取所述第三图像特征提取网络输出的第十三图像特征;Inputting the target image into the third image feature extraction network, and obtaining the thirteenth image feature output by the third image feature extraction network;
将所述目标重建图像输入所述第三图像特征提取网络,获取所述第三图像特征提取网络输出的第十四图像特征;Inputting the target reconstructed image into the third image feature extraction network, and obtaining the fourteenth image feature output by the third image feature extraction network;
根据所述第十三图像特征与所述第十四图像特征的差异,得到第二差异值;Obtaining a second difference value according to the difference between the thirteenth image feature and the fourteenth image feature;
将所述第一差异值与所述第二差异值求和,得到第四求和结果,取所述第四求和结果的负数值,得到第一分值。Summing the first difference value and the second difference value to obtain a fourth summation result, and taking a negative value of the fourth summation result to obtain a first score.
根据本公开的一个或多个实施例,示例九提供了一种医学报告生成模型的训练方法,所述根据所述源图像特异性损失以及所述目标图像特异性损失,确定第二分值,包括:According to one or more embodiments of the present disclosure, Example 9 provides a training method for a medical report generation model, wherein the second score is determined according to the source image-specific loss and the target image-specific loss, include:
将所述源图像特异性损失以及所述目标图像特异性损失求和,得到第五求和结果,取所述第五求和结果的负数值,得到第二分值。Summing the source image-specific loss and the target image-specific loss to obtain a fifth summation result, and taking a negative value of the fifth summation result to obtain a second score.
根据本公开的一个或多个实施例,示例十提供了一种医学报告生成方法,所述方法包括:According to one or more embodiments of the present disclosure, example ten provides a method for generating a medical report, the method comprising:
将医学图像输入编码器,得到医学图像特征;Input the medical image into the encoder to obtain the medical image features;
将所述医学图像特征输入文本生成器,得到医学报告文本;Inputting the medical image features into a text generator to obtain a medical report text;
所述编码器是根据上述任一项实施例所述的医学报告生成模型的训练方法训练得到的第二编码器;The encoder is a second encoder trained according to the training method of the medical report generation model described in any one of the above embodiments;
所述文本生成器是根据上述任一项实施例所述的医学报告生成模型的训练方法训练得到的文本生成器。The text generator is a text generator trained according to the training method of the medical report generation model described in any one of the above embodiments.
根据本公开的一个或多个实施例,示例十一提供了一种医学报告生成模型的训练装置,所述装置包括:According to one or more embodiments of the present disclosure, Example Eleven provides a training device for a medical report generation model, the device comprising:
第一输入单元,用于将源图像输入第一编码器,得到第一图像特征,将所述源图像输入第二编码器,得到第二图像特征;所述源图像对应有医学文本标签;The first input unit is configured to input a source image into a first encoder to obtain a first image feature, and input the source image to a second encoder to obtain a second image feature; the source image corresponds to a medical text label;
第二输入单元,用于将目标图像输入第三编码器,得到第三图像特征,将所述目标图像输入所述第二编码器,得到第四图像特征;The second input unit is configured to input the target image into the third encoder to obtain a third image feature, and input the target image to the second encoder to obtain a fourth image feature;
第三输入单元,用于将所述第二图像特征输入文本生成器,得到第一医学报告文本;A third input unit, configured to input the second image feature into the text generator to obtain the first medical report text;
第四输入单元,用于将所述第四图像特征输入所述文本生成器,得到第二医学报告文本;A fourth input unit, configured to input the fourth image feature into the text generator to obtain a second medical report text;
第五输入单元,用于将所述第一医学报告文本输入判别器,得到第一判别结果;The fifth input unit is used to input the first medical report text into the discriminator to obtain the first discriminant result;
第六输入单元,用于将所述第二医学报告文本输入所述判别器,得到第二判别结果;A sixth input unit, configured to input the second medical report text into the discriminator to obtain a second discriminant result;
第一计算单元,用于根据所述第一图像特征以及所述第二图像特征,计算源图像特异性损失,根据所述第三图像特征以及所述第四图像特征,计算目标图像特异性损失;A first calculation unit, configured to calculate a source image-specific loss based on the first image feature and the second image feature, and calculate a target image-specific loss based on the third image feature and the fourth image feature ;
第二计算单元,用于根据所述第一医学报告文本以及所述源图像对应的医学文本标签计算交叉熵损失;A second calculation unit, configured to calculate a cross-entropy loss according to the first medical report text and the medical text label corresponding to the source image;
第三计算单元,用于根据所述第一判别结果计算第一对抗性损失,根据所述第二判别结果计算第二对抗性损失以及第三对抗性损失;A third calculation unit, configured to calculate a first adversarial loss according to the first discrimination result, and calculate a second adversarial loss and a third adversarial loss according to the second discrimination result;
执行单元,用于根据所述源图像特征性损失、所述目标图像特征性损失、所述交叉熵损失、所述第一对抗性损失、所述第二对抗性损失以及所述第三对抗性损失,训练所述第一编码器、所述第二编码器、所述第三编码器、所述文本生成器以及所述判别器,重复执行所述将源图像输入第一图像特征编码器以及后续步骤,直到达到预设条件。an execution unit, configured to use the source image characteristic loss, the target image characteristic loss, the cross-entropy loss, the first adversarial loss, the second adversarial loss, and the third adversarial loss Loss, training the first encoder, the second encoder, the third encoder, the text generator and the discriminator, repeatedly performing the input of the source image into the first image feature encoder and Subsequent steps until preset conditions are met.
根据本公开的一个或多个实施例,示例十二提供了一种医学报告生成模型的训练装置,所述装置还包括:According to one or more embodiments of the present disclosure, Example 12 provides a training device for a medical report generation model, the device further comprising:
第七输入单元,用于将所述第一图像特征以及所述第二图像特征输入第一解码器,得到重建源图像;A seventh input unit, configured to input the first image feature and the second image feature into the first decoder to obtain a reconstructed source image;
第八输入单元,用于将所述第三图像特征以及所述第四图像特征输入第二解码器,得到重建目标图像;An eighth input unit, configured to input the third image feature and the fourth image feature into a second decoder to obtain a reconstructed target image;
第四计算单元,用于根据所述源图像以及所述重建源图像,计算源图像感知损失,根据所述目标图像以及所述重建目标图像,计算目标图像感知损失;A fourth calculation unit, configured to calculate the perceptual loss of the source image according to the source image and the reconstructed source image, and calculate the perceptual loss of the target image according to the target image and the reconstructed target image;
所述执行单元,具体用于所述根据所述源图像特征性损失、所述目标图像特征性损失、所述交叉熵损失、所述第一对抗性损失、所述第二对抗性损失、所述第三对抗性损失、所述源图像感知损失以及所述目标图像感知损失,训练所述第一编码器、所述第二编码器、所述第三编码器、所述文本生成器、所述判别器、第一解码器以及所述第二解码器。The execution unit is specifically used for the said source image characteristic loss, the target image characteristic loss, the cross-entropy loss, the first adversarial loss, the second adversarial loss, the The third adversarial loss, the source image perceptual loss and the target image perceptual loss, train the first encoder, the second encoder, the third encoder, the text generator, the The discriminator, the first decoder and the second decoder.
根据本公开的一个或多个实施例,示例十三提供了一种医学报告生成模型的训练装置,部分所述目标图像对应有医学文本标签;所述装置还包括:According to one or more embodiments of the present disclosure, Example 13 provides a training device for a medical report generation model, part of the target image corresponds to a medical text label; the device further includes:
第一确定单元,用于根据所述源图像与所述重建源图像的差异以及所述目标图像与所述重建目标图像的差异,确定第一分值;A first determining unit, configured to determine a first score according to the difference between the source image and the reconstructed source image and the difference between the target image and the reconstructed target image;
第二确定单元,用于根据所述源图像特异性损失以及所述目标图像特异性损失,确定第二分值;a second determining unit, configured to determine a second score according to the source image-specific loss and the target image-specific loss;
第五计算单元,用于如果所述目标图像对应有医学文本标签,根据所述第二医学报告文本以及所述目标图像对应的医学文本标签,计算自然语言评估指标作为第三分值;A fifth calculation unit, configured to calculate a natural language evaluation index as a third score according to the second medical report text and the medical text label corresponding to the target image if the target image corresponds to a medical text label;
求和单元,用于将所述第一分值、所述第二分值以及所述第三分值加权求和,得到奖励值;A summation unit, configured to weight and sum the first score, the second score and the third score to obtain a reward value;
训练单元,用于以最大化所述奖励值为目标,重新训练所述第一编码器、所述第二编码器、所述第三编码器、所述文本生成器、所述判别器、第一解码器以及所述第二解码器。a training unit, configured to retrain the first encoder, the second encoder, the third encoder, the text generator, the discriminator, and the second encoder with the goal of maximizing the reward value A decoder and the second decoder.
根据本公开的一个或多个实施例,示例十四提供了一种医学报告生成模型的训练装置,所述装置还包括:According to one or more embodiments of the present disclosure, Example Fourteen provides a training device for a medical report generation model, the device further comprising:
第七输入单元,用于将训练图像输入第一图像特征提取网络,得到第五图像特征,将所述第五图像特征输入第一分类网络,得到所述训练图像的第一预测分类结果;根据所述训练图像的第一预测分类结果以及所述训练图像 对应的分类标签,训练所述第一图像特征提取网络以及所述第一分类网络;The seventh input unit is used to input the training image into the first image feature extraction network to obtain the fifth image feature, and input the fifth image feature into the first classification network to obtain the first predicted classification result of the training image; according to The first predicted classification result of the training image and the classification label corresponding to the training image, training the first image feature extraction network and the first classification network;
第八输入单元,用于将训练图像输入第二图像特征提取网络,得到第六图像特征,将所述第六图像特征输入第二分类网络,得到所述训练图像的第二预测分类结果;根据所述训练图像的第二预测分类结果以及所述训练图像对应的分类标签,训练所述第二图像特征提取网络以及所述第二分类网络;所述第一图像特征提取网络与所述第二图像特征提取网络的网络结构不同;The eighth input unit is used to input the training image into the second image feature extraction network to obtain the sixth image feature, and input the sixth image feature into the second classification network to obtain the second predicted classification result of the training image; according to The second predicted classification result of the training image and the classification label corresponding to the training image, train the second image feature extraction network and the second classification network; the first image feature extraction network and the second The network structure of the image feature extraction network is different;
第三确定单元,用于将训练完成的所述第一图像特征提取网络的模型参数确定为所述第一编码器以及所述第三编码器的初始模型参数,将训练完成的所述第二图像特征提取网络的模型参数确定为所述第二编码器的初始模型参数;所述第一图像特征提取网络与所述第一编码器以及所述第三编码器的网络结构相同,所述第二图像特征提取网络与所述第二编码器的网络结构相同。The third determining unit is configured to determine the model parameters of the trained first image feature extraction network as the initial model parameters of the first encoder and the third encoder, and determine the trained second The model parameters of the image feature extraction network are determined as the initial model parameters of the second encoder; the first image feature extraction network has the same network structure as the first encoder and the third encoder, and the second encoder The network structure of the second image feature extraction network is the same as that of the second encoder.
根据本公开的一个或多个实施例,示例十五提供了一种医学报告生成模型的训练装置,所述装置还包括:According to one or more embodiments of the present disclosure, Example 15 provides a training device for a medical report generation model, the device further comprising:
初始化单元,用于随机初始化所述第一编码器、所述第二编码器以及所述第三编码器的初始模型参数。An initialization unit, configured to randomly initialize initial model parameters of the first encoder, the second encoder, and the third encoder.
根据本公开的一个或多个实施例,示例十六提供了一种医学报告生成模型的训练装置,所述第一判别结果包括判别所述第一医学报告文本中每个分词是否由所述源图像生成的第一概率值,所述第二判别结果包括判别所述第二医学报告文本中每个分词是否由所述源图像生成的第二概率值;According to one or more embodiments of the present disclosure, Example 16 provides a training device for a medical report generation model, the first judgment result includes judging whether each word segment in the first medical report text is provided by the source The first probability value generated by the image, the second judgment result includes a second probability value for judging whether each word segment in the second medical report text is generated by the source image;
所述第三计算单元,具体用于将所述第一概率值取对数后求和,得到第一求和结果,取所述第一求和结果的负数值,得到第一对抗性损失;The third calculation unit is specifically configured to sum the logarithms of the first probability values to obtain a first summation result, and take a negative value of the first summation result to obtain a first adversarial loss;
所述第三计算单元,具体用于将所述第二概率值取对数后求和,得到第二求和结果,取所述第二求和结果的负数值,得到第二对抗性损失;The third calculation unit is specifically configured to sum the second probability values after taking logarithms to obtain a second summation result, and take a negative value of the second summation result to obtain a second adversarial loss;
计算1与所述第二概率值之差后求和,得到第三求和结果,取所述第三求和结果的负数值,得到第三对抗性损失。The difference between 1 and the second probability value is calculated and then summed to obtain a third summation result, and the negative value of the third summation result is taken to obtain a third adversarial loss.
根据本公开的一个或多个实施例,示例十七提供了一种医学报告生成模型的训练装置,所述第四计算单元,具体用于将所述源图像输入第三图像特征提取网络,获取所述第三图像特征提取网络的各特征提取层输出的第七图像特征;According to one or more embodiments of the present disclosure, Example 17 provides a training device for a medical report generation model, the fourth calculation unit is specifically configured to input the source image into the third image feature extraction network to obtain The seventh image feature output by each feature extraction layer of the third image feature extraction network;
将所述重建源图像输入所述第三图像特征提取网络,获取所述第三图像 特征提取网络的各特征提取层输出的第八图像特征;The reconstruction source image is input into the third image feature extraction network to obtain the eighth image feature output by each feature extraction layer of the third image feature extraction network;
根据每一所述特征提取层输出的第七图像特征、第八图像特征以及该特征提取层对应的权重,计算该特征提取层对应的源图像损失;Calculate the source image loss corresponding to the feature extraction layer according to the seventh image feature, the eighth image feature output by each feature extraction layer, and the weight corresponding to the feature extraction layer;
将各个特征提取层对应的源图像损失求和,得到源图像感知损失;The source image loss corresponding to each feature extraction layer is summed to obtain the source image perception loss;
所述第四计算单元,具体用于将所述目标图像输入第三图像特征提取网络,获取所述第三图像特征提取网络的各特征提取层输出的第九图像特征;The fourth calculation unit is specifically configured to input the target image into a third image feature extraction network, and obtain ninth image features output by each feature extraction layer of the third image feature extraction network;
将所述重建目标图像输入所述第三图像特征提取网络,获取所述第三图像特征提取网络的各特征提取层输出的第十图像特征;Input the reconstruction target image into the third image feature extraction network, and obtain the tenth image feature output by each feature extraction layer of the third image feature extraction network;
根据每一所述特征提取层输出的第九图像特征、第十图像特征以及该特征提取层对应的权重,计算该特征提取层对应的目标图像损失;Calculate the target image loss corresponding to the feature extraction layer according to the ninth image feature, the tenth image feature output by each feature extraction layer, and the weight corresponding to the feature extraction layer;
将各个所述特征提取层对应的目标图像损失求和,得到目标图像感知损失。The target image loss corresponding to each feature extraction layer is summed to obtain the target image perception loss.
根据本公开的一个或多个实施例,示例十八提供了一种医学报告生成模型的训练装置,所述第一确定单元,具体用于将所述源图像输入第三图像特征提取网络,获取所述第三图像特征提取网络输出的第十一图像特征;According to one or more embodiments of the present disclosure, Example 18 provides a training device for a medical report generation model, the first determining unit is specifically configured to input the source image into a third image feature extraction network to obtain The eleventh image feature output by the third image feature extraction network;
将所述重建源图像输入所述第三图像特征提取网络,获取所述第三图像特征提取网络输出的第十二图像特征;Inputting the reconstructed source image into the third image feature extraction network, and obtaining the twelfth image feature output by the third image feature extraction network;
根据所述第十一图像特征与所述第十二图像特征的差异,得到第一差异值;Obtaining a first difference value according to the difference between the eleventh image feature and the twelfth image feature;
将所述目标图像输入所述第三图像特征提取网络,获取所述第三图像特征提取网络输出的第十三图像特征;Inputting the target image into the third image feature extraction network, and obtaining the thirteenth image feature output by the third image feature extraction network;
将所述目标重建图像输入所述第三图像特征提取网络,获取所述第三图像特征提取网络输出的第十四图像特征;Inputting the target reconstructed image into the third image feature extraction network, and obtaining the fourteenth image feature output by the third image feature extraction network;
根据所述第十三图像特征与所述第十四图像特征的差异,得到第二差异值;Obtaining a second difference value according to the difference between the thirteenth image feature and the fourteenth image feature;
将所述第一差异值与所述第二差异值求和,得到第四求和结果,取所述第四求和结果的负数值,得到第一分值。Summing the first difference value and the second difference value to obtain a fourth summation result, and taking a negative value of the fourth summation result to obtain a first score.
根据本公开的一个或多个实施例,示例十九提供了一种医学报告生成模型的训练装置,所述第二确定单元,具体用于将所述源图像特异性损失以及所述目标图像特异性损失求和,得到第五求和结果,取所述第五求和结果的负数值,得到第二分值。According to one or more embodiments of the present disclosure, Example Nineteen provides a training device for a medical report generation model, the second determination unit is specifically configured to combine the source image specific loss and the target image specific and summation losses to obtain a fifth summation result, and take the negative value of the fifth summation result to obtain a second score.
根据本公开的一个或多个实施例,示例二十提供了一种医学报告生成装置,所述装置包括:According to one or more embodiments of the present disclosure, Example 20 provides a medical report generating device, the device comprising:
输入单元,用于将医学图像输入编码器,得到医学图像特征;The input unit is used to input the medical image into the encoder to obtain the medical image features;
生成单元,用于将所述医学图像特征输入文本生成器,得到医学报告文本;A generating unit, configured to input the medical image features into a text generator to obtain a medical report text;
所述编码器是根据上述任一项实施例所述的医学报告生成模型的训练方法训练得到的第二编码器;The encoder is a second encoder trained according to the training method of the medical report generation model described in any one of the above embodiments;
所述文本生成器是根据上述任一项实施例所述的医学报告生成模型的训练方法训练得到的文本生成器。The text generator is a text generator trained according to the training method of the medical report generation model described in any one of the above embodiments.
根据本公开的一个或多个实施例,示例二十一提供了一种电子设备,包括:According to one or more embodiments of the present disclosure, Example 21 provides an electronic device, including:
一个或多个处理器;one or more processors;
存储装置,其上存储有一个或多个程序,a storage device on which one or more programs are stored,
当所述一个或多个程序被所述一个或多个处理器执行,使得所述一个或多个处理器实现如上述任一实施例所述的医学报告生成模型的训练方法,或者实现上述实施例所述的医学报告生成方法。When the one or more programs are executed by the one or more processors, the one or more processors implement the training method of the medical report generation model as described in any of the above embodiments, or implement the above implementation The medical report generation method described in the example.
根据本公开的一个或多个实施例,示例二十二提供了一种计算机可读介质,其上存储有计算机程序,其中,所述程序被处理器执行时实现如上述任一实施例所述的医学报告生成模型的训练方法,或者实现上述实施例所述的医学报告生成方法。According to one or more embodiments of the present disclosure, Example 22 provides a computer-readable medium on which a computer program is stored, wherein, when the program is executed by a processor, the implementation of any of the above-mentioned embodiments The training method of the medical report generation model, or realize the medical report generation method described in the above embodiment.
需要说明的是,本说明书中各个实施例采用递进的方式描述,每个实施例重点说明的都是与其他实施例的不同之处,各个实施例之间相同相似部分互相参见即可。对于实施例公开的系统或装置而言,由于其与实施例公开的方法相对应,所以描述的比较简单,相关之处参见方法部分说明即可。It should be noted that each embodiment in this specification is described in a progressive manner, each embodiment focuses on the differences from other embodiments, and the same and similar parts of each embodiment can be referred to each other. As for the system or device disclosed in the embodiment, since it corresponds to the method disclosed in the embodiment, the description is relatively simple, and for relevant details, please refer to the description of the method part.
应当理解,在本公开中,“至少一个(项)”是指一个或者多个,“多个”是指两个或两个以上。“和/或”,用于描述关联对象的关联关系,表示可以存在三种关系,例如,“A和/或B”可以表示:只存在A,只存在B以及同时存在A和B三种情况,其中A,B可以是单数或者复数。字符“/”一般表示前后关联对象是一种“或”的关系。“以下至少一项(个)”或其类似表达,是指这些项中的任意组合,包括单项(个)或复数项(个)的任意组合。例如,a,b或c中的至少一项(个),可以表示:a,b,c,“a和b”,“a和c”,“b和c”,或 “a和b和c”,其中a,b,c可以是单个,也可以是多个。It should be understood that in the present disclosure, "at least one (item)" means one or more, and "plurality" means two or more. "And/or" is used to describe the association relationship of associated objects, indicating that there can be three types of relationships, for example, "A and/or B" can mean: only A exists, only B exists, and A and B exist at the same time , where A and B can be singular or plural. The character "/" generally indicates that the contextual objects are an "or" relationship. "At least one of the following" or similar expressions refer to any combination of these items, including any combination of single or plural items. For example, at least one item (piece) of a, b or c can mean: a, b, c, "a and b", "a and c", "b and c", or "a and b and c ", where a, b, c can be single or multiple.
还需要说明的是,在本文中,诸如第一和第二等之类的关系术语仅仅用来将一个实体或者操作与另一个实体或操作区分开来,而不一定要求或者暗示这些实体或操作之间存在任何这种实际的关系或者顺序。而且,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者设备不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者设备所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括所述要素的过程、方法、物品或者设备中还存在另外的相同要素。It should also be noted that in this article, relational terms such as first and second etc. are only used to distinguish one entity or operation from another entity or operation, and do not necessarily require or imply that these entities or operations Any such actual relationship or order exists between. Furthermore, the term "comprises", "comprises" or any other variation thereof is intended to cover a non-exclusive inclusion such that a process, method, article, or apparatus comprising a set of elements includes not only those elements, but also includes elements not expressly listed. other elements of or also include elements inherent in such a process, method, article, or device. Without further limitations, an element defined by the phrase "comprising a ..." does not exclude the presence of additional identical elements in the process, method, article or apparatus comprising said element.
结合本文中所公开的实施例描述的方法或算法的步骤可以直接用硬件、处理器执行的软件模块,或者二者的结合来实施。软件模块可以置于随机存储器(RAM)、内存、只读存储器(ROM)、电可编程ROM、电可擦除可编程ROM、寄存器、硬盘、可移动磁盘、CD-ROM、或技术领域内所公知的任意其它形式的存储介质中。The steps of the methods or algorithms described in connection with the embodiments disclosed herein may be directly implemented by hardware, software modules executed by a processor, or a combination of both. Software modules can be placed in random access memory (RAM), internal memory, read-only memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, removable disk, CD-ROM, or any other Any other known storage medium.
对所公开的实施例的上述说明,使本领域专业技术人员能够实现或使用本公开。对这些实施例的多种修改对本领域的专业技术人员来说将是显而易见的,本文中所定义的一般原理可以在不脱离本公开的精神或范围的情况下,在其它实施例中实现。因此,本公开将不会被限制于本文所示的这些实施例,而是要符合与本文所公开的原理和新颖特点相一致的最宽的范围。The above description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present disclosure. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the general principles defined herein may be implemented in other embodiments without departing from the spirit or scope of the present disclosure. Therefore, the present disclosure will not be limited to the embodiments shown herein, but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (14)

  1. 一种医学报告生成模型的训练方法,包括:A training method for a medical report generation model, comprising:
    将源图像输入第一编码器,得到第一图像特征,将所述源图像输入第二编码器,得到第二图像特征;所述源图像对应有医学文本标签;The source image is input into the first encoder to obtain the first image feature, and the source image is input into the second encoder to obtain the second image feature; the source image corresponds to a medical text label;
    将目标图像输入第三编码器,得到第三图像特征,将所述目标图像输入所述第二编码器,得到第四图像特征;inputting the target image into a third encoder to obtain a third image feature, and inputting the target image into the second encoder to obtain a fourth image feature;
    将所述第二图像特征输入文本生成器,得到第一医学报告文本;Inputting the second image feature into the text generator to obtain the first medical report text;
    将所述第四图像特征输入所述文本生成器,得到第二医学报告文本;Inputting the fourth image feature into the text generator to obtain a second medical report text;
    将所述第一医学报告文本输入判别器,得到第一判别结果;Inputting the first medical report text into a discriminator to obtain a first discriminant result;
    将所述第二医学报告文本输入所述判别器,得到第二判别结果;inputting the second medical report text into the discriminator to obtain a second discriminant result;
    根据所述第一图像特征以及所述第二图像特征,计算源图像特异性损失,根据所述第三图像特征以及所述第四图像特征,计算目标图像特异性损失;calculating a source image-specific loss based on the first image feature and the second image feature, and calculating a target image-specific loss based on the third image feature and the fourth image feature;
    根据所述第一医学报告文本以及所述源图像对应的医学文本标签计算交叉熵损失;calculating a cross-entropy loss according to the first medical report text and the medical text label corresponding to the source image;
    根据所述第一判别结果计算第一对抗性损失,根据所述第二判别结果计算第二对抗性损失以及第三对抗性损失;calculating a first adversarial loss according to the first discrimination result, and calculating a second adversarial loss and a third adversarial loss according to the second discrimination result;
    根据所述源图像特征性损失、所述目标图像特征性损失、所述交叉熵损失、所述第一对抗性损失、所述第二对抗性损失以及所述第三对抗性损失,训练所述第一编码器、所述第二编码器、所述第三编码器、所述文本生成器以及所述判别器,重复执行所述将源图像输入第一图像特征编码器以及后续步骤,直到达到预设条件。According to the source image characteristic loss, the target image characteristic loss, the cross entropy loss, the first adversarial loss, the second adversarial loss and the third adversarial loss, train the The first encoder, the second encoder, the third encoder, the text generator and the discriminator repeatedly execute the inputting the source image into the first image feature encoder and subsequent steps until reaching preset conditions.
  2. 根据权利要求1所述的方法,还包括:The method according to claim 1, further comprising:
    将所述第一图像特征以及所述第二图像特征输入第一解码器,得到重建源图像;inputting the first image feature and the second image feature into a first decoder to obtain a reconstructed source image;
    将所述第三图像特征以及所述第四图像特征输入第二解码器,得到重建目标图像;inputting the third image feature and the fourth image feature into a second decoder to obtain a reconstructed target image;
    根据所述源图像以及所述重建源图像,计算源图像感知损失,根据所述目标图像以及所述重建目标图像,计算目标图像感知损失;calculating a source image perceptual loss based on the source image and the reconstructed source image, and calculating a target image perceptual loss based on the target image and the reconstructed target image;
    所述根据所述源图像特征性损失、所述目标图像特征性损失、所述交叉熵损失、所述第一对抗性损失、所述第二对抗性损失以及所述第三对抗性损 失,训练所述第一编码器、所述第二编码器、所述第三编码器、所述文本生成器以及所述判别器,包括:According to the source image characteristic loss, the target image characteristic loss, the cross-entropy loss, the first adversarial loss, the second adversarial loss and the third adversarial loss, training The first encoder, the second encoder, the third encoder, the text generator and the discriminator include:
    根据所述源图像特征性损失、所述目标图像特征性损失、所述交叉熵损失、所述第一对抗性损失、所述第二对抗性损失、所述第三对抗性损失、所述源图像感知损失以及所述目标图像感知损失,训练所述第一编码器、所述第二编码器、所述第三编码器、所述文本生成器、所述判别器、第一解码器以及所述第二解码器。According to the source image characteristic loss, the target image characteristic loss, the cross-entropy loss, the first adversarial loss, the second adversarial loss, the third adversarial loss, the source image perception loss and the target image perception loss, train the first encoder, the second encoder, the third encoder, the text generator, the discriminator, the first decoder and the Describe the second decoder.
  3. 根据权利要求2所述的方法,其中,部分所述目标图像对应有医学文本标签;所述方法还包括:The method according to claim 2, wherein part of the target image corresponds to a medical text label; the method further comprises:
    根据所述源图像与所述重建源图像的差异以及所述目标图像与所述重建目标图像的差异,确定第一分值;determining a first score based on a difference between the source image and the reconstructed source image and a difference between the target image and the reconstructed target image;
    根据所述源图像特异性损失以及所述目标图像特异性损失,确定第二分值;determining a second score based on the source image-specific loss and the target image-specific loss;
    如果所述目标图像对应有医学文本标签,根据所述第二医学报告文本以及所述目标图像对应的医学文本标签,计算自然语言评估指标作为第三分值;If the target image corresponds to a medical text label, calculate a natural language evaluation index as a third score according to the second medical report text and the medical text label corresponding to the target image;
    将所述第一分值、所述第二分值以及所述第三分值加权求和,得到奖励值;weighting and summing the first score, the second score and the third score to obtain a reward value;
    以最大化所述奖励值为目标,重新训练所述第一编码器、所述第二编码器、所述第三编码器、所述文本生成器、所述判别器、第一解码器以及所述第二解码器。With the goal of maximizing the reward value, retrain the first encoder, the second encoder, the third encoder, the text generator, the discriminator, the first decoder and all Describe the second decoder.
  4. 根据权利要求1-3任一项所述的方法,还包括:The method according to any one of claims 1-3, further comprising:
    将训练图像输入第一图像特征提取网络,得到第五图像特征,将所述第五图像特征输入第一分类网络,得到所述训练图像的第一预测分类结果;根据所述训练图像的第一预测分类结果以及所述训练图像对应的分类标签,训练所述第一图像特征提取网络以及所述第一分类网络;Input the training image into the first image feature extraction network to obtain the fifth image feature, and input the fifth image feature into the first classification network to obtain the first predicted classification result of the training image; according to the first prediction classification result of the training image predicting classification results and classification labels corresponding to the training images, training the first image feature extraction network and the first classification network;
    将训练图像输入第二图像特征提取网络,得到第六图像特征,将所述第六图像特征输入第二分类网络,得到所述训练图像的第二预测分类结果;根据所述训练图像的第二预测分类结果以及所述训练图像对应的分类标签,训练所述第二图像特征提取网络以及所述第二分类网络;所述第一图像特征提取网络与所述第二图像特征提取网络的网络结构不同;The training image is input to the second image feature extraction network to obtain the sixth image feature, and the sixth image feature is input to the second classification network to obtain the second predicted classification result of the training image; according to the second of the training image Predict classification results and classification labels corresponding to the training images, train the second image feature extraction network and the second classification network; the network structure of the first image feature extraction network and the second image feature extraction network different;
    将训练完成的所述第一图像特征提取网络的模型参数确定为所述第一 编码器以及所述第三编码器的初始模型参数,将训练完成的所述第二图像特征提取网络的模型参数确定为所述第二编码器的初始模型参数;所述第一图像特征提取网络与所述第一编码器以及所述第三编码器的网络结构相同,所述第二图像特征提取网络与所述第二编码器的网络结构相同。Determining the model parameters of the first image feature extraction network that has been trained as initial model parameters of the first encoder and the third encoder, and determining the model parameters of the second image feature extraction network that has been trained determined as the initial model parameters of the second encoder; the network structure of the first image feature extraction network is the same as that of the first encoder and the third encoder, and the second image feature extraction network is the same as the network structure of the third encoder The network structure of the second encoder is the same as above.
  5. 根据权利要求1-3任一项所述的方法,还包括:The method according to any one of claims 1-3, further comprising:
    随机初始化所述第一编码器、所述第二编码器以及所述第三编码器的初始模型参数。Randomly initialize initial model parameters of the first encoder, the second encoder, and the third encoder.
  6. 根据权利要求1-5任一项所述的方法,其中,所述第一判别结果包括判别所述第一医学报告文本中每个分词是否由所述源图像生成的第一概率值,所述第二判别结果包括判别所述第二医学报告文本中每个分词是否由所述源图像生成的第二概率值;The method according to any one of claims 1-5, wherein the first judgment result includes a first probability value for judging whether each word segment in the first medical report text is generated by the source image, the The second judgment result includes a second probability value for judging whether each word segment in the second medical report text is generated by the source image;
    所述根据所述第一判别结果计算第一对抗性损失,包括:The calculating the first adversarial loss according to the first discrimination result includes:
    将所述第一概率值取对数后求和,得到第一求和结果,取所述第一求和结果的负数值,得到第一对抗性损失;taking the logarithm of the first probability value and summing to obtain a first summation result, and taking the negative value of the first summation result to obtain a first adversarial loss;
    所述根据所述第二判别结果计算第二对抗性损失以及第三对抗性损失,包括:The calculating the second adversarial loss and the third adversarial loss according to the second discrimination result includes:
    将所述第二概率值取对数后求和,得到第二求和结果,取所述第二求和结果的负数值,得到第二对抗性损失;summing the logarithms of the second probability values to obtain a second summation result, and taking the negative value of the second summation result to obtain a second adversarial loss;
    计算1与所述第二概率值之差后求和,得到第三求和结果,取所述第三求和结果的负数值,得到第三对抗性损失。The difference between 1 and the second probability value is calculated and then summed to obtain a third summation result, and the negative value of the third summation result is taken to obtain a third adversarial loss.
  7. 根据权利要求2或3所述的方法,其中,所述根据所述源图像以及所述重建源图像,计算源图像感知损失,包括:The method according to claim 2 or 3, wherein the calculating the source image perceptual loss according to the source image and the reconstructed source image comprises:
    将所述源图像输入第三图像特征提取网络,获取所述第三图像特征提取网络的各特征提取层输出的第七图像特征;Input the source image into the third image feature extraction network, and obtain the seventh image feature output by each feature extraction layer of the third image feature extraction network;
    将所述重建源图像输入所述第三图像特征提取网络,获取所述第三图像特征提取网络的各特征提取层输出的第八图像特征;Inputting the reconstruction source image into the third image feature extraction network, and obtaining the eighth image feature output by each feature extraction layer of the third image feature extraction network;
    根据每一所述特征提取层输出的第七图像特征、第八图像特征以及该特征提取层对应的权重,计算该特征提取层对应的源图像损失;Calculate the source image loss corresponding to the feature extraction layer according to the seventh image feature, the eighth image feature output by each feature extraction layer, and the weight corresponding to the feature extraction layer;
    将各个特征提取层对应的源图像损失求和,得到源图像感知损失;The source image loss corresponding to each feature extraction layer is summed to obtain the source image perception loss;
    所述根据所述目标图像以及所述重建目标图像,计算目标图像感知损失,包括:The calculating the target image perception loss according to the target image and the reconstructed target image includes:
    将所述目标图像输入第三图像特征提取网络,获取所述第三图像特征提取网络的各特征提取层输出的第九图像特征;Input the target image into the third image feature extraction network, and obtain the ninth image feature output by each feature extraction layer of the third image feature extraction network;
    将所述重建目标图像输入所述第三图像特征提取网络,获取所述第三图像特征提取网络的各特征提取层输出的第十图像特征;Input the reconstruction target image into the third image feature extraction network, and obtain the tenth image feature output by each feature extraction layer of the third image feature extraction network;
    根据每一所述特征提取层输出的第九图像特征、第十图像特征以及该特征提取层对应的权重,计算该特征提取层对应的目标图像损失;Calculate the target image loss corresponding to the feature extraction layer according to the ninth image feature, the tenth image feature output by each feature extraction layer, and the weight corresponding to the feature extraction layer;
    将各个所述特征提取层对应的目标图像损失求和,得到目标图像感知损失。The target image loss corresponding to each feature extraction layer is summed to obtain the target image perception loss.
  8. 根据权利要求3所述的方法,其中,所述根据所述源图像与所述重建源图像的差异以及所述目标图像与所述重建目标图像的差异,确定第一分值,包括:The method according to claim 3, wherein said determining the first score according to the difference between the source image and the reconstructed source image and the difference between the target image and the reconstructed target image comprises:
    将所述源图像输入第三图像特征提取网络,获取所述第三图像特征提取网络输出的第十一图像特征;Inputting the source image into a third image feature extraction network to obtain the eleventh image feature output by the third image feature extraction network;
    将所述重建源图像输入所述第三图像特征提取网络,获取所述第三图像特征提取网络输出的第十二图像特征;Inputting the reconstructed source image into the third image feature extraction network, and obtaining the twelfth image feature output by the third image feature extraction network;
    根据所述第十一图像特征与所述第十二图像特征的差异,得到第一差异值;Obtaining a first difference value according to the difference between the eleventh image feature and the twelfth image feature;
    将所述目标图像输入所述第三图像特征提取网络,获取所述第三图像特征提取网络输出的第十三图像特征;Inputting the target image into the third image feature extraction network, and obtaining the thirteenth image feature output by the third image feature extraction network;
    将所述目标重建图像输入所述第三图像特征提取网络,获取所述第三图像特征提取网络输出的第十四图像特征;Inputting the target reconstructed image into the third image feature extraction network, and obtaining the fourteenth image feature output by the third image feature extraction network;
    根据所述第十三图像特征与所述第十四图像特征的差异,得到第二差异值;Obtaining a second difference value according to the difference between the thirteenth image feature and the fourteenth image feature;
    将所述第一差异值与所述第二差异值求和,得到第四求和结果,取所述第四求和结果的负数值,得到第一分值。Summing the first difference value and the second difference value to obtain a fourth summation result, and taking a negative value of the fourth summation result to obtain a first score.
  9. 根据权利要求3或8所述的方法,其中,所述根据所述源图像特异性损失以及所述目标图像特异性损失,确定第二分值,包括:The method according to claim 3 or 8, wherein said determining a second score based on said source image-specific loss and said target image-specific loss comprises:
    将所述源图像特异性损失以及所述目标图像特异性损失求和,得到第五求和结果,取所述第五求和结果的负数值,得到第二分值。Summing the source image-specific loss and the target image-specific loss to obtain a fifth summation result, and taking a negative value of the fifth summation result to obtain a second score.
  10. 一种医学报告生成方法,包括:A method of generating a medical report comprising:
    将医学图像输入编码器,得到医学图像特征;Input the medical image into the encoder to obtain the medical image features;
    将所述医学图像特征输入文本生成器,得到医学报告文本;Inputting the medical image features into a text generator to obtain a medical report text;
    所述编码器是根据权利要求1-9任一项所述的医学报告生成模型的训练方法训练得到的第二编码器;The encoder is a second encoder trained according to the training method of the medical report generation model described in any one of claims 1-9;
    所述文本生成器是根据权利要求1-9任一项所述的医学报告生成模型的训练方法训练得到的文本生成器。The text generator is a text generator trained according to the training method of the medical report generation model described in any one of claims 1-9.
  11. 一种医学报告生成模型的训练装置,包括:A training device for a medical report generation model, comprising:
    第一输入单元,用于将源图像输入第一编码器,得到第一图像特征,将所述源图像输入第二编码器,得到第二图像特征;所述源图像对应有医学文本标签;The first input unit is configured to input a source image into a first encoder to obtain a first image feature, and input the source image to a second encoder to obtain a second image feature; the source image corresponds to a medical text label;
    第二输入单元,用于将目标图像输入第三编码器,得到第三图像特征,将所述目标图像输入所述第二编码器,得到第四图像特征;The second input unit is configured to input the target image into the third encoder to obtain a third image feature, and input the target image to the second encoder to obtain a fourth image feature;
    第三输入单元,用于将所述第二图像特征输入文本生成器,得到第一医学报告文本;A third input unit, configured to input the second image feature into the text generator to obtain the first medical report text;
    第四输入单元,用于将所述第四图像特征输入所述文本生成器,得到第二医学报告文本;A fourth input unit, configured to input the fourth image feature into the text generator to obtain a second medical report text;
    第五输入单元,用于将所述第一医学报告文本输入判别器,得到第一判别结果;The fifth input unit is used to input the first medical report text into the discriminator to obtain the first discriminant result;
    第六输入单元,用于将所述第二医学报告文本输入所述判别器,得到第二判别结果;A sixth input unit, configured to input the second medical report text into the discriminator to obtain a second discriminant result;
    第一计算单元,用于根据所述第一图像特征以及所述第二图像特征,计算源图像特异性损失,根据所述第三图像特征以及所述第四图像特征,计算目标图像特异性损失;A first calculation unit, configured to calculate a source image-specific loss based on the first image feature and the second image feature, and calculate a target image-specific loss based on the third image feature and the fourth image feature ;
    第二计算单元,用于根据所述第一医学报告文本以及所述源图像对应的医学文本标签计算交叉熵损失;A second calculation unit, configured to calculate a cross-entropy loss according to the first medical report text and the medical text label corresponding to the source image;
    第三计算单元,用于根据所述第一判别结果计算第一对抗性损失,根据所述第二判别结果计算第二对抗性损失以及第三对抗性损失;A third calculation unit, configured to calculate a first adversarial loss according to the first discrimination result, and calculate a second adversarial loss and a third adversarial loss according to the second discrimination result;
    执行单元,用于根据所述源图像特征性损失、所述目标图像特征性损失、所述交叉熵损失、所述第一对抗性损失、所述第二对抗性损失以及所述第三对抗性损失,训练所述第一编码器、所述第二编码器、所述第三编码器、所述文本生成器以及所述判别器,重复执行所述将源图像输入第一图像特征编码器以及后续步骤,直到达到预设条件。an execution unit, configured to use the source image characteristic loss, the target image characteristic loss, the cross-entropy loss, the first adversarial loss, the second adversarial loss, and the third adversarial loss Loss, training the first encoder, the second encoder, the third encoder, the text generator and the discriminator, repeatedly performing the input of the source image into the first image feature encoder and Subsequent steps until preset conditions are met.
  12. 一种医学报告生成装置,包括:A medical report generating device, comprising:
    输入单元,用于将医学图像输入编码器,得到医学图像特征;The input unit is used to input the medical image into the encoder to obtain the medical image features;
    生成单元,用于将所述医学图像特征输入文本生成器,得到医学报告文本;A generating unit, configured to input the medical image features into a text generator to obtain a medical report text;
    所述编码器是根据权利要求1-9任一项所述的医学报告生成模型的训练方法训练得到的第二编码器;The encoder is a second encoder trained according to the training method of the medical report generation model described in any one of claims 1-9;
    所述文本生成器是根据权利要求1-9任一项所述的医学报告生成模型的训练方法训练得到的文本生成器。The text generator is a text generator trained according to the training method of the medical report generation model described in any one of claims 1-9.
  13. 一种电子设备,包括:An electronic device comprising:
    一个或多个处理器;one or more processors;
    存储装置,其上存储有一个或多个程序,a storage device on which one or more programs are stored,
    当所述一个或多个程序被所述一个或多个处理器执行,使得所述一个或多个处理器实现如权利要求1-9中任一所述的医学报告生成模型的训练方法,或者实现如权利要求10所述的医学报告生成方法。When the one or more programs are executed by the one or more processors, so that the one or more processors implement the training method of the medical report generation model according to any one of claims 1-9, or Realize the medical report generation method as claimed in claim 10.
  14. 一种计算机可读介质,存储有计算机程序,其中,所述计算机程序被处理器执行时实现如权利要求1-9中任一所述的医学报告生成模型的训练方法,或者实现如权利要求10所述的医学报告生成方法。A computer-readable medium storing a computer program, wherein, when the computer program is executed by a processor, the method for training the medical report generation model according to any one of claims 1-9 is realized, or the method according to claim 10 is realized. The medical report generating method.
PCT/CN2022/107921 2021-08-31 2022-07-26 Medical report generation method and apparatus, model training method and apparatus, and device WO2023029817A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202111013687.9 2021-08-31
CN202111013687.9A CN113539408B (en) 2021-08-31 2021-08-31 Medical report generation method, training device and training equipment of model

Publications (1)

Publication Number Publication Date
WO2023029817A1 true WO2023029817A1 (en) 2023-03-09

Family

ID=78092338

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/107921 WO2023029817A1 (en) 2021-08-31 2022-07-26 Medical report generation method and apparatus, model training method and apparatus, and device

Country Status (2)

Country Link
CN (1) CN113539408B (en)
WO (1) WO2023029817A1 (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117174240A (en) * 2023-10-26 2023-12-05 中国科学技术大学 Medical image report generation method based on large model field migration
CN117198514A (en) * 2023-11-08 2023-12-08 中国医学科学院北京协和医院 Vulnerable plaque identification method and system based on CLIP model
CN117393100A (en) * 2023-12-11 2024-01-12 安徽大学 Diagnostic report generation method, model training method, system, equipment and medium

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113539408B (en) * 2021-08-31 2022-02-25 北京字节跳动网络技术有限公司 Medical report generation method, training device and training equipment of model
CN116631566B (en) * 2023-05-23 2024-05-24 广州合昊医疗科技有限公司 Medical image report intelligent generation method based on big data

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180225823A1 (en) * 2017-02-09 2018-08-09 Siemens Healthcare Gmbh Adversarial and Dual Inverse Deep Learning Networks for Medical Image Analysis
US20200334809A1 (en) * 2019-04-16 2020-10-22 Covera Health Computer-implemented machine learning for detection and statistical analysis of errors by healthcare providers
CN112992308A (en) * 2021-03-25 2021-06-18 腾讯科技(深圳)有限公司 Training method of medical image report generation model and image report generation method
CN113129309A (en) * 2021-03-04 2021-07-16 同济大学 Medical image semi-supervised segmentation system based on object context consistency constraint
CN113539408A (en) * 2021-08-31 2021-10-22 北京字节跳动网络技术有限公司 Medical report generation method, training device and training equipment of model

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107767928A (en) * 2017-09-15 2018-03-06 深圳市前海安测信息技术有限公司 Medical image report preparing system and method based on artificial intelligence
CN109147890B (en) * 2018-05-14 2020-04-24 平安科技(深圳)有限公司 Method and equipment for generating medical report
CN111063410B (en) * 2019-12-20 2024-01-09 京东方科技集团股份有限公司 Method and device for generating medical image text report
CN112529857B (en) * 2020-12-03 2022-08-23 重庆邮电大学 Ultrasonic image diagnosis report generation method based on target detection and strategy gradient
CN112614561A (en) * 2020-12-24 2021-04-06 北京工业大学 Brain CT medical report generation method based on hierarchical self-attention sequence coding

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180225823A1 (en) * 2017-02-09 2018-08-09 Siemens Healthcare Gmbh Adversarial and Dual Inverse Deep Learning Networks for Medical Image Analysis
US20200334809A1 (en) * 2019-04-16 2020-10-22 Covera Health Computer-implemented machine learning for detection and statistical analysis of errors by healthcare providers
CN113129309A (en) * 2021-03-04 2021-07-16 同济大学 Medical image semi-supervised segmentation system based on object context consistency constraint
CN112992308A (en) * 2021-03-25 2021-06-18 腾讯科技(深圳)有限公司 Training method of medical image report generation model and image report generation method
CN113539408A (en) * 2021-08-31 2021-10-22 北京字节跳动网络技术有限公司 Medical report generation method, training device and training equipment of model

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117174240A (en) * 2023-10-26 2023-12-05 中国科学技术大学 Medical image report generation method based on large model field migration
CN117174240B (en) * 2023-10-26 2024-02-09 中国科学技术大学 Medical image report generation method based on large model field migration
CN117198514A (en) * 2023-11-08 2023-12-08 中国医学科学院北京协和医院 Vulnerable plaque identification method and system based on CLIP model
CN117198514B (en) * 2023-11-08 2024-01-30 中国医学科学院北京协和医院 Vulnerable plaque identification method and system based on CLIP model
CN117393100A (en) * 2023-12-11 2024-01-12 安徽大学 Diagnostic report generation method, model training method, system, equipment and medium
CN117393100B (en) * 2023-12-11 2024-04-05 安徽大学 Diagnostic report generation method, model training method, system, equipment and medium

Also Published As

Publication number Publication date
CN113539408A (en) 2021-10-22
CN113539408B (en) 2022-02-25

Similar Documents

Publication Publication Date Title
WO2023029817A1 (en) Medical report generation method and apparatus, model training method and apparatus, and device
Lella et al. Automatic diagnosis of COVID-19 disease using deep convolutional neural network with multi-feature channel from respiratory sound data: cough, voice, and breath
Alkhodari et al. Detection of COVID-19 in smartphone-based breathing recordings: A pre-screening deep learning tool
US11288279B2 (en) Cognitive computer assisted attribute acquisition through iterative disclosure
TW202040585A (en) Method and apparatus for automated target and tissue segmentation using multi-modal imaging and ensemble machine learning models
JP7357614B2 (en) Machine-assisted dialogue system, medical condition interview device, and method thereof
US11663057B2 (en) Analytics framework for selection and execution of analytics in a distributed environment
US11334806B2 (en) Registration, composition, and execution of analytics in a distributed environment
WO2021098534A1 (en) Similarity determining method and device, network training method and device, search method and device, and electronic device and storage medium
US10847261B1 (en) Methods and systems for prioritizing comprehensive diagnoses
WO2023185516A1 (en) Method and apparatus for training image recognition model, and recognition method and apparatus, and medium and device
CN111523593B (en) Method and device for analyzing medical images
JP2023553586A (en) Non-invasive method for the detection of pulmonary hypertension
WO2022033462A9 (en) Method and apparatus for generating prediction information, and electronic device and medium
CN113220895B (en) Information processing method and device based on reinforcement learning and terminal equipment
CN117149998B (en) Intelligent diagnosis recommendation method and system based on multi-objective optimization
Zhang et al. A two-stage federated transfer learning framework in medical images classification on limited data: A COVID-19 case study
CN112397195B (en) Method, apparatus, electronic device and medium for generating physical examination model
WO2023011397A1 (en) Method for generating acoustic features, training speech models and speech recognition, and device
CN113343664B (en) Method and device for determining matching degree between image texts
CN115662510A (en) Method, device and equipment for determining causal parameters and storage medium
Vyas et al. Classification of COVID-19 Cases: The Customized Deep Convolutional Neural Network and Transfer Learning Approach.
Abhishek et al. The Auscultation Sound Classification Era of the Future
JP2021507392A (en) Learning and applying contextual similarities between entities
WO2023053189A1 (en) Information processing device, information processing method, and program

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22862958

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE