WO2022199462A1 - 医学图像报告生成模型的训练方法及图像报告生成方法 - Google Patents

医学图像报告生成模型的训练方法及图像报告生成方法 Download PDF

Info

Publication number
WO2022199462A1
WO2022199462A1 PCT/CN2022/081537 CN2022081537W WO2022199462A1 WO 2022199462 A1 WO2022199462 A1 WO 2022199462A1 CN 2022081537 W CN2022081537 W CN 2022081537W WO 2022199462 A1 WO2022199462 A1 WO 2022199462A1
Authority
WO
WIPO (PCT)
Prior art keywords
medical image
visual feature
network
loss function
function value
Prior art date
Application number
PCT/CN2022/081537
Other languages
English (en)
French (fr)
Inventor
边成
Original Assignee
腾讯科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 腾讯科技(深圳)有限公司 filed Critical 腾讯科技(深圳)有限公司
Publication of WO2022199462A1 publication Critical patent/WO2022199462A1/zh
Priority to US18/073,290 priority Critical patent/US20230092027A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/0002Inspection of images, e.g. flaw detection
    • G06T7/0012Biomedical image inspection
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H10/00ICT specially adapted for the handling or processing of patient-related medical or healthcare data
    • G16H10/60ICT specially adapted for the handling or processing of patient-related medical or healthcare data for patient-specific data, e.g. for electronic patient records
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H15/00ICT specially adapted for medical reports, e.g. generation or transmission thereof
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H30/00ICT specially adapted for the handling or processing of medical images
    • G16H30/20ICT specially adapted for the handling or processing of medical images for handling medical images, e.g. DICOM, HL7 or PACS
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H30/00ICT specially adapted for the handling or processing of medical images
    • G16H30/40ICT specially adapted for the handling or processing of medical images for processing medical images, e.g. editing
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/20ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for computer-aided diagnosis, e.g. based on medical expert systems
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/70ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for mining of medical data, e.g. analysing previous cases of other patients
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10072Tomographic images
    • G06T2207/10081Computed x-ray tomography [CT]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10072Tomographic images
    • G06T2207/10088Magnetic resonance imaging [MRI]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10072Tomographic images
    • G06T2207/10104Positron emission tomography [PET]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10116X-ray image
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/30Subject of image; Context of image processing
    • G06T2207/30004Biomedical image processing

Definitions

  • the embodiments of the present application relate to the technical field of artificial intelligence, and in particular, to a training method for a medical image report generation model and an image report generation method.
  • Medical images also known as medical images, refer to internal tissue images obtained in a non-invasive manner on the human body or a certain part of the human body.
  • the clinic still uses manual methods to read medical images and write corresponding image reports. This method will lead to low efficiency of image report generation, and for junior doctors, the problem of inaccurate report writing is prone to occur. .
  • the embodiments of the present application provide a training method for a medical image report generation model and an image report generation method, so as to realize the automatic generation of a medical image report with high accuracy.
  • the technical solution is as follows:
  • a training method for a medical image report generation model is provided, the method is executed by a computer device, and the medical image report generation model includes a visual feature extraction network, an encoding network and a decoding network, so The methods described include:
  • the input information is encoded by the encoding network to obtain the visual encoding feature vector corresponding to the visual feature sequence, and the output task result corresponding to the self-learning label;
  • a total loss function value of the medical image report generation model is calculated, and parameters of the medical image report generation model are adjusted according to the total loss function value.
  • an image report generation method based on a medical image report generation model is provided, the method is executed by a computer device, and the medical image report generation model includes a visual feature extraction network, an encoding network and a decoding network, the method includes:
  • the input information is encoded by the encoding network to obtain a visual encoding feature vector corresponding to the visual feature sequence;
  • the visual encoding feature vector is decoded by the decoding network to obtain an output image report corresponding to the target medical image.
  • a training device for a medical image report generation model where the medical image report generation model includes a visual feature extraction network, an encoding network, and a decoding network, and the device includes:
  • a sample acquisition module for acquiring sample medical images
  • a feature extraction module configured to perform visual feature extraction processing on the sample medical image through the visual feature extraction network to obtain a visual feature sequence of the sample medical image
  • an information splicing module for splicing self-learning labels on the basis of the visual feature sequence to obtain the input information of the encoding network
  • an encoding processing module configured to perform encoding processing on the input information through the encoding network, to obtain a visual encoding feature vector corresponding to the visual feature sequence, and an output task result corresponding to the self-learning label;
  • a decoding processing module configured to perform decoding processing on the visual encoding feature vector through the decoding network to obtain an output image report corresponding to the sample medical image
  • a loss calculation module configured to calculate the total loss function value of the medical image report generation model based on the output image report and the output task result;
  • a model parameter adjustment module configured to adjust parameters of the medical image report generation model according to the total loss function value.
  • an image report generation device based on a medical image report generation model, where the medical image report generation model includes a visual feature extraction network, an encoding network, and a decoding network, and the device includes:
  • a feature extraction module configured to perform feature extraction processing on the target medical image through the visual feature extraction network to obtain a visual feature sequence of the target medical image
  • an information splicing module for splicing self-learning labels on the basis of the visual feature sequence to obtain the input information of the encoding network
  • a coding processing module configured to perform coding processing on the input information through the coding network to obtain a visual coding feature vector corresponding to the visual feature sequence
  • a decoding processing module configured to perform decoding processing on the visual encoding feature vector through the decoding network to obtain an output image report corresponding to the target medical image.
  • a computer device includes a processor and a memory, and the memory stores at least one instruction, at least one program, a code set or an instruction set, the at least one The instructions, the at least one piece of program, the code set or the instruction set are loaded and executed by the processor to implement the above-mentioned training method for a medical image report generation model, or the above-mentioned image report generation method based on the medical image report generation model.
  • a computer-readable storage medium where at least one instruction, at least one segment of program, code set or instruction set is stored in the storage medium, the at least one instruction, the at least one segment of The program, the code set or the instruction set are loaded and executed by the processor to implement the above-mentioned training method of the medical image report generation model, or the above-mentioned image report generation method based on the medical image report generation model.
  • a computer program product or computer program where the computer program product or computer program includes computer instructions, and the computer instructions are stored in a computer-readable storage medium.
  • the processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device executes the above-mentioned training method of the medical image report generation model, or the above-mentioned image report based on the medical image report generation model Generate method.
  • This application provides a technical solution for automatically generating a medical image report based on an AI (Artificial Intelligence) model.
  • AI Artificial Intelligence
  • the model training process in addition to letting the model complete the main task (ie, generating an image report), it also completes other tasks in parallel. Tasks (such as task results, etc.), task results refer to the output results of related tasks of self-supervised training.
  • the intra-class differences can be further expanded, and the feature extraction ability of the network to the input image can be improved.
  • the robustness of the model network to different images improves the recognition ability of the model network to images, thereby indirectly enhancing the image-text conversion performance of the model, and making the model output more accurate and reliable medical image reports.
  • FIG. 1 is a schematic diagram of a solution implementation environment provided by an embodiment of the present application.
  • FIG. 2 is a flowchart of a training method for a medical image report generation model provided by an embodiment of the present application
  • FIG. 3 is a flowchart of a training method for a medical image report generation model provided by another embodiment of the present application.
  • FIG. 4 is a flowchart of a training method for a medical image report generation model provided by another embodiment of the present application.
  • FIG. 5 is an architecture diagram of a medical image report generation model provided by an embodiment of the present application.
  • FIG. 6 is a schematic diagram of a Transformer structure provided by an embodiment of the present application.
  • FIG. 7 is a schematic diagram of a multi-head attention mechanism provided by an embodiment of the present application.
  • FIG. 8 is a schematic diagram of position coding provided by an embodiment of the present application.
  • FIG. 9 is a schematic diagram of a masked multi-head attention mechanism provided by an embodiment of the present application.
  • FIG. 10 is a schematic diagram of a calculation process of an attention vector provided by an embodiment of the present application.
  • FIG. 11 is a flowchart of a method for generating an image report provided by an embodiment of the present application.
  • FIG. 12 is a block diagram of a training device for a medical image report generation model provided by an embodiment of the present application.
  • FIG. 13 is a block diagram of an image report generating apparatus provided by an embodiment of the present application.
  • FIG. 14 is a schematic structural diagram of a computer device provided by an embodiment of the present application.
  • FIG. 1 shows a schematic diagram of a solution implementation environment provided by an embodiment of the present application.
  • the solution implementation environment may include a model training device 10 and a model using device 20 .
  • the model training device 10 may be a computer device such as a computer, a server, etc., for training a medical image report generation model.
  • the medical image report generation model is a machine learning model for automatically generating corresponding image reports based on medical images, and the model training device 10 can use machine learning to train the medical image report generation model, In order to make it have better automatic generation performance of medical image report.
  • the trained medical image report generation model can be deployed in the model using device 20 for use.
  • the model-using device 20 may be a terminal device such as a mobile phone, a tablet computer, a PC (Personal Computer), a smart TV, a multimedia playback device, a medical device, etc., or a server.
  • the model-using device 20 can automatically generate a medical image report through the medical image report generating model.
  • the medical image report generation model provided by this application can automatically generate a text report in the form of natural language.
  • the automatically generated medical image report can assist the doctor in diagnosing the condition, reduce the workload of the doctor, and help improve the generation efficiency of the medical image report.
  • the medical image report generation model includes a visual feature extraction network, an encoding network, and a decoding network.
  • a visual feature extraction network for the introduction and description of each network, please refer to the following embodiments.
  • the type of medical image is not limited, for example, it can be X-ray image, CT (Computed Tomography, Computed Tomography) image, PET (Positron Emission Computed Tomography, Positron Emission Computed Tomography) Images, Magnetic Resonance Imaging (MRI), medical ultrasound images, medical microscope images, etc.
  • CT Computerputed Tomography, Computed Tomography
  • PET PET
  • Magnetic Resonance Imaging MRI
  • medical ultrasound images Medical microscope images
  • medical microscope images etc.
  • the body parts targeted by the medical images are also not limited, including but not limited to abdomen, internal organs, bones, head, blood vessels, and the like.
  • the medical images may also be medical images for animals such as cats and dogs, and corresponding image reports may also be automatically generated by using the technical solutions of the present application.
  • FIG. 2 shows a flowchart of a training method for a medical image report generation model provided by an embodiment of the present application.
  • the execution subject of each step of the method may be the model training device 10 in the above-mentioned embodiment of FIG. 1 , such as a computer device such as a computer and a server.
  • the method may include the following steps (210-260):
  • Step 210 acquiring a sample medical image.
  • Step 220 Perform visual feature extraction processing on the sample medical image through a visual feature extraction network to obtain a visual feature sequence of the sample medical image.
  • Step 230 on the basis of the visual feature sequence, splicing the self-learning labels to obtain the input information of the encoding network.
  • the visual feature sequence is not directly used as the input information of the coding network, but the self-learning labels are spliced on the basis of the visual feature sequence to obtain the input information of the coding network.
  • the self-learning label is used to learn the image feature information from the visual feature sequence through the processing of the coding network, so as to predict the task result of the sample medical image.
  • Step 240 Encode the input information through an encoding network to obtain a visual encoding feature vector corresponding to the visual feature sequence and an output task result corresponding to the self-learning label.
  • the input information is encoded by the encoding network to obtain the encoded feature vector.
  • the output category result corresponding to the image category label refers to the category corresponding to the sample medical image predicted by the encoding network (such as the diagnosis results introduced above)
  • the output task result corresponding to the self-learning label refers to the sample predicted by the encoding network.
  • Task results corresponding to medical images (such as the rotation angle introduced above).
  • Step 250 decoding the visual encoding feature vector through a decoding network to obtain an output image report corresponding to the sample medical image.
  • Step 260 Calculate the total loss function value of the medical image report generation model based on the output image report and the output task result, and adjust the parameters of the medical image report generation model according to the total loss function value.
  • the present application provides a technical solution for automatically generating medical image reports based on AI models.
  • the model training process in addition to letting the model complete the main task (ie, generating image reports), other tasks (ie, generating image reports) are also completed in parallel.
  • task results refer to the output results of related tasks of self-supervised training.
  • the intra-class differences can be further expanded, the feature extraction ability of the network for input images can be increased, and the model network can be improved.
  • the robustness to different images improves the image recognition ability of the model network, thereby indirectly enhancing the image-text conversion performance of the model, and making the model output more accurate and reliable medical image reports.
  • FIG. 3 shows a flowchart of a training method for a medical image report generation model provided by another embodiment of the present application.
  • the execution subject of each step of the method may be the model training device 10 in the above-mentioned embodiment of FIG. 1 , such as a computer device such as a computer and a server.
  • the method may include the following steps (310-360):
  • Step 310 acquiring a sample medical image.
  • a sample medical image refers to a medical image used as a training sample. Sample medical images can be selected from some known datasets.
  • the sample medical image has a corresponding target image report, and the target image report refers to an image report that is manually generated and verified, and has a pathological description for the sample medical image.
  • the medical images and corresponding image reports in the above data set must meet the following requirements, so that they can be used as sample medical images and corresponding target image reports.
  • Medical images need to be standardized images, such as 2D or 3D X-ray images, CT images, PET images, magnetic resonance images, medical ultrasound images, medical microscope images, etc., and the images must meet the requirements of the collection area and quality requirements.
  • the image report corresponding to the medical image needs to be a structured report, a text-based document written by a qualified radiologist, containing descriptive information and explanations about the patient's medical history, symptoms, and lesions contained in the medical image.
  • the image report corresponding to the medical image is a structured report including the following four parts: impression, finding, comparison and indication.
  • the radiologist makes a diagnosis based on the guidance of the Results section, the patient's clinical history, and imaging studies.
  • the results section the radiological observations of the body parts detected during the imaging studies are listed.
  • the comparison part and the indication part have little to do with the content of this application and will not be described in detail.
  • Step 320 Perform visual feature extraction processing on the sample medical image through a visual feature extraction network to obtain a visual feature sequence of the sample medical image.
  • a visual feature extraction network is a neural network for extracting visual features of medical images.
  • the visual feature extraction network may be a CNN (Convolutional Neural Network, convolutional neural network). CNN has good performance in handling computer vision related tasks.
  • this step includes the following sub-steps:
  • the visual feature information can be a feature map output by the sample medical image after passing through the visual feature extraction network, and the feature map records the visual features of the sample medical image, including but not limited to the color feature, texture feature, shape feature, and spatial relationship feature of the image. and other image features.
  • the color feature is a global feature that describes the surface properties of the scene corresponding to the image or image area.
  • Texture feature is also a global feature, which also describes the surface properties of the scene corresponding to the image or image area.
  • shape features There are two types of representation methods for shape features, one is contour feature and the other is regional feature.
  • the contour feature of the image is mainly aimed at the outer boundary of the object, while the regional feature of the image is related to the entire shape area.
  • Spatial relationship feature refers to the mutual spatial position or relative direction relationship between multiple objects segmented in an image. These relationships can also be divided into connection/adjacency relationship, overlapping/overlapping relationship, and inclusion/inclusion relationship.
  • the feature map corresponding to the visual feature information is divided into a plurality of feature map sub-blocks by performing block processing, and each feature map sub-block corresponds to a visual feature unit.
  • the feature map corresponding to the visual feature information is divided into a 5 ⁇ 5 feature map sub-block, and each feature map sub-block has the same size.
  • the visual feature information represented in the form of a feature map can be converted into a vector form to represent it.
  • the feature vector (embedding) corresponding to each visual feature unit can be obtained, and then the feature vectors corresponding to each visual feature unit can be arranged in order.
  • Feature sequence the visual feature sequence is a vector sequence.
  • the position vector corresponding to the visual feature unit can be considered in combination, and the position vector is used to represent the relative position of the visual feature unit in the entire visual feature information (that is, the feature map). or absolute position.
  • Step 330 on the basis of the visual feature sequence, splicing the image category label and the self-learning label to obtain the input information of the encoding network.
  • the visual feature sequence is not directly used as the input information of the encoding network, but the image category label and the self-learning label are spliced on the basis of the visual feature sequence to obtain the input information of the encoding network.
  • the image category label is used to learn the image feature information from the visual feature sequence through the processing of the encoding network, so as to predict the category of the sample medical image.
  • the self-learning label is used to learn the image feature information from the visual feature sequence through the processing of the encoding network to predict the task result of the sample medical image.
  • the categories of the medical images can be classified from the diagnosis results of the medical images, for example, including different categories such as fractures, cardiac hypertrophy, pneumonia, and pulmonary edema.
  • the classification task here can also classify images from other perspectives, such as recognizing the types of organs contained in the images, recognizing the disease grading of lesions in the images, etc. Other classification tasks are not limited in this application.
  • the diagnosis results of medical images can be classified, thereby improving the robustness of the model network to images of different categories, thereby further enhancing the image-text conversion performance of the model, and making the model output more accurate and reliable. medical image report.
  • the above task results refer to output results of related tasks of self-supervised training.
  • the present application introduces a self-supervised training method, which can further expand the intra-class differences and increase the feature extraction capability of the network for input images.
  • the related tasks of self-supervised training can be set according to requirements.
  • the task can be to judge the rotation angle of the input image, such as judging how many 90 degrees the input image is rotated.
  • the input sample medical image it can be randomly not rotated, rotated by 90 degrees, rotated by 180 degrees or rotated by 270 degrees.
  • the rotated image is input to the visual feature extraction network for subsequent processing, and the encoding network outputs the corresponding task results. , that is, the prediction result for the rotation angle.
  • step 340 the input information is encoded by the encoding network to obtain the visual encoding feature vector corresponding to the visual feature sequence, the output category result corresponding to the image category label, and the output task result corresponding to the self-learning label.
  • the input information is encoded by the encoding network to obtain the encoded feature vector.
  • We extract three parts of information from the encoded feature vector one of which is the visual encoding feature vector corresponding to the visual feature sequence, the other is the output category result corresponding to the image category label, and the other is the output task result corresponding to the self-learning label.
  • the output category result corresponding to the image category label refers to the category corresponding to the sample medical image predicted by the encoding network (such as the diagnosis results introduced above)
  • the output task result corresponding to the self-learning label refers to the sample predicted by the encoding network.
  • Task results corresponding to medical images such as the rotation angle introduced above).
  • Step 350 decoding the visual encoding feature vector through a decoding network to obtain an output image report corresponding to the sample medical image.
  • the visual encoding feature vector is further sent to the decoding network for decoding processing, and the decoding network outputs the output image report corresponding to the sample medical image.
  • the image report output by the decoding network is an image report in text form, and the image report in text form is a sentence/paragraph report conforming to natural language specifications, rather than some simple keywords.
  • Step 360 Calculate the total loss function value of the medical image report generation model based on the output image report, the output category result and the output task result, and adjust the parameters of the medical image report generation model according to the total loss function value.
  • the model in addition to completing the main task (ie, generating an image report), the model also completes other tasks (including determining the image category and task results) in parallel. Therefore, the loss function of the model includes not only the loss function corresponding to the main task, but also the loss function corresponding to other tasks. Finally, the various loss functions are added together to obtain the total loss function of the model.
  • the total loss function value of the model is calculated as follows:
  • the output image report refers to the image report output by the model, specifically the image report output by the decoding network.
  • the target image report has been introduced above, which refers to the image report manually written by a professional doctor. By comparing the difference between the output image report and the target image report, the performance of the model in report generation can be obtained.
  • the first loss function may be a cross-entropy loss for measuring the dissimilarity between the output image report and the target image report. Therefore, training the medical image report generation model based on the first loss function can improve the accuracy and reliability of the medical image report generated by the model.
  • the output category result refers to the category result output by the model, specifically, the category result output by the encoding network.
  • the target category result refers to the accurate category result.
  • extract the information of the specified field from the target image report corresponding to the sample medical image perform semantic recognition on the information of the specified field to obtain the target category result corresponding to the sample medical image.
  • extract the information of the impression part from the target image report perform semantic recognition on this part of the information, obtain the diagnosis result, and use the diagnosis result as the target category result.
  • the target category result can be obtained by performing semantic recognition on the information of the specified field without analyzing the full text of the target image report, thereby reducing the time for semantic recognition and saving the processing resources of computer equipment.
  • the second loss function may be a cross-entropy loss for measuring the difference between the output category result and the target category result.
  • the target category result can be automatically extracted from the target image report corresponding to the sample medical image, thereby eliminating the need to manually annotate the target category result and improving the training efficiency of the model.
  • training the medical image report generation model based on the second loss function can improve the accuracy of the model's category judgment, thereby further improving the accuracy and reliability of the medical image report generated by the model.
  • the output task result refers to the task result output by the model, specifically the task result output by the encoding network.
  • the target task result refers to the exact task result.
  • the target task result is used to indicate the actual rotation angle of the sample medical image
  • the output task result is used to indicate the predicted rotation angle of the sample medical image.
  • the task is to determine how many 90 degrees the input image is rotated, then the task result can be 0, 1, 2, 3 to represent, corresponding to no rotation, rotation 90 degrees, rotation 180 degrees and rotation 270 degrees.
  • the input sample medical image it can be randomly not rotated or rotated by any angle (such as 10 degrees, 36 degrees, 45 degrees, 110 degrees, 305 degrees, etc.), and the task is to determine how many degrees the input image is rotated. Then the task result can be represented by an angle value, which is used to represent the rotation angle corresponding to the sample medical image.
  • the rotation angle of the medical image is used as the task result, so that the model can recognize medical images of various angles, thereby reducing the probability of inaccurate image recognition caused by the existence of the image rotation angle, thereby improving the model for Robustness of recognition of images from different angles.
  • the third loss function may be a cross-entropy loss for measuring the difference between the output task result and the target task result. Therefore, training the medical image report generation model based on the third loss function can improve the accuracy of the model's judgment on the task results, and further improve the accuracy and reliability of the medical image report generated by the model.
  • the third loss function not only includes the cross entropy between the output task result and the target task result, but also includes the information entropy of the output task result itself.
  • the third loss function L St The formula can be as follows:
  • Z p represents the output task result
  • y p represents the target task result
  • L CE (Z p , y p ) represents the cross entropy between the output task result and the target task result
  • ⁇ Z p log(Z p ) represents the output task The information entropy of the result itself.
  • weighted summation is performed on the first loss function value, the second loss function value and the third loss function value to obtain a total loss function value.
  • the weight corresponding to each loss function can be reasonably set and adjusted according to the actual situation, for example, it can be set according to the importance of each task or adjusted according to the training effect of the model, so as to adjust the importance of each loss function to get the focus This application does not limit the model for one or some performances.
  • the present application provides a technical solution for automatically generating medical image reports based on AI models.
  • the model training process in addition to letting the model complete the main task (ie, generating image reports), other tasks (ie, generating image reports) are also completed in parallel. Including determining the image category and task results, etc.), which helps to improve the image recognition ability of the model network, thereby indirectly enhancing the image-text conversion performance of the model, and making the model output more accurate and reliable medical image reports.
  • the medical image report generation model is adjusted from multiple dimensions based on the first loss function, the second loss function and the third loss function, so that the trained model can meet the indicators of multiple dimensions and improve the performance of the medical image report generated by the model. Accuracy and reliability.
  • FIG. 4 shows a flowchart of a training method for a medical image report generation model provided by another embodiment of the present application.
  • the execution subject of each step of the method may be the model training device 10 in the above-mentioned embodiment of FIG. 1 , such as a computer device such as a computer and a server.
  • the method may include the following steps (410-460):
  • Step 410 obtaining a sample medical image.
  • Step 420 Perform visual feature extraction processing on the sample medical image through a visual feature extraction network to obtain a visual feature sequence of the sample medical image.
  • Step 430 on the basis of the visual feature sequence, splicing the image category label, the self-learning label and the model distillation label to obtain the input information of the encoding network.
  • model distillation label is further added.
  • the model distillation label is used to predict the class of the sample medical image by learning the image feature information from the visual feature sequence through the processing of the encoding network.
  • the categories here can also be classified from the diagnosis results of medical images, such as including fractures, cardiac hypertrophy, pneumonia, pulmonary edema and other different categories.
  • Step 440 Encode the input information through an encoding network to obtain the visual encoding feature vector corresponding to the visual feature sequence, the output category result corresponding to the image category label, the output task result corresponding to the self-learning label, and the student output corresponding to the model distillation label. diagnostic result.
  • the input information is subjected to encoding processing by an encoding network to obtain an encoded feature vector.
  • We extract 4 parts of information from the encoded feature vector of which the first part is the visual encoding feature vector corresponding to the visual feature sequence, the second part is the output category result corresponding to the image category label, and the third part is the output task corresponding to the self-learning label
  • the fourth part serves as the student output diagnosis result corresponding to the model distillation label.
  • the output category result corresponding to the image category label refers to the category corresponding to the sample medical image predicted by the encoding network (such as the diagnosis results introduced above)
  • the output task result corresponding to the self-learning label refers to the sample predicted by the encoding network.
  • the task result corresponding to the medical image (such as the rotation angle introduced above), and the student output diagnosis result corresponding to the model distillation label refers to the diagnosis result corresponding to the sample medical image predicted by the encoding network.
  • Step 450 decoding the visual encoding feature vector through a decoding network to obtain an output image report corresponding to the sample medical image.
  • Step 460 Calculate the total loss function value of the medical image report generation model based on the output image report, output category result, output task result and student output diagnosis result, and adjust the parameters of the medical image report generation model according to the total loss function value.
  • the model in addition to completing the main task (ie, generating an image report), the model also completes other tasks (including determining the image category, determining the task result, and determining the diagnosis result) in parallel. Therefore, the loss function of the model includes not only the loss function corresponding to the main task, but also the loss function corresponding to other tasks. Finally, the various loss functions are added together to obtain the total loss function of the model.
  • the total loss function value of the model is calculated as follows:
  • the student output diagnosis result refers to the diagnosis result output by the medical image report generation model, specifically, the diagnosis result output by the coding network.
  • the diagnostic result output by the teacher refers to the diagnostic result output by the pre-trained teacher model.
  • the sample medical image is input into the pre-trained teacher model, and the teacher model is used to identify the symptom category (that is, the diagnosis result) in the sample medical image; obtain the teacher output diagnosis result corresponding to the sample medical image through the teacher model .
  • the sample medical images can be used for training, and the target diagnosis result can be used as the label information of the model training.
  • the target diagnosis result can be the diagnosis result extracted from the impression part of the target image report corresponding to the sample medical image. .
  • the pre-trained teacher model is used to perform model distillation on the medical image report generation model, so as to improve the accuracy of the model and simplify the network structure of the model, thereby saving the storage resources occupied by the model and the cost of using the model.
  • using the pre-trained teacher model to perform model distillation on the medical image report generation model can speed up the convergence speed of the medical image report generation model during the training process, thereby improving the training efficiency of the medical image report generation model.
  • the fourth loss function may be a cross-entropy loss used to measure the difference between the student's output diagnosis result and the teacher's output diagnosis result.
  • the formula of the fourth loss function L global can be as follows:
  • L global (1- ⁇ )L CE ( ⁇ (Z s ),y)+ ⁇ 2 KL( ⁇ (Z s / ⁇ ), ⁇ (Z t / ⁇ ))
  • Z s and Z t are the outputs of the student model (that is, the medical image report generation model) and the teacher model, respectively, that is, Z s is the output diagnosis result of the student, Z t is the output diagnosis result of the teacher, y is the target diagnosis result, L CE ( ⁇ (Z s ), y) represents the cross-entropy between the student’s output diagnostic result and the target diagnostic result, KL represents the KL divergence (Kullback-Leibler Divergence), ⁇ represents the softmax function, and ⁇ and ⁇ are hyperparameters. Illustratively, ⁇ is set to 0.5 and ⁇ is set to 1.
  • the target diagnosis result can be automatically extracted from the target image report corresponding to the sample medical image, thereby eliminating the need to manually label the target diagnosis result, and improving the training efficiency of the model.
  • weighted summation is performed on the first loss function value, the second loss function value, the third loss function value and the fourth loss function value to obtain a total loss function value.
  • the model distillation label is further introduced to allow the model to complete the diagnosis task. It is found through experiments that the introduction of the model distillation label is compared to simply using two image classification labels, and the final medical image report is generated. The performance of the model will be better because the model distillation label can learn inductive hypotheses from the teacher model to achieve the effect of improving the performance of the medical image report generation model.
  • the medical image report generation model may adopt a model architecture of CNN+Transformer; wherein, CNN is used as a visual feature extraction network, and the Transformer includes multiple cascaded encoders and multiple cascaded decoders, the A plurality of cascaded encoders are used as an encoding network, and the plurality of cascaded decoders are used as a decoding network.
  • FIG. 5 it exemplarily shows an architecture diagram of a medical image report generation model.
  • the model adopts the model architecture of CNN+Transformer, which includes a visual feature extraction network, an encoding network and a decoding network.
  • the visual feature extraction network adopts the CNN structure
  • the encoding network and the decoding network adopt the Transformer structure
  • the encoding network includes N cascades
  • the encoder, the decoding network includes N cascaded decoders, where N is an integer greater than 1.
  • the value of N is 6.
  • the medical image is processed by the feature extraction of the visual feature extraction network to obtain the visual feature information.
  • the visual feature information is divided into a plurality of visual feature units, and then the feature vector of each visual feature unit is obtained to obtain a visual feature sequence.
  • the image category labels, self-learning labels and model distillation labels are stitched together to obtain the input information of the encoding network.
  • the input information is coded by the coding network, and the visual coding feature vector corresponding to the visual feature sequence is obtained.
  • the visual encoding feature vector is decoded by the decoding network, and the image report corresponding to the medical image is obtained.
  • Transformer structure is a sequence to sequence (Sequence to Sequence) model, the special thing is that it uses a lot of Self-Attention (self-attention) mechanism.
  • the network model based on the Transformer structure uses the Self-Attention mechanism and does not use the RNN (Recurrent Neural Network) sequential structure, so that the model can be trained in parallel and can have global information.
  • RNN Recurrent Neural Network
  • FIG. 6 exemplarily shows a schematic diagram of a Transformer structure.
  • the input is a sequence of words to be translated
  • the output is a sequence of translated words.
  • each word in the word sequence to be translated will go through embedding encoding, position encoding, multi-head attention layer, residual connection and layer normalization, forward propagation, residual connection and layer normalization in parallel in turn. Convert, calculate the K and V vectors encoded by each word, and then send them to the decoder.
  • the decoder part input the translation result (or start tag) of the previous word, go through embedding coding, position coding, multi-head attention layer with mask, residual connection and layer normalization in turn, and get the decoded Q vector; after that, the K and V vectors of the current word and the decoded Q vector go through the multi-head attention layer, residual connection and layer normalization, forward propagation, residual connection and layer normalization, and fully connected layer. , Softmax layer, get the translation result of the current word. Finally, the translation results of all words are spliced to obtain the translated word sequence.
  • the encoder part can be calculated in parallel and encode all the encoder inputs at one time, but the decoder part does not solve all sequences at once, but solves them one by one like RNN, because Use the decoded output of the previous position as the Q vector of the attention mechanism.
  • the attention mechanism can discover the relationship between words, which is already very common in CV (computer vision technology). Given a word, a word embedding can be obtained, and the Q(query), K(key) and V(value) vectors corresponding to the word can be obtained through three independent fully connected layers.
  • the position of words/features is very important for sequence conversion (such as text translation or image-text conversion, etc.), so after obtaining image features and word embeddings, it is necessary to encode the position information of words.
  • the encoding method is shown in the figure 8, where pos represents the position of the word in the current sentence, i represents the dimension corresponding to the word embedding, the value range of i is [0, d/2), and d is a set value such as 512. Therefore, the encoding of each word and dimension in PE (Positional Encoding) is different.
  • the odd numbers are encoded by the sin formula, and the even numbers are encoded by the cos formula. details as follows:
  • PE (pos, 2i) sin (pos/10000 2i/d );
  • PE (pos, 2i+1) cos(pos/10000 2i/d ).
  • Residual Connection can avoid the phenomenon of gradient disappearance due to the deepening of modules in Transformer, and is used to prevent network degradation. Therefore, we first add the Z vector to the original input X vector, and then use layer normalization to find the variance and mean of the channel dimension of the current word vector, normalize it, and then input it to the forward layer.
  • the attention vector can be calculated according to the following formula:
  • d k represents the dimension of the Q vector or the K vector.
  • the original medical image is converted into a visual feature sequence through the feature extraction process of the visual feature extraction network, and then on the basis of the visual feature sequence, the image category labels are stitched together , self-learning labels and model distillation labels to obtain the input information of the encoding network.
  • the input information is a vector sequence. Therefore, through the encoding and decoding processing of the Transformer network, the image report in the form of text can be output.
  • the above embodiment has introduced and explained the training method of the medical image report generation model.
  • the following will introduce and explain the image report generation method based on the medical image report generation model through the embodiment.
  • the content involved and the content involved in the training process correspond to each other, and the two communicate with each other. If there is no detailed description on one side, you can refer to the description on the other side.
  • FIG. 11 shows a flowchart of a method for generating an image report provided by an embodiment of the present application.
  • the execution subject of each step of the method may be the model using device 20 in the above-mentioned embodiment of FIG. 1 , such as a terminal device such as a mobile phone, a tablet computer, a PC, and a medical device, or a server.
  • the method may include the following steps (1010-1040):
  • Step 1110 Perform feature extraction processing on the target medical image through a visual feature extraction network to obtain a visual feature sequence of the target medical image.
  • the target medical image may be any medical image.
  • an image report corresponding to the target medical image can be automatically generated through a medical image report generation model.
  • Step 1120 on the basis of the visual feature sequence, splicing the image category label and the self-learning label to obtain the input information of the encoding network.
  • the image category label, self-learning label and model distillation label are spliced to obtain the input information of the encoding network.
  • the image class labels, self-learning labels and model distillation labels spliced here are exactly the same as the image class labels, self-learning labels and model distillation labels spliced during the model training process.
  • the image category label, self-learning label and model distillation label are 3 all-zero vectors, that is, all elements in the vector are 0, then during the model use process, these 3 labels are also 3 an all-zero vector.
  • Step 1130 Encode the input information through an encoding network to obtain a visual encoding feature vector corresponding to the visual feature sequence.
  • the input information is encoded by the encoding network to obtain the encoded feature vector.
  • the input information includes visual feature sequences, image class labels and self-learning labels
  • we extract three parts of information from the encoded feature vector one of which is used as the visual encoding feature vector corresponding to the visual feature sequence, and the other is used as the image class label.
  • the corresponding output category results, and part of the output task results corresponding to the self-learning labels refers to the category corresponding to the target medical image predicted by the encoding network (such as the diagnosis results introduced above)
  • the output task result corresponding to the self-learning label refers to the target predicted by the encoding network.
  • Task results corresponding to medical images such as the rotation angle introduced above).
  • the input information includes visual feature sequence, image category label, self-learning label and model distillation label
  • the third part is the output category result corresponding to the image category label
  • the third part is the output task result corresponding to the self-learning label
  • the fourth part is the student output diagnosis result corresponding to the model distillation label.
  • the output category result corresponding to the image category label refers to the category corresponding to the target medical image predicted by the encoding network (such as the diagnosis results introduced above)
  • the output task result corresponding to the self-learning label refers to the target predicted by the encoding network.
  • the task result corresponding to the medical image (such as the rotation angle introduced above)
  • the student output diagnosis result corresponding to the model distillation label refers to the diagnosis result corresponding to the target medical image predicted by the encoding network.
  • Step 1140 Decode the visual encoding feature vector through a decoding network to obtain an output image report corresponding to the target medical image.
  • the visual encoding feature vector is further sent to the decoding network for decoding processing, and the decoding network outputs the output image report corresponding to the target medical image.
  • the image report output by the decoding network is an image report in text form, and the image report in text form is a sentence/paragraph report conforming to natural language specifications, rather than some simple keywords.
  • At least one of a category result, a task result, and a diagnosis result corresponding to the target medical image output by the encoding network may be further acquired.
  • the present application provides a technical solution for automatically generating medical image reports based on AI models, because during the model training process, in addition to letting the model complete the main task (ie, generating image reports), other tasks are also completed in parallel (Including determining the image category, determining the task result, determining the diagnosis result, etc.), which helps to improve the image recognition ability of the model network.
  • the image category is also stitched on the basis of the visual feature sequence. Labels, self-learning labels and model distillation labels, get the input information of the encoding network, so that the model outputs more accurate and reliable medical image reports.
  • FIG. 12 shows a block diagram of an apparatus for training a medical image report generation model provided by an embodiment of the present application.
  • the device has the function of implementing the training method for the above-mentioned medical image report generation model, and the function can be implemented by hardware or by executing corresponding software by the hardware.
  • the apparatus may be computer equipment, or may be provided in computer equipment.
  • the apparatus 1200 may include: a sample acquisition module 1210 , a feature extraction module 1220 , an information splicing module 1230 , an encoding processing module 1240 , a decoding processing module 1250 , a loss calculation module 1260 and a model parameter adjustment module 1270 .
  • the sample acquisition module 1210 is used to acquire sample medical images.
  • the feature extraction module 1220 is configured to perform visual feature extraction processing on the sample medical image through the visual feature extraction network to obtain a visual feature sequence of the sample medical image.
  • the information splicing module 1230 is used for splicing self-learning labels on the basis of the visual feature sequence to obtain the input information of the encoding network.
  • the encoding processing module 1240 is configured to perform encoding processing on the input information through the encoding network to obtain a visual encoding feature vector corresponding to the visual feature sequence and an output task result corresponding to the self-learning label.
  • the decoding processing module 1250 is configured to perform decoding processing on the visual encoding feature vector through the decoding network to obtain an output image report corresponding to the sample medical image.
  • the loss calculation module 1260 is configured to calculate the total loss function value of the medical image report generation model based on the output image report and the output task result.
  • the model parameter adjustment module 1270 is configured to adjust the parameters of the medical image report generation model according to the total loss function value.
  • the loss calculation module 1260 is used to:
  • the total loss function value is calculated based on the first loss function value and the third loss function value.
  • the feature extraction module 1220 is further configured to:
  • the target task result is used to indicate the actual rotation angle of the sample medical image
  • the output task result is used to indicate the predicted rotation angle of the sample medical image
  • the input information further includes a model distillation label, and the model distillation label obtains a student output diagnosis result through the encoding network;
  • the loss calculation module 1260 is also used for:
  • the total loss function value is calculated based on the first loss function value, the third loss function value and the fourth loss function value.
  • the sample acquisition module 1210 is further configured to:
  • the teacher output diagnosis result corresponding to the sample medical image is obtained through the teacher model.
  • the loss calculation module 1260 is further configured to:
  • the first loss function value, the third loss function value and the fourth loss function value are weighted and summed to obtain the total loss function value.
  • the input information further includes an image class label, and the image class label obtains an output class result corresponding to the image class label through the encoding network;
  • the loss calculation module 1260 is further configured to: calculate a second loss function value based on the output category result and the target category result corresponding to the sample medical image;
  • the loss calculation module 1260 is further configured to: calculate the total loss function value based on the first loss function value, the second loss function value and the third loss function value.
  • the sample acquisition module 1210 is further configured to:
  • Semantic recognition is performed on the information of the designated field to obtain a target category result corresponding to the sample medical image.
  • the feature extraction module 1220 is used to:
  • a feature vector of each of the visual feature units is acquired to obtain the visual feature sequence.
  • the present application provides a technical solution for automatically generating medical image reports based on AI models.
  • the model training process in addition to letting the model complete the main task (ie, generating image reports), other tasks (ie, generating image reports) are also completed in parallel.
  • task results refer to the output results of related tasks of self-supervised training.
  • the intra-class differences can be further expanded, the feature extraction ability of the network for input images can be increased, and the model network can be improved.
  • the robustness to different images improves the image recognition ability of the model network, thereby indirectly enhancing the image-text conversion performance of the model, and making the model output more accurate and reliable medical image reports.
  • FIG. 13 shows a block diagram of an image report generating apparatus provided by an embodiment of the present application.
  • the device has the function of realizing the above-mentioned image report generation method, and the function can be realized by hardware, and can also be realized by hardware executing corresponding software.
  • the apparatus may be computer equipment, or may be provided in computer equipment.
  • the apparatus 1300 may include: a feature extraction module 1310 , an information splicing module 1320 , an encoding processing module 1330 and a decoding processing module 1340 .
  • the feature extraction module 1310 is configured to perform feature extraction processing on the target medical image through the visual feature extraction network to obtain a visual feature sequence of the target medical image.
  • the information splicing module 1320 is used for splicing self-learning labels on the basis of the visual feature sequence to obtain the input information of the encoding network.
  • the coding processing module 1330 is configured to perform coding processing on the input information through the coding network to obtain a visual coding feature vector corresponding to the visual feature sequence.
  • the decoding processing module 1340 is configured to perform decoding processing on the visual encoding feature vector through the decoding network to obtain an output image report corresponding to the target medical image.
  • the information splicing module 1320 is configured to splicing the image category label and the self-learning label based on the visual feature sequence to obtain the input information of the encoding network.
  • the information splicing module 1320 is used to:
  • the image category label, the self-learning label and the model distillation label are spliced to obtain the input information of the encoding network.
  • the feature extraction module 1310 is used to:
  • a feature vector of each of the visual feature units is acquired to obtain the visual feature sequence.
  • the present application provides a technical solution for automatically generating medical image reports based on AI models, because during the model training process, in addition to letting the model complete the main task (ie, generating image reports), other tasks are also completed in parallel (Including determining the image category, determining the task result, determining the diagnosis result, etc.), which helps to improve the image recognition ability of the model network.
  • the image category is also stitched on the basis of the visual feature sequence. Labels, self-learning labels and model distillation labels, get the input information of the encoding network, so that the model outputs more accurate and reliable medical image reports.
  • FIG. 14 shows a schematic structural diagram of a computer device provided by an embodiment of the present application.
  • the computer device can be any electronic device with data computing, processing and storage functions, such as a mobile phone, a tablet computer, a PC (Personal Computer, personal computer) or a server, and the like.
  • the computer device is used to implement the training method of the medical image report generation model or the image report generation method provided in the above embodiments. Specifically:
  • the computer device 1400 includes a central processing unit (such as a CPU (Central Processing Unit, central processing unit), a GPU (Graphics Processing Unit, graphics processing unit), and an FPGA (Field Programmable Gate Array, Field Programmable Logic Gate Array, etc.) 1401,
  • the system memory 1404 includes a RAM (Random-Access Memory) 1402 and a ROM (Read-Only Memory) 1403, and a system bus 1405 connecting the system memory 1404 and the central processing unit 1401.
  • the computer device 1400 also includes a basic input/output system (I/O system) 1406 that facilitates the transfer of information between various devices within the server, and a basic input/output system (I/O system) 1406 for storing an operating system 1414, application programs 1414, and other program modules 1415
  • I/O system basic input/output system
  • the basic input/output system 1406 includes a display 1408 for displaying information and input devices 1409 such as a mouse, keyboard, etc., for user input of information.
  • the display 1408 and the input device 1409 are both connected to the central processing unit 1401 through the input and output controller 1410 connected to the system bus 1405 .
  • the basic input/output system 1406 may also include an input output controller 1410 for receiving and processing input from a number of other devices such as a keyboard, mouse, or electronic stylus.
  • input output controller 1410 also provides output to a display screen, printer, or other type of output device.
  • the mass storage device 1407 is connected to the central processing unit 1401 through a mass storage controller (not shown) connected to the system bus 1405 .
  • the mass storage device 1407 and its associated computer-readable media provide non-volatile storage for the computer device 1400 . That is, the mass storage device 1407 may include a computer-readable medium (not shown) such as a hard disk or a CD-ROM (Compact Disc Read-Only Memory) drive.
  • Computer-readable media can include computer storage media and communication media.
  • Computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data.
  • Computer storage media include RAM, ROM, EPROM (Erasable Programmable Read-Only Memory, Erasable Programmable Read-Only Memory), EEPROM (Electrically Erasable Programmable Read-Only Memory, Electrically Erasable Programmable Read-Only Memory), flash memory or Other solid-state storage technologies, CD-ROM, DVD (Digital Video Disc) or other optical storage, cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices.
  • the system memory 1404 and the mass storage device 1407 described above may be collectively referred to as memory.
  • the computer device 1400 may also be connected to a remote computer on the network through a network such as the Internet to run. That is, the computer device 1400 can be connected to the network 1412 through the network interface unit 1411 connected to the system bus 1405, or can also use the network interface unit 1411 to connect to other types of networks or remote computer systems (not shown) .
  • the memory also includes at least one instruction, at least one piece of program, set of code or set of instructions stored in the memory and configured to be executed by one or more processors , so as to realize the training method of the above-mentioned medical image report generation model or the image report generation method.
  • a computer-readable storage medium stores at least one instruction, at least one piece of program, code set or instruction set, the at least one instruction, the at least one piece of program .
  • the code set or the instruction set when executed by the processor of the computer device, implements the above-mentioned training method for a medical image report generation model or an image report generation method.
  • the computer-readable storage medium may include: ROM (Read-Only Memory, read-only memory), RAM (Random-Access Memory, random access memory), SSD (Solid State Drives, solid-state hard disk), or an optical disk.
  • the random access memory may include ReRAM (Resistance Random Access Memory, resistive random access memory) and DRAM (Dynamic Random Access Memory, dynamic random access memory).
  • a computer program product or computer program comprising computer instructions stored in a computer readable storage medium.
  • the processor of the computer device reads the computer instructions from the computer-readable storage medium, the processor executes the computer instructions, so that the computer device executes the above-mentioned training method for a medical image report generation model or image report generation method.

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • General Health & Medical Sciences (AREA)
  • Public Health (AREA)
  • Epidemiology (AREA)
  • Primary Health Care (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Nuclear Medicine, Radiotherapy & Molecular Imaging (AREA)
  • Radiology & Medical Imaging (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Biomedical Technology (AREA)
  • Software Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Computing Systems (AREA)
  • Artificial Intelligence (AREA)
  • Mathematical Physics (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • Quality & Reliability (AREA)
  • Computational Linguistics (AREA)
  • Molecular Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biophysics (AREA)
  • Databases & Information Systems (AREA)
  • Pathology (AREA)
  • Image Analysis (AREA)
  • Measuring And Recording Apparatus For Diagnosis (AREA)
  • Medical Treatment And Welfare Office Work (AREA)
  • Image Processing (AREA)

Abstract

本申请公开了一种医学图像报告生成模型的训练方法及图像报告生成方法,涉及人工智能技术领域。方法包括:通过视觉特征提取网络对样本医学图像进行视觉特征提取处理,得到视觉特征序列;在视觉特征序列的基础上,拼接自学习标签,得到编码网络的输入信息(230);通过编码网络对输入信息进行编码处理,得到视觉编码特征向量和输出任务结果;通过解码网络对视觉编码特征向量进行解码处理,得到输出图像报告;基于输出图像报告和输出任务结果,计算模型损失并对模型参数进行调整。

Description

医学图像报告生成模型的训练方法及图像报告生成方法
本申请要求于2021年03月25日提交的、申请号为202110320701.3、发明名称为“医学图像报告生成模型的训练方法及图像报告生成方法”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请实施例涉及人工智能技术领域,特别涉及一种医学图像报告生成模型的训练方法及图像报告生成方法。
背景技术
医学图像也称为医学影像,是指对人体或人体某部分以非侵入方式取得的内部组织图像。
目前,临床仍然采用人工方式进行医学图像阅览,并撰写相应的图像报告,这种方式会导致图像报告的生成效率较低,且对于资历较浅的医生而言,容易出现报告撰写不准确的问题。
发明内容
本申请实施例提供了一种医学图像报告生成模型的训练方法及图像报告生成方法,实现自动化生成具有较高准确性的医学图像报告。所述技术方案如下:
根据本申请实施例的一个方面,提供了一种医学图像报告生成模型的训练方法,所述方法由计算机设备执行,所述医学图像报告生成模型包括视觉特征提取网络、编码网络和解码网络,所述方法包括:
获取样本医学图像;
通过所述视觉特征提取网络对所述样本医学图像进行视觉特征提取处理,得到所述样本医学图像的视觉特征序列;
在所述视觉特征序列的基础上,拼接自学习标签,得到所述编码网络的输入信息;
通过所述编码网络对所述输入信息进行编码处理,得到所述视觉特征序列对应的视觉编码特征向量,以及所述自学习标签对应的输出任务结果;
通过所述解码网络对所述视觉编码特征向量进行解码处理,得到所述样本医学图像对应的输出图像报告;
基于所述输出图像报告和所述输出任务结果,计算所述医学图像报告生成模型的总损失函数值,并根据所述总损失函数值对所述医学图像报告生成模型的参数进行调整。
根据本申请实施例的一个方面,提供了一种基于医学图像报告生成模型的图像报告生成方法,所述方法由计算机设备执行,所述医学图像报告生成模型包括视觉特征提取网络、编码网络和解码网络,所述方法包括:
通过所述视觉特征提取网络对目标医学图像进行特征提取处理,得到所述目标医学图像的视觉特征序列;
在所述视觉特征序列的基础上,拼接自学习标签,得到所述编码网络的输入信息;
通过所述编码网络对所述输入信息进行编码处理,得到所述视觉特征序列对应的视觉编码特征向量;
通过所述解码网络对所述视觉编码特征向量进行解码处理,得到所述目标医学图像对应的输出图像报告。
根据本申请实施例的一个方面,提供了一种医学图像报告生成模型的训练装置,所述医学图像报告生成模型包括视觉特征提取网络、编码网络和解码网络,所述装置包括:
样本获取模块,用于获取样本医学图像;
特征提取模块,用于通过所述视觉特征提取网络对所述样本医学图像进行视觉特征提取处理,得到所述样本医学图像的视觉特征序列;
信息拼接模块,用于在所述视觉特征序列的基础上,拼接自学习标签,得到所述编码网络的输入信息;
编码处理模块,用于通过所述编码网络对所述输入信息进行编码处理,得到所述视觉特征序列对应的视觉编码特征向量,以及所述自学习标签对应的输出任务结果;
解码处理模块,用于通过所述解码网络对所述视觉编码特征向量进行解码处理,得到所述样本医学图像对应的输出图像报告;
损失计算模块,用于基于所述输出图像报告和所述输出任务结果,计算所述医学图像报告生成模型的总损失函数值;
模型调参模块,用于根据所述总损失函数值对所述医学图像报告生成模型的参数进行调整。
根据本申请实施例的一个方面,提供了一种基于医学图像报告生成模型的图像报告生成装置,所述医学图像报告生成模型包括视觉特征提取网络、编码网络和解码网络,所述装置包括:
特征提取模块,用于通过所述视觉特征提取网络对目标医学图像进行特征提取处理,得到所述目标医学图像的视觉特征序列;
信息拼接模块,用于在所述视觉特征序列的基础上,拼接自学习标签,得到所述编码网络的输入信息;
编码处理模块,用于通过所述编码网络对所述输入信息进行编码处理,得到所述视觉特征序列对应的视觉编码特征向量;
解码处理模块,用于通过所述解码网络对所述视觉编码特征向量进行解码处理,得到所述目标医学图像对应的输出图像报告。
根据本申请实施例的一个方面,提供了一种计算机设备,所述计算机设备包括处理器和存储器,所述存储器中存储有至少一条指令、至少一段程序、代码集或指令集,所述至少一条指令、所述至少一段程序、所述代码集或指令集由所述处理器加载并执行以实现上述医学图像报告生成模型的训练方法,或者上述基于医学图像报告生成模型的图像报告生成方法。
根据本申请实施例的一个方面,提供了一种计算机可读存储介质,所述存储介质中存储有至少一条指令、至少一段程序、代码集或指令集,所述至少一条指令、所述至少一段程序、所述代码集或指令集由处理器加载并执行以实现上述医学图像报告生成模型的训练方法,或者上述基于医学图像报告生成模型的图像报告生成方法。
根据本申请实施例的一个方面,提供了一种计算机程序产品或计算机程序,该计算机程序产品或计算机程序包括计算机指令,该计算机指令存储在计算机可读存储介质中。计算机设备的处理器从计算机可读存储介质读取该计算机指令,处理器执行该计算机指令,使得该计算机设备执行上述医学图像报告生成模型的训练方法,或者上述基于医学图像报告生成模型的图像报告生成方法。
本申请实施例提供的技术方案至少包括如下有益效果:
本申请提供了一种基于AI(Artificial Intelligence,人工智能)模型自动化生成医学图像报告的技术方案,在模型训练过程中,除了让模型完成主任务(即生成图像报告)之外,还并行完成其他任务(如任务结果等),任务结果是指自监督训练的相关任务的输出结果,通过引入一种自监督训练方法,能够进一步扩大类内差异,增大网络对输入图像的特征提取能力、提升模型网络对不同图像的鲁棒性、提升模型网络对图像的识别能力,从而间接增强模型的图文转换性能,使得模型输出更具准确性和可靠性的医学图像报告。
附图说明
图1是本申请一个实施例提供的方案实施环境的示意图;
图2是本申请一个实施例提供的医学图像报告生成模型的训练方法的流程图;
图3是本申请另一个实施例提供的医学图像报告生成模型的训练方法的流程图;
图4是本申请另一个实施例提供的医学图像报告生成模型的训练方法的流程图;
图5是本申请一个实施例提供的医学图像报告生成模型的架构图;
图6是本申请一个实施例提供的Transformer结构的示意图;
图7是本申请一个实施例提供的多头注意力机制的示意图;
图8是本申请一个实施例提供的位置编码的示意图;
图9是本申请一个实施例提供的带掩膜的多头注意力机制的示意图;
图10是本申请一个实施例提供的注意力向量的计算过程的示意图;
图11是本申请一个实施例提供的图像报告生成方法的流程图;
图12是本申请一个实施例提供的医学图像报告生成模型的训练装置的框图;
图13是本申请一个实施例提供的图像报告生成装置的框图;
图14是本申请一个实施例提供的计算机设备的结构示意图。
具体实施方式
本申请实施例提供的技术方案,涉及人工智能的机器学习和计算机视觉等技术,具体通过如下实施例进行介绍说明。
请参考图1,其示出了本申请一个实施例提供的方案实施环境的示意图。该方案实施环境可以包括模型训练设备10和模型使用设备20。
模型训练设备10可以是诸如电脑、服务器等计算机设备,用于对医学图像报告生成模型进行训练。在本申请实施例中,医学图像报告生成模型是用于基于医学图像,自动化生成相应的图像报告的机器学习模型,模型训练设备10可以采用机器学习的方式对该医学图像报告生成模型进行训练,以使得其具备较好的医学图像报告自动化生成性能。
训练完成的医学图像报告生成模型可部署在模型使用设备20中使用。模型使用设备20可以是诸如手机、平板电脑、PC(Personal Computer,个人计算机)、智能电视、多媒体播放设备、医疗设备等终端设备,也可以是服务器。在需要生成医学图像报告时,模型使用设备20可以通过医学图像报告生成模型自动化地生成医学图像报告。
本申请提供的医学图像报告生成模型,能够自动化地生成自然语言形式的文本报告。该自动化生成的医学图像报告可以辅助医生进行病情诊断,且减少医生工作量,有助于提升医学图像报告的生成效率。
在本申请实施例中,医学图像报告生成模型包括视觉特征提取网络、编码网络和解码网络,有关各网络的介绍说明可参见下文实施例。
在本申请实施例中,对医学图像的种类不作限定,如其可以是X光图像、CT(Computed Tomography,电子计算机断层扫描)图像、PET(Positron Emission Computed Tomography,正电子发射型计算机断层显像)图像、磁共振图像(Magnetic Resonance Imaging,MRI)、医学超声波图像、医学显微镜图像等等。并且,在本申请实施例中,对医学图像所针对的人体部位也不作限定,包括但不限于腹部、内脏、骨骼、头部、血管等等。当然,在一些其他实施例中,医学图像也可以是针对猫、狗等动物的医学图像,同样可以采用本申请技术方案自动化生成相应的图像报告。
下面,将通过几个实施例对本申请技术方案进行介绍说明。
请参考图2,其示出了本申请一个实施例提供的医学图像报告生成模型的训练方法的流程图。该方法各步骤的执行主体可以是上述图1实施例中的模型训练设备10,如电脑、服务器等计算机设备。该方法可以包括如下几个步骤(210~260):
步骤210,获取样本医学图像。
步骤220,通过视觉特征提取网络对样本医学图像进行视觉特征提取处理,得到样本医学图像的视觉特征序列。
步骤230,在视觉特征序列的基础上,拼接自学习标签,得到编码网络的输入信息。
在本申请实施例中,并不是直接将视觉特征序列作为编码网络的输入信息,而是在视觉特征序列的基础上,拼接自学习标签,得到编码网络的输入信息。其中,自学习标签用于经过编码网络的处理,从视觉特征序列中学习到图像特征信息,来据此预测出样本医学图像的任务结果。
步骤240,通过编码网络对输入信息进行编码处理,得到视觉特征序列对应的视觉编码特征向量,以及自学习标签对应的输出任务结果。
输入信息经过编码网络的编码处理,得到编码特征向量。我们从编码特征向量中提取出2部分信息,其中一部分作为视觉特征序列对应的视觉编码特征向量,另一部分作为自学习标签对应的输出任务结果。其中,图像类别标签对应的输出类别结果是指由编码网络预测得到的样本医学图像对应的类别(如上文介绍的诊断结果),自学习标签对应的输出任务结果是指由编码网络预测得到的样本医学图像对应的任务结果(如上文介绍的旋转角度)。
步骤250,通过解码网络对视觉编码特征向量进行解码处理,得到样本医学图像对应的输出图像报告。
步骤260,基于输出图像报告和输出任务结果,计算医学图像报告生成模型的总损失函数值,并根据总损失函数值对医学图像报告生成模型的参数进行调整。
关于图2实施例的步骤内容,在下文图3、图4实施例中有相关介绍。
综上所述,本申请提供了一种基于AI模型自动化生成医学图像报告的技术方案,在模型训练过程中,除了让模型完成主任务(即生成图像报告)之外,还并行完成其他任务(如任务结果等),任务结果是指自监督训练的相关任务的输出结果,通过引入一种自监督训练方法,能够进一步扩大类内差异,增大网络对输入图像的特征提取能力、提升模型网络对不同图像的鲁棒性、提升模型网络对图像的识别能力,从而间接增强模型的图文转换性能,使得模型输出更具准确性和可靠性的医学图像报告。
请参考图3,其示出了本申请另一个实施例提供的医学图像报告生成模型的训练方法的流程图。该方法各步骤的执行主体可以是上述图1实施例中的模型训练设备10,如电脑、服务器等计算机设备。该方法可以包括如下几个步骤(310~360):
步骤310,获取样本医学图像。
样本医学图像是指作为训练样本使用的医学图像。样本医学图像可以从一些已知的数据集中选取。另外,样本医学图像具有对应的目标图像报告,该目标图像报告是指经过人工生成、核验后的图像报告,具有针对样本医学图像的病理描述。
可选地,上述数据集中的医学图像和对应的图像报告需满足如下要求,从而能够作为样本医学图像及对应的目标图像报告来使用。医学图像需是规范化的图像,如2D或3D的X光图像、CT图像、PET图像、磁共振图像、医学超声波图像、医学显微镜图像等,且图像需满足采集区域要求和质量要求。另外,医学图像对应的图像报告需是结构化报告,由合格的放射科医生撰写的基于文本的文档,包含有关患者的病史、症状,以及医学图像中包含的病灶相关的描述性信息和解释。可选地,医学图像对应的图像报告是包括如下4部分的结构化报告:印象(impression)、结果(findings)、比较(comparison)和适应症(indication)。其中,在印象部分,放射科医生结合结果部分、病人临床历史及影像学研究的指导做出诊断。在结果部分列出了影像学检查中所检测的身体各部分放射学观察结果。比较部分和适应症部分与本申请内容关系不大,不作详述。
步骤320,通过视觉特征提取网络对样本医学图像进行视觉特征提取处理,得到样本医学图像的视觉特征序列。
视觉特征提取网络是用于提取医学图像的视觉特征的神经网络。可选地,该视觉特征提取网络可以是CNN(Convolutional Neural Network,卷积神经网络)。CNN在处理计算机视觉相关任务方面具有较好的表现性能。
在示例性实施例中,本步骤包括如下几个子步骤:
1、通过视觉特征提取网络对样本医学图像进行视觉特征提取处理,得到样本医学图像的视觉特征信息;
视觉特征信息可以是样本医学图像经过视觉特征提取网络后输出的特征图,该特征图中记录了样本医学图像的视觉特征,包括但不限于图像的颜色特征、纹理特征、形状特征、空间关系特征等图像特征。颜色特征是一种全局特征,描述了图像或图像区域所对应的景物的表面性质。纹理特征也是一种全局特征,它也描述了图像或图像区域所对应景物的表面性质。形状特征有两类表示方法,一类是轮廓特征,另一类是区域特征,图像的轮廓特征主要针对物体的外边界,而图像的区域特征则关系到整个形状区域。空间关系特征,是指图像中分割出来的多个目标之间的相互的空间位置或相对方向关系,这些关系也可分为连接/邻接关系、交叠/重叠关系和包含/包容关系等。
2、将视觉特征信息划分为多个视觉特征单元;
可选地,通过对视觉特征信息对应的特征图进行分块处理,划分为多个特征图子块,每个特征图子块对应于一个视觉特征单元。例如,将视觉特征信息对应的特征图划分为一个5×5的特征图子块,各个特征图子块的尺寸相同。
3、获取各个视觉特征单元的特征向量,得到视觉特征序列。
通过这一步的转换,可以将特征图形式表示的视觉特征信息,转换为向量形式来表示。例如,通过将各个视觉特征单元分别乘以一个矩阵W,即可得到各个视觉特征单元分别对应的特征向量(embedding),然后将各个视觉特征单元分别对应的特征向量按序排列,即可得到视觉特征序列,该视觉特征序列是一个向量序列。可选地,在生成视觉特征单元对应的特征向量时,可以结合考虑该视觉特征单元对应的位置向量,该位置向量用于表征视觉特征单元在整个视觉特征信息(也即特征图)中的相对或绝对位置。
步骤330,在视觉特征序列的基础上,拼接图像类别标签和自学习标签,得到编码网络的输入信息。
在本申请实施例中,并不是直接将视觉特征序列作为编码网络的输入信息,而是在视觉特征序列的基础上,拼接图像类别标签和自学习标签,得到编码网络的输入信息。其中,图像类别标签用于经过编码网络的处理,从视觉特征序列中学习到图像特征信息,来据此预测出样本医学图像的类别。自学习标签用于经过编码网络的处理,从视觉特征序列中学习到图像特征信息,来据此预测出样本医学图像的任务结果。
在本申请实施例中,医学图像的类别可以从医学图像的诊断结果去进行分类,如包括骨折、心脏肥大、肺炎、肺水肿等不同类别。需要说明的是,此处的分类任务除了从诊断结果的角度对图像进行分类之外,还可从其他角度对图像进行分类,如识别图像中包含的器官类别、识别图像中病灶的病症分级等其他分类任务,本申请对此不作限定。通过图像类别标签,可以从医学图像的诊断结果去进行分类,从而提升模型网络对不同类别的图像的鲁棒性,从而进一步增强模型的图文转换性能,使得模型输出更具准确性和可靠性的医学图像报告。
在本申请实施例中,上述任务结果是指自监督训练的相关任务的输出结果。本申请在对医学图像报告生成模型进行训练的过程中,引入一种自监督训练方法,能够进一步扩大类内差异,增大网络对输入图像的特征提取能力。自监督训练的相关任务可以根据需求进行设置,例如该任务可以是判断输入图像的旋转角度,如判断输入图像旋转了多少个90度。对于输入的样本医学图像,可以随机对其不旋转、旋转90度、旋转180度或旋转270度,经过旋转之后的图像输入至视觉特征提取网络进行后续处理,并由编码网络输出相应的任务结果,即针对旋转角度的预测结果。
步骤340,通过编码网络对输入信息进行编码处理,得到视觉特征序列对应的视觉编码 特征向量、图像类别标签对应的输出类别结果,以及自学习标签对应的输出任务结果。
输入信息经过编码网络的编码处理,得到编码特征向量。我们从编码特征向量中提取出3部分信息,其中一部分作为视觉特征序列对应的视觉编码特征向量,另一部分作为图像类别标签对应的输出类别结果,还有一部分作为自学习标签对应的输出任务结果。其中,图像类别标签对应的输出类别结果是指由编码网络预测得到的样本医学图像对应的类别(如上文介绍的诊断结果),自学习标签对应的输出任务结果是指由编码网络预测得到的样本医学图像对应的任务结果(如上文介绍的旋转角度)。
步骤350,通过解码网络对视觉编码特征向量进行解码处理,得到样本医学图像对应的输出图像报告。
视觉编码特征向量进一步送入解码网络进行解码处理,由解码网络输出样本医学图像对应的输出图像报告。在本申请实施例中,解码网络输出的图像报告是文本形式的图像报告,且该文本形式的图像报告是符合自然语言规范的句子/段落式的报告,并非是一些简单的关键词。
步骤360,基于输出图像报告、输出类别结果和输出任务结果,计算医学图像报告生成模型的总损失函数值,并根据总损失函数值对医学图像报告生成模型的参数进行调整。
在本申请实施例中,模型除了完成主任务(即生成图像报告)之外,还并行完成了其他任务(包括确定图像类别和任务结果)。因此,模型的损失函数除了包括主任务对应的损失函数之外,还包括其他任务对应的损失函数,最后将各个损失函数加总在一起,得到模型的总损失函数。
可选地,采用如下方式计算模型的总损失函数值:
1、基于输出图像报告和样本医学图像对应的目标图像报告,计算第一损失函数值;
输出图像报告是指模型输出的图像报告,具体来讲是解码网络输出的图像报告。目标图像报告在上文已经介绍,是指由专业医生人工撰写的图像报告。通过对输出图像报告和目标图像报告进行差异性比对,可以获知模型在报告生成方面的表现性能。可选地,第一损失函数可以是用于衡量输出图像报告和目标图像报告之间的差异性的交叉熵损失。因此,基于第一损失函数训练医学图像报告生成模型,可以提升模型生成的医学图像报告的准确性和可靠性。
2、基于输出类别结果和样本医学图像对应的目标类别结果,计算第二损失函数值;
输出类别结果是指模型输出的类别结果,具体来讲是编码网络输出的类别结果。目标类别结果是指准确的类别结果。可选地,从样本医学图像对应的目标图像报告中,提取指定字段的信息;对该指定字段的信息进行语义识别,得到样本医学图像对应的目标类别结果。例如,从目标图像报告中提取印象部分的信息,对该部分信息进行语义识别,得到诊断结果,将诊断结果作为目标类别结果。本申请实施例中,通过对指定字段的信息进行语义识别即可得到目标类别结果,无需对目标图像报告的全文进行分析,从而减少了语义识别的时间,节省了计算机设备的处理资源。
通过对输出类别结果和目标类别结果进行差异性比对,可以获知模型在类别判断方面的表现性能。可选地,第二损失函数可以是用于衡量输出类别结果和目标类别结果之间的差异性的交叉熵损失。
在本申请实施例中,能够从样本医学图像对应的目标图像报告中自动提取目标类别结果,从而无需人工标注目标类别结果,有注意提升模型的训练效率。另外,基于第二损失函数训练医学图像报告生成模型,可以提升模型进行类别判断的准确性,从而进一步提升模型生成的医学图像报告的准确性和可靠性。
3、基于输出任务结果和样本医学图像对应的目标任务结果,计算第三损失函数值;
输出任务结果是指模型输出的任务结果,具体来讲是编码网络输出的任务结果。目标任务结果是指准确的任务结果。可选地,如果任务是判断输入图像的旋转角度,那么将样本医学图像旋转指定角度之后,输入至视觉特征提取网络。相应地,目标任务结果用于指示样本 医学图像的真实旋转角度,输出任务结果用于指示样本医学图像的预测旋转角度。如果对于输入的样本医学图像,可以随机对其不旋转、旋转90度、旋转180度或旋转270度,任务是判断输入图像旋转了多少个90度,那么任务结果可以采用0、1、2、3来表示,分别对应不旋转、旋转90度、旋转180度和旋转270度。又例如,对于输入的样本医学图像,可以随机对其不旋转或旋转任意角度(如10度、36度、45度、110度、305度等),任务是判断输入图像具体旋转了多少度,那么任务结果可以采用角度值来表示,用于表示样本医学图像对应的旋转角度。本申请实施例中,将医学图像的旋转角度作为任务结果,从而使得模型可以识别各种角度的医学图像,从而减少因图像旋转角度的存在而导致的图像识别不准确的概率,进而提升模型针对不同角度图像进行识别的鲁棒性。
通过对输出任务结果和目标任务结果进行差异性比对,可以获知模型在任务结果判断方面的表现性能。可选地,第三损失函数可以是用于衡量输出任务结果和目标任务结果之间的差异性的交叉熵损失。因此,基于第三损失函数训练医学图像报告生成模型,可以提升模型对任务结果的判断准确性,进而进一步提升模型生成的医学图像报告的准确性和可靠性。
可选地,为了减少任务引入的不确定性,第三损失函数除了包括输出任务结果和目标任务结果之间的交叉熵之外,还包括输出任务结果自身的信息熵,第三损失函数L St的公式可以如下:
L St=L CE(Z p,y p)+∑Z plog(Z p)
其中,Z p表示输出任务结果,y p表示目标任务结果,L CE(Z p,y p)表示输出任务结果和目标任务结果之间的交叉熵,∑Z plog(Z p)表示输出任务结果自身的信息熵。
4、基于第一损失函数值、第二损失函数值和第三损失函数值,计算总损失函数值。
可选地,对第一损失函数值、第二损失函数值和第三损失函数值进行加权求和,得到总损失函数值。其中,各个损失函数对应的权重可以结合实际情况进行合理设定和调整,例如根据各项任务的重要性进行设定或根据模型训练效果进行调整,从而调整各个损失函数的重要性,以得到侧重于某种或某些性能的模型,本申请对此不作限定。
综上所述,本申请提供了一种基于AI模型自动化生成医学图像报告的技术方案,在模型训练过程中,除了让模型完成主任务(即生成图像报告)之外,还并行完成其他任务(包括确定图像类别和任务结果等),这有助于提升模型网络对图像的识别能力,从而间接增强模型的图文转换性能,使得模型输出更具准确性和可靠性的医学图像报告。
另外,基于第一损失函数、第二损失函数和第三损失函数从多个维度调整医学图像报告生成模型,从而使得训练得到的模型可以满足多种维度的指标、提升模型生成的医学图像报告的准确性和可靠性。
请参考图4,其示出了本申请另一个实施例提供的医学图像报告生成模型的训练方法的流程图。该方法各步骤的执行主体可以是上述图1实施例中的模型训练设备10,如电脑、服务器等计算机设备。该方法可以包括如下几个步骤(410~460):
步骤410,获取样本医学图像。
步骤420,通过视觉特征提取网络对样本医学图像进行视觉特征提取处理,得到样本医学图像的视觉特征序列。
步骤430,在视觉特征序列的基础上,拼接图像类别标签、自学习标签和模型蒸馏标签,得到编码网络的输入信息。
在本实施例中,进一步增加了模型蒸馏标签。模型蒸馏标签用于经过编码网络的处理,从视觉特征序列中学习到图像特征信息,来据此预测出样本医学图像的类别。此处的类别同样可以是从医学图像的诊断结果去进行分类,如包括骨折、心脏肥大、肺炎、肺水肿等不同 类别。
步骤440,通过编码网络对输入信息进行编码处理,得到视觉特征序列对应的视觉编码特征向量、图像类别标签对应的输出类别结果、自学习标签对应的输出任务结果,以及模型蒸馏标签对应的学生输出诊断结果。
在本实施例中,输入信息经过编码网络的编码处理,得到编码特征向量。我们从编码特征向量中提取出4部分信息,其中第一部分作为视觉特征序列对应的视觉编码特征向量,第二部分作为图像类别标签对应的输出类别结果,第三部分作为自学习标签对应的输出任务结果,第四部分作为模型蒸馏标签对应的学生输出诊断结果。其中,图像类别标签对应的输出类别结果是指由编码网络预测得到的样本医学图像对应的类别(如上文介绍的诊断结果),自学习标签对应的输出任务结果是指由编码网络预测得到的样本医学图像对应的任务结果(如上文介绍的旋转角度),模型蒸馏标签对应的学生输出诊断结果是指由编码网络预测得到的样本医学图像对应的诊断结果。
步骤450,通过解码网络对视觉编码特征向量进行解码处理,得到样本医学图像对应的输出图像报告。
步骤460,基于输出图像报告、输出类别结果、输出任务结果和学生输出诊断结果,计算医学图像报告生成模型的总损失函数值,并根据总损失函数值对医学图像报告生成模型的参数进行调整。
在本申请实施例中,模型除了完成主任务(即生成图像报告)之外,还并行完成了其他任务(包括确定图像类别、确定任务结果和确定诊断结果)。因此,模型的损失函数除了包括主任务对应的损失函数之外,还包括其他任务对应的损失函数,最后将各个损失函数加总在一起,得到模型的总损失函数。
可选地,采用如下方式计算模型的总损失函数值:
1、基于输出图像报告和样本医学图像对应的目标图像报告,计算第一损失函数值;
2、基于输出类别结果和样本医学图像对应的目标类别结果,计算第二损失函数值;
3、基于输出任务结果和样本医学图像对应的目标任务结果,计算第三损失函数值;
4、基于学生输出诊断结果和样本医学图像对应的教师输出诊断结果,计算第四损失函数值;
学生输出诊断结果是指医学图像报告生成模型输出的诊断结果,具体来讲是编码网络输出的诊断结果。教师输出诊断结果是指由预训练完成的教师模型输出的诊断结果。可选地,将样本医学图像输入至预训练完成的教师模型,该教师模型用于识别样本医学图像中的症状类别(也即诊断结果);通过教师模型得到样本医学图像对应的教师输出诊断结果。在教师模型的训练过程中,可以采用样本医学图像进行训练,并将目标诊断结果作为模型训练的标签信息,该目标诊断结果可以是从样本医学图像对应的目标图像报告的印象部分提取的诊断结果。之后,采用预训练完成的教师模型对医学图像报告生成模型进行模型蒸馏,从而在提升模型精度的基础上,还精简了模型网络结构,从而节省了模型所占的存储资源、以及模型使用时所需的处理资源,提升了模型的运行效率,并进一步提升医学图像报告生成模型对图像的识别能力。另外,采用预训练完成的教师模型对医学图像报告生成模型进行模型蒸馏,可以加快医学图像报告生成模型在训练过程中的收敛速度,从而提升医学图像报告生成模型的训练效率。
通过对学生输出诊断结果和教师输出诊断结果进行差异性比对,可以获知模型在诊断结果识别方面的表现性能。可选地,第四损失函数可以是用于衡量学生输出诊断结果和教师输出诊断结果之间的差异性的交叉熵损失。
在一个示例中,第四损失函数L global的公式可以如下:
L global=(1-λ)L CE(ψ(Z s),y)+λτ 2KL(ψ(Z s/τ),ψ(Z t/τ))
其中,Z s和Z t分别是学生模型(也即医学图像报告生成模型)和教师模型的输出,也即Z s为学生输出诊断结果,Z t为教师输出诊断结果,y为目标诊断结果,L CE(ψ(Z s),y)表示学生输出诊断结果和目标诊断结果之间的交叉熵,KL表示KL散度(Kullback-Leibler Divergence),ψ表示softmax函数,λ和τ是超参数。示例性地,λ设置为0.5,τ设置为1。
在本申请实施例中,能够从样本医学图像对应的目标图像报告中自动提取目标诊断结果,从而无需人工标注目标诊断结果,有注意提升模型的训练效率。
5、基于第一损失函数值、第二损失函数值、第三损失函数值和第四损失函数值,计算总损失函数值。
可选地,对第一损失函数值、第二损失函数值、第三损失函数值和第四损失函数值进行加权求和,得到总损失函数值。
需要说明的是,在本实施例中,主要对相比于图2实施例的新增内容进行了介绍说明,对于本实施例中未详细说明的部分,可参见上文图2实施例中的介绍说明,本实施例不作赘述。
综上所述,在本实施例中,进一步引入了模型蒸馏标签来让模型完成诊断任务,经过实验发现,引入模型蒸馏标签相比于简单采用两个图像分类标签,最终得到的医学图像报告生成模型的性能会更优,原因是模型蒸馏标签能够从教师模型中学到归纳假设,来达到提升医学图像报告生成模型性能的效果。
在示例性实施例中,医学图像报告生成模型可以采用CNN+Transformer的模型架构;其中,CNN用作视觉特征提取网络,Transformer包括多个级联的编码器以及多个级联的解码器,该多个级联的编码器用作编码网络,该多个级联的解码器用作解码网络。
可选地,如图5所示,其示例性示出了一种医学图像报告生成模型的架构图。该模型采用CNN+Transformer的模型架构,该模型包括视觉特征提取网络、编码网络和解码网络;其中,视觉特征提取网络采用CNN结构,编码网络和解码网络采用Transformer结构,编码网络包括N个级联的编码器,解码网络包括N个级联的解码器,N为大于1的整数。示例性地,N的取值为6。医学图像经过视觉特征提取网络的特征提取处理,得到视觉特征信息。该视觉特征信息被划分为多个视觉特征单元,然后获取各个视觉特征单元的特征向量,得到视觉特征序列。在视觉特征序列的基础上,拼接图像类别标签、自学习标签和模型蒸馏标签,得到编码网络的输入信息。通过编码网络对输入信息进行编码处理,得到视觉特征序列对应的视觉编码特征向量。通过解码网络对视觉编码特征向量进行解码处理,得到医学图像对应的图像报告。
下面,对基于Transformer结构构建的编码网络和解码网络进行介绍说明。Transformer结构是一个序列到序列(Sequence to Sequence)的模型,特别之处在于它大量用到了Self-Attention(自注意力)机制。基于Transformer结构构建的网络模型使用了Self-Attention机制,不采用RNN(Recurrent Neural Network,循环神经网络)顺序结构,使得模型可以并行化训练,而且能够拥有全局信息。
图6示例性示出了一种Transformer结构的示意图。左侧为编码器部分,右侧为解码器部分。为了便于理解,我们先以文本翻译任务为例,对Transformer结构进行介绍说明。对于文本翻译任务来说,输入是一个待翻译的单词序列,输出是一个翻译后的单词序列。
在编码器部分,待翻译的单词序列中的各个单词,会并行依次经过嵌入编码、位置编码、多头注意力层、残差连接和层归一化、前向传播、残差连接和层归一化、计算各单词编码的K、V向量,然后送入解码器。
在解码器部分,输入前一个单词的翻译结果(或者起始标记),依次经过嵌入编码、位置 编码、带有掩膜的多头注意力层、残差连接和层归一化,得到解码的Q向量;之后,当前单词的K、V向量以及该解码的Q向量,依次经过多头注意力层、残差连接和层归一化、前向传播、残差连接和层归一化、全连接层、Softmax层,得到当前单词的翻译结果。最终将所有单词的翻译结果拼接,即得到翻译后的单词序列。
以将“机器学习”翻译为“machine learning”为例。在编码器处的输入是“机器学习”(包括“机器”和“学习”两个单词),在解码器处的第一个输入是<BOS>(代表起始标记),输出是“machine”。下一个时刻在解码器处的输入是“machine”,输出是“learning”。不断重复上述过程,直到输出结束标记(如句点)代表翻译结束。此处需要说明的是,编码器部分可以并行计算,一次性将编码器输入全部编码出来,但解码器部分并不是一次把所有序列解出来的,而是像RNN一样一个一个解出来的,因为要用上一个位置的解码输出当作注意力机制的Q向量。
注意力机制能够发掘词与词间的关系,这种关系在CV(计算机视觉技术)上已经是很常用的了。给定一个单词可以得到一个词嵌入,通过三个独立的全连接层可以得到该单词对应的Q(query)、K(key)和V(value)向量。我们用当前单词的Q向量对不同单词的K向量做点积,然后经过归一化和Softmax再乘以当前单词的V向量,就可以得到当前单词对其他单词的注意力向量(attention vector),实现一个Self-attention的过程。因为单纯的Self-attention总是会过度关注到当前单词自身而弱化其他单词的信息,这种信息其实是没太大作用的。为了解决这个问题,采用了Multi-head attention(多头注意力)的机制。
在Multi-head attention的机制中,如图7所示,我们把Q、K、V处理的词嵌入拆成h个部分,这里h我们取3。通过3种不同的Q、K和V我们会得到各不相同的Z(隐层特征),显然,现在的Z已经把关注自身的特性弱化了。然后,我们把Q、K和V拼接在一起,经过一个全连接层W计算出最终的Z(隐层)向量。现在的Z我们就可以视为是平均了不同关注区域的新特征。另外,这种multi-head(多头)计算还有一个好处,就是可以进行并行计算。
另外,单词/特征所处的位置对于序列转换(如文本翻译或图文转换等)的问题十分重要,因此在获得图像特征和词嵌入之后需要对单词的位置信息做一个编码,编码方式如图8所示,其中pos代表当前句子中的单词所处的位置,i代表词嵌入对应的维度,i的取值范围是[0,d/2),d为设定值如512。因此PE(Positional Encoding,位置编码)中每个单词以及维度得到的编码都不同,奇数采用sin公式进行编码,偶数采用cos公式进行编码。具体如下:
PE (pos,2i)=sin(pos/10000 2i/d);
PE (pos,2i+1)=cos(pos/10000 2i/d)。
残差连接(Residual Connection)可以避免由于Transformer中模块的加深导致梯度消失的现象,用于防止网络退化。因此我们先把Z向量和原来的输入X向量进行相加,然后用层归一化(layer normalization)对当前词向量的通道维求方差和均值,进行归一化,再输入到前向层。
最后把得到的注意力结果送进两个全连接层,其中一个用来升维,另一个用于降维。再紧接着刚刚提到的残差连接和层归一化,就得到编码器的最终输出结果。
在解码器中,翻译的结果在输入解码器的时候是看不到后面的输出的,所以在做注意力机制的时候强制加了一个mask(掩膜)机制。简单来说,如图9所示,就是在得到注意力权重(attention weights,通过Q和K向量的矩阵乘法获得)之后乘以一个上三角矩阵,然后把上三角区域置为无效,这样经过softmax之后结果这些被置为无效的区域就全是0,从而达到防止解码器信息泄露的效果。
这个模块的计算方式基本与多头注意力模块一样,仅唯一不同的就是K和V是从编码器中得到的。如图10所示,注意力向量可以按照如下公式计算得到:
Figure PCTCN2022081537-appb-000001
其中,d k表示Q向量或K向量的维度。
在将Transformer结构应用于本申请中执行图文转换任务时,原始的医学图像经过视觉特征提取网络的特征提取处理,转换为视觉特征序列,然后在该视觉特征序列的基础上,拼接图像类别标签、自学习标签和模型蒸馏标签,得到编码网络的输入信息。此时,该输入信息即为一个向量序列。因此,通过Transformer网络的编解码处理,能够输出文本形式的图像报告。
上文实施例对医学图像报告生成模型的训练方法进行了介绍说明,下面将通过实施例对基于该医学图像报告生成模型的图像报告生成方法进行介绍说明,有关该医学图像报告生成模型使用过程中涉及的内容和训练过程中涉及的内容是相互对应的,两者互通,如在一侧未作详细说明的地方,可以参考另一侧的描述说明。
请参考图11,其示出了本申请一个实施例提供的图像报告生成方法的流程图。该方法各步骤的执行主体可以是上述图1实施例中的模型使用设备20,诸如手机、平板电脑、PC、医疗设备等终端设备,也可以是服务器。该方法可以包括如下几个步骤(1010~1040):
步骤1110,通过视觉特征提取网络对目标医学图像进行特征提取处理,得到目标医学图像的视觉特征序列。
目标医学图像可以是任意一张医学图像,通过本实施例提供的方法,能够通过医学图像报告生成模型自动化地生成该目标医学图像对应的图像报告。
可选地,通过视觉特征提取网络对目标医学图像进行视觉特征提取处理,得到目标医学图像的视觉特征信息;将视觉特征信息划分为多个视觉特征单元;获取各个视觉特征单元的特征向量,得到视觉特征序列。
步骤1120,在视觉特征序列的基础上,拼接图像类别标签和自学习标签,得到编码网络的输入信息。
可选地,在视觉特征序列的基础上,拼接图像类别标签、自学习标签和模型蒸馏标签,得到编码网络的输入信息。
需要说明的是,此处拼接的图像类别标签、自学习标签和模型蒸馏标签,与在模型训练过程中拼接的图像类别标签、自学习标签和模型蒸馏标签是完全相同的。例如,在模型训练过程中,图像类别标签、自学习标签和模型蒸馏标签是3个全零向量,即向量中的所有元素均为0,那么在模型使用过程中,这3个标签同样是3个全零向量。
步骤1130,通过编码网络对输入信息进行编码处理,得到视觉特征序列对应的视觉编码特征向量。
输入信息经过编码网络的编码处理,得到编码特征向量。
在输入信息包括视觉特征序列、图像类别标签和自学习标签的情况下,我们从编码特征向量中提取出3部分信息,其中一部分作为视觉特征序列对应的视觉编码特征向量,另一部分作为图像类别标签对应的输出类别结果,还有一部分作为自学习标签对应的输出任务结果。其中,图像类别标签对应的输出类别结果是指由编码网络预测得到的目标医学图像对应的类别(如上文介绍的诊断结果),自学习标签对应的输出任务结果是指由编码网络预测得到的目标医学图像对应的任务结果(如上文介绍的旋转角度)。
在输入信息包括视觉特征序列、图像类别标签、自学习标签和模型蒸馏标签的情况下,我们从编码特征向量中提取出4部分信息,其中第一部分作为视觉特征序列对应的视觉编码特征向量,第二部分作为图像类别标签对应的输出类别结果,第三部分作为自学习标签对应的输出任务结果,第四部分作为模型蒸馏标签对应的学生输出诊断结果。其中,图像类别标签对应的输出类别结果是指由编码网络预测得到的目标医学图像对应的类别(如上文介绍的诊断结果),自学习标签对应的输出任务结果是指由编码网络预测得到的目标医学图像对应的任务结果(如上文介绍的旋转角度),模型蒸馏标签对应的学生输出诊断结果是指由编码网络预测得到的目标医学图像对应的诊断结果。
步骤1140,通过解码网络对视觉编码特征向量进行解码处理,得到目标医学图像对应的输出图像报告。
视觉编码特征向量进一步送入解码网络进行解码处理,由解码网络输出目标医学图像对应的输出图像报告。在本申请实施例中,解码网络输出的图像报告是文本形式的图像报告,且该文本形式的图像报告是符合自然语言规范的句子/段落式的报告,并非是一些简单的关键词。
可选地,如有需求,还可以进一步获取编码网络输出的目标医学图像对应的类别结果、任务结果以及诊断结果中的至少一项。
综上所述,本申请提供了一种基于AI模型自动化生成医学图像报告的技术方案,由于在模型训练过程中,除了让模型完成主任务(即生成图像报告)之外,还并行完成其他任务(包括确定图像类别、确定任务结果、确定诊断结果等),这有助于提升模型网络对图像的识别能力,相应地,在模型使用过程中,同样在视觉特征序列的基础上,拼接图像类别标签、自学习标签和模型蒸馏标签,得到编码网络的输入信息,使得模型输出更具准确性和可靠性的医学图像报告。
请参考图12,其示出了本申请一个实施例提供的医学图像报告生成模型的训练装置的框图。该装置具有实现上述医学图像报告生成模型的训练方法的功能,所述功能可以由硬件实现,也可以由硬件执行相应的软件实现。该装置可以是计算机设备,也可以设置在计算机设备中。该装置1200可以包括:样本获取模块1210、特征提取模块1220、信息拼接模块1230、编码处理模块1240、解码处理模块1250、损失计算模块1260和模型调参模块1270。
所述样本获取模块1210,用于获取样本医学图像。
所述特征提取模块1220,用于通过所述视觉特征提取网络对所述样本医学图像进行视觉特征提取处理,得到所述样本医学图像的视觉特征序列。
所述信息拼接模块1230,用于在所述视觉特征序列的基础上,拼接自学习标签,得到所述编码网络的输入信息。
所述编码处理模块1240,用于通过所述编码网络对所述输入信息进行编码处理,得到所述视觉特征序列对应的视觉编码特征向量,以及所述自学习标签对应的输出任务结果。
所述解码处理模块1250,用于通过所述解码网络对所述视觉编码特征向量进行解码处理,得到所述样本医学图像对应的输出图像报告。
所述损失计算模块1260,用于基于所述输出图像报告和所述输出任务结果,计算所述医学图像报告生成模型的总损失函数值。
所述模型调参模块1270,用于根据所述总损失函数值对所述医学图像报告生成模型的参数进行调整。
在示例性实施例中,所述损失计算模块1260,用于:
基于所述输出图像报告和所述样本医学图像对应的目标图像报告,计算第一损失函数值;
基于所述输出任务结果和所述样本医学图像对应的目标任务结果,计算第三损失函数值;
基于所述第一损失函数值和所述第三损失函数值,计算所述总损失函数值。
在示例性实施例中,所述特征提取模块1220,还用于:
将所述样本医学图像旋转指定角度之后,输入至所述视觉特征提取网络;
其中,所述目标任务结果用于指示所述样本医学图像的真实旋转角度,所述输出任务结果用于指示所述样本医学图像的预测旋转角度。
在示例性实施例中,所述输入信息还包括模型蒸馏标签,所述模型蒸馏标签经所述编码网络得到学生输出诊断结果;
所述损失计算模块1260,还用于:
基于所述学生输出诊断结果和所述样本医学图像对应的教师输出诊断结果,计算第四损失函数值;
基于所述第一损失函数值、所述第三损失函数值和所述第四损失函数值,计算所述总损失函数值。
在示例性实施例中,所述样本获取模块1210,还用于:
将所述样本医学图像输入至预训练完成的教师模型,所述教师模型用于识别所述样本医学图像中的症状类别;
通过所述教师模型得到所述样本医学图像对应的教师输出诊断结果。
在示例性实施例中,所述损失计算模块1260,还用于:
对所述第一损失函数值、所述第三损失函数值和所述第四损失函数值进行加权求和,得到所述总损失函数值。
在示例性实施例中,所述输入信息还包括图像类别标签,所述图像类别标签经所述编码网络得到所述图像类别标签对应的输出类别结果;
所述损失计算模块1260,还用于:基于所述输出类别结果和所述样本医学图像对应的目标类别结果,计算第二损失函数值;
所述损失计算模块1260,还用于:基于所述第一损失函数值、所述第二损失函数值和所述第三损失函数值,计算所述总损失函数值。
在示例性实施例中,所述样本获取模块1210,还用于:
从所述样本医学图像对应的目标图像报告中,提取指定字段的信息;
对所述指定字段的信息进行语义识别,得到所述样本医学图像对应的目标类别结果。
在示例性实施例中,所述特征提取模块1220,用于:
通过所述视觉特征提取网络对所述样本医学图像进行视觉特征提取处理,得到所述样本医学图像的视觉特征信息;
将所述视觉特征信息划分为多个视觉特征单元;
获取各个所述视觉特征单元的特征向量,得到所述视觉特征序列。
综上所述,本申请提供了一种基于AI模型自动化生成医学图像报告的技术方案,在模型训练过程中,除了让模型完成主任务(即生成图像报告)之外,还并行完成其他任务(如任务结果等),任务结果是指自监督训练的相关任务的输出结果,通过引入一种自监督训练方法,能够进一步扩大类内差异,增大网络对输入图像的特征提取能力、提升模型网络对不同图像的鲁棒性、提升模型网络对图像的识别能力,从而间接增强模型的图文转换性能,使得模型输出更具准确性和可靠性的医学图像报告。
请参考图13,其示出了本申请一个实施例提供的图像报告生成装置的框图。该装置具有实现上述图像报告生成方法的功能,所述功能可以由硬件实现,也可以由硬件执行相应的软件实现。该装置可以是计算机设备,也可以设置在计算机设备中。该装置1300可以包括:特征提取模块1310、信息拼接模块1320、编码处理模块1330和解码处理模块1340。
所述特征提取模块1310,用于通过所述视觉特征提取网络对目标医学图像进行特征提取处理,得到所述目标医学图像的视觉特征序列。
所述信息拼接模块1320,用于在所述视觉特征序列的基础上,拼接自学习标签,得到所述编码网络的输入信息。
所述编码处理模块1330,用于通过所述编码网络对所述输入信息进行编码处理,得到所述视觉特征序列对应的视觉编码特征向量。
所述解码处理模块1340,用于通过所述解码网络对所述视觉编码特征向量进行解码处理,得到所述目标医学图像对应的输出图像报告。
在示例性实施例中,所述信息拼接模块1320,用于在所述视觉特征序列的基础上,拼接图像类别标签和所述自学习标签,得到所述编码网络的输入信息。
在示例性实施例中,所述信息拼接模块1320,用于:
在所述视觉特征序列的基础上,拼接所述图像类别标签、所述自学习标签和模型蒸馏标签,得到所述编码网络的输入信息。
在示例性实施例中,所述特征提取模块1310,用于:
通过所述视觉特征提取网络对所述目标医学图像进行视觉特征提取处理,得到所述目标医学图像的视觉特征信息;
将所述视觉特征信息划分为多个视觉特征单元;
获取各个所述视觉特征单元的特征向量,得到所述视觉特征序列。
综上所述,本申请提供了一种基于AI模型自动化生成医学图像报告的技术方案,由于在模型训练过程中,除了让模型完成主任务(即生成图像报告)之外,还并行完成其他任务(包括确定图像类别、确定任务结果、确定诊断结果等),这有助于提升模型网络对图像的识别能力,相应地,在模型使用过程中,同样在视觉特征序列的基础上,拼接图像类别标签、自学习标签和模型蒸馏标签,得到编码网络的输入信息,使得模型输出更具准确性和可靠性的医学图像报告。
请参考图14,其示出了本申请一个实施例提供的计算机设备的结构示意图。该计算机设备可以是任何具备数据计算、处理和存储功能的电子设备,如手机、平板电脑、PC(Personal Computer,个人计算机)或服务器等。该计算机设备用于实施上述实施例中提供的医学图像报告生成模型的训练方法或图像报告生成方法。具体来讲:
该计算机设备1400包括中央处理单元(如CPU(Central Processing Unit,中央处理器)、GPU(Graphics Processing Unit,图形处理器)和FPGA(Field Programmable Gate Array,现场可编程逻辑门阵列)等)1401、包括RAM(Random-Access Memory,随机存储器)1402和ROM(Read-Only Memory,只读存储器)1403的系统存储器1404,以及连接系统存储器1404和中央处理单元1401的系统总线1405。该计算机设备1400还包括帮助服务器内的各个器件之间传输信息的基本输入/输出系统(Input Output System,I/O系统)1406,和用于存储操作系统1414、应用程序1414和其他程序模块1415的大容量存储设备1407。
该基本输入/输出系统1406包括有用于显示信息的显示器1408和用于用户输入信息的诸如鼠标、键盘之类的输入设备1409。其中,该显示器1408和输入设备1409都通过连接到系统总线1405的输入输出控制器1410连接到中央处理单元1401。该基本输入/输出系统1406还可以包括输入输出控制器1410以用于接收和处理来自键盘、鼠标、或电子触控笔等多个其他设备的输入。类似地,输入输出控制器1410还提供输出到显示屏、打印机或其他类型的输出设备。
该大容量存储设备1407通过连接到系统总线1405的大容量存储控制器(未示出)连接到中央处理单元1401。该大容量存储设备1407及其相关联的计算机可读介质为计算机设备1400提供非易失性存储。也就是说,该大容量存储设备1407可以包括诸如硬盘或者CD-ROM(Compact Disc Read-Only Memory,只读光盘)驱动器之类的计算机可读介质(未示出)。
不失一般性,该计算机可读介质可以包括计算机存储介质和通信介质。计算机存储介质包括以用于存储诸如计算机可读指令、数据结构、程序模块或其他数据等信息的任何方法或技术实现的易失性和非易失性、可移动和不可移动介质。计算机存储介质包括RAM、ROM、EPROM(Erasable Programmable Read-Only Memory,可擦写可编程只读存储器)、EEPROM(Electrically Erasable Programmable Read-Only Memory,电可擦写可编程只读存储器)、闪存或其他固态存储技术,CD-ROM、DVD(Digital Video Disc,高密度数字视频光盘)或其他光学存储、磁带盒、磁带、磁盘存储或其他磁性存储设备。当然,本领域技术人员可知该计算机存储介质不局限于上述几种。上述的系统存储器1404和大容量存储设备1407可以统称为存储器。
根据本申请实施例,该计算机设备1400还可以通过诸如因特网等网络连接到网络上的远程计算机运行。也即计算机设备1400可以通过连接在该系统总线1405上的网络接口单元1411连接到网络1412,或者说,也可以使用网络接口单元1411来连接到其他类型的网络或远程计算机系统(未示出)。
所述存储器还包括至少一条指令、至少一段程序、代码集或指令集,该至少一条指令、至少一段程序、代码集或指令集存储于存储器中,且经配置以由一个或者一个以上处理器执行,以实现上述医学图像报告生成模型的训练方法或图像报告生成方法。
在示例性实施例中,还提供了一种计算机可读存储介质,所述存储介质中存储有至少一条指令、至少一段程序、代码集或指令集,所述至少一条指令、所述至少一段程序、所述代码集或所述指令集在被计算机设备的处理器执行时实现上述医学图像报告生成模型的训练方法或图像报告生成方法。
可选地,该计算机可读存储介质可以包括:ROM(Read-Only Memory,只读存储器)、RAM(Random-Access Memory,随机存储器)、SSD(Solid State Drives,固态硬盘)或光盘等。其中,随机存取记忆体可以包括ReRAM(Resistance Random Access Memory,电阻式随机存取记忆体)和DRAM(Dynamic Random Access Memory,动态随机存取存储器)。
在示例性实施例中,还提供了一种计算机程序产品或计算机程序,所述计算机程序产品或计算机程序包括计算机指令,所述计算机指令存储在计算机可读存储介质中。计算机设备的处理器从所述计算机可读存储介质中读取所述计算机指令,所述处理器执行所述计算机指令,使得所述计算机设备执行上述医学图像报告生成模型的训练方法或图像报告生成方法。
应当理解的是,在本文中提及的“多个”是指两个或两个以上。本申请并不局限于上面已经描述并在附图中示出的精确结构,并且可以在不脱离其范围进行各种修改和改变。本申请的范围仅由所附的权利要求来限制。

Claims (20)

  1. 一种医学图像报告生成模型的训练方法,所述方法由计算机设备执行,所述医学图像报告生成模型包括视觉特征提取网络、编码网络和解码网络,所述方法包括:
    获取样本医学图像;
    通过所述视觉特征提取网络对所述样本医学图像进行视觉特征提取处理,得到所述样本医学图像的视觉特征序列;
    在所述视觉特征序列的基础上,拼接自学习标签,得到所述编码网络的输入信息;
    通过所述编码网络对所述输入信息进行编码处理,得到所述视觉特征序列对应的视觉编码特征向量,以及所述自学习标签对应的输出任务结果;
    通过所述解码网络对所述视觉编码特征向量进行解码处理,得到所述样本医学图像对应的输出图像报告;
    基于所述输出图像报告和所述输出任务结果,计算所述医学图像报告生成模型的总损失函数值,并根据所述总损失函数值对所述医学图像报告生成模型的参数进行调整。
  2. 根据权利要求1所述的方法,其中,所述基于所述输出图像报告、所述输出类别结果和所述输出任务结果,计算所述医学图像报告生成模型的总损失函数值,包括:
    基于所述输出图像报告和所述样本医学图像对应的目标图像报告,计算第一损失函数值;
    基于所述输出任务结果和所述样本医学图像对应的目标任务结果,计算第三损失函数值;
    基于所述第一损失函数值和所述第三损失函数值,计算所述总损失函数值。
  3. 根据权利要求2所述的方法,其中,所述方法还包括:
    将所述样本医学图像旋转指定角度之后,输入至所述视觉特征提取网络;
    其中,所述目标任务结果用于指示所述样本医学图像的真实旋转角度,所述输出任务结果用于指示所述样本医学图像的预测旋转角度。
  4. 根据权利要求2所述的方法,其中,所述输入信息还包括模型蒸馏标签,所述模型蒸馏标签经所述编码网络得到学生输出诊断结果;
    所述方法还包括:
    基于所述学生输出诊断结果和所述样本医学图像对应的教师输出诊断结果,计算第四损失函数值;
    基于所述第一损失函数值、所述第三损失函数值和所述第四损失函数值,计算所述总损失函数值。
  5. 根据权利要求4所述的方法,其中,所述方法还包括:
    将所述样本医学图像输入至预训练完成的教师模型,所述教师模型用于识别所述样本医学图像中的症状类别;
    通过所述教师模型得到所述样本医学图像对应的教师输出诊断结果。
  6. 根据权利要求4所述的方法,其中,所述基于所述第一损失函数值、所述第三损失函数值和所述第四损失函数值,计算所述总损失函数值,包括:
    对所述第一损失函数值、所述第三损失函数值和所述第四损失函数值进行加权求和,得到所述总损失函数值。
  7. 根据权利要求2所述的方法,其中,所述输入信息还包括图像类别标签,所述图像类别标签经所述编码网络得到所述图像类别标签对应的输出类别结果;
    所述方法还包括:
    基于所述输出类别结果和所述样本医学图像对应的目标类别结果,计算第二损失函数值;
    基于所述第一损失函数值、所述第二损失函数值和所述第三损失函数值,计算所述总损失函数值。
  8. 根据权利要求7所述的方法,其中,所述方法还包括:
    从所述样本医学图像对应的目标图像报告中,提取指定字段的信息;
    对所述指定字段的信息进行语义识别,得到所述目标类别结果。
  9. 根据权利要求1所述的方法,其中,所述通过所述视觉特征提取网络对所述样本医学图像进行视觉特征提取处理,得到所述样本医学图像的视觉特征序列,包括:
    通过所述视觉特征提取网络对所述样本医学图像进行视觉特征提取处理,得到所述样本医学图像的视觉特征信息;
    将所述视觉特征信息划分为多个视觉特征单元;
    获取各个所述视觉特征单元的特征向量,得到所述视觉特征序列。
  10. 一种基于医学图像报告生成模型的图像报告生成方法,所述方法由计算机设备执行,所述医学图像报告生成模型包括视觉特征提取网络、编码网络和解码网络,所述方法包括:
    通过所述视觉特征提取网络对目标医学图像进行特征提取处理,得到所述目标医学图像的视觉特征序列;
    在所述视觉特征序列的基础上,拼接自学习标签,得到所述编码网络的输入信息;
    通过所述编码网络对所述输入信息进行编码处理,得到所述视觉特征序列对应的视觉编码特征向量;
    通过所述解码网络对所述视觉编码特征向量进行解码处理,得到所述目标医学图像对应的输出图像报告。
  11. 根据权利要求10所述的方法,其中,所述在所述视觉特征序列的基础上,拼接自学习标签,得到所述编码网络的输入信息,包括:
    在所述视觉特征序列的基础上,拼接图像类别标签和所述自学习标签,得到所述编码网络的输入信息。
  12. 根据权利要求11所述的方法,其中,所述在所述视觉特征序列的基础上,拼接图像类别标签和所述自学习标签,得到所述编码网络的输入信息,包括:
    在所述视觉特征序列的基础上,拼接所述图像类别标签、所述自学习标签和模型蒸馏标签,得到所述编码网络的输入信息。
  13. 根据权利要求10所述的方法,其中,所述通过所述视觉特征提取网络对目标医学图像进行特征提取处理,得到所述目标医学图像的视觉特征序列,包括:
    通过所述视觉特征提取网络对所述目标医学图像进行视觉特征提取处理,得到所述目标医学图像的视觉特征信息;
    将所述视觉特征信息划分为多个视觉特征单元;
    获取各个所述视觉特征单元的特征向量,得到所述视觉特征序列。
  14. 一种医学图像报告生成模型的训练装置,所述医学图像报告生成模型包括视觉特征提取网络、编码网络和解码网络,所述装置包括:
    样本获取模块,用于获取样本医学图像;
    特征提取模块,用于通过所述视觉特征提取网络对所述样本医学图像进行视觉特征提取处理,得到所述样本医学图像的视觉特征序列;
    信息拼接模块,用于在所述视觉特征序列的基础上,拼接自学习标签,得到所述编码网络的输入信息;
    编码处理模块,用于通过所述编码网络对所述输入信息进行编码处理,得到所述视觉特征序列对应的视觉编码特征向量,以及所述自学习标签对应的输出任务结果;
    解码处理模块,用于通过所述解码网络对所述视觉编码特征向量进行解码处理,得到所述样本医学图像对应的输出图像报告;
    损失计算模块,用于基于所述输出图像报告和所述输出任务结果,计算所述医学图像报 告生成模型的总损失函数值;
    模型调参模块,用于根据所述总损失函数值对所述医学图像报告生成模型的参数进行调整。
  15. 根据权利要求14所述的装置,其中,所述损失计算模块,用于:
    基于所述输出图像报告和所述样本医学图像对应的目标图像报告,计算第一损失函数值;
    基于所述输出任务结果和所述样本医学图像对应的目标任务结果,计算第三损失函数值;
    基于所述第一损失函数值和所述第三损失函数值,计算所述总损失函数值。
  16. 根据权利要求15所述的装置,其中,所述输入信息还包括模型蒸馏标签,所述模型蒸馏标签经所述编码网络得到学生输出诊断结果;
    所述损失计算模块,用于:
    基于所述学生输出诊断结果和所述样本医学图像对应的教师输出诊断结果,计算第四损失函数值;
    基于所述第一损失函数值、所述第三损失函数值和所述第四损失函数值,计算所述总损失函数值。
  17. 一种基于医学图像报告生成模型的图像报告生成装置,所述医学图像报告生成模型包括视觉特征提取网络、编码网络和解码网络,所述装置包括:
    特征提取模块,用于通过所述视觉特征提取网络对目标医学图像进行特征提取处理,得到所述目标医学图像的视觉特征序列;
    信息拼接模块,用于在所述视觉特征序列的基础上,拼接自学习标签,得到所述编码网络的输入信息;
    编码处理模块,用于通过所述编码网络对所述输入信息进行编码处理,得到所述视觉特征序列对应的视觉编码特征向量;
    解码处理模块,用于通过所述解码网络对所述视觉编码特征向量进行解码处理,得到所述目标医学图像对应的输出图像报告。
  18. 一种计算机设备,所述计算机设备包括处理器和存储器,所述存储器中存储有至少一条指令、至少一段程序、代码集或指令集,所述至少一条指令、所述至少一段程序、所述代码集或指令集由所述处理器加载并执行以实现如权利要求1至9任一项所述的医学图像报告生成模型的训练方法,或者实现如权利要求9至13任一项所述的基于医学图像报告生成模型的图像报告生成方法。
  19. 一种计算机可读存储介质,所述存储介质中存储有至少一条指令、至少一段程序、代码集或指令集,所述至少一条指令、所述至少一段程序、所述代码集或指令集由处理器加载并执行以实现如权利要求1至9任一项所述的医学图像报告生成模型的训练方法,或者实现如权利要求10至13任一项所述的基于医学图像报告生成模型的图像报告生成方法。
  20. 一种计算机程序产品或计算机程序,所述计算机程序产品或计算机程序包括计算机指令,处理器执行所述计算机指令,以实现如权利要求1至9任一项所述的医学图像报告生成模型的训练方法,或者实现如权利要求10至13任一项所述的基于医学图像报告生成模型的图像报告生成方法。
PCT/CN2022/081537 2021-03-25 2022-03-17 医学图像报告生成模型的训练方法及图像报告生成方法 WO2022199462A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US18/073,290 US20230092027A1 (en) 2021-03-25 2022-12-01 Method and apparatus for training medical image report generation model, and image report generation method and apparatus

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110320701.3A CN112992308B (zh) 2021-03-25 2021-03-25 医学图像报告生成模型的训练方法及图像报告生成方法
CN202110320701.3 2021-03-25

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US18/073,290 Continuation US20230092027A1 (en) 2021-03-25 2022-12-01 Method and apparatus for training medical image report generation model, and image report generation method and apparatus

Publications (1)

Publication Number Publication Date
WO2022199462A1 true WO2022199462A1 (zh) 2022-09-29

Family

ID=76333652

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/081537 WO2022199462A1 (zh) 2021-03-25 2022-03-17 医学图像报告生成模型的训练方法及图像报告生成方法

Country Status (3)

Country Link
US (1) US20230092027A1 (zh)
CN (1) CN112992308B (zh)
WO (1) WO2022199462A1 (zh)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115601485A (zh) * 2022-12-15 2023-01-13 阿里巴巴(中国)有限公司(Cn) 任务处理模型的数据处理方法及虚拟人物动画生成方法
CN115690793A (zh) * 2023-01-03 2023-02-03 北京百度网讯科技有限公司 文字识别模型及其识别方法、装置、设备和介质
CN116012650A (zh) * 2023-01-03 2023-04-25 北京百度网讯科技有限公司 文字识别模型训练及其识别方法、装置、设备和介质
CN117035052A (zh) * 2023-10-10 2023-11-10 杭州海康威视数字技术股份有限公司 一种无数据知识蒸馏的方法、装置和存储介质
CN117194605A (zh) * 2023-11-08 2023-12-08 中南大学 用于多模态医学数据缺失的哈希编码方法、终端及介质

Families Citing this family (31)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112992308B (zh) * 2021-03-25 2023-05-16 腾讯科技(深圳)有限公司 医学图像报告生成模型的训练方法及图像报告生成方法
CN113421643B (zh) * 2021-07-09 2023-06-27 浙江大学 一种ai模型可靠性判断方法、装置、设备及存储介质
CN113724359B (zh) * 2021-07-14 2024-09-24 鹏城实验室 一种基于Transformer的CT报告生成方法
CN113850753B (zh) * 2021-08-17 2023-09-01 苏州鸿熙融合智能医疗科技有限公司 医学图像信息计算方法、装置、边缘计算设备和存储介质
CN113539408B (zh) * 2021-08-31 2022-02-25 北京字节跳动网络技术有限公司 一种医学报告生成方法、模型的训练方法、装置及设备
CN113869141A (zh) * 2021-09-07 2021-12-31 中国电信股份有限公司 特征提取方法和装置、编码器、通信系统
CN113888475A (zh) * 2021-09-10 2022-01-04 上海商汤智能科技有限公司 图像检测方法及相关模型的训练方法、相关装置和设备
CN114446434A (zh) * 2021-11-11 2022-05-06 中国科学院深圳先进技术研究院 一种报告生成方法、系统及终端设备
CN114334068B (zh) * 2021-11-15 2022-11-01 深圳市龙岗中心医院(深圳市龙岗中心医院集团、深圳市第九人民医院、深圳市龙岗中心医院针灸研究所) 一种放射学报告生成方法、装置、终端及存储介质
EP4433990A1 (en) * 2021-11-17 2024-09-25 Eyetelligence Pty Ltd Method and system for analysing medical images to generate a medical report
CN114207673A (zh) * 2021-12-20 2022-03-18 商汤国际私人有限公司 序列识别方法及装置、电子设备和存储介质
CN114332479A (zh) * 2021-12-23 2022-04-12 浪潮(北京)电子信息产业有限公司 一种目标检测模型的训练方法及相关装置
CN114743018B (zh) * 2022-04-21 2024-05-31 平安科技(深圳)有限公司 图像描述生成方法、装置、设备及介质
WO2023212997A1 (zh) * 2022-05-05 2023-11-09 五邑大学 基于知识蒸馏的神经网络训练方法、设备及存储介质
CN114944213A (zh) * 2022-06-08 2022-08-26 长江大学 基于记忆驱动的Transformer医学内窥镜图像报告生成方法
CN114792315B (zh) * 2022-06-22 2022-10-11 浙江太美医疗科技股份有限公司 医学图像视觉模型训练方法和装置、电子设备和存储介质
CN115331769B (zh) * 2022-07-15 2023-05-09 北京大学 基于多模态融合的医学影像报告生成方法及装置
CN115240844B (zh) * 2022-07-15 2023-04-18 北京医准智能科技有限公司 辅助诊断模型的训练方法、装置、电子设备及存储介质
CN115082430B (zh) * 2022-07-20 2022-12-06 中国科学院自动化研究所 图像分析方法、装置及电子设备
TWI824861B (zh) * 2022-11-30 2023-12-01 國立陽明交通大學 機器學習裝置及其訓練方法
CN116013475B (zh) * 2023-03-24 2023-06-27 福建自贸试验区厦门片区Manteia数据科技有限公司 多模态医学图像的勾画方法和装置、存储介质及电子设备
CN116631566B (zh) * 2023-05-23 2024-05-24 广州合昊医疗科技有限公司 一种基于大数据的医学影像报告智能生成方法
CN116758341B (zh) * 2023-05-31 2024-03-19 北京长木谷医疗科技股份有限公司 一种基于gpt的髋关节病变智能诊断方法、装置及设备
CN117352120B (zh) * 2023-06-05 2024-06-11 北京长木谷医疗科技股份有限公司 基于gpt的膝关节病变诊断智能自生成方法、装置及设备
CN117153343B (zh) * 2023-08-16 2024-04-05 丽水瑞联医疗科技有限公司 一种胎盘多尺度分析系统
CN116884561B (zh) * 2023-09-08 2023-12-01 紫东信息科技(苏州)有限公司 基于自监督联合学习的胃部诊断报告生成系统
CN117174240B (zh) * 2023-10-26 2024-02-09 中国科学技术大学 一种基于大模型领域迁移的医疗影像报告生成方法
CN117393100B (zh) * 2023-12-11 2024-04-05 安徽大学 诊断报告的生成方法、模型训练方法、系统、设备及介质
CN117636099B (zh) * 2024-01-23 2024-04-12 数据空间研究院 一种医学图像和医学报告配对训练模型
CN118098481A (zh) * 2024-03-11 2024-05-28 世象医疗科技(大连)有限公司 一种基于医学图像概念的诊断报告生成系统及方法
CN118520932B (zh) * 2024-07-25 2024-10-15 山东海量信息技术研究院 视觉语言模型训练方法、设备、介质和计算机程序产品

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109545302A (zh) * 2018-10-22 2019-03-29 复旦大学 一种基于语义的医学影像报告模板生成方法
CN109741806A (zh) * 2019-01-07 2019-05-10 北京推想科技有限公司 一种医学影像诊断报告辅助生成方法及其装置
CN110111864A (zh) * 2019-04-15 2019-08-09 中山大学 一种基于关系模型的医学报告生成模型及其生成方法
EP3753025A1 (en) * 2018-02-16 2020-12-23 Google LLC Automated extraction of structured labels from medical text using deep convolutional networks and use thereof to train a computer vision model
CN112164446A (zh) * 2020-10-13 2021-01-01 电子科技大学 一种基于多网络融合的医疗影像报告生成方法
CN112992308A (zh) * 2021-03-25 2021-06-18 腾讯科技(深圳)有限公司 医学图像报告生成模型的训练方法及图像报告生成方法

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109147890B (zh) * 2018-05-14 2020-04-24 平安科技(深圳)有限公司 一种医学报告的生成方法及设备
EP3754548A1 (en) * 2019-06-17 2020-12-23 Sap Se A method for recognizing an object in an image using features vectors of an encoding neural network
CN110379491B (zh) * 2019-06-18 2022-07-15 平安科技(深圳)有限公司 识别脑胶质瘤方法、装置、设备和存储介质
CN110826638B (zh) * 2019-11-12 2023-04-18 福州大学 基于重复注意力网络的零样本图像分类模型及其方法
CN111462010A (zh) * 2020-03-31 2020-07-28 腾讯科技(深圳)有限公司 图像处理模型的训练方法、图像处理方法、装置及设备

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP3753025A1 (en) * 2018-02-16 2020-12-23 Google LLC Automated extraction of structured labels from medical text using deep convolutional networks and use thereof to train a computer vision model
CN109545302A (zh) * 2018-10-22 2019-03-29 复旦大学 一种基于语义的医学影像报告模板生成方法
CN109741806A (zh) * 2019-01-07 2019-05-10 北京推想科技有限公司 一种医学影像诊断报告辅助生成方法及其装置
CN110111864A (zh) * 2019-04-15 2019-08-09 中山大学 一种基于关系模型的医学报告生成模型及其生成方法
CN112164446A (zh) * 2020-10-13 2021-01-01 电子科技大学 一种基于多网络融合的医疗影像报告生成方法
CN112992308A (zh) * 2021-03-25 2021-06-18 腾讯科技(深圳)有限公司 医学图像报告生成模型的训练方法及图像报告生成方法

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
TONG ZHAO: "Research and Application of Automatic Generation of Medical Imaging Report Based on Multimodal Machine Learning", MASTER THESIS, TIANJIN POLYTECHNIC UNIVERSITY, CN, 15 March 2021 (2021-03-15), CN , XP055970737, ISSN: 1674-0246 *

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115601485A (zh) * 2022-12-15 2023-01-13 阿里巴巴(中国)有限公司(Cn) 任务处理模型的数据处理方法及虚拟人物动画生成方法
CN115601485B (zh) * 2022-12-15 2023-04-07 阿里巴巴(中国)有限公司 任务处理模型的数据处理方法及虚拟人物动画生成方法
CN115690793A (zh) * 2023-01-03 2023-02-03 北京百度网讯科技有限公司 文字识别模型及其识别方法、装置、设备和介质
CN116012650A (zh) * 2023-01-03 2023-04-25 北京百度网讯科技有限公司 文字识别模型训练及其识别方法、装置、设备和介质
CN116012650B (zh) * 2023-01-03 2024-04-23 北京百度网讯科技有限公司 文字识别模型训练及其识别方法、装置、设备和介质
CN117035052A (zh) * 2023-10-10 2023-11-10 杭州海康威视数字技术股份有限公司 一种无数据知识蒸馏的方法、装置和存储介质
CN117035052B (zh) * 2023-10-10 2024-01-26 杭州海康威视数字技术股份有限公司 一种无数据知识蒸馏的方法、装置和存储介质
CN117194605A (zh) * 2023-11-08 2023-12-08 中南大学 用于多模态医学数据缺失的哈希编码方法、终端及介质
CN117194605B (zh) * 2023-11-08 2024-01-19 中南大学 用于多模态医学数据缺失的哈希编码方法、终端及介质

Also Published As

Publication number Publication date
CN112992308B (zh) 2023-05-16
US20230092027A1 (en) 2023-03-23
CN112992308A (zh) 2021-06-18

Similar Documents

Publication Publication Date Title
WO2022199462A1 (zh) 医学图像报告生成模型的训练方法及图像报告生成方法
US11024066B2 (en) Presentation generating system for medical images, training method thereof and presentation generating method
Yin et al. Automatic generation of medical imaging diagnostic report with hierarchical recurrent neural network
WO2021179205A1 (zh) 医学图像分割方法、医学图像分割装置及终端设备
Xue et al. Multimodal recurrent model with attention for automated radiology report generation
CN112712879B (zh) 医学影像报告的信息提取方法、装置、设备及存储介质
WO2022242131A1 (zh) 图像分割方法、装置、设备及存储介质
Beddiar et al. Automatic captioning for medical imaging (MIC): a rapid review of literature
Li et al. Automated measurement network for accurate segmentation and parameter modification in fetal head ultrasound images
CN115830017B (zh) 基于图文多模态融合的肿瘤检测系统、方法、设备及介质
Cai et al. Lesion-harvester: Iteratively mining unlabeled lesions and hard-negative examples at scale
Eslami et al. Automatic vocal tract landmark localization from midsagittal MRI data
CN113988274B (zh) 一种基于深度学习的文本智能生成方法
CN115205880A (zh) 一种医学影像报告生成方法及装置
CN115601352A (zh) 基于多模态自监督的医学影像分割方法
Spinks et al. Justifying diagnosis decisions by deep neural networks
Lu et al. Multitask deep neural network for the fully automatic measurement of the angle of progression
EP4327333A1 (en) Methods and systems for automated follow-up reading of medical image data
Tang et al. Work like a doctor: Unifying scan localizer and dynamic generator for automated computed tomography report generation
Gaggion et al. CheXmask: a large-scale dataset of anatomical segmentation masks for multi-center chest x-ray images
Yang et al. Lung Nodule Segmentation and Uncertain Region Prediction with an Uncertainty-Aware Attention Mechanism
JP2024054748A (ja) 言語特徴抽出モデルの生成方法、情報処理装置、情報処理方法及びプログラム
Diamantis et al. This Intestine Does Not Exist: Multiscale Residual Variational Autoencoder for Realistic Wireless Capsule Endoscopy Image Generation
Hassan et al. Analysis of multimodal representation learning across medical images and reports using multiple vision and language pre-trained models
CN118398154B (zh) 词语索引模型和医学报告的生成方法、系统、设备和介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22774123

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 12/04/2024)

122 Ep: pct application non-entry in european phase

Ref document number: 22774123

Country of ref document: EP

Kind code of ref document: A1