CN117636099B

CN117636099B - Medical image and medical report pairing training model

Info

Publication number: CN117636099B
Application number: CN202410090308.3A
Authority: CN
Inventors: 马韵洁; 宋国磊; 王飞; 王佐成; 吴艳平; 谢浩天; 徐晓龙
Original assignee: Data Space Research Institute
Current assignee: Data Space Research Institute
Priority date: 2024-01-23
Filing date: 2024-01-23
Publication date: 2024-04-12
Anticipated expiration: 2044-01-23
Also published as: CN117636099A

Abstract

The invention discloses a medical image and medical report pairing training model which adopts a set of registered medical images and medical reportsTraining is realized, and the training steps of the comprehensive training model are as follows: s1, image coding; s2, text encoding; s3, attention weighted image representation; s4, training the establishment of a model function, the invention can automatically learn useful characteristic representation from medical images and medical reports, can capture complex relations between the medical images and the report data through joint learning of the medical images and the report data, improves the characterization capability and the information extraction effect of the data, and introduces a modern deep learning technology into the medical field, thereby realizing the integrated analysis of multi-mode medical data and providing more accurate and comprehensive support for clinical diagnosis and disease monitoring.

Description

Medical image and medical report pairing training model

Technical Field

The invention relates to the technical field of medical information processing, in particular to a medical image and medical report pairing training model.

Background

Image and text data accumulation in the medical field has increased dramatically, and these data sources cover a wide variety of information from X-ray and MRI scans to clinical reports and medical records. These data are both rich in anatomical and pathological features and bear the clinical experience and professional judgment of the physician. However, the efficient use of such data, particularly in combining medical images with text data, remains one of the important challenges facing the medical field.

Interpretation of medical images is critical to accurately conduct clinical diagnosis. However, due to the variety and complexity of medical image data, accurate identification of lesions, localization of abnormalities, and analysis of anatomical structures requires a great deal of experience and expertise from a physician. In addition, medical texts contain a large amount of important information about patient condition, treatment regimen, and doctor diagnosis. However, there are challenges in correlating and combining these two data types to extract more comprehensive information.

In recent years, the application of deep learning techniques in the fields of medical images and texts is increasing. However, current deep learning methods focus mainly on single modality data analysis, ignoring the rich correlation between medical images and text. To fully exploit multimodal data, in particular medical image and text pairing data, we need an innovative approach to achieve joint analysis and comprehensive interpretation.

Disclosure of Invention

In order to solve the problems, the invention provides a medical image and medical report pairing training model, which is realized by the following technical scheme.

A medical image and medical report pairing training model employing a set of registered medical images and medical reportsTo achieve training, wherein K represents the number of paired medical images and medical reports;

wherein a is _i And b _i Representing a medical image and a medical report, respectively, W and H representing the width and length of the medical image, respectively, and C representing the number of color channels of the medical image source file;

wherein,；

the training steps of the medical image and medical report pairing training model are as follows:

s1, image coding, namely partitioning a medical image, and coding sub-regions to obtain feature vectors of the sub-regions;

s2, text coding, namely extracting entity information in the medical report to code so as to acquire embedded representation of the entity information;

s3, attention weighted image representation, namely weighting the subareas of the medical image according to the importance of each subarea in the medical image relative to each medical report to obtain the final representation of the medical image;

s4, building a training model function.

Preferably, in the step S1, a target detection segmentation model is used to identify a critical entity region and a weak semantic feature region in the medical image, and a res net-50 model is used as an encoder to encode the critical entity region and the weak semantic feature region respectively, so as to obtain f and fWherein f represents the eigenvector of the key solid region, < ->Feature vectors representing regions of weak semantic features,

wherein M represents the number of key entity areas on each medical image, and M is 5;

final adaptive mean pooling by ResNet-50 modelThe global features extracted in the layer are denoted as f _g 。

Preferably, in the step S2, entity information is extracted from the medical report by using the existing MetaMap model, and the extracted entity information is expressed asWherein->Representing->Extracted entity information, <' > about>Subsequently the BioClinicalBERT model was used as encoder and denoted +.>Encoding the entity information to obtain the entity information and an embedded representation of the overall report, comprising:

mapping the representation into 128-dimensional feature vectors by projection mapping, i.e

。

Preferably, in said step S3, the final representation a of the medical image _i The method comprises the following steps:

λ1 and λ2 are hyper-parameters and λ1+λ2=1.

Preferably, the method comprises the steps of,

wherein the method comprises the steps ofWeighting the critical entity area on the ith medical image based on the attention of the medical report; />Is the effect of entity information on the ith medical report on the jth critical entity area on the ith medical image,/->Represents an attention weight; />Is the feature vector of the j-th key entity region on the medical image paired with the i-th medical report.

Preferably, the method comprises the steps of,

wherein,is a super parameter; />Corresponding to->The embedding of the individual reporting entity information represents a similarity to a jth sub-region in the ith medical image;

wherein,representing the transpose of the vector.

Preferably, the training model is optimized using the following loss function:

wherein,is a weight function, +.>Z is equal to->Or->，/>Is a super parameter.

Preferably, in the step S4, each medical image and the corresponding medical report are used as positive sample pairs, each medical image and other medical reports are used as negative sample pairs, and the noisy contrast estimation loss function in contrast learning is finally obtained, which is a function of the training model:

wherein the method comprises the steps ofIs a hyper-parameter used to control weak semantic negative sample weights where neg represents the set of negative sample pairs formed by each image with other exam reports.

The invention can automatically learn useful characteristic expression from medical images and medical reports, can capture complex relations between the medical images and the report data through joint learning of the medical images and the report data, improves the characterization capability and the information extraction effect of the data, introduces a modern deep learning technology into the medical field, thereby realizing the integrated analysis of multi-mode medical data and providing more accurate and comprehensive support for clinical diagnosis and disease monitoring, and has the following beneficial effects:

extraction of weak semantic features: by extracting image regions with weak semantic information. The method is beneficial to better capturing local low-level features by the model and improving the performance of the model in processing scenes such as medical images.

Consider local features: the generated weak semantic negative sample contains local features and other texture features of the target, so that the local structure of the medical image can be more comprehensively described, and the model can more accurately understand the details of the image.

Enhancing semantic information: the medical report is embedded in combination with the medical image representation, the image representation is weighted with an attention mechanism, and important areas related to the text entities are further extracted. This helps the model to better capture semantic information in the medical image.

Enhanced image-report association: by computing the similarity between the medical image sub-regions and the medical report, an attention weighted image representation is generated. This helps to strengthen the association between the image and the report, improving the similarity calculation capability of the model between the medical report and the image.

Contrast learning framework: the model is trained by comparing the differences between positive and negative samples using a contrast learning framework. This allows the model to better distinguish between different samples in the feature space, improving the robustness and performance of the model.

Drawings

In order to more clearly illustrate the technical solutions of the present invention, the drawings that are needed in the description of the specific embodiments will be briefly described below, it being obvious that the drawings in the following description are only some examples of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1: the invention relates to a process for establishing a medical image and medical report pairing training model;

fig. 2: the medical image processing flow comprises a medical image processing flow;

fig. 3: the invention relates to a medical report processing flow.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

As shown in figures 1-3 of the drawings,

example 1

wherein,；

s4, building a training model function.

According to the invention, the medical report can be embedded into the medical image representation, the image representation is weighted, and the important area related to the entity information on the medical report is extracted, so that the semantic information in the medical image is better captured, and the sub-area on the medical image is weighted, thereby being beneficial to enhancing the association between the medical image and the medical report and improving the similarity computing capability of the model between the medical report and the medical image.

Example 2

In step S1, a target detection segmentation model is used for identifying a key entity region and a weak semantic feature region in a medical image, and a ResNet-50 model is used as an encoder for encoding the key entity region and the weak semantic feature region respectively to obtain f and fWherein f represents the eigenvector of the key solid region, < ->Feature vectors representing regions of weak semantic features,

wherein M represents the number of key entity areas on each medical image, and M is 5; the fixed value of M can be manually set according to task requirements, and is set to be 5 in the invention, namely five key entity areas are selected on each medical image.

The global features extracted in the final adaptive average pooling layer by ResNet-50 model are denoted as f _g 。

In the application, the target detection segmentation model adopts the existing models of Faster R-CNN, YOLO or U-Net and the like, the models can identify key entity areas in medical images and help to locate interesting anatomical structures, lesion areas and the like, and then the ResNet-50 model is used as an encoder to encode the key entity areas to obtain feature vectors f of the key entity areas.

In the medical image field, weak semantic feature regions refer to those in the imageAreas which may not be major lesions or structures, such as blood vessels, bones, organ boundaries, textures, color distributions of images, etc., are identified using a target detection segmentation model, and encoded by a ResNet-50 model to obtain feature vectors of the weak semantic feature areas。

Example 3

In step S2, entity information is extracted from the medical report using the existing MetaMap model, and the extracted entity information is expressed asWherein->Representing->The extracted entity information is used to determine the entity information,subsequently the BioClinicalBERT model was used as encoder and denoted +.>Encoding the entity information to obtain the entity information and an embedded representation of the overall report, comprising:

。

This process helps better capture key semantic information in medical reports.

Further, in step S3, the final representation Ai of the medical image is:

λ1 and λ2 are hyper-parameters and λ1+λ2=1.

Further, the method comprises the steps of,

wherein,representing the transpose of the vector.

Entity information that is defined in conformity with the medical language system, i.e., keyword information in the medical report.

In the present embodiment, in the medical image, unlike the natural image, the region of interest is often indicated by subtle visual cues. Using global features alone may not adequately capture the features of these regions of interest, we have adopted a different approach, namely learning the attention mechanism, to weight key solid regions of different medical images according to their importance to given solid information.

To generate an attention weighted image representation based on entity information, we first calculate the similarity of all key entity regions and entity information, using the dot product similarity calculation formula:

。

for each medical report, we calculate an attention weighted image representation based on its similarity to all key solid regions in the paired medical image，

Attention weighting is the impact of a medical report on different critical entity areas,

a final representation of the final resulting image:

。

example 4

The training model is optimized using the following loss function:

wherein,is a weight function by +.>To dynamically adjust the weights of positive and weak semantic negative samples, which weight function can be calculated based on the distance of the feature vector, +.>Z is equal toOr->，/>Is a super parameter.

The loss function here is to learn the feature representation of the medical image more optimally.

By passing throughThe method has the advantages that the weight can be dynamically adjusted according to the feature distance between samples, so that the model can learn weak semantic features more pertinently. This helps to increase the sensitivity of the model to weak semantic information, thus making the comparison betterLearning and feature learning. By this loss function we can better utilize the back propagation algorithm to optimize the model parameters.

Example 5

In step S4, each medical image and the corresponding medical report are used as a positive sample pair, each medical image and other medical reports are used as a negative sample pair, and finally, the noisy contrast estimation loss function in contrast learning is obtained, which is a function of the training model:

The feature function here is to optimize the pairing training of the medical image and the medical report to meet the task requirements of subsequent automated interpretation of the medical image.

The preferred embodiments of the invention disclosed above are intended only to assist in the explanation of the invention. The preferred embodiments are not intended to be exhaustive or to limit the invention to the precise form disclosed. Obviously, many modifications and variations are possible in light of the above teaching. The embodiments were chosen and described in order to best explain the principles of the invention and the practical application, to thereby enable others skilled in the art to best understand and utilize the invention. The invention is limited only by the claims and the full scope and equivalents thereof.

Claims

1. A medical image and medical report pairing training model, characterized in that the model employs a set of registered medical images and medical reportsTo achieve training, wherein K represents paired medical imagesAnd the number of medical reports;

wherein,；

final representation A of medical image _i The method comprises the following steps:

λ1 and λ2 are hyper-parameters and λ1+λ2=1;

wherein,transposition of the representative vector;

wherein,is a super parameter; />Corresponding to->Embedding representation of information of individual reporting entity with the +.sup.th in the ith medical image>Similarity between sub-regions;

for each medical report, we calculate an attention weighted image representation based on its similarity to all key solid regions in the paired medical image；

Wherein the method comprises the steps ofWeighting the critical entity area on the ith medical image based on the attention of the medical report; />Is the effect of entity information on the ith medical report on the jth critical entity area on the ith medical image,/->Represents an attention weight; />Is the feature vector of the jth critical entity region on the medical image paired with the ith medical report;

s4, building a training model function.

2. The medical image and medical report pairing training model according to claim 1, wherein in step S1, the target detection segmentation model is used to identify the critical entity region and the weak semantic feature region in the medical image, and the res net-50 model is used as the encoder to encode the critical entity region and the weak semantic feature region respectively, to obtain f and fWherein f represents the eigenvector of the key solid region, < ->Feature vectors representing weak semantic feature regions;

3. The medical image and medical report pairing training model according to claim 2, wherein in step S2, entity information is extracted from the medical report using the existing MetaMap model, and the extracted entity information is represented asWherein->Representing->Extracted entity information, <' > about>Subsequently the BioClinicalBERT model was used as encoder and denoted +.>Encoding the entity information to obtain the entity information and an embedded representation of the overall report, comprising:

mapping the representation into 128-dimensional feature vectors by projection mapping, i.e。

4. A medical image and medical report pairing training model according to claim 3, characterized in that the training model is optimized with the following loss function:

，

wherein,is a weight function, +.>Z is equal to->Or->，/>Is a super parameter.

5. The medical image and medical report pairing training model according to claim 4, wherein in step S4, each medical image and corresponding medical report are used as positive sample pairs, each medical image and other medical report are used as negative sample pairs, and the noisy contrast estimation loss function in contrast learning is finally obtained, and the function is a function of the training model: