CN117558394B

CN117558394B - Cross-modal network-based chest X-ray image report generation method

Info

Publication number: CN117558394B
Application number: CN202311271188.9A
Authority: CN
Inventors: 董子龙; 廉敬; 石斌; 刘冀钊; 张家骏; 张怀堃
Original assignee: Lanzhou Jiaotong University
Current assignee: Lanzhou Jiaotong University
Priority date: 2023-09-28
Filing date: 2023-09-28
Publication date: 2024-06-25
Anticipated expiration: 2043-09-28
Also published as: CN117558394A

Abstract

The invention discloses a chest X-ray image report generation method based on a cross-modal network, belongs to the technical field of image reports, and provides a cross-modal auxiliary network (CMLRAN), wherein an attention introducing mechanism is used for respectively processing image and text information, and the information association of the image and the text is enhanced by combining with a CLIP proposed by OpenAI based on a Memory Storage Response Matrix (MSRM). The classification of fine granularity differences of X-ray images is focused on during encoding; the generation of medical terms is focused on in decoding. The method can better solve the problems of semantic gaps and the like, and intelligently generates the chest X-ray image report.

Description

Cross-modal network-based chest X-ray image report generation method

Technical Field

The invention relates to the technical field of image reports, in particular to a chest X-ray image report generation method based on a cross-modal network.

Background

Chest X-ray is a tip medical imaging technology, and accurately displays lung lesions (such as pneumonia, tuberculosis, lung cancer and the like), mediastinum lesions (such as mediastinum tumors, mediastinum emphysema and the like), pleural lesions (such as pleural effusion, pleurisy and the like) and cardiovascular lesions and the like through a high-resolution and multi-angle three-dimensional imaging technology. The chest X-ray image diagnosis report is the professional interpretation and summary of the examination result, generally comprises the parts of imaging, diagnosis opinion, suggestion and the like, and provides basis for doctors to make diagnosis and treatment schemes.

In recent years, the need for medical X-ray "imaging" guided "treatment" has increased, and related studies have received extensive attention. Among other things, the method of generating long text using a hierarchical long and short term memory network (LSTM) exhibits certain advantages. However, research in this area still presents many challenges, mainly: medical image features are complex, cross-modal features are difficult to extract, and medical reports have a large number of specialized words. Currently, LSTM has not been able to enable automatic generation of multi-organ imaging reports. For this reason, some scholars propose an automatic generation method ^[ of medical image reports based on deep learning, which can be classified into an image processing method and a natural language processing method according to the difference of the processing objects: the method comprises the steps of taking images as cut-in points, tanida1 providing a generation model RGRGRG based on a focus area guidance report, firstly dividing a specific focus area, then forming a final report based on the focus area, li providing a knowledge graph auxiliary generation network model DCL with a dynamic structure and nodes, taking each image as a starting point to extract image comparison text generation features, and finally adding the features to each output node. Based on natural language processing, chen proposed using language knowledge of a large pre-training language model (PLM) to quickly generate an image subtitle network model VisualGPT that can effectively learn a large amount of language knowledge from a small amount of multimodal data, kaur proposed a CNN-RNN based model to network model CADxReport that uses reinforcement learning and visual and semantic attention mechanisms to enable automatic generation of medical reports.

The medical report automatic generation method based on deep learning has the defects. Generating reports for the cut-in points by image processing, the model has difficulty in fully comprehending the complex information of the images, and the generated reports lack the flexibility of language expression. The natural language processing is taken as an access point, the model for generating the report is based on a predefined template, and the flexibility is also lacking, so that the model is difficult to adapt to different application scenes. In view of this, the present invention proposes a cross-modal auxiliary network (CMLRAN), which introduces a focus mechanism to process image and text information separately, and based on a Memory Storage Response Matrix (MSRM), enhances the information association of image and text in combination with CLIP proposed by OpenAI. The classification of fine granularity differences of X-ray images is focused on during encoding; the generation of medical terms is focused on in decoding. The method can better solve the problems of semantic gaps and the like, and intelligently generates the chest X-ray image report.

Disclosure of Invention

The present invention has been made to solve the above-mentioned problems, and an object of the present invention is to provide a method for generating a chest X-ray image report based on a cross-modal network.

In order to achieve the above purpose, the technical scheme adopted by the invention is as follows: a chest X-ray image report generation method based on a cross-modal network comprises the steps of creating a cross-modal auxiliary network (CMLRAN), introducing an attention mechanism to process image and text information respectively, and enhancing information association of the image and the text by combining with a CLIP proposed by OpenAI on the basis of a Memory Storage Response Matrix (MSRM), wherein the method comprises the following specific steps:

Step one: the focus area feature extraction implementation steps are as follows:

① Performing contrast enhancement, image size conversion and image pixel block adjustment on an input image to obtain a preprocessed image;

② The preprocessing image converts the image into an image feature matrix through a Convolutional Neural Network (CNN), all data of the matrix are elongated into a row to obtain a chest CT trainable image feature matrix C, then the image feature matrix C is substituted into ResnetII, feature information with high relevance to chest organs is extracted to obtain an X-ray image feature matrix C', the residual network can learn chest organ features of an original image and the chest organ features after convolutional extraction, the problems of gradient disappearance and gradient explosion in the information transmission process are avoided,

The formula expression of the feature matrix C' after the first processing of ResnetII network is as follows:

Sigma represents a Sigmoid function, avg represents average pooling, max represents maximum pooling, c represents a feature matrix obtained by each step of chest X-ray imaging, f ^7×7 represents a size of 7×7 convolution kernel channels, delta _p represents network direct mapping, and mu represents a loss function of a residual network;

step two: the cross-mode auxiliary positioning implementation steps are as follows:

After the first operation is completed, performing matrix calculation on an X-ray image feature matrix C' obtained through feature extraction and an introduced medical CLIP and MSRM, and determining a focus area with the maximum probability, and enhancing the contrast or definition of the focus area, wherein a related formula is as follows:

And/> Images and characters representing a core region, C _img and C _txt represent the images and characters after preprocessing, the cosine similarity W of the text features calculated based on the image features and the cosine similarity W of the image features calculated based on the text features can be obtained through matrix calculation, N represents the total number of a group of image text pairs, W' represents feature scores obtained through normalization calculation on the basis of W, and cross probability scores L _i→t and L _t→i of corresponding reports of a focus region can be finally output through calculation;

Step three: the implementation step of automatic generation of medical report:

after deriving the cross-modal feature, the transcoder's decoder can take into account the entire input sequence at the same time when generating each word, so it can capture the context information well,

Specifically, a GPT-2 network similar to that proposed by PubMed et al is adopted, in which the labels in the sequence are text-generated on condition of the previous labels, each generated word is used as the input of the next step, the process is repeated until a complete medical report is generated, a forgetting gate SFG is proposed based on a bidirectional LSTM, the SFG is combined with an attention mechanism, and the flow of cross-modal information is controlled by introducing the forgetting gate and an update gate, so that the context information and the cross-modal information of the X-ray image in the medical report are better captured, and in order to limit the language model to the regional visual features, the focus regional features and the associated disease keyword features are directly injected into the self-attention of the model by using pseudo self-attention, and the related formulas are as follows:

Wherein X represents the visual characteristics of the focus area, Y represents word embedding, W _q,W_k,W_v represents inquiry, keys and values, U _k and U _v represent parameters of the keys and values of the initial hidden state obtained through LSTM, and the generation of text of the focus area can be realized through matrix operation.

Further, the newly built cross-modal auxiliary network (CMLRAN) focuses on the classification of X-ray image fine-granularity differences during encoding; focusing on the generation of medical terms in decoding.

Furthermore, the ResnetII network in the step 1 adds a maximum pooling layer and an average pooling layer on the basis of the Resnet-152 pre-training network, and the two network layers can acquire the maximum value and the average value of the features under different scales, so that the maximum pooling is beneficial to improving the stability of feature extraction and reducing the influence of errors such as geometric distortion and exposure in X-ray images on a model.

Further, a new word segmentation device BPE is adopted for processing the medical report, and a tree-shaped knowledge graph is added during word segmentation, so that weight distribution of the word segmentation device is enhanced, and trainability of a data set is realized.

Further, in step 2, in order to prevent oversaturation of stored information and thus gradient explosion during network model training, a Selective Forgetting Gate (SFG) is added, and the formulas stored by SFG and cross-modal memory are as follows:

Wherein W _f represents the forgetting gate SFG, b _f represents the bias, x _(t) represents the hidden information, h _(t-1) represents the time function at time t-1, and C _(t-1) represents the cross-modal memory storage characteristic at the last time.

Compared with the prior art, the invention has the following beneficial effects:

(1) Based on transfer learning, adding multi-channel feature extraction, dividing the multi-channel feature extraction into MaxPool layers and AvgPool layers, and adding an attention mechanism on the basis to enhance the extraction capability of global features of the image;

(2) A cross-modal auxiliary network is provided for opening a semantic gap between chest X-ray images and corresponding medical reports and enhancing the connection between two modal information. And the matching precision of the X-ray image and the corresponding medical report is improved.

Drawings

FIG. 1 is a schematic diagram of the implementation steps of a body network model framework according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a focus area feature extraction module according to an embodiment of the invention;

FIG. 3 is a cross-modal network layer model of an embodiment of the present invention;

FIG. 4 is a tree-like knowledge-graph of medical report according to an embodiment of the invention;

Fig. 5 is a graph of lesion area visualization using Grad-CAM according to an embodiment of the invention.

Detailed Description

The invention is further described in connection with the following detailed description, in order to make the technical means, the creation characteristics, the achievement of the purpose and the effect of the invention easy to understand.

As shown in FIG. 1, the method provided by the invention firstly uses a residual network and a visual attention model to extract image characteristics, then combines CLIP and MSRM to position a focus area, and finally realizes automatic generation of a medical report through a Decoder of a transducer and a gate unit mechanism of an LSTM, and the proposed cross-mode auxiliary network model CMLRAN consists of three modules: the system comprises a focus area feature extraction module, a focus area-based cross-mode auxiliary positioning module and a medical report automatic generation module.

(1) The focus area feature extraction implementation steps are as follows:

② The preprocessing image is converted into an image feature matrix through a Convolutional Neural Network (CNN), all data of the matrix are elongated into a row, a chest CT trainable image feature matrix C is obtained, then ResnetII is substituted, feature information with high relevance to chest organs is extracted, and an X-ray image feature matrix C' is obtained. The residual error network can learn the chest organ characteristics of the original image and the chest organ characteristics after convolution extraction, so that the problems of gradient disappearance and gradient explosion in the information transmission process are avoided, wherein the focus region characteristic extraction module architecture is shown in fig. 2.

In fig. 2 Resnet-152 represent a 152-layer residual network, whose underlying modules consist of 12 different-dimensional convolutions (1 x 1 and 3 x 3 are convolution kernels, 64, 128, 256, 512, 1024, 2048 are the number of network layers),

To ensure that the encoder learns chest CT image features better, a two-channel modular network architecture is added, an inner-segment max-pooling (Maxpool) layer and an average-pooling (AvgPool) layer are added, and a self-attention mechanism is added, so that the effect of extracting image features from Maxpool and AvgPool is enhanced.

Training of the lesion area feature extraction network uses a dual channel module feature extraction network ResnetII and an attention feature extraction network. During double-channel feature extraction, convolution operation is performed on the trainable image feature matrix C of chest CT and the dimension is increased, then images are respectively sent into an expansion convolution layer of Resnet 152 through Maxpool and Avgpool to obtain Resnet network output results (the expansion rate is 2, the convolution kernel size is 7 multiplied by 7), the output results are respectively subjected to double-channel residual operation through Maxpool and Avgpool, and the obtained result is subjected to addition summation operation with the image feature matrix converted from the preprocessed images through a Convolutional Neural Network (CNN), so that a feature matrix C' after first processing of the Resnet network is obtained. The double-channel module feature extraction enhances the multi-scale extraction capability of the model on chest CT image details, and simultaneously reduces the negative effects of original image space hierarchy information loss, unimportant information repeated extraction and the like caused by single use of expansion convolution operation. The formula expression of the feature matrix C' after the first processing of ResnetII network is as follows:

Sigma represents Sigmoid function, avg represents average pooling, max represents maximum pooling, c represents feature matrix obtained per step of chest X-ray image, f ^7×7 represents size of 7×7 convolution kernel channel number, delta _p represents network direct mapping, and mu represents loss function of residual network. ResnetII adding a maximum pooling layer and an average pooling layer on the basis of Resnet-152 pre-training network, wherein the two network layers can acquire the maximum value and the average value of the features under different scales, and the maximum pooling is beneficial to improving the stability of feature extraction and reducing the influence of errors such as geometric distortion, exposure and the like in X-ray images on a model. The average pooling layer can convert the spatial information of the features into more compact feature representation, improves the generalization capability of the model, reduces the influence of background radiation, artifacts and the like in the X-ray image on the model, and can reflect the local features of the medical X-ray image based on the features C' extracted by the two network layers. Then, the original feature C and the processed feature C ' are input into an attention mechanism network for secondary feature processing, the network can divide the chest X-ray image into a series of learnable blocks, global processing of the whole image is achieved, and then the extracted feature C ' and the feature C ' extracted by the ResnetII network are subjected to calculation based on residual connection, and finally the complete chest X-ray image feature C) is obtained.

The application selects the word segmentation device BPE, which is a data driving method based on self-focusing and autoregressive neural network, and divides the text into a fixed number of sub-words by continuously combining the most frequently occurring characters or character sequences.

(2) The cross-mode auxiliary positioning implementation steps are as follows:

After the first operation is completed, the X-ray image features obtained through feature extraction and the introduced medical CLIP and MSRM are subjected to matrix calculation, the focus area with the highest probability is determined, and the relation between the focus area and the corresponding medical report keywords is enhanced. As shown in fig. 3, firstly, the medical CLIP uses pre-trained Resnet-152 as an image encoder and BPE as an encoder of a medical report, the present application uses contrast loss on IU X-RAY and MIMIC-CXR datasets to fine tune the original CLIP model, enhance the similarity of image-text matching pairs and the dissimilarity of non-matching pairs, and the generated result can determine the output result through a pre-built medical report tree-like knowledge graph (the knowledge graph is shown in fig. 4). The atlas is primarily judged by using the type of focus and disease information possibly appearing at each part of chest X-ray, and then the alignment of the medical atlas and the corresponding focus area is realized by a zero-shot classification mechanism of the medical CLIP. And finally, selecting the chest X-ray focus area corresponding to the disease information with the maximum similarity based on an image classification technology. The correlation formula is as follows:

And/> Images and characters representing the core region, C _img and C _txt represent the images and characters after preprocessing, text features are calculated based on the image features and cosine similarity W of the image features is calculated based on the text features by matrix calculation, N represents the total number of pairs of image text, W' represents feature scores obtained by normalization calculation on the basis of W, and cross probability scores L _i→t and L _t→i of the corresponding reports of the final output focus region can be obtained by calculation. After obtaining the probability score L, introducing the probability into an MSRM to perform cross-mode memory storage, wherein the main operation is that two different modes of information are simultaneously stored, important information is reserved, useless information is deleted, and meanwhile, in order to prevent oversaturation of the stored information and further gradient explosion during network model training, a Selective Forgetting Gate (SFG) is added by referring to a gate unit mechanism of an LSTM, and the SFG and the cross-mode memory storage formula is as follows:

(3) The implementation step of automatic generation of medical report:

After deriving the cross-modal feature, the transcoder's decoder can take the entire input sequence into account at the same time when generating each word, so it can capture the context information well. In particular, it employs a GPT-2 network similar to that proposed by PubMed et al, wherein the tags in the sequence are text generated conditioned on previous tags, each generated word is taken as input to the next step, and the process is repeated until a complete medical report is generated. However, in decoding, the application adds the cross-modal memory feature C _(t) extracted by MSRM, if only the traditional decoder is adopted, the problems of repeated memory of error information, slow network convergence speed and the like can be possibly caused, therefore, the application provides a forgetting gate SFG based on the bidirectional LSTM, combines the SFG with an attention mechanism, controls the flow of cross-modal information by introducing the forgetting gate and the updating gate, and better captures the context information and the cross-modal information of the X-ray image in the medical report. In order to define the language model to the regional visual features, the present application uses pseudo self-attention features to inject lesion region features and associated disease keyword features directly into the self-attention of the model, the associated formulas are as follows:

Wherein X represents the visual characteristics of the focus area, Y represents word embedding, W _q represents query, W _k represents keys, W _v represents values, U _k and U _v represent parameters of keys and values of initial hidden states obtained through LSTM, and generation of text of the focus area can be realized through matrix operation.

The focusing region in model training can be displayed by using a mapping (Gradient-WEIGHTED CLASS Activation Mapping, grad-CAM) technology, and whether the focus region and the corresponding tree-like knowledge graph are accurately related by the model or not is determined by using the method. As shown in FIGS. 5 (a) - (f), M1-M5 represent MRARGN and variants thereof, respectively, for Grad-CAM display of focal areas of heart, pleura, bone, mediastinum, lung and free areas of the MIMIMIMIMI-CXR dataset in the optimal state. The application can observe that M5 accurately identifies the boundary and shape information of most focus areas. Although the treatment of M1 in the skeletal region and M2 in the pleural and mediastinal regions is relatively reasonable, there are a large number of erroneous judgments or repeated correlations of the erroneous information at other locations. In the lung lesion area treatment of fig. 5 (e), M1 faces point to the upper right lobe, and M2 also perceives the free area as a portion of the lung lesion area. The focal zone extracted by M3 belongs to a large number of irrelevant extractions. M4 performs better than M1-M3 in generating the report's integrity, but when predicting a lesion area, a pacemaker or external device is erroneously interpreted as part of the lesion area. M5 can eliminate this type of error by introducing SFG, while irrelevant areas that M4 over-extracts can also be deleted by SFG.

The embodiment of the invention is implemented in PyTorch and trained on a workstation with 64GB RAM and NVIDIAGeforce RTX4090GPU processor, using pre-training Resnet 152 as a common feature extraction encoder for image processing ResnetII and CLIP, all images are scaled to 224X 224 and provided to a feature map of 7X 512 encoder size.

According to the table 1, the present application uses the method suggested by the author of the original dataset during the training, testing and validation phases, i.e. the dataset is trained as: test: verify = 7:2:1 ratio, feed extracted visual features to CLIP to generate more than 320 tags, for each tag, generate a vector containing 512 word embeddings, where the body part of the sentence may not reach 512 words, the rest will be complemented by < text > which is an omitted part identifiable by the network training, obtain the highest probability features as semantic features generated by the model, then connect the focus area alone with the network training through MSRM, build a more complete auxiliary network while not affecting network convergence, decoder of the Transformer as decoding layer of the model, first all hidden layers and word embedment dimensions set to 512, this hidden layers will directly call the hidden state extracted through LSTM forgetting gate, parameter learning uses AdamW optimizers, the band size is 4, the total training loss is defined as L = λ _MSRM·L_MSRM+λ_cmn·L_cmn+λ_cross·L_cross+λ_language·L_language, where L _cmn and L _cross are two binary classifier cross-entropy loss processing the cross-modal information, L24 is the cross-entropy loss on the model, set of the language weight loss is the performance set according to the set of the values: lambda _MSRM＝2.0,λ_cmn＝3.0,λ_cross＝3.0,λ_language =1.0.

As shown in Table 1, the present application compares MRARGN with the 8 most recent chest report generating models described above, respectively, on IU X-RAY and MIMIMIC-CXR datasets: generating report PPKED with the aid of a priori knowledge and a posterior combination; generating individual descriptive words in the context of X-ray images and then converting them into coherent text m2tr using a converter architecture; generating a radiological report R2Gen via a memory drive transformer; enhancing the encoder-decoder framework using CMN in combination with a self-attention mechanism to facilitate cross-modal interactions and generation; performing highly structured report generation operation by combining knowledge distillation, computer vision warm start, viT and GPT2 implemented CvT-212DISTILGPT, a task distillation module for structure level description, a task perception report generation module for describable level and an anomaly classification marking module; generating highly interpretable text VisualGPT using a self-reviving encoder-decoder attention mechanism assistance model; an anatomical region is detected using a simple and efficient region-oriented report generation model RGRGRG, and then a single salient region is described to form the final report. Wherein the CvT-212DISTILGPT2 model only gets experimental results on the MIMIC-CXR dataset (and not on the IU X-RAY dataset), the MRARGN model of the present application performs better than most models in cross-modal feature processing and automatic generation of medical reports, and successfully generates a highly refined description text of lesion areas and case conditions.

Table 1: comparison graph of evaluation index results of network models

It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential characteristics thereof. The present embodiments are, therefore, to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference sign in a claim should not be construed as limiting the claim concerned.

Furthermore, it should be understood that although the present disclosure describes embodiments, not every embodiment is provided with a separate embodiment, and that this description is provided for clarity only, and that the disclosure is not limited to the embodiments described in detail below, and that the embodiments described in the examples may be combined as appropriate to form other embodiments that will be apparent to those skilled in the art.

Claims

1. A chest X-ray image report generation method based on a cross-modal network is characterized by comprising the steps of creating a cross-modal auxiliary network CMLRAN, introducing an attention mechanism to process image and text information respectively, and enhancing the information association of the image and the text by combining with a CLIP proposed by OpenAI on the basis of a memory storage response matrix MSRM, wherein the method comprises the following specific steps:

② The preprocessed image is converted into an image feature matrix through a convolutional neural network CNN, all data of the matrix are elongated into a row to obtain a chest CT trainable image feature matrix C, then the chest CT trainable image feature matrix C is substituted into ResnetII, feature information with high relevance to chest organs is extracted to obtain an X-ray image feature matrix C ', the extracted features and the features C' extracted by a ResnetII network are subjected to calculation based on residual connection, finally the features C 'and the features C are subjected to attention network calculation to obtain complete chest X-ray image features C',

ResnetII the network is based on Resnet-152 network, resnet-152 network represents 152 layer residual network, its bottom layer module is composed of 12 different dimension convolutions, 1×1 and 3×3 are convolution cores, 64, 128, 256, 512, 1024, 2048 are network layer number,

The X-ray image characteristic matrix C 'is obtained by calculating an image characteristic matrix C through ResnetII network, the ResnetII network is provided with two different pooling layers which are respectively a maximum pooling layer and an average pooling layer, and the calculation results of the two pooling layers are connected with the image characteristic matrix C through residual errors to obtain a special X-ray image characteristic matrix C';

after the first operation is completed, introducing an X-ray image feature matrix C' obtained through feature extraction into a medical CLIP and an MSRM for matrix calculation, and determining a focus area with the maximum probability, wherein the related formula is as follows:

Step three: the implementation step of automatic generation of medical report:

after describing the cross-modal characteristics of the X-ray image characteristics and the corresponding medical diagnosis report through MSRM calculation in the second step, the decoder of the transducer can take the whole input sequence into consideration when generating each word, so that the processor can well capture the context information,

Specifically, a GPT-2 network is adopted, wherein the marks in the sequence are subjected to text generation on the condition of the previous marks, each generated word is used as the input of the next step, all the operations are repeated until a complete medical report is generated, and the formulas of forgetting gate SFG, SFG and cross-modal memory storage are proposed based on bidirectional LSTM as follows:

Wherein W _f represents a forgetting gate SFG, b _f represents a bias, x _(t) represents hidden information, h _(t-1) represents a time function of t-1, and C _(t-1) represents a cross-mode memory storage characteristic of the last time;

The flow of cross-modal information is controlled by introducing forgetting gates and updating gates, so that the cross-modal information of the context information and the X-ray images in the medical report is better captured, in order to limit the language model to the regional visual characteristics, the focus region characteristics and the related disease keyword characteristics are directly injected into the self-attention of the model by using the pseudo self-attention characteristics, and the related formulas are as follows:

2. The chest X-ray image report generating method based on the cross-modal network according to claim 1, wherein the medical CLIP pre-training in the second step adopts a new word segmentation device BPE by using an encoder, and a tree-shaped knowledge graph is added during word segmentation, so that the weight distribution of the word segmentation device is enhanced, and the trainable data set is realized.