CN117558394A

CN117558394A - Cross-modal network-based chest X-ray image report generation method

Info

Publication number: CN117558394A
Application number: CN202311271188.9A
Authority: CN
Inventors: 董子龙; 廉敬; 石斌; 刘冀钊; 张家骏; 张怀堃
Original assignee: Lanzhou Jiaotong University
Current assignee: Lanzhou Jiaotong University
Priority date: 2023-09-28
Filing date: 2023-09-28
Publication date: 2024-02-13
Anticipated expiration: 2043-09-28
Also published as: CN117558394B

Abstract

The invention discloses a chest X-ray image report generation method based on a cross-modal network, belongs to the technical field of image reports, and provides a cross-modal auxiliary network (CMLRAN), wherein an attention introducing mechanism is used for respectively processing image and text information, and based on a Memory Storage Response Matrix (MSRM), the information association of the image and the text is enhanced by combining with a CLIP proposed by OpenAI. The classification of fine granularity differences of X-ray images is focused on during encoding; the generation of medical terms is focused on in decoding. The method can better solve the problems of semantic gaps and the like, and intelligently generates the chest X-ray image report.

Description

Cross-modal network-based chest X-ray image report generation method

Technical Field

The invention relates to the technical field of image reports, in particular to a chest X-ray image report generation method based on a cross-modal network.

Background

Chest X-ray is a tip medical imaging technology, and accurately displays lung lesions (such as pneumonia, tuberculosis, lung cancer and the like), mediastinum lesions (such as mediastinum tumors, mediastinum emphysema and the like), pleural lesions (such as pleural effusion, pleurisy and the like) and cardiovascular lesions and the like through a high-resolution and multi-angle three-dimensional imaging technology. The chest X-ray image diagnosis report is the professional interpretation and summary of the examination result, generally comprises the parts of imaging, diagnosis opinion, suggestion and the like, and provides basis for doctors to make diagnosis and treatment schemes.

In recent years, the need for medical X-ray "imaging" guided "treatment" has increased, and related studies have received extensive attention. Among other things, the method of generating long text using a hierarchical long and short term memory network (LSTM) exhibits certain advantages. However, research in this area still presents many challenges, mainly: medical image features are complex, cross-modal features are difficult to extract, and medical reports have a large number of specialized words. Currently, LSTM has not been able to enable automatic generation of multi-organ imaging reports. For this reason, some scholars propose a deep learning-based medical image report automatic generation method, which can be classified into an image processing method and a natural language processing method according to the difference of the processing objects: taking images as cut-in points, tanida1 proposes a generating model RGRGRG based on a focus area guiding report, the model firstly partitions a specific focus area, then forms a final report based on the focus area, li proposes a knowledge graph aided generating network model DCL with a dynamic structure and nodes, each image is taken as a starting point to extract image comparison text generating features, and finally the features are added to each output node. Based on natural language processing, chen proposes using language knowledge of a large pre-training language model (PLM) to quickly generate an image subtitle network model visual gpt, which can effectively learn a large amount of language knowledge from a small amount of multi-modal data, kaur proposes a network model CADxReport based on a CNN-RNN model, which uses reinforcement learning and visual and semantic attention mechanisms to realize automatic generation of medical reports.

The medical report automatic generation method based on deep learning has the defects. Generating reports for the cut-in points by image processing, the model has difficulty in fully comprehending the complex information of the images, and the generated reports lack the flexibility of language expression. The natural language processing is taken as an access point, the model for generating the report is based on a predefined template, and the flexibility is also lacking, so that the model is difficult to adapt to different application scenes. In view of this, the present invention proposes a cross-modal auxiliary network (CMLRAN), in which an attention mechanism processes image and text information respectively, and based on a Memory Storage Response Matrix (MSRM), the information association of the image and text is enhanced in combination with CLIP proposed by OpenAI. The classification of fine granularity differences of X-ray images is focused on during encoding; the generation of medical terms is focused on in decoding. The method can better solve the problems of semantic gaps and the like, and intelligently generates the chest X-ray image report.

Disclosure of Invention

The present invention has been made to solve the above-mentioned problems, and an object of the present invention is to provide a method for generating a chest X-ray image report based on a cross-modal network.

In order to achieve the above purpose, the technical scheme adopted by the invention is as follows: a chest X-ray image report generation method based on a cross-modal network comprises the steps of creating a cross-modal auxiliary network (CMLRAN), introducing an attention mechanism to process image and text information respectively, and enhancing information association of the image and the text by combining a CLIP proposed by OpenAI on the basis of a Memory Storage Response Matrix (MSRM), wherein the method comprises the following specific steps:

step one: the focus area feature extraction implementation steps are as follows:

(1) performing contrast enhancement, image size conversion and image pixel block adjustment on an input image to obtain a preprocessed image;

(2) the preprocessing image is converted into an image feature matrix through a Convolutional Neural Network (CNN), all data of the matrix are elongated into a row, a chest CT trainable image feature matrix C is obtained, then ResnetII is substituted, feature information with high relevance to chest organs is extracted, and an X-ray image feature matrix C' is obtained. The residual network can learn the chest organ characteristics of the original image and the chest organ characteristics after convolution extraction, thereby avoiding the problems of gradient disappearance and gradient explosion in the information transmission process,

the formula expression of the feature matrix C' after the first processing of the ResnetII network is as follows:

sigma represents a Sigmoid function, and an Avg tableMean pooling is shown, max represents maximum pooling, c represents feature matrix obtained for each step of chest X-ray image, f ^7×7 Representing the size, delta, of a convolution kernel channel number of 7 x 7 _p Representing a network direct map, μ representing a loss function of the residual network;

step two: the cross-mode auxiliary positioning implementation steps are as follows:

after the first operation is completed, the X-ray image features obtained through feature extraction and the introduced medical CL and MSRM are subjected to matrix calculation, the focus area with the highest probability is determined, the focus area is enhanced, and the related formula is as follows:

and->Images and text representing the core region, C _img And C _txt Representing the preprocessed images and characters, calculating the cosine similarity W of the text features based on the image features and the text features based on the text features through matrix calculation, N represents the total number of a group of image text pairs, W' represents the feature score obtained through normalization calculation on the basis of W, and calculating the cross probability score L of the report corresponding to the final output focus region _i→t And L _t→i ；

Step three: the implementation step of automatic generation of medical report:

after deriving the cross-modal feature, the transcoder's decoder can take the entire input sequence into account at the same time when generating each word, so it can capture the context information well.

Further, the newly built cross-modal auxiliary network (CMLRAN) focuses on classification of X-ray image fine-granularity differences in encoding; focusing on the generation of medical terms in decoding.

Furthermore, the ResnetII in the step 1 adds a maximum pooling layer and an average pooling layer on the basis of a Resnet-152 pre-training network, and the two network layers can acquire the maximum value and the average value of the features under different scales, so that the maximum pooling is beneficial to improving the stability of feature extraction and reducing the influence of errors such as geometric distortion, exposure and the like in X-ray images on a model.

Further, in step 1, a new word segmentation device BPE is adopted for processing the medical report, and a tree-shaped knowledge graph is added during word segmentation, so that weight distribution of the word segmentation device is enhanced, and trainability of a data set is realized.

Further, in step 2, in order to prevent oversaturation of stored information and thus gradient explosion during network model training, a Selective Forgetting Gate (SFG) is added, and the formulas stored by SFG and cross-modal memory are as follows:

wherein W is _f Representing forget gate SFG, b _f Representing bias, x _(t) Represents hidden information, h _(t-1) Time function representing time t-1, C _(t-1) Representing the cross-modal memory storage characteristics of the last moment.

Further, in step 3, to limit the language model to regional visual features, we use pseudo-self-attention to inject lesion region features and associated disease keyword features directly into the self-attention of the model, the correlation formula is as follows:

wherein X represents the visual characteristics of the focal region,y represents word embedding, W _q ，W _k ，W _v Representing queries, keys and values, U _k And U _v Parameters representing keys and values of the initial hidden state obtained by the LSTM, the generation of text for a lesion area can be realized by matrix operation.

Compared with the prior art, the invention has the following beneficial effects:

(1) Based on transfer learning, adding a multichannel feature extraction, dividing a MaxPool layer and an AvgPool layer, and adding an attention mechanism on the basis to enhance the extraction capability of global features of the image;

(2) A cross-modal auxiliary network is provided for opening a semantic gap between chest X-ray images and corresponding medical reports and enhancing the connection between two modal information. And the matching precision of the X-ray image and the corresponding medical report is improved.

Drawings

FIG. 1 is a schematic diagram of the implementation steps of a body network model framework according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a focus area feature extraction module according to an embodiment of the invention;

FIG. 3 is a cross-modal network layer model of an embodiment of the present invention;

FIG. 4 is a tree-like knowledge-graph of medical report according to an embodiment of the invention;

fig. 5 is a graph of lesion area visualization using Grad-CAM according to an embodiment of the invention.

Detailed Description

The invention is further described in connection with the following detailed description, in order to make the technical means, the creation characteristics, the achievement of the purpose and the effect of the invention easy to understand.

As shown in FIG. 1, the method provided by the invention firstly uses a residual network and a visual attention model to extract image characteristics, then combines CLIP and MSRM to locate a focus area, and finally realizes automatic generation of a medical report through a gate unit mechanism of a Decoder and an LSTM of a transducer, and the proposed cross-modal auxiliary network model CMLRAN consists of three modules: the system comprises a focus area feature extraction module, a focus area-based cross-mode auxiliary positioning module and a medical report automatic generation module.

(1) The focus area feature extraction implementation steps are as follows:

(2) the preprocessing image is converted into an image feature matrix through a Convolutional Neural Network (CNN), all data of the matrix are elongated into a row, a chest CT trainable image feature matrix C is obtained, then ResnetII is substituted, feature information with high relevance to chest organs is extracted, and an X-ray image feature matrix C' is obtained. The residual error network can learn the chest organ characteristics of the original image and the chest organ characteristics after convolution extraction, so that the problems of gradient disappearance and gradient explosion in the information transmission process are avoided, wherein the focus region characteristic extraction module architecture is shown in fig. 2.

In fig. 2, resnet-152 represents a 152-layer residual network, whose underlying modules consist of 12 different-dimensional convolutions (1 x 1 and 3 x 3 are convolution kernels, 64, 128, 256, 512, 1024, 2048 are the number of network layers),

to ensure that the encoder learns chest CT image features better, a two-channel modular network architecture is added, an inner-segment max-pool (Maxpool) layer and an average-pool (AvgPool) layer are added, and a self-attention mechanism is added to enhance the effect of extracting image features from Maxpool and AvgPool.

Training the focus area feature extraction network, and using a dual-channel module feature extraction network ResnetII and an attention feature extraction network. During the dual-channel feature extraction, convolution operation is performed on the chest CT trainable image feature matrix C and the dimension is increased, then images are respectively sent into an expansion convolution layer of the Resnet152 through Maxpool and Avgpool to obtain a Resnet network output result (the expansion rate is 2, the convolution kernel size is 7 multiplied by 7), then the output result is respectively subjected to dual-channel residual operation through Maxpool and Avgpool, and then addition summation operation is performed on the output result and original input information to obtain a feature matrix C' after first processing of the Resnet network. The double-channel module feature extraction enhances the multi-scale extraction capability of the model on chest CT image details, and simultaneously reduces the negative effects of original image space hierarchy information loss, unimportant information repeated extraction and the like caused by single use of expansion convolution operation. The formula expression of the feature matrix C' after the first processing of the ResnetII network is as follows:

sigma represents Sigmoid function, avg represents average pooling, max represents maximum pooling, c represents feature matrix obtained for each step of chest X-ray image, f ^7×7 Representing the size, delta, of a convolution kernel channel number of 7 x 7 _p Representing the network direct mapping, μ represents the loss function of the residual network. ResnetII adds a maximum pooling layer and an average pooling layer on the basis of a Resnet-152 pre-training network, and the two network layers can acquire the maximum value and the average value of the features under different scales, so that the maximum pooling is beneficial to improving the stability of feature extraction and reducing the influence of errors such as geometric distortion, exposure and the like in X-ray images on a model. The average pooling layer can convert the spatial information of the features into more compact feature representation, improves the generalization capability of the model, reduces the influence of background radiation, artifacts and the like in the X-ray image on the model, and can reflect the local features of the medical X-ray image based on the features C' extracted by the two network layers. Then, the original feature C and the processed feature C ' are input into an attention mechanism network for secondary feature processing, the network can divide the chest X-ray image into a series of learnable blocks, global processing of the whole image is achieved, and then the extracted feature C ' and the feature C ' extracted by a ResnetII network are subjected to calculation based on residual connection, and finally the complete chest X-ray image feature C) is obtained.

The invention discloses a method for processing a corresponding medical report by selecting a word segmentation device BPE, wherein the BPE is a data driving method based on self-focusing and autoregressive neural network, and text is divided into a fixed number of sub-words by continuously combining most frequently occurring characters or character sequences.

(2) The cross-mode auxiliary positioning implementation steps are as follows:

after the first operation is completed, the X-ray image features obtained through feature extraction and medical CLIP and MSRM are introduced to perform matrix calculation, the focus area with the highest probability is determined, and the focus area is enhanced

And the corresponding medical report keywords. As shown in fig. 3, first, the medical CLIP uses a pretrained Resnet-152 as an image encoder, the BPE as an encoder for medical report, we use contrast loss on IU X-RAY and MIMIC-CXR datasets to fine tune the original CLIP model, enhance the similarity of image-text matching pairs and the dissimilarity of non-matching pairs, and the generated result determines the output result through a pre-built medical report tree-like knowledge graph (the knowledge graph is shown in fig. 4). The atlas is primarily judged by using the type of focus and disease information possibly appearing at each part of chest X-ray, and then the alignment of the medical atlas and the corresponding focus area is realized by a zero-shot classification mechanism of the medical CLIP. And finally, selecting the chest X-ray focus area corresponding to the disease information with the maximum similarity based on an image classification technology. The correlation formula is as follows:

and->Images and text representing the core region, C _img And C _txt Representing the preprocessed images and characters, calculating the cosine similarity W of the text features based on the image features and the text features based on the text features through matrix calculation, N represents the total number of a group of image text pairs, W' represents the feature score obtained through normalization calculation on the basis of W, and calculating the cross probability score L of the report corresponding to the final output focus region _i→t And L _t→i . After obtaining the probability score L, introducing the probability into an MSRM to perform cross-mode memory storage, wherein the main operation is that two different modes of information are simultaneously stored, important information is reserved, useless information is deleted, meanwhile, in order to prevent oversaturation of the stored information and further gradient explosion during network model training, a Selective Forgetting Gate (SFG) is added by referring to a gate unit mechanism of an LSTM, and the SFG and the cross-mode memory storage have the following formulas:

(3) The implementation step of automatic generation of medical report:

after deriving the cross-modal feature, the transcoder's decoder can take the entire input sequence into account at the same time when generating each word, so it can capture the context information well. In particular, it employs a GPT-2 network similar to that proposed by PubMed et al, wherein the tags in the sequence are subject to text generation on the condition of the previous tag, each generated word is taken as input to the next step, and this process is repeated until the complete word is generatedIs a medical record report of (1). However, at decoding we add cross-modal memory feature C _(t) If only the traditional decoder is adopted, the problems of repeated memorization of error information, slow network convergence speed and the like can be caused, therefore, the invention provides a forgetting gate SFG based on the bidirectional LSTM, combines the SFG with an attention mechanism, and controls the flow of cross-modal information by introducing the forgetting gate and updating the gate, thereby better capturing the context information and the cross-modal information of the X-ray image in the medical report. To define the language model to regional visual features, we use pseudo-self-attention to inject lesion region features and associated disease keyword features directly into the self-attention of the model, the correlation formula is as follows:

wherein X represents the visual characteristics of the focus area, Y represents word embedding, W _q ，W _k ，W _v Representing queries, keys and values, U _k And U _v Parameters representing keys and values of the initial hidden state obtained by the LSTM, the generation of text for a lesion area can be realized by matrix operation.

The focusing region in model training can be displayed by using a mapping (Gradient-weighted Class Activation Mapping, grad-CAM) technology, and whether the focus region and the corresponding tree-like knowledge graph are accurately related by the model or not is determined by using the method. As shown in FIGS. 5 (a) - (f), M1-M5 represent MRARGN and variants thereof, respectively, in optimal conditions for Grad-CAM display of focal areas of heart, pleura, bone, mediastinum, lung and free regions of the MIMIMIMI-CXR dataset. We can observe that M5 accurately identifies the boundary and shape information of most lesion areas. Although the treatment of M1 in the skeletal region and M2 in the pleural and mediastinal regions is relatively reasonable, there are a large number of erroneous judgments or repeated correlations of the erroneous information at other locations. In the lung lesion area treatment of fig. 5 (e), M1 faces point to the upper right lobe, and M2 also perceives the free area as a portion of the lung lesion area. The focal zone extracted by M3 belongs to a large number of irrelevant extractions. M4 performs better than M1-M3 in generating the report's integrity, but when predicting a lesion area, a pacemaker or external device is erroneously interpreted as part of the lesion area. M5 can eliminate this type of error by introducing SFG, while irrelevant areas that M4 over-extracts can also be deleted by SFG.

The embodiment of the invention is implemented in PyTorch and trained on a workstation with 64GB RAM and NVIDIA Geforce RTX 4090GPU processor, using pre-trained Resnet152 as a common feature extraction encoder for image processing Resnet II and CLIP, all images are scaled to 224X 224 and provided to a feature map of 7X 512 encoder size.

According to the table 1, we used the method proposed by the original dataset author during training, testing and validating stages, namely the dataset was divided according to the training stage, testing that validation=7:2:1 ratio, feeding the extracted visual features to the CLIP to generate more than 320 tags, generating for each tag a vector containing 512 word embeddings, wherein the main part of the sentence may not reach 512 words, the rest of the sentence will be complemented by < text > which is the omitted part identifiable by the network training, obtaining the highest probability feature as the semantic feature generated by the model, then connecting the focus area alone with the network training by MSRM, constructing a more perfect auxiliary network without affecting the network convergence, the decoder of the Transformer as the decoding layer of the model, first setting all hidden layers and word embedment dimensions to 512, this hidden layers directly calling the hidden states extracted through LSTM remains gate, parameter learning using an AdamW optimizer, the bandwidth loss to be 4, defining the total loss as l=λ _MSRM ·L _MSRM +λ _cmn ·L _cmn +λ _cross ·L _cross +λ _language ·L _language Wherein L is _MSRM Is the image loss of focus area, L _cmn And L _cross Is the binary cross entropy loss of two binary classifiers that handle cross-modal information, L _language Is the cross entropy loss of the language model, and the loss weight is set as follows according to the performance on the verification set: lambda (lambda) _MSRM ＝2.0，λ _cmn ＝3.0，λ _cross ＝3.0，λ _language ＝1.0。

As shown in Table 1, we performed a comparison test on IU X-RAY and MIMIMIMI-CXR datasets with MRARGN, respectively, the 8 most recent chest report generation models described above: generating a report PPKED in an auxiliary way by using priori knowledge and posterior combination; generating individual descriptive words in the context of X-ray images and then converting them into coherent text m2tr using a converter architecture; generating a radiological report R2Gen via a memory drive transformer; enhancing the encoder-decoder framework using CMN in combination with a self-attention mechanism to facilitate cross-modal interactions and generation; performing highly structured report generation operation by combining knowledge distillation, computer vision hot start, cvT-212 DistillGPT 2 realized by ViT and GPT2, a task distillation module for structure level description, a task perception report generation module for describable level and an abnormal classification marking module; generating a highly interpretable text visual gpt using a self-reviving encoder-decoder attention mechanism assistance model; an anatomical region is detected using a simple and efficient region-oriented report generation model RGRGRG, and then a single salient region is described to form the final report. Wherein, the CvT-212 DistillGPT 2 model only obtains experimental results on the MIMIMIC-CXR data set (experimental results cannot be obtained on the IU X-RAY data set), the MRARGN model of the invention has better performance on cross-modal feature processing and automatic generation of medical reports than most models, and successfully generates a highly refined description text of lesion areas and case conditions.

Table 1: comparison graph of evaluation index results of network models

It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential characteristics thereof. The present embodiments are, therefore, to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference sign in a claim should not be construed as limiting the claim concerned.

Furthermore, it should be understood that although the present disclosure describes embodiments, not every embodiment is provided with a separate embodiment, and that this description is provided for clarity only, and that the disclosure is not limited to the embodiments described in detail below, and that the embodiments described in the examples may be combined as appropriate to form other embodiments that will be apparent to those skilled in the art.

Claims

1. A chest X-ray image report generation method based on a cross-modal network is characterized by comprising the steps of creating a cross-modal auxiliary network (CMLRAN), introducing an attention mechanism to process image and text information respectively, and enhancing information association of the image and the text by combining a CLIP proposed by OpenAI on the basis of a Memory Storage Response Matrix (MSRM), wherein the method comprises the following specific steps of:

(2) converting the preprocessed image into an image feature matrix through a Convolutional Neural Network (CNN), elongating all data of the matrix into a column to obtain a chest CT trainable image feature matrix C, substituting the training image feature matrix C into ResnetII, extracting feature information with high association degree with chest organs, and obtaining an X-ray image feature matrix C'; the residual network can learn the chest organ characteristics of the original image and the chest organ characteristics after convolution extraction, thereby avoiding the problems of gradient disappearance and gradient explosion in the information transmission process,

sigma represents Sigmoid function, avg represents average pooling, max represents maximum pooling, c represents feature matrix obtained for each step of chest X-ray image, f ^7×7 Representing the size, delta, of a convolution kernel channel number of 7 x 7 _p Representing a network direct map, μ representing a loss function of the residual network;

Step three: the implementation step of automatic generation of medical report:

2. The method for generating a cross-modal network-based chest radiograph report according to claim 1, wherein the newly built cross-modal auxiliary network (CMLRAN) focuses on classification of fine-grained differences of radiographs at the time of encoding; focusing on the generation of medical terms in decoding.

3. The method for generating the chest X-ray image report based on the cross-modal network according to claim 1, wherein the ResnetII in the step 1 is added with a maximum pooling layer and an average pooling layer on the basis of a Resnet-152 pre-training network, and the two network layers can acquire the maximum value and the average value of the features under different scales, so that the maximum pooling is beneficial to improving the stability of feature extraction and reducing the influence of errors such as geometric distortion and exposure in X-ray images on a model.

4. The method for generating the chest X-ray image report based on the cross-modal network according to claim 1, wherein in the step 1, a new word segmentation device BPE is adopted for processing the medical report, and a tree-shaped knowledge graph is added during word segmentation, so that weight distribution of the word segmentation device is enhanced, and trainability of a data set is realized.

5. The method for generating a cross-modal network-based chest X-ray image report according to claim 1, wherein in step 2, in order to prevent oversaturation of stored information and thus gradient explosion during training of a network model, a Selective Forgetting Gate (SFG) is added, and the formulas stored in SFG and cross-modal memory are as follows:

6. The method of claim 1, wherein in step 3, in order to limit the language model to the regional visual features, we use pseudo-self-attention to inject the lesion regional features and associated disease keyword features directly into the self-attention of the model, the related formulas are as follows: