CN117558394A - Cross-modal network-based chest X-ray image report generation method - Google Patents
Cross-modal network-based chest X-ray image report generation method Download PDFInfo
- Publication number
- CN117558394A CN117558394A CN202311271188.9A CN202311271188A CN117558394A CN 117558394 A CN117558394 A CN 117558394A CN 202311271188 A CN202311271188 A CN 202311271188A CN 117558394 A CN117558394 A CN 117558394A
- Authority
- CN
- China
- Prior art keywords
- image
- network
- cross
- chest
- modal
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 29
- 238000011976 chest X-ray Methods 0.000 title claims abstract description 25
- 239000011159 matrix material Substances 0.000 claims abstract description 34
- 230000007246 mechanism Effects 0.000 claims abstract description 14
- 238000012545 processing Methods 0.000 claims abstract description 14
- 230000005055 memory storage Effects 0.000 claims abstract description 9
- 230000004044 response Effects 0.000 claims abstract description 4
- 238000000605 extraction Methods 0.000 claims description 26
- 230000003902 lesion Effects 0.000 claims description 16
- 238000011176 pooling Methods 0.000 claims description 16
- 238000012549 training Methods 0.000 claims description 14
- 230000000007 visual effect Effects 0.000 claims description 11
- 238000004364 calculation method Methods 0.000 claims description 10
- 230000006870 function Effects 0.000 claims description 9
- 210000000056 organ Anatomy 0.000 claims description 9
- 230000011218 segmentation Effects 0.000 claims description 7
- 238000013527 convolutional neural network Methods 0.000 claims description 6
- 238000004880 explosion Methods 0.000 claims description 6
- 201000010099 disease Diseases 0.000 claims description 5
- 208000037265 diseases, disorders, signs and symptoms Diseases 0.000 claims description 5
- 230000009286 beneficial effect Effects 0.000 claims description 4
- 230000002708 enhancing effect Effects 0.000 claims description 4
- 230000015654 memory Effects 0.000 claims description 4
- 230000008569 process Effects 0.000 claims description 4
- 230000005540 biological transmission Effects 0.000 claims description 3
- 238000006243 chemical reaction Methods 0.000 claims description 3
- 230000008034 disappearance Effects 0.000 claims description 3
- 238000010606 normalization Methods 0.000 claims description 3
- 210000000038 chest Anatomy 0.000 description 13
- 238000005516 engineering process Methods 0.000 description 4
- 238000003384 imaging method Methods 0.000 description 4
- 210000004072 lung Anatomy 0.000 description 4
- 238000003745 diagnosis Methods 0.000 description 3
- 230000000694 effects Effects 0.000 description 3
- 238000013507 mapping Methods 0.000 description 3
- 210000001370 mediastinum Anatomy 0.000 description 3
- 238000003058 natural language processing Methods 0.000 description 3
- 238000012360 testing method Methods 0.000 description 3
- 102100033814 Alanine aminotransferase 2 Human genes 0.000 description 2
- 238000013135 deep learning Methods 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 238000007781 pre-processing Methods 0.000 description 2
- 101710096000 Alanine aminotransferase 2 Proteins 0.000 description 1
- 206010014561 Emphysema Diseases 0.000 description 1
- 101000779415 Homo sapiens Alanine aminotransferase 2 Proteins 0.000 description 1
- 206010058467 Lung neoplasm malignant Diseases 0.000 description 1
- 208000002151 Pleural effusion Diseases 0.000 description 1
- 206010035664 Pneumonia Diseases 0.000 description 1
- 230000002159 abnormal effect Effects 0.000 description 1
- 230000004913 activation Effects 0.000 description 1
- 210000003484 anatomy Anatomy 0.000 description 1
- 238000013528 artificial neural network Methods 0.000 description 1
- 230000002457 bidirectional effect Effects 0.000 description 1
- 210000000988 bone and bone Anatomy 0.000 description 1
- 231100001011 cardiovascular lesion Toxicity 0.000 description 1
- 230000001427 coherent effect Effects 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000002059 diagnostic imaging Methods 0.000 description 1
- 238000004821 distillation Methods 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 239000000284 extract Substances 0.000 description 1
- 210000002216 heart Anatomy 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 238000013140 knowledge distillation Methods 0.000 description 1
- 230000007787 long-term memory Effects 0.000 description 1
- 201000005202 lung cancer Diseases 0.000 description 1
- 208000020816 lung neoplasm Diseases 0.000 description 1
- 208000018280 neoplasm of mediastinum Diseases 0.000 description 1
- 238000005192 partition Methods 0.000 description 1
- 230000008447 perception Effects 0.000 description 1
- 210000004224 pleura Anatomy 0.000 description 1
- 208000008423 pleurisy Diseases 0.000 description 1
- 238000003672 processing method Methods 0.000 description 1
- 230000005855 radiation Effects 0.000 description 1
- 230000002787 reinforcement Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 230000006403 short-term memory Effects 0.000 description 1
- 238000013526 transfer learning Methods 0.000 description 1
- 201000008827 tuberculosis Diseases 0.000 description 1
- 238000010200 validation analysis Methods 0.000 description 1
- 238000012795 verification Methods 0.000 description 1
- 238000012800 visualization Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16H—HEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
- G16H15/00—ICT specially adapted for medical reports, e.g. generation or transmission thereof
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/36—Creation of semantic tools, e.g. ontology or thesauri
- G06F16/367—Ontology
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/22—Matching criteria, e.g. proximity measures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/044—Recurrent networks, e.g. Hopfield networks
- G06N3/0442—Recurrent networks, e.g. Hopfield networks characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
- G06N3/0455—Auto-encoder networks; Encoder-decoder networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/0464—Convolutional networks [CNN, ConvNet]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/048—Activation functions
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/096—Transfer learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/20—Image preprocessing
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/20—Image preprocessing
- G06V10/25—Determination of region of interest [ROI] or a volume of interest [VOI]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/20—Image preprocessing
- G06V10/32—Normalisation of the pattern dimensions
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/40—Extraction of image or video features
- G06V10/44—Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
- G06V10/443—Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components by matching or filtering
- G06V10/449—Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters
- G06V10/451—Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters with interaction between the filter responses, e.g. cortical complex cells
- G06V10/454—Integrating the filters into a hierarchical structure, e.g. convolutional neural networks [CNN]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/764—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02A—TECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
- Y02A90/00—Technologies having an indirect contribution to adaptation to climate change
- Y02A90/10—Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- General Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Life Sciences & Earth Sciences (AREA)
- Computational Linguistics (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Software Systems (AREA)
- Biomedical Technology (AREA)
- Mathematical Physics (AREA)
- Biophysics (AREA)
- Multimedia (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Medical Informatics (AREA)
- Databases & Information Systems (AREA)
- Primary Health Care (AREA)
- Epidemiology (AREA)
- Public Health (AREA)
- Biodiversity & Conservation Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Evolutionary Biology (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Animal Behavior & Ethology (AREA)
- Apparatus For Radiation Diagnosis (AREA)
Abstract
The invention discloses a chest X-ray image report generation method based on a cross-modal network, belongs to the technical field of image reports, and provides a cross-modal auxiliary network (CMLRAN), wherein an attention introducing mechanism is used for respectively processing image and text information, and based on a Memory Storage Response Matrix (MSRM), the information association of the image and the text is enhanced by combining with a CLIP proposed by OpenAI. The classification of fine granularity differences of X-ray images is focused on during encoding; the generation of medical terms is focused on in decoding. The method can better solve the problems of semantic gaps and the like, and intelligently generates the chest X-ray image report.
Description
Technical Field
The invention relates to the technical field of image reports, in particular to a chest X-ray image report generation method based on a cross-modal network.
Background
Chest X-ray is a tip medical imaging technology, and accurately displays lung lesions (such as pneumonia, tuberculosis, lung cancer and the like), mediastinum lesions (such as mediastinum tumors, mediastinum emphysema and the like), pleural lesions (such as pleural effusion, pleurisy and the like) and cardiovascular lesions and the like through a high-resolution and multi-angle three-dimensional imaging technology. The chest X-ray image diagnosis report is the professional interpretation and summary of the examination result, generally comprises the parts of imaging, diagnosis opinion, suggestion and the like, and provides basis for doctors to make diagnosis and treatment schemes.
In recent years, the need for medical X-ray "imaging" guided "treatment" has increased, and related studies have received extensive attention. Among other things, the method of generating long text using a hierarchical long and short term memory network (LSTM) exhibits certain advantages. However, research in this area still presents many challenges, mainly: medical image features are complex, cross-modal features are difficult to extract, and medical reports have a large number of specialized words. Currently, LSTM has not been able to enable automatic generation of multi-organ imaging reports. For this reason, some scholars propose a deep learning-based medical image report automatic generation method, which can be classified into an image processing method and a natural language processing method according to the difference of the processing objects: taking images as cut-in points, tanida1 proposes a generating model RGRGRG based on a focus area guiding report, the model firstly partitions a specific focus area, then forms a final report based on the focus area, li proposes a knowledge graph aided generating network model DCL with a dynamic structure and nodes, each image is taken as a starting point to extract image comparison text generating features, and finally the features are added to each output node. Based on natural language processing, chen proposes using language knowledge of a large pre-training language model (PLM) to quickly generate an image subtitle network model visual gpt, which can effectively learn a large amount of language knowledge from a small amount of multi-modal data, kaur proposes a network model CADxReport based on a CNN-RNN model, which uses reinforcement learning and visual and semantic attention mechanisms to realize automatic generation of medical reports.
The medical report automatic generation method based on deep learning has the defects. Generating reports for the cut-in points by image processing, the model has difficulty in fully comprehending the complex information of the images, and the generated reports lack the flexibility of language expression. The natural language processing is taken as an access point, the model for generating the report is based on a predefined template, and the flexibility is also lacking, so that the model is difficult to adapt to different application scenes. In view of this, the present invention proposes a cross-modal auxiliary network (CMLRAN), in which an attention mechanism processes image and text information respectively, and based on a Memory Storage Response Matrix (MSRM), the information association of the image and text is enhanced in combination with CLIP proposed by OpenAI. The classification of fine granularity differences of X-ray images is focused on during encoding; the generation of medical terms is focused on in decoding. The method can better solve the problems of semantic gaps and the like, and intelligently generates the chest X-ray image report.
Disclosure of Invention
The present invention has been made to solve the above-mentioned problems, and an object of the present invention is to provide a method for generating a chest X-ray image report based on a cross-modal network.
In order to achieve the above purpose, the technical scheme adopted by the invention is as follows: a chest X-ray image report generation method based on a cross-modal network comprises the steps of creating a cross-modal auxiliary network (CMLRAN), introducing an attention mechanism to process image and text information respectively, and enhancing information association of the image and the text by combining a CLIP proposed by OpenAI on the basis of a Memory Storage Response Matrix (MSRM), wherein the method comprises the following specific steps:
step one: the focus area feature extraction implementation steps are as follows:
(1) performing contrast enhancement, image size conversion and image pixel block adjustment on an input image to obtain a preprocessed image;
(2) the preprocessing image is converted into an image feature matrix through a Convolutional Neural Network (CNN), all data of the matrix are elongated into a row, a chest CT trainable image feature matrix C is obtained, then ResnetII is substituted, feature information with high relevance to chest organs is extracted, and an X-ray image feature matrix C' is obtained. The residual network can learn the chest organ characteristics of the original image and the chest organ characteristics after convolution extraction, thereby avoiding the problems of gradient disappearance and gradient explosion in the information transmission process,
the formula expression of the feature matrix C' after the first processing of the ResnetII network is as follows:
sigma represents a Sigmoid function, and an Avg tableMean pooling is shown, max represents maximum pooling, c represents feature matrix obtained for each step of chest X-ray image, f 7×7 Representing the size, delta, of a convolution kernel channel number of 7 x 7 p Representing a network direct map, μ representing a loss function of the residual network;
step two: the cross-mode auxiliary positioning implementation steps are as follows:
after the first operation is completed, the X-ray image features obtained through feature extraction and the introduced medical CL and MSRM are subjected to matrix calculation, the focus area with the highest probability is determined, the focus area is enhanced, and the related formula is as follows:
and->Images and text representing the core region, C img And C txt Representing the preprocessed images and characters, calculating the cosine similarity W of the text features based on the image features and the text features based on the text features through matrix calculation, N represents the total number of a group of image text pairs, W' represents the feature score obtained through normalization calculation on the basis of W, and calculating the cross probability score L of the report corresponding to the final output focus region i→t And L t→i ;
Step three: the implementation step of automatic generation of medical report:
after deriving the cross-modal feature, the transcoder's decoder can take the entire input sequence into account at the same time when generating each word, so it can capture the context information well.
Further, the newly built cross-modal auxiliary network (CMLRAN) focuses on classification of X-ray image fine-granularity differences in encoding; focusing on the generation of medical terms in decoding.
Furthermore, the ResnetII in the step 1 adds a maximum pooling layer and an average pooling layer on the basis of a Resnet-152 pre-training network, and the two network layers can acquire the maximum value and the average value of the features under different scales, so that the maximum pooling is beneficial to improving the stability of feature extraction and reducing the influence of errors such as geometric distortion, exposure and the like in X-ray images on a model.
Further, in step 1, a new word segmentation device BPE is adopted for processing the medical report, and a tree-shaped knowledge graph is added during word segmentation, so that weight distribution of the word segmentation device is enhanced, and trainability of a data set is realized.
Further, in step 2, in order to prevent oversaturation of stored information and thus gradient explosion during network model training, a Selective Forgetting Gate (SFG) is added, and the formulas stored by SFG and cross-modal memory are as follows:
wherein W is f Representing forget gate SFG, b f Representing bias, x (t) Represents hidden information, h (t-1) Time function representing time t-1, C (t-1) Representing the cross-modal memory storage characteristics of the last moment.
Further, in step 3, to limit the language model to regional visual features, we use pseudo-self-attention to inject lesion region features and associated disease keyword features directly into the self-attention of the model, the correlation formula is as follows:
wherein X represents the visual characteristics of the focal region,y represents word embedding, W q ,W k ,W v Representing queries, keys and values, U k And U v Parameters representing keys and values of the initial hidden state obtained by the LSTM, the generation of text for a lesion area can be realized by matrix operation.
Compared with the prior art, the invention has the following beneficial effects:
(1) Based on transfer learning, adding a multichannel feature extraction, dividing a MaxPool layer and an AvgPool layer, and adding an attention mechanism on the basis to enhance the extraction capability of global features of the image;
(2) A cross-modal auxiliary network is provided for opening a semantic gap between chest X-ray images and corresponding medical reports and enhancing the connection between two modal information. And the matching precision of the X-ray image and the corresponding medical report is improved.
Drawings
FIG. 1 is a schematic diagram of the implementation steps of a body network model framework according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of a focus area feature extraction module according to an embodiment of the invention;
FIG. 3 is a cross-modal network layer model of an embodiment of the present invention;
FIG. 4 is a tree-like knowledge-graph of medical report according to an embodiment of the invention;
fig. 5 is a graph of lesion area visualization using Grad-CAM according to an embodiment of the invention.
Detailed Description
The invention is further described in connection with the following detailed description, in order to make the technical means, the creation characteristics, the achievement of the purpose and the effect of the invention easy to understand.
As shown in FIG. 1, the method provided by the invention firstly uses a residual network and a visual attention model to extract image characteristics, then combines CLIP and MSRM to locate a focus area, and finally realizes automatic generation of a medical report through a gate unit mechanism of a Decoder and an LSTM of a transducer, and the proposed cross-modal auxiliary network model CMLRAN consists of three modules: the system comprises a focus area feature extraction module, a focus area-based cross-mode auxiliary positioning module and a medical report automatic generation module.
(1) The focus area feature extraction implementation steps are as follows:
(1) performing contrast enhancement, image size conversion and image pixel block adjustment on an input image to obtain a preprocessed image;
(2) the preprocessing image is converted into an image feature matrix through a Convolutional Neural Network (CNN), all data of the matrix are elongated into a row, a chest CT trainable image feature matrix C is obtained, then ResnetII is substituted, feature information with high relevance to chest organs is extracted, and an X-ray image feature matrix C' is obtained. The residual error network can learn the chest organ characteristics of the original image and the chest organ characteristics after convolution extraction, so that the problems of gradient disappearance and gradient explosion in the information transmission process are avoided, wherein the focus region characteristic extraction module architecture is shown in fig. 2.
In fig. 2, resnet-152 represents a 152-layer residual network, whose underlying modules consist of 12 different-dimensional convolutions (1 x 1 and 3 x 3 are convolution kernels, 64, 128, 256, 512, 1024, 2048 are the number of network layers),
to ensure that the encoder learns chest CT image features better, a two-channel modular network architecture is added, an inner-segment max-pool (Maxpool) layer and an average-pool (AvgPool) layer are added, and a self-attention mechanism is added to enhance the effect of extracting image features from Maxpool and AvgPool.
Training the focus area feature extraction network, and using a dual-channel module feature extraction network ResnetII and an attention feature extraction network. During the dual-channel feature extraction, convolution operation is performed on the chest CT trainable image feature matrix C and the dimension is increased, then images are respectively sent into an expansion convolution layer of the Resnet152 through Maxpool and Avgpool to obtain a Resnet network output result (the expansion rate is 2, the convolution kernel size is 7 multiplied by 7), then the output result is respectively subjected to dual-channel residual operation through Maxpool and Avgpool, and then addition summation operation is performed on the output result and original input information to obtain a feature matrix C' after first processing of the Resnet network. The double-channel module feature extraction enhances the multi-scale extraction capability of the model on chest CT image details, and simultaneously reduces the negative effects of original image space hierarchy information loss, unimportant information repeated extraction and the like caused by single use of expansion convolution operation. The formula expression of the feature matrix C' after the first processing of the ResnetII network is as follows:
sigma represents Sigmoid function, avg represents average pooling, max represents maximum pooling, c represents feature matrix obtained for each step of chest X-ray image, f 7×7 Representing the size, delta, of a convolution kernel channel number of 7 x 7 p Representing the network direct mapping, μ represents the loss function of the residual network. ResnetII adds a maximum pooling layer and an average pooling layer on the basis of a Resnet-152 pre-training network, and the two network layers can acquire the maximum value and the average value of the features under different scales, so that the maximum pooling is beneficial to improving the stability of feature extraction and reducing the influence of errors such as geometric distortion, exposure and the like in X-ray images on a model. The average pooling layer can convert the spatial information of the features into more compact feature representation, improves the generalization capability of the model, reduces the influence of background radiation, artifacts and the like in the X-ray image on the model, and can reflect the local features of the medical X-ray image based on the features C' extracted by the two network layers. Then, the original feature C and the processed feature C ' are input into an attention mechanism network for secondary feature processing, the network can divide the chest X-ray image into a series of learnable blocks, global processing of the whole image is achieved, and then the extracted feature C ' and the feature C ' extracted by a ResnetII network are subjected to calculation based on residual connection, and finally the complete chest X-ray image feature C) is obtained.
The invention discloses a method for processing a corresponding medical report by selecting a word segmentation device BPE, wherein the BPE is a data driving method based on self-focusing and autoregressive neural network, and text is divided into a fixed number of sub-words by continuously combining most frequently occurring characters or character sequences.
(2) The cross-mode auxiliary positioning implementation steps are as follows:
after the first operation is completed, the X-ray image features obtained through feature extraction and medical CLIP and MSRM are introduced to perform matrix calculation, the focus area with the highest probability is determined, and the focus area is enhanced
And the corresponding medical report keywords. As shown in fig. 3, first, the medical CLIP uses a pretrained Resnet-152 as an image encoder, the BPE as an encoder for medical report, we use contrast loss on IU X-RAY and MIMIC-CXR datasets to fine tune the original CLIP model, enhance the similarity of image-text matching pairs and the dissimilarity of non-matching pairs, and the generated result determines the output result through a pre-built medical report tree-like knowledge graph (the knowledge graph is shown in fig. 4). The atlas is primarily judged by using the type of focus and disease information possibly appearing at each part of chest X-ray, and then the alignment of the medical atlas and the corresponding focus area is realized by a zero-shot classification mechanism of the medical CLIP. And finally, selecting the chest X-ray focus area corresponding to the disease information with the maximum similarity based on an image classification technology. The correlation formula is as follows:
and->Images and text representing the core region, C img And C txt Representing the preprocessed images and characters, calculating the cosine similarity W of the text features based on the image features and the text features based on the text features through matrix calculation, N represents the total number of a group of image text pairs, W' represents the feature score obtained through normalization calculation on the basis of W, and calculating the cross probability score L of the report corresponding to the final output focus region i→t And L t→i . After obtaining the probability score L, introducing the probability into an MSRM to perform cross-mode memory storage, wherein the main operation is that two different modes of information are simultaneously stored, important information is reserved, useless information is deleted, meanwhile, in order to prevent oversaturation of the stored information and further gradient explosion during network model training, a Selective Forgetting Gate (SFG) is added by referring to a gate unit mechanism of an LSTM, and the SFG and the cross-mode memory storage have the following formulas:
wherein W is f Representing forget gate SFG, b f Representing bias, x (t) Represents hidden information, h (t-1) Time function representing time t-1, C (t-1) Representing the cross-modal memory storage characteristics of the last moment.
(3) The implementation step of automatic generation of medical report:
after deriving the cross-modal feature, the transcoder's decoder can take the entire input sequence into account at the same time when generating each word, so it can capture the context information well. In particular, it employs a GPT-2 network similar to that proposed by PubMed et al, wherein the tags in the sequence are subject to text generation on the condition of the previous tag, each generated word is taken as input to the next step, and this process is repeated until the complete word is generatedIs a medical record report of (1). However, at decoding we add cross-modal memory feature C (t) If only the traditional decoder is adopted, the problems of repeated memorization of error information, slow network convergence speed and the like can be caused, therefore, the invention provides a forgetting gate SFG based on the bidirectional LSTM, combines the SFG with an attention mechanism, and controls the flow of cross-modal information by introducing the forgetting gate and updating the gate, thereby better capturing the context information and the cross-modal information of the X-ray image in the medical report. To define the language model to regional visual features, we use pseudo-self-attention to inject lesion region features and associated disease keyword features directly into the self-attention of the model, the correlation formula is as follows:
wherein X represents the visual characteristics of the focus area, Y represents word embedding, W q ,W k ,W v Representing queries, keys and values, U k And U v Parameters representing keys and values of the initial hidden state obtained by the LSTM, the generation of text for a lesion area can be realized by matrix operation.
The focusing region in model training can be displayed by using a mapping (Gradient-weighted Class Activation Mapping, grad-CAM) technology, and whether the focus region and the corresponding tree-like knowledge graph are accurately related by the model or not is determined by using the method. As shown in FIGS. 5 (a) - (f), M1-M5 represent MRARGN and variants thereof, respectively, in optimal conditions for Grad-CAM display of focal areas of heart, pleura, bone, mediastinum, lung and free regions of the MIMIMIMI-CXR dataset. We can observe that M5 accurately identifies the boundary and shape information of most lesion areas. Although the treatment of M1 in the skeletal region and M2 in the pleural and mediastinal regions is relatively reasonable, there are a large number of erroneous judgments or repeated correlations of the erroneous information at other locations. In the lung lesion area treatment of fig. 5 (e), M1 faces point to the upper right lobe, and M2 also perceives the free area as a portion of the lung lesion area. The focal zone extracted by M3 belongs to a large number of irrelevant extractions. M4 performs better than M1-M3 in generating the report's integrity, but when predicting a lesion area, a pacemaker or external device is erroneously interpreted as part of the lesion area. M5 can eliminate this type of error by introducing SFG, while irrelevant areas that M4 over-extracts can also be deleted by SFG.
The embodiment of the invention is implemented in PyTorch and trained on a workstation with 64GB RAM and NVIDIA Geforce RTX 4090GPU processor, using pre-trained Resnet152 as a common feature extraction encoder for image processing Resnet II and CLIP, all images are scaled to 224X 224 and provided to a feature map of 7X 512 encoder size.
According to the table 1, we used the method proposed by the original dataset author during training, testing and validating stages, namely the dataset was divided according to the training stage, testing that validation=7:2:1 ratio, feeding the extracted visual features to the CLIP to generate more than 320 tags, generating for each tag a vector containing 512 word embeddings, wherein the main part of the sentence may not reach 512 words, the rest of the sentence will be complemented by < text > which is the omitted part identifiable by the network training, obtaining the highest probability feature as the semantic feature generated by the model, then connecting the focus area alone with the network training by MSRM, constructing a more perfect auxiliary network without affecting the network convergence, the decoder of the Transformer as the decoding layer of the model, first setting all hidden layers and word embedment dimensions to 512, this hidden layers directly calling the hidden states extracted through LSTM remains gate, parameter learning using an AdamW optimizer, the bandwidth loss to be 4, defining the total loss as l=λ MSRM ·L MSRM +λ cmn ·L cmn +λ cross ·L cross +λ language ·L language Wherein L is MSRM Is the image loss of focus area, L cmn And L cross Is the binary cross entropy loss of two binary classifiers that handle cross-modal information, L language Is the cross entropy loss of the language model, and the loss weight is set as follows according to the performance on the verification set: lambda (lambda) MSRM =2.0,λ cmn =3.0,λ cross =3.0,λ language =1.0。
As shown in Table 1, we performed a comparison test on IU X-RAY and MIMIMIMI-CXR datasets with MRARGN, respectively, the 8 most recent chest report generation models described above: generating a report PPKED in an auxiliary way by using priori knowledge and posterior combination; generating individual descriptive words in the context of X-ray images and then converting them into coherent text m2tr using a converter architecture; generating a radiological report R2Gen via a memory drive transformer; enhancing the encoder-decoder framework using CMN in combination with a self-attention mechanism to facilitate cross-modal interactions and generation; performing highly structured report generation operation by combining knowledge distillation, computer vision hot start, cvT-212 DistillGPT 2 realized by ViT and GPT2, a task distillation module for structure level description, a task perception report generation module for describable level and an abnormal classification marking module; generating a highly interpretable text visual gpt using a self-reviving encoder-decoder attention mechanism assistance model; an anatomical region is detected using a simple and efficient region-oriented report generation model RGRGRG, and then a single salient region is described to form the final report. Wherein, the CvT-212 DistillGPT 2 model only obtains experimental results on the MIMIMIC-CXR data set (experimental results cannot be obtained on the IU X-RAY data set), the MRARGN model of the invention has better performance on cross-modal feature processing and automatic generation of medical reports than most models, and successfully generates a highly refined description text of lesion areas and case conditions.
Table 1: comparison graph of evaluation index results of network models
It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential characteristics thereof. The present embodiments are, therefore, to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference sign in a claim should not be construed as limiting the claim concerned.
Furthermore, it should be understood that although the present disclosure describes embodiments, not every embodiment is provided with a separate embodiment, and that this description is provided for clarity only, and that the disclosure is not limited to the embodiments described in detail below, and that the embodiments described in the examples may be combined as appropriate to form other embodiments that will be apparent to those skilled in the art.
Claims (6)
1. A chest X-ray image report generation method based on a cross-modal network is characterized by comprising the steps of creating a cross-modal auxiliary network (CMLRAN), introducing an attention mechanism to process image and text information respectively, and enhancing information association of the image and the text by combining a CLIP proposed by OpenAI on the basis of a Memory Storage Response Matrix (MSRM), wherein the method comprises the following specific steps of:
step one: the focus area feature extraction implementation steps are as follows:
(1) performing contrast enhancement, image size conversion and image pixel block adjustment on an input image to obtain a preprocessed image;
(2) converting the preprocessed image into an image feature matrix through a Convolutional Neural Network (CNN), elongating all data of the matrix into a column to obtain a chest CT trainable image feature matrix C, substituting the training image feature matrix C into ResnetII, extracting feature information with high association degree with chest organs, and obtaining an X-ray image feature matrix C'; the residual network can learn the chest organ characteristics of the original image and the chest organ characteristics after convolution extraction, thereby avoiding the problems of gradient disappearance and gradient explosion in the information transmission process,
the formula expression of the feature matrix C' after the first processing of the ResnetII network is as follows:
sigma represents Sigmoid function, avg represents average pooling, max represents maximum pooling, c represents feature matrix obtained for each step of chest X-ray image, f 7×7 Representing the size, delta, of a convolution kernel channel number of 7 x 7 p Representing a network direct map, μ representing a loss function of the residual network;
step two: the cross-mode auxiliary positioning implementation steps are as follows:
after the first operation is completed, the X-ray image features obtained through feature extraction and the introduced medical CL and MSRM are subjected to matrix calculation, the focus area with the highest probability is determined, the focus area is enhanced, and the related formula is as follows:
and->Images and text representing the core region, C img And C txt Representing the preprocessed images and characters, calculating the cosine similarity W of the text features based on the image features and the text features based on the text features through matrix calculation, N represents the total number of a group of image text pairs, W' represents the feature score obtained through normalization calculation on the basis of W, and calculating the cross probability score L of the report corresponding to the final output focus region i→t And L t→i ;
Step three: the implementation step of automatic generation of medical report:
after deriving the cross-modal feature, the transcoder's decoder can take the entire input sequence into account at the same time when generating each word, so it can capture the context information well.
2. The method for generating a cross-modal network-based chest radiograph report according to claim 1, wherein the newly built cross-modal auxiliary network (CMLRAN) focuses on classification of fine-grained differences of radiographs at the time of encoding; focusing on the generation of medical terms in decoding.
3. The method for generating the chest X-ray image report based on the cross-modal network according to claim 1, wherein the ResnetII in the step 1 is added with a maximum pooling layer and an average pooling layer on the basis of a Resnet-152 pre-training network, and the two network layers can acquire the maximum value and the average value of the features under different scales, so that the maximum pooling is beneficial to improving the stability of feature extraction and reducing the influence of errors such as geometric distortion and exposure in X-ray images on a model.
4. The method for generating the chest X-ray image report based on the cross-modal network according to claim 1, wherein in the step 1, a new word segmentation device BPE is adopted for processing the medical report, and a tree-shaped knowledge graph is added during word segmentation, so that weight distribution of the word segmentation device is enhanced, and trainability of a data set is realized.
5. The method for generating a cross-modal network-based chest X-ray image report according to claim 1, wherein in step 2, in order to prevent oversaturation of stored information and thus gradient explosion during training of a network model, a Selective Forgetting Gate (SFG) is added, and the formulas stored in SFG and cross-modal memory are as follows:
wherein W is f Representing forget gate SFG, b f Representing bias, x (t) Represents hidden information, h (t-1) Time function representing time t-1, C (t-1) Representing the cross-modal memory storage characteristics of the last moment.
6. The method of claim 1, wherein in step 3, in order to limit the language model to the regional visual features, we use pseudo-self-attention to inject the lesion regional features and associated disease keyword features directly into the self-attention of the model, the related formulas are as follows:
wherein X represents the visual characteristics of the focus area, Y represents word embedding, W q ,W k ,W v Representing queries, keys and values, U k And U v Parameters representing keys and values of the initial hidden state obtained by the LSTM, the generation of text for a lesion area can be realized by matrix operation.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311271188.9A CN117558394B (en) | 2023-09-28 | 2023-09-28 | Cross-modal network-based chest X-ray image report generation method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311271188.9A CN117558394B (en) | 2023-09-28 | 2023-09-28 | Cross-modal network-based chest X-ray image report generation method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN117558394A true CN117558394A (en) | 2024-02-13 |
CN117558394B CN117558394B (en) | 2024-06-25 |
Family
ID=89815449
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202311271188.9A Active CN117558394B (en) | 2023-09-28 | 2023-09-28 | Cross-modal network-based chest X-ray image report generation method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN117558394B (en) |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2020215557A1 (en) * | 2019-04-24 | 2020-10-29 | 平安科技(深圳)有限公司 | Medical image interpretation method and apparatus, computer device and storage medium |
CN115132313A (en) * | 2021-12-07 | 2022-09-30 | 北京工商大学 | Automatic generation method of medical image report based on attention mechanism |
CN115171838A (en) * | 2022-08-24 | 2022-10-11 | 中南大学 | Training method of medical report generation model based on cross-modal fusion |
CN116503515A (en) * | 2023-04-26 | 2023-07-28 | 北京理工大学 | Brain lesion image generation method and system based on text and image multi-mode |
CN116779091A (en) * | 2023-06-15 | 2023-09-19 | 兰州交通大学 | Automatic generation method of multi-mode network interconnection and fusion chest image diagnosis report |
-
2023
- 2023-09-28 CN CN202311271188.9A patent/CN117558394B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2020215557A1 (en) * | 2019-04-24 | 2020-10-29 | 平安科技(深圳)有限公司 | Medical image interpretation method and apparatus, computer device and storage medium |
CN115132313A (en) * | 2021-12-07 | 2022-09-30 | 北京工商大学 | Automatic generation method of medical image report based on attention mechanism |
CN115171838A (en) * | 2022-08-24 | 2022-10-11 | 中南大学 | Training method of medical report generation model based on cross-modal fusion |
CN116503515A (en) * | 2023-04-26 | 2023-07-28 | 北京理工大学 | Brain lesion image generation method and system based on text and image multi-mode |
CN116779091A (en) * | 2023-06-15 | 2023-09-19 | 兰州交通大学 | Automatic generation method of multi-mode network interconnection and fusion chest image diagnosis report |
Non-Patent Citations (2)
Title |
---|
HEN Z , SHEN Y , SONG Y , ET AL: "Cross-modal Memory Networks for Radiology Report Generation", DOI:10.48550/ARXIV.2204.13258, 31 December 2022 (2022-12-31), pages 1 - 11 * |
张嘉诚, 欧卫华, 陈英杰, 等: "胸部 X 线影像和诊断报告的双塔跨模态检索", 计算机应用研究, vol. 40, no. 8, 31 August 2023 (2023-08-31), pages 2543 - 2548 * |
Also Published As
Publication number | Publication date |
---|---|
CN117558394B (en) | 2024-06-25 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
He et al. | Pathvqa: 30000+ questions for medical visual question answering | |
WO2018176035A1 (en) | Method and system of building hospital-scale chest x-ray database for entity extraction and weakly-supervised classification and localization of common thorax diseases | |
EP4266195A1 (en) | Training of text and image models | |
CN112614561A (en) | Brain CT medical report generation method based on hierarchical self-attention sequence coding | |
Yamanakkanavar et al. | A novel M-SegNet with global attention CNN architecture for automatic segmentation of brain MRI | |
CN113744265B (en) | Anomaly detection system, method and storage medium based on generation countermeasure network | |
CN116779091B (en) | Automatic generation method of multi-mode network interconnection and fusion chest image diagnosis report | |
CN114220516A (en) | Brain CT medical report generation method based on hierarchical recurrent neural network decoding | |
Kuang et al. | Towards simultaneous segmentation of liver tumors and intrahepatic vessels via cross-attention mechanism | |
Jia et al. | Few-shot radiology report generation for rare diseases | |
Huang et al. | Enhancing representation in radiography-reports foundation model: A granular alignment algorithm using masked contrastive learning | |
CN113011514B (en) | Intracranial hemorrhage sub-type classification algorithm applied to CT image based on bilinear pooling | |
Gu et al. | Complex organ mask guided radiology report generation | |
Divya et al. | Memory Guided Transformer with Spatio-Semantic Visual Extractor for Medical Report Generation | |
CN117316369B (en) | Chest image diagnosis report automatic generation method for balancing cross-mode information | |
CN112562809A (en) | Method and system for auxiliary diagnosis based on electronic medical record text | |
CN117333462A (en) | Ultrasonic diagnosis intelligent interaction system based on liver attribute analysis | |
CN117558394B (en) | Cross-modal network-based chest X-ray image report generation method | |
JP7055848B2 (en) | Learning device, learning method, learning program, and claim mapping device | |
CN115311490A (en) | Hepatic cystic echinococcosis classification model and computer medium | |
Zeng et al. | AERMNet: Attention-enhanced relational memory network for medical image report generation | |
Sloan et al. | Automated Radiology Report Generation: A Review of Recent Advances | |
Deng et al. | A diagnostic report supervised deep learning model training strategy for diagnosis of COVID-19 | |
Zuo et al. | Knowledge-Powered Thyroid Nodule Classification with Thyroid Ultrasound Reports | |
Hou et al. | Radiology Report Generation via Visual Recalibration and Context Gating-Aware |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant |