CN117558394B - Cross-modal network-based chest X-ray image report generation method - Google Patents
Cross-modal network-based chest X-ray image report generation method Download PDFInfo
- Publication number
- CN117558394B CN117558394B CN202311271188.9A CN202311271188A CN117558394B CN 117558394 B CN117558394 B CN 117558394B CN 202311271188 A CN202311271188 A CN 202311271188A CN 117558394 B CN117558394 B CN 117558394B
- Authority
- CN
- China
- Prior art keywords
- image
- network
- cross
- matrix
- modal
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 26
- 238000011976 chest X-ray Methods 0.000 title claims abstract description 23
- 239000011159 matrix material Substances 0.000 claims abstract description 40
- 230000007246 mechanism Effects 0.000 claims abstract description 15
- 230000005055 memory storage Effects 0.000 claims abstract description 10
- 230000004044 response Effects 0.000 claims abstract description 4
- 238000000605 extraction Methods 0.000 claims description 25
- 238000004364 calculation method Methods 0.000 claims description 17
- 238000011176 pooling Methods 0.000 claims description 17
- 238000012549 training Methods 0.000 claims description 13
- 230000000007 visual effect Effects 0.000 claims description 9
- 238000013527 convolutional neural network Methods 0.000 claims description 8
- 230000006870 function Effects 0.000 claims description 7
- 210000000056 organ Anatomy 0.000 claims description 7
- 230000011218 segmentation Effects 0.000 claims description 7
- 201000010099 disease Diseases 0.000 claims description 5
- 208000037265 diseases, disorders, signs and symptoms Diseases 0.000 claims description 5
- 230000002708 enhancing effect Effects 0.000 claims description 5
- 238000007781 pre-processing Methods 0.000 claims description 5
- 230000008569 process Effects 0.000 claims description 5
- 102100033814 Alanine aminotransferase 2 Human genes 0.000 claims description 4
- 238000003745 diagnosis Methods 0.000 claims description 4
- 101710096000 Alanine aminotransferase 2 Proteins 0.000 claims description 3
- 230000002457 bidirectional effect Effects 0.000 claims description 3
- 238000006243 chemical reaction Methods 0.000 claims description 3
- 238000010606 normalization Methods 0.000 claims description 3
- 238000012545 processing Methods 0.000 abstract description 12
- 210000000038 chest Anatomy 0.000 description 13
- 230000003902 lesion Effects 0.000 description 11
- 238000003384 imaging method Methods 0.000 description 5
- 238000005516 engineering process Methods 0.000 description 4
- 238000004880 explosion Methods 0.000 description 4
- 210000004072 lung Anatomy 0.000 description 4
- 238000013507 mapping Methods 0.000 description 4
- 230000015654 memory Effects 0.000 description 4
- 230000009286 beneficial effect Effects 0.000 description 3
- 230000000694 effects Effects 0.000 description 3
- 210000001370 mediastinum Anatomy 0.000 description 3
- 238000003058 natural language processing Methods 0.000 description 3
- 230000005540 biological transmission Effects 0.000 description 2
- 238000013135 deep learning Methods 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 230000008034 disappearance Effects 0.000 description 2
- 238000012360 testing method Methods 0.000 description 2
- 206010014561 Emphysema Diseases 0.000 description 1
- 101000779415 Homo sapiens Alanine aminotransferase 2 Proteins 0.000 description 1
- 206010058467 Lung neoplasm malignant Diseases 0.000 description 1
- 208000002151 Pleural effusion Diseases 0.000 description 1
- 206010035664 Pneumonia Diseases 0.000 description 1
- 230000004913 activation Effects 0.000 description 1
- 210000003484 anatomy Anatomy 0.000 description 1
- 238000013528 artificial neural network Methods 0.000 description 1
- 210000000988 bone and bone Anatomy 0.000 description 1
- 231100001011 cardiovascular lesion Toxicity 0.000 description 1
- 230000001427 coherent effect Effects 0.000 description 1
- 230000001143 conditioned effect Effects 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000002059 diagnostic imaging Methods 0.000 description 1
- 238000004821 distillation Methods 0.000 description 1
- 230000009977 dual effect Effects 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 239000000284 extract Substances 0.000 description 1
- 210000002216 heart Anatomy 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 238000013140 knowledge distillation Methods 0.000 description 1
- 230000007787 long-term memory Effects 0.000 description 1
- 201000005202 lung cancer Diseases 0.000 description 1
- 208000020816 lung neoplasm Diseases 0.000 description 1
- 208000018280 neoplasm of mediastinum Diseases 0.000 description 1
- 230000008447 perception Effects 0.000 description 1
- 210000004224 pleura Anatomy 0.000 description 1
- 208000008423 pleurisy Diseases 0.000 description 1
- 238000003672 processing method Methods 0.000 description 1
- 230000005855 radiation Effects 0.000 description 1
- 230000002787 reinforcement Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 230000006403 short-term memory Effects 0.000 description 1
- 238000013526 transfer learning Methods 0.000 description 1
- 201000008827 tuberculosis Diseases 0.000 description 1
- 238000010200 validation analysis Methods 0.000 description 1
- 238000012800 visualization Methods 0.000 description 1
- 230000004580 weight loss Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16H—HEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
- G16H15/00—ICT specially adapted for medical reports, e.g. generation or transmission thereof
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/36—Creation of semantic tools, e.g. ontology or thesauri
- G06F16/367—Ontology
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/22—Matching criteria, e.g. proximity measures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/044—Recurrent networks, e.g. Hopfield networks
- G06N3/0442—Recurrent networks, e.g. Hopfield networks characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
- G06N3/0455—Auto-encoder networks; Encoder-decoder networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/0464—Convolutional networks [CNN, ConvNet]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/048—Activation functions
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/096—Transfer learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/20—Image preprocessing
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/20—Image preprocessing
- G06V10/25—Determination of region of interest [ROI] or a volume of interest [VOI]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/20—Image preprocessing
- G06V10/32—Normalisation of the pattern dimensions
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/40—Extraction of image or video features
- G06V10/44—Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
- G06V10/443—Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components by matching or filtering
- G06V10/449—Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters
- G06V10/451—Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters with interaction between the filter responses, e.g. cortical complex cells
- G06V10/454—Integrating the filters into a hierarchical structure, e.g. convolutional neural networks [CNN]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/764—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02A—TECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
- Y02A90/00—Technologies having an indirect contribution to adaptation to climate change
- Y02A90/10—Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- General Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Life Sciences & Earth Sciences (AREA)
- Computational Linguistics (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Software Systems (AREA)
- Biomedical Technology (AREA)
- Mathematical Physics (AREA)
- Biophysics (AREA)
- Multimedia (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Medical Informatics (AREA)
- Databases & Information Systems (AREA)
- Primary Health Care (AREA)
- Epidemiology (AREA)
- Public Health (AREA)
- Biodiversity & Conservation Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Evolutionary Biology (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Animal Behavior & Ethology (AREA)
- Apparatus For Radiation Diagnosis (AREA)
Abstract
The invention discloses a chest X-ray image report generation method based on a cross-modal network, belongs to the technical field of image reports, and provides a cross-modal auxiliary network (CMLRAN), wherein an attention introducing mechanism is used for respectively processing image and text information, and the information association of the image and the text is enhanced by combining with a CLIP proposed by OpenAI based on a Memory Storage Response Matrix (MSRM). The classification of fine granularity differences of X-ray images is focused on during encoding; the generation of medical terms is focused on in decoding. The method can better solve the problems of semantic gaps and the like, and intelligently generates the chest X-ray image report.
Description
Technical Field
The invention relates to the technical field of image reports, in particular to a chest X-ray image report generation method based on a cross-modal network.
Background
Chest X-ray is a tip medical imaging technology, and accurately displays lung lesions (such as pneumonia, tuberculosis, lung cancer and the like), mediastinum lesions (such as mediastinum tumors, mediastinum emphysema and the like), pleural lesions (such as pleural effusion, pleurisy and the like) and cardiovascular lesions and the like through a high-resolution and multi-angle three-dimensional imaging technology. The chest X-ray image diagnosis report is the professional interpretation and summary of the examination result, generally comprises the parts of imaging, diagnosis opinion, suggestion and the like, and provides basis for doctors to make diagnosis and treatment schemes.
In recent years, the need for medical X-ray "imaging" guided "treatment" has increased, and related studies have received extensive attention. Among other things, the method of generating long text using a hierarchical long and short term memory network (LSTM) exhibits certain advantages. However, research in this area still presents many challenges, mainly: medical image features are complex, cross-modal features are difficult to extract, and medical reports have a large number of specialized words. Currently, LSTM has not been able to enable automatic generation of multi-organ imaging reports. For this reason, some scholars propose an automatic generation method [ of medical image reports based on deep learning, which can be classified into an image processing method and a natural language processing method according to the difference of the processing objects: the method comprises the steps of taking images as cut-in points, tanida1 providing a generation model RGRGRG based on a focus area guidance report, firstly dividing a specific focus area, then forming a final report based on the focus area, li providing a knowledge graph auxiliary generation network model DCL with a dynamic structure and nodes, taking each image as a starting point to extract image comparison text generation features, and finally adding the features to each output node. Based on natural language processing, chen proposed using language knowledge of a large pre-training language model (PLM) to quickly generate an image subtitle network model VisualGPT that can effectively learn a large amount of language knowledge from a small amount of multimodal data, kaur proposed a CNN-RNN based model to network model CADxReport that uses reinforcement learning and visual and semantic attention mechanisms to enable automatic generation of medical reports.
The medical report automatic generation method based on deep learning has the defects. Generating reports for the cut-in points by image processing, the model has difficulty in fully comprehending the complex information of the images, and the generated reports lack the flexibility of language expression. The natural language processing is taken as an access point, the model for generating the report is based on a predefined template, and the flexibility is also lacking, so that the model is difficult to adapt to different application scenes. In view of this, the present invention proposes a cross-modal auxiliary network (CMLRAN), which introduces a focus mechanism to process image and text information separately, and based on a Memory Storage Response Matrix (MSRM), enhances the information association of image and text in combination with CLIP proposed by OpenAI. The classification of fine granularity differences of X-ray images is focused on during encoding; the generation of medical terms is focused on in decoding. The method can better solve the problems of semantic gaps and the like, and intelligently generates the chest X-ray image report.
Disclosure of Invention
The present invention has been made to solve the above-mentioned problems, and an object of the present invention is to provide a method for generating a chest X-ray image report based on a cross-modal network.
In order to achieve the above purpose, the technical scheme adopted by the invention is as follows: a chest X-ray image report generation method based on a cross-modal network comprises the steps of creating a cross-modal auxiliary network (CMLRAN), introducing an attention mechanism to process image and text information respectively, and enhancing information association of the image and the text by combining with a CLIP proposed by OpenAI on the basis of a Memory Storage Response Matrix (MSRM), wherein the method comprises the following specific steps:
Step one: the focus area feature extraction implementation steps are as follows:
① Performing contrast enhancement, image size conversion and image pixel block adjustment on an input image to obtain a preprocessed image;
② The preprocessing image converts the image into an image feature matrix through a Convolutional Neural Network (CNN), all data of the matrix are elongated into a row to obtain a chest CT trainable image feature matrix C, then the image feature matrix C is substituted into ResnetII, feature information with high relevance to chest organs is extracted to obtain an X-ray image feature matrix C', the residual network can learn chest organ features of an original image and the chest organ features after convolutional extraction, the problems of gradient disappearance and gradient explosion in the information transmission process are avoided,
The formula expression of the feature matrix C' after the first processing of ResnetII network is as follows:
Sigma represents a Sigmoid function, avg represents average pooling, max represents maximum pooling, c represents a feature matrix obtained by each step of chest X-ray imaging, f 7×7 represents a size of 7×7 convolution kernel channels, delta p represents network direct mapping, and mu represents a loss function of a residual network;
step two: the cross-mode auxiliary positioning implementation steps are as follows:
After the first operation is completed, performing matrix calculation on an X-ray image feature matrix C' obtained through feature extraction and an introduced medical CLIP and MSRM, and determining a focus area with the maximum probability, and enhancing the contrast or definition of the focus area, wherein a related formula is as follows:
And/> Images and characters representing a core region, C img and C txt represent the images and characters after preprocessing, the cosine similarity W of the text features calculated based on the image features and the cosine similarity W of the image features calculated based on the text features can be obtained through matrix calculation, N represents the total number of a group of image text pairs, W' represents feature scores obtained through normalization calculation on the basis of W, and cross probability scores L i→t and L t→i of corresponding reports of a focus region can be finally output through calculation;
Step three: the implementation step of automatic generation of medical report:
after deriving the cross-modal feature, the transcoder's decoder can take into account the entire input sequence at the same time when generating each word, so it can capture the context information well,
Specifically, a GPT-2 network similar to that proposed by PubMed et al is adopted, in which the labels in the sequence are text-generated on condition of the previous labels, each generated word is used as the input of the next step, the process is repeated until a complete medical report is generated, a forgetting gate SFG is proposed based on a bidirectional LSTM, the SFG is combined with an attention mechanism, and the flow of cross-modal information is controlled by introducing the forgetting gate and an update gate, so that the context information and the cross-modal information of the X-ray image in the medical report are better captured, and in order to limit the language model to the regional visual features, the focus regional features and the associated disease keyword features are directly injected into the self-attention of the model by using pseudo self-attention, and the related formulas are as follows:
Wherein X represents the visual characteristics of the focus area, Y represents word embedding, W q,Wk,Wv represents inquiry, keys and values, U k and U v represent parameters of the keys and values of the initial hidden state obtained through LSTM, and the generation of text of the focus area can be realized through matrix operation.
Further, the newly built cross-modal auxiliary network (CMLRAN) focuses on the classification of X-ray image fine-granularity differences during encoding; focusing on the generation of medical terms in decoding.
Furthermore, the ResnetII network in the step 1 adds a maximum pooling layer and an average pooling layer on the basis of the Resnet-152 pre-training network, and the two network layers can acquire the maximum value and the average value of the features under different scales, so that the maximum pooling is beneficial to improving the stability of feature extraction and reducing the influence of errors such as geometric distortion and exposure in X-ray images on a model.
Further, a new word segmentation device BPE is adopted for processing the medical report, and a tree-shaped knowledge graph is added during word segmentation, so that weight distribution of the word segmentation device is enhanced, and trainability of a data set is realized.
Further, in step 2, in order to prevent oversaturation of stored information and thus gradient explosion during network model training, a Selective Forgetting Gate (SFG) is added, and the formulas stored by SFG and cross-modal memory are as follows:
Wherein W f represents the forgetting gate SFG, b f represents the bias, x (t) represents the hidden information, h (t-1) represents the time function at time t-1, and C (t-1) represents the cross-modal memory storage characteristic at the last time.
Compared with the prior art, the invention has the following beneficial effects:
(1) Based on transfer learning, adding multi-channel feature extraction, dividing the multi-channel feature extraction into MaxPool layers and AvgPool layers, and adding an attention mechanism on the basis to enhance the extraction capability of global features of the image;
(2) A cross-modal auxiliary network is provided for opening a semantic gap between chest X-ray images and corresponding medical reports and enhancing the connection between two modal information. And the matching precision of the X-ray image and the corresponding medical report is improved.
Drawings
FIG. 1 is a schematic diagram of the implementation steps of a body network model framework according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of a focus area feature extraction module according to an embodiment of the invention;
FIG. 3 is a cross-modal network layer model of an embodiment of the present invention;
FIG. 4 is a tree-like knowledge-graph of medical report according to an embodiment of the invention;
Fig. 5 is a graph of lesion area visualization using Grad-CAM according to an embodiment of the invention.
Detailed Description
The invention is further described in connection with the following detailed description, in order to make the technical means, the creation characteristics, the achievement of the purpose and the effect of the invention easy to understand.
As shown in FIG. 1, the method provided by the invention firstly uses a residual network and a visual attention model to extract image characteristics, then combines CLIP and MSRM to position a focus area, and finally realizes automatic generation of a medical report through a Decoder of a transducer and a gate unit mechanism of an LSTM, and the proposed cross-mode auxiliary network model CMLRAN consists of three modules: the system comprises a focus area feature extraction module, a focus area-based cross-mode auxiliary positioning module and a medical report automatic generation module.
(1) The focus area feature extraction implementation steps are as follows:
① Performing contrast enhancement, image size conversion and image pixel block adjustment on an input image to obtain a preprocessed image;
② The preprocessing image is converted into an image feature matrix through a Convolutional Neural Network (CNN), all data of the matrix are elongated into a row, a chest CT trainable image feature matrix C is obtained, then ResnetII is substituted, feature information with high relevance to chest organs is extracted, and an X-ray image feature matrix C' is obtained. The residual error network can learn the chest organ characteristics of the original image and the chest organ characteristics after convolution extraction, so that the problems of gradient disappearance and gradient explosion in the information transmission process are avoided, wherein the focus region characteristic extraction module architecture is shown in fig. 2.
In fig. 2 Resnet-152 represent a 152-layer residual network, whose underlying modules consist of 12 different-dimensional convolutions (1 x 1 and 3 x 3 are convolution kernels, 64, 128, 256, 512, 1024, 2048 are the number of network layers),
To ensure that the encoder learns chest CT image features better, a two-channel modular network architecture is added, an inner-segment max-pooling (Maxpool) layer and an average-pooling (AvgPool) layer are added, and a self-attention mechanism is added, so that the effect of extracting image features from Maxpool and AvgPool is enhanced.
Training of the lesion area feature extraction network uses a dual channel module feature extraction network ResnetII and an attention feature extraction network. During double-channel feature extraction, convolution operation is performed on the trainable image feature matrix C of chest CT and the dimension is increased, then images are respectively sent into an expansion convolution layer of Resnet 152 through Maxpool and Avgpool to obtain Resnet network output results (the expansion rate is 2, the convolution kernel size is 7 multiplied by 7), the output results are respectively subjected to double-channel residual operation through Maxpool and Avgpool, and the obtained result is subjected to addition summation operation with the image feature matrix converted from the preprocessed images through a Convolutional Neural Network (CNN), so that a feature matrix C' after first processing of the Resnet network is obtained. The double-channel module feature extraction enhances the multi-scale extraction capability of the model on chest CT image details, and simultaneously reduces the negative effects of original image space hierarchy information loss, unimportant information repeated extraction and the like caused by single use of expansion convolution operation. The formula expression of the feature matrix C' after the first processing of ResnetII network is as follows:
Sigma represents Sigmoid function, avg represents average pooling, max represents maximum pooling, c represents feature matrix obtained per step of chest X-ray image, f 7×7 represents size of 7×7 convolution kernel channel number, delta p represents network direct mapping, and mu represents loss function of residual network. ResnetII adding a maximum pooling layer and an average pooling layer on the basis of Resnet-152 pre-training network, wherein the two network layers can acquire the maximum value and the average value of the features under different scales, and the maximum pooling is beneficial to improving the stability of feature extraction and reducing the influence of errors such as geometric distortion, exposure and the like in X-ray images on a model. The average pooling layer can convert the spatial information of the features into more compact feature representation, improves the generalization capability of the model, reduces the influence of background radiation, artifacts and the like in the X-ray image on the model, and can reflect the local features of the medical X-ray image based on the features C' extracted by the two network layers. Then, the original feature C and the processed feature C ' are input into an attention mechanism network for secondary feature processing, the network can divide the chest X-ray image into a series of learnable blocks, global processing of the whole image is achieved, and then the extracted feature C ' and the feature C ' extracted by the ResnetII network are subjected to calculation based on residual connection, and finally the complete chest X-ray image feature C) is obtained.
The application selects the word segmentation device BPE, which is a data driving method based on self-focusing and autoregressive neural network, and divides the text into a fixed number of sub-words by continuously combining the most frequently occurring characters or character sequences.
(2) The cross-mode auxiliary positioning implementation steps are as follows:
After the first operation is completed, the X-ray image features obtained through feature extraction and the introduced medical CLIP and MSRM are subjected to matrix calculation, the focus area with the highest probability is determined, and the relation between the focus area and the corresponding medical report keywords is enhanced. As shown in fig. 3, firstly, the medical CLIP uses pre-trained Resnet-152 as an image encoder and BPE as an encoder of a medical report, the present application uses contrast loss on IU X-RAY and MIMIC-CXR datasets to fine tune the original CLIP model, enhance the similarity of image-text matching pairs and the dissimilarity of non-matching pairs, and the generated result can determine the output result through a pre-built medical report tree-like knowledge graph (the knowledge graph is shown in fig. 4). The atlas is primarily judged by using the type of focus and disease information possibly appearing at each part of chest X-ray, and then the alignment of the medical atlas and the corresponding focus area is realized by a zero-shot classification mechanism of the medical CLIP. And finally, selecting the chest X-ray focus area corresponding to the disease information with the maximum similarity based on an image classification technology. The correlation formula is as follows:
And/> Images and characters representing the core region, C img and C txt represent the images and characters after preprocessing, text features are calculated based on the image features and cosine similarity W of the image features is calculated based on the text features by matrix calculation, N represents the total number of pairs of image text, W' represents feature scores obtained by normalization calculation on the basis of W, and cross probability scores L i→t and L t→i of the corresponding reports of the final output focus region can be obtained by calculation. After obtaining the probability score L, introducing the probability into an MSRM to perform cross-mode memory storage, wherein the main operation is that two different modes of information are simultaneously stored, important information is reserved, useless information is deleted, and meanwhile, in order to prevent oversaturation of the stored information and further gradient explosion during network model training, a Selective Forgetting Gate (SFG) is added by referring to a gate unit mechanism of an LSTM, and the SFG and the cross-mode memory storage formula is as follows:
Wherein W f represents the forgetting gate SFG, b f represents the bias, x (t) represents the hidden information, h (t-1) represents the time function at time t-1, and C (t-1) represents the cross-modal memory storage characteristic at the last time.
(3) The implementation step of automatic generation of medical report:
After deriving the cross-modal feature, the transcoder's decoder can take the entire input sequence into account at the same time when generating each word, so it can capture the context information well. In particular, it employs a GPT-2 network similar to that proposed by PubMed et al, wherein the tags in the sequence are text generated conditioned on previous tags, each generated word is taken as input to the next step, and the process is repeated until a complete medical report is generated. However, in decoding, the application adds the cross-modal memory feature C (t) extracted by MSRM, if only the traditional decoder is adopted, the problems of repeated memory of error information, slow network convergence speed and the like can be possibly caused, therefore, the application provides a forgetting gate SFG based on the bidirectional LSTM, combines the SFG with an attention mechanism, controls the flow of cross-modal information by introducing the forgetting gate and the updating gate, and better captures the context information and the cross-modal information of the X-ray image in the medical report. In order to define the language model to the regional visual features, the present application uses pseudo self-attention features to inject lesion region features and associated disease keyword features directly into the self-attention of the model, the associated formulas are as follows:
Wherein X represents the visual characteristics of the focus area, Y represents word embedding, W q represents query, W k represents keys, W v represents values, U k and U v represent parameters of keys and values of initial hidden states obtained through LSTM, and generation of text of the focus area can be realized through matrix operation.
The focusing region in model training can be displayed by using a mapping (Gradient-WEIGHTED CLASS Activation Mapping, grad-CAM) technology, and whether the focus region and the corresponding tree-like knowledge graph are accurately related by the model or not is determined by using the method. As shown in FIGS. 5 (a) - (f), M1-M5 represent MRARGN and variants thereof, respectively, for Grad-CAM display of focal areas of heart, pleura, bone, mediastinum, lung and free areas of the MIMIMIMIMI-CXR dataset in the optimal state. The application can observe that M5 accurately identifies the boundary and shape information of most focus areas. Although the treatment of M1 in the skeletal region and M2 in the pleural and mediastinal regions is relatively reasonable, there are a large number of erroneous judgments or repeated correlations of the erroneous information at other locations. In the lung lesion area treatment of fig. 5 (e), M1 faces point to the upper right lobe, and M2 also perceives the free area as a portion of the lung lesion area. The focal zone extracted by M3 belongs to a large number of irrelevant extractions. M4 performs better than M1-M3 in generating the report's integrity, but when predicting a lesion area, a pacemaker or external device is erroneously interpreted as part of the lesion area. M5 can eliminate this type of error by introducing SFG, while irrelevant areas that M4 over-extracts can also be deleted by SFG.
The embodiment of the invention is implemented in PyTorch and trained on a workstation with 64GB RAM and NVIDIAGeforce RTX4090GPU processor, using pre-training Resnet 152 as a common feature extraction encoder for image processing ResnetII and CLIP, all images are scaled to 224X 224 and provided to a feature map of 7X 512 encoder size.
According to the table 1, the present application uses the method suggested by the author of the original dataset during the training, testing and validation phases, i.e. the dataset is trained as: test: verify = 7:2:1 ratio, feed extracted visual features to CLIP to generate more than 320 tags, for each tag, generate a vector containing 512 word embeddings, where the body part of the sentence may not reach 512 words, the rest will be complemented by < text > which is an omitted part identifiable by the network training, obtain the highest probability features as semantic features generated by the model, then connect the focus area alone with the network training through MSRM, build a more complete auxiliary network while not affecting network convergence, decoder of the Transformer as decoding layer of the model, first all hidden layers and word embedment dimensions set to 512, this hidden layers will directly call the hidden state extracted through LSTM forgetting gate, parameter learning uses AdamW optimizers, the band size is 4, the total training loss is defined as L = λ MSRM·LMSRM+λcmn·Lcmn+λcross·Lcross+λlanguage·Llanguage, where L cmn and L cross are two binary classifier cross-entropy loss processing the cross-modal information, L24 is the cross-entropy loss on the model, set of the language weight loss is the performance set according to the set of the values: lambda MSRM=2.0,λcmn=3.0,λcross=3.0,λlanguage =1.0.
As shown in Table 1, the present application compares MRARGN with the 8 most recent chest report generating models described above, respectively, on IU X-RAY and MIMIMIC-CXR datasets: generating report PPKED with the aid of a priori knowledge and a posterior combination; generating individual descriptive words in the context of X-ray images and then converting them into coherent text m2tr using a converter architecture; generating a radiological report R2Gen via a memory drive transformer; enhancing the encoder-decoder framework using CMN in combination with a self-attention mechanism to facilitate cross-modal interactions and generation; performing highly structured report generation operation by combining knowledge distillation, computer vision warm start, viT and GPT2 implemented CvT-212DISTILGPT, a task distillation module for structure level description, a task perception report generation module for describable level and an anomaly classification marking module; generating highly interpretable text VisualGPT using a self-reviving encoder-decoder attention mechanism assistance model; an anatomical region is detected using a simple and efficient region-oriented report generation model RGRGRG, and then a single salient region is described to form the final report. Wherein the CvT-212DISTILGPT2 model only gets experimental results on the MIMIC-CXR dataset (and not on the IU X-RAY dataset), the MRARGN model of the present application performs better than most models in cross-modal feature processing and automatic generation of medical reports, and successfully generates a highly refined description text of lesion areas and case conditions.
Table 1: comparison graph of evaluation index results of network models
It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential characteristics thereof. The present embodiments are, therefore, to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference sign in a claim should not be construed as limiting the claim concerned.
Furthermore, it should be understood that although the present disclosure describes embodiments, not every embodiment is provided with a separate embodiment, and that this description is provided for clarity only, and that the disclosure is not limited to the embodiments described in detail below, and that the embodiments described in the examples may be combined as appropriate to form other embodiments that will be apparent to those skilled in the art.
Claims (2)
1. A chest X-ray image report generation method based on a cross-modal network is characterized by comprising the steps of creating a cross-modal auxiliary network CMLRAN, introducing an attention mechanism to process image and text information respectively, and enhancing the information association of the image and the text by combining with a CLIP proposed by OpenAI on the basis of a memory storage response matrix MSRM, wherein the method comprises the following specific steps:
Step one: the focus area feature extraction implementation steps are as follows:
① Performing contrast enhancement, image size conversion and image pixel block adjustment on an input image to obtain a preprocessed image;
② The preprocessed image is converted into an image feature matrix through a convolutional neural network CNN, all data of the matrix are elongated into a row to obtain a chest CT trainable image feature matrix C, then the chest CT trainable image feature matrix C is substituted into ResnetII, feature information with high relevance to chest organs is extracted to obtain an X-ray image feature matrix C ', the extracted features and the features C' extracted by a ResnetII network are subjected to calculation based on residual connection, finally the features C 'and the features C are subjected to attention network calculation to obtain complete chest X-ray image features C',
ResnetII the network is based on Resnet-152 network, resnet-152 network represents 152 layer residual network, its bottom layer module is composed of 12 different dimension convolutions, 1×1 and 3×3 are convolution cores, 64, 128, 256, 512, 1024, 2048 are network layer number,
The X-ray image characteristic matrix C 'is obtained by calculating an image characteristic matrix C through ResnetII network, the ResnetII network is provided with two different pooling layers which are respectively a maximum pooling layer and an average pooling layer, and the calculation results of the two pooling layers are connected with the image characteristic matrix C through residual errors to obtain a special X-ray image characteristic matrix C';
step two: the cross-mode auxiliary positioning implementation steps are as follows:
after the first operation is completed, introducing an X-ray image feature matrix C' obtained through feature extraction into a medical CLIP and an MSRM for matrix calculation, and determining a focus area with the maximum probability, wherein the related formula is as follows:
And/> Images and characters representing a core region, C img and C txt represent the images and characters after preprocessing, the cosine similarity W of the text features calculated based on the image features and the cosine similarity W of the image features calculated based on the text features can be obtained through matrix calculation, N represents the total number of a group of image text pairs, W' represents feature scores obtained through normalization calculation on the basis of W, and cross probability scores L i→t and L t→i of corresponding reports of a focus region can be finally output through calculation;
Step three: the implementation step of automatic generation of medical report:
after describing the cross-modal characteristics of the X-ray image characteristics and the corresponding medical diagnosis report through MSRM calculation in the second step, the decoder of the transducer can take the whole input sequence into consideration when generating each word, so that the processor can well capture the context information,
Specifically, a GPT-2 network is adopted, wherein the marks in the sequence are subjected to text generation on the condition of the previous marks, each generated word is used as the input of the next step, all the operations are repeated until a complete medical report is generated, and the formulas of forgetting gate SFG, SFG and cross-modal memory storage are proposed based on bidirectional LSTM as follows:
Wherein W f represents a forgetting gate SFG, b f represents a bias, x (t) represents hidden information, h (t-1) represents a time function of t-1, and C (t-1) represents a cross-mode memory storage characteristic of the last time;
The flow of cross-modal information is controlled by introducing forgetting gates and updating gates, so that the cross-modal information of the context information and the X-ray images in the medical report is better captured, in order to limit the language model to the regional visual characteristics, the focus region characteristics and the related disease keyword characteristics are directly injected into the self-attention of the model by using the pseudo self-attention characteristics, and the related formulas are as follows:
Wherein X represents the visual characteristics of the focus area, Y represents word embedding, W q,Wk,Wv represents inquiry, keys and values, U k and U v represent parameters of the keys and values of the initial hidden state obtained through LSTM, and the generation of text of the focus area can be realized through matrix operation.
2. The chest X-ray image report generating method based on the cross-modal network according to claim 1, wherein the medical CLIP pre-training in the second step adopts a new word segmentation device BPE by using an encoder, and a tree-shaped knowledge graph is added during word segmentation, so that the weight distribution of the word segmentation device is enhanced, and the trainable data set is realized.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311271188.9A CN117558394B (en) | 2023-09-28 | 2023-09-28 | Cross-modal network-based chest X-ray image report generation method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311271188.9A CN117558394B (en) | 2023-09-28 | 2023-09-28 | Cross-modal network-based chest X-ray image report generation method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN117558394A CN117558394A (en) | 2024-02-13 |
CN117558394B true CN117558394B (en) | 2024-06-25 |
Family
ID=89815449
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202311271188.9A Active CN117558394B (en) | 2023-09-28 | 2023-09-28 | Cross-modal network-based chest X-ray image report generation method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN117558394B (en) |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115132313A (en) * | 2021-12-07 | 2022-09-30 | 北京工商大学 | Automatic generation method of medical image report based on attention mechanism |
CN115171838A (en) * | 2022-08-24 | 2022-10-11 | 中南大学 | Training method of medical report generation model based on cross-modal fusion |
Family Cites Families (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110136103B (en) * | 2019-04-24 | 2024-05-28 | 平安科技(深圳)有限公司 | Medical image interpretation method, device, computer equipment and storage medium |
CN116503515A (en) * | 2023-04-26 | 2023-07-28 | 北京理工大学 | Brain lesion image generation method and system based on text and image multi-mode |
CN116779091B (en) * | 2023-06-15 | 2024-02-27 | 兰州交通大学 | Automatic generation method of multi-mode network interconnection and fusion chest image diagnosis report |
-
2023
- 2023-09-28 CN CN202311271188.9A patent/CN117558394B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115132313A (en) * | 2021-12-07 | 2022-09-30 | 北京工商大学 | Automatic generation method of medical image report based on attention mechanism |
CN115171838A (en) * | 2022-08-24 | 2022-10-11 | 中南大学 | Training method of medical report generation model based on cross-modal fusion |
Also Published As
Publication number | Publication date |
---|---|
CN117558394A (en) | 2024-02-13 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110147457B (en) | Image-text matching method, device, storage medium and equipment | |
He et al. | Pathvqa: 30000+ questions for medical visual question answering | |
WO2022007685A1 (en) | Method and device for text-based image generation | |
CN110390363A (en) | A kind of Image Description Methods | |
CN110619313B (en) | Remote sensing image discriminant description generation method | |
CN112614561A (en) | Brain CT medical report generation method based on hierarchical self-attention sequence coding | |
CN113343705A (en) | Text semantic based detail preservation image generation method and system | |
CN113837229B (en) | Knowledge-driven text-to-image generation method | |
CN116610778A (en) | Bidirectional image-text matching method based on cross-modal global and local attention mechanism | |
CN114220516A (en) | Brain CT medical report generation method based on hierarchical recurrent neural network decoding | |
Elbedwehy et al. | Efficient Image Captioning Based on Vision Transformer Models. | |
CN112801217B (en) | Text similarity judgment method and device, electronic equipment and readable storage medium | |
CN105678349B (en) | A kind of sub- generation method of the context-descriptive of visual vocabulary | |
Zhang et al. | Multi-head self-attention gated-dilated convolutional neural network for word sense disambiguation | |
CN117611601A (en) | Text-assisted semi-supervised 3D medical image segmentation method | |
Wang et al. | Image captioning based on deep learning methods: A survey | |
US11494431B2 (en) | Generating accurate and natural captions for figures | |
CN112562809A (en) | Method and system for auxiliary diagnosis based on electronic medical record text | |
Gu et al. | Automatic generation of pulmonary radiology reports with semantic tags | |
CN117333462A (en) | Ultrasonic diagnosis intelligent interaction system based on liver attribute analysis | |
CN113011514A (en) | Intracranial hemorrhage sub-type classification algorithm applied to CT image based on bilinear pooling | |
CN117558394B (en) | Cross-modal network-based chest X-ray image report generation method | |
Perdana et al. | Instance-based deep transfer learning on cross-domain image captioning | |
Fan et al. | Long-term recurrent merge network model for image captioning | |
CN115588486A (en) | Traditional Chinese medicine diagnosis generating device based on Transformer and application thereof |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant |