CN117558394B - Cross-modal network-based chest X-ray image report generation method - Google Patents

Cross-modal network-based chest X-ray image report generation method Download PDF

Info

Publication number
CN117558394B
CN117558394B CN202311271188.9A CN202311271188A CN117558394B CN 117558394 B CN117558394 B CN 117558394B CN 202311271188 A CN202311271188 A CN 202311271188A CN 117558394 B CN117558394 B CN 117558394B
Authority
CN
China
Prior art keywords
image
network
cross
matrix
modal
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202311271188.9A
Other languages
Chinese (zh)
Other versions
CN117558394A (en
Inventor
董子龙
廉敬
石斌
刘冀钊
张家骏
张怀堃
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Lanzhou Jiaotong University
Original Assignee
Lanzhou Jiaotong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Lanzhou Jiaotong University filed Critical Lanzhou Jiaotong University
Priority to CN202311271188.9A priority Critical patent/CN117558394B/en
Publication of CN117558394A publication Critical patent/CN117558394A/en
Application granted granted Critical
Publication of CN117558394B publication Critical patent/CN117558394B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H15/00ICT specially adapted for medical reports, e.g. generation or transmission thereof
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • G06N3/0442Recurrent networks, e.g. Hopfield networks characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • G06N3/0455Auto-encoder networks; Encoder-decoder networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/096Transfer learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/25Determination of region of interest [ROI] or a volume of interest [VOI]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/32Normalisation of the pattern dimensions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • G06V10/443Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components by matching or filtering
    • G06V10/449Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters
    • G06V10/451Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters with interaction between the filter responses, e.g. cortical complex cells
    • G06V10/454Integrating the filters into a hierarchical structure, e.g. convolutional neural networks [CNN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computational Linguistics (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • Biophysics (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Medical Informatics (AREA)
  • Databases & Information Systems (AREA)
  • Primary Health Care (AREA)
  • Epidemiology (AREA)
  • Public Health (AREA)
  • Biodiversity & Conservation Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Animal Behavior & Ethology (AREA)
  • Apparatus For Radiation Diagnosis (AREA)

Abstract

The invention discloses a chest X-ray image report generation method based on a cross-modal network, belongs to the technical field of image reports, and provides a cross-modal auxiliary network (CMLRAN), wherein an attention introducing mechanism is used for respectively processing image and text information, and the information association of the image and the text is enhanced by combining with a CLIP proposed by OpenAI based on a Memory Storage Response Matrix (MSRM). The classification of fine granularity differences of X-ray images is focused on during encoding; the generation of medical terms is focused on in decoding. The method can better solve the problems of semantic gaps and the like, and intelligently generates the chest X-ray image report.

Description

Cross-modal network-based chest X-ray image report generation method
Technical Field
The invention relates to the technical field of image reports, in particular to a chest X-ray image report generation method based on a cross-modal network.
Background
Chest X-ray is a tip medical imaging technology, and accurately displays lung lesions (such as pneumonia, tuberculosis, lung cancer and the like), mediastinum lesions (such as mediastinum tumors, mediastinum emphysema and the like), pleural lesions (such as pleural effusion, pleurisy and the like) and cardiovascular lesions and the like through a high-resolution and multi-angle three-dimensional imaging technology. The chest X-ray image diagnosis report is the professional interpretation and summary of the examination result, generally comprises the parts of imaging, diagnosis opinion, suggestion and the like, and provides basis for doctors to make diagnosis and treatment schemes.
In recent years, the need for medical X-ray "imaging" guided "treatment" has increased, and related studies have received extensive attention. Among other things, the method of generating long text using a hierarchical long and short term memory network (LSTM) exhibits certain advantages. However, research in this area still presents many challenges, mainly: medical image features are complex, cross-modal features are difficult to extract, and medical reports have a large number of specialized words. Currently, LSTM has not been able to enable automatic generation of multi-organ imaging reports. For this reason, some scholars propose an automatic generation method [ of medical image reports based on deep learning, which can be classified into an image processing method and a natural language processing method according to the difference of the processing objects: the method comprises the steps of taking images as cut-in points, tanida1 providing a generation model RGRGRG based on a focus area guidance report, firstly dividing a specific focus area, then forming a final report based on the focus area, li providing a knowledge graph auxiliary generation network model DCL with a dynamic structure and nodes, taking each image as a starting point to extract image comparison text generation features, and finally adding the features to each output node. Based on natural language processing, chen proposed using language knowledge of a large pre-training language model (PLM) to quickly generate an image subtitle network model VisualGPT that can effectively learn a large amount of language knowledge from a small amount of multimodal data, kaur proposed a CNN-RNN based model to network model CADxReport that uses reinforcement learning and visual and semantic attention mechanisms to enable automatic generation of medical reports.
The medical report automatic generation method based on deep learning has the defects. Generating reports for the cut-in points by image processing, the model has difficulty in fully comprehending the complex information of the images, and the generated reports lack the flexibility of language expression. The natural language processing is taken as an access point, the model for generating the report is based on a predefined template, and the flexibility is also lacking, so that the model is difficult to adapt to different application scenes. In view of this, the present invention proposes a cross-modal auxiliary network (CMLRAN), which introduces a focus mechanism to process image and text information separately, and based on a Memory Storage Response Matrix (MSRM), enhances the information association of image and text in combination with CLIP proposed by OpenAI. The classification of fine granularity differences of X-ray images is focused on during encoding; the generation of medical terms is focused on in decoding. The method can better solve the problems of semantic gaps and the like, and intelligently generates the chest X-ray image report.
Disclosure of Invention
The present invention has been made to solve the above-mentioned problems, and an object of the present invention is to provide a method for generating a chest X-ray image report based on a cross-modal network.
In order to achieve the above purpose, the technical scheme adopted by the invention is as follows: a chest X-ray image report generation method based on a cross-modal network comprises the steps of creating a cross-modal auxiliary network (CMLRAN), introducing an attention mechanism to process image and text information respectively, and enhancing information association of the image and the text by combining with a CLIP proposed by OpenAI on the basis of a Memory Storage Response Matrix (MSRM), wherein the method comprises the following specific steps:
Step one: the focus area feature extraction implementation steps are as follows:
① Performing contrast enhancement, image size conversion and image pixel block adjustment on an input image to obtain a preprocessed image;
② The preprocessing image converts the image into an image feature matrix through a Convolutional Neural Network (CNN), all data of the matrix are elongated into a row to obtain a chest CT trainable image feature matrix C, then the image feature matrix C is substituted into ResnetII, feature information with high relevance to chest organs is extracted to obtain an X-ray image feature matrix C', the residual network can learn chest organ features of an original image and the chest organ features after convolutional extraction, the problems of gradient disappearance and gradient explosion in the information transmission process are avoided,
The formula expression of the feature matrix C' after the first processing of ResnetII network is as follows:
Sigma represents a Sigmoid function, avg represents average pooling, max represents maximum pooling, c represents a feature matrix obtained by each step of chest X-ray imaging, f 7×7 represents a size of 7×7 convolution kernel channels, delta p represents network direct mapping, and mu represents a loss function of a residual network;
step two: the cross-mode auxiliary positioning implementation steps are as follows:
After the first operation is completed, performing matrix calculation on an X-ray image feature matrix C' obtained through feature extraction and an introduced medical CLIP and MSRM, and determining a focus area with the maximum probability, and enhancing the contrast or definition of the focus area, wherein a related formula is as follows:
And/> Images and characters representing a core region, C img and C txt represent the images and characters after preprocessing, the cosine similarity W of the text features calculated based on the image features and the cosine similarity W of the image features calculated based on the text features can be obtained through matrix calculation, N represents the total number of a group of image text pairs, W' represents feature scores obtained through normalization calculation on the basis of W, and cross probability scores L i→t and L t→i of corresponding reports of a focus region can be finally output through calculation;
Step three: the implementation step of automatic generation of medical report:
after deriving the cross-modal feature, the transcoder's decoder can take into account the entire input sequence at the same time when generating each word, so it can capture the context information well,
Specifically, a GPT-2 network similar to that proposed by PubMed et al is adopted, in which the labels in the sequence are text-generated on condition of the previous labels, each generated word is used as the input of the next step, the process is repeated until a complete medical report is generated, a forgetting gate SFG is proposed based on a bidirectional LSTM, the SFG is combined with an attention mechanism, and the flow of cross-modal information is controlled by introducing the forgetting gate and an update gate, so that the context information and the cross-modal information of the X-ray image in the medical report are better captured, and in order to limit the language model to the regional visual features, the focus regional features and the associated disease keyword features are directly injected into the self-attention of the model by using pseudo self-attention, and the related formulas are as follows:
Wherein X represents the visual characteristics of the focus area, Y represents word embedding, W q,Wk,Wv represents inquiry, keys and values, U k and U v represent parameters of the keys and values of the initial hidden state obtained through LSTM, and the generation of text of the focus area can be realized through matrix operation.
Further, the newly built cross-modal auxiliary network (CMLRAN) focuses on the classification of X-ray image fine-granularity differences during encoding; focusing on the generation of medical terms in decoding.
Furthermore, the ResnetII network in the step 1 adds a maximum pooling layer and an average pooling layer on the basis of the Resnet-152 pre-training network, and the two network layers can acquire the maximum value and the average value of the features under different scales, so that the maximum pooling is beneficial to improving the stability of feature extraction and reducing the influence of errors such as geometric distortion and exposure in X-ray images on a model.
Further, a new word segmentation device BPE is adopted for processing the medical report, and a tree-shaped knowledge graph is added during word segmentation, so that weight distribution of the word segmentation device is enhanced, and trainability of a data set is realized.
Further, in step 2, in order to prevent oversaturation of stored information and thus gradient explosion during network model training, a Selective Forgetting Gate (SFG) is added, and the formulas stored by SFG and cross-modal memory are as follows:
Wherein W f represents the forgetting gate SFG, b f represents the bias, x (t) represents the hidden information, h (t-1) represents the time function at time t-1, and C (t-1) represents the cross-modal memory storage characteristic at the last time.
Compared with the prior art, the invention has the following beneficial effects:
(1) Based on transfer learning, adding multi-channel feature extraction, dividing the multi-channel feature extraction into MaxPool layers and AvgPool layers, and adding an attention mechanism on the basis to enhance the extraction capability of global features of the image;
(2) A cross-modal auxiliary network is provided for opening a semantic gap between chest X-ray images and corresponding medical reports and enhancing the connection between two modal information. And the matching precision of the X-ray image and the corresponding medical report is improved.
Drawings
FIG. 1 is a schematic diagram of the implementation steps of a body network model framework according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of a focus area feature extraction module according to an embodiment of the invention;
FIG. 3 is a cross-modal network layer model of an embodiment of the present invention;
FIG. 4 is a tree-like knowledge-graph of medical report according to an embodiment of the invention;
Fig. 5 is a graph of lesion area visualization using Grad-CAM according to an embodiment of the invention.
Detailed Description
The invention is further described in connection with the following detailed description, in order to make the technical means, the creation characteristics, the achievement of the purpose and the effect of the invention easy to understand.
As shown in FIG. 1, the method provided by the invention firstly uses a residual network and a visual attention model to extract image characteristics, then combines CLIP and MSRM to position a focus area, and finally realizes automatic generation of a medical report through a Decoder of a transducer and a gate unit mechanism of an LSTM, and the proposed cross-mode auxiliary network model CMLRAN consists of three modules: the system comprises a focus area feature extraction module, a focus area-based cross-mode auxiliary positioning module and a medical report automatic generation module.
(1) The focus area feature extraction implementation steps are as follows:
① Performing contrast enhancement, image size conversion and image pixel block adjustment on an input image to obtain a preprocessed image;
② The preprocessing image is converted into an image feature matrix through a Convolutional Neural Network (CNN), all data of the matrix are elongated into a row, a chest CT trainable image feature matrix C is obtained, then ResnetII is substituted, feature information with high relevance to chest organs is extracted, and an X-ray image feature matrix C' is obtained. The residual error network can learn the chest organ characteristics of the original image and the chest organ characteristics after convolution extraction, so that the problems of gradient disappearance and gradient explosion in the information transmission process are avoided, wherein the focus region characteristic extraction module architecture is shown in fig. 2.
In fig. 2 Resnet-152 represent a 152-layer residual network, whose underlying modules consist of 12 different-dimensional convolutions (1 x 1 and 3 x 3 are convolution kernels, 64, 128, 256, 512, 1024, 2048 are the number of network layers),
To ensure that the encoder learns chest CT image features better, a two-channel modular network architecture is added, an inner-segment max-pooling (Maxpool) layer and an average-pooling (AvgPool) layer are added, and a self-attention mechanism is added, so that the effect of extracting image features from Maxpool and AvgPool is enhanced.
Training of the lesion area feature extraction network uses a dual channel module feature extraction network ResnetII and an attention feature extraction network. During double-channel feature extraction, convolution operation is performed on the trainable image feature matrix C of chest CT and the dimension is increased, then images are respectively sent into an expansion convolution layer of Resnet 152 through Maxpool and Avgpool to obtain Resnet network output results (the expansion rate is 2, the convolution kernel size is 7 multiplied by 7), the output results are respectively subjected to double-channel residual operation through Maxpool and Avgpool, and the obtained result is subjected to addition summation operation with the image feature matrix converted from the preprocessed images through a Convolutional Neural Network (CNN), so that a feature matrix C' after first processing of the Resnet network is obtained. The double-channel module feature extraction enhances the multi-scale extraction capability of the model on chest CT image details, and simultaneously reduces the negative effects of original image space hierarchy information loss, unimportant information repeated extraction and the like caused by single use of expansion convolution operation. The formula expression of the feature matrix C' after the first processing of ResnetII network is as follows:
Sigma represents Sigmoid function, avg represents average pooling, max represents maximum pooling, c represents feature matrix obtained per step of chest X-ray image, f 7×7 represents size of 7×7 convolution kernel channel number, delta p represents network direct mapping, and mu represents loss function of residual network. ResnetII adding a maximum pooling layer and an average pooling layer on the basis of Resnet-152 pre-training network, wherein the two network layers can acquire the maximum value and the average value of the features under different scales, and the maximum pooling is beneficial to improving the stability of feature extraction and reducing the influence of errors such as geometric distortion, exposure and the like in X-ray images on a model. The average pooling layer can convert the spatial information of the features into more compact feature representation, improves the generalization capability of the model, reduces the influence of background radiation, artifacts and the like in the X-ray image on the model, and can reflect the local features of the medical X-ray image based on the features C' extracted by the two network layers. Then, the original feature C and the processed feature C ' are input into an attention mechanism network for secondary feature processing, the network can divide the chest X-ray image into a series of learnable blocks, global processing of the whole image is achieved, and then the extracted feature C ' and the feature C ' extracted by the ResnetII network are subjected to calculation based on residual connection, and finally the complete chest X-ray image feature C) is obtained.
The application selects the word segmentation device BPE, which is a data driving method based on self-focusing and autoregressive neural network, and divides the text into a fixed number of sub-words by continuously combining the most frequently occurring characters or character sequences.
(2) The cross-mode auxiliary positioning implementation steps are as follows:
After the first operation is completed, the X-ray image features obtained through feature extraction and the introduced medical CLIP and MSRM are subjected to matrix calculation, the focus area with the highest probability is determined, and the relation between the focus area and the corresponding medical report keywords is enhanced. As shown in fig. 3, firstly, the medical CLIP uses pre-trained Resnet-152 as an image encoder and BPE as an encoder of a medical report, the present application uses contrast loss on IU X-RAY and MIMIC-CXR datasets to fine tune the original CLIP model, enhance the similarity of image-text matching pairs and the dissimilarity of non-matching pairs, and the generated result can determine the output result through a pre-built medical report tree-like knowledge graph (the knowledge graph is shown in fig. 4). The atlas is primarily judged by using the type of focus and disease information possibly appearing at each part of chest X-ray, and then the alignment of the medical atlas and the corresponding focus area is realized by a zero-shot classification mechanism of the medical CLIP. And finally, selecting the chest X-ray focus area corresponding to the disease information with the maximum similarity based on an image classification technology. The correlation formula is as follows:
And/> Images and characters representing the core region, C img and C txt represent the images and characters after preprocessing, text features are calculated based on the image features and cosine similarity W of the image features is calculated based on the text features by matrix calculation, N represents the total number of pairs of image text, W' represents feature scores obtained by normalization calculation on the basis of W, and cross probability scores L i→t and L t→i of the corresponding reports of the final output focus region can be obtained by calculation. After obtaining the probability score L, introducing the probability into an MSRM to perform cross-mode memory storage, wherein the main operation is that two different modes of information are simultaneously stored, important information is reserved, useless information is deleted, and meanwhile, in order to prevent oversaturation of the stored information and further gradient explosion during network model training, a Selective Forgetting Gate (SFG) is added by referring to a gate unit mechanism of an LSTM, and the SFG and the cross-mode memory storage formula is as follows:
Wherein W f represents the forgetting gate SFG, b f represents the bias, x (t) represents the hidden information, h (t-1) represents the time function at time t-1, and C (t-1) represents the cross-modal memory storage characteristic at the last time.
(3) The implementation step of automatic generation of medical report:
After deriving the cross-modal feature, the transcoder's decoder can take the entire input sequence into account at the same time when generating each word, so it can capture the context information well. In particular, it employs a GPT-2 network similar to that proposed by PubMed et al, wherein the tags in the sequence are text generated conditioned on previous tags, each generated word is taken as input to the next step, and the process is repeated until a complete medical report is generated. However, in decoding, the application adds the cross-modal memory feature C (t) extracted by MSRM, if only the traditional decoder is adopted, the problems of repeated memory of error information, slow network convergence speed and the like can be possibly caused, therefore, the application provides a forgetting gate SFG based on the bidirectional LSTM, combines the SFG with an attention mechanism, controls the flow of cross-modal information by introducing the forgetting gate and the updating gate, and better captures the context information and the cross-modal information of the X-ray image in the medical report. In order to define the language model to the regional visual features, the present application uses pseudo self-attention features to inject lesion region features and associated disease keyword features directly into the self-attention of the model, the associated formulas are as follows:
Wherein X represents the visual characteristics of the focus area, Y represents word embedding, W q represents query, W k represents keys, W v represents values, U k and U v represent parameters of keys and values of initial hidden states obtained through LSTM, and generation of text of the focus area can be realized through matrix operation.
The focusing region in model training can be displayed by using a mapping (Gradient-WEIGHTED CLASS Activation Mapping, grad-CAM) technology, and whether the focus region and the corresponding tree-like knowledge graph are accurately related by the model or not is determined by using the method. As shown in FIGS. 5 (a) - (f), M1-M5 represent MRARGN and variants thereof, respectively, for Grad-CAM display of focal areas of heart, pleura, bone, mediastinum, lung and free areas of the MIMIMIMIMI-CXR dataset in the optimal state. The application can observe that M5 accurately identifies the boundary and shape information of most focus areas. Although the treatment of M1 in the skeletal region and M2 in the pleural and mediastinal regions is relatively reasonable, there are a large number of erroneous judgments or repeated correlations of the erroneous information at other locations. In the lung lesion area treatment of fig. 5 (e), M1 faces point to the upper right lobe, and M2 also perceives the free area as a portion of the lung lesion area. The focal zone extracted by M3 belongs to a large number of irrelevant extractions. M4 performs better than M1-M3 in generating the report's integrity, but when predicting a lesion area, a pacemaker or external device is erroneously interpreted as part of the lesion area. M5 can eliminate this type of error by introducing SFG, while irrelevant areas that M4 over-extracts can also be deleted by SFG.
The embodiment of the invention is implemented in PyTorch and trained on a workstation with 64GB RAM and NVIDIAGeforce RTX4090GPU processor, using pre-training Resnet 152 as a common feature extraction encoder for image processing ResnetII and CLIP, all images are scaled to 224X 224 and provided to a feature map of 7X 512 encoder size.
According to the table 1, the present application uses the method suggested by the author of the original dataset during the training, testing and validation phases, i.e. the dataset is trained as: test: verify = 7:2:1 ratio, feed extracted visual features to CLIP to generate more than 320 tags, for each tag, generate a vector containing 512 word embeddings, where the body part of the sentence may not reach 512 words, the rest will be complemented by < text > which is an omitted part identifiable by the network training, obtain the highest probability features as semantic features generated by the model, then connect the focus area alone with the network training through MSRM, build a more complete auxiliary network while not affecting network convergence, decoder of the Transformer as decoding layer of the model, first all hidden layers and word embedment dimensions set to 512, this hidden layers will directly call the hidden state extracted through LSTM forgetting gate, parameter learning uses AdamW optimizers, the band size is 4, the total training loss is defined as L = λ MSRM·LMSRMcmn·Lcmncross·Lcrosslanguage·Llanguage, where L cmn and L cross are two binary classifier cross-entropy loss processing the cross-modal information, L24 is the cross-entropy loss on the model, set of the language weight loss is the performance set according to the set of the values: lambda MSRM=2.0,λcmn=3.0,λcross=3.0,λlanguage =1.0.
As shown in Table 1, the present application compares MRARGN with the 8 most recent chest report generating models described above, respectively, on IU X-RAY and MIMIMIC-CXR datasets: generating report PPKED with the aid of a priori knowledge and a posterior combination; generating individual descriptive words in the context of X-ray images and then converting them into coherent text m2tr using a converter architecture; generating a radiological report R2Gen via a memory drive transformer; enhancing the encoder-decoder framework using CMN in combination with a self-attention mechanism to facilitate cross-modal interactions and generation; performing highly structured report generation operation by combining knowledge distillation, computer vision warm start, viT and GPT2 implemented CvT-212DISTILGPT, a task distillation module for structure level description, a task perception report generation module for describable level and an anomaly classification marking module; generating highly interpretable text VisualGPT using a self-reviving encoder-decoder attention mechanism assistance model; an anatomical region is detected using a simple and efficient region-oriented report generation model RGRGRG, and then a single salient region is described to form the final report. Wherein the CvT-212DISTILGPT2 model only gets experimental results on the MIMIC-CXR dataset (and not on the IU X-RAY dataset), the MRARGN model of the present application performs better than most models in cross-modal feature processing and automatic generation of medical reports, and successfully generates a highly refined description text of lesion areas and case conditions.
Table 1: comparison graph of evaluation index results of network models
It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential characteristics thereof. The present embodiments are, therefore, to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference sign in a claim should not be construed as limiting the claim concerned.
Furthermore, it should be understood that although the present disclosure describes embodiments, not every embodiment is provided with a separate embodiment, and that this description is provided for clarity only, and that the disclosure is not limited to the embodiments described in detail below, and that the embodiments described in the examples may be combined as appropriate to form other embodiments that will be apparent to those skilled in the art.

Claims (2)

1. A chest X-ray image report generation method based on a cross-modal network is characterized by comprising the steps of creating a cross-modal auxiliary network CMLRAN, introducing an attention mechanism to process image and text information respectively, and enhancing the information association of the image and the text by combining with a CLIP proposed by OpenAI on the basis of a memory storage response matrix MSRM, wherein the method comprises the following specific steps:
Step one: the focus area feature extraction implementation steps are as follows:
① Performing contrast enhancement, image size conversion and image pixel block adjustment on an input image to obtain a preprocessed image;
② The preprocessed image is converted into an image feature matrix through a convolutional neural network CNN, all data of the matrix are elongated into a row to obtain a chest CT trainable image feature matrix C, then the chest CT trainable image feature matrix C is substituted into ResnetII, feature information with high relevance to chest organs is extracted to obtain an X-ray image feature matrix C ', the extracted features and the features C' extracted by a ResnetII network are subjected to calculation based on residual connection, finally the features C 'and the features C are subjected to attention network calculation to obtain complete chest X-ray image features C',
ResnetII the network is based on Resnet-152 network, resnet-152 network represents 152 layer residual network, its bottom layer module is composed of 12 different dimension convolutions, 1×1 and 3×3 are convolution cores, 64, 128, 256, 512, 1024, 2048 are network layer number,
The X-ray image characteristic matrix C 'is obtained by calculating an image characteristic matrix C through ResnetII network, the ResnetII network is provided with two different pooling layers which are respectively a maximum pooling layer and an average pooling layer, and the calculation results of the two pooling layers are connected with the image characteristic matrix C through residual errors to obtain a special X-ray image characteristic matrix C';
step two: the cross-mode auxiliary positioning implementation steps are as follows:
after the first operation is completed, introducing an X-ray image feature matrix C' obtained through feature extraction into a medical CLIP and an MSRM for matrix calculation, and determining a focus area with the maximum probability, wherein the related formula is as follows:
And/> Images and characters representing a core region, C img and C txt represent the images and characters after preprocessing, the cosine similarity W of the text features calculated based on the image features and the cosine similarity W of the image features calculated based on the text features can be obtained through matrix calculation, N represents the total number of a group of image text pairs, W' represents feature scores obtained through normalization calculation on the basis of W, and cross probability scores L i→t and L t→i of corresponding reports of a focus region can be finally output through calculation;
Step three: the implementation step of automatic generation of medical report:
after describing the cross-modal characteristics of the X-ray image characteristics and the corresponding medical diagnosis report through MSRM calculation in the second step, the decoder of the transducer can take the whole input sequence into consideration when generating each word, so that the processor can well capture the context information,
Specifically, a GPT-2 network is adopted, wherein the marks in the sequence are subjected to text generation on the condition of the previous marks, each generated word is used as the input of the next step, all the operations are repeated until a complete medical report is generated, and the formulas of forgetting gate SFG, SFG and cross-modal memory storage are proposed based on bidirectional LSTM as follows:
Wherein W f represents a forgetting gate SFG, b f represents a bias, x (t) represents hidden information, h (t-1) represents a time function of t-1, and C (t-1) represents a cross-mode memory storage characteristic of the last time;
The flow of cross-modal information is controlled by introducing forgetting gates and updating gates, so that the cross-modal information of the context information and the X-ray images in the medical report is better captured, in order to limit the language model to the regional visual characteristics, the focus region characteristics and the related disease keyword characteristics are directly injected into the self-attention of the model by using the pseudo self-attention characteristics, and the related formulas are as follows:
Wherein X represents the visual characteristics of the focus area, Y represents word embedding, W q,Wk,Wv represents inquiry, keys and values, U k and U v represent parameters of the keys and values of the initial hidden state obtained through LSTM, and the generation of text of the focus area can be realized through matrix operation.
2. The chest X-ray image report generating method based on the cross-modal network according to claim 1, wherein the medical CLIP pre-training in the second step adopts a new word segmentation device BPE by using an encoder, and a tree-shaped knowledge graph is added during word segmentation, so that the weight distribution of the word segmentation device is enhanced, and the trainable data set is realized.
CN202311271188.9A 2023-09-28 2023-09-28 Cross-modal network-based chest X-ray image report generation method Active CN117558394B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311271188.9A CN117558394B (en) 2023-09-28 2023-09-28 Cross-modal network-based chest X-ray image report generation method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311271188.9A CN117558394B (en) 2023-09-28 2023-09-28 Cross-modal network-based chest X-ray image report generation method

Publications (2)

Publication Number Publication Date
CN117558394A CN117558394A (en) 2024-02-13
CN117558394B true CN117558394B (en) 2024-06-25

Family

ID=89815449

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311271188.9A Active CN117558394B (en) 2023-09-28 2023-09-28 Cross-modal network-based chest X-ray image report generation method

Country Status (1)

Country Link
CN (1) CN117558394B (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115132313A (en) * 2021-12-07 2022-09-30 北京工商大学 Automatic generation method of medical image report based on attention mechanism
CN115171838A (en) * 2022-08-24 2022-10-11 中南大学 Training method of medical report generation model based on cross-modal fusion

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110136103B (en) * 2019-04-24 2024-05-28 平安科技(深圳)有限公司 Medical image interpretation method, device, computer equipment and storage medium
CN116503515A (en) * 2023-04-26 2023-07-28 北京理工大学 Brain lesion image generation method and system based on text and image multi-mode
CN116779091B (en) * 2023-06-15 2024-02-27 兰州交通大学 Automatic generation method of multi-mode network interconnection and fusion chest image diagnosis report

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115132313A (en) * 2021-12-07 2022-09-30 北京工商大学 Automatic generation method of medical image report based on attention mechanism
CN115171838A (en) * 2022-08-24 2022-10-11 中南大学 Training method of medical report generation model based on cross-modal fusion

Also Published As

Publication number Publication date
CN117558394A (en) 2024-02-13

Similar Documents

Publication Publication Date Title
CN110147457B (en) Image-text matching method, device, storage medium and equipment
He et al. Pathvqa: 30000+ questions for medical visual question answering
WO2022007685A1 (en) Method and device for text-based image generation
CN110390363A (en) A kind of Image Description Methods
CN110619313B (en) Remote sensing image discriminant description generation method
CN112614561A (en) Brain CT medical report generation method based on hierarchical self-attention sequence coding
CN113343705A (en) Text semantic based detail preservation image generation method and system
CN113837229B (en) Knowledge-driven text-to-image generation method
CN116610778A (en) Bidirectional image-text matching method based on cross-modal global and local attention mechanism
CN114220516A (en) Brain CT medical report generation method based on hierarchical recurrent neural network decoding
Elbedwehy et al. Efficient Image Captioning Based on Vision Transformer Models.
CN112801217B (en) Text similarity judgment method and device, electronic equipment and readable storage medium
CN105678349B (en) A kind of sub- generation method of the context-descriptive of visual vocabulary
Zhang et al. Multi-head self-attention gated-dilated convolutional neural network for word sense disambiguation
CN117611601A (en) Text-assisted semi-supervised 3D medical image segmentation method
Wang et al. Image captioning based on deep learning methods: A survey
US11494431B2 (en) Generating accurate and natural captions for figures
CN112562809A (en) Method and system for auxiliary diagnosis based on electronic medical record text
Gu et al. Automatic generation of pulmonary radiology reports with semantic tags
CN117333462A (en) Ultrasonic diagnosis intelligent interaction system based on liver attribute analysis
CN113011514A (en) Intracranial hemorrhage sub-type classification algorithm applied to CT image based on bilinear pooling
CN117558394B (en) Cross-modal network-based chest X-ray image report generation method
Perdana et al. Instance-based deep transfer learning on cross-domain image captioning
Fan et al. Long-term recurrent merge network model for image captioning
CN115588486A (en) Traditional Chinese medicine diagnosis generating device based on Transformer and application thereof

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant