CN115331769B - Medical image report generation method and device based on multi-mode fusion - Google Patents
Medical image report generation method and device based on multi-mode fusion Download PDFInfo
- Publication number
- CN115331769B CN115331769B CN202210836966.3A CN202210836966A CN115331769B CN 115331769 B CN115331769 B CN 115331769B CN 202210836966 A CN202210836966 A CN 202210836966A CN 115331769 B CN115331769 B CN 115331769B
- Authority
- CN
- China
- Prior art keywords
- medical
- graph
- representing
- attention
- sequence
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 230000004927 fusion Effects 0.000 title claims abstract description 79
- 238000000034 method Methods 0.000 title claims abstract description 68
- 239000013598 vector Substances 0.000 claims abstract description 170
- 230000000007 visual effect Effects 0.000 claims abstract description 103
- 230000007246 mechanism Effects 0.000 claims abstract description 15
- 239000011159 matrix material Substances 0.000 claims description 120
- 238000010606 normalization Methods 0.000 claims description 22
- 230000006870 function Effects 0.000 claims description 17
- 238000004590 computer program Methods 0.000 claims description 14
- 230000009467 reduction Effects 0.000 claims description 11
- 238000010276 construction Methods 0.000 claims description 6
- 230000004913 activation Effects 0.000 claims description 4
- 239000012633 leachable Substances 0.000 claims description 4
- 238000013507 mapping Methods 0.000 claims description 4
- 238000004891 communication Methods 0.000 description 6
- 238000010586 diagram Methods 0.000 description 6
- 201000010099 disease Diseases 0.000 description 5
- 208000037265 diseases, disorders, signs and symptoms Diseases 0.000 description 5
- 238000000605 extraction Methods 0.000 description 4
- 238000012549 training Methods 0.000 description 4
- 239000008186 active pharmaceutical agent Substances 0.000 description 3
- 208000017520 skin disease Diseases 0.000 description 3
- 230000002708 enhancing effect Effects 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- ORILYTVJVMAKLC-UHFFFAOYSA-N Adamantane Natural products C1C(C2)CC3CC1CC2C3 ORILYTVJVMAKLC-UHFFFAOYSA-N 0.000 description 1
- 238000013473 artificial intelligence Methods 0.000 description 1
- 238000013527 convolutional neural network Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000003745 diagnosis Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
- 230000008447 perception Effects 0.000 description 1
- 238000011176 pooling Methods 0.000 description 1
- 230000008569 process Effects 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 238000003672 processing method Methods 0.000 description 1
- 238000002601 radiography Methods 0.000 description 1
- 230000011218 segmentation Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 208000024891 symptom Diseases 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16H—HEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
- G16H15/00—ICT specially adapted for medical reports, e.g. generation or transmission thereof
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/77—Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
- G06V10/80—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16H—HEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
- G16H30/00—ICT specially adapted for the handling or processing of medical images
Landscapes
- Engineering & Computer Science (AREA)
- Health & Medical Sciences (AREA)
- General Health & Medical Sciences (AREA)
- Medical Informatics (AREA)
- Theoretical Computer Science (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Public Health (AREA)
- Primary Health Care (AREA)
- Epidemiology (AREA)
- Databases & Information Systems (AREA)
- Software Systems (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Multimedia (AREA)
- Nuclear Medicine, Radiotherapy & Molecular Imaging (AREA)
- Radiology & Medical Imaging (AREA)
- Evolutionary Computation (AREA)
- Computing Systems (AREA)
- Artificial Intelligence (AREA)
- Measuring And Recording Apparatus For Diagnosis (AREA)
Abstract
The invention provides a medical image report generation method and device based on multi-mode fusion, wherein the method comprises the following steps: constructing a medical priori knowledge graph, and acquiring an initial feature vector of each node in the medical priori knowledge graph; inputting the medical priori knowledge graph and the initial feature vector of each node in the medical priori knowledge graph into a graph encoder to obtain a graph embedding vector; inputting the medical image into an image encoder which does not comprise a linear layer to obtain a visual characteristic sequence; adopting a cooperative attention mechanism to perform multi-mode fusion on the graph embedding vector and the visual feature sequence to obtain an attention re-weighted image sequence; and inputting the image sequence subjected to attention re-weighting into a memory driving transducer model to generate a medical image report. The invention can improve the accuracy and reliability of medical image report generation.
Description
Technical Field
The invention relates to the technical field of intersection of medical treatment and artificial intelligence, in particular to a medical image report generation method and device based on multi-mode fusion.
Background
In recent years, medical image reporting is an important direction of research collaboration by computer students and medical professionals. The accurate and efficient medical image report can greatly improve the control of doctors on the disease condition of patients, reduce the workload of doctors, assist the doctors to make correct disease diagnosis and provide corresponding medical guidance and advice for the patients.
At present, research on medical image report generation technology is still in a starting stage. In the existing medical image report generation scheme, the medical knowledge graph is generally used for subtasks such as classification, and is not integrated into a model, so that the accuracy and reliability of medical image report generation are not high.
Disclosure of Invention
The invention provides a medical image report generation method and device based on multi-mode fusion, which are used for solving the defects that in the prior art, medical knowledge maps are used for subtasks such as classification and are not fused into a model, so that the accuracy and reliability of medical image report generation are improved, and the purposes of improving the accuracy and reliability of medical image report generation are realized.
The invention provides a medical image report generation method based on multi-mode fusion, which comprises the following steps:
Constructing a medical priori knowledge graph, and acquiring an initial feature vector of each node in the medical priori knowledge graph;
inputting the medical priori knowledge graph and the initial feature vector of each node in the medical priori knowledge graph into a graph encoder to obtain a graph embedding vector;
inputting the medical image into an image encoder which does not comprise a linear layer to obtain a visual characteristic sequence;
adopting a cooperative attention mechanism to perform multi-mode fusion on the graph embedding vector and the visual feature sequence to obtain an attention re-weighted image sequence;
and inputting the image sequence subjected to attention re-weighting into a memory driving transducer model to generate a medical image report.
According to the medical image report generation method based on multi-mode fusion, the medical priori knowledge map is constructed, and the medical priori knowledge map comprises the following steps:
acquiring a plurality of unmarked medical image report texts;
extracting a plurality of medical entities from the plurality of unmarked medical image report texts by adopting a named entity recognition algorithm;
adopting a clustering algorithm to reduce the dimension of the medical entities;
and constructing a medical priori knowledge map by taking the medical entities after dimension reduction as nodes and the relationship among the medical entities after dimension reduction as edges.
According to the medical image report generating method based on multi-mode fusion provided by the invention, the obtaining of the initial feature vector of each node in the medical priori knowledge graph comprises the following steps:
and initializing each node of the medical priori knowledge map through a word embedding model to obtain an initial feature vector of the node.
According to the medical image report generating method based on multi-mode fusion provided by the invention, the initial feature vector of each node in the medical priori knowledge graph and the medical priori knowledge graph is input into a graph encoder to obtain a graph embedding vector, and the method comprises the following steps:
building a graph encoder:
wherein ,an adjacency matrix representing the medical prior knowledge-graph, said adjacency matrix being accompanied by edges pointing to its own nodes,>representing an initial feature vector of the medical priori knowledge map, wherein the initial feature vector of the medical priori knowledge map is obtained by splicing initial feature vectors of all nodes in the medical priori knowledge map, and the +>A picture convolution feature vector representing the k-th layer,/>Graph roll feature vector representing layer k+1, d= Σ i D i ,/> Elements of the ith row and jth column of the adjacency matrix representing the medical prior knowledge-graph, W (k) Representing a trainable weight matrix, GC () representing a convolution function, σ () representing an activation function, dropout () representing a random discard function, BN () representing a batch normalization function;
and inputting the medical priori knowledge graph and the initial feature vector of each node in the medical priori knowledge graph into the graph encoder to obtain a graph convolution feature vector of the last layer as a graph embedding vector.
According to the medical image report generating method based on multi-mode fusion, the medical image is input into an image encoder which does not comprise a linear layer to obtain a visual characteristic sequence, and the method comprises the following steps:
inputting the medical image into an image encoder which does not comprise a linear layer to obtain a four-dimensional visual feature matrix;
deforming the four-dimensional visual feature matrix into a three-dimensional visual feature matrix;
and converting the three-dimensional visual characteristic matrix into a visual characteristic sequence.
According to the medical image report generating method based on multi-mode fusion provided by the invention, the adoption of a cooperative attention mechanism carries out multi-mode fusion on the graph embedding vector and the visual feature sequence to obtain an attention re-weighted image sequence, and the method comprises the following steps:
Calculating an affinity matrix between the graph embedding vector and the visual feature sequence;
learning, by the affinity matrix, an attention map between the graph embedding vector and the visual feature sequence;
calculating an attention weight vector based on the attention map;
based on the visual feature sequence and the attention weight vector, a sequence of attention re-weighted images is calculated.
According to the medical image report generating method based on multi-mode fusion provided by the invention, the calculating of the affinity matrix between the graph embedding vector and the visual characteristic sequence comprises the following steps:
the affinity matrix between the graph embedding vector and the visual feature sequence is calculated by the following expression:
wherein C represents an affinity matrix, G E Representing graph embedding vectors, I E Representing a sequence of visual features, W b Representing a weight matrix.
According to the medical image report generating method based on multi-mode fusion provided by the invention, the learning of the attention map between the graph embedding vector and the visual feature sequence through the affinity matrix comprises the following steps:
learning an attention map between the graph embedding vector and the visual feature sequence by:
F i =tanh(W i I E +(W g G E )C)
wherein ,Fi Representing the output result of learning the graph embedding vector and the visual feature sequence by means of an affinity matrix, W i and Wg All represent a trainable weight matrix, C represents an affinity matrix, G E Representing graph embedding vectors, I E Representing a sequence of visual features.
According to the medical image report generating method based on multi-mode fusion provided by the invention, the attention weight vector is calculated based on the attention map, and the method comprises the following steps:
the attention weight vector is calculated by the following expression:
wherein ,ai Representing a vector of attention weights, w fi Representing a trainable weight matrix, F i Representing the attention map result.
According to the medical image report generating method based on multi-mode fusion provided by the invention, the image sequence subjected to attention re-weighting is calculated based on the visual feature sequence and the attention weight vector, and the method comprises the following steps:
the attention re-weighted image sequence is calculated by the following expression:
wherein ,representing a sequence of images, x, that have been re-weighted by attention 1,2,…,R Representing elements in the visual characteristics sequence, +.>Representing elements in the sequence of images that are re-weighted by attention,/->The elements in the attention weight vector corresponding to the visual feature sequence are represented, and r=h '×w' represents the number of image blocks.
According to the medical image report generation method based on multi-mode fusion provided by the invention, the memory driving transducer model comprises the following steps: an encoder and a decoder, the decoder comprising: the decoding module is provided with a memory-driven normalization layer.
According to the method for generating the medical image report based on the multi-mode fusion provided by the invention, the image sequence subjected to attention re-weighting is input into a memory driving transducer model to generate the medical image report, and the method comprises the following steps:
inputting the attention re-weighted image sequence into the encoder;
initializing the relational memory by adopting the graph embedding vector;
calculating a memory matrix of the last round of output of the relational memory;
and inputting the memory matrix output by the last round of the relational memory and the output result of the encoder into the decoding module provided with the memory-driven normalization layer to obtain a medical image report.
According to the medical image report generating method based on multi-mode fusion provided by the invention, the initializing the relational memory by adopting the graph embedding vector comprises the following steps:
Initializing the relational memory by the following expression:
M 0 =MLP(G E ·W m )
wherein ,M0 Initial memory matrix representing relational memory, G E Representing graph embedded vectors, W m Representing a weight matrix, MLP () represents a multi-layer perceptron for building an inter-dimensional mapping.
According to the medical image report generating method based on multi-mode fusion provided by the invention, the memory matrix for calculating the last round of output of the relational memory comprises the following steps:
calculating the memory matrix of the last round of output of the relational memory by the following expression:
wherein ,Q=Mt-1 ·W Q ,K=[M t-1 ;y t-1 ]·W k ,V=[M t-1 ;y t-1 ]·W v ,M t-1 A memory matrix, y representing a round of output on the relational memory t-1 Word embedding vector representing a round of predictions on the relational memory, W Q 、W k 、W v Are trainable weight matrices, d k Representing the scaling factor, which is obtained by dividing the dimension of k by the number of heads, MLP () represents the multi-layer perceptron, and />For balancing M t-1 and yt-1 Forgetting gate and output gate, +.>Representing a multi-headed attention output matrix mapped by a multi-layer perceptron, M t And the memory matrix representing the last round of output of the relational memory.
According to the medical image report generating method based on multi-mode fusion provided by the invention, the memory matrix output by the last round of the relational memory and the output result of the encoder are input into the decoding module provided with the memory-driven normalization layer to obtain the medical image report, and the medical image report generating method comprises the following steps:
Calculating an output result of the decoding module provided with the memory-driven normalization layer by the following expression:
γ t =γ+MLP(M t )
β t =β+MLP(M t )
θ=T D (ψ,N,RM(M t-1 ,y t-1 ),MCLN(r,M t ))
wherein ψ represents the output of the encoder, N represents the decoder layer number, γ represents the matrix of learnable parameters representing the scaling used to improve generalization ability, γ t Representing gamma and M mapped by a multi-layer perceptron t As a result of the addition, β represents a matrix of leachable parameters representing movement used to enhance generalization ability, β t Representing beta and M mapped by a multi-layer perceptron t As a result of the addition, μ represents the mean value of γ, v represents the standard deviation of γ, T E () Representing the encoder, T D () Representing the decoder, RM () represents the relational memory.
The invention also provides a medical image report generating device based on multi-mode fusion, which comprises the following steps:
the map construction module is used for constructing a medical priori knowledge map and acquiring an initial feature vector of each node in the medical priori knowledge map;
the graph encoder module is used for inputting the medical priori knowledge graph and the initial feature vector of each node in the medical priori knowledge graph into the graph encoder to obtain a graph embedding vector;
the image encoder module is used for inputting the medical image into an image encoder which does not comprise a linear layer to obtain a visual characteristic sequence;
The multi-mode fusion module is used for adopting a cooperative attention mechanism to perform multi-mode fusion on the graph embedding vector and the visual feature sequence to obtain an attention re-weighted image sequence;
and the report generation module is used for inputting the image sequence subjected to attention re-weighting into a memory driving transducer model to generate a medical image report.
The invention also provides an electronic device, which comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor realizes the medical image report generating method based on the multi-modal fusion when executing the program.
The present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements a medical image report generating method based on multimodal fusion as described in any one of the above.
The invention also provides a computer program product comprising a computer program which when executed by a processor implements a medical image report generating method based on multimodal fusion as described in any one of the above.
The invention provides a medical image report generation method and a device based on multi-mode fusion, which comprises the steps of firstly, constructing a medical priori knowledge graph and acquiring an initial feature vector of each node in the medical priori knowledge graph; then, inputting the medical priori knowledge graph and the initial feature vector of each node in the medical priori knowledge graph into a graph encoder to obtain a graph embedding vector; inputting the medical image into an image encoder which does not comprise a linear layer to obtain a visual characteristic sequence; adopting a cooperative attention mechanism to perform multi-mode fusion on the graph embedding vector and the visual feature sequence to obtain an image sequence subjected to attention re-weighting; the image sequence subjected to attention re-weighting fuses a medical priori knowledge map and a medical image; and finally, inputting the image sequence subjected to attention re-weighting into a memory driving transducer model to generate a medical image report, wherein the memory driving transducer model has better understanding capability and more robust understanding capability of medical priori knowledge due to the fact that the image sequence subjected to attention re-weighting is fused with a medical priori knowledge map and a medical image, so that the accuracy and reliability of medical image report generation can be improved.
Drawings
In order to more clearly illustrate the invention or the technical solutions of the prior art, the following description will briefly explain the drawings used in the embodiments or the description of the prior art, and it is obvious that the drawings in the following description are some embodiments of the invention, and other drawings can be obtained according to the drawings without inventive effort for a person skilled in the art.
FIG. 1 is a schematic flow chart of a medical image report generating method based on multi-modal fusion;
FIG. 2 is a schematic structural diagram of a medical image report generation model based on a medical prior knowledge graph and memory driving provided by the invention;
FIG. 3 is a schematic diagram of constructing a medical prior knowledge graph provided by the invention;
FIG. 4 is a schematic structural diagram of a medical image report generating device based on multi-modal fusion;
fig. 5 is a schematic structural diagram of an electronic device provided by the present invention.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
The medical image report generation method based on the multi-modal fusion of the present invention is described below with reference to fig. 1 to 3.
Referring to fig. 1, fig. 1 is a flow chart of a medical image report generating method based on multi-mode fusion according to the present invention. As shown in fig. 1, the medical image report generating method based on multi-mode fusion provided by the invention may include the following steps:
104, adopting a cooperative attention mechanism to perform multi-mode fusion on the graph embedding vector and the visual feature sequence to obtain an image sequence subjected to attention re-weighting;
In step 101, a medical prior knowledge graph with proper size is constructed by adopting a natural language processing method, instead of manually constructing by selecting only a few keywords. The medical priori knowledge map of the embodiment has the following characteristics:
1) Comprehensive entity type
The medical prior knowledge graph contains an omnidirectional entity type, and can describe disease symptoms from an omnidirectional perspective, not just the names of diseases, for example: for skin diseases, physical information such as disease name, location, shape, color, etc. may be included.
2) Proper scale of map
The scale of the medical priori knowledge map is proper, if the characteristics are too large, training and learning are difficult, and if the scale is too small, enough priori knowledge cannot be reserved.
3) Comprehensive entity relationship
The relationship between the entities can be automatically established by a relationship extraction method, and the relationship between the entities can be manually supplemented, so that the medical priori knowledge map can embody more priori knowledge.
In this step, in order to complete the subsequent steps, each node in the medical priori knowledge map needs to be initialized, and an initial feature vector of each node in the medical priori knowledge map is obtained. Optionally, each node in the medical prior knowledge-graph is initialized by a word embedding model.
In step 102, a graph encoder is used to extract a graph embedding vector of a medical prior knowledge graph. Inputting the medical priori knowledge graph and the initial feature vector of each node in the medical priori knowledge graph into a graph encoder, and obtaining a graph embedding vector through graph encoding processing.
In step 103, for an image encoder, a derivative model of a convolutional neural network is employed as a visual extractor, for example: residual convolutional network ResNet, densely-chained convolutional network DenseNet, etc. The image encoder used in this embodiment does not include the last linear layer, and outputs the result of the pooling layer.
In step 104, the image encoder may obtain visual features of the medical image, but may not obtain high-level semantic information well. In the embodiment, a collaborative attention mechanism is adopted, the graph embedded vector and the visual feature sequence are subjected to multi-mode fusion, a visual question-answering process is simulated, and finally an image sequence subjected to attention re-weighting is obtained.
In step 105, as shown in fig. 2, the memory driven transducer model optionally includes: an encoder and a decoder, the decoder comprising: the decoding module is provided with a memory-driven normalization layer. The Memory-driven normalization layer is three Memory-driven conditional layer normalization (Memory-driven Conditional Layer Normalization, MCLN) layers, and is used for enhancing the decoding capability of the Memory-driven transducer model and increasing generalization.
The sequence of images that are re-weighted is input into a memory driven transducer model to generate a medical image report.
In this embodiment, since the attention re-weighted image sequence fuses the medical priori knowledge map and the medical image, the memory driving transducer model has better understanding capability and more robust ability of understanding medical priori knowledge, and can improve the accuracy and reliability of medical image report generation.
Optionally, constructing the medical prior knowledge-graph in step 101 includes the following sub-steps:
step 1011, obtaining a plurality of untagged medical image report texts;
step 1012, extracting a plurality of medical entities from a plurality of unlabeled medical image report texts by adopting a named entity recognition algorithm;
step 1013, performing dimension reduction on a plurality of medical entities by adopting a clustering algorithm;
and 1014, constructing a medical priori knowledge map by taking the medical entities after dimension reduction as nodes and the relation among the medical entities after dimension reduction as edges.
In step 1011, several unlabeled medical image report texts are acquired, such as: a large number of unannotated reports, training data sets, available text information provided by the physician, and the like. Different types of reports may be selected as the underlying data for different tasks.
In step 1012, using a named entity recognition algorithm, a number of medical entities may be effectively extracted from a number of unlabeled medical image report texts, which are stored as key nodes of a medical prior knowledge graph.
In step 1013, the nodes identified by the named entities may have a large amount of similar content, and if all the nodes are reserved, a large amount of redundant structures are generated, so that the scale of the medical priori knowledge graph is too large, and therefore, a text processing method and a clustering algorithm are required to reduce the dimensions of a plurality of medical entities.
In step 1014, relationships between the medical entities are established using relationship extraction and entity dependencies are established with the assistance of manual design. And constructing a medical priori knowledge map by taking the medical entities after dimension reduction as nodes and the relationship among the medical entities after dimension reduction as edges.
In this embodiment, instead of manually constructing by selecting only a few keywords, a medical prior knowledge graph of appropriate size may be constructed.
Specifically, as shown in fig. 3, two data sets are used to construct a medical prior knowledge-graph. Because the two data sets differ in language, there is also some variance in the construction of the medical prior knowledge-graph. For IU-Xray data set, stanza BioMedical is used as backbone method for named entity identification and relation extraction, and a medical priori knowledge map with 284 key nodes is finally obtained through clustering. Each node obtains 768-dimensional feature vectors through BioBert and serves as initial features of a medical priori knowledge map. Similarly, for the NCRC-DS dataset, CMeKG is used to extract Chinese medical entities and extract entity triples. CMeKG is a tool library of chinese medical knowledge graph, providing open source implementation of named entity recognition, relationship extraction, medical word segmentation, etc. After clustering, a knowledge graph containing 191 key nodes is finally obtained. In order to obtain the initial characteristics of the node, the node keywords are input into a Chinese medical Bert model provided by CMeKG, and 768-dimensional initial vectors are obtained.
Optionally, step 102 comprises the sub-steps of:
step 1021, constructing a graph encoder:
wherein ,adjacency matrix representing medical priori knowledge graph, adjacency matrix being attached with edge pointing to own node, ++>The initial feature vector of the medical priori knowledge map is obtained by splicing the initial feature vectors of all nodes in the medical priori knowledge map, and the initial feature vector of the medical priori knowledge map is a +.>A picture convolution feature vector representing the k-th layer,/>Graph roll feature vector representing layer k+1, d= Σ i D i ,/>For normalizing aggregate node features,/->Elements of the ith row and jth column of the adjacency matrix representing the medical prior knowledge-graph, W (k) Representing a trainable weight matrix, GC () representing a convolution function, σ () representing an activation function, dropout () representing a random discard function, BN () representing a batch normalization function;
step 1022, inputting the medical priori knowledge graph and the initial feature vector of each node in the medical priori knowledge graph into the graph encoder to obtain the final layer of graph convolution feature vector as the graph embedding vector.
In the embodiment, a random discarding layer, a batch normalization layer and residual connection are added between two layers of graph convolution, so that the expression capability of a graph encoder can be improved, and the graph embedding vector of the medical priori knowledge graph can be extracted through the graph encoder.
Optionally, step 103 comprises the sub-steps of:
step 1031, inputting the medical image (imag, dimension [ B, C, H, W ]) into an image encoder VE () not including a linear layer to obtain a four-dimensional visual feature matrix VE (imag), wherein the dimension is [ B, F, H ', W';
step 1032, deforming the four-dimensional visual feature matrix into a three-dimensional visual feature matrix;
I E =reshape(VE(imag)) (3)
wherein ,IE Is a three-dimensional visual characteristic matrix, and the dimension is [ B, F, H' ×W ]]Reshape () represents a deformation function;
step 1033, converting the three-dimensional visual feature matrix into a visual feature sequence x 1 ,x 2 ,…,x H` ×W`。
In this embodiment, the medical image is input to an image encoder that does not include a linear layer, and further deformed to obtain a visual feature sequence.
Optionally, step 104 comprises the sub-steps of:
step 1041, calculating an affinity matrix between the graph embedding vector and the visual feature sequence;
step 1042, embedding an attention map between the vector and the visual feature sequence by the learning map through the affinity matrix;
step 1043, calculating an attention weight vector based on the attention map;
step 1044, calculating an attention re-weighted image sequence based on the visual feature sequence and the attention weight vector.
In step 1041, an affinity matrix between the graph embedding vector and the visual feature sequence is calculated by the following expression:
wherein C represents an affinity matrix, G E Representing graph embedding vectors, I E Representing a sequence of visual features, W b Representing a weight matrix.
In step 1042, the attention map between the vector and the visual feature sequence is learned by the following expression:
F i =tanh(W i I E +(W g G E )C) (5)
wherein ,Fi Representing the output result of embedding vector and visual feature sequence through affinity matrix learning map, W i and Wg All represent a trainable weight matrix, C represents an affinity matrix, G E Representing graph embedding vectors, I E Representing a sequence of visual features.
In step 1043, an attention weight vector is calculated by the following expression:
wherein ,ai Representing a vector of attention weights, w fi Representing a trainable weight matrix, F i Representing the attention map result.
In step 1044, the attention re-weighted image sequence is calculated by the following expression:
wherein ,representing a sequence of images, x, that have been re-weighted by attention 1,2,…,R Representing elements in the visual characteristics sequence, +.>Representing elements in the sequence of images that are re-weighted by attention,/->The elements in the attention weight vector corresponding to the visual feature sequence are represented, and r=h '×w' represents the number of image blocks.
In this embodiment, a collaborative attention mechanism is adopted, and the image embedded vector output by the image encoder and the visual feature sequence output by the image encoder are subjected to multi-mode fusion, so that the finally obtained image sequence subjected to attention re-weighting fuses the visual features of the medical image and the high-level semantic information of the medical priori knowledge graph.
Optionally, step 105 comprises the sub-steps of:
step 1051, inputting the attention re-weighted image sequence into an encoder;
step 1052, initializing the relational memory by adopting the graph embedding vector;
step 1053, calculating the memory matrix of the last round of output of the relational memory;
step 1054, the output results of the memory matrix and the encoder output in the last round of the relational memory are input into a decoding module provided with a memory-driven normalization layer, and a medical image report is obtained.
In step 1051, the output result of the encoder is calculated by the following expression:
wherein, psi represents the output result of the encoder, N represents, T E () Representing the encoder.
In step 1052, the relational memory is used to store shared content in the model training results, enhancing the learning ability of the model. Specifically, a memory matrix is provided that includes a plurality of rows, each row being considered a slot for storing specific pattern information.
Initializing a relational memory by the following expression:
M 0 =MLP(G E ·W m ) (9)
wherein ,M0 Initial memory matrix representing relational memory, W m Representing a weight matrix, MLP () represents a multi-layer perceptron for building an inter-dimensional mapping.
In step 1053, the memory matrix for the last round of output of the relational memory is calculated by the following expression:
wherein ,Q=Mt-1 ·W Q ,K=[M t-1 ;y t-1 ]·W k ,V=[M t-1 ;y t-1 ]·W v ,M t-1 Memory matrix representing one round of output on relational memory, y t-1 Word embedding vector representing a round of predictions in relational memory, W Q 、W k 、W v Are all trainable weightsMatrix, d k Representing the scaling factor, which is obtained by dividing the dimension of k by the number of heads, MLP () represents the multi-layer perceptron, and />For balancing M t-1 and yt-1 Forgetting gate and output gate, +.>Representing a multi-headed attention output matrix mapped by a multi-layer perceptron, M t And the memory matrix representing the last round of output of the relational memory.
In step 1054, the output result of the decoding module provided with the memory-driven normalization layer is calculated by the following expression:
γ t =γ+MLP(M t ) (13)
β t =β+MLP(M t ) (14)
θ=T D (ψ,N,RM(M t-1 ,y t-1 ),MCLN(r,M t )) (16)
wherein ψ represents the output result of the encoder, N represents the decoder layer number, γ represents the matrix of learnable parameters representing scaling used to improve generalization ability, γ t Representing gamma and M mapped by a multi-layer perceptron t As a result of the addition, β represents a matrix of leachable parameters representing movement used to enhance generalization ability, β t Representing beta and M mapped by a multi-layer perceptron t As a result of the addition, μ represents the mean value of γ, v represents the standard deviation of γ, T E () Representation encoder, T D () Representing the decoder, RM () represents the relational memory.
In this embodiment, the graph embedding vector is used to initialize the relational memory, instead of assigning 0 to all the initial memory matrices of the relational memory, so as to optimize the memory driven transducer model. And moreover, the input of the decoder of the memory-driven transducer model fuses a medical priori knowledge map and a medical image, so that the memory-driven transducer model has better understanding capability and more robust ability of understanding medical priori knowledge, and the accuracy and reliability of medical image report generation can be improved.
Specifically, taking IU-Xray dataset and NCRC-DS dataset as examples, a DenseNet-121 model pre-trained on CheXpert is selected for IU-Xray dataset as the backbone network of the image encoder. Two chest radiographs of the same description will be input into the model and spliced for delivery to the encoder of the memory driven transducer model. A medical prior knowledge graph containing 284 nodes is used as medical prior knowledge. For the NCRC-DS dataset, a ResNet-101 model pre-trained on ImageNet was chosen as the backbone network for the image encoder. Because of the small size of the dataset, only one skin disease picture and its corresponding description report are input at a time, and a medical prior knowledge map containing 191 nodes is used as medical prior knowledge. By default, the number of layers of the graph volume is 3, the number of slots of the relational memory is 3, and the word embedding dimension is set to 512. The model was trained using Adam optimizer under cross entropy loss, while training, evaluating BLEU-4 score for the test set, and setting weight decay and early stop. In the fields of chest radiography, skin diseases and the like, the medical image report generation method based on multi-mode fusion of the embodiment is superior to the current common medical image report generation model in accuracy and reliability.
The medical image report generating device based on multi-mode fusion provided by the invention is described below, and the medical image report generating device based on multi-mode fusion described below and the medical image report generating method based on multi-mode fusion described above can be correspondingly referred to each other.
Referring to fig. 4, fig. 4 is a schematic structural diagram of a medical image report generating device based on multi-mode fusion according to the present invention. As shown in fig. 4, the medical image report generating device based on multi-mode fusion provided by the invention may include:
the map construction module 10 is used for constructing a medical priori knowledge map and acquiring an initial feature vector of each node in the medical priori knowledge map;
a graph encoder module 20, configured to input the medical prior knowledge graph and an initial feature vector of each node in the medical prior knowledge graph into a graph encoder to obtain a graph embedding vector;
an image encoder module 30 for inputting the medical image into an image encoder that does not include a linear layer, resulting in a sequence of visual features;
the multi-mode fusion module 40 is configured to perform multi-mode fusion on the graph embedding vector and the visual feature sequence by adopting a cooperative attention mechanism, so as to obtain an image sequence after attention re-weighting;
The report generating module 50 is configured to input the image sequence subjected to attention re-weighting into a memory driven transducer model, and generate a medical image report.
Optionally, the map construction module 10 is specifically configured to:
acquiring a plurality of unmarked medical image report texts;
extracting a plurality of medical entities from the plurality of unmarked medical image report texts by adopting a named entity recognition algorithm;
adopting a clustering algorithm to reduce the dimension of the medical entities;
and constructing a medical priori knowledge map by taking the medical entities after dimension reduction as nodes and the relationship among the medical entities after dimension reduction as edges.
Optionally, the map construction module 10 is specifically configured to:
and initializing each node of the medical priori knowledge map through a word embedding model to obtain an initial feature vector of the node.
Optionally, the graph encoder module 20 is specifically configured to:
building a graph encoder:
wherein ,an adjacency matrix representing the medical prior knowledge-graph, said adjacency matrix being accompanied by edges pointing to its own nodes,>representing an initial feature vector of the medical priori knowledge map, wherein the initial feature vector of the medical priori knowledge map is obtained by splicing initial feature vectors of all nodes in the medical priori knowledge map, and the + >A picture convolution feature vector representing the k-th layer,/>Graph roll feature vector representing layer k+1, d= Σ i D i , Elements of the ith row and jth column of the adjacency matrix representing the medical prior knowledge-graph, W (k) Representing a trainable weight matrix, GC () representing a convolution function, σ () representing an activation function, dropout () representing a random discard function, BN () representing a batch normalization function;
and inputting the medical priori knowledge graph and the initial feature vector of each node in the medical priori knowledge graph into the graph encoder to obtain a graph convolution feature vector of the last layer as a graph embedding vector.
Optionally, the image encoder module 30 is specifically configured to:
inputting the medical image into an image encoder which does not comprise a linear layer to obtain a four-dimensional visual feature matrix;
deforming the four-dimensional visual feature matrix into a three-dimensional visual feature matrix;
and converting the three-dimensional visual characteristic matrix into a visual characteristic sequence.
Optionally, the multimodal fusion module 40 is specifically configured to:
calculating an affinity matrix between the graph embedding vector and the visual feature sequence;
learning, by the affinity matrix, an attention map between the graph embedding vector and the visual feature sequence;
Calculating an attention weight vector based on the attention map;
based on the visual feature sequence and the attention weight vector, a sequence of attention re-weighted images is calculated.
Optionally, the multimodal fusion module 40 is specifically configured to:
the affinity matrix between the graph embedding vector and the visual feature sequence is calculated by the following expression:
wherein C represents an affinity matrix, G E Representing graph embedding vectors, I E Representing a sequence of visual features, W b Representing a weight matrix.
Optionally, the multimodal fusion module 40 is specifically configured to:
learning an attention map between the graph embedding vector and the visual feature sequence by:
F i =tanh(W i I E +(W g G E )C)
wherein ,Fi Representing the output result of learning the graph embedding vector and the visual feature sequence by means of an affinity matrix, W i and Wg All represent a trainable weight matrix, C represents an affinity matrix, G E Representing graph embedding vectors, I E Representing a sequence of visual features.
Optionally, the multimodal fusion module 40 is specifically configured to:
the attention weight vector is calculated by the following expression:
wherein ,ai Representing a vector of attention weights, w fi Representing a trainable weight matrix, F i Representing the attention map result.
Optionally, the multimodal fusion module 40 is specifically configured to:
the attention re-weighted image sequence is calculated by the following expression:
wherein ,representing a sequence of images, x, that have been re-weighted by attention 1,2,…,R Representing elements in the visual characteristics sequence, +.>Representing elements in the sequence of images that are re-weighted by attention,/->The elements in the attention weight vector corresponding to the visual feature sequence are represented, and r=h '×w' represents the number of image blocks.
Optionally, the memory driven transducer model comprises: an encoder and a decoder, the decoder comprising: the decoding module is provided with a memory-driven normalization layer.
Alternatively, the report generating module 50 is specifically configured to:
inputting the attention re-weighted image sequence into the encoder;
initializing the relational memory by adopting the graph embedding vector;
calculating a memory matrix of the last round of output of the relational memory;
and inputting the memory matrix output by the last round of the relational memory and the output result of the encoder into the decoding module provided with the memory-driven normalization layer to obtain a medical image report.
Alternatively, the report generating module 50 is specifically configured to:
initializing the relational memory by the following expression:
M 0 =MLP(G E ·W m )
wherein ,M0 Initial memory matrix representing relational memory, G E Representing graph embedded vectors, W m Representing a weight matrix, MLP () represents a multi-layer perceptron for building an inter-dimensional mapping.
Alternatively, the report generating module 50 is specifically configured to:
calculating the memory matrix of the last round of output of the relational memory by the following expression:
wherein ,Q=Mt-1 ·W Q ,K=[M t-1 ;y t-1 ]·W k ,V=[M t-1 ;y t-1 ]·W v ,M t-1 A memory matrix, y representing a round of output on the relational memory t-1 Word embedding vector representing a round of predictions on the relational memory, W Q 、W k 、W v Are trainable weight matrices, d k Representing the scaling factor, which is obtained by dividing the dimension of k by the number of heads, MLP () represents the multi-layer perceptron, and />For balancing M t-1 and yt-1 Forgetting gate and output gate, +.>Representing a multi-headed attention output matrix mapped by a multi-layer perceptron, M t And the memory matrix representing the last round of output of the relational memory.
Alternatively, the report generating module 50 is specifically configured to:
calculating an output result of the decoding module provided with the memory-driven normalization layer by the following expression:
γ t =γ+MLP(M t )
β t =β+MLP(M t )
θ=T D (ψ,N,RM(M t-1 ,y t-1 ),MCLN(r,M t ))
wherein ψ represents the output of the encoder, N represents the decoder layer number, γ represents the matrix of learnable parameters representing the scaling used to improve generalization ability, γ t Representing gamma and passing multi-layer perception mechanismM after shooting t As a result of the addition, β represents a matrix of leachable parameters representing movement used to enhance generalization ability, β t Representing beta and M mapped by a multi-layer perceptron t As a result of the addition, μ represents the mean value of γ, v represents the standard deviation of γ, T E () Representing the encoder, T D () Representing the decoder, RM () represents the relational memory.
Fig. 5 illustrates a physical schematic diagram of an electronic device, as shown in fig. 5, which may include: processor 810, communication interface (Communications Interface) 820, memory 830, and communication bus 840, wherein processor 810, communication interface 820, memory 830 accomplish communication with each other through communication bus 840. The processor 810 may invoke logic instructions in the memory 830 to perform a multi-modal fusion based medical image report generation method comprising:
constructing a medical priori knowledge graph, and acquiring an initial feature vector of each node in the medical priori knowledge graph;
inputting the medical priori knowledge graph and the initial feature vector of each node in the medical priori knowledge graph into a graph encoder to obtain a graph embedding vector;
Inputting the medical image into an image encoder which does not comprise a linear layer to obtain a visual characteristic sequence;
adopting a cooperative attention mechanism to perform multi-mode fusion on the graph embedding vector and the visual feature sequence to obtain an attention re-weighted image sequence;
and inputting the image sequence subjected to attention re-weighting into a memory driving transducer model to generate a medical image report.
Further, the logic instructions in the memory 830 described above may be implemented in the form of software functional units and may be stored in a computer-readable storage medium when sold or used as a stand-alone product. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.
In another aspect, the present invention also provides a computer program product, where the computer program product includes a computer program, where the computer program can be stored on a non-transitory computer readable storage medium, and when the computer program is executed by a processor, the computer can perform a medical image report generating method based on multi-modal fusion provided by the above methods, and the method includes:
constructing a medical priori knowledge graph, and acquiring an initial feature vector of each node in the medical priori knowledge graph;
inputting the medical priori knowledge graph and the initial feature vector of each node in the medical priori knowledge graph into a graph encoder to obtain a graph embedding vector;
inputting the medical image into an image encoder which does not comprise a linear layer to obtain a visual characteristic sequence;
adopting a cooperative attention mechanism to perform multi-mode fusion on the graph embedding vector and the visual feature sequence to obtain an attention re-weighted image sequence;
and inputting the image sequence subjected to attention re-weighting into a memory driving transducer model to generate a medical image report.
In yet another aspect, the present invention further provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, is implemented to perform the method for generating a medical image report based on multi-modal fusion provided by the above methods, the method comprising:
Constructing a medical priori knowledge graph, and acquiring an initial feature vector of each node in the medical priori knowledge graph;
inputting the medical priori knowledge graph and the initial feature vector of each node in the medical priori knowledge graph into a graph encoder to obtain a graph embedding vector;
inputting the medical image into an image encoder which does not comprise a linear layer to obtain a visual characteristic sequence;
adopting a cooperative attention mechanism to perform multi-mode fusion on the graph embedding vector and the visual feature sequence to obtain an attention re-weighted image sequence;
and inputting the image sequence subjected to attention re-weighting into a memory driving transducer model to generate a medical image report.
The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.
From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course may be implemented by means of hardware. Based on this understanding, the foregoing technical solution may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the respective embodiments or some parts of the embodiments.
Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.
Claims (18)
1. A medical image report generation method based on multi-modal fusion, comprising:
constructing a medical priori knowledge graph, and acquiring an initial feature vector of each node in the medical priori knowledge graph;
inputting the medical priori knowledge graph and the initial feature vector of each node in the medical priori knowledge graph into a graph encoder to obtain a graph embedding vector;
inputting the medical image into an image encoder which does not comprise a linear layer to obtain a visual characteristic sequence;
adopting a cooperative attention mechanism to perform multi-mode fusion on the graph embedding vector and the visual feature sequence to obtain an attention re-weighted image sequence;
inputting the image sequence subjected to attention re-weighting into a memory driving transformer model to generate a medical image report;
the method for obtaining the image sequence with attention re-weighted by adopting a cooperative attention mechanism carries out multi-mode fusion on the graph embedded vector and the visual feature sequence, and comprises the following steps:
calculating an affinity matrix between the graph embedding vector and the visual feature sequence;
learning, by the affinity matrix, an attention map between the graph embedding vector and the visual feature sequence;
Calculating an attention weight vector based on the attention map;
based on the visual feature sequence and the attention weight vector, a sequence of attention re-weighted images is calculated.
2. The method for generating a medical image report based on multi-modal fusion according to claim 1, wherein the constructing a medical prior knowledge-graph includes:
acquiring a plurality of unmarked medical image report texts;
extracting a plurality of medical entities from the plurality of unmarked medical image report texts by adopting a named entity recognition algorithm;
adopting a clustering algorithm to reduce the dimension of the medical entities;
and constructing a medical priori knowledge map by taking the medical entities after dimension reduction as nodes and the relationship among the medical entities after dimension reduction as edges.
3. The method for generating a medical image report based on multi-modal fusion according to claim 1, wherein the obtaining the initial feature vector of each node in the medical prior knowledge-graph includes:
and initializing each node of the medical priori knowledge map through a word embedding model to obtain an initial feature vector of the node.
4. The method for generating a medical image report based on multi-modal fusion according to claim 1, wherein inputting the medical prior knowledge-graph and the initial feature vector of each node in the medical prior knowledge-graph into a graph encoder to obtain a graph embedding vector comprises:
building a graph encoder:
wherein ,an adjacency matrix representing the medical prior knowledge-graph, said adjacency matrix being accompanied by edges pointing to its own nodes,>representing an initial feature vector of the medical priori knowledge map, wherein the initial feature vector of the medical priori knowledge map is obtained by splicing initial feature vectors of all nodes in the medical priori knowledge map, and the +>A picture convolution feature vector representing the k-th layer,/>Graph roll feature vector representing layer k+1, d= Σ i D i ,/> Elements of the ith row and jth column of the adjacency matrix representing the medical prior knowledge-graph, W (k) Representing a trainable weight matrix, GC () representing a convolution function, σ () representing an activation function, dropout () representing a random discard function, BN () representing a batch normalization function;
and inputting the medical priori knowledge graph and the initial feature vector of each node in the medical priori knowledge graph into the graph encoder to obtain a graph convolution feature vector of the last layer as a graph embedding vector.
5. The method for generating a medical image report based on multi-modal fusion according to claim 1, wherein the inputting the medical image into an image encoder not including a linear layer, to obtain a visual feature sequence, includes:
inputting the medical image into an image encoder which does not comprise a linear layer to obtain a four-dimensional visual feature matrix;
deforming the four-dimensional visual feature matrix into a three-dimensional visual feature matrix;
and converting the three-dimensional visual characteristic matrix into a visual characteristic sequence.
6. The multi-modality fusion-based medical image report generation method of claim 1, wherein the computing an affinity matrix between the graph embedding vector and the sequence of visual features comprises:
the affinity matrix between the graph embedding vector and the visual feature sequence is calculated by the following expression:
wherein C represents an affinity matrix, G E Representing graph embedding vectors, I E Representing a sequence of visual features, W b Representing a weight matrix.
7. The method of claim 1, wherein learning the attention map between the map embedding vector and the visual feature sequence by the affinity matrix comprises:
Learning an attention map between the graph embedding vector and the visual feature sequence by:
F i =tanh(W i I E +(W g G E )C)
wherein ,Fi Representing the output result of learning the graph embedding vector and the visual feature sequence by means of an affinity matrix, W i and Wg All represent a trainable weight matrix, table CShowing the affinity matrix, G E Representing graph embedding vectors, I E Representing a sequence of visual features.
8. The multi-modality fusion-based medical image report generation method of claim 1, wherein the calculating an attention weight vector based on the attention map includes:
the attention weight vector is calculated by the following expression:
wherein ,ai Representing a vector of attention weights, w fi Representing a trainable weight matrix, F i Representing the attention map result.
9. The multi-modality fusion-based medical image report generation method of claim 1, wherein the computing the attention re-weighted image sequence based on the visual feature sequence and the attention weight vector comprises:
the attention re-weighted image sequence is calculated by the following expression:
wherein ,representing a sequence of images, x, that have been re-weighted by attention 1,2,…,R Representing elements in the visual characteristics sequence, +.>Representing elements in the sequence of images that are re-weighted by attention,/->The elements in the attention weight vector corresponding to the visual feature sequence are represented, and r=h '×w' represents the number of image blocks.
10. The multi-modality fusion-based medical image report generation method of claim 1, wherein the memory-driven transducer model comprises: an encoder and a decoder, the decoder comprising: the decoding module is provided with a memory-driven normalization layer.
11. The method of claim 10, wherein inputting the attention re-weighted image sequence into a memory driven transducer model to generate a medical image report comprises:
inputting the attention re-weighted image sequence into the encoder;
initializing the relational memory by adopting the graph embedding vector;
calculating a memory matrix of the last round of output of the relational memory;
and inputting the memory matrix output by the last round of the relational memory and the output result of the encoder into the decoding module provided with the memory-driven normalization layer to obtain a medical image report.
12. The method for generating a medical image report based on multi-modal fusion according to claim 11, wherein initializing the relational memory using the graph embedding vector includes:
initializing the relational memory by the following expression:
M 0 =MLP(G E ·W m )
wherein ,M0 Initial memory matrix representing relational memory, G E Representing graph embedded vectors, W m Representing a weight matrix, MLP () represents a multi-layer perceptron,for creating an inter-dimensional mapping.
13. The method for generating a medical image report based on multi-modal fusion according to claim 11, wherein the calculating the memory matrix of the last round of output of the relational memory includes:
calculating the memory matrix of the last round of output of the relational memory by the following expression:
wherein ,Q=Mt-1 ·W Q ,K=[M t-1 ;y t-1 ]·W k ,V=[M t-1 ;y t-1 ]·W v ,M t-1 A memory matrix, y representing a round of output on the relational memory t-1 Word embedding vector representing a round of predictions on the relational memory, W Q 、W k 、W v Are trainable weight matrices, d k Representing the scaling factor, which is obtained by dividing the dimension of k by the number of heads, MLP () represents the multi-layer perceptron, and />For balancing M t-1 and yt-1 Forgetting gate and output gate, +.>Representing a multi-headed attention output matrix mapped by a multi-layer perceptron, M t And the memory matrix representing the last round of output of the relational memory.
14. The method for generating a medical image report based on multi-modal fusion according to claim 13, wherein inputting the memory matrix output by the last round of the relational memory and the output result of the encoder into the decoding module provided with the memory-driven normalization layer to obtain the medical image report includes:
calculating an output result of the decoding module provided with the memory-driven normalization layer by the following expression:
γ t =γ+MLP(M t )
β t =β+MLP(M t )
θ=T D (ψ,N,RM(M t-1 ,y t-1 ),MCLN(r,M t ))
wherein ψ represents the output of the encoder, N represents the decoder layer number, γ represents the matrix of learnable parameters representing the scaling used to improve generalization ability, γ t Representing gamma and M mapped by a multi-layer perceptron t As a result of the addition, β represents a matrix of leachable parameters representing movement used to enhance generalization ability, β t Representing beta and M mapped by a multi-layer perceptron t As a result of the addition, μ represents the mean value of γ, v represents the standard deviation of γ, T E () Representing the encoder, T D () Representing the decoder, RM () represents the relational memory.
15. A medical image report generating device based on multi-modal fusion, comprising:
The map construction module is used for constructing a medical priori knowledge map and acquiring an initial feature vector of each node in the medical priori knowledge map;
the graph encoder module is used for inputting the medical priori knowledge graph and the initial feature vector of each node in the medical priori knowledge graph into the graph encoder to obtain a graph embedding vector;
the image encoder module is used for inputting the medical image into an image encoder which does not comprise a linear layer to obtain a visual characteristic sequence;
the multi-mode fusion module is used for adopting a cooperative attention mechanism to perform multi-mode fusion on the graph embedding vector and the visual feature sequence to obtain an attention re-weighted image sequence;
the report generation module is used for inputting the image sequence subjected to attention re-weighting into a memory driving transducer model to generate a medical image report;
the multi-mode fusion module is specifically configured to:
calculating an affinity matrix between the graph embedding vector and the visual feature sequence;
learning, by the affinity matrix, an attention map between the graph embedding vector and the visual feature sequence;
calculating an attention weight vector based on the attention map;
Based on the visual feature sequence and the attention weight vector, a sequence of attention re-weighted images is calculated.
16. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the medical image report generating method based on multi-modal fusion as claimed in any one of claims 1 to 14 when the program is executed by the processor.
17. A non-transitory computer readable storage medium having stored thereon a computer program, which when executed by a processor implements the medical image report generating method based on multi-modal fusion according to any one of claims 1 to 14.
18. A computer program product comprising a computer program which, when executed by a processor, implements a medical image report generating method based on multi-modal fusion as claimed in any one of claims 1 to 14.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210836966.3A CN115331769B (en) | 2022-07-15 | 2022-07-15 | Medical image report generation method and device based on multi-mode fusion |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210836966.3A CN115331769B (en) | 2022-07-15 | 2022-07-15 | Medical image report generation method and device based on multi-mode fusion |
Publications (2)
Publication Number | Publication Date |
---|---|
CN115331769A CN115331769A (en) | 2022-11-11 |
CN115331769B true CN115331769B (en) | 2023-05-09 |
Family
ID=83917479
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210836966.3A Active CN115331769B (en) | 2022-07-15 | 2022-07-15 | Medical image report generation method and device based on multi-mode fusion |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115331769B (en) |
Families Citing this family (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115937689B (en) * | 2022-12-30 | 2023-08-11 | 安徽农业大学 | Intelligent identification and monitoring technology for agricultural pests |
CN116028654B (en) * | 2023-03-30 | 2023-06-13 | 中电科大数据研究院有限公司 | Multi-mode fusion updating method for knowledge nodes |
CN117010494B (en) * | 2023-09-27 | 2024-01-05 | 之江实验室 | Medical data generation method and system based on causal expression learning |
CN117649917A (en) * | 2024-01-29 | 2024-03-05 | 北京大学 | Training method and device for test report generation model and test report generation method |
CN117993500A (en) * | 2024-04-07 | 2024-05-07 | 江西为易科技有限公司 | Medical teaching data management method and system based on artificial intelligence |
Family Cites Families (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20180060512A1 (en) * | 2016-08-29 | 2018-03-01 | Jeffrey Sorenson | System and method for medical imaging informatics peer review system |
CN112992308B (en) * | 2021-03-25 | 2023-05-16 | 腾讯科技(深圳)有限公司 | Training method of medical image report generation model and image report generation method |
CN113724359A (en) * | 2021-07-14 | 2021-11-30 | 鹏城实验室 | CT report generation method based on Transformer |
CN114724670A (en) * | 2022-06-02 | 2022-07-08 | 合肥综合性国家科学中心人工智能研究院(安徽省人工智能实验室) | Medical report generation method and device, storage medium and electronic equipment |
-
2022
- 2022-07-15 CN CN202210836966.3A patent/CN115331769B/en active Active
Also Published As
Publication number | Publication date |
---|---|
CN115331769A (en) | 2022-11-11 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN115331769B (en) | Medical image report generation method and device based on multi-mode fusion | |
CN109545302A (en) | A kind of semantic-based medical image report template generation method | |
CN110750959A (en) | Text information processing method, model training method and related device | |
CN111316281A (en) | Semantic classification of numerical data in natural language context based on machine learning | |
CN112561064B (en) | Knowledge base completion method based on OWKBC model | |
WO2022052530A1 (en) | Method and apparatus for training face correction model, electronic device, and storage medium | |
CN111881926A (en) | Image generation method, image generation model training method, image generation device, image generation equipment and image generation medium | |
CN115132313A (en) | Automatic generation method of medical image report based on attention mechanism | |
US11430123B2 (en) | Sampling latent variables to generate multiple segmentations of an image | |
CN112052889B (en) | Laryngoscope image recognition method based on double-gating recursion unit decoding | |
CN112530584A (en) | Medical diagnosis assisting method and system | |
CN113724359A (en) | CT report generation method based on Transformer | |
CN116129141A (en) | Medical data processing method, apparatus, device, medium and computer program product | |
CN116563537A (en) | Semi-supervised learning method and device based on model framework | |
CN116385937A (en) | Method and system for solving video question and answer based on multi-granularity cross-mode interaction framework | |
CN112560454B (en) | Bilingual image subtitle generating method, bilingual image subtitle generating system, storage medium and computer device | |
CN115190999A (en) | Classifying data outside of a distribution using contrast loss | |
CN116486465A (en) | Image recognition method and system for face structure analysis | |
CN116258928A (en) | Pre-training method based on self-supervision information of unlabeled medical image | |
CN116994695A (en) | Training method, device, equipment and storage medium of report generation model | |
CN115662565A (en) | Medical image report generation method and equipment integrating label information | |
CN115762721A (en) | Medical image quality control method and system based on computer vision technology | |
CN115239740A (en) | GT-UNet-based full-center segmentation algorithm | |
CN114139531A (en) | Medical entity prediction method and system based on deep learning | |
Souza et al. | Automatic recognition of continuous signing of brazilian sign language for medical interview |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |