CN115331769A - Medical image report generation method and device based on multi-modal fusion - Google Patents

Medical image report generation method and device based on multi-modal fusion Download PDF

Info

Publication number
CN115331769A
CN115331769A CN202210836966.3A CN202210836966A CN115331769A CN 115331769 A CN115331769 A CN 115331769A CN 202210836966 A CN202210836966 A CN 202210836966A CN 115331769 A CN115331769 A CN 115331769A
Authority
CN
China
Prior art keywords
medical
representing
graph
memory
attention
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202210836966.3A
Other languages
Chinese (zh)
Other versions
CN115331769B (en
Inventor
黄雨
李航
徐德轩
金芝
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Peking University
Peking University First Hospital
Original Assignee
Peking University
Peking University First Hospital
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Peking University, Peking University First Hospital filed Critical Peking University
Priority to CN202210836966.3A priority Critical patent/CN115331769B/en
Publication of CN115331769A publication Critical patent/CN115331769A/en
Application granted granted Critical
Publication of CN115331769B publication Critical patent/CN115331769B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H15/00ICT specially adapted for medical reports, e.g. generation or transmission thereof
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H30/00ICT specially adapted for the handling or processing of medical images

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Public Health (AREA)
  • Primary Health Care (AREA)
  • Epidemiology (AREA)
  • Databases & Information Systems (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Nuclear Medicine, Radiotherapy & Molecular Imaging (AREA)
  • Radiology & Medical Imaging (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • Artificial Intelligence (AREA)
  • Measuring And Recording Apparatus For Diagnosis (AREA)

Abstract

The invention provides a medical image report generation method and a device based on multi-mode fusion, wherein the method comprises the following steps: constructing a medical prior knowledge graph, and acquiring an initial feature vector of each node in the medical prior knowledge graph; inputting the medical priori knowledge graph and the initial characteristic vector of each node in the medical priori knowledge graph into a graph encoder to obtain a graph embedding vector; inputting the medical image into an image encoder without a linear layer to obtain a visual characteristic sequence; performing multi-mode fusion on the graph embedding vector and the visual feature sequence by adopting a cooperative attention mechanism to obtain an image sequence subjected to attention re-weighting; and inputting the attention-re-weighted image sequence into a memory-driven Transformer model to generate a medical image report. The invention can improve the accuracy and reliability of generating the medical image report.

Description

Medical image report generation method and device based on multi-mode fusion
Technical Field
The invention relates to the technical field of medical treatment and artificial intelligence intersection, in particular to a medical image report generation method and device based on multi-mode fusion.
Background
In recent years, medical image reports have been the focus of research and collaboration between computer students and medical professionals. The accurate and efficient medical image report can greatly improve the control of doctors on the disease conditions of patients, reduce the workload of doctors, assist the doctors in making correct disease diagnosis and provide corresponding medical guidance and suggestions for the patients.
Currently, the research on medical image report generation technology is still in the beginning stage. In the existing medical image report generation scheme, a medical knowledge graph is generally used for subtasks such as classification and the like and is not integrated into a model, so that the accuracy and reliability of generating the medical image report are not high.
Disclosure of Invention
The invention provides a method and a device for generating a medical image report based on multi-mode fusion, which are used for solving the defects that the accuracy and the reliability of the generation of the medical image report are not high because a medical knowledge graph is used for subtasks such as classification and the like and is not fused into a model in the prior art, and the purpose of improving the accuracy and the reliability of the generation of the medical image report is realized.
The invention provides a medical image report generation method based on multi-modal fusion, which comprises the following steps:
constructing a medical prior knowledge graph, and acquiring an initial feature vector of each node in the medical prior knowledge graph;
inputting the medical prior knowledge graph and the initial characteristic vector of each node in the medical prior knowledge graph into a graph encoder to obtain a graph embedding vector;
inputting the medical image into an image encoder which does not comprise a linear layer to obtain a visual characteristic sequence;
performing multi-mode fusion on the graph embedding vector and the visual feature sequence by adopting a cooperative attention mechanism to obtain an image sequence subjected to attention re-weighting;
and inputting the attention-re-weighted image sequence into a memory-driven transform model to generate a medical image report.
According to the method for generating the medical image report based on the multi-mode fusion, which is provided by the invention, the construction of the medical priori knowledge map comprises the following steps:
acquiring a plurality of unmarked medical image report texts;
extracting a plurality of medical entities from the unmarked medical image report texts by adopting a named entity recognition algorithm;
reducing the dimensions of the medical entities by adopting a clustering algorithm;
and constructing a medical prior knowledge graph by taking the medical entities after dimension reduction as nodes and taking the relation between the medical entities after dimension reduction as edges.
According to the method for generating the medical image report based on the multi-modal fusion, the obtaining of the initial feature vector of each node in the medical prior knowledge graph comprises the following steps:
and initializing each node of the medical prior knowledge graph through a word embedding model to obtain an initial characteristic vector of the node.
According to the method for generating the medical image report based on the multi-modal fusion, the initial feature vectors of each node in the medical priori knowledge map and the medical priori knowledge map are input into a graph encoder to obtain a graph embedding vector, and the method comprises the following steps:
constructing a graph encoder:
Figure BDA0003748877450000021
Figure BDA0003748877450000031
wherein ,
Figure BDA0003748877450000032
an adjacency matrix representing the medical prior knowledge graph, the adjacency matrix being accompanied by edges pointing to self nodes,
Figure BDA0003748877450000033
representing initial characteristic vectors of the medical prior knowledge graph, wherein the initial characteristic vectors of the medical prior knowledge graph are obtained by splicing the initial characteristic vectors of all nodes in the medical prior knowledge graph,
Figure BDA0003748877450000034
a graph convolution feature vector representing the k-th layer,
Figure BDA0003748877450000035
represents the convolution eigenvector of the (k + 1) th layer, D = ∑ Σ i D i
Figure BDA0003748877450000036
Figure BDA0003748877450000037
Elements of row i and column j of an adjacency matrix representing the medical prior knowledge-graph, W (k) Representing a trainable weight matrix, GC () representing a convolution function, σ () representing an activation function, dropout () representing a random drop function, BN () representing a batch normalization function;
and inputting the medical prior knowledge graph and the initial characteristic vector of each node in the medical prior knowledge graph into the graph encoder to obtain a graph convolution characteristic vector of the last layer as a graph embedding vector.
According to the method for generating a medical image report based on multi-modal fusion provided by the invention, the medical image is input into an image encoder without a linear layer to obtain a visual feature sequence, and the method comprises the following steps:
inputting the medical image into an image encoder which does not comprise a linear layer to obtain a four-dimensional visual characteristic matrix;
transforming the four-dimensional visual characteristic matrix into a three-dimensional visual characteristic matrix;
and converting the three-dimensional visual feature matrix into a visual feature sequence.
According to the medical image report generation method based on multi-modal fusion provided by the invention, the multi-modal fusion is carried out on the graph embedding vector and the visual feature sequence by adopting a cooperative attention mechanism to obtain an image sequence subjected to attention re-empowerment, and the method comprises the following steps:
computing an affinity matrix between the graph embedding vector and the sequence of visual features;
learning, by the affinity matrix, an attention mapping between the graph embedding vector and the sequence of visual features;
calculating an attention weight vector based on the attention map;
calculating a sequence of attention re-weighted images based on the sequence of visual features and the attention weight vector.
According to the medical image report generation method based on multi-modal fusion provided by the invention, the calculation of the affinity matrix between the graph embedding vector and the visual feature sequence comprises the following steps:
computing an affinity matrix between the graph embedding vector and the visual feature sequence by the expression:
Figure BDA0003748877450000041
wherein C represents an affinity matrix, G E Representation embedding vector, I E Representing a sequence of visual features, W b A weight matrix is represented.
According to the medical image report generation method based on multi-modal fusion provided by the invention, the learning of the attention mapping between the graph embedding vector and the visual feature sequence through the affinity matrix comprises the following steps:
learning an attention mapping between the graph embedding vector and the sequence of visual features by the following expression:
F i =tanh(W i I E +(W g G E )C)
wherein ,Fi Output results representing learning of the graph embedding vector and the visual feature sequence by an affinity matrix, W i and Wg All represent trainable weight matrices, C represents affinity matrices, G E Representation embedding vector, I E Representing a sequence of visual features.
According to the medical image report generation method based on multi-modal fusion provided by the invention, the calculation of the attention weight vector based on the attention mapping comprises the following steps:
calculating an attention weight vector by the following expression:
Figure BDA0003748877450000042
wherein ,ai Represents the attention weight vector, w fi Representing trainable weight matrices, F i Indicating the attention mapping result.
According to the medical image report generation method based on multi-modal fusion provided by the invention, the image sequence subjected to attention re-weighting is calculated based on the visual feature sequence and the attention weight vector, and the method comprises the following steps:
the sequence of attention-re-weighted images is calculated by the following expression:
Figure BDA0003748877450000051
wherein ,
Figure BDA0003748877450000052
representing a sequence of attention-re-weighted images, x 1,2,…,R Representing an element in a sequence of visual features,
Figure BDA0003748877450000053
representing elements in the attention-re-weighted image sequence,
Figure BDA0003748877450000054
and representing an element in the attention weight vector corresponding to the visual feature sequence, wherein R = H 'xw' represents the number of image blocks.
According to the medical image report generation method based on multi-modal fusion, the memory-driven Transformer model comprises the following steps: an encoder and a decoder, the decoder comprising: the device comprises a relation memory and a decoding module provided with a memory-driven normalization layer.
According to the method for generating the medical image report based on the multi-modal fusion, the image sequence subjected to attention re-empowerment is input into a memory-driven Transformer model to generate the medical image report, and the method comprises the following steps:
inputting the sequence of attention-re-weighted images into the encoder;
initializing the relational memory by adopting the graph embedding vector;
calculating a memory matrix output by the last round of the relational memory;
and inputting the memory matrix output by the last round of the relational memory and the output result of the encoder into the decoding module provided with the memory-driven normalization layer to obtain a medical image report.
According to the medical image report generation method based on multi-modal fusion provided by the invention, the initialization of the relationship memory by adopting the graph embedding vector comprises the following steps:
initializing the relational memory by the following expression:
M 0 =MLP(G E ·W m )
wherein ,M0 Initial memory matrix, G, representing a relational memory E Representation embedding vector, W m Representing a weight matrix, MLP () representing a multi-layer perceptron,for establishing inter-dimensional mappings.
According to the medical image report generation method based on multi-modal fusion provided by the invention, the calculating of the memory matrix output in the last round of the relational memory comprises the following steps:
calculating a memory matrix output by the last round of the relational memory according to the following expression:
Figure BDA0003748877450000061
Figure BDA0003748877450000062
Figure BDA0003748877450000063
wherein ,Q=Mt-1 ·W Q ,K=[M t-1 ;y t-1 ]·W k ,V=[M t-1 ;y t-1 ]·W v ,M t-1 A memory matrix, y, representing a round of output over the relational memory t-1 Word-embedding vector, W, representing a round of prediction over the relational memory Q 、W k 、W v Are all trainable weight matrices, d k Representing the scaling factor, which is obtained by dividing the dimension of k by the number of multi-points, MLP () representing a multi-layer perceptron,
Figure BDA0003748877450000064
and
Figure BDA0003748877450000065
for balancing M t-1 and yt-1 A forgetting gate and an output gate of the gate,
Figure BDA0003748877450000066
representing a multi-headed attention output matrix, M, mapped across multiple tiers of perceptrons t Memory moment representing the last round of output of the relational memoryAnd (5) arraying.
According to the method for generating a medical image report based on multi-modal fusion provided by the invention, the memory matrix output by the last round of the relational memory and the output result of the encoder are input into the decoding module provided with the normalization layer driven by the memory to obtain the medical image report, and the method comprises the following steps:
calculating an output result of the decoding module provided with the memory-driven normalization layer by the following expression:
γ t =γ+MLP(M t )
β t =β+MLP(M t )
Figure BDA0003748877450000067
θ=T D (ψ,N,RM(M t-1 ,y t-1 ),MCLN(r,M t ))
where ψ represents an output result of the encoder, N represents the number of decoder layers, γ represents a learnable parameter matrix representing scaling used for improving generalization ability, γ t Representing gamma and M after multi-layer perceptron mapping t The result of the addition, β represents a learnable parameter matrix representing the movement used to improve generalization ability, β t Represents beta and M after multi-layer perceptron mapping t As a result of addition, μ represents the mean value of γ, v represents the standard deviation of γ, and T E () Represents the encoder, T D () Representing the decoder, RM () representing the relational memory.
The invention also provides a medical image report generating device based on multi-mode fusion, which comprises:
the map construction module is used for constructing a medical prior knowledge map and acquiring an initial feature vector of each node in the medical prior knowledge map;
the graph encoder module is used for inputting the medical priori knowledge graph and the initial characteristic vector of each node in the medical priori knowledge graph into a graph encoder to obtain a graph embedding vector;
the image encoder module is used for inputting the medical image into an image encoder which does not comprise a linear layer to obtain a visual characteristic sequence;
the multi-mode fusion module is used for performing multi-mode fusion on the graph embedding vector and the visual feature sequence by adopting a cooperative attention mechanism to obtain an image sequence subjected to attention re-empowerment;
and the report generation module is used for inputting the image sequence subjected to attention re-weighting into a memory-driven Transformer model to generate a medical image report.
The invention further provides an electronic device, which comprises a memory, a processor and a computer program stored in the memory and capable of running on the processor, wherein the processor executes the program to implement the method for generating the medical image report based on multi-modal fusion as described in any one of the above.
The present invention also provides a non-transitory computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements a method for generating a medical image report based on multimodal fusion as described in any of the above.
The present invention also provides a computer program product comprising a computer program which, when executed by a processor, implements a method for generating a medical image report based on multimodal fusion as described in any of the above.
The invention provides a method and a device for generating a medical image report based on multi-mode fusion.A medical prior knowledge graph is constructed, and an initial feature vector of each node in the medical prior knowledge graph is obtained; then, inputting the medical prior knowledge graph and the initial characteristic vector of each node in the medical prior knowledge graph into a graph encoder to obtain a graph embedding vector; inputting the medical image into an image encoder without a linear layer to obtain a visual characteristic sequence; performing multi-mode fusion on the image embedding vector and the visual characteristic sequence by adopting a cooperative attention mechanism to obtain an image sequence subjected to attention re-empowerment; the image sequence subjected to attention re-weighting fuses a medical priori knowledge map and a medical image; and finally, inputting the image sequence subjected to attention re-weighting into a memory-driven Transformer model to generate a medical image report, wherein the image sequence subjected to attention re-weighting fuses a medical priori knowledge map and a medical image, so that the memory-driven Transformer model has better comprehension capability and more robust capability of understanding medical priori knowledge, and the accuracy and reliability of generation of the medical image report can be improved.
Drawings
In order to more clearly illustrate the technical solutions of the present invention or the prior art, the drawings needed for the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.
FIG. 1 is a flow chart of a multi-modal fusion-based medical image report generation method provided by the present invention;
FIG. 2 is a schematic structural diagram of a medical image report generation model based on medical priori knowledge mapping and memory driving provided by the present invention;
FIG. 3 is a schematic diagram of the construction of a medical prior knowledge map provided by the present invention;
FIG. 4 is a schematic structural diagram of a medical image report generation device based on multi-modal fusion provided by the present invention;
fig. 5 is a schematic structural diagram of an electronic device provided in the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention clearer, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is obvious that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be obtained by a person skilled in the art without inventive step based on the embodiments of the present invention, are within the scope of protection of the present invention.
The method for generating a medical image report based on multi-modal fusion according to the present invention is described below with reference to fig. 1 to 3.
Referring to fig. 1, fig. 1 is a schematic flow chart of a medical image report generation method based on multi-modal fusion according to the present invention. As shown in fig. 1, the method for generating a medical image report based on multi-modal fusion provided by the present invention may include the following steps:
101, constructing a medical prior knowledge graph, and acquiring an initial characteristic vector of each node in the medical prior knowledge graph;
102, inputting the medical prior knowledge graph and the initial characteristic vector of each node in the medical prior knowledge graph into a graph encoder to obtain a graph embedding vector;
103, inputting the medical image into an image encoder without a linear layer to obtain a visual feature sequence;
104, adopting a cooperative attention mechanism to perform multi-mode fusion on the image embedding vector and the visual feature sequence to obtain an image sequence subjected to attention re-empowerment;
and 105, inputting the image sequence subjected to attention re-weighting into a memory-driven Transformer model to generate a medical image report.
In step 101, a medical prior knowledge map with a proper size is constructed by a natural language processing method, instead of manually constructing by selecting only a few keywords. The medical prior knowledge map of the embodiment has the following characteristics:
1) Entity type universalization
The medical prior knowledge map contains all-round entity types, can describe disease symptoms from all-round angles, and not only contains the names of diseases, such as: for skin diseases, entity information such as name, position, shape, color, etc. of the disease may be included.
2) Map scale is appropriate
The scale of the medical prior knowledge map is proper, if the features are too large, training and learning are difficult, and if the scale is too small, enough prior knowledge cannot be reserved.
3) Entity relationship comprehensiveness
The method can not only automatically establish the relationship between the entities through a relationship extraction method, but also artificially supplement the relationship of the entities, so that the medical priori knowledge map can embody more priori knowledge.
In this step, in order to complete the subsequent steps, each node in the medical prior knowledge graph needs to be initialized, and an initial feature vector of each node in the medical prior knowledge graph is obtained. Optionally, each node in the medical prior knowledge graph is initialized by a word embedding model.
In step 102, the graph encoder is used to extract graph embedding vectors of the medical prior knowledge graph. And inputting the medical prior knowledge graph and the initial characteristic vector of each node in the medical prior knowledge graph into a graph encoder, and obtaining a graph embedding vector through graph encoding processing.
In step 103, for the image encoder, a derivative model of the convolutional neural network is used as the visual extractor, for example: residual convolutional network ResNet, dense chained convolutional network densneet, etc. The image encoder used in this embodiment does not include the last linear layer and outputs the result of the pooling layer.
In step 104, the image encoder may obtain the visual features of the medical image, but may not obtain high-level semantic information well. In the embodiment, a cooperative attention mechanism is adopted, multi-modal fusion is carried out on the graph embedding vector and the visual characteristic sequence, a visual question-answering process is simulated, and finally an image sequence subjected to attention re-empowerment is obtained.
In step 105, as shown in fig. 2, optionally, the memory-driven Transformer model comprises: an encoder and a decoder, the decoder comprising: the device comprises a relation memory and a decoding module provided with a memory-driven normalization layer. The Normalization Layer of the Memory driver is three Memory-driver Conditional Layer Normalization (MCLN) layers, and is used for enhancing the decoding capability of the Memory driver Transformer model and increasing the generalization.
And inputting the image sequence subjected to attention re-weighting into a memory-driven Transformer model to generate a medical image report.
In this embodiment, since the medical priori knowledge map and the medical image are fused in the attention-reweighted image sequence, the memory-driven transform model has better comprehension capability and more robust capability of comprehending the medical priori knowledge, and the accuracy and reliability of generating the medical image report can be improved.
Optionally, constructing the medical prior knowledge map in step 101 includes the following sub-steps:
step 1011, acquiring a plurality of unmarked medical image report texts;
step 1012, extracting a plurality of medical entities from a plurality of unmarked medical image report texts by adopting a named entity recognition algorithm;
1013, reducing the dimensions of the medical entities by adopting a clustering algorithm;
and 1014, constructing a medical priori knowledge graph by taking the medical entities subjected to the dimensionality reduction as nodes and taking the relationship among the medical entities subjected to the dimensionality reduction as edges.
In step 1011, several unmarked medical image report texts are acquired, such as: a large number of unlabelled reports, training data sets, effective text information provided by the physician, etc. Different types of reports may be selected as underlying data for different tasks.
In step 1012, a named entity recognition algorithm is used to effectively extract medical entities from the unlabeled medical image report texts, which are stored as key nodes of the medical prior knowledge graph.
In step 1013, a large number of similar contents may exist in the node identified by the named entity, and if all the similar contents are retained, a large number of redundant structures are generated, which causes an oversize medical prior knowledge graph, so that a text processing method and a clustering algorithm are required to perform dimension reduction on several medical entities.
In step 1014, relationships between medical entities are established using relationship extraction, and entity dependencies are established with human design assistance. And constructing a medical prior knowledge graph by taking the medical entities after dimension reduction as nodes and taking the relation between the medical entities after dimension reduction as edges.
In the embodiment, a medical priori knowledge map with a proper size can be constructed, and the medical priori knowledge map is not constructed manually by selecting only a few keywords.
Specifically, as shown in fig. 3, a medical prior knowledge map is constructed using two data sets. Because the two data sets are in different languages, there will be some difference in the process of constructing the medical prior knowledge map. For IU-Xray data set, stanza Biomedical is used as a backbone method for named entity identification and relationship extraction, and a medical priori knowledge map with 284 key nodes is finally obtained through clustering. And each node obtains 768-dimensional feature vectors through BioBert, and the 768-dimensional feature vectors are used as initial features of the medical priori knowledge graph. Similarly, for the NCRC-DS dataset, chinese medical entities were extracted using CMeKG and entity triplets were extracted. The CMeKG is a Chinese medical knowledge map tool library and provides open source realization of functions such as named entity recognition, relation extraction, medical word segmentation and the like. After clustering, a knowledge graph containing 191 key nodes is finally obtained. In order to obtain the initial characteristics of the nodes, the keywords of the nodes are input into a Chinese medical Bert model provided by CMeKG, and 768-dimensional initial vectors are obtained.
Optionally, step 102 comprises the sub-steps of:
step 1021, constructing a graph encoder:
Figure BDA0003748877450000121
Figure BDA0003748877450000122
wherein ,
Figure BDA0003748877450000123
adjacent moments representing a medical prior knowledge mapArray, the adjacent matrix is accompanied by an edge pointing to the node of the adjacent matrix,
Figure BDA0003748877450000124
the initial feature vector of the medical prior knowledge graph is obtained by splicing the initial feature vectors of all nodes in the medical prior knowledge graph,
Figure BDA0003748877450000125
a graph convolution feature vector representing the k-th layer,
Figure BDA0003748877450000131
represents the convolution eigenvector of the (k + 1) th layer, D = ∑ Σ i D i
Figure BDA0003748877450000132
For use in normalizing the characteristics of the aggregation nodes,
Figure BDA0003748877450000133
elements of row i and column j of an adjacency matrix representing a medical prior knowledge map, W (k) Representing a trainable weight matrix, GC () representing a convolution function, σ () representing an activation function, dropout () representing a random drop function, BN () representing a batch normalization function;
and 1022, inputting the medical priori knowledge map and the initial feature vector of each node in the medical priori knowledge map into a map encoder to obtain a final layer of map convolution feature vector as a map embedding vector.
In this embodiment, random discarding, batch normalization layer and residual connection are added between two layers of graph convolution, so that the expression capability of a graph encoder can be improved, and graph embedding vectors of a medical priori knowledge graph can be extracted through the graph encoder.
Optionally, step 103 comprises the sub-steps of:
step 1031, inputting the medical image (imag with dimensionality [ B, C, H, W ]) into an image encoder VE () without a linear layer to obtain a four-dimensional visual feature matrix VE (imag) with dimensionality [ B, F, H ', W');
step 1032, transforming the four-dimensional visual characteristic matrix into a three-dimensional visual characteristic matrix;
I E =reshape(VE(imag)) (3)
wherein ,IE Is a three-dimensional visual characteristic matrix with the dimension of [ B, F, H' xW]Reshape () represents a warping function;
step 1033, converting the three-dimensional visual feature matrix into a visual feature sequence x 1 ,x 2 ,…,x H` ×W`。
In this embodiment, the medical image is input to an image encoder that does not include a linear layer, and is further deformed to obtain a visual feature sequence.
Optionally, step 104 comprises the sub-steps of:
step 1041, calculating an affinity matrix between the graph embedding vector and the visual feature sequence;
step 1042, learning the attention mapping between the graph embedding vector and the visual characteristic sequence through the affinity matrix;
step 1043, calculating an attention weight vector based on the attention mapping;
and step 1044, calculating the image sequence subjected to attention re-weighting based on the visual feature sequence and the attention weight vector.
In step 1041, an affinity matrix between the graph embedding vector and the visual feature sequence is calculated by the following expression:
Figure BDA0003748877450000141
wherein C represents an affinity matrix, G E Representation embedding vector, I E Representing a sequence of visual features, W b A weight matrix is represented.
In step 1042, the attention mapping between the graph embedding vector and the visual feature sequence is learned by the following expression:
F i =tanh(W i I E +(W g G E )C) (5)
wherein ,Fi Output results representing embedding of vectors and visual feature sequences by affinity matrix learning graph, W i and Wg All represent trainable weight matrices, C represents affinity matrices, G E Representation embedding vector, I E Representing a sequence of visual features.
In step 1043, the attention weight vector is calculated by the following expression:
Figure BDA0003748877450000142
wherein ,ai Represents the attention weight vector, w fi Representing trainable weight matrices, F i Indicating the attention mapping result.
In step 1044, the attention-re-weighted image sequence is computed by the following expression:
Figure BDA0003748877450000143
wherein ,
Figure BDA0003748877450000144
representing a sequence of attention-re-weighted images, x 1,2,…,R Representing an element in a sequence of visual features,
Figure BDA0003748877450000145
representing elements in the attention-re-weighted image sequence,
Figure BDA0003748877450000146
and representing an element in the attention weight vector corresponding to the visual feature sequence, wherein R = H 'xw' represents the number of image blocks.
In this embodiment, a cooperative attention mechanism is adopted to perform multi-modal fusion on the image embedding vector output by the image encoder and the visual feature sequence output by the image encoder, so that the finally obtained image sequence subjected to attention re-weighting fuses the visual features of the medical images and the high-level semantic information of the medical priori knowledge map.
Optionally, step 105 comprises the sub-steps of:
step 1051, inputting the image sequence subjected to attention re-weighting into an encoder;
step 1052, initializing the relational memory by using the graph embedding vector;
step 1053, calculating a memory matrix output in the last round of the relational memory;
and 1054, inputting the memory matrix output by the last round of the relational memory and the output result of the encoder into a decoding module provided with a normalization layer of a memory drive to obtain a medical image report.
In step 1051, the output result of the encoder is calculated by the following expression:
Figure BDA0003748877450000151
where ψ denotes the output of the encoder, N denotes T E () Representing an encoder.
In step 1052, the relational memory is used to store the shared content in the model training result, enhancing the learning ability of the model. Specifically, a memory matrix is provided that includes a plurality of rows, each row being considered as a slot for storing pattern-specific information.
Initializing the relational memory by the following expression:
M 0 =MLP(G E ·W m ) (9)
wherein ,M0 Initial memory matrix, W, representing a relational memory m Representing a weight matrix and MLP () representing a multi-layer perceptron for building the inter-dimensional mapping.
In step 1053, the memory matrix output in the last round of the relational memory is calculated by the following expression:
Figure BDA0003748877450000152
Figure BDA0003748877450000153
Figure BDA0003748877450000154
wherein ,Q=Mt-1 ·W Q ,K=[M t-1 ;y t-1 ]·W k ,V=[M t-1 ;y t-1 ]·W v ,M t-1 Memory matrix, y, representing the output of a round on relational memory t-1 Word-embedded vector, W, representing a previous round of prediction in relational memory Q 、W k 、W v Are all trainable weight matrices, d k Representing the scaling factor, which is obtained by dividing the dimension of k by the number of multi-points, MLP () representing a multi-layer perceptron,
Figure BDA0003748877450000161
and
Figure BDA0003748877450000162
for balancing M t-1 and yt-1 A forgetting gate and an output gate of the gate,
Figure BDA0003748877450000163
representing a multi-headed attention output matrix, M, mapped by a multi-layered perceptron t And the memory matrix of the last round of output of the relational memory is represented.
In step 1054, the output result of the decoding module provided with the memory-driven normalization layer is calculated by the following expression:
γ t =γ+MLP(M t ) (13)
β t =β+MLP(M t ) (14)
Figure BDA0003748877450000164
θ=T D (ψ,N,RM(M t-1 ,y t-1 ),MCLN(r,M t )) (16)
where ψ represents the output of the encoder, N represents the number of decoder layers, γ represents a learnable parameter matrix representing scaling used for improving generalization ability, γ t Representing gamma and M after multi-layer perceptron mapping t The result of the addition, β represents a learnable parameter matrix representing the movement used to improve generalization ability, β t Represents beta and M after multi-layer perceptron mapping t As a result of addition, μ represents the mean value of γ, v represents the standard deviation of γ, T E () Representation encoder, T D () Representing the decoder, RM () representing the relational memory.
In this embodiment, the relational memory is initialized by using the graph embedding vector, instead of assigning all initial memory matrices of the relational memory to 0, so that the memory-driven Transformer model can be optimized. Moreover, the input of the decoder of the memory driven Transformer model fuses the medical priori knowledge map and the medical image, so that the memory driven Transformer model has better comprehension capability and more robust capability of comprehending the medical priori knowledge, and the accuracy and reliability of generating the medical image report can be improved.
Specifically, taking the IU-Xray dataset and the NCRC-DS dataset as an example, the DenseNet-121 model pre-trained on CheXpert is selected for the IU-Xray dataset as the backbone network for the image encoder. Two chest pieces of the same description will be input into the model and stitched to the encoder of the memory driven Transformer model. A medical prior knowledge map containing 284 nodes is used as medical prior knowledge. For the NCRC-DS dataset, the ResNet-101 model pre-trained on ImageNet was chosen as the backbone network for the image encoder. Because the data set is small in size, only one dermatosis picture and a corresponding description report are input at a time, and a medical prior knowledge map containing 191 nodes is used as medical prior knowledge. By default, the number of levels of the graph convolution is 3, the number of slots of the relational memory is 3, and the word embedding dimension is set to 512. The model was trained using an Adam optimizer with cross entropy loss, BLEU-4 scores were evaluated on the test set while training, and weight decay and early stop were set. In the fields of chest radiographs, skin diseases and the like, the method for generating the medical image report based on multi-modal fusion of the embodiment exceeds the currently common medical image report generation model in accuracy and reliability.
The multi-modality fusion-based medical image report generation apparatus provided by the present invention is described below, and the multi-modality fusion-based medical image report generation apparatus described below and the multi-modality fusion-based medical image report generation method described above may be referred to in correspondence with each other.
Referring to fig. 4, fig. 4 is a schematic structural diagram of a medical image report generation apparatus based on multi-modal fusion according to the present invention. As shown in fig. 4, the medical image report generating apparatus based on multi-modal fusion provided by the present invention may include:
the map construction module 10 is used for constructing a medical prior knowledge map and acquiring an initial feature vector of each node in the medical prior knowledge map;
a graph encoder module 20, configured to input the medical prior knowledge graph and an initial feature vector of each node in the medical prior knowledge graph into a graph encoder, so as to obtain a graph embedding vector;
an image encoder module 30, configured to input the medical image into an image encoder that does not include the linear layer, to obtain a visual feature sequence;
a multi-modal fusion module 40, configured to perform multi-modal fusion on the graph embedding vector and the visual feature sequence by using a cooperative attention mechanism, so as to obtain an image sequence subjected to attention re-weighting;
and a report generation module 50, configured to input the attention-re-weighted image sequence into a memory-driven Transformer model, so as to generate a medical image report.
Optionally, the atlas construction module 10 is specifically configured to:
acquiring a plurality of unmarked medical image report texts;
extracting a plurality of medical entities from the plurality of unmarked medical image report texts by adopting a named entity recognition algorithm;
reducing dimensions of the medical entities by adopting a clustering algorithm;
and constructing a medical prior knowledge graph by taking the medical entities after dimension reduction as nodes and taking the relation between the medical entities after dimension reduction as edges.
Optionally, the atlas construction module 10 is specifically configured to:
and initializing each node of the medical prior knowledge graph through a word embedding model to obtain an initial feature vector of the node.
Optionally, the graph encoder module 20 is specifically configured to:
constructing a graph encoder:
Figure BDA0003748877450000181
Figure BDA0003748877450000182
wherein ,
Figure BDA0003748877450000183
an adjacency matrix representing the medical prior knowledge graph, the adjacency matrix being accompanied by edges pointing to self nodes,
Figure BDA0003748877450000184
representing initial characteristic vectors of the medical prior knowledge graph, wherein the initial characteristic vectors of the medical prior knowledge graph are obtained by splicing the initial characteristic vectors of all nodes in the medical prior knowledge graph,
Figure BDA0003748877450000185
a graph convolution feature vector representing the k-th layer,
Figure BDA0003748877450000186
represents the convolution eigenvector of the (k + 1) th layer, D = ∑ Σ i D i
Figure BDA0003748877450000187
Figure BDA0003748877450000188
Elements of row i and column j of an adjacency matrix representing the medical prior knowledge-graph, W (k) Representing a trainable weight matrix, GC () representing a convolution function, σ () representing an activation function, dropout () representing a random drop function, BN () representing a batch normalization function;
and inputting the initial characteristic vector of each node in the medical priori knowledge graph and the medical priori knowledge graph into the graph encoder to obtain a graph convolution characteristic vector of a last layer as a graph embedding vector.
Optionally, image encoder module 30 is specifically configured to:
inputting the medical image into an image encoder without a linear layer to obtain a four-dimensional visual characteristic matrix;
transforming the four-dimensional visual feature matrix into a three-dimensional visual feature matrix;
and converting the three-dimensional visual feature matrix into a visual feature sequence.
Optionally, the multimodal fusion module 40 is specifically configured to:
calculating an affinity matrix between the graph embedding vector and the visual feature sequence;
learning, by the affinity matrix, an attention mapping between the graph embedding vector and the sequence of visual features;
calculating an attention weight vector based on the attention map;
calculating a sequence of attention re-weighted images based on the sequence of visual features and the attention weight vector.
Optionally, the multimodal fusion module 40 is specifically configured to:
computing an affinity matrix between the graph embedding vector and the visual feature sequence by the expression:
Figure BDA0003748877450000191
wherein C represents an affinity matrix, G E Representation embedding vector, I E Representing a sequence of visual features, W b A weight matrix is represented.
Optionally, the multimodal fusion module 40 is specifically configured to:
learning an attention mapping between the graph embedding vector and the sequence of visual features by the following expression:
F i =tanh(W i I E +(W g G E )C)
wherein ,Fi Output results representing learning of the graph embedding vector and the visual feature sequence by an affinity matrix, W i and Wg All represent trainable weight matrices, C represents affinity matrices, G E Representation embedding vector, I E Representing a sequence of visual features.
Optionally, the multimodal fusion module 40 is specifically configured to:
calculating an attention weight vector by the following expression:
Figure BDA0003748877450000201
wherein ,ai Represents the attention weight vector, w fi Representing trainable weight matrices, F i Indicating the attention mapping result.
Optionally, the multimodal fusion module 40 is specifically configured to:
the sequence of attention-re-weighted images is calculated by the following expression:
Figure BDA0003748877450000202
wherein ,
Figure BDA0003748877450000203
representing a sequence of attention-re-weighted images, x 1,2,…,R Representing an element in a sequence of visual features,
Figure BDA0003748877450000204
representing elements in the attention-re-weighted image sequence,
Figure BDA0003748877450000205
and representing an element in the attention weight vector corresponding to the visual feature sequence, wherein R = H 'xw' represents the number of image blocks.
Optionally, the memory-driven Transformer model comprises: an encoder and a decoder, the decoder comprising: the device comprises a relation memory and a decoding module provided with a memory-driven normalization layer.
Optionally, the report generating module 50 is specifically configured to:
inputting the sequence of attention-re-weighted images into the encoder;
initializing the relational memory by adopting the graph embedding vector;
calculating a memory matrix output in the last round of the relationship memory;
and inputting the memory matrix output by the last round of the relational memory and the output result of the encoder into the decoding module provided with the memory-driven normalization layer to obtain a medical image report.
Optionally, the report generating module 50 is specifically configured to:
initializing the relational memory by the following expression:
M 0 =MLP(G E ·W m )
wherein ,M0 Initial memory matrix, G, representing a relational memory E Representation embedding vector, W m Representing a weight matrix and MLP () representing a multi-layer perceptron for building the inter-dimensional mapping.
Optionally, the report generating module 50 is specifically configured to:
calculating a memory matrix output by the last round of the relational memory according to the following expression:
Figure BDA0003748877450000211
Figure BDA0003748877450000212
Figure BDA0003748877450000213
wherein ,Q=Mt-1 ·W Q ,K=[M t-1 ;y t-1 ]·W k ,V=[M t-1 ;y t-1 ]·W v ,M t-1 A memory matrix, y, representing a round of output over the relational memory t-1 Word-embedding vector, W, representing a round of prediction over the relational memory Q 、W k 、W v Are all trainable weight matrices, d k Representing the scaling factor, which is obtained by dividing the dimension of k by the number of multiple starts, MLP () representing the multi-layered perceptron,
Figure BDA0003748877450000214
and
Figure BDA0003748877450000215
for balancing M t-1 and yt-1 A forgetting gate and an output gate of the gate,
Figure BDA0003748877450000216
representing a multi-headed attention output matrix, M, mapped across multiple tiers of perceptrons t And representing the memory matrix output in the last round of the relational memory.
Optionally, the report generating module 50 is specifically configured to:
calculating an output result of the decoding module provided with the memory-driven normalization layer by the following expression:
γ t =γ+MLP(M t )
β t =β+MLP(M t )
Figure BDA0003748877450000217
θ=T D (ψ,N,RM(M t-1 ,y t-1 ),MCLN(r,M t ))
where ψ denotes the output of the encoder, N denotes the number of decoder layers, γ denotes a learnable parameter matrix representing scaling used for improving generalization ability, γ t Representing gamma and M after multi-layer perceptron mapping t The result of the addition, β represents a learnable parameter matrix representing the movement used to improve generalization ability, β t Represents beta and M after multi-layer perceptron mapping t As a result of addition, μ represents the mean value of γ, v represents the standard deviation of γ, and T E () Represents the encoder, T D () Representing the decoder, RM () representing the relational memory.
Fig. 5 illustrates a physical structure diagram of an electronic device, which may include, as shown in fig. 5: a processor (processor) 810, a communication Interface 820, a memory 830 and a communication bus 840, wherein the processor 810, the communication Interface 820 and the memory 830 communicate with each other via the communication bus 840. The processor 810 may invoke logic instructions in the memory 830 to perform a method for multi-modal fusion based medical image report generation, the method comprising:
constructing a medical prior knowledge graph, and acquiring an initial feature vector of each node in the medical prior knowledge graph;
inputting the medical prior knowledge graph and the initial characteristic vector of each node in the medical prior knowledge graph into a graph encoder to obtain a graph embedding vector;
inputting the medical image into an image encoder without a linear layer to obtain a visual characteristic sequence;
performing multi-mode fusion on the graph embedding vector and the visual feature sequence by adopting a cooperative attention mechanism to obtain an image sequence subjected to attention re-empowerment;
and inputting the attention-re-weighted image sequence into a memory-driven Transformer model to generate a medical image report.
In addition, the logic instructions in the memory 830 may be implemented in software functional units and stored in a computer readable storage medium when the logic instructions are sold or used as independent products. Based on such understanding, the technical solution of the present invention or a part thereof which substantially contributes to the prior art may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
In another aspect, the present invention further provides a computer program product, the computer program product includes a computer program, the computer program can be stored on a non-transitory computer readable storage medium, when the computer program is executed by a processor, a computer can execute a method for generating a medical image report based on multi-modal fusion provided by the above methods, the method includes:
constructing a medical prior knowledge graph, and acquiring an initial characteristic vector of each node in the medical prior knowledge graph;
inputting the medical priori knowledge graph and the initial characteristic vector of each node in the medical priori knowledge graph into a graph encoder to obtain a graph embedding vector;
inputting the medical image into an image encoder without a linear layer to obtain a visual characteristic sequence;
performing multi-mode fusion on the graph embedding vector and the visual feature sequence by adopting a cooperative attention mechanism to obtain an image sequence subjected to attention re-weighting;
and inputting the attention-re-weighted image sequence into a memory-driven Transformer model to generate a medical image report.
In yet another aspect, the present invention further provides a non-transitory computer-readable storage medium, on which a computer program is stored, the computer program being implemented by a processor to execute a method for generating a medical image report based on multi-modal fusion provided by the above methods, the method comprising:
constructing a medical prior knowledge graph, and acquiring an initial characteristic vector of each node in the medical prior knowledge graph;
inputting the medical prior knowledge graph and the initial characteristic vector of each node in the medical prior knowledge graph into a graph encoder to obtain a graph embedding vector;
inputting the medical image into an image encoder which does not comprise a linear layer to obtain a visual characteristic sequence;
performing multi-mode fusion on the graph embedding vector and the visual feature sequence by adopting a cooperative attention mechanism to obtain an image sequence subjected to attention re-weighting;
and inputting the attention-re-weighted image sequence into a memory-driven Transformer model to generate a medical image report.
The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.
Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.
Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims (19)

1. A medical image report generation method based on multi-modal fusion is characterized by comprising the following steps:
constructing a medical prior knowledge graph, and acquiring an initial characteristic vector of each node in the medical prior knowledge graph;
inputting the medical priori knowledge graph and the initial characteristic vector of each node in the medical priori knowledge graph into a graph encoder to obtain a graph embedding vector;
inputting the medical image into an image encoder without a linear layer to obtain a visual characteristic sequence;
performing multi-mode fusion on the graph embedding vector and the visual feature sequence by adopting a cooperative attention mechanism to obtain an image sequence subjected to attention re-empowerment;
and inputting the attention-re-weighted image sequence into a memory-driven Transformer model to generate a medical image report.
2. The method for generating a medical image report based on multi-modal fusion as claimed in claim 1, wherein the constructing a medical prior knowledge map comprises:
acquiring a plurality of unmarked medical image report texts;
extracting a plurality of medical entities from the plurality of unmarked medical image report texts by adopting a named entity recognition algorithm;
reducing dimensions of the medical entities by adopting a clustering algorithm;
and constructing a medical prior knowledge graph by taking the medical entities after dimension reduction as nodes and taking the relation between the medical entities after dimension reduction as edges.
3. The method according to claim 1, wherein the obtaining an initial feature vector of each node in the medical prior knowledge graph comprises:
and initializing each node of the medical prior knowledge graph through a word embedding model to obtain an initial feature vector of the node.
4. The method according to claim 1, wherein the inputting the medical prior knowledge map and the initial feature vector of each node in the medical prior knowledge map into a map encoder to obtain a map embedding vector comprises:
constructing a graph encoder:
Figure FDA0003748877440000021
Figure FDA0003748877440000022
wherein ,
Figure FDA0003748877440000023
an adjacency matrix representing the medical prior knowledge graph, the adjacency matrix being accompanied by edges pointing to self nodes,
Figure FDA0003748877440000024
representing initial characteristic vectors of the medical prior knowledge graph, wherein the initial characteristic vectors of the medical prior knowledge graph are obtained by splicing the initial characteristic vectors of all nodes in the medical prior knowledge graph,
Figure FDA0003748877440000025
a graph convolution feature vector representing the k-th layer,
Figure FDA0003748877440000026
represents the convolution eigenvector of the (k + 1) th layer, D = ∑ Σ i D i
Figure FDA0003748877440000027
Figure FDA0003748877440000028
Elements of row i and column j of an adjacency matrix representing the medical prior knowledge-graph, W (k) Representing a trainable weight matrix, GC () representing a convolution function, σ () representing an activation function, dropout () representing a random drop function, BN () representing a batch normalization function;
and inputting the medical prior knowledge graph and the initial characteristic vector of each node in the medical prior knowledge graph into the graph encoder to obtain a graph convolution characteristic vector of the last layer as a graph embedding vector.
5. The method according to claim 1, wherein the inputting the medical image into an image encoder that does not include a linear layer to obtain a visual feature sequence comprises:
inputting the medical image into an image encoder which does not comprise a linear layer to obtain a four-dimensional visual characteristic matrix;
transforming the four-dimensional visual characteristic matrix into a three-dimensional visual characteristic matrix;
and converting the three-dimensional visual feature matrix into a visual feature sequence.
6. The method for generating a multi-modal fusion-based medical image report according to claim 1, wherein the multi-modal fusion of the graph embedding vector and the visual feature sequence by adopting a cooperative attention mechanism to obtain an attention-re-weighted image sequence comprises:
calculating an affinity matrix between the graph embedding vector and the visual feature sequence;
learning, by the affinity matrix, an attention mapping between the graph embedding vector and the sequence of visual features;
calculating an attention weight vector based on the attention map;
calculating a sequence of attention re-weighted images based on the sequence of visual features and the attention weight vector.
7. The multi-modality fusion-based medical image report generation method of claim 6, wherein the calculating an affinity matrix between the graph embedding vector and the visual feature sequence includes:
computing an affinity matrix between the graph embedding vector and the visual feature sequence by the expression:
Figure FDA0003748877440000031
wherein C represents affinityMatrix, G E Representation embedding vector, I E Representing a sequence of visual features, W b Representing a weight matrix.
8. The method of claim 6, wherein learning the attention mapping between the graph embedding vector and the visual feature sequence via the affinity matrix comprises:
learning an attention mapping between the graph embedding vector and the visual feature sequence by the following expression:
F i =tanh(W i I E +(W g G E )C)
wherein ,Fi Representing the output result of learning the graph-embedding vector and the visual feature sequence by an affinity matrix, W i and Wg All represent trainable weight matrices, C represents affinity matrices, G E Representation embedding vector, I E Representing a sequence of visual features.
9. The method of claim 6, wherein the calculating an attention weight vector based on the attention map comprises:
the attention weight vector is calculated by the following expression:
Figure FDA0003748877440000041
wherein ,ai Represents the attention weight vector, w fi Representing a trainable weight matrix, F i Indicating the attention mapping result.
10. The method of claim 6, wherein the computing a sequence of re-weighted images based on the sequence of visual features and the attention weight vector comprises:
the sequence of attention-re-weighted images is calculated by the following expression:
Figure FDA0003748877440000042
wherein ,
Figure FDA0003748877440000043
representing a sequence of attention-re-weighted images, x 1,2,…,R Representing an element in a sequence of visual features,
Figure FDA0003748877440000044
representing elements in the attention-re-weighted image sequence,
Figure FDA0003748877440000045
and representing an element in the attention weight vector corresponding to the visual feature sequence, wherein R = H 'xw' represents the number of image blocks.
11. The method of claim 1, wherein the memory-driven Transformer model comprises: an encoder and a decoder, the decoder comprising: the device comprises a relation memory and a decoding module provided with a memory-driven normalization layer.
12. The method according to claim 11, wherein the inputting the attention-re-weighted image sequence into a memory-driven Transformer model to generate a medical image report comprises:
inputting the sequence of attention-re-weighted images into the encoder;
initializing the relational memory by adopting the graph embedding vector;
calculating a memory matrix output in the last round of the relationship memory;
and inputting the memory matrix output by the last round of the relational memory and the output result of the encoder into the decoding module provided with the memory-driven normalization layer to obtain a medical image report.
13. The method according to claim 12, wherein initializing the relational memory using the graph embedding vector comprises:
initializing the relational memory by the following expression:
M 0 =MLP(G E ·W m )
wherein ,M0 Initial memory matrix, G, representing relational memory E Representation embedding vector, W m Representing a weight matrix and MLP () representing a multi-layer perceptron for building the inter-dimensional mapping.
14. The method according to claim 12, wherein the calculating a memory matrix of the last round output of the relational memory comprises:
calculating the memory matrix output by the last round of the relational memory according to the following expression:
Figure FDA0003748877440000051
Figure FDA0003748877440000052
Figure FDA0003748877440000053
wherein ,Q=Mt-1 ·W Q ,K=[M t-1 ;y t-1 ]·W k ,V=[M t-1 ;y t-1 ]·W v ,M t-1 Memory matrix, y, representing the last round of output of the relational memory t-1 Word-embedded vector, W, representing a previous round of prediction in the relational memory Q 、W k 、W v Are all trainable weight matrices, d k Representing the scaling factor, which is obtained by dividing the dimension of k by the number of multiple starts, MLP () representing the multi-layered perceptron,
Figure FDA0003748877440000054
and
Figure FDA0003748877440000055
for balancing M t-1 and yt-1 A forgetting gate and an output gate of the gate,
Figure FDA0003748877440000056
representing a multi-headed attention output matrix, M, mapped by a multi-layered perceptron t And representing the memory matrix output in the last round of the relational memory.
15. The method according to claim 14, wherein the inputting the memory matrix outputted from the last round of the relational memory and the output result of the encoder into the decoding module provided with the memory-driven normalization layer to obtain the medical image report comprises:
calculating an output result of the decoding module provided with the memory-driven normalization layer by the following expression:
γ t =γ+MLP(M t )
β t =β+MLP(M t )
Figure FDA0003748877440000061
θ=T D (ψ,N,RM(M t-1 ,y t-1 ),MCLN(r,M t ))
where ψ denotes the output of the encoder, N denotes the number of decoder layers, γ denotes a learnable parameter matrix representing scaling used for improving generalization ability, γ t Representing gamma and M after multi-layer perceptron mapping t The result of the addition, β represents a learnable parameter matrix representing the movement used to improve generalization ability, β t Represents beta and M after multi-layer perceptron mapping t As a result of addition, μ represents the mean value of γ, v represents the standard deviation of γ, and T E () Represents said encoder, T D () Representing the decoder, RM () representing the relational memory.
16. A medical image report generation device based on multi-modal fusion is characterized by comprising:
the map construction module is used for constructing a medical prior knowledge map and acquiring an initial feature vector of each node in the medical prior knowledge map;
the graph encoder module is used for inputting the medical priori knowledge graph and the initial characteristic vector of each node in the medical priori knowledge graph into a graph encoder to obtain a graph embedding vector;
the image encoder module is used for inputting the medical image into an image encoder which does not comprise a linear layer to obtain a visual characteristic sequence;
the multi-mode fusion module is used for performing multi-mode fusion on the graph embedding vector and the visual feature sequence by adopting a cooperative attention mechanism to obtain an image sequence subjected to attention re-empowerment;
and the report generation module is used for inputting the attention re-weighted image sequence into a memory-driven Transformer model to generate a medical image report.
17. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the method for generating a medical image report based on multimodal fusion according to any one of claims 1 to 15 when executing the program.
18. A non-transitory computer-readable storage medium, having a computer program stored thereon, wherein the computer program, when executed by a processor, implements the method for generating a medical image report based on multi-modal fusion according to any one of claims 1 to 15.
19. A computer program product comprising a computer program, wherein the computer program, when executed by a processor, implements the method for generating a medical image report based on multi-modal fusion according to any one of claims 1 to 15.
CN202210836966.3A 2022-07-15 2022-07-15 Medical image report generation method and device based on multi-mode fusion Active CN115331769B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210836966.3A CN115331769B (en) 2022-07-15 2022-07-15 Medical image report generation method and device based on multi-mode fusion

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210836966.3A CN115331769B (en) 2022-07-15 2022-07-15 Medical image report generation method and device based on multi-mode fusion

Publications (2)

Publication Number Publication Date
CN115331769A true CN115331769A (en) 2022-11-11
CN115331769B CN115331769B (en) 2023-05-09

Family

ID=83917479

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210836966.3A Active CN115331769B (en) 2022-07-15 2022-07-15 Medical image report generation method and device based on multi-mode fusion

Country Status (1)

Country Link
CN (1) CN115331769B (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115937689A (en) * 2022-12-30 2023-04-07 安徽农业大学 Agricultural pest intelligent identification and monitoring technology
CN116028654A (en) * 2023-03-30 2023-04-28 中电科大数据研究院有限公司 Multi-mode fusion updating method for knowledge nodes
CN117010494A (en) * 2023-09-27 2023-11-07 之江实验室 Medical data generation method and system based on causal expression learning
CN117649917A (en) * 2024-01-29 2024-03-05 北京大学 Training method and device for test report generation model and test report generation method
CN117726920A (en) * 2023-12-20 2024-03-19 广州丽芳园林生态科技股份有限公司 Knowledge-graph-based plant disease and pest identification method, system, equipment and storage medium
CN117993500A (en) * 2024-04-07 2024-05-07 江西为易科技有限公司 Medical teaching data management method and system based on artificial intelligence
CN118072899A (en) * 2024-02-27 2024-05-24 中国人民解放军总医院第二医学中心 Bone mineral density report generation platform based on diffusion model text generation technology

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180060512A1 (en) * 2016-08-29 2018-03-01 Jeffrey Sorenson System and method for medical imaging informatics peer review system
CN112992308A (en) * 2021-03-25 2021-06-18 腾讯科技(深圳)有限公司 Training method of medical image report generation model and image report generation method
CN113724359A (en) * 2021-07-14 2021-11-30 鹏城实验室 CT report generation method based on Transformer
CN114724670A (en) * 2022-06-02 2022-07-08 合肥综合性国家科学中心人工智能研究院(安徽省人工智能实验室) Medical report generation method and device, storage medium and electronic equipment

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180060512A1 (en) * 2016-08-29 2018-03-01 Jeffrey Sorenson System and method for medical imaging informatics peer review system
CN112992308A (en) * 2021-03-25 2021-06-18 腾讯科技(深圳)有限公司 Training method of medical image report generation model and image report generation method
CN113724359A (en) * 2021-07-14 2021-11-30 鹏城实验室 CT report generation method based on Transformer
CN114724670A (en) * 2022-06-02 2022-07-08 合肥综合性国家科学中心人工智能研究院(安徽省人工智能实验室) Medical report generation method and device, storage medium and electronic equipment

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115937689A (en) * 2022-12-30 2023-04-07 安徽农业大学 Agricultural pest intelligent identification and monitoring technology
CN115937689B (en) * 2022-12-30 2023-08-11 安徽农业大学 Intelligent identification and monitoring technology for agricultural pests
CN116028654A (en) * 2023-03-30 2023-04-28 中电科大数据研究院有限公司 Multi-mode fusion updating method for knowledge nodes
CN116028654B (en) * 2023-03-30 2023-06-13 中电科大数据研究院有限公司 Multi-mode fusion updating method for knowledge nodes
CN117010494A (en) * 2023-09-27 2023-11-07 之江实验室 Medical data generation method and system based on causal expression learning
CN117010494B (en) * 2023-09-27 2024-01-05 之江实验室 Medical data generation method and system based on causal expression learning
CN117726920A (en) * 2023-12-20 2024-03-19 广州丽芳园林生态科技股份有限公司 Knowledge-graph-based plant disease and pest identification method, system, equipment and storage medium
CN117726920B (en) * 2023-12-20 2024-06-07 广州丽芳园林生态科技股份有限公司 Knowledge-graph-based plant disease and pest identification method, system, equipment and storage medium
CN117649917A (en) * 2024-01-29 2024-03-05 北京大学 Training method and device for test report generation model and test report generation method
CN118072899A (en) * 2024-02-27 2024-05-24 中国人民解放军总医院第二医学中心 Bone mineral density report generation platform based on diffusion model text generation technology
CN117993500A (en) * 2024-04-07 2024-05-07 江西为易科技有限公司 Medical teaching data management method and system based on artificial intelligence

Also Published As

Publication number Publication date
CN115331769B (en) 2023-05-09

Similar Documents

Publication Publication Date Title
CN115331769B (en) Medical image report generation method and device based on multi-mode fusion
US11507800B2 (en) Semantic class localization digital environment
Shin et al. Learning to read chest x-rays: Recurrent neural cascade model for automated image annotation
Lu et al. Neural baby talk
CN109471895B (en) Electronic medical record phenotype extraction and phenotype name normalization method and system
Wu et al. Neural scene de-rendering
CN109545302A (en) A kind of semantic-based medical image report template generation method
CN107909115B (en) Image Chinese subtitle generating method
Egger et al. Deep learning—a first meta-survey of selected reviews across scientific disciplines, their commonalities, challenges and research impact
CN112561064B (en) Knowledge base completion method based on OWKBC model
WO2021052875A1 (en) Systems and methods for incorporating multimodal data to improve attention mechanisms
CN111667483B (en) Training method of segmentation model of multi-modal image, image processing method and device
WO2022052530A1 (en) Method and apparatus for training face correction model, electronic device, and storage medium
CN112560454B (en) Bilingual image subtitle generating method, bilingual image subtitle generating system, storage medium and computer device
CN116129141B (en) Medical data processing method, apparatus, device, medium and computer program product
CN111667027A (en) Multi-modal image segmentation model training method, image processing method and device
CN115190999A (en) Classifying data outside of a distribution using contrast loss
CN116563537A (en) Semi-supervised learning method and device based on model framework
CN116385937A (en) Method and system for solving video question and answer based on multi-granularity cross-mode interaction framework
CN116258928A (en) Pre-training method based on self-supervision information of unlabeled medical image
CN115662565A (en) Medical image report generation method and equipment integrating label information
Robben et al. DeepVoxNet: voxel-wise prediction for 3D images
CN114139531A (en) Medical entity prediction method and system based on deep learning
Souza et al. Automatic recognition of continuous signing of brazilian sign language for medical interview
Rebai et al. Deep kernel-SVM network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant