CN115331769A

CN115331769A - Medical image report generation method and device based on multi-modal fusion

Info

Publication number: CN115331769A
Application number: CN202210836966.3A
Authority: CN
Inventors: 黄雨; 李航; 徐德轩; 金芝
Original assignee: Peking University; Peking University First Hospital
Current assignee: Peking University; Peking University First Hospital
Priority date: 2022-07-15
Filing date: 2022-07-15
Publication date: 2022-11-11
Anticipated expiration: 2042-07-15
Also published as: CN115331769B

Abstract

The invention provides a medical image report generation method and a device based on multi-mode fusion, wherein the method comprises the following steps: constructing a medical prior knowledge graph, and acquiring an initial feature vector of each node in the medical prior knowledge graph; inputting the medical priori knowledge graph and the initial characteristic vector of each node in the medical priori knowledge graph into a graph encoder to obtain a graph embedding vector; inputting the medical image into an image encoder without a linear layer to obtain a visual characteristic sequence; performing multi-mode fusion on the graph embedding vector and the visual feature sequence by adopting a cooperative attention mechanism to obtain an image sequence subjected to attention re-weighting; and inputting the attention-re-weighted image sequence into a memory-driven Transformer model to generate a medical image report. The invention can improve the accuracy and reliability of generating the medical image report.

Description

Medical image report generation method and device based on multi-mode fusion

Technical Field

The invention relates to the technical field of medical treatment and artificial intelligence intersection, in particular to a medical image report generation method and device based on multi-mode fusion.

Background

In recent years, medical image reports have been the focus of research and collaboration between computer students and medical professionals. The accurate and efficient medical image report can greatly improve the control of doctors on the disease conditions of patients, reduce the workload of doctors, assist the doctors in making correct disease diagnosis and provide corresponding medical guidance and suggestions for the patients.

Currently, the research on medical image report generation technology is still in the beginning stage. In the existing medical image report generation scheme, a medical knowledge graph is generally used for subtasks such as classification and the like and is not integrated into a model, so that the accuracy and reliability of generating the medical image report are not high.

Disclosure of Invention

The invention provides a method and a device for generating a medical image report based on multi-mode fusion, which are used for solving the defects that the accuracy and the reliability of the generation of the medical image report are not high because a medical knowledge graph is used for subtasks such as classification and the like and is not fused into a model in the prior art, and the purpose of improving the accuracy and the reliability of the generation of the medical image report is realized.

The invention provides a medical image report generation method based on multi-modal fusion, which comprises the following steps:

constructing a medical prior knowledge graph, and acquiring an initial feature vector of each node in the medical prior knowledge graph;

inputting the medical prior knowledge graph and the initial characteristic vector of each node in the medical prior knowledge graph into a graph encoder to obtain a graph embedding vector;

inputting the medical image into an image encoder which does not comprise a linear layer to obtain a visual characteristic sequence;

performing multi-mode fusion on the graph embedding vector and the visual feature sequence by adopting a cooperative attention mechanism to obtain an image sequence subjected to attention re-weighting;

and inputting the attention-re-weighted image sequence into a memory-driven transform model to generate a medical image report.

According to the method for generating the medical image report based on the multi-mode fusion, which is provided by the invention, the construction of the medical priori knowledge map comprises the following steps:

acquiring a plurality of unmarked medical image report texts;

extracting a plurality of medical entities from the unmarked medical image report texts by adopting a named entity recognition algorithm;

reducing the dimensions of the medical entities by adopting a clustering algorithm;

and constructing a medical prior knowledge graph by taking the medical entities after dimension reduction as nodes and taking the relation between the medical entities after dimension reduction as edges.

According to the method for generating the medical image report based on the multi-modal fusion, the obtaining of the initial feature vector of each node in the medical prior knowledge graph comprises the following steps:

and initializing each node of the medical prior knowledge graph through a word embedding model to obtain an initial characteristic vector of the node.

According to the method for generating the medical image report based on the multi-modal fusion, the initial feature vectors of each node in the medical priori knowledge map and the medical priori knowledge map are input into a graph encoder to obtain a graph embedding vector, and the method comprises the following steps:

constructing a graph encoder:

wherein ,

an adjacency matrix representing the medical prior knowledge graph, the adjacency matrix being accompanied by edges pointing to self nodes,

representing initial characteristic vectors of the medical prior knowledge graph, wherein the initial characteristic vectors of the medical prior knowledge graph are obtained by splicing the initial characteristic vectors of all nodes in the medical prior knowledge graph,

a graph convolution feature vector representing the k-th layer,

represents the convolution eigenvector of the (k + 1) th layer, D = ∑ Σ _i D _i ，

Elements of row i and column j of an adjacency matrix representing the medical prior knowledge-graph, W ^(k) Representing a trainable weight matrix, GC () representing a convolution function, σ () representing an activation function, dropout () representing a random drop function, BN () representing a batch normalization function;

and inputting the medical prior knowledge graph and the initial characteristic vector of each node in the medical prior knowledge graph into the graph encoder to obtain a graph convolution characteristic vector of the last layer as a graph embedding vector.

According to the method for generating a medical image report based on multi-modal fusion provided by the invention, the medical image is input into an image encoder without a linear layer to obtain a visual feature sequence, and the method comprises the following steps:

inputting the medical image into an image encoder which does not comprise a linear layer to obtain a four-dimensional visual characteristic matrix;

transforming the four-dimensional visual characteristic matrix into a three-dimensional visual characteristic matrix;

and converting the three-dimensional visual feature matrix into a visual feature sequence.

According to the medical image report generation method based on multi-modal fusion provided by the invention, the multi-modal fusion is carried out on the graph embedding vector and the visual feature sequence by adopting a cooperative attention mechanism to obtain an image sequence subjected to attention re-empowerment, and the method comprises the following steps:

computing an affinity matrix between the graph embedding vector and the sequence of visual features;

learning, by the affinity matrix, an attention mapping between the graph embedding vector and the sequence of visual features;

calculating an attention weight vector based on the attention map;

calculating a sequence of attention re-weighted images based on the sequence of visual features and the attention weight vector.

According to the medical image report generation method based on multi-modal fusion provided by the invention, the calculation of the affinity matrix between the graph embedding vector and the visual feature sequence comprises the following steps:

computing an affinity matrix between the graph embedding vector and the visual feature sequence by the expression:

wherein C represents an affinity matrix, G _E Representation embedding vector, I _E Representing a sequence of visual features, W _b A weight matrix is represented.

According to the medical image report generation method based on multi-modal fusion provided by the invention, the learning of the attention mapping between the graph embedding vector and the visual feature sequence through the affinity matrix comprises the following steps:

learning an attention mapping between the graph embedding vector and the sequence of visual features by the following expression:

F ⁱ ＝tanh(W _i I _E +(W _g G _E )C)

wherein ,Fⁱ Output results representing learning of the graph embedding vector and the visual feature sequence by an affinity matrix, W _i and W_g All represent trainable weight matrices, C represents affinity matrices, G _E Representation embedding vector, I _E Representing a sequence of visual features.

According to the medical image report generation method based on multi-modal fusion provided by the invention, the calculation of the attention weight vector based on the attention mapping comprises the following steps:

calculating an attention weight vector by the following expression:

wherein ,aⁱ Represents the attention weight vector, w _fi Representing trainable weight matrices, F ⁱ Indicating the attention mapping result.

According to the medical image report generation method based on multi-modal fusion provided by the invention, the image sequence subjected to attention re-weighting is calculated based on the visual feature sequence and the attention weight vector, and the method comprises the following steps:

the sequence of attention-re-weighted images is calculated by the following expression:

wherein ,

representing a sequence of attention-re-weighted images, x _1,2,…,R Representing an element in a sequence of visual features,

representing elements in the attention-re-weighted image sequence,

and representing an element in the attention weight vector corresponding to the visual feature sequence, wherein R = H 'xw' represents the number of image blocks.

According to the medical image report generation method based on multi-modal fusion, the memory-driven Transformer model comprises the following steps: an encoder and a decoder, the decoder comprising: the device comprises a relation memory and a decoding module provided with a memory-driven normalization layer.

According to the method for generating the medical image report based on the multi-modal fusion, the image sequence subjected to attention re-empowerment is input into a memory-driven Transformer model to generate the medical image report, and the method comprises the following steps:

inputting the sequence of attention-re-weighted images into the encoder;

initializing the relational memory by adopting the graph embedding vector;

calculating a memory matrix output by the last round of the relational memory;

and inputting the memory matrix output by the last round of the relational memory and the output result of the encoder into the decoding module provided with the memory-driven normalization layer to obtain a medical image report.

According to the medical image report generation method based on multi-modal fusion provided by the invention, the initialization of the relationship memory by adopting the graph embedding vector comprises the following steps:

initializing the relational memory by the following expression:

M ₀ ＝MLP(G _E ·W _m )

wherein ,M₀ Initial memory matrix, G, representing a relational memory _E Representation embedding vector, W _m Representing a weight matrix, MLP () representing a multi-layer perceptron,for establishing inter-dimensional mappings.

According to the medical image report generation method based on multi-modal fusion provided by the invention, the calculating of the memory matrix output in the last round of the relational memory comprises the following steps:

calculating a memory matrix output by the last round of the relational memory according to the following expression:

wherein ,Q＝M_t-1 ·W _Q ，K＝[M _t-1 ；y _t-1 ]·W _k ，V＝[M _t-1 ；y _t-1 ]·W _v ，M _t-1 A memory matrix, y, representing a round of output over the relational memory _t-1 Word-embedding vector, W, representing a round of prediction over the relational memory _Q 、W _k 、W _v Are all trainable weight matrices, d _k Representing the scaling factor, which is obtained by dividing the dimension of k by the number of multi-points, MLP () representing a multi-layer perceptron,

and

for balancing M _t-1 and y_t-1 A forgetting gate and an output gate of the gate,

representing a multi-headed attention output matrix, M, mapped across multiple tiers of perceptrons _t Memory moment representing the last round of output of the relational memoryAnd (5) arraying.

According to the method for generating a medical image report based on multi-modal fusion provided by the invention, the memory matrix output by the last round of the relational memory and the output result of the encoder are input into the decoding module provided with the normalization layer driven by the memory to obtain the medical image report, and the method comprises the following steps:

calculating an output result of the decoding module provided with the memory-driven normalization layer by the following expression:

γ _t ＝γ+MLP(M _t )

β _t ＝β+MLP(M _t )

θ＝T _D (ψ,N,RM(M _t-1 ,y _t-1 ),MCLN(r,M _t ))

where ψ represents an output result of the encoder, N represents the number of decoder layers, γ represents a learnable parameter matrix representing scaling used for improving generalization ability, γ _t Representing gamma and M after multi-layer perceptron mapping _t The result of the addition, β represents a learnable parameter matrix representing the movement used to improve generalization ability, β _t Represents beta and M after multi-layer perceptron mapping _t As a result of addition, μ represents the mean value of γ, v represents the standard deviation of γ, and T _E () Represents the encoder, T _D () Representing the decoder, RM () representing the relational memory.

The invention also provides a medical image report generating device based on multi-mode fusion, which comprises:

the map construction module is used for constructing a medical prior knowledge map and acquiring an initial feature vector of each node in the medical prior knowledge map;

the graph encoder module is used for inputting the medical priori knowledge graph and the initial characteristic vector of each node in the medical priori knowledge graph into a graph encoder to obtain a graph embedding vector;

the image encoder module is used for inputting the medical image into an image encoder which does not comprise a linear layer to obtain a visual characteristic sequence;

the multi-mode fusion module is used for performing multi-mode fusion on the graph embedding vector and the visual feature sequence by adopting a cooperative attention mechanism to obtain an image sequence subjected to attention re-empowerment;

and the report generation module is used for inputting the image sequence subjected to attention re-weighting into a memory-driven Transformer model to generate a medical image report.

The invention further provides an electronic device, which comprises a memory, a processor and a computer program stored in the memory and capable of running on the processor, wherein the processor executes the program to implement the method for generating the medical image report based on multi-modal fusion as described in any one of the above.

The present invention also provides a non-transitory computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements a method for generating a medical image report based on multimodal fusion as described in any of the above.

The present invention also provides a computer program product comprising a computer program which, when executed by a processor, implements a method for generating a medical image report based on multimodal fusion as described in any of the above.

The invention provides a method and a device for generating a medical image report based on multi-mode fusion.A medical prior knowledge graph is constructed, and an initial feature vector of each node in the medical prior knowledge graph is obtained; then, inputting the medical prior knowledge graph and the initial characteristic vector of each node in the medical prior knowledge graph into a graph encoder to obtain a graph embedding vector; inputting the medical image into an image encoder without a linear layer to obtain a visual characteristic sequence; performing multi-mode fusion on the image embedding vector and the visual characteristic sequence by adopting a cooperative attention mechanism to obtain an image sequence subjected to attention re-empowerment; the image sequence subjected to attention re-weighting fuses a medical priori knowledge map and a medical image; and finally, inputting the image sequence subjected to attention re-weighting into a memory-driven Transformer model to generate a medical image report, wherein the image sequence subjected to attention re-weighting fuses a medical priori knowledge map and a medical image, so that the memory-driven Transformer model has better comprehension capability and more robust capability of understanding medical priori knowledge, and the accuracy and reliability of generation of the medical image report can be improved.

Drawings

In order to more clearly illustrate the technical solutions of the present invention or the prior art, the drawings needed for the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.

FIG. 1 is a flow chart of a multi-modal fusion-based medical image report generation method provided by the present invention;

FIG. 2 is a schematic structural diagram of a medical image report generation model based on medical priori knowledge mapping and memory driving provided by the present invention;

FIG. 3 is a schematic diagram of the construction of a medical prior knowledge map provided by the present invention;

FIG. 4 is a schematic structural diagram of a medical image report generation device based on multi-modal fusion provided by the present invention;

fig. 5 is a schematic structural diagram of an electronic device provided in the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention clearer, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is obvious that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be obtained by a person skilled in the art without inventive step based on the embodiments of the present invention, are within the scope of protection of the present invention.

The method for generating a medical image report based on multi-modal fusion according to the present invention is described below with reference to fig. 1 to 3.

Referring to fig. 1, fig. 1 is a schematic flow chart of a medical image report generation method based on multi-modal fusion according to the present invention. As shown in fig. 1, the method for generating a medical image report based on multi-modal fusion provided by the present invention may include the following steps:

101, constructing a medical prior knowledge graph, and acquiring an initial characteristic vector of each node in the medical prior knowledge graph;

102, inputting the medical prior knowledge graph and the initial characteristic vector of each node in the medical prior knowledge graph into a graph encoder to obtain a graph embedding vector;

103, inputting the medical image into an image encoder without a linear layer to obtain a visual feature sequence;

104, adopting a cooperative attention mechanism to perform multi-mode fusion on the image embedding vector and the visual feature sequence to obtain an image sequence subjected to attention re-empowerment;

and 105, inputting the image sequence subjected to attention re-weighting into a memory-driven Transformer model to generate a medical image report.

In step 101, a medical prior knowledge map with a proper size is constructed by a natural language processing method, instead of manually constructing by selecting only a few keywords. The medical prior knowledge map of the embodiment has the following characteristics:

1) Entity type universalization

The medical prior knowledge map contains all-round entity types, can describe disease symptoms from all-round angles, and not only contains the names of diseases, such as: for skin diseases, entity information such as name, position, shape, color, etc. of the disease may be included.

2) Map scale is appropriate

The scale of the medical prior knowledge map is proper, if the features are too large, training and learning are difficult, and if the scale is too small, enough prior knowledge cannot be reserved.

3) Entity relationship comprehensiveness

The method can not only automatically establish the relationship between the entities through a relationship extraction method, but also artificially supplement the relationship of the entities, so that the medical priori knowledge map can embody more priori knowledge.

In this step, in order to complete the subsequent steps, each node in the medical prior knowledge graph needs to be initialized, and an initial feature vector of each node in the medical prior knowledge graph is obtained. Optionally, each node in the medical prior knowledge graph is initialized by a word embedding model.

In step 102, the graph encoder is used to extract graph embedding vectors of the medical prior knowledge graph. And inputting the medical prior knowledge graph and the initial characteristic vector of each node in the medical prior knowledge graph into a graph encoder, and obtaining a graph embedding vector through graph encoding processing.

In step 103, for the image encoder, a derivative model of the convolutional neural network is used as the visual extractor, for example: residual convolutional network ResNet, dense chained convolutional network densneet, etc. The image encoder used in this embodiment does not include the last linear layer and outputs the result of the pooling layer.

In step 104, the image encoder may obtain the visual features of the medical image, but may not obtain high-level semantic information well. In the embodiment, a cooperative attention mechanism is adopted, multi-modal fusion is carried out on the graph embedding vector and the visual characteristic sequence, a visual question-answering process is simulated, and finally an image sequence subjected to attention re-empowerment is obtained.

In step 105, as shown in fig. 2, optionally, the memory-driven Transformer model comprises: an encoder and a decoder, the decoder comprising: the device comprises a relation memory and a decoding module provided with a memory-driven normalization layer. The Normalization Layer of the Memory driver is three Memory-driver Conditional Layer Normalization (MCLN) layers, and is used for enhancing the decoding capability of the Memory driver Transformer model and increasing the generalization.

And inputting the image sequence subjected to attention re-weighting into a memory-driven Transformer model to generate a medical image report.

In this embodiment, since the medical priori knowledge map and the medical image are fused in the attention-reweighted image sequence, the memory-driven transform model has better comprehension capability and more robust capability of comprehending the medical priori knowledge, and the accuracy and reliability of generating the medical image report can be improved.

Optionally, constructing the medical prior knowledge map in step 101 includes the following sub-steps:

step 1011, acquiring a plurality of unmarked medical image report texts;

step 1012, extracting a plurality of medical entities from a plurality of unmarked medical image report texts by adopting a named entity recognition algorithm;

1013, reducing the dimensions of the medical entities by adopting a clustering algorithm;

and 1014, constructing a medical priori knowledge graph by taking the medical entities subjected to the dimensionality reduction as nodes and taking the relationship among the medical entities subjected to the dimensionality reduction as edges.

In step 1011, several unmarked medical image report texts are acquired, such as: a large number of unlabelled reports, training data sets, effective text information provided by the physician, etc. Different types of reports may be selected as underlying data for different tasks.

In step 1012, a named entity recognition algorithm is used to effectively extract medical entities from the unlabeled medical image report texts, which are stored as key nodes of the medical prior knowledge graph.

In step 1013, a large number of similar contents may exist in the node identified by the named entity, and if all the similar contents are retained, a large number of redundant structures are generated, which causes an oversize medical prior knowledge graph, so that a text processing method and a clustering algorithm are required to perform dimension reduction on several medical entities.

In step 1014, relationships between medical entities are established using relationship extraction, and entity dependencies are established with human design assistance. And constructing a medical prior knowledge graph by taking the medical entities after dimension reduction as nodes and taking the relation between the medical entities after dimension reduction as edges.

In the embodiment, a medical priori knowledge map with a proper size can be constructed, and the medical priori knowledge map is not constructed manually by selecting only a few keywords.

Specifically, as shown in fig. 3, a medical prior knowledge map is constructed using two data sets. Because the two data sets are in different languages, there will be some difference in the process of constructing the medical prior knowledge map. For IU-Xray data set, stanza Biomedical is used as a backbone method for named entity identification and relationship extraction, and a medical priori knowledge map with 284 key nodes is finally obtained through clustering. And each node obtains 768-dimensional feature vectors through BioBert, and the 768-dimensional feature vectors are used as initial features of the medical priori knowledge graph. Similarly, for the NCRC-DS dataset, chinese medical entities were extracted using CMeKG and entity triplets were extracted. The CMeKG is a Chinese medical knowledge map tool library and provides open source realization of functions such as named entity recognition, relation extraction, medical word segmentation and the like. After clustering, a knowledge graph containing 191 key nodes is finally obtained. In order to obtain the initial characteristics of the nodes, the keywords of the nodes are input into a Chinese medical Bert model provided by CMeKG, and 768-dimensional initial vectors are obtained.

Optionally, step 102 comprises the sub-steps of:

step 1021, constructing a graph encoder:

wherein ,

adjacent moments representing a medical prior knowledge mapArray, the adjacent matrix is accompanied by an edge pointing to the node of the adjacent matrix,

the initial feature vector of the medical prior knowledge graph is obtained by splicing the initial feature vectors of all nodes in the medical prior knowledge graph,

a graph convolution feature vector representing the k-th layer,

For use in normalizing the characteristics of the aggregation nodes,

elements of row i and column j of an adjacency matrix representing a medical prior knowledge map, W ^(k) Representing a trainable weight matrix, GC () representing a convolution function, σ () representing an activation function, dropout () representing a random drop function, BN () representing a batch normalization function;

and 1022, inputting the medical priori knowledge map and the initial feature vector of each node in the medical priori knowledge map into a map encoder to obtain a final layer of map convolution feature vector as a map embedding vector.

In this embodiment, random discarding, batch normalization layer and residual connection are added between two layers of graph convolution, so that the expression capability of a graph encoder can be improved, and graph embedding vectors of a medical priori knowledge graph can be extracted through the graph encoder.

Optionally, step 103 comprises the sub-steps of:

step 1031, inputting the medical image (imag with dimensionality [ B, C, H, W ]) into an image encoder VE () without a linear layer to obtain a four-dimensional visual feature matrix VE (imag) with dimensionality [ B, F, H ', W');

step 1032, transforming the four-dimensional visual characteristic matrix into a three-dimensional visual characteristic matrix;

I _E ＝reshape(VE(imag)) (3)

wherein ,I_E Is a three-dimensional visual characteristic matrix with the dimension of [ B, F, H' xW]Reshape () represents a warping function;

step 1033, converting the three-dimensional visual feature matrix into a visual feature sequence x ₁ ,x ₂ ,…,x _H` ×W`。

In this embodiment, the medical image is input to an image encoder that does not include a linear layer, and is further deformed to obtain a visual feature sequence.

Optionally, step 104 comprises the sub-steps of:

step 1041, calculating an affinity matrix between the graph embedding vector and the visual feature sequence;

step 1042, learning the attention mapping between the graph embedding vector and the visual characteristic sequence through the affinity matrix;

step 1043, calculating an attention weight vector based on the attention mapping;

and step 1044, calculating the image sequence subjected to attention re-weighting based on the visual feature sequence and the attention weight vector.

In step 1041, an affinity matrix between the graph embedding vector and the visual feature sequence is calculated by the following expression:

In step 1042, the attention mapping between the graph embedding vector and the visual feature sequence is learned by the following expression:

F ⁱ ＝tanh(W _i I _E +(W _g G _E )C) (5)

wherein ,Fⁱ Output results representing embedding of vectors and visual feature sequences by affinity matrix learning graph, W _i and W_g All represent trainable weight matrices, C represents affinity matrices, G _E Representation embedding vector, I _E Representing a sequence of visual features.

In step 1043, the attention weight vector is calculated by the following expression:

In step 1044, the attention-re-weighted image sequence is computed by the following expression:

wherein ,

representing elements in the attention-re-weighted image sequence,

In this embodiment, a cooperative attention mechanism is adopted to perform multi-modal fusion on the image embedding vector output by the image encoder and the visual feature sequence output by the image encoder, so that the finally obtained image sequence subjected to attention re-weighting fuses the visual features of the medical images and the high-level semantic information of the medical priori knowledge map.

Optionally, step 105 comprises the sub-steps of:

step 1051, inputting the image sequence subjected to attention re-weighting into an encoder;

step 1052, initializing the relational memory by using the graph embedding vector;

step 1053, calculating a memory matrix output in the last round of the relational memory;

and 1054, inputting the memory matrix output by the last round of the relational memory and the output result of the encoder into a decoding module provided with a normalization layer of a memory drive to obtain a medical image report.

In step 1051, the output result of the encoder is calculated by the following expression:

where ψ denotes the output of the encoder, N denotes T _E () Representing an encoder.

In step 1052, the relational memory is used to store the shared content in the model training result, enhancing the learning ability of the model. Specifically, a memory matrix is provided that includes a plurality of rows, each row being considered as a slot for storing pattern-specific information.

Initializing the relational memory by the following expression:

M ₀ ＝MLP(G _E ·W _m ) (9)

wherein ,M₀ Initial memory matrix, W, representing a relational memory _m Representing a weight matrix and MLP () representing a multi-layer perceptron for building the inter-dimensional mapping.

In step 1053, the memory matrix output in the last round of the relational memory is calculated by the following expression:

wherein ,Q＝M_t-1 ·W _Q ，K＝[M _t-1 ；y _t-1 ]·W _k ，V＝[M _t-1 ；y _t-1 ]·W _v ，M _t-1 Memory matrix, y, representing the output of a round on relational memory _t-1 Word-embedded vector, W, representing a previous round of prediction in relational memory _Q 、W _k 、W _v Are all trainable weight matrices, d _k Representing the scaling factor, which is obtained by dividing the dimension of k by the number of multi-points, MLP () representing a multi-layer perceptron,

and

representing a multi-headed attention output matrix, M, mapped by a multi-layered perceptron _t And the memory matrix of the last round of output of the relational memory is represented.

In step 1054, the output result of the decoding module provided with the memory-driven normalization layer is calculated by the following expression:

γ _t ＝γ+MLP(M _t ) (13)

β _t ＝β+MLP(M _t ) (14)

θ＝T _D (ψ,N,RM(M _t-1 ,y _t-1 ),MCLN(r,M _t )) (16)

where ψ represents the output of the encoder, N represents the number of decoder layers, γ represents a learnable parameter matrix representing scaling used for improving generalization ability, γ _t Representing gamma and M after multi-layer perceptron mapping _t The result of the addition, β represents a learnable parameter matrix representing the movement used to improve generalization ability, β _t Represents beta and M after multi-layer perceptron mapping _t As a result of addition, μ represents the mean value of γ, v represents the standard deviation of γ, T _E () Representation encoder, T _D () Representing the decoder, RM () representing the relational memory.

In this embodiment, the relational memory is initialized by using the graph embedding vector, instead of assigning all initial memory matrices of the relational memory to 0, so that the memory-driven Transformer model can be optimized. Moreover, the input of the decoder of the memory driven Transformer model fuses the medical priori knowledge map and the medical image, so that the memory driven Transformer model has better comprehension capability and more robust capability of comprehending the medical priori knowledge, and the accuracy and reliability of generating the medical image report can be improved.

Specifically, taking the IU-Xray dataset and the NCRC-DS dataset as an example, the DenseNet-121 model pre-trained on CheXpert is selected for the IU-Xray dataset as the backbone network for the image encoder. Two chest pieces of the same description will be input into the model and stitched to the encoder of the memory driven Transformer model. A medical prior knowledge map containing 284 nodes is used as medical prior knowledge. For the NCRC-DS dataset, the ResNet-101 model pre-trained on ImageNet was chosen as the backbone network for the image encoder. Because the data set is small in size, only one dermatosis picture and a corresponding description report are input at a time, and a medical prior knowledge map containing 191 nodes is used as medical prior knowledge. By default, the number of levels of the graph convolution is 3, the number of slots of the relational memory is 3, and the word embedding dimension is set to 512. The model was trained using an Adam optimizer with cross entropy loss, BLEU-4 scores were evaluated on the test set while training, and weight decay and early stop were set. In the fields of chest radiographs, skin diseases and the like, the method for generating the medical image report based on multi-modal fusion of the embodiment exceeds the currently common medical image report generation model in accuracy and reliability.

The multi-modality fusion-based medical image report generation apparatus provided by the present invention is described below, and the multi-modality fusion-based medical image report generation apparatus described below and the multi-modality fusion-based medical image report generation method described above may be referred to in correspondence with each other.

Referring to fig. 4, fig. 4 is a schematic structural diagram of a medical image report generation apparatus based on multi-modal fusion according to the present invention. As shown in fig. 4, the medical image report generating apparatus based on multi-modal fusion provided by the present invention may include:

the map construction module 10 is used for constructing a medical prior knowledge map and acquiring an initial feature vector of each node in the medical prior knowledge map;

a graph encoder module 20, configured to input the medical prior knowledge graph and an initial feature vector of each node in the medical prior knowledge graph into a graph encoder, so as to obtain a graph embedding vector;

an image encoder module 30, configured to input the medical image into an image encoder that does not include the linear layer, to obtain a visual feature sequence;

a multi-modal fusion module 40, configured to perform multi-modal fusion on the graph embedding vector and the visual feature sequence by using a cooperative attention mechanism, so as to obtain an image sequence subjected to attention re-weighting;

and a report generation module 50, configured to input the attention-re-weighted image sequence into a memory-driven Transformer model, so as to generate a medical image report.

Optionally, the atlas construction module 10 is specifically configured to:

acquiring a plurality of unmarked medical image report texts;

extracting a plurality of medical entities from the plurality of unmarked medical image report texts by adopting a named entity recognition algorithm;

reducing dimensions of the medical entities by adopting a clustering algorithm;

Optionally, the atlas construction module 10 is specifically configured to:

and initializing each node of the medical prior knowledge graph through a word embedding model to obtain an initial feature vector of the node.

Optionally, the graph encoder module 20 is specifically configured to:

constructing a graph encoder:

wherein ,

a graph convolution feature vector representing the k-th layer,

and inputting the initial characteristic vector of each node in the medical priori knowledge graph and the medical priori knowledge graph into the graph encoder to obtain a graph convolution characteristic vector of a last layer as a graph embedding vector.

Optionally, image encoder module 30 is specifically configured to:

inputting the medical image into an image encoder without a linear layer to obtain a four-dimensional visual characteristic matrix;

transforming the four-dimensional visual feature matrix into a three-dimensional visual feature matrix;

Optionally, the multimodal fusion module 40 is specifically configured to:

calculating an affinity matrix between the graph embedding vector and the visual feature sequence;

calculating an attention weight vector based on the attention map;

Optionally, the multimodal fusion module 40 is specifically configured to:

Optionally, the multimodal fusion module 40 is specifically configured to:

F ⁱ ＝tanh(W _i I _E +(W _g G _E )C)

Optionally, the multimodal fusion module 40 is specifically configured to:

calculating an attention weight vector by the following expression:

Optionally, the multimodal fusion module 40 is specifically configured to:

wherein ,

representing elements in the attention-re-weighted image sequence,

Optionally, the memory-driven Transformer model comprises: an encoder and a decoder, the decoder comprising: the device comprises a relation memory and a decoding module provided with a memory-driven normalization layer.

Optionally, the report generating module 50 is specifically configured to:

inputting the sequence of attention-re-weighted images into the encoder;

initializing the relational memory by adopting the graph embedding vector;

calculating a memory matrix output in the last round of the relationship memory;

Optionally, the report generating module 50 is specifically configured to:

initializing the relational memory by the following expression:

M ₀ ＝MLP(G _E ·W _m )

wherein ,M₀ Initial memory matrix, G, representing a relational memory _E Representation embedding vector, W _m Representing a weight matrix and MLP () representing a multi-layer perceptron for building the inter-dimensional mapping.

Optionally, the report generating module 50 is specifically configured to:

wherein ,Q＝M_t-1 ·W _Q ，K＝[M _t-1 ；y _t-1 ]·W _k ，V＝[M _t-1 ；y _t-1 ]·W _v ，M _t-1 A memory matrix, y, representing a round of output over the relational memory _t-1 Word-embedding vector, W, representing a round of prediction over the relational memory _Q 、W _k 、W _v Are all trainable weight matrices, d _k Representing the scaling factor, which is obtained by dividing the dimension of k by the number of multiple starts, MLP () representing the multi-layered perceptron,

and

representing a multi-headed attention output matrix, M, mapped across multiple tiers of perceptrons _t And representing the memory matrix output in the last round of the relational memory.

Optionally, the report generating module 50 is specifically configured to:

γ _t ＝γ+MLP(M _t )

β _t ＝β+MLP(M _t )

θ＝T _D (ψ,N,RM(M _t-1 ,y _t-1 ),MCLN(r,M _t ))

where ψ denotes the output of the encoder, N denotes the number of decoder layers, γ denotes a learnable parameter matrix representing scaling used for improving generalization ability, γ _t Representing gamma and M after multi-layer perceptron mapping _t The result of the addition, β represents a learnable parameter matrix representing the movement used to improve generalization ability, β _t Represents beta and M after multi-layer perceptron mapping _t As a result of addition, μ represents the mean value of γ, v represents the standard deviation of γ, and T _E () Represents the encoder, T _D () Representing the decoder, RM () representing the relational memory.

Fig. 5 illustrates a physical structure diagram of an electronic device, which may include, as shown in fig. 5: a processor (processor) 810, a communication Interface 820, a memory 830 and a communication bus 840, wherein the processor 810, the communication Interface 820 and the memory 830 communicate with each other via the communication bus 840. The processor 810 may invoke logic instructions in the memory 830 to perform a method for multi-modal fusion based medical image report generation, the method comprising:

inputting the medical image into an image encoder without a linear layer to obtain a visual characteristic sequence;

performing multi-mode fusion on the graph embedding vector and the visual feature sequence by adopting a cooperative attention mechanism to obtain an image sequence subjected to attention re-empowerment;

and inputting the attention-re-weighted image sequence into a memory-driven Transformer model to generate a medical image report.

In addition, the logic instructions in the memory 830 may be implemented in software functional units and stored in a computer readable storage medium when the logic instructions are sold or used as independent products. Based on such understanding, the technical solution of the present invention or a part thereof which substantially contributes to the prior art may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

In another aspect, the present invention further provides a computer program product, the computer program product includes a computer program, the computer program can be stored on a non-transitory computer readable storage medium, when the computer program is executed by a processor, a computer can execute a method for generating a medical image report based on multi-modal fusion provided by the above methods, the method includes:

constructing a medical prior knowledge graph, and acquiring an initial characteristic vector of each node in the medical prior knowledge graph;

inputting the medical priori knowledge graph and the initial characteristic vector of each node in the medical priori knowledge graph into a graph encoder to obtain a graph embedding vector;

In yet another aspect, the present invention further provides a non-transitory computer-readable storage medium, on which a computer program is stored, the computer program being implemented by a processor to execute a method for generating a medical image report based on multi-modal fusion provided by the above methods, the method comprising:

The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A medical image report generation method based on multi-modal fusion is characterized by comprising the following steps:

2. The method for generating a medical image report based on multi-modal fusion as claimed in claim 1, wherein the constructing a medical prior knowledge map comprises:

acquiring a plurality of unmarked medical image report texts;

reducing dimensions of the medical entities by adopting a clustering algorithm;

3. The method according to claim 1, wherein the obtaining an initial feature vector of each node in the medical prior knowledge graph comprises:

4. The method according to claim 1, wherein the inputting the medical prior knowledge map and the initial feature vector of each node in the medical prior knowledge map into a map encoder to obtain a map embedding vector comprises:

constructing a graph encoder:

wherein ,

a graph convolution feature vector representing the k-th layer,

5. The method according to claim 1, wherein the inputting the medical image into an image encoder that does not include a linear layer to obtain a visual feature sequence comprises:

6. The method for generating a multi-modal fusion-based medical image report according to claim 1, wherein the multi-modal fusion of the graph embedding vector and the visual feature sequence by adopting a cooperative attention mechanism to obtain an attention-re-weighted image sequence comprises:

calculating an attention weight vector based on the attention map;

7. The multi-modality fusion-based medical image report generation method of claim 6, wherein the calculating an affinity matrix between the graph embedding vector and the visual feature sequence includes:

wherein C represents affinityMatrix, G _E Representation embedding vector, I _E Representing a sequence of visual features, W _b Representing a weight matrix.

8. The method of claim 6, wherein learning the attention mapping between the graph embedding vector and the visual feature sequence via the affinity matrix comprises:

learning an attention mapping between the graph embedding vector and the visual feature sequence by the following expression:

F ⁱ ＝tanh(W _i I _E +(W _g G _E )C)

wherein ,Fⁱ Representing the output result of learning the graph-embedding vector and the visual feature sequence by an affinity matrix, W _i and W_g All represent trainable weight matrices, C represents affinity matrices, G _E Representation embedding vector, I _E Representing a sequence of visual features.

9. The method of claim 6, wherein the calculating an attention weight vector based on the attention map comprises:

the attention weight vector is calculated by the following expression:

wherein ,aⁱ Represents the attention weight vector, w _fi Representing a trainable weight matrix, F ⁱ Indicating the attention mapping result.

10. The method of claim 6, wherein the computing a sequence of re-weighted images based on the sequence of visual features and the attention weight vector comprises:

wherein ,

representing elements in the attention-re-weighted image sequence,

11. The method of claim 1, wherein the memory-driven Transformer model comprises: an encoder and a decoder, the decoder comprising: the device comprises a relation memory and a decoding module provided with a memory-driven normalization layer.

12. The method according to claim 11, wherein the inputting the attention-re-weighted image sequence into a memory-driven Transformer model to generate a medical image report comprises:

inputting the sequence of attention-re-weighted images into the encoder;

initializing the relational memory by adopting the graph embedding vector;

13. The method according to claim 12, wherein initializing the relational memory using the graph embedding vector comprises:

initializing the relational memory by the following expression:

M ₀ ＝MLP(G _E ·W _m )

wherein ,M₀ Initial memory matrix, G, representing relational memory _E Representation embedding vector, W _m Representing a weight matrix and MLP () representing a multi-layer perceptron for building the inter-dimensional mapping.

14. The method according to claim 12, wherein the calculating a memory matrix of the last round output of the relational memory comprises:

calculating the memory matrix output by the last round of the relational memory according to the following expression:

wherein ,Q＝M_t-1 ·W _Q ，K＝[M _t-1 ；y _t-1 ]·W _k ，V＝[M _t-1 ；y _t-1 ]·W _v ，M _t-1 Memory matrix, y, representing the last round of output of the relational memory _t-1 Word-embedded vector, W, representing a previous round of prediction in the relational memory _Q 、W _k 、W _v Are all trainable weight matrices, d _k Representing the scaling factor, which is obtained by dividing the dimension of k by the number of multiple starts, MLP () representing the multi-layered perceptron,

and

representing a multi-headed attention output matrix, M, mapped by a multi-layered perceptron _t And representing the memory matrix output in the last round of the relational memory.

15. The method according to claim 14, wherein the inputting the memory matrix outputted from the last round of the relational memory and the output result of the encoder into the decoding module provided with the memory-driven normalization layer to obtain the medical image report comprises:

γ _t ＝γ+MLP(M _t )

β _t ＝β+MLP(M _t )

θ＝T _D (ψ,N,RM(M _t-1 ,y _t-1 ),MCLN(r,M _t ))

where ψ denotes the output of the encoder, N denotes the number of decoder layers, γ denotes a learnable parameter matrix representing scaling used for improving generalization ability, γ _t Representing gamma and M after multi-layer perceptron mapping _t The result of the addition, β represents a learnable parameter matrix representing the movement used to improve generalization ability, β _t Represents beta and M after multi-layer perceptron mapping _t As a result of addition, μ represents the mean value of γ, v represents the standard deviation of γ, and T _E () Represents said encoder, T _D () Representing the decoder, RM () representing the relational memory.

16. A medical image report generation device based on multi-modal fusion is characterized by comprising:

and the report generation module is used for inputting the attention re-weighted image sequence into a memory-driven Transformer model to generate a medical image report.

17. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the method for generating a medical image report based on multimodal fusion according to any one of claims 1 to 15 when executing the program.

18. A non-transitory computer-readable storage medium, having a computer program stored thereon, wherein the computer program, when executed by a processor, implements the method for generating a medical image report based on multi-modal fusion according to any one of claims 1 to 15.

19. A computer program product comprising a computer program, wherein the computer program, when executed by a processor, implements the method for generating a medical image report based on multi-modal fusion according to any one of claims 1 to 15.