CN115331769B

CN115331769B - Medical image report generation method and device based on multi-mode fusion

Info

Publication number: CN115331769B
Application number: CN202210836966.3A
Authority: CN
Inventors: 黄雨; 李航; 徐德轩; 金芝
Original assignee: Peking University; Peking University First Hospital
Current assignee: Peking University; Peking University First Hospital
Priority date: 2022-07-15
Filing date: 2022-07-15
Publication date: 2023-05-09
Anticipated expiration: 2042-07-15
Also published as: CN115331769A

Abstract

The invention provides a medical image report generation method and device based on multi-mode fusion, wherein the method comprises the following steps: constructing a medical priori knowledge graph, and acquiring an initial feature vector of each node in the medical priori knowledge graph; inputting the medical priori knowledge graph and the initial feature vector of each node in the medical priori knowledge graph into a graph encoder to obtain a graph embedding vector; inputting the medical image into an image encoder which does not comprise a linear layer to obtain a visual characteristic sequence; adopting a cooperative attention mechanism to perform multi-mode fusion on the graph embedding vector and the visual feature sequence to obtain an attention re-weighted image sequence; and inputting the image sequence subjected to attention re-weighting into a memory driving transducer model to generate a medical image report. The invention can improve the accuracy and reliability of medical image report generation.

Description

Medical image report generation method and device based on multi-mode fusion

Technical Field

The invention relates to the technical field of intersection of medical treatment and artificial intelligence, in particular to a medical image report generation method and device based on multi-mode fusion.

Background

In recent years, medical image reporting is an important direction of research collaboration by computer students and medical professionals. The accurate and efficient medical image report can greatly improve the control of doctors on the disease condition of patients, reduce the workload of doctors, assist the doctors to make correct disease diagnosis and provide corresponding medical guidance and advice for the patients.

At present, research on medical image report generation technology is still in a starting stage. In the existing medical image report generation scheme, the medical knowledge graph is generally used for subtasks such as classification, and is not integrated into a model, so that the accuracy and reliability of medical image report generation are not high.

Disclosure of Invention

The invention provides a medical image report generation method and device based on multi-mode fusion, which are used for solving the defects that in the prior art, medical knowledge maps are used for subtasks such as classification and are not fused into a model, so that the accuracy and reliability of medical image report generation are improved, and the purposes of improving the accuracy and reliability of medical image report generation are realized.

The invention provides a medical image report generation method based on multi-mode fusion, which comprises the following steps:

Constructing a medical priori knowledge graph, and acquiring an initial feature vector of each node in the medical priori knowledge graph;

inputting the medical priori knowledge graph and the initial feature vector of each node in the medical priori knowledge graph into a graph encoder to obtain a graph embedding vector;

inputting the medical image into an image encoder which does not comprise a linear layer to obtain a visual characteristic sequence;

adopting a cooperative attention mechanism to perform multi-mode fusion on the graph embedding vector and the visual feature sequence to obtain an attention re-weighted image sequence;

and inputting the image sequence subjected to attention re-weighting into a memory driving transducer model to generate a medical image report.

According to the medical image report generation method based on multi-mode fusion, the medical priori knowledge map is constructed, and the medical priori knowledge map comprises the following steps:

acquiring a plurality of unmarked medical image report texts;

extracting a plurality of medical entities from the plurality of unmarked medical image report texts by adopting a named entity recognition algorithm;

adopting a clustering algorithm to reduce the dimension of the medical entities;

and constructing a medical priori knowledge map by taking the medical entities after dimension reduction as nodes and the relationship among the medical entities after dimension reduction as edges.

According to the medical image report generating method based on multi-mode fusion provided by the invention, the obtaining of the initial feature vector of each node in the medical priori knowledge graph comprises the following steps:

and initializing each node of the medical priori knowledge map through a word embedding model to obtain an initial feature vector of the node.

According to the medical image report generating method based on multi-mode fusion provided by the invention, the initial feature vector of each node in the medical priori knowledge graph and the medical priori knowledge graph is input into a graph encoder to obtain a graph embedding vector, and the method comprises the following steps:

building a graph encoder:

wherein ,

an adjacency matrix representing the medical prior knowledge-graph, said adjacency matrix being accompanied by edges pointing to its own nodes,>

representing an initial feature vector of the medical priori knowledge map, wherein the initial feature vector of the medical priori knowledge map is obtained by splicing initial feature vectors of all nodes in the medical priori knowledge map, and the +>

A picture convolution feature vector representing the k-th layer,/>

Graph roll feature vector representing layer k+1, d= Σ _i D _i ，/>

Elements of the ith row and jth column of the adjacency matrix representing the medical prior knowledge-graph, W ^(k) Representing a trainable weight matrix, GC () representing a convolution function, σ () representing an activation function, dropout () representing a random discard function, BN () representing a batch normalization function;

and inputting the medical priori knowledge graph and the initial feature vector of each node in the medical priori knowledge graph into the graph encoder to obtain a graph convolution feature vector of the last layer as a graph embedding vector.

According to the medical image report generating method based on multi-mode fusion, the medical image is input into an image encoder which does not comprise a linear layer to obtain a visual characteristic sequence, and the method comprises the following steps:

inputting the medical image into an image encoder which does not comprise a linear layer to obtain a four-dimensional visual feature matrix;

deforming the four-dimensional visual feature matrix into a three-dimensional visual feature matrix;

and converting the three-dimensional visual characteristic matrix into a visual characteristic sequence.

According to the medical image report generating method based on multi-mode fusion provided by the invention, the adoption of a cooperative attention mechanism carries out multi-mode fusion on the graph embedding vector and the visual feature sequence to obtain an attention re-weighted image sequence, and the method comprises the following steps:

Calculating an affinity matrix between the graph embedding vector and the visual feature sequence;

learning, by the affinity matrix, an attention map between the graph embedding vector and the visual feature sequence;

calculating an attention weight vector based on the attention map;

based on the visual feature sequence and the attention weight vector, a sequence of attention re-weighted images is calculated.

According to the medical image report generating method based on multi-mode fusion provided by the invention, the calculating of the affinity matrix between the graph embedding vector and the visual characteristic sequence comprises the following steps:

the affinity matrix between the graph embedding vector and the visual feature sequence is calculated by the following expression:

wherein C represents an affinity matrix, G _E Representing graph embedding vectors, I _E Representing a sequence of visual features, W _b Representing a weight matrix.

According to the medical image report generating method based on multi-mode fusion provided by the invention, the learning of the attention map between the graph embedding vector and the visual feature sequence through the affinity matrix comprises the following steps:

learning an attention map between the graph embedding vector and the visual feature sequence by:

F ⁱ ＝tanh(W _i I _E +(W _g G _E )C)

wherein ,Fⁱ Representing the output result of learning the graph embedding vector and the visual feature sequence by means of an affinity matrix, W _i and W_g All represent a trainable weight matrix, C represents an affinity matrix, G _E Representing graph embedding vectors, I _E Representing a sequence of visual features.

According to the medical image report generating method based on multi-mode fusion provided by the invention, the attention weight vector is calculated based on the attention map, and the method comprises the following steps:

the attention weight vector is calculated by the following expression:

wherein ,aⁱ Representing a vector of attention weights, w _fi Representing a trainable weight matrix, F ⁱ Representing the attention map result.

According to the medical image report generating method based on multi-mode fusion provided by the invention, the image sequence subjected to attention re-weighting is calculated based on the visual feature sequence and the attention weight vector, and the method comprises the following steps:

the attention re-weighted image sequence is calculated by the following expression:

wherein ,

representing a sequence of images, x, that have been re-weighted by attention _1,2,…,R Representing elements in the visual characteristics sequence, +.>

Representing elements in the sequence of images that are re-weighted by attention,/->

The elements in the attention weight vector corresponding to the visual feature sequence are represented, and r=h '×w' represents the number of image blocks.

According to the medical image report generation method based on multi-mode fusion provided by the invention, the memory driving transducer model comprises the following steps: an encoder and a decoder, the decoder comprising: the decoding module is provided with a memory-driven normalization layer.

According to the method for generating the medical image report based on the multi-mode fusion provided by the invention, the image sequence subjected to attention re-weighting is input into a memory driving transducer model to generate the medical image report, and the method comprises the following steps:

inputting the attention re-weighted image sequence into the encoder;

initializing the relational memory by adopting the graph embedding vector;

calculating a memory matrix of the last round of output of the relational memory;

and inputting the memory matrix output by the last round of the relational memory and the output result of the encoder into the decoding module provided with the memory-driven normalization layer to obtain a medical image report.

According to the medical image report generating method based on multi-mode fusion provided by the invention, the initializing the relational memory by adopting the graph embedding vector comprises the following steps:

Initializing the relational memory by the following expression:

M ₀ ＝MLP(G _E ·W _m )

wherein ,M₀ Initial memory matrix representing relational memory, G _E Representing graph embedded vectors, W _m Representing a weight matrix, MLP () represents a multi-layer perceptron for building an inter-dimensional mapping.

According to the medical image report generating method based on multi-mode fusion provided by the invention, the memory matrix for calculating the last round of output of the relational memory comprises the following steps:

calculating the memory matrix of the last round of output of the relational memory by the following expression:

wherein ,Q＝M_t-1 ·W _Q ，K＝[M _t-1 ；y _t-1 ]·W _k ，V＝[M _t-1 ；y _t-1 ]·W _v ，M _t-1 A memory matrix, y representing a round of output on the relational memory _t-1 Word embedding vector representing a round of predictions on the relational memory, W _Q 、W _k 、W _v Are trainable weight matrices, d _k Representing the scaling factor, which is obtained by dividing the dimension of k by the number of heads, MLP () represents the multi-layer perceptron,

and />

For balancing M _t-1 and y_t-1 Forgetting gate and output gate, +.>

Representing a multi-headed attention output matrix mapped by a multi-layer perceptron, M _t And the memory matrix representing the last round of output of the relational memory.

According to the medical image report generating method based on multi-mode fusion provided by the invention, the memory matrix output by the last round of the relational memory and the output result of the encoder are input into the decoding module provided with the memory-driven normalization layer to obtain the medical image report, and the medical image report generating method comprises the following steps:

Calculating an output result of the decoding module provided with the memory-driven normalization layer by the following expression:

γ _t ＝γ+MLP(M _t )

β _t ＝β+MLP(M _t )

θ＝T _D (ψ,N,RM(M _t-1 ,y _t-1 ),MCLN(r,M _t ))

wherein ψ represents the output of the encoder, N represents the decoder layer number, γ represents the matrix of learnable parameters representing the scaling used to improve generalization ability, γ _t Representing gamma and M mapped by a multi-layer perceptron _t As a result of the addition, β represents a matrix of leachable parameters representing movement used to enhance generalization ability, β _t Representing beta and M mapped by a multi-layer perceptron _t As a result of the addition, μ represents the mean value of γ, v represents the standard deviation of γ, T _E () Representing the encoder, T _D () Representing the decoder, RM () represents the relational memory.

The invention also provides a medical image report generating device based on multi-mode fusion, which comprises the following steps:

the map construction module is used for constructing a medical priori knowledge map and acquiring an initial feature vector of each node in the medical priori knowledge map;

the graph encoder module is used for inputting the medical priori knowledge graph and the initial feature vector of each node in the medical priori knowledge graph into the graph encoder to obtain a graph embedding vector;

the image encoder module is used for inputting the medical image into an image encoder which does not comprise a linear layer to obtain a visual characteristic sequence;

The multi-mode fusion module is used for adopting a cooperative attention mechanism to perform multi-mode fusion on the graph embedding vector and the visual feature sequence to obtain an attention re-weighted image sequence;

and the report generation module is used for inputting the image sequence subjected to attention re-weighting into a memory driving transducer model to generate a medical image report.

The invention also provides an electronic device, which comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor realizes the medical image report generating method based on the multi-modal fusion when executing the program.

The present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements a medical image report generating method based on multimodal fusion as described in any one of the above.

The invention also provides a computer program product comprising a computer program which when executed by a processor implements a medical image report generating method based on multimodal fusion as described in any one of the above.

The invention provides a medical image report generation method and a device based on multi-mode fusion, which comprises the steps of firstly, constructing a medical priori knowledge graph and acquiring an initial feature vector of each node in the medical priori knowledge graph; then, inputting the medical priori knowledge graph and the initial feature vector of each node in the medical priori knowledge graph into a graph encoder to obtain a graph embedding vector; inputting the medical image into an image encoder which does not comprise a linear layer to obtain a visual characteristic sequence; adopting a cooperative attention mechanism to perform multi-mode fusion on the graph embedding vector and the visual feature sequence to obtain an image sequence subjected to attention re-weighting; the image sequence subjected to attention re-weighting fuses a medical priori knowledge map and a medical image; and finally, inputting the image sequence subjected to attention re-weighting into a memory driving transducer model to generate a medical image report, wherein the memory driving transducer model has better understanding capability and more robust understanding capability of medical priori knowledge due to the fact that the image sequence subjected to attention re-weighting is fused with a medical priori knowledge map and a medical image, so that the accuracy and reliability of medical image report generation can be improved.

Drawings

In order to more clearly illustrate the invention or the technical solutions of the prior art, the following description will briefly explain the drawings used in the embodiments or the description of the prior art, and it is obvious that the drawings in the following description are some embodiments of the invention, and other drawings can be obtained according to the drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic flow chart of a medical image report generating method based on multi-modal fusion;

FIG. 2 is a schematic structural diagram of a medical image report generation model based on a medical prior knowledge graph and memory driving provided by the invention;

FIG. 3 is a schematic diagram of constructing a medical prior knowledge graph provided by the invention;

FIG. 4 is a schematic structural diagram of a medical image report generating device based on multi-modal fusion;

fig. 5 is a schematic structural diagram of an electronic device provided by the present invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

The medical image report generation method based on the multi-modal fusion of the present invention is described below with reference to fig. 1 to 3.

Referring to fig. 1, fig. 1 is a flow chart of a medical image report generating method based on multi-mode fusion according to the present invention. As shown in fig. 1, the medical image report generating method based on multi-mode fusion provided by the invention may include the following steps:

step 101, constructing a medical priori knowledge graph, and acquiring an initial feature vector of each node in the medical priori knowledge graph;

step 102, inputting the medical priori knowledge graph and the initial feature vector of each node in the medical priori knowledge graph into a graph encoder to obtain a graph embedding vector;

step 103, inputting the medical image into an image encoder which does not comprise a linear layer to obtain a visual characteristic sequence;

104, adopting a cooperative attention mechanism to perform multi-mode fusion on the graph embedding vector and the visual feature sequence to obtain an image sequence subjected to attention re-weighting;

step 105, inputting the image sequence subjected to attention re-weighting into a memory driving transducer model to generate a medical image report.

In step 101, a medical prior knowledge graph with proper size is constructed by adopting a natural language processing method, instead of manually constructing by selecting only a few keywords. The medical priori knowledge map of the embodiment has the following characteristics:

1) Comprehensive entity type

The medical prior knowledge graph contains an omnidirectional entity type, and can describe disease symptoms from an omnidirectional perspective, not just the names of diseases, for example: for skin diseases, physical information such as disease name, location, shape, color, etc. may be included.

2) Proper scale of map

The scale of the medical priori knowledge map is proper, if the characteristics are too large, training and learning are difficult, and if the scale is too small, enough priori knowledge cannot be reserved.

3) Comprehensive entity relationship

The relationship between the entities can be automatically established by a relationship extraction method, and the relationship between the entities can be manually supplemented, so that the medical priori knowledge map can embody more priori knowledge.

In this step, in order to complete the subsequent steps, each node in the medical priori knowledge map needs to be initialized, and an initial feature vector of each node in the medical priori knowledge map is obtained. Optionally, each node in the medical prior knowledge-graph is initialized by a word embedding model.

In step 102, a graph encoder is used to extract a graph embedding vector of a medical prior knowledge graph. Inputting the medical priori knowledge graph and the initial feature vector of each node in the medical priori knowledge graph into a graph encoder, and obtaining a graph embedding vector through graph encoding processing.

In step 103, for an image encoder, a derivative model of a convolutional neural network is employed as a visual extractor, for example: residual convolutional network ResNet, densely-chained convolutional network DenseNet, etc. The image encoder used in this embodiment does not include the last linear layer, and outputs the result of the pooling layer.

In step 104, the image encoder may obtain visual features of the medical image, but may not obtain high-level semantic information well. In the embodiment, a collaborative attention mechanism is adopted, the graph embedded vector and the visual feature sequence are subjected to multi-mode fusion, a visual question-answering process is simulated, and finally an image sequence subjected to attention re-weighting is obtained.

In step 105, as shown in fig. 2, the memory driven transducer model optionally includes: an encoder and a decoder, the decoder comprising: the decoding module is provided with a memory-driven normalization layer. The Memory-driven normalization layer is three Memory-driven conditional layer normalization (Memory-driven Conditional Layer Normalization, MCLN) layers, and is used for enhancing the decoding capability of the Memory-driven transducer model and increasing generalization.

The sequence of images that are re-weighted is input into a memory driven transducer model to generate a medical image report.

In this embodiment, since the attention re-weighted image sequence fuses the medical priori knowledge map and the medical image, the memory driving transducer model has better understanding capability and more robust ability of understanding medical priori knowledge, and can improve the accuracy and reliability of medical image report generation.

Optionally, constructing the medical prior knowledge-graph in step 101 includes the following sub-steps:

step 1011, obtaining a plurality of untagged medical image report texts;

step 1012, extracting a plurality of medical entities from a plurality of unlabeled medical image report texts by adopting a named entity recognition algorithm;

step 1013, performing dimension reduction on a plurality of medical entities by adopting a clustering algorithm;

and 1014, constructing a medical priori knowledge map by taking the medical entities after dimension reduction as nodes and the relation among the medical entities after dimension reduction as edges.

In step 1011, several unlabeled medical image report texts are acquired, such as: a large number of unannotated reports, training data sets, available text information provided by the physician, and the like. Different types of reports may be selected as the underlying data for different tasks.

In step 1012, using a named entity recognition algorithm, a number of medical entities may be effectively extracted from a number of unlabeled medical image report texts, which are stored as key nodes of a medical prior knowledge graph.

In step 1013, the nodes identified by the named entities may have a large amount of similar content, and if all the nodes are reserved, a large amount of redundant structures are generated, so that the scale of the medical priori knowledge graph is too large, and therefore, a text processing method and a clustering algorithm are required to reduce the dimensions of a plurality of medical entities.

In step 1014, relationships between the medical entities are established using relationship extraction and entity dependencies are established with the assistance of manual design. And constructing a medical priori knowledge map by taking the medical entities after dimension reduction as nodes and the relationship among the medical entities after dimension reduction as edges.

In this embodiment, instead of manually constructing by selecting only a few keywords, a medical prior knowledge graph of appropriate size may be constructed.

Specifically, as shown in fig. 3, two data sets are used to construct a medical prior knowledge-graph. Because the two data sets differ in language, there is also some variance in the construction of the medical prior knowledge-graph. For IU-Xray data set, stanza BioMedical is used as backbone method for named entity identification and relation extraction, and a medical priori knowledge map with 284 key nodes is finally obtained through clustering. Each node obtains 768-dimensional feature vectors through BioBert and serves as initial features of a medical priori knowledge map. Similarly, for the NCRC-DS dataset, CMeKG is used to extract Chinese medical entities and extract entity triples. CMeKG is a tool library of chinese medical knowledge graph, providing open source implementation of named entity recognition, relationship extraction, medical word segmentation, etc. After clustering, a knowledge graph containing 191 key nodes is finally obtained. In order to obtain the initial characteristics of the node, the node keywords are input into a Chinese medical Bert model provided by CMeKG, and 768-dimensional initial vectors are obtained.

Optionally, step 102 comprises the sub-steps of:

step 1021, constructing a graph encoder:

wherein ,

adjacency matrix representing medical priori knowledge graph, adjacency matrix being attached with edge pointing to own node, ++>

The initial feature vector of the medical priori knowledge map is obtained by splicing the initial feature vectors of all nodes in the medical priori knowledge map, and the initial feature vector of the medical priori knowledge map is a +.>

A picture convolution feature vector representing the k-th layer,/>

Graph roll feature vector representing layer k+1, d= Σ _i D _i ，/>

For normalizing aggregate node features,/->

step 1022, inputting the medical priori knowledge graph and the initial feature vector of each node in the medical priori knowledge graph into the graph encoder to obtain the final layer of graph convolution feature vector as the graph embedding vector.

In the embodiment, a random discarding layer, a batch normalization layer and residual connection are added between two layers of graph convolution, so that the expression capability of a graph encoder can be improved, and the graph embedding vector of the medical priori knowledge graph can be extracted through the graph encoder.

Optionally, step 103 comprises the sub-steps of:

step 1031, inputting the medical image (imag, dimension [ B, C, H, W ]) into an image encoder VE () not including a linear layer to obtain a four-dimensional visual feature matrix VE (imag), wherein the dimension is [ B, F, H ', W';

step 1032, deforming the four-dimensional visual feature matrix into a three-dimensional visual feature matrix;

I _E ＝reshape(VE(imag)) (3)

wherein ,I_E Is a three-dimensional visual characteristic matrix, and the dimension is [ B, F, H' ×W ]]Reshape () represents a deformation function;

step 1033, converting the three-dimensional visual feature matrix into a visual feature sequence x ₁ ,x ₂ ,…,x _H` ×W`。

In this embodiment, the medical image is input to an image encoder that does not include a linear layer, and further deformed to obtain a visual feature sequence.

Optionally, step 104 comprises the sub-steps of:

step 1041, calculating an affinity matrix between the graph embedding vector and the visual feature sequence;

step 1042, embedding an attention map between the vector and the visual feature sequence by the learning map through the affinity matrix;

step 1043, calculating an attention weight vector based on the attention map;

step 1044, calculating an attention re-weighted image sequence based on the visual feature sequence and the attention weight vector.

In step 1041, an affinity matrix between the graph embedding vector and the visual feature sequence is calculated by the following expression:

In step 1042, the attention map between the vector and the visual feature sequence is learned by the following expression:

F ⁱ ＝tanh(W _i I _E +(W _g G _E )C) (5)

wherein ,Fⁱ Representing the output result of embedding vector and visual feature sequence through affinity matrix learning map, W _i and W_g All represent a trainable weight matrix, C represents an affinity matrix, G _E Representing graph embedding vectors, I _E Representing a sequence of visual features.

In step 1043, an attention weight vector is calculated by the following expression:

In step 1044, the attention re-weighted image sequence is calculated by the following expression:

wherein ,

In this embodiment, a collaborative attention mechanism is adopted, and the image embedded vector output by the image encoder and the visual feature sequence output by the image encoder are subjected to multi-mode fusion, so that the finally obtained image sequence subjected to attention re-weighting fuses the visual features of the medical image and the high-level semantic information of the medical priori knowledge graph.

Optionally, step 105 comprises the sub-steps of:

step 1051, inputting the attention re-weighted image sequence into an encoder;

step 1052, initializing the relational memory by adopting the graph embedding vector;

step 1053, calculating the memory matrix of the last round of output of the relational memory;

step 1054, the output results of the memory matrix and the encoder output in the last round of the relational memory are input into a decoding module provided with a memory-driven normalization layer, and a medical image report is obtained.

In step 1051, the output result of the encoder is calculated by the following expression:

wherein, psi represents the output result of the encoder, N represents, T _E () Representing the encoder.

In step 1052, the relational memory is used to store shared content in the model training results, enhancing the learning ability of the model. Specifically, a memory matrix is provided that includes a plurality of rows, each row being considered a slot for storing specific pattern information.

Initializing a relational memory by the following expression:

M ₀ ＝MLP(G _E ·W _m ) (9)

wherein ,M₀ Initial memory matrix representing relational memory, W _m Representing a weight matrix, MLP () represents a multi-layer perceptron for building an inter-dimensional mapping.

In step 1053, the memory matrix for the last round of output of the relational memory is calculated by the following expression:

/>

wherein ,Q＝M_t-1 ·W _Q ，K＝[M _t-1 ；y _t-1 ]·W _k ，V＝[M _t-1 ；y _t-1 ]·W _v ，M _t-1 Memory matrix representing one round of output on relational memory, y _t-1 Word embedding vector representing a round of predictions in relational memory, W _Q 、W _k 、W _v Are all trainable weightsMatrix, d _k Representing the scaling factor, which is obtained by dividing the dimension of k by the number of heads, MLP () represents the multi-layer perceptron,

and />

For balancing M _t-1 and y_t-1 Forgetting gate and output gate, +.>

In step 1054, the output result of the decoding module provided with the memory-driven normalization layer is calculated by the following expression:

γ _t ＝γ+MLP(M _t ) (13)

β _t ＝β+MLP(M _t ) (14)

θ＝T _D (ψ,N,RM(M _t-1 ,y _t-1 ),MCLN(r,M _t )) (16)

wherein ψ represents the output result of the encoder, N represents the decoder layer number, γ represents the matrix of learnable parameters representing scaling used to improve generalization ability, γ _t Representing gamma and M mapped by a multi-layer perceptron _t As a result of the addition, β represents a matrix of leachable parameters representing movement used to enhance generalization ability, β _t Representing beta and M mapped by a multi-layer perceptron _t As a result of the addition, μ represents the mean value of γ, v represents the standard deviation of γ, T _E () Representation encoder, T _D () Representing the decoder, RM () represents the relational memory.

In this embodiment, the graph embedding vector is used to initialize the relational memory, instead of assigning 0 to all the initial memory matrices of the relational memory, so as to optimize the memory driven transducer model. And moreover, the input of the decoder of the memory-driven transducer model fuses a medical priori knowledge map and a medical image, so that the memory-driven transducer model has better understanding capability and more robust ability of understanding medical priori knowledge, and the accuracy and reliability of medical image report generation can be improved.

Specifically, taking IU-Xray dataset and NCRC-DS dataset as examples, a DenseNet-121 model pre-trained on CheXpert is selected for IU-Xray dataset as the backbone network of the image encoder. Two chest radiographs of the same description will be input into the model and spliced for delivery to the encoder of the memory driven transducer model. A medical prior knowledge graph containing 284 nodes is used as medical prior knowledge. For the NCRC-DS dataset, a ResNet-101 model pre-trained on ImageNet was chosen as the backbone network for the image encoder. Because of the small size of the dataset, only one skin disease picture and its corresponding description report are input at a time, and a medical prior knowledge map containing 191 nodes is used as medical prior knowledge. By default, the number of layers of the graph volume is 3, the number of slots of the relational memory is 3, and the word embedding dimension is set to 512. The model was trained using Adam optimizer under cross entropy loss, while training, evaluating BLEU-4 score for the test set, and setting weight decay and early stop. In the fields of chest radiography, skin diseases and the like, the medical image report generation method based on multi-mode fusion of the embodiment is superior to the current common medical image report generation model in accuracy and reliability.

The medical image report generating device based on multi-mode fusion provided by the invention is described below, and the medical image report generating device based on multi-mode fusion described below and the medical image report generating method based on multi-mode fusion described above can be correspondingly referred to each other.

Referring to fig. 4, fig. 4 is a schematic structural diagram of a medical image report generating device based on multi-mode fusion according to the present invention. As shown in fig. 4, the medical image report generating device based on multi-mode fusion provided by the invention may include:

the map construction module 10 is used for constructing a medical priori knowledge map and acquiring an initial feature vector of each node in the medical priori knowledge map;

a graph encoder module 20, configured to input the medical prior knowledge graph and an initial feature vector of each node in the medical prior knowledge graph into a graph encoder to obtain a graph embedding vector;

an image encoder module 30 for inputting the medical image into an image encoder that does not include a linear layer, resulting in a sequence of visual features;

the multi-mode fusion module 40 is configured to perform multi-mode fusion on the graph embedding vector and the visual feature sequence by adopting a cooperative attention mechanism, so as to obtain an image sequence after attention re-weighting;

The report generating module 50 is configured to input the image sequence subjected to attention re-weighting into a memory driven transducer model, and generate a medical image report.

Optionally, the map construction module 10 is specifically configured to:

acquiring a plurality of unmarked medical image report texts;

Optionally, the map construction module 10 is specifically configured to:

Optionally, the graph encoder module 20 is specifically configured to:

building a graph encoder:

wherein ,

representing an initial feature vector of the medical priori knowledge map, wherein the initial feature vector of the medical priori knowledge map is obtained by splicing initial feature vectors of all nodes in the medical priori knowledge map, and the + >

A picture convolution feature vector representing the k-th layer,/>

Graph roll feature vector representing layer k+1, d= Σ _i D _i ，

Optionally, the image encoder module 30 is specifically configured to:

Optionally, the multimodal fusion module 40 is specifically configured to:

Calculating an attention weight vector based on the attention map;

Optionally, the multimodal fusion module 40 is specifically configured to:

Optionally, the multimodal fusion module 40 is specifically configured to:

F ⁱ ＝tanh(W _i I _E +(W _g G _E )C)

Optionally, the multimodal fusion module 40 is specifically configured to:

the attention weight vector is calculated by the following expression:

Optionally, the multimodal fusion module 40 is specifically configured to:

wherein ,

Optionally, the memory driven transducer model comprises: an encoder and a decoder, the decoder comprising: the decoding module is provided with a memory-driven normalization layer.

Alternatively, the report generating module 50 is specifically configured to:

inputting the attention re-weighted image sequence into the encoder;

initializing the relational memory by adopting the graph embedding vector;

Alternatively, the report generating module 50 is specifically configured to:

initializing the relational memory by the following expression:

M ₀ ＝MLP(G _E ·W _m )

Alternatively, the report generating module 50 is specifically configured to:

and />

For balancing M _t-1 and y_t-1 Forgetting gate and output gate, +.>

Alternatively, the report generating module 50 is specifically configured to:

γ _t ＝γ+MLP(M _t )

β _t ＝β+MLP(M _t )

θ＝T _D (ψ,N,RM(M _t-1 ,y _t-1 ),MCLN(r,M _t ))

wherein ψ represents the output of the encoder, N represents the decoder layer number, γ represents the matrix of learnable parameters representing the scaling used to improve generalization ability, γ _t Representing gamma and passing multi-layer perception mechanismM after shooting _t As a result of the addition, β represents a matrix of leachable parameters representing movement used to enhance generalization ability, β _t Representing beta and M mapped by a multi-layer perceptron _t As a result of the addition, μ represents the mean value of γ, v represents the standard deviation of γ, T _E () Representing the encoder, T _D () Representing the decoder, RM () represents the relational memory.

Fig. 5 illustrates a physical schematic diagram of an electronic device, as shown in fig. 5, which may include: processor 810, communication interface (Communications Interface) 820, memory 830, and communication bus 840, wherein processor 810, communication interface 820, memory 830 accomplish communication with each other through communication bus 840. The processor 810 may invoke logic instructions in the memory 830 to perform a multi-modal fusion based medical image report generation method comprising:

Further, the logic instructions in the memory 830 described above may be implemented in the form of software functional units and may be stored in a computer-readable storage medium when sold or used as a stand-alone product. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

In another aspect, the present invention also provides a computer program product, where the computer program product includes a computer program, where the computer program can be stored on a non-transitory computer readable storage medium, and when the computer program is executed by a processor, the computer can perform a medical image report generating method based on multi-modal fusion provided by the above methods, and the method includes:

In yet another aspect, the present invention further provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, is implemented to perform the method for generating a medical image report based on multi-modal fusion provided by the above methods, the method comprising:

The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.

From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course may be implemented by means of hardware. Based on this understanding, the foregoing technical solution may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the respective embodiments or some parts of the embodiments.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. A medical image report generation method based on multi-modal fusion, comprising:

inputting the image sequence subjected to attention re-weighting into a memory driving transformer model to generate a medical image report;

the method for obtaining the image sequence with attention re-weighted by adopting a cooperative attention mechanism carries out multi-mode fusion on the graph embedded vector and the visual feature sequence, and comprises the following steps:

Calculating an attention weight vector based on the attention map;

2. The method for generating a medical image report based on multi-modal fusion according to claim 1, wherein the constructing a medical prior knowledge-graph includes:

acquiring a plurality of unmarked medical image report texts;

3. The method for generating a medical image report based on multi-modal fusion according to claim 1, wherein the obtaining the initial feature vector of each node in the medical prior knowledge-graph includes:

4. The method for generating a medical image report based on multi-modal fusion according to claim 1, wherein inputting the medical prior knowledge-graph and the initial feature vector of each node in the medical prior knowledge-graph into a graph encoder to obtain a graph embedding vector comprises:

building a graph encoder:

wherein ,

A picture convolution feature vector representing the k-th layer,/>

Graph roll feature vector representing layer k+1, d= Σ _i D _i ，/>

5. The method for generating a medical image report based on multi-modal fusion according to claim 1, wherein the inputting the medical image into an image encoder not including a linear layer, to obtain a visual feature sequence, includes:

6. The multi-modality fusion-based medical image report generation method of claim 1, wherein the computing an affinity matrix between the graph embedding vector and the sequence of visual features comprises:

7. The method of claim 1, wherein learning the attention map between the map embedding vector and the visual feature sequence by the affinity matrix comprises:

F ⁱ ＝tanh(W _i I _E +(W _g G _E )C)

wherein ,Fⁱ Representing the output result of learning the graph embedding vector and the visual feature sequence by means of an affinity matrix, W _i and W_g All represent a trainable weight matrix, table CShowing the affinity matrix, G _E Representing graph embedding vectors, I _E Representing a sequence of visual features.

8. The multi-modality fusion-based medical image report generation method of claim 1, wherein the calculating an attention weight vector based on the attention map includes:

the attention weight vector is calculated by the following expression:

9. The multi-modality fusion-based medical image report generation method of claim 1, wherein the computing the attention re-weighted image sequence based on the visual feature sequence and the attention weight vector comprises:

wherein ,

10. The multi-modality fusion-based medical image report generation method of claim 1, wherein the memory-driven transducer model comprises: an encoder and a decoder, the decoder comprising: the decoding module is provided with a memory-driven normalization layer.

11. The method of claim 10, wherein inputting the attention re-weighted image sequence into a memory driven transducer model to generate a medical image report comprises:

inputting the attention re-weighted image sequence into the encoder;

initializing the relational memory by adopting the graph embedding vector;

12. The method for generating a medical image report based on multi-modal fusion according to claim 11, wherein initializing the relational memory using the graph embedding vector includes:

initializing the relational memory by the following expression:

M ₀ ＝MLP(G _E ·W _m )

wherein ,M₀ Initial memory matrix representing relational memory, G _E Representing graph embedded vectors, W _m Representing a weight matrix, MLP () represents a multi-layer perceptron,for creating an inter-dimensional mapping.

13. The method for generating a medical image report based on multi-modal fusion according to claim 11, wherein the calculating the memory matrix of the last round of output of the relational memory includes:

and />

For balancing M _t-1 and y_t-1 Forgetting gate and output gate, +.>

14. The method for generating a medical image report based on multi-modal fusion according to claim 13, wherein inputting the memory matrix output by the last round of the relational memory and the output result of the encoder into the decoding module provided with the memory-driven normalization layer to obtain the medical image report includes:

γ _t ＝γ+MLP(M _t )

β _t ＝β+MLP(M _t )

θ＝T _D (ψ，N，RM(M _t-1 ，y _t-1 )，MCLN(r，M _t ))

15. A medical image report generating device based on multi-modal fusion, comprising:

the report generation module is used for inputting the image sequence subjected to attention re-weighting into a memory driving transducer model to generate a medical image report;

the multi-mode fusion module is specifically configured to:

calculating an attention weight vector based on the attention map;

16. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the medical image report generating method based on multi-modal fusion as claimed in any one of claims 1 to 14 when the program is executed by the processor.

17. A non-transitory computer readable storage medium having stored thereon a computer program, which when executed by a processor implements the medical image report generating method based on multi-modal fusion according to any one of claims 1 to 14.

18. A computer program product comprising a computer program which, when executed by a processor, implements a medical image report generating method based on multi-modal fusion as claimed in any one of claims 1 to 14.