CN115331769B - Medical image report generation method and device based on multi-mode fusion - Google Patents

Medical image report generation method and device based on multi-mode fusion Download PDF

Info

Publication number
CN115331769B
CN115331769B CN202210836966.3A CN202210836966A CN115331769B CN 115331769 B CN115331769 B CN 115331769B CN 202210836966 A CN202210836966 A CN 202210836966A CN 115331769 B CN115331769 B CN 115331769B
Authority
CN
China
Prior art keywords
medical
graph
representing
attention
sequence
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210836966.3A
Other languages
Chinese (zh)
Other versions
CN115331769A (en
Inventor
黄雨
李航
徐德轩
金芝
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Peking University
Peking University First Hospital
Original Assignee
Peking University
Peking University First Hospital
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Peking University, Peking University First Hospital filed Critical Peking University
Priority to CN202210836966.3A priority Critical patent/CN115331769B/en
Publication of CN115331769A publication Critical patent/CN115331769A/en
Application granted granted Critical
Publication of CN115331769B publication Critical patent/CN115331769B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H15/00ICT specially adapted for medical reports, e.g. generation or transmission thereof
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H30/00ICT specially adapted for the handling or processing of medical images

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Public Health (AREA)
  • Primary Health Care (AREA)
  • Epidemiology (AREA)
  • Databases & Information Systems (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Nuclear Medicine, Radiotherapy & Molecular Imaging (AREA)
  • Radiology & Medical Imaging (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • Artificial Intelligence (AREA)
  • Measuring And Recording Apparatus For Diagnosis (AREA)

Abstract

The invention provides a medical image report generation method and device based on multi-mode fusion, wherein the method comprises the following steps: constructing a medical priori knowledge graph, and acquiring an initial feature vector of each node in the medical priori knowledge graph; inputting the medical priori knowledge graph and the initial feature vector of each node in the medical priori knowledge graph into a graph encoder to obtain a graph embedding vector; inputting the medical image into an image encoder which does not comprise a linear layer to obtain a visual characteristic sequence; adopting a cooperative attention mechanism to perform multi-mode fusion on the graph embedding vector and the visual feature sequence to obtain an attention re-weighted image sequence; and inputting the image sequence subjected to attention re-weighting into a memory driving transducer model to generate a medical image report. The invention can improve the accuracy and reliability of medical image report generation.

Description

Medical image report generation method and device based on multi-mode fusion
Technical Field
The invention relates to the technical field of intersection of medical treatment and artificial intelligence, in particular to a medical image report generation method and device based on multi-mode fusion.
Background
In recent years, medical image reporting is an important direction of research collaboration by computer students and medical professionals. The accurate and efficient medical image report can greatly improve the control of doctors on the disease condition of patients, reduce the workload of doctors, assist the doctors to make correct disease diagnosis and provide corresponding medical guidance and advice for the patients.
At present, research on medical image report generation technology is still in a starting stage. In the existing medical image report generation scheme, the medical knowledge graph is generally used for subtasks such as classification, and is not integrated into a model, so that the accuracy and reliability of medical image report generation are not high.
Disclosure of Invention
The invention provides a medical image report generation method and device based on multi-mode fusion, which are used for solving the defects that in the prior art, medical knowledge maps are used for subtasks such as classification and are not fused into a model, so that the accuracy and reliability of medical image report generation are improved, and the purposes of improving the accuracy and reliability of medical image report generation are realized.
The invention provides a medical image report generation method based on multi-mode fusion, which comprises the following steps:
Constructing a medical priori knowledge graph, and acquiring an initial feature vector of each node in the medical priori knowledge graph;
inputting the medical priori knowledge graph and the initial feature vector of each node in the medical priori knowledge graph into a graph encoder to obtain a graph embedding vector;
inputting the medical image into an image encoder which does not comprise a linear layer to obtain a visual characteristic sequence;
adopting a cooperative attention mechanism to perform multi-mode fusion on the graph embedding vector and the visual feature sequence to obtain an attention re-weighted image sequence;
and inputting the image sequence subjected to attention re-weighting into a memory driving transducer model to generate a medical image report.
According to the medical image report generation method based on multi-mode fusion, the medical priori knowledge map is constructed, and the medical priori knowledge map comprises the following steps:
acquiring a plurality of unmarked medical image report texts;
extracting a plurality of medical entities from the plurality of unmarked medical image report texts by adopting a named entity recognition algorithm;
adopting a clustering algorithm to reduce the dimension of the medical entities;
and constructing a medical priori knowledge map by taking the medical entities after dimension reduction as nodes and the relationship among the medical entities after dimension reduction as edges.
According to the medical image report generating method based on multi-mode fusion provided by the invention, the obtaining of the initial feature vector of each node in the medical priori knowledge graph comprises the following steps:
and initializing each node of the medical priori knowledge map through a word embedding model to obtain an initial feature vector of the node.
According to the medical image report generating method based on multi-mode fusion provided by the invention, the initial feature vector of each node in the medical priori knowledge graph and the medical priori knowledge graph is input into a graph encoder to obtain a graph embedding vector, and the method comprises the following steps:
building a graph encoder:
Figure BDA0003748877450000021
Figure BDA0003748877450000031
wherein ,
Figure BDA0003748877450000032
an adjacency matrix representing the medical prior knowledge-graph, said adjacency matrix being accompanied by edges pointing to its own nodes,>
Figure BDA0003748877450000033
representing an initial feature vector of the medical priori knowledge map, wherein the initial feature vector of the medical priori knowledge map is obtained by splicing initial feature vectors of all nodes in the medical priori knowledge map, and the +>
Figure BDA0003748877450000034
A picture convolution feature vector representing the k-th layer,/>
Figure BDA0003748877450000035
Graph roll feature vector representing layer k+1, d= Σ i D i ,/>
Figure BDA0003748877450000036
Figure BDA0003748877450000037
Elements of the ith row and jth column of the adjacency matrix representing the medical prior knowledge-graph, W (k) Representing a trainable weight matrix, GC () representing a convolution function, σ () representing an activation function, dropout () representing a random discard function, BN () representing a batch normalization function;
and inputting the medical priori knowledge graph and the initial feature vector of each node in the medical priori knowledge graph into the graph encoder to obtain a graph convolution feature vector of the last layer as a graph embedding vector.
According to the medical image report generating method based on multi-mode fusion, the medical image is input into an image encoder which does not comprise a linear layer to obtain a visual characteristic sequence, and the method comprises the following steps:
inputting the medical image into an image encoder which does not comprise a linear layer to obtain a four-dimensional visual feature matrix;
deforming the four-dimensional visual feature matrix into a three-dimensional visual feature matrix;
and converting the three-dimensional visual characteristic matrix into a visual characteristic sequence.
According to the medical image report generating method based on multi-mode fusion provided by the invention, the adoption of a cooperative attention mechanism carries out multi-mode fusion on the graph embedding vector and the visual feature sequence to obtain an attention re-weighted image sequence, and the method comprises the following steps:
Calculating an affinity matrix between the graph embedding vector and the visual feature sequence;
learning, by the affinity matrix, an attention map between the graph embedding vector and the visual feature sequence;
calculating an attention weight vector based on the attention map;
based on the visual feature sequence and the attention weight vector, a sequence of attention re-weighted images is calculated.
According to the medical image report generating method based on multi-mode fusion provided by the invention, the calculating of the affinity matrix between the graph embedding vector and the visual characteristic sequence comprises the following steps:
the affinity matrix between the graph embedding vector and the visual feature sequence is calculated by the following expression:
Figure BDA0003748877450000041
wherein C represents an affinity matrix, G E Representing graph embedding vectors, I E Representing a sequence of visual features, W b Representing a weight matrix.
According to the medical image report generating method based on multi-mode fusion provided by the invention, the learning of the attention map between the graph embedding vector and the visual feature sequence through the affinity matrix comprises the following steps:
learning an attention map between the graph embedding vector and the visual feature sequence by:
F i =tanh(W i I E +(W g G E )C)
wherein ,Fi Representing the output result of learning the graph embedding vector and the visual feature sequence by means of an affinity matrix, W i and Wg All represent a trainable weight matrix, C represents an affinity matrix, G E Representing graph embedding vectors, I E Representing a sequence of visual features.
According to the medical image report generating method based on multi-mode fusion provided by the invention, the attention weight vector is calculated based on the attention map, and the method comprises the following steps:
the attention weight vector is calculated by the following expression:
Figure BDA0003748877450000042
wherein ,ai Representing a vector of attention weights, w fi Representing a trainable weight matrix, F i Representing the attention map result.
According to the medical image report generating method based on multi-mode fusion provided by the invention, the image sequence subjected to attention re-weighting is calculated based on the visual feature sequence and the attention weight vector, and the method comprises the following steps:
the attention re-weighted image sequence is calculated by the following expression:
Figure BDA0003748877450000051
wherein ,
Figure BDA0003748877450000052
representing a sequence of images, x, that have been re-weighted by attention 1,2,…,R Representing elements in the visual characteristics sequence, +.>
Figure BDA0003748877450000053
Representing elements in the sequence of images that are re-weighted by attention,/->
Figure BDA0003748877450000054
The elements in the attention weight vector corresponding to the visual feature sequence are represented, and r=h '×w' represents the number of image blocks.
According to the medical image report generation method based on multi-mode fusion provided by the invention, the memory driving transducer model comprises the following steps: an encoder and a decoder, the decoder comprising: the decoding module is provided with a memory-driven normalization layer.
According to the method for generating the medical image report based on the multi-mode fusion provided by the invention, the image sequence subjected to attention re-weighting is input into a memory driving transducer model to generate the medical image report, and the method comprises the following steps:
inputting the attention re-weighted image sequence into the encoder;
initializing the relational memory by adopting the graph embedding vector;
calculating a memory matrix of the last round of output of the relational memory;
and inputting the memory matrix output by the last round of the relational memory and the output result of the encoder into the decoding module provided with the memory-driven normalization layer to obtain a medical image report.
According to the medical image report generating method based on multi-mode fusion provided by the invention, the initializing the relational memory by adopting the graph embedding vector comprises the following steps:
Initializing the relational memory by the following expression:
M 0 =MLP(G E ·W m )
wherein ,M0 Initial memory matrix representing relational memory, G E Representing graph embedded vectors, W m Representing a weight matrix, MLP () represents a multi-layer perceptron for building an inter-dimensional mapping.
According to the medical image report generating method based on multi-mode fusion provided by the invention, the memory matrix for calculating the last round of output of the relational memory comprises the following steps:
calculating the memory matrix of the last round of output of the relational memory by the following expression:
Figure BDA0003748877450000061
Figure BDA0003748877450000062
Figure BDA0003748877450000063
wherein ,Q=Mt-1 ·W Q ,K=[M t-1 ;y t-1 ]·W k ,V=[M t-1 ;y t-1 ]·W v ,M t-1 A memory matrix, y representing a round of output on the relational memory t-1 Word embedding vector representing a round of predictions on the relational memory, W Q 、W k 、W v Are trainable weight matrices, d k Representing the scaling factor, which is obtained by dividing the dimension of k by the number of heads, MLP () represents the multi-layer perceptron,
Figure BDA0003748877450000064
and />
Figure BDA0003748877450000065
For balancing M t-1 and yt-1 Forgetting gate and output gate, +.>
Figure BDA0003748877450000066
Representing a multi-headed attention output matrix mapped by a multi-layer perceptron, M t And the memory matrix representing the last round of output of the relational memory.
According to the medical image report generating method based on multi-mode fusion provided by the invention, the memory matrix output by the last round of the relational memory and the output result of the encoder are input into the decoding module provided with the memory-driven normalization layer to obtain the medical image report, and the medical image report generating method comprises the following steps:
Calculating an output result of the decoding module provided with the memory-driven normalization layer by the following expression:
γ t =γ+MLP(M t )
β t =β+MLP(M t )
Figure BDA0003748877450000067
θ=T D (ψ,N,RM(M t-1 ,y t-1 ),MCLN(r,M t ))
wherein ψ represents the output of the encoder, N represents the decoder layer number, γ represents the matrix of learnable parameters representing the scaling used to improve generalization ability, γ t Representing gamma and M mapped by a multi-layer perceptron t As a result of the addition, β represents a matrix of leachable parameters representing movement used to enhance generalization ability, β t Representing beta and M mapped by a multi-layer perceptron t As a result of the addition, μ represents the mean value of γ, v represents the standard deviation of γ, T E () Representing the encoder, T D () Representing the decoder, RM () represents the relational memory.
The invention also provides a medical image report generating device based on multi-mode fusion, which comprises the following steps:
the map construction module is used for constructing a medical priori knowledge map and acquiring an initial feature vector of each node in the medical priori knowledge map;
the graph encoder module is used for inputting the medical priori knowledge graph and the initial feature vector of each node in the medical priori knowledge graph into the graph encoder to obtain a graph embedding vector;
the image encoder module is used for inputting the medical image into an image encoder which does not comprise a linear layer to obtain a visual characteristic sequence;
The multi-mode fusion module is used for adopting a cooperative attention mechanism to perform multi-mode fusion on the graph embedding vector and the visual feature sequence to obtain an attention re-weighted image sequence;
and the report generation module is used for inputting the image sequence subjected to attention re-weighting into a memory driving transducer model to generate a medical image report.
The invention also provides an electronic device, which comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor realizes the medical image report generating method based on the multi-modal fusion when executing the program.
The present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements a medical image report generating method based on multimodal fusion as described in any one of the above.
The invention also provides a computer program product comprising a computer program which when executed by a processor implements a medical image report generating method based on multimodal fusion as described in any one of the above.
The invention provides a medical image report generation method and a device based on multi-mode fusion, which comprises the steps of firstly, constructing a medical priori knowledge graph and acquiring an initial feature vector of each node in the medical priori knowledge graph; then, inputting the medical priori knowledge graph and the initial feature vector of each node in the medical priori knowledge graph into a graph encoder to obtain a graph embedding vector; inputting the medical image into an image encoder which does not comprise a linear layer to obtain a visual characteristic sequence; adopting a cooperative attention mechanism to perform multi-mode fusion on the graph embedding vector and the visual feature sequence to obtain an image sequence subjected to attention re-weighting; the image sequence subjected to attention re-weighting fuses a medical priori knowledge map and a medical image; and finally, inputting the image sequence subjected to attention re-weighting into a memory driving transducer model to generate a medical image report, wherein the memory driving transducer model has better understanding capability and more robust understanding capability of medical priori knowledge due to the fact that the image sequence subjected to attention re-weighting is fused with a medical priori knowledge map and a medical image, so that the accuracy and reliability of medical image report generation can be improved.
Drawings
In order to more clearly illustrate the invention or the technical solutions of the prior art, the following description will briefly explain the drawings used in the embodiments or the description of the prior art, and it is obvious that the drawings in the following description are some embodiments of the invention, and other drawings can be obtained according to the drawings without inventive effort for a person skilled in the art.
FIG. 1 is a schematic flow chart of a medical image report generating method based on multi-modal fusion;
FIG. 2 is a schematic structural diagram of a medical image report generation model based on a medical prior knowledge graph and memory driving provided by the invention;
FIG. 3 is a schematic diagram of constructing a medical prior knowledge graph provided by the invention;
FIG. 4 is a schematic structural diagram of a medical image report generating device based on multi-modal fusion;
fig. 5 is a schematic structural diagram of an electronic device provided by the present invention.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
The medical image report generation method based on the multi-modal fusion of the present invention is described below with reference to fig. 1 to 3.
Referring to fig. 1, fig. 1 is a flow chart of a medical image report generating method based on multi-mode fusion according to the present invention. As shown in fig. 1, the medical image report generating method based on multi-mode fusion provided by the invention may include the following steps:
step 101, constructing a medical priori knowledge graph, and acquiring an initial feature vector of each node in the medical priori knowledge graph;
step 102, inputting the medical priori knowledge graph and the initial feature vector of each node in the medical priori knowledge graph into a graph encoder to obtain a graph embedding vector;
step 103, inputting the medical image into an image encoder which does not comprise a linear layer to obtain a visual characteristic sequence;
104, adopting a cooperative attention mechanism to perform multi-mode fusion on the graph embedding vector and the visual feature sequence to obtain an image sequence subjected to attention re-weighting;
step 105, inputting the image sequence subjected to attention re-weighting into a memory driving transducer model to generate a medical image report.
In step 101, a medical prior knowledge graph with proper size is constructed by adopting a natural language processing method, instead of manually constructing by selecting only a few keywords. The medical priori knowledge map of the embodiment has the following characteristics:
1) Comprehensive entity type
The medical prior knowledge graph contains an omnidirectional entity type, and can describe disease symptoms from an omnidirectional perspective, not just the names of diseases, for example: for skin diseases, physical information such as disease name, location, shape, color, etc. may be included.
2) Proper scale of map
The scale of the medical priori knowledge map is proper, if the characteristics are too large, training and learning are difficult, and if the scale is too small, enough priori knowledge cannot be reserved.
3) Comprehensive entity relationship
The relationship between the entities can be automatically established by a relationship extraction method, and the relationship between the entities can be manually supplemented, so that the medical priori knowledge map can embody more priori knowledge.
In this step, in order to complete the subsequent steps, each node in the medical priori knowledge map needs to be initialized, and an initial feature vector of each node in the medical priori knowledge map is obtained. Optionally, each node in the medical prior knowledge-graph is initialized by a word embedding model.
In step 102, a graph encoder is used to extract a graph embedding vector of a medical prior knowledge graph. Inputting the medical priori knowledge graph and the initial feature vector of each node in the medical priori knowledge graph into a graph encoder, and obtaining a graph embedding vector through graph encoding processing.
In step 103, for an image encoder, a derivative model of a convolutional neural network is employed as a visual extractor, for example: residual convolutional network ResNet, densely-chained convolutional network DenseNet, etc. The image encoder used in this embodiment does not include the last linear layer, and outputs the result of the pooling layer.
In step 104, the image encoder may obtain visual features of the medical image, but may not obtain high-level semantic information well. In the embodiment, a collaborative attention mechanism is adopted, the graph embedded vector and the visual feature sequence are subjected to multi-mode fusion, a visual question-answering process is simulated, and finally an image sequence subjected to attention re-weighting is obtained.
In step 105, as shown in fig. 2, the memory driven transducer model optionally includes: an encoder and a decoder, the decoder comprising: the decoding module is provided with a memory-driven normalization layer. The Memory-driven normalization layer is three Memory-driven conditional layer normalization (Memory-driven Conditional Layer Normalization, MCLN) layers, and is used for enhancing the decoding capability of the Memory-driven transducer model and increasing generalization.
The sequence of images that are re-weighted is input into a memory driven transducer model to generate a medical image report.
In this embodiment, since the attention re-weighted image sequence fuses the medical priori knowledge map and the medical image, the memory driving transducer model has better understanding capability and more robust ability of understanding medical priori knowledge, and can improve the accuracy and reliability of medical image report generation.
Optionally, constructing the medical prior knowledge-graph in step 101 includes the following sub-steps:
step 1011, obtaining a plurality of untagged medical image report texts;
step 1012, extracting a plurality of medical entities from a plurality of unlabeled medical image report texts by adopting a named entity recognition algorithm;
step 1013, performing dimension reduction on a plurality of medical entities by adopting a clustering algorithm;
and 1014, constructing a medical priori knowledge map by taking the medical entities after dimension reduction as nodes and the relation among the medical entities after dimension reduction as edges.
In step 1011, several unlabeled medical image report texts are acquired, such as: a large number of unannotated reports, training data sets, available text information provided by the physician, and the like. Different types of reports may be selected as the underlying data for different tasks.
In step 1012, using a named entity recognition algorithm, a number of medical entities may be effectively extracted from a number of unlabeled medical image report texts, which are stored as key nodes of a medical prior knowledge graph.
In step 1013, the nodes identified by the named entities may have a large amount of similar content, and if all the nodes are reserved, a large amount of redundant structures are generated, so that the scale of the medical priori knowledge graph is too large, and therefore, a text processing method and a clustering algorithm are required to reduce the dimensions of a plurality of medical entities.
In step 1014, relationships between the medical entities are established using relationship extraction and entity dependencies are established with the assistance of manual design. And constructing a medical priori knowledge map by taking the medical entities after dimension reduction as nodes and the relationship among the medical entities after dimension reduction as edges.
In this embodiment, instead of manually constructing by selecting only a few keywords, a medical prior knowledge graph of appropriate size may be constructed.
Specifically, as shown in fig. 3, two data sets are used to construct a medical prior knowledge-graph. Because the two data sets differ in language, there is also some variance in the construction of the medical prior knowledge-graph. For IU-Xray data set, stanza BioMedical is used as backbone method for named entity identification and relation extraction, and a medical priori knowledge map with 284 key nodes is finally obtained through clustering. Each node obtains 768-dimensional feature vectors through BioBert and serves as initial features of a medical priori knowledge map. Similarly, for the NCRC-DS dataset, CMeKG is used to extract Chinese medical entities and extract entity triples. CMeKG is a tool library of chinese medical knowledge graph, providing open source implementation of named entity recognition, relationship extraction, medical word segmentation, etc. After clustering, a knowledge graph containing 191 key nodes is finally obtained. In order to obtain the initial characteristics of the node, the node keywords are input into a Chinese medical Bert model provided by CMeKG, and 768-dimensional initial vectors are obtained.
Optionally, step 102 comprises the sub-steps of:
step 1021, constructing a graph encoder:
Figure BDA0003748877450000121
Figure BDA0003748877450000122
wherein ,
Figure BDA0003748877450000123
adjacency matrix representing medical priori knowledge graph, adjacency matrix being attached with edge pointing to own node, ++>
Figure BDA0003748877450000124
The initial feature vector of the medical priori knowledge map is obtained by splicing the initial feature vectors of all nodes in the medical priori knowledge map, and the initial feature vector of the medical priori knowledge map is a +.>
Figure BDA0003748877450000125
A picture convolution feature vector representing the k-th layer,/>
Figure BDA0003748877450000131
Graph roll feature vector representing layer k+1, d= Σ i D i ,/>
Figure BDA0003748877450000132
For normalizing aggregate node features,/->
Figure BDA0003748877450000133
Elements of the ith row and jth column of the adjacency matrix representing the medical prior knowledge-graph, W (k) Representing a trainable weight matrix, GC () representing a convolution function, σ () representing an activation function, dropout () representing a random discard function, BN () representing a batch normalization function;
step 1022, inputting the medical priori knowledge graph and the initial feature vector of each node in the medical priori knowledge graph into the graph encoder to obtain the final layer of graph convolution feature vector as the graph embedding vector.
In the embodiment, a random discarding layer, a batch normalization layer and residual connection are added between two layers of graph convolution, so that the expression capability of a graph encoder can be improved, and the graph embedding vector of the medical priori knowledge graph can be extracted through the graph encoder.
Optionally, step 103 comprises the sub-steps of:
step 1031, inputting the medical image (imag, dimension [ B, C, H, W ]) into an image encoder VE () not including a linear layer to obtain a four-dimensional visual feature matrix VE (imag), wherein the dimension is [ B, F, H ', W';
step 1032, deforming the four-dimensional visual feature matrix into a three-dimensional visual feature matrix;
I E =reshape(VE(imag)) (3)
wherein ,IE Is a three-dimensional visual characteristic matrix, and the dimension is [ B, F, H' ×W ]]Reshape () represents a deformation function;
step 1033, converting the three-dimensional visual feature matrix into a visual feature sequence x 1 ,x 2 ,…,x H` ×W`。
In this embodiment, the medical image is input to an image encoder that does not include a linear layer, and further deformed to obtain a visual feature sequence.
Optionally, step 104 comprises the sub-steps of:
step 1041, calculating an affinity matrix between the graph embedding vector and the visual feature sequence;
step 1042, embedding an attention map between the vector and the visual feature sequence by the learning map through the affinity matrix;
step 1043, calculating an attention weight vector based on the attention map;
step 1044, calculating an attention re-weighted image sequence based on the visual feature sequence and the attention weight vector.
In step 1041, an affinity matrix between the graph embedding vector and the visual feature sequence is calculated by the following expression:
Figure BDA0003748877450000141
wherein C represents an affinity matrix, G E Representing graph embedding vectors, I E Representing a sequence of visual features, W b Representing a weight matrix.
In step 1042, the attention map between the vector and the visual feature sequence is learned by the following expression:
F i =tanh(W i I E +(W g G E )C) (5)
wherein ,Fi Representing the output result of embedding vector and visual feature sequence through affinity matrix learning map, W i and Wg All represent a trainable weight matrix, C represents an affinity matrix, G E Representing graph embedding vectors, I E Representing a sequence of visual features.
In step 1043, an attention weight vector is calculated by the following expression:
Figure BDA0003748877450000142
wherein ,ai Representing a vector of attention weights, w fi Representing a trainable weight matrix, F i Representing the attention map result.
In step 1044, the attention re-weighted image sequence is calculated by the following expression:
Figure BDA0003748877450000143
wherein ,
Figure BDA0003748877450000144
representing a sequence of images, x, that have been re-weighted by attention 1,2,…,R Representing elements in the visual characteristics sequence, +.>
Figure BDA0003748877450000145
Representing elements in the sequence of images that are re-weighted by attention,/->
Figure BDA0003748877450000146
The elements in the attention weight vector corresponding to the visual feature sequence are represented, and r=h '×w' represents the number of image blocks.
In this embodiment, a collaborative attention mechanism is adopted, and the image embedded vector output by the image encoder and the visual feature sequence output by the image encoder are subjected to multi-mode fusion, so that the finally obtained image sequence subjected to attention re-weighting fuses the visual features of the medical image and the high-level semantic information of the medical priori knowledge graph.
Optionally, step 105 comprises the sub-steps of:
step 1051, inputting the attention re-weighted image sequence into an encoder;
step 1052, initializing the relational memory by adopting the graph embedding vector;
step 1053, calculating the memory matrix of the last round of output of the relational memory;
step 1054, the output results of the memory matrix and the encoder output in the last round of the relational memory are input into a decoding module provided with a memory-driven normalization layer, and a medical image report is obtained.
In step 1051, the output result of the encoder is calculated by the following expression:
Figure BDA0003748877450000151
wherein, psi represents the output result of the encoder, N represents, T E () Representing the encoder.
In step 1052, the relational memory is used to store shared content in the model training results, enhancing the learning ability of the model. Specifically, a memory matrix is provided that includes a plurality of rows, each row being considered a slot for storing specific pattern information.
Initializing a relational memory by the following expression:
M 0 =MLP(G E ·W m ) (9)
wherein ,M0 Initial memory matrix representing relational memory, W m Representing a weight matrix, MLP () represents a multi-layer perceptron for building an inter-dimensional mapping.
In step 1053, the memory matrix for the last round of output of the relational memory is calculated by the following expression:
Figure BDA0003748877450000152
Figure BDA0003748877450000153
/>
Figure BDA0003748877450000154
wherein ,Q=Mt-1 ·W Q ,K=[M t-1 ;y t-1 ]·W k ,V=[M t-1 ;y t-1 ]·W v ,M t-1 Memory matrix representing one round of output on relational memory, y t-1 Word embedding vector representing a round of predictions in relational memory, W Q 、W k 、W v Are all trainable weightsMatrix, d k Representing the scaling factor, which is obtained by dividing the dimension of k by the number of heads, MLP () represents the multi-layer perceptron,
Figure BDA0003748877450000161
and />
Figure BDA0003748877450000162
For balancing M t-1 and yt-1 Forgetting gate and output gate, +.>
Figure BDA0003748877450000163
Representing a multi-headed attention output matrix mapped by a multi-layer perceptron, M t And the memory matrix representing the last round of output of the relational memory.
In step 1054, the output result of the decoding module provided with the memory-driven normalization layer is calculated by the following expression:
γ t =γ+MLP(M t ) (13)
β t =β+MLP(M t ) (14)
Figure BDA0003748877450000164
θ=T D (ψ,N,RM(M t-1 ,y t-1 ),MCLN(r,M t )) (16)
wherein ψ represents the output result of the encoder, N represents the decoder layer number, γ represents the matrix of learnable parameters representing scaling used to improve generalization ability, γ t Representing gamma and M mapped by a multi-layer perceptron t As a result of the addition, β represents a matrix of leachable parameters representing movement used to enhance generalization ability, β t Representing beta and M mapped by a multi-layer perceptron t As a result of the addition, μ represents the mean value of γ, v represents the standard deviation of γ, T E () Representation encoder, T D () Representing the decoder, RM () represents the relational memory.
In this embodiment, the graph embedding vector is used to initialize the relational memory, instead of assigning 0 to all the initial memory matrices of the relational memory, so as to optimize the memory driven transducer model. And moreover, the input of the decoder of the memory-driven transducer model fuses a medical priori knowledge map and a medical image, so that the memory-driven transducer model has better understanding capability and more robust ability of understanding medical priori knowledge, and the accuracy and reliability of medical image report generation can be improved.
Specifically, taking IU-Xray dataset and NCRC-DS dataset as examples, a DenseNet-121 model pre-trained on CheXpert is selected for IU-Xray dataset as the backbone network of the image encoder. Two chest radiographs of the same description will be input into the model and spliced for delivery to the encoder of the memory driven transducer model. A medical prior knowledge graph containing 284 nodes is used as medical prior knowledge. For the NCRC-DS dataset, a ResNet-101 model pre-trained on ImageNet was chosen as the backbone network for the image encoder. Because of the small size of the dataset, only one skin disease picture and its corresponding description report are input at a time, and a medical prior knowledge map containing 191 nodes is used as medical prior knowledge. By default, the number of layers of the graph volume is 3, the number of slots of the relational memory is 3, and the word embedding dimension is set to 512. The model was trained using Adam optimizer under cross entropy loss, while training, evaluating BLEU-4 score for the test set, and setting weight decay and early stop. In the fields of chest radiography, skin diseases and the like, the medical image report generation method based on multi-mode fusion of the embodiment is superior to the current common medical image report generation model in accuracy and reliability.
The medical image report generating device based on multi-mode fusion provided by the invention is described below, and the medical image report generating device based on multi-mode fusion described below and the medical image report generating method based on multi-mode fusion described above can be correspondingly referred to each other.
Referring to fig. 4, fig. 4 is a schematic structural diagram of a medical image report generating device based on multi-mode fusion according to the present invention. As shown in fig. 4, the medical image report generating device based on multi-mode fusion provided by the invention may include:
the map construction module 10 is used for constructing a medical priori knowledge map and acquiring an initial feature vector of each node in the medical priori knowledge map;
a graph encoder module 20, configured to input the medical prior knowledge graph and an initial feature vector of each node in the medical prior knowledge graph into a graph encoder to obtain a graph embedding vector;
an image encoder module 30 for inputting the medical image into an image encoder that does not include a linear layer, resulting in a sequence of visual features;
the multi-mode fusion module 40 is configured to perform multi-mode fusion on the graph embedding vector and the visual feature sequence by adopting a cooperative attention mechanism, so as to obtain an image sequence after attention re-weighting;
The report generating module 50 is configured to input the image sequence subjected to attention re-weighting into a memory driven transducer model, and generate a medical image report.
Optionally, the map construction module 10 is specifically configured to:
acquiring a plurality of unmarked medical image report texts;
extracting a plurality of medical entities from the plurality of unmarked medical image report texts by adopting a named entity recognition algorithm;
adopting a clustering algorithm to reduce the dimension of the medical entities;
and constructing a medical priori knowledge map by taking the medical entities after dimension reduction as nodes and the relationship among the medical entities after dimension reduction as edges.
Optionally, the map construction module 10 is specifically configured to:
and initializing each node of the medical priori knowledge map through a word embedding model to obtain an initial feature vector of the node.
Optionally, the graph encoder module 20 is specifically configured to:
building a graph encoder:
Figure BDA0003748877450000181
Figure BDA0003748877450000182
wherein ,
Figure BDA0003748877450000183
an adjacency matrix representing the medical prior knowledge-graph, said adjacency matrix being accompanied by edges pointing to its own nodes,>
Figure BDA0003748877450000184
representing an initial feature vector of the medical priori knowledge map, wherein the initial feature vector of the medical priori knowledge map is obtained by splicing initial feature vectors of all nodes in the medical priori knowledge map, and the + >
Figure BDA0003748877450000185
A picture convolution feature vector representing the k-th layer,/>
Figure BDA0003748877450000186
Graph roll feature vector representing layer k+1, d= Σ i D i
Figure BDA0003748877450000187
Figure BDA0003748877450000188
Elements of the ith row and jth column of the adjacency matrix representing the medical prior knowledge-graph, W (k) Representing a trainable weight matrix, GC () representing a convolution function, σ () representing an activation function, dropout () representing a random discard function, BN () representing a batch normalization function;
and inputting the medical priori knowledge graph and the initial feature vector of each node in the medical priori knowledge graph into the graph encoder to obtain a graph convolution feature vector of the last layer as a graph embedding vector.
Optionally, the image encoder module 30 is specifically configured to:
inputting the medical image into an image encoder which does not comprise a linear layer to obtain a four-dimensional visual feature matrix;
deforming the four-dimensional visual feature matrix into a three-dimensional visual feature matrix;
and converting the three-dimensional visual characteristic matrix into a visual characteristic sequence.
Optionally, the multimodal fusion module 40 is specifically configured to:
calculating an affinity matrix between the graph embedding vector and the visual feature sequence;
learning, by the affinity matrix, an attention map between the graph embedding vector and the visual feature sequence;
Calculating an attention weight vector based on the attention map;
based on the visual feature sequence and the attention weight vector, a sequence of attention re-weighted images is calculated.
Optionally, the multimodal fusion module 40 is specifically configured to:
the affinity matrix between the graph embedding vector and the visual feature sequence is calculated by the following expression:
Figure BDA0003748877450000191
wherein C represents an affinity matrix, G E Representing graph embedding vectors, I E Representing a sequence of visual features, W b Representing a weight matrix.
Optionally, the multimodal fusion module 40 is specifically configured to:
learning an attention map between the graph embedding vector and the visual feature sequence by:
F i =tanh(W i I E +(W g G E )C)
wherein ,Fi Representing the output result of learning the graph embedding vector and the visual feature sequence by means of an affinity matrix, W i and Wg All represent a trainable weight matrix, C represents an affinity matrix, G E Representing graph embedding vectors, I E Representing a sequence of visual features.
Optionally, the multimodal fusion module 40 is specifically configured to:
the attention weight vector is calculated by the following expression:
Figure BDA0003748877450000201
wherein ,ai Representing a vector of attention weights, w fi Representing a trainable weight matrix, F i Representing the attention map result.
Optionally, the multimodal fusion module 40 is specifically configured to:
the attention re-weighted image sequence is calculated by the following expression:
Figure BDA0003748877450000202
wherein ,
Figure BDA0003748877450000203
representing a sequence of images, x, that have been re-weighted by attention 1,2,…,R Representing elements in the visual characteristics sequence, +.>
Figure BDA0003748877450000204
Representing elements in the sequence of images that are re-weighted by attention,/->
Figure BDA0003748877450000205
The elements in the attention weight vector corresponding to the visual feature sequence are represented, and r=h '×w' represents the number of image blocks.
Optionally, the memory driven transducer model comprises: an encoder and a decoder, the decoder comprising: the decoding module is provided with a memory-driven normalization layer.
Alternatively, the report generating module 50 is specifically configured to:
inputting the attention re-weighted image sequence into the encoder;
initializing the relational memory by adopting the graph embedding vector;
calculating a memory matrix of the last round of output of the relational memory;
and inputting the memory matrix output by the last round of the relational memory and the output result of the encoder into the decoding module provided with the memory-driven normalization layer to obtain a medical image report.
Alternatively, the report generating module 50 is specifically configured to:
initializing the relational memory by the following expression:
M 0 =MLP(G E ·W m )
wherein ,M0 Initial memory matrix representing relational memory, G E Representing graph embedded vectors, W m Representing a weight matrix, MLP () represents a multi-layer perceptron for building an inter-dimensional mapping.
Alternatively, the report generating module 50 is specifically configured to:
calculating the memory matrix of the last round of output of the relational memory by the following expression:
Figure BDA0003748877450000211
Figure BDA0003748877450000212
Figure BDA0003748877450000213
wherein ,Q=Mt-1 ·W Q ,K=[M t-1 ;y t-1 ]·W k ,V=[M t-1 ;y t-1 ]·W v ,M t-1 A memory matrix, y representing a round of output on the relational memory t-1 Word embedding vector representing a round of predictions on the relational memory, W Q 、W k 、W v Are trainable weight matrices, d k Representing the scaling factor, which is obtained by dividing the dimension of k by the number of heads, MLP () represents the multi-layer perceptron,
Figure BDA0003748877450000214
and />
Figure BDA0003748877450000215
For balancing M t-1 and yt-1 Forgetting gate and output gate, +.>
Figure BDA0003748877450000216
Representing a multi-headed attention output matrix mapped by a multi-layer perceptron, M t And the memory matrix representing the last round of output of the relational memory.
Alternatively, the report generating module 50 is specifically configured to:
calculating an output result of the decoding module provided with the memory-driven normalization layer by the following expression:
γ t =γ+MLP(M t )
β t =β+MLP(M t )
Figure BDA0003748877450000217
θ=T D (ψ,N,RM(M t-1 ,y t-1 ),MCLN(r,M t ))
wherein ψ represents the output of the encoder, N represents the decoder layer number, γ represents the matrix of learnable parameters representing the scaling used to improve generalization ability, γ t Representing gamma and passing multi-layer perception mechanismM after shooting t As a result of the addition, β represents a matrix of leachable parameters representing movement used to enhance generalization ability, β t Representing beta and M mapped by a multi-layer perceptron t As a result of the addition, μ represents the mean value of γ, v represents the standard deviation of γ, T E () Representing the encoder, T D () Representing the decoder, RM () represents the relational memory.
Fig. 5 illustrates a physical schematic diagram of an electronic device, as shown in fig. 5, which may include: processor 810, communication interface (Communications Interface) 820, memory 830, and communication bus 840, wherein processor 810, communication interface 820, memory 830 accomplish communication with each other through communication bus 840. The processor 810 may invoke logic instructions in the memory 830 to perform a multi-modal fusion based medical image report generation method comprising:
constructing a medical priori knowledge graph, and acquiring an initial feature vector of each node in the medical priori knowledge graph;
inputting the medical priori knowledge graph and the initial feature vector of each node in the medical priori knowledge graph into a graph encoder to obtain a graph embedding vector;
Inputting the medical image into an image encoder which does not comprise a linear layer to obtain a visual characteristic sequence;
adopting a cooperative attention mechanism to perform multi-mode fusion on the graph embedding vector and the visual feature sequence to obtain an attention re-weighted image sequence;
and inputting the image sequence subjected to attention re-weighting into a memory driving transducer model to generate a medical image report.
Further, the logic instructions in the memory 830 described above may be implemented in the form of software functional units and may be stored in a computer-readable storage medium when sold or used as a stand-alone product. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.
In another aspect, the present invention also provides a computer program product, where the computer program product includes a computer program, where the computer program can be stored on a non-transitory computer readable storage medium, and when the computer program is executed by a processor, the computer can perform a medical image report generating method based on multi-modal fusion provided by the above methods, and the method includes:
constructing a medical priori knowledge graph, and acquiring an initial feature vector of each node in the medical priori knowledge graph;
inputting the medical priori knowledge graph and the initial feature vector of each node in the medical priori knowledge graph into a graph encoder to obtain a graph embedding vector;
inputting the medical image into an image encoder which does not comprise a linear layer to obtain a visual characteristic sequence;
adopting a cooperative attention mechanism to perform multi-mode fusion on the graph embedding vector and the visual feature sequence to obtain an attention re-weighted image sequence;
and inputting the image sequence subjected to attention re-weighting into a memory driving transducer model to generate a medical image report.
In yet another aspect, the present invention further provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, is implemented to perform the method for generating a medical image report based on multi-modal fusion provided by the above methods, the method comprising:
Constructing a medical priori knowledge graph, and acquiring an initial feature vector of each node in the medical priori knowledge graph;
inputting the medical priori knowledge graph and the initial feature vector of each node in the medical priori knowledge graph into a graph encoder to obtain a graph embedding vector;
inputting the medical image into an image encoder which does not comprise a linear layer to obtain a visual characteristic sequence;
adopting a cooperative attention mechanism to perform multi-mode fusion on the graph embedding vector and the visual feature sequence to obtain an attention re-weighted image sequence;
and inputting the image sequence subjected to attention re-weighting into a memory driving transducer model to generate a medical image report.
The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.
From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course may be implemented by means of hardware. Based on this understanding, the foregoing technical solution may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the respective embodiments or some parts of the embodiments.
Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims (18)

1. A medical image report generation method based on multi-modal fusion, comprising:
constructing a medical priori knowledge graph, and acquiring an initial feature vector of each node in the medical priori knowledge graph;
inputting the medical priori knowledge graph and the initial feature vector of each node in the medical priori knowledge graph into a graph encoder to obtain a graph embedding vector;
inputting the medical image into an image encoder which does not comprise a linear layer to obtain a visual characteristic sequence;
adopting a cooperative attention mechanism to perform multi-mode fusion on the graph embedding vector and the visual feature sequence to obtain an attention re-weighted image sequence;
inputting the image sequence subjected to attention re-weighting into a memory driving transformer model to generate a medical image report;
the method for obtaining the image sequence with attention re-weighted by adopting a cooperative attention mechanism carries out multi-mode fusion on the graph embedded vector and the visual feature sequence, and comprises the following steps:
calculating an affinity matrix between the graph embedding vector and the visual feature sequence;
learning, by the affinity matrix, an attention map between the graph embedding vector and the visual feature sequence;
Calculating an attention weight vector based on the attention map;
based on the visual feature sequence and the attention weight vector, a sequence of attention re-weighted images is calculated.
2. The method for generating a medical image report based on multi-modal fusion according to claim 1, wherein the constructing a medical prior knowledge-graph includes:
acquiring a plurality of unmarked medical image report texts;
extracting a plurality of medical entities from the plurality of unmarked medical image report texts by adopting a named entity recognition algorithm;
adopting a clustering algorithm to reduce the dimension of the medical entities;
and constructing a medical priori knowledge map by taking the medical entities after dimension reduction as nodes and the relationship among the medical entities after dimension reduction as edges.
3. The method for generating a medical image report based on multi-modal fusion according to claim 1, wherein the obtaining the initial feature vector of each node in the medical prior knowledge-graph includes:
and initializing each node of the medical priori knowledge map through a word embedding model to obtain an initial feature vector of the node.
4. The method for generating a medical image report based on multi-modal fusion according to claim 1, wherein inputting the medical prior knowledge-graph and the initial feature vector of each node in the medical prior knowledge-graph into a graph encoder to obtain a graph embedding vector comprises:
building a graph encoder:
Figure FDA0004091353610000021
Figure FDA0004091353610000022
wherein ,
Figure FDA0004091353610000023
an adjacency matrix representing the medical prior knowledge-graph, said adjacency matrix being accompanied by edges pointing to its own nodes,>
Figure FDA0004091353610000024
representing an initial feature vector of the medical priori knowledge map, wherein the initial feature vector of the medical priori knowledge map is obtained by splicing initial feature vectors of all nodes in the medical priori knowledge map, and the +>
Figure FDA0004091353610000025
A picture convolution feature vector representing the k-th layer,/>
Figure FDA0004091353610000026
Graph roll feature vector representing layer k+1, d= Σ i D i ,/>
Figure FDA0004091353610000027
Figure FDA0004091353610000028
Elements of the ith row and jth column of the adjacency matrix representing the medical prior knowledge-graph, W (k) Representing a trainable weight matrix, GC () representing a convolution function, σ () representing an activation function, dropout () representing a random discard function, BN () representing a batch normalization function;
and inputting the medical priori knowledge graph and the initial feature vector of each node in the medical priori knowledge graph into the graph encoder to obtain a graph convolution feature vector of the last layer as a graph embedding vector.
5. The method for generating a medical image report based on multi-modal fusion according to claim 1, wherein the inputting the medical image into an image encoder not including a linear layer, to obtain a visual feature sequence, includes:
inputting the medical image into an image encoder which does not comprise a linear layer to obtain a four-dimensional visual feature matrix;
deforming the four-dimensional visual feature matrix into a three-dimensional visual feature matrix;
and converting the three-dimensional visual characteristic matrix into a visual characteristic sequence.
6. The multi-modality fusion-based medical image report generation method of claim 1, wherein the computing an affinity matrix between the graph embedding vector and the sequence of visual features comprises:
the affinity matrix between the graph embedding vector and the visual feature sequence is calculated by the following expression:
Figure FDA0004091353610000031
wherein C represents an affinity matrix, G E Representing graph embedding vectors, I E Representing a sequence of visual features, W b Representing a weight matrix.
7. The method of claim 1, wherein learning the attention map between the map embedding vector and the visual feature sequence by the affinity matrix comprises:
Learning an attention map between the graph embedding vector and the visual feature sequence by:
F i =tanh(W i I E +(W g G E )C)
wherein ,Fi Representing the output result of learning the graph embedding vector and the visual feature sequence by means of an affinity matrix, W i and Wg All represent a trainable weight matrix, table CShowing the affinity matrix, G E Representing graph embedding vectors, I E Representing a sequence of visual features.
8. The multi-modality fusion-based medical image report generation method of claim 1, wherein the calculating an attention weight vector based on the attention map includes:
the attention weight vector is calculated by the following expression:
Figure FDA0004091353610000041
wherein ,ai Representing a vector of attention weights, w fi Representing a trainable weight matrix, F i Representing the attention map result.
9. The multi-modality fusion-based medical image report generation method of claim 1, wherein the computing the attention re-weighted image sequence based on the visual feature sequence and the attention weight vector comprises:
the attention re-weighted image sequence is calculated by the following expression:
Figure FDA0004091353610000042
wherein ,
Figure FDA0004091353610000043
representing a sequence of images, x, that have been re-weighted by attention 1,2,…,R Representing elements in the visual characteristics sequence, +.>
Figure FDA0004091353610000044
Representing elements in the sequence of images that are re-weighted by attention,/->
Figure FDA0004091353610000045
The elements in the attention weight vector corresponding to the visual feature sequence are represented, and r=h '×w' represents the number of image blocks.
10. The multi-modality fusion-based medical image report generation method of claim 1, wherein the memory-driven transducer model comprises: an encoder and a decoder, the decoder comprising: the decoding module is provided with a memory-driven normalization layer.
11. The method of claim 10, wherein inputting the attention re-weighted image sequence into a memory driven transducer model to generate a medical image report comprises:
inputting the attention re-weighted image sequence into the encoder;
initializing the relational memory by adopting the graph embedding vector;
calculating a memory matrix of the last round of output of the relational memory;
and inputting the memory matrix output by the last round of the relational memory and the output result of the encoder into the decoding module provided with the memory-driven normalization layer to obtain a medical image report.
12. The method for generating a medical image report based on multi-modal fusion according to claim 11, wherein initializing the relational memory using the graph embedding vector includes:
initializing the relational memory by the following expression:
M 0 =MLP(G E ·W m )
wherein ,M0 Initial memory matrix representing relational memory, G E Representing graph embedded vectors, W m Representing a weight matrix, MLP () represents a multi-layer perceptron,for creating an inter-dimensional mapping.
13. The method for generating a medical image report based on multi-modal fusion according to claim 11, wherein the calculating the memory matrix of the last round of output of the relational memory includes:
calculating the memory matrix of the last round of output of the relational memory by the following expression:
Figure FDA0004091353610000051
Figure FDA0004091353610000052
Figure FDA0004091353610000053
wherein ,Q=Mt-1 ·W Q ,K=[M t-1 ;y t-1 ]·W k ,V=[M t-1 ;y t-1 ]·W v ,M t-1 A memory matrix, y representing a round of output on the relational memory t-1 Word embedding vector representing a round of predictions on the relational memory, W Q 、W k 、W v Are trainable weight matrices, d k Representing the scaling factor, which is obtained by dividing the dimension of k by the number of heads, MLP () represents the multi-layer perceptron,
Figure FDA0004091353610000054
and />
Figure FDA0004091353610000055
For balancing M t-1 and yt-1 Forgetting gate and output gate, +.>
Figure FDA0004091353610000056
Representing a multi-headed attention output matrix mapped by a multi-layer perceptron, M t And the memory matrix representing the last round of output of the relational memory.
14. The method for generating a medical image report based on multi-modal fusion according to claim 13, wherein inputting the memory matrix output by the last round of the relational memory and the output result of the encoder into the decoding module provided with the memory-driven normalization layer to obtain the medical image report includes:
calculating an output result of the decoding module provided with the memory-driven normalization layer by the following expression:
γ t =γ+MLP(M t )
β t =β+MLP(M t )
Figure FDA0004091353610000061
θ=T D (ψ,N,RM(M t-1 ,y t-1 ),MCLN(r,M t ))
wherein ψ represents the output of the encoder, N represents the decoder layer number, γ represents the matrix of learnable parameters representing the scaling used to improve generalization ability, γ t Representing gamma and M mapped by a multi-layer perceptron t As a result of the addition, β represents a matrix of leachable parameters representing movement used to enhance generalization ability, β t Representing beta and M mapped by a multi-layer perceptron t As a result of the addition, μ represents the mean value of γ, v represents the standard deviation of γ, T E () Representing the encoder, T D () Representing the decoder, RM () represents the relational memory.
15. A medical image report generating device based on multi-modal fusion, comprising:
The map construction module is used for constructing a medical priori knowledge map and acquiring an initial feature vector of each node in the medical priori knowledge map;
the graph encoder module is used for inputting the medical priori knowledge graph and the initial feature vector of each node in the medical priori knowledge graph into the graph encoder to obtain a graph embedding vector;
the image encoder module is used for inputting the medical image into an image encoder which does not comprise a linear layer to obtain a visual characteristic sequence;
the multi-mode fusion module is used for adopting a cooperative attention mechanism to perform multi-mode fusion on the graph embedding vector and the visual feature sequence to obtain an attention re-weighted image sequence;
the report generation module is used for inputting the image sequence subjected to attention re-weighting into a memory driving transducer model to generate a medical image report;
the multi-mode fusion module is specifically configured to:
calculating an affinity matrix between the graph embedding vector and the visual feature sequence;
learning, by the affinity matrix, an attention map between the graph embedding vector and the visual feature sequence;
calculating an attention weight vector based on the attention map;
Based on the visual feature sequence and the attention weight vector, a sequence of attention re-weighted images is calculated.
16. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the medical image report generating method based on multi-modal fusion as claimed in any one of claims 1 to 14 when the program is executed by the processor.
17. A non-transitory computer readable storage medium having stored thereon a computer program, which when executed by a processor implements the medical image report generating method based on multi-modal fusion according to any one of claims 1 to 14.
18. A computer program product comprising a computer program which, when executed by a processor, implements a medical image report generating method based on multi-modal fusion as claimed in any one of claims 1 to 14.
CN202210836966.3A 2022-07-15 2022-07-15 Medical image report generation method and device based on multi-mode fusion Active CN115331769B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210836966.3A CN115331769B (en) 2022-07-15 2022-07-15 Medical image report generation method and device based on multi-mode fusion

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210836966.3A CN115331769B (en) 2022-07-15 2022-07-15 Medical image report generation method and device based on multi-mode fusion

Publications (2)

Publication Number Publication Date
CN115331769A CN115331769A (en) 2022-11-11
CN115331769B true CN115331769B (en) 2023-05-09

Family

ID=83917479

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210836966.3A Active CN115331769B (en) 2022-07-15 2022-07-15 Medical image report generation method and device based on multi-mode fusion

Country Status (1)

Country Link
CN (1) CN115331769B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115937689B (en) * 2022-12-30 2023-08-11 安徽农业大学 Intelligent identification and monitoring technology for agricultural pests
CN116028654B (en) * 2023-03-30 2023-06-13 中电科大数据研究院有限公司 Multi-mode fusion updating method for knowledge nodes
CN117010494B (en) * 2023-09-27 2024-01-05 之江实验室 Medical data generation method and system based on causal expression learning
CN117649917A (en) * 2024-01-29 2024-03-05 北京大学 Training method and device for test report generation model and test report generation method
CN117993500A (en) * 2024-04-07 2024-05-07 江西为易科技有限公司 Medical teaching data management method and system based on artificial intelligence

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180060512A1 (en) * 2016-08-29 2018-03-01 Jeffrey Sorenson System and method for medical imaging informatics peer review system
CN112992308B (en) * 2021-03-25 2023-05-16 腾讯科技(深圳)有限公司 Training method of medical image report generation model and image report generation method
CN113724359A (en) * 2021-07-14 2021-11-30 鹏城实验室 CT report generation method based on Transformer
CN114724670A (en) * 2022-06-02 2022-07-08 合肥综合性国家科学中心人工智能研究院(安徽省人工智能实验室) Medical report generation method and device, storage medium and electronic equipment

Also Published As

Publication number Publication date
CN115331769A (en) 2022-11-11

Similar Documents

Publication Publication Date Title
CN115331769B (en) Medical image report generation method and device based on multi-mode fusion
CN109545302A (en) A kind of semantic-based medical image report template generation method
CN110750959A (en) Text information processing method, model training method and related device
CN111316281A (en) Semantic classification of numerical data in natural language context based on machine learning
CN112561064B (en) Knowledge base completion method based on OWKBC model
WO2022052530A1 (en) Method and apparatus for training face correction model, electronic device, and storage medium
CN111881926A (en) Image generation method, image generation model training method, image generation device, image generation equipment and image generation medium
CN115132313A (en) Automatic generation method of medical image report based on attention mechanism
US11430123B2 (en) Sampling latent variables to generate multiple segmentations of an image
CN112052889B (en) Laryngoscope image recognition method based on double-gating recursion unit decoding
CN112530584A (en) Medical diagnosis assisting method and system
CN113724359A (en) CT report generation method based on Transformer
CN116129141A (en) Medical data processing method, apparatus, device, medium and computer program product
CN116563537A (en) Semi-supervised learning method and device based on model framework
CN116385937A (en) Method and system for solving video question and answer based on multi-granularity cross-mode interaction framework
CN112560454B (en) Bilingual image subtitle generating method, bilingual image subtitle generating system, storage medium and computer device
CN115190999A (en) Classifying data outside of a distribution using contrast loss
CN116486465A (en) Image recognition method and system for face structure analysis
CN116258928A (en) Pre-training method based on self-supervision information of unlabeled medical image
CN116994695A (en) Training method, device, equipment and storage medium of report generation model
CN115662565A (en) Medical image report generation method and equipment integrating label information
CN115762721A (en) Medical image quality control method and system based on computer vision technology
CN115239740A (en) GT-UNet-based full-center segmentation algorithm
CN114139531A (en) Medical entity prediction method and system based on deep learning
Souza et al. Automatic recognition of continuous signing of brazilian sign language for medical interview

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant