CN113779298A - Medical vision question-answering method based on composite loss - Google Patents

Medical vision question-answering method based on composite loss Download PDF

Info

Publication number
CN113779298A
CN113779298A CN202111085818.4A CN202111085818A CN113779298A CN 113779298 A CN113779298 A CN 113779298A CN 202111085818 A CN202111085818 A CN 202111085818A CN 113779298 A CN113779298 A CN 113779298A
Authority
CN
China
Prior art keywords
question
image
medical
answer
loss
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202111085818.4A
Other languages
Chinese (zh)
Other versions
CN113779298B (en
Inventor
潘海为
何舒宁
张可佳
陈春伶
史坤
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Harbin Engineering University
Original Assignee
Harbin Engineering University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Harbin Engineering University filed Critical Harbin Engineering University
Priority to CN202111085818.4A priority Critical patent/CN113779298B/en
Publication of CN113779298A publication Critical patent/CN113779298A/en
Application granted granted Critical
Publication of CN113779298B publication Critical patent/CN113779298B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/58Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/583Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/55Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/58Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/583Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • G06F16/5846Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using extracted text
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H30/00ICT specially adapted for the handling or processing of medical images
    • G16H30/40ICT specially adapted for the handling or processing of medical images for processing medical images, e.g. editing
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Abstract

The invention belongs to the technical field of medical image and artificial intelligence intersection, and particularly relates to a medical vision question-answering method based on composite loss. Aiming at the problem that most medical visual question-answering methods concentrate on visual contents and neglect text importance, the problems are associated with images and words by adopting a multi-view attention mechanism after characteristics of the images and the problems are extracted, and the whole model is trained by adopting classification loss and image problem complementary loss together, so that the problem that most existing medical visual question-answering methods neglect excavation of text information importance is compensated, the problem of multi-angle concern on the problems is realized, and the effectiveness of the medical visual question-answering methods is improved. The invention can effectively solve the medical visual question-answering task.

Description

Medical vision question-answering method based on composite loss
Technical Field
The invention belongs to the technical field of medical image and artificial intelligence intersection, and particularly relates to a medical vision question-answering method based on composite loss.
Background
With the development of artificial intelligence, visual question answering has become one of the current popular research contents. It is a multi-modal challenging task that requires extensive consideration of both the main areas of research, computer vision and natural language processing. The most common application of visual question and answer is to help visually impaired people to obtain more information in the virtual world or the real world, which will greatly improve their quality of life. With the continuous development of intelligent medical treatment, the visual question-answering task based on the professional medical field is gradually known by the public. Given a medical image and corresponding text question, the correct answer can be predicted. The medical visual question and answer highlights the specialties of images and texts, so that the rich contents of the medical images are deeply understood, and the complex semantics of clinical questions are accurately explored. The task can assist doctors to diagnose, answer and prejudge diseases in advance, further greatly reduce the probability of misdiagnosis and missed diagnosis, improve accuracy and reduce diagnosis and treatment time and improve efficiency. For the patient, when the patient encounters a troublesome question and symptom, the reference answer can be immediately obtained for judging and preventing the disease condition at the first time.
However, current research based on medical vision question-answering tasks is very limited. On the one hand, the concept of medical-technical terms is complex and there are challenges to understanding clinical text. On the other hand, because the imaging principle of medical images is complex, different from natural images, most of information in the medical images has potential value, and some slight changes may be the positions of lesions. Although most deep learning methods work significantly in medical image analysis, current medical visual question-answer datasets lack large-scale labeling training data. If the deep learning model trained on the general visual question-answering data set is transferred to a medical visual question-answering task by using transfer learning and is finely adjusted by using a small number of medical images, the final realization effect is not good due to the difference between the natural images and the medical images. Moreover, if the requirements of a multi-modal task cannot be met by singly modeling the semantics of the text and the vision of the image, the image and the problem have correlation, and the relation between the image and the problem is more important.
Disclosure of Invention
The invention aims to provide a medical visual question-answering method based on composite loss aiming at the problem that most medical visual question-answering are concentrated on visual contents and neglect text importance, text information can be effectively mined, multi-angle attention to problems is realized, and therefore the effectiveness of the medical visual question-answering method is improved.
The purpose of the invention is realized by the following technical scheme: the method comprises the following steps:
step 1: acquiring a medical visual question-answer data set, and extracting medical image characteristics v and question characteristics aiming at two modes of vision and text;
step 2: feeding the image features and the question features obtained in the step 1 into a multi-view attention mechanism, wherein the mechanism comprises an image pair question attention mechanism and a word pair text attention mechanism, attention weights of the image pair question and text features Q under visual guidance are obtained in the image pair question attention mechanismmObtaining the attention weight a of a word to a question in a word-to-text attention mechanismq
And step 3: text feature Q under visual guidancemAnd the image features v are respectively transmitted into a multi-mode fusion model, and fused multi-mode features M are outputcl、Mop
Figure BDA0003265662970000021
Figure BDA0003265662970000022
Wherein F represents multi-modal feature fusion, a bilinear attention network is adopted to learn joint representation of images and problems, and subscript theta represents trainable parameters during feature fusion; cl and op respectively represent a closed question-answer pair and an open question-answer pair;
and 4, step 4: multimodal features M for combining open and closed question-and-answer pairscl and MopRespectively transmitting the candidate answers to a classification model consisting of two layers of MLPs (Multi-level hierarchical processing) so as to obtain the probability of candidate answers; taking the answer with the maximum probability in the candidate answer set as the final prediction output ycl and yop(ii) a Using a binary cross entropy loss L during model trainingcLoss L complementary to image problemmqAnd (3) forming a combined loss module joint optimization model:
Loss=Lc+γLmq
Figure BDA0003265662970000023
Figure BDA0003265662970000024
wherein BCE (.) represents a binary cross-entropy loss function;
Figure BDA0003265662970000025
representing a predicted answer; y represents a true answer; gamma is a hyperparameter;
and 5: acquiring a medical visual question to be answered, and executing the steps 1 to 3 to extract the fused multi-modal features Mcl、MopAnd then inputting the answer into the trained classification model, and taking the answer with the highest probability in the candidate answer set as output.
The present invention may further comprise:
the method for extracting the medical image features in the step 1 specifically comprises the following steps: initializing a pre-training weight represented by an image by adopting model unknown element learning and a convolution noise reduction self-encoder together; the structure of model agnostic meta-learning consists of four 3 x 3 convolutional layers and one average pooling layer, each convolutional layer contains 64 filters and one nonlinear layer; the convolution noise reduction self-encoder is a combination of a series of convolution layers and a maximum pooling layer; self-coding medical image through model-agnostic meta-learning and convolution noise reductionThe device respectively obtains 64-dimensional vector features, and connects the 64-dimensional vector features in series to obtain the final medical image features, wherein the image features are expressed as
Figure BDA0003265662970000026
dk128 denotes the dimension of the image feature.
The method for extracting the problem features in the step 1 specifically comprises the following steps: unifying each question into a sentence consisting of n words, and deleting the exceeding part if the length of the question exceeds n; if the question consists of less than n words, zero padding is performed on the question until the length of the question is n; first, each word in the question is represented by a 300-dimensional GloVe word embedding as
Figure BDA0003265662970000031
dh300 denotes the dimension of each word embedding; the word embedding representation is then fed into a network of gated cyclic units to encode the problem embedding
Figure BDA0003265662970000032
ds1024 is the dimension of each hidden state in the gated round-robin unit network.
The word-to-text attention mechanism in the step 2 specifically comprises the following steps:
step 2.1.1: concatenating the word embedding representation D and the question embedding representation Q to obtain Qc
Qc=[D||Q]
Wherein, | | represents the concatenation of characteristic dimensions;
Figure BDA0003265662970000033
step 2.1.2: using the context-independent character of word embedding and the context-dependent character of question embedding, using sigmoid activating function as selection mechanism to control output, thereby obtaining the question representation for filtering useless noise
Figure BDA0003265662970000034
Figure BDA0003265662970000035
Wherein: tan (.) is a gated hyperbolic tangent activation function; sigma (.) is sigmoid activation function;
Figure BDA0003265662970000036
is the learning weight; as is the hadamard product;
step 2.1.3: obtaining importance weight a of question on semantic levelq∈Rn*1
Figure BDA0003265662970000037
wherein ,
Figure BDA0003265662970000038
is the learning weight.
The image pair problem attention mechanism in the step 2 is specifically as follows:
step 2.2.1: accurately mining the degree of association between the image and the problem by using the attention weight;
am=softmax(QTMLP(v))
wherein ,am∈Rn*1Is the weight distribution given to the n words of the question by the image in the question-answer pair, amEach element in (1) corresponds to a degree of correlation between the word and the image, the greater the value of the element, the higher the correlation; MLP (.) is a multi-layer perceptron for aligning the dimension between Q and v;
step 2.2.2: weighting matrix a for importance of problem under visual guidancemEmbedding the question obtained in the step 1 into the text characteristic Q to obtain the text characteristic Q under the visual guidancem
Qm=am T⊙Q。
The invention has the beneficial effects that:
aiming at the problem that most medical visual question-answering methods concentrate on visual contents and neglect text importance, the problems are associated with images and words by adopting a multi-view attention mechanism after characteristics of the images and the problems are extracted, and the whole model is trained by adopting classification loss and image problem complementary loss together, so that the problem that most existing medical visual question-answering methods neglect excavation of text information importance is compensated, the problem of multi-angle concern on the problems is realized, and the effectiveness of the medical visual question-answering methods is improved. The invention can effectively solve the medical visual question-answering task.
Drawings
Fig. 1 is an overall framework diagram of the present invention.
FIG. 2 is a table comparing the accuracy of different medical visual question answering methods under the VQA-RAD test set in the experiment of the present invention.
Fig. 3 is an analysis chart of an ablation experiment of the method of the present invention.
FIG. 4 is a visual assessment diagram of the present invention.
Detailed Description
The invention is further described below with reference to the accompanying drawings.
The invention aims to provide a medical visual question-answering method based on composite loss aiming at the problem that most medical visual question-answering are concentrated on visual contents and neglect text importance, text information can be effectively mined, multi-angle attention to problems is realized, and therefore the effectiveness of the medical visual question-answering method is improved.
Referring to fig. 1, the implementation steps of the invention are as follows:
the method comprises the following steps: for the two modalities of vision and text, features of medical images and problems are extracted by different methods.
The method for extracting the medical image features overcomes the limitation of marking data, and the pre-training weight represented by the image is initialized by mainly adopting model-agnostic element learning and a convolution noise reduction self-encoder together. The structure of model agnostic meta-learning consists of four 3 x 3 convolutional layers and one average pooling layer. Each convolutional layer contains 64 filters and one nonlinear layer. A convolutional noise-reducing self-encoder is a combination of a series of convolutional layers and a max-pooling layer. Medical image is subjected to model-agnostic meta-learning and convolutional noise reduction auto-encoder partitioningAnd acquiring 64-dimensional vector features, and connecting the 64-dimensional vector features in series to obtain the final medical image features. The features of the image are represented as
Figure BDA0003265662970000041
dk128 denotes the dimension of the image feature.
When extracting the question features, each question is unified into a sentence consisting of n words. If the length of the problem exceeds n, deleting the exceeding part; if the question consists of fewer than n words, it is padded with zeros until it is n in length. First, each word in the question is represented by a 300-dimensional GloVe word embedding as
Figure BDA0003265662970000042
dh300 denotes the dimension of each word embedding. The word embedding representation is then fed into a network of gated cyclic units to encode the problem embedding
Figure BDA0003265662970000051
ds1024 is the dimension of each hidden state in the gated round-robin unit network.
The above contents respectively use different feature extraction methods to obtain image features and problem features for the modality features of medical images and text problems.
Step two: and feeding the image features and the problem features obtained in the first step to a multi-view attention mechanism, wherein the multi-view attention mechanism comprises an image-to-problem attention mechanism and a word-to-text attention mechanism. Attention weights for the image pair questions and text features under visual guidance are found in the image pair question attention mechanism, and attention weights for the word pair questions are found in the word pair text attention mechanism. The problem can be better analyzed through a multi-view attention mechanism, and sufficient preparation is made for obtaining accurate answers.
Word-to-text attention mechanism: the problem representation Q obtained in the step one ignores judgment of importance degrees of different words. Thus, to emphasize key words in the question, the method uses a word-to-text attention mechanism. When the problem features are extracted by using the step oneThe word embedding representation and the question feature representation are used for fully playing the advantages of the word embedding representation and the question feature representation, and each word in the question is assigned with weight, and the process is consistent with the attention process of human brain. The mechanism captures the importance of the problem from a semantic level. First, a word embedding representation D and a question embedding representation Q are concatenated to obtain Qc
Qc=[D||Q] (1)
In the formula: | | represents a concatenation of feature dimensions,
Figure BDA0003265662970000052
and then using the context-independent characteristics of word embedding and the context-dependent characteristics of question embedding, and using the sigmoid activation function as a selection mechanism to control output so as to obtain a question representation for filtering useless noise
Figure BDA0003265662970000053
Figure BDA0003265662970000054
In the formula: tan () and σ () are activation functions called gated hyperbolic tangent and sigmoid, respectively;
Figure BDA0003265662970000055
is the learning weight; as is the hadamard product.
Finally, the importance weight a of the question is obtained at the semantic levelq∈Rn*1
Figure BDA0003265662970000056
in the formula :
Figure BDA0003265662970000057
is the learning weight.
Image pair problem attention mechanism: by introducing the mechanism to establish the relationship between the visual mode and the text mode, the problem is observed from the visual angle to mine effective information. The image gives importance weight to the words in question, and the words with important significance are found under the guidance of vision. And accurately mining the association degree of the image and the problem by using attention weight:
am=softmax(QTMLP(v)) (4)
in the formula: MLP (.) is a multi-layer perceptron for aligning the dimension between Q and v; a ism∈Rn*1Is the weight distribution given to the n words of the question by the image in the question-answer pair. a ismEach element in (a) corresponds to a degree of correlation between the word and the image. The larger the value of an element, the higher the correlation.
Obtaining a problem importance weight matrix a under visual guidancemThen, a is mixedmAnd (4) embedding the problem obtained in the step one on the Q. Finally, the image pair problem attention mechanism retrieves the problem embedding Q of the fused image featuresm
Qm=am T⊙Q (5)
In the formula: as is the hadamard product; qmAre text features that are learned under visual guidance.
At this time, the question embeds not only a single-mode feature including a text semantic level, but also a feature added to an image level. The characteristics of the two modes can accurately judge the fine-grained relation between the vision and the text through an image problem attention mechanism. The mechanism assigns different importance weights to the text based on how relevant the image is to the word in each question.
Step three: the output of the multiview attention mechanism is transferred to the recombination losses. In order to ensure that the accuracy of predicting correct answers is higher, the composite loss is composed of classification loss and image question complementary loss, and a model is trained together. The classification loss is used for accurately predicting answer distribution after multi-modal feature fusion, and the image problem complementation loss is used for improving the similarity between text and visual cross-modal features and minimizing the difference of importance of learning words and images to the problem.
Classification loss: in obtaining visual guidanceAfter the text is characterized, the question-answer pairs are divided into an open type and a closed type according to the answer types, and the accuracy rates of the question-answer pairs of different types are respectively compared. Two types of questions are represented as QmAnd the image features v are respectively transmitted into a general multi-mode fusion model, and fused multi-mode features are output:
Figure BDA0003265662970000061
Figure BDA0003265662970000062
(6) (7) in the formula: f is a multi-modal feature fusion representation method, which adopts a bilinear attention network to learn the joint representation of images and problems; θ is a trainable parameter when the features are fused; cl and op represent closed and open challenge-response pairs, respectively.
In order to predict the best answer, the method combines the multi-modal features M of the open question-answer pair and the closed question-answer paircl and MopRespectively transmitting the candidate answers into a classifier consisting of two layers of MLPs, thereby obtaining the probability of the candidate answers. Taking the answer with the maximum probability in the candidate answer set as the final prediction output ycl and yop. At this stage, a binary cross entropy loss L is used in the training processc
Figure BDA0003265662970000071
In the formula: BCE (.) represents a binary cross-entropy loss function;
Figure BDA0003265662970000072
is the predicted answer; y is the true answer; cl and op represent closed and open challenge-response pairs, respectively.
Image problem complementary loss: in the model training process, in order to improve the similarity between the visual-text cross-modal characteristics, the importance of words to problem learning and the visual finger are enabledThe difference between the importance of learning to the problem is minimized. The method utilizes the learning weight a obtained by a word-to-text attention mechanismmAttention weight a generated for the problem attention mechanismqDefining image problem complementary loss LmqTo guide the learning of the importance of the question together:
Figure BDA0003265662970000073
the composite loss module, which consists of the above classification loss and image problem complementary loss, is used for the joint optimization model:
Loss=Lc+γLmq (10)
in the formula: γ is a hyperparameter.
Compared with the prior art, the invention has the beneficial effects that: the core technical content of the invention is to provide a medical vision question-answering method based on composite loss, which adopts a multi-view attention mechanism to associate the problem with the image and the word after extracting the characteristics of the image and the problem, and adopts classification loss and image problem complementary loss to jointly train the whole model. The problem that most of the existing medical vision question-answering methods neglect the importance of text information mining is compensated, and the problem attention from multiple angles is realized.
The method provided by the invention verifies through experiments that the medical vision question-answering method based on the composite loss can pay attention to the problem in multiple angles and effectively excavate text information. The realization of the method has important significance for the application of the current medical visual question answering.
An experiment platform: all experiments are realized on a GTX 1080ti GPU server, Python programming language is used for performing experiments on Pycarm software, and a deep learning framework utilized in programming is a pytorch.
(1) Experimental parameters
The length n of the question when the question features are obtained in the first step is 12, that is, each question consists of 12 words. Word-embedded representations and problem features were obtained using the GloVe method and gated cyclic cell network, respectively, in sequence. Wherein the hidden layer of the gated-round cell network has 1024 dimensions. In the experiment, the batch size was set to 64 using an Adamax optimizer and a learning rate of 0.005 for training.
(2) Content of the experiment
Experiment 1: introduction of data sets.
VQA-RAD dataset is the first manually constructed natural question in the field of medical visual question-answering concerning radiological images and provides a dataset of reference answers. The total number of radiological images is 315 and evenly distributed in the three parts head, chest and abdomen. Questions are classified into 11 categories according to different question types, including location, size, and the like. The question-answer pairs are classified into open type question-answer pairs and closed type question-answer pairs according to answer types. Questions of a generally selective nature are referred to as closed question-answer pairs, otherwise as open question-answer pairs. The data set may be divided into a training set and a test set, containing 3064 and 451 challenge-response pairs, respectively.
Experiment 2: the effect of different medical visual question-answering methods was tested in the test set of the medical visual question-answering data set VQA-RAD, and a comparison graph of accuracy is shown in fig. 2.
The experimental results are as follows: as shown in FIG. 2, the proposed method of the present invention provides some improvement in VQA-RAD data set over other existing methods. Our method is superior to other methods in terms of accuracy in open, closed, and whole challenge-response pairs. Compared with the Med-VQA method with the best effect in the comparison method, the accuracy of the three question-answer pairs is improved by about 3 percent on average.
And (3) analysis: the method not only models only single modes, but also effectively mines the relationship among the modes. The experimental results show that the potential meanings of the questions and the images can be better understood by establishing an attention mechanism between the questions and the images by utilizing the relation between the texts and the vision, and the keywords matched with the images can be found, so that the predicted answer accuracy is higher and more stable, and the effectiveness of the method is proved.
Experiment 3: an ablation study of each component of the proposed method is shown in fig. 3.
The experimental results are as follows: as shown in FIG. 3, the image pair problem attention mechanism component and image problem complementary loss component are evaluated on the VQA-RAD data set. Experimental results show that cooperation between components is superior to either component working alone, but superior to the baseline method.
And (3) analysis: close relationships between images and questions in question-answer pairs can be explored using an attention mechanism. The image problem complementation loss further improves the similarity between vision and text, minimizing the difference between the learning of word and image problems. The two work together to achieve the best effect.
Experiment 4: and (3) analyzing the optimal value of the hyper-parameter gamma in the composite loss.
The experimental results are as follows: the hyper-parameter gamma in the composite loss sets different values for evaluation on the open question-answer pair, the closed question-answer pair and the overall question-answer pair. The performance of the process varies with the change in the gamma extraction. The best results are obtained when γ is 1.6, the three types of precision are particularly outstanding.
And (3) analysis: compared with the accuracy when gamma is 0, the accuracy of the open question-answer pair, the closed question-answer pair and the whole question-answer pair is improved, and the image question complementary loss can be proved to have obvious influence on the method provided by the invention.
Experiment 5: the visual evaluation of the method of the invention is shown in fig. 4.
The experimental results are as follows: as shown in fig. 4, the method proposed by the present invention can generally accurately find the visual information and text keywords involved in the visual question-answering task.
And (3) analysis: the method provided by the invention can predict the correct answer to most images and problems. According to the combined action of a multi-view attention mechanism and complex loss on the method, the key positions in the images and the key words in the questions can be correctly positioned, and finally, correct answers are predicted according to the positioned image areas and words.
In conclusion, the medical visual question-answering method based on the composite loss can effectively solve the medical visual question-answering task. Not only are image and problem features extracted, but also potential influence of words and images on the problems is explored by using a multi-view attention mechanism, so that important information of the texts is effectively mined by using semantic relations between the images and the texts, a model is trained by using composite loss to optimize the method, and the accuracy of the medical visual question-answering task is finally improved.
The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (6)

1. A medical vision question-answering method based on composite loss is characterized by comprising the following steps:
step 1: acquiring a medical visual question-answer data set, and extracting medical image characteristics v and question characteristics aiming at two modes of vision and text;
step 2: feeding the image features and the question features obtained in the step 1 into a multi-view attention mechanism, wherein the mechanism comprises an image pair question attention mechanism and a word pair text attention mechanism, attention weights of the image pair question and text features Q under visual guidance are obtained in the image pair question attention mechanismmObtaining the attention weight a of a word to a question in a word-to-text attention mechanismq
And step 3: text feature Q under visual guidancemAnd the image features v are respectively transmitted into a multi-mode fusion model, and fused multi-mode features M are outputcl、Mop
Figure FDA0003265662960000011
Figure FDA0003265662960000012
Wherein F represents multi-modal feature fusion, a bilinear attention network is adopted to learn joint representation of images and problems, and subscript theta represents trainable parameters during feature fusion; cl and op respectively represent a closed question-answer pair and an open question-answer pair;
and 4, step 4: multimodal features of open and closed question-and-answer pairs
Figure FDA0003265662960000013
And
Figure FDA0003265662960000014
respectively transmitting the candidate answers to a classification model consisting of two layers of MLPs (Multi-level hierarchical processing) so as to obtain the probability of candidate answers; taking the answer with the maximum probability in the candidate answer set as the final prediction output
Figure FDA0003265662960000015
And
Figure FDA0003265662960000016
using a binary cross entropy loss L during model trainingcLoss L complementary to image problemmqAnd (3) forming a combined loss module joint optimization model:
Loss=Lc+γLmq
Figure FDA0003265662960000017
Figure FDA0003265662960000018
wherein BCE (.) represents a binary cross-entropy loss function;
Figure FDA0003265662960000019
representing a predicted answer; y represents a true answer; gamma is a hyperparameter;
and 5: obtainTaking the medical visual question to be answered, executing the steps 1 to 3 to extract the fused multi-modal features Mcl、MopAnd then inputting the answer into the trained classification model, and taking the answer with the highest probability in the candidate answer set as output.
2. The composite-loss-based medical vision question-answering method according to claim 1, wherein: the method for extracting the medical image features in the step 1 specifically comprises the following steps: initializing a pre-training weight represented by an image by adopting model unknown element learning and a convolution noise reduction self-encoder together; the structure of model agnostic meta-learning consists of four 3 x 3 convolutional layers and one average pooling layer, each convolutional layer contains 64 filters and one nonlinear layer; the convolution noise reduction self-encoder is a combination of a series of convolution layers and a maximum pooling layer; respectively obtaining 64-dimensional vector features of the medical image through model-agnostic element learning and convolution noise reduction self-encoders, connecting the 64-dimensional vector features in series to obtain final medical image features, wherein the features of the image are expressed as
Figure FDA0003265662960000021
dk128 denotes the dimension of the image feature.
3. The composite-loss-based medical vision question-answering method according to claim 1 or 2, characterized in that: the method for extracting the problem features in the step 1 specifically comprises the following steps: unifying each question into a sentence consisting of n words, and deleting the exceeding part if the length of the question exceeds n; if the question consists of less than n words, zero padding is performed on the question until the length of the question is n; first, each word in the question is represented by a 300-dimensional GloVe word embedding as
Figure FDA0003265662960000022
dh300 denotes the dimension of each word embedding; the word embedding representation is then fed into a network of gated cyclic units to encode the problem embedding
Figure FDA0003265662960000023
ds1024 is the dimension of each hidden state in the gated round-robin unit network.
4. The composite-loss-based medical vision question-answering method according to claim 3, wherein: the word-to-text attention mechanism in the step 2 specifically comprises the following steps:
step 2.1.1: concatenating the word embedding representation D and the question embedding representation Q to obtain Qc
Qc=[D||Q]
Wherein, | | represents the concatenation of characteristic dimensions;
Figure FDA0003265662960000024
step 2.1.2: using the context-independent character of word embedding and the context-dependent character of question embedding, using sigmoid activating function as selection mechanism to control output, thereby obtaining the question representation for filtering useless noise
Figure FDA0003265662960000025
Figure FDA0003265662960000026
Wherein: tan (.) is a gated hyperbolic tangent activation function; sigma (.) is sigmoid activation function;
Figure FDA0003265662960000027
is the learning weight; as is the hadamard product;
step 2.1.3: obtaining importance weight a of question on semantic levelq∈Rn*1
Figure FDA0003265662960000028
wherein ,
Figure FDA0003265662960000029
is the learning weight.
5. The composite-loss-based medical vision question-answering method according to claim 3, wherein: the image pair problem attention mechanism in the step 2 is specifically as follows:
step 2.2.1: accurately mining the degree of association between the image and the problem by using the attention weight;
am=softmax(QTMLP(v))
wherein ,am∈Rn*1Is the weight distribution given to the n words of the question by the image in the question-answer pair, amEach element in (1) corresponds to a degree of correlation between the word and the image, the greater the value of the element, the higher the correlation; MLP (.) is a multi-layer perceptron for aligning the dimension between Q and v;
step 2.2.2: weighting matrix a for importance of problem under visual guidancemEmbedding the question obtained in the step 1 into the text characteristic Q to obtain the text characteristic Q under the visual guidancem
Qm=am T⊙Q。
6. The composite-loss-based medical vision question-answering method according to claim 4, wherein: the image pair problem attention mechanism in the step 2 is specifically as follows:
step 2.2.1: accurately mining the degree of association between the image and the problem by using the attention weight;
am=softmax(QTMLP(v))
wherein ,am∈Rn*1Is the weight distribution given to the n words of the question by the image in the question-answer pair, amEach element in (1) corresponds to a degree of correlation between the word and the image, the greater the value of the element, the higher the correlation; MLP (.) is a multi-layer perceptron for aligning the dimension between Q and v;
step 2.2.2: weighting matrix a for importance of problem under visual guidancemEmbedding the question obtained in the step 1 into the text characteristic Q to obtain the text characteristic Q under the visual guidancem
Qm=am T⊙Q。
CN202111085818.4A 2021-09-16 2021-09-16 Medical vision question-answering method based on composite loss Active CN113779298B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111085818.4A CN113779298B (en) 2021-09-16 2021-09-16 Medical vision question-answering method based on composite loss

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111085818.4A CN113779298B (en) 2021-09-16 2021-09-16 Medical vision question-answering method based on composite loss

Publications (2)

Publication Number Publication Date
CN113779298A true CN113779298A (en) 2021-12-10
CN113779298B CN113779298B (en) 2023-10-31

Family

ID=78844492

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111085818.4A Active CN113779298B (en) 2021-09-16 2021-09-16 Medical vision question-answering method based on composite loss

Country Status (1)

Country Link
CN (1) CN113779298B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114821245A (en) * 2022-05-30 2022-07-29 大连大学 Medical visual question-answering method based on global visual information intervention

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110288609A (en) * 2019-05-30 2019-09-27 南京师范大学 A kind of multi-modal whole-heartedly dirty image partition method of attention mechanism guidance
CN110717431A (en) * 2019-09-27 2020-01-21 华侨大学 Fine-grained visual question and answer method combined with multi-view attention mechanism
US20210216862A1 (en) * 2020-01-15 2021-07-15 Beijing Jingdong Shangke Information Technology Co., Ltd. System and method for semantic analysis of multimedia data using attention-based fusion network

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110288609A (en) * 2019-05-30 2019-09-27 南京师范大学 A kind of multi-modal whole-heartedly dirty image partition method of attention mechanism guidance
CN110717431A (en) * 2019-09-27 2020-01-21 华侨大学 Fine-grained visual question and answer method combined with multi-view attention mechanism
US20210216862A1 (en) * 2020-01-15 2021-07-15 Beijing Jingdong Shangke Information Technology Co., Ltd. System and method for semantic analysis of multimedia data using attention-based fusion network

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
闫茹玉;刘学亮;: "结合自底向上注意力机制和记忆网络的视觉问答模型", 中国图象图形学报, no. 05 *
韩坤;潘海为;张伟;边晓菲;陈春伶;何舒宁;: "基于多模态医学图像的Alzheimer病分类方法", 清华大学学报(自然科学版), no. 08 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114821245A (en) * 2022-05-30 2022-07-29 大连大学 Medical visual question-answering method based on global visual information intervention
CN114821245B (en) * 2022-05-30 2024-03-26 大连大学 Medical visual question-answering method based on global visual information intervention

Also Published As

Publication number Publication date
CN113779298B (en) 2023-10-31

Similar Documents

Publication Publication Date Title
Liu et al. Learning a recurrent residual fusion network for multimodal matching
Xia et al. Generative adversarial regularized mutual information policy gradient framework for automatic diagnosis
Cheng et al. Facial expression recognition method based on improved VGG convolutional neural network
CN107247881A (en) A kind of multi-modal intelligent analysis method and system
CN109949929A (en) A kind of assistant diagnosis system based on the extensive case history of deep learning
Wang et al. From lsat: The progress and challenges of complex reasoning
CN111428481A (en) Entity relation extraction method based on deep learning
CN114201592A (en) Visual question-answering method for medical image diagnosis
CN116611024A (en) Multi-mode trans mock detection method based on facts and emotion oppositivity
CN114220516A (en) Brain CT medical report generation method based on hierarchical recurrent neural network decoding
Li et al. Graph diffusion convolutional network for skeleton based semantic recognition of two-person actions
CN113779298B (en) Medical vision question-answering method based on composite loss
Pan et al. Muvam: A multi-view attention-based model for medical visual question answering
CN115862837A (en) Medical visual question-answering method based on type reasoning and semantic constraint
Mendoza et al. Application of data mining techniques in diagnosing various thyroid ailments: a review
CN114944002B (en) Text description-assisted gesture-aware facial expression recognition method
CN114821245B (en) Medical visual question-answering method based on global visual information intervention
CN115659991A (en) Brain CT medical report automatic generation method based on co-occurrence relationship layered attention
Melnyk et al. Generative Artificial Intelligence Terminology: A Primer for Clinicians and Medical Researchers
CN115588486A (en) Traditional Chinese medicine diagnosis generating device based on Transformer and application thereof
Zheng et al. Modular graph attention network for complex visual relational reasoning
CN116756361A (en) Medical visual question-answering method based on corresponding feature fusion
CN115017910A (en) Entity relation joint extraction method, network, equipment and computer readable storage medium based on Chinese electronic medical record
CN110473636A (en) Intelligent doctor's advice recommended method and system based on deep learning
CN117407541B (en) Knowledge graph question-answering method based on knowledge enhancement

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant