CN113779298A

CN113779298A - Medical vision question-answering method based on composite loss

Info

Publication number: CN113779298A
Application number: CN202111085818.4A
Authority: CN
Inventors: 潘海为; 何舒宁; 张可佳; 陈春伶; 史坤
Original assignee: Harbin Engineering University
Current assignee: Harbin Engineering University
Priority date: 2021-09-16
Filing date: 2021-09-16
Publication date: 2021-12-10
Anticipated expiration: 2041-09-16
Also published as: CN113779298B

Abstract

The invention belongs to the technical field of medical image and artificial intelligence intersection, and particularly relates to a medical vision question-answering method based on composite loss. Aiming at the problem that most medical visual question-answering methods concentrate on visual contents and neglect text importance, the problems are associated with images and words by adopting a multi-view attention mechanism after characteristics of the images and the problems are extracted, and the whole model is trained by adopting classification loss and image problem complementary loss together, so that the problem that most existing medical visual question-answering methods neglect excavation of text information importance is compensated, the problem of multi-angle concern on the problems is realized, and the effectiveness of the medical visual question-answering methods is improved. The invention can effectively solve the medical visual question-answering task.

Description

Medical vision question-answering method based on composite loss

Technical Field

The invention belongs to the technical field of medical image and artificial intelligence intersection, and particularly relates to a medical vision question-answering method based on composite loss.

Background

With the development of artificial intelligence, visual question answering has become one of the current popular research contents. It is a multi-modal challenging task that requires extensive consideration of both the main areas of research, computer vision and natural language processing. The most common application of visual question and answer is to help visually impaired people to obtain more information in the virtual world or the real world, which will greatly improve their quality of life. With the continuous development of intelligent medical treatment, the visual question-answering task based on the professional medical field is gradually known by the public. Given a medical image and corresponding text question, the correct answer can be predicted. The medical visual question and answer highlights the specialties of images and texts, so that the rich contents of the medical images are deeply understood, and the complex semantics of clinical questions are accurately explored. The task can assist doctors to diagnose, answer and prejudge diseases in advance, further greatly reduce the probability of misdiagnosis and missed diagnosis, improve accuracy and reduce diagnosis and treatment time and improve efficiency. For the patient, when the patient encounters a troublesome question and symptom, the reference answer can be immediately obtained for judging and preventing the disease condition at the first time.

However, current research based on medical vision question-answering tasks is very limited. On the one hand, the concept of medical-technical terms is complex and there are challenges to understanding clinical text. On the other hand, because the imaging principle of medical images is complex, different from natural images, most of information in the medical images has potential value, and some slight changes may be the positions of lesions. Although most deep learning methods work significantly in medical image analysis, current medical visual question-answer datasets lack large-scale labeling training data. If the deep learning model trained on the general visual question-answering data set is transferred to a medical visual question-answering task by using transfer learning and is finely adjusted by using a small number of medical images, the final realization effect is not good due to the difference between the natural images and the medical images. Moreover, if the requirements of a multi-modal task cannot be met by singly modeling the semantics of the text and the vision of the image, the image and the problem have correlation, and the relation between the image and the problem is more important.

Disclosure of Invention

The invention aims to provide a medical visual question-answering method based on composite loss aiming at the problem that most medical visual question-answering are concentrated on visual contents and neglect text importance, text information can be effectively mined, multi-angle attention to problems is realized, and therefore the effectiveness of the medical visual question-answering method is improved.

The purpose of the invention is realized by the following technical scheme: the method comprises the following steps:

step 1: acquiring a medical visual question-answer data set, and extracting medical image characteristics v and question characteristics aiming at two modes of vision and text;

step 2: feeding the image features and the question features obtained in the step 1 into a multi-view attention mechanism, wherein the mechanism comprises an image pair question attention mechanism and a word pair text attention mechanism, attention weights of the image pair question and text features Q under visual guidance are obtained in the image pair question attention mechanism_mObtaining the attention weight a of a word to a question in a word-to-text attention mechanism_q；

And step 3: text feature Q under visual guidance_mAnd the image features v are respectively transmitted into a multi-mode fusion model, and fused multi-mode features M are output^cl、M^op；

Wherein F represents multi-modal feature fusion, a bilinear attention network is adopted to learn joint representation of images and problems, and subscript theta represents trainable parameters during feature fusion; cl and op respectively represent a closed question-answer pair and an open question-answer pair;

and 4, step 4: multimodal features M for combining open and closed question-and-answer pairs^cl and M^opRespectively transmitting the candidate answers to a classification model consisting of two layers of MLPs (Multi-level hierarchical processing) so as to obtain the probability of candidate answers; taking the answer with the maximum probability in the candidate answer set as the final prediction output y^cl and y^op(ii) a Using a binary cross entropy loss L during model training_cLoss L complementary to image problem_mqAnd (3) forming a combined loss module joint optimization model:

Loss＝L_c+γL_mq

wherein BCE (.) represents a binary cross-entropy loss function;

representing a predicted answer; y represents a true answer; gamma is a hyperparameter;

and 5: acquiring a medical visual question to be answered, and executing the steps 1 to 3 to extract the fused multi-modal features M^cl、M^opAnd then inputting the answer into the trained classification model, and taking the answer with the highest probability in the candidate answer set as output.

The present invention may further comprise:

the method for extracting the medical image features in the step 1 specifically comprises the following steps: initializing a pre-training weight represented by an image by adopting model unknown element learning and a convolution noise reduction self-encoder together; the structure of model agnostic meta-learning consists of four 3 x 3 convolutional layers and one average pooling layer, each convolutional layer contains 64 filters and one nonlinear layer; the convolution noise reduction self-encoder is a combination of a series of convolution layers and a maximum pooling layer; self-coding medical image through model-agnostic meta-learning and convolution noise reductionThe device respectively obtains 64-dimensional vector features, and connects the 64-dimensional vector features in series to obtain the final medical image features, wherein the image features are expressed as

d_k128 denotes the dimension of the image feature.

The method for extracting the problem features in the step 1 specifically comprises the following steps: unifying each question into a sentence consisting of n words, and deleting the exceeding part if the length of the question exceeds n; if the question consists of less than n words, zero padding is performed on the question until the length of the question is n; first, each word in the question is represented by a 300-dimensional GloVe word embedding as

d_h300 denotes the dimension of each word embedding; the word embedding representation is then fed into a network of gated cyclic units to encode the problem embedding

d_s1024 is the dimension of each hidden state in the gated round-robin unit network.

The word-to-text attention mechanism in the step 2 specifically comprises the following steps:

step 2.1.1: concatenating the word embedding representation D and the question embedding representation Q to obtain Q_c；

Q_c＝[D||Q]

Wherein, | | represents the concatenation of characteristic dimensions;

step 2.1.2: using the context-independent character of word embedding and the context-dependent character of question embedding, using sigmoid activating function as selection mechanism to control output, thereby obtaining the question representation for filtering useless noise

Wherein: tan (.) is a gated hyperbolic tangent activation function; sigma (.) is sigmoid activation function;

is the learning weight; as is the hadamard product;

step 2.1.3: obtaining importance weight a of question on semantic level_q∈R^n*1；

wherein ,

is the learning weight.

The image pair problem attention mechanism in the step 2 is specifically as follows:

step 2.2.1: accurately mining the degree of association between the image and the problem by using the attention weight;

a_m＝softmax(Q^TMLP(v))

wherein ,a_m∈R^n*1Is the weight distribution given to the n words of the question by the image in the question-answer pair, a_mEach element in (1) corresponds to a degree of correlation between the word and the image, the greater the value of the element, the higher the correlation; MLP (.) is a multi-layer perceptron for aligning the dimension between Q and v;

step 2.2.2: weighting matrix a for importance of problem under visual guidance_mEmbedding the question obtained in the step 1 into the text characteristic Q to obtain the text characteristic Q under the visual guidance_m；

Q_m＝a_m ^T⊙Q。

The invention has the beneficial effects that:

aiming at the problem that most medical visual question-answering methods concentrate on visual contents and neglect text importance, the problems are associated with images and words by adopting a multi-view attention mechanism after characteristics of the images and the problems are extracted, and the whole model is trained by adopting classification loss and image problem complementary loss together, so that the problem that most existing medical visual question-answering methods neglect excavation of text information importance is compensated, the problem of multi-angle concern on the problems is realized, and the effectiveness of the medical visual question-answering methods is improved. The invention can effectively solve the medical visual question-answering task.

Drawings

Fig. 1 is an overall framework diagram of the present invention.

FIG. 2 is a table comparing the accuracy of different medical visual question answering methods under the VQA-RAD test set in the experiment of the present invention.

Fig. 3 is an analysis chart of an ablation experiment of the method of the present invention.

FIG. 4 is a visual assessment diagram of the present invention.

Detailed Description

The invention is further described below with reference to the accompanying drawings.

Referring to fig. 1, the implementation steps of the invention are as follows:

the method comprises the following steps: for the two modalities of vision and text, features of medical images and problems are extracted by different methods.

The method for extracting the medical image features overcomes the limitation of marking data, and the pre-training weight represented by the image is initialized by mainly adopting model-agnostic element learning and a convolution noise reduction self-encoder together. The structure of model agnostic meta-learning consists of four 3 x 3 convolutional layers and one average pooling layer. Each convolutional layer contains 64 filters and one nonlinear layer. A convolutional noise-reducing self-encoder is a combination of a series of convolutional layers and a max-pooling layer. Medical image is subjected to model-agnostic meta-learning and convolutional noise reduction auto-encoder partitioningAnd acquiring 64-dimensional vector features, and connecting the 64-dimensional vector features in series to obtain the final medical image features. The features of the image are represented as

d_k128 denotes the dimension of the image feature.

When extracting the question features, each question is unified into a sentence consisting of n words. If the length of the problem exceeds n, deleting the exceeding part; if the question consists of fewer than n words, it is padded with zeros until it is n in length. First, each word in the question is represented by a 300-dimensional GloVe word embedding as

d_h300 denotes the dimension of each word embedding. The word embedding representation is then fed into a network of gated cyclic units to encode the problem embedding

The above contents respectively use different feature extraction methods to obtain image features and problem features for the modality features of medical images and text problems.

Step two: and feeding the image features and the problem features obtained in the first step to a multi-view attention mechanism, wherein the multi-view attention mechanism comprises an image-to-problem attention mechanism and a word-to-text attention mechanism. Attention weights for the image pair questions and text features under visual guidance are found in the image pair question attention mechanism, and attention weights for the word pair questions are found in the word pair text attention mechanism. The problem can be better analyzed through a multi-view attention mechanism, and sufficient preparation is made for obtaining accurate answers.

Word-to-text attention mechanism: the problem representation Q obtained in the step one ignores judgment of importance degrees of different words. Thus, to emphasize key words in the question, the method uses a word-to-text attention mechanism. When the problem features are extracted by using the step oneThe word embedding representation and the question feature representation are used for fully playing the advantages of the word embedding representation and the question feature representation, and each word in the question is assigned with weight, and the process is consistent with the attention process of human brain. The mechanism captures the importance of the problem from a semantic level. First, a word embedding representation D and a question embedding representation Q are concatenated to obtain Q_c：

Q_c＝[D||Q] (1)

In the formula: | | represents a concatenation of feature dimensions,

and then using the context-independent characteristics of word embedding and the context-dependent characteristics of question embedding, and using the sigmoid activation function as a selection mechanism to control output so as to obtain a question representation for filtering useless noise

In the formula: tan () and σ () are activation functions called gated hyperbolic tangent and sigmoid, respectively;

is the learning weight; as is the hadamard product.

Finally, the importance weight a of the question is obtained at the semantic level_q∈R^n*1：

in the formula ：

is the learning weight.

Image pair problem attention mechanism: by introducing the mechanism to establish the relationship between the visual mode and the text mode, the problem is observed from the visual angle to mine effective information. The image gives importance weight to the words in question, and the words with important significance are found under the guidance of vision. And accurately mining the association degree of the image and the problem by using attention weight:

a_m＝softmax(Q^TMLP(v)) (4)

in the formula: MLP (.) is a multi-layer perceptron for aligning the dimension between Q and v; a is_m∈R^n*1Is the weight distribution given to the n words of the question by the image in the question-answer pair. a is_mEach element in (a) corresponds to a degree of correlation between the word and the image. The larger the value of an element, the higher the correlation.

Obtaining a problem importance weight matrix a under visual guidance_mThen, a is mixed_mAnd (4) embedding the problem obtained in the step one on the Q. Finally, the image pair problem attention mechanism retrieves the problem embedding Q of the fused image features_m：

Q_m＝a_m ^T⊙Q (5)

In the formula: as is the hadamard product; q_mAre text features that are learned under visual guidance.

At this time, the question embeds not only a single-mode feature including a text semantic level, but also a feature added to an image level. The characteristics of the two modes can accurately judge the fine-grained relation between the vision and the text through an image problem attention mechanism. The mechanism assigns different importance weights to the text based on how relevant the image is to the word in each question.

Step three: the output of the multiview attention mechanism is transferred to the recombination losses. In order to ensure that the accuracy of predicting correct answers is higher, the composite loss is composed of classification loss and image question complementary loss, and a model is trained together. The classification loss is used for accurately predicting answer distribution after multi-modal feature fusion, and the image problem complementation loss is used for improving the similarity between text and visual cross-modal features and minimizing the difference of importance of learning words and images to the problem.

Classification loss: in obtaining visual guidanceAfter the text is characterized, the question-answer pairs are divided into an open type and a closed type according to the answer types, and the accuracy rates of the question-answer pairs of different types are respectively compared. Two types of questions are represented as Q_mAnd the image features v are respectively transmitted into a general multi-mode fusion model, and fused multi-mode features are output:

(6) (7) in the formula: f is a multi-modal feature fusion representation method, which adopts a bilinear attention network to learn the joint representation of images and problems; θ is a trainable parameter when the features are fused; cl and op represent closed and open challenge-response pairs, respectively.

In order to predict the best answer, the method combines the multi-modal features M of the open question-answer pair and the closed question-answer pair^cl and M^opRespectively transmitting the candidate answers into a classifier consisting of two layers of MLPs, thereby obtaining the probability of the candidate answers. Taking the answer with the maximum probability in the candidate answer set as the final prediction output y^cl and y^op. At this stage, a binary cross entropy loss L is used in the training process_c：

In the formula: BCE (.) represents a binary cross-entropy loss function;

is the predicted answer; y is the true answer; cl and op represent closed and open challenge-response pairs, respectively.

Image problem complementary loss: in the model training process, in order to improve the similarity between the visual-text cross-modal characteristics, the importance of words to problem learning and the visual finger are enabledThe difference between the importance of learning to the problem is minimized. The method utilizes the learning weight a obtained by a word-to-text attention mechanism_mAttention weight a generated for the problem attention mechanism_qDefining image problem complementary loss L_mqTo guide the learning of the importance of the question together:

the composite loss module, which consists of the above classification loss and image problem complementary loss, is used for the joint optimization model:

Loss＝L_c+γL_mq (10)

in the formula: γ is a hyperparameter.

Compared with the prior art, the invention has the beneficial effects that: the core technical content of the invention is to provide a medical vision question-answering method based on composite loss, which adopts a multi-view attention mechanism to associate the problem with the image and the word after extracting the characteristics of the image and the problem, and adopts classification loss and image problem complementary loss to jointly train the whole model. The problem that most of the existing medical vision question-answering methods neglect the importance of text information mining is compensated, and the problem attention from multiple angles is realized.

The method provided by the invention verifies through experiments that the medical vision question-answering method based on the composite loss can pay attention to the problem in multiple angles and effectively excavate text information. The realization of the method has important significance for the application of the current medical visual question answering.

An experiment platform: all experiments are realized on a GTX 1080ti GPU server, Python programming language is used for performing experiments on Pycarm software, and a deep learning framework utilized in programming is a pytorch.

(1) Experimental parameters

The length n of the question when the question features are obtained in the first step is 12, that is, each question consists of 12 words. Word-embedded representations and problem features were obtained using the GloVe method and gated cyclic cell network, respectively, in sequence. Wherein the hidden layer of the gated-round cell network has 1024 dimensions. In the experiment, the batch size was set to 64 using an Adamax optimizer and a learning rate of 0.005 for training.

(2) Content of the experiment

Experiment 1: introduction of data sets.

VQA-RAD dataset is the first manually constructed natural question in the field of medical visual question-answering concerning radiological images and provides a dataset of reference answers. The total number of radiological images is 315 and evenly distributed in the three parts head, chest and abdomen. Questions are classified into 11 categories according to different question types, including location, size, and the like. The question-answer pairs are classified into open type question-answer pairs and closed type question-answer pairs according to answer types. Questions of a generally selective nature are referred to as closed question-answer pairs, otherwise as open question-answer pairs. The data set may be divided into a training set and a test set, containing 3064 and 451 challenge-response pairs, respectively.

Experiment 2: the effect of different medical visual question-answering methods was tested in the test set of the medical visual question-answering data set VQA-RAD, and a comparison graph of accuracy is shown in fig. 2.

The experimental results are as follows: as shown in FIG. 2, the proposed method of the present invention provides some improvement in VQA-RAD data set over other existing methods. Our method is superior to other methods in terms of accuracy in open, closed, and whole challenge-response pairs. Compared with the Med-VQA method with the best effect in the comparison method, the accuracy of the three question-answer pairs is improved by about 3 percent on average.

And (3) analysis: the method not only models only single modes, but also effectively mines the relationship among the modes. The experimental results show that the potential meanings of the questions and the images can be better understood by establishing an attention mechanism between the questions and the images by utilizing the relation between the texts and the vision, and the keywords matched with the images can be found, so that the predicted answer accuracy is higher and more stable, and the effectiveness of the method is proved.

Experiment 3: an ablation study of each component of the proposed method is shown in fig. 3.

The experimental results are as follows: as shown in FIG. 3, the image pair problem attention mechanism component and image problem complementary loss component are evaluated on the VQA-RAD data set. Experimental results show that cooperation between components is superior to either component working alone, but superior to the baseline method.

And (3) analysis: close relationships between images and questions in question-answer pairs can be explored using an attention mechanism. The image problem complementation loss further improves the similarity between vision and text, minimizing the difference between the learning of word and image problems. The two work together to achieve the best effect.

Experiment 4: and (3) analyzing the optimal value of the hyper-parameter gamma in the composite loss.

The experimental results are as follows: the hyper-parameter gamma in the composite loss sets different values for evaluation on the open question-answer pair, the closed question-answer pair and the overall question-answer pair. The performance of the process varies with the change in the gamma extraction. The best results are obtained when γ is 1.6, the three types of precision are particularly outstanding.

And (3) analysis: compared with the accuracy when gamma is 0, the accuracy of the open question-answer pair, the closed question-answer pair and the whole question-answer pair is improved, and the image question complementary loss can be proved to have obvious influence on the method provided by the invention.

Experiment 5: the visual evaluation of the method of the invention is shown in fig. 4.

The experimental results are as follows: as shown in fig. 4, the method proposed by the present invention can generally accurately find the visual information and text keywords involved in the visual question-answering task.

And (3) analysis: the method provided by the invention can predict the correct answer to most images and problems. According to the combined action of a multi-view attention mechanism and complex loss on the method, the key positions in the images and the key words in the questions can be correctly positioned, and finally, correct answers are predicted according to the positioned image areas and words.

In conclusion, the medical visual question-answering method based on the composite loss can effectively solve the medical visual question-answering task. Not only are image and problem features extracted, but also potential influence of words and images on the problems is explored by using a multi-view attention mechanism, so that important information of the texts is effectively mined by using semantic relations between the images and the texts, a model is trained by using composite loss to optimize the method, and the accuracy of the medical visual question-answering task is finally improved.

The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A medical vision question-answering method based on composite loss is characterized by comprising the following steps:

and 4, step 4: multimodal features of open and closed question-and-answer pairs

And

respectively transmitting the candidate answers to a classification model consisting of two layers of MLPs (Multi-level hierarchical processing) so as to obtain the probability of candidate answers; taking the answer with the maximum probability in the candidate answer set as the final prediction output

And

using a binary cross entropy loss L during model training_cLoss L complementary to image problem_mqAnd (3) forming a combined loss module joint optimization model:

Loss＝L_c+γL_mq

wherein BCE (.) represents a binary cross-entropy loss function;

and 5: obtainTaking the medical visual question to be answered, executing the steps 1 to 3 to extract the fused multi-modal features M^cl、M^opAnd then inputting the answer into the trained classification model, and taking the answer with the highest probability in the candidate answer set as output.

2. The composite-loss-based medical vision question-answering method according to claim 1, wherein: the method for extracting the medical image features in the step 1 specifically comprises the following steps: initializing a pre-training weight represented by an image by adopting model unknown element learning and a convolution noise reduction self-encoder together; the structure of model agnostic meta-learning consists of four 3 x 3 convolutional layers and one average pooling layer, each convolutional layer contains 64 filters and one nonlinear layer; the convolution noise reduction self-encoder is a combination of a series of convolution layers and a maximum pooling layer; respectively obtaining 64-dimensional vector features of the medical image through model-agnostic element learning and convolution noise reduction self-encoders, connecting the 64-dimensional vector features in series to obtain final medical image features, wherein the features of the image are expressed as

d_k128 denotes the dimension of the image feature.

3. The composite-loss-based medical vision question-answering method according to claim 1 or 2, characterized in that: the method for extracting the problem features in the step 1 specifically comprises the following steps: unifying each question into a sentence consisting of n words, and deleting the exceeding part if the length of the question exceeds n; if the question consists of less than n words, zero padding is performed on the question until the length of the question is n; first, each word in the question is represented by a 300-dimensional GloVe word embedding as

4. The composite-loss-based medical vision question-answering method according to claim 3, wherein: the word-to-text attention mechanism in the step 2 specifically comprises the following steps:

Q_c＝[D||Q]

Wherein, | | represents the concatenation of characteristic dimensions;

is the learning weight; as is the hadamard product;

wherein ,

is the learning weight.

5. The composite-loss-based medical vision question-answering method according to claim 3, wherein: the image pair problem attention mechanism in the step 2 is specifically as follows:

a_m＝softmax(Q^TMLP(v))

Q_m＝a_m ^T⊙Q。

6. The composite-loss-based medical vision question-answering method according to claim 4, wherein: the image pair problem attention mechanism in the step 2 is specifically as follows:

a_m＝softmax(Q^TMLP(v))

Q_m＝a_m ^T⊙Q。