CN117235670A - Medical image problem vision solving method based on fine granularity cross attention - Google Patents

Medical image problem vision solving method based on fine granularity cross attention Download PDF

Info

Publication number
CN117235670A
CN117235670A CN202311490620.3A CN202311490620A CN117235670A CN 117235670 A CN117235670 A CN 117235670A CN 202311490620 A CN202311490620 A CN 202311490620A CN 117235670 A CN117235670 A CN 117235670A
Authority
CN
China
Prior art keywords
layer
cross
attention
visual
feature
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311490620.3A
Other languages
Chinese (zh)
Inventor
吴梓恒
陆振宇
舒昕垚
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University of Information Science and Technology
Original Assignee
Nanjing University of Information Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University of Information Science and Technology filed Critical Nanjing University of Information Science and Technology
Priority to CN202311490620.3A priority Critical patent/CN117235670A/en
Publication of CN117235670A publication Critical patent/CN117235670A/en
Pending legal-status Critical Current

Links

Landscapes

  • Measuring And Recording Apparatus For Diagnosis (AREA)

Abstract

The application relates to a medical image problem vision solving method based on fine grain cross attention. The method comprises the following steps: acquiring a radioactive medical image and text data of symptom questions corresponding to the radioactive medical image, carrying out local feature extraction on the radioactive medical image by adopting a fine-granularity visual feature extraction module to obtain local image features, carrying out feature extraction on the text data by adopting a text feature extraction module to obtain text features, inputting multi-mode features consisting of the local image features and the text features into a cross-mode encoder module to carry out multi-mode feature fusion to obtain fused features, inputting the fused features into an answer prediction module to carry out answer prediction to obtain an answer prediction result, and solving the symptom questions according to the answer prediction result, thereby improving the accuracy of medical visual question solutions.

Description

Medical image problem vision solving method based on fine granularity cross attention
Technical Field
The application relates to the technical field of image processing, in particular to a medical image problem vision solving method based on fine-granularity cross attention.
Background
Medical visual problem solutions aim to accurately answer clinical questions presented in medical images. Despite its great potential in the healthcare industry and in the field of services, this technology is still in the onset phase and has not yet been put to practical use. Medical visual question-answering tasks are very challenging because of the great diversity of clinical questions and the differences in visual reasoning skills required for different types of questions. Medical visual question answering is a field-specific visual question answering problem that requires interpretation of medical-related visual concepts by considering image and language information. In particular, medical visual question answering systems aim at inputting and outputting a medical image and a clinical question about the image as correct answers to natural language. Medical visual question answers can help patients get timely query feedback and make more informed decisions. It can relieve the pressure on medical facilities and save precious medical resources for the urgent need. It can also help doctors obtain a second opinion in terms of diagnosis and reduce the high cost of training medical professionals.
In the related art, visual and linguistic reasoning requires understanding of visual concepts and linguistic semantics, most importantly, alignment and relationships between the two modes. Conventional visual problem solving methods require a large amount of marking data to train. Such large-scale data is often not useful in the medical field. The image in the medical field is fundamentally different from the image in the general field. Therefore, it is not feasible to directly employ a general-purpose visual problem solving model in the medical field. Furthermore, medical image annotation is an expensive and time-consuming process. The application of visual questions and answers in the medical field has a significant impact on traditional medical research methods. A well-established medical visual question-answering system is of great help to the diagnosis of patients. Due to the complexity and diversity of clinical questions and the difficulty of multi-modal reasoning, the visual question answering model in the general field is not attractive enough for feature alignment in medical images and text semantics, and the accuracy of the visual question answering model applied to medical visual questions is low.
Disclosure of Invention
Based on the above, it is necessary to provide a medical image problem visual solving method based on fine-granularity cross attention, which can solve the medical visual problem accurately.
A medical image problem visual solution method based on fine-grained cross-attention, the method comprising:
acquiring a radioactive medical image and text data of symptom problems corresponding to the radioactive medical image;
carrying out local feature extraction on the radioactive medical image by adopting a fine-granularity visual feature extraction module of a medical visual question-answering model to obtain local image features;
the text feature extraction module of the medical visual question-answering model is adopted to perform feature extraction on the text data, so that text features are obtained;
the multi-modal feature pair consisting of the local image feature and the text feature is input into a cross-modal encoder module of the medical visual question-answering model to perform multi-modal feature fusion, so that the fused feature is obtained;
inputting the fused features into an answer prediction module of the medical visual question-answering model to perform answer prediction, so as to obtain an answer prediction result;
and solving the symptom problem according to the answer prediction result.
In one embodiment, the fine-grained visual feature extraction module that adopts the medical visual question-answering model performs local feature extraction on the radiomedical image to obtain local image features, and the method includes:
inputting the radioactive medical image into a feature extraction unit of the fine-granularity visual feature extraction module to perform feature extraction to obtain a primary image feature;
inputting the preliminary image features to a full convolution unit of the fine-granularity visual feature extraction module for processing to obtain processed image features;
and inputting the processed image features to a fine-granularity visual feature extraction unit of the fine-granularity visual feature extraction module for feature extraction to obtain local image features.
In one embodiment, the cross-modal encoder module of the medical visual question-answering model comprises N cross-modal encoding layers and a feature pooling layer which are sequentially connected;
the input of the first cross mode coding layer is the output of the fine-grained visual feature extraction module and the output of the text feature extraction module, the output of the first cross mode coding layer is the input of the second cross mode coding layer, and the like, the output of the last cross mode coding layer is the input of the feature pooling layer, and the output of the feature pooling layer is the input of the answer prediction module.
In one embodiment, the cross-modality encoding layer includes a first self-attention layer, a second self-attention layer, a first cross-attention layer, a second cross-attention layer, a first feed-forward sub-layer, and a second feed-forward sub-layer;
the output of the first self-attention layer is input into the first and second cross-attention layers, the output of the second self-attention layer is input into the first and second cross-attention layers, the output of the first cross-attention layer is input into the first and second feedforward sublayers, and the output of the second cross-attention layer is input into the first and second feedforward sublayers.
In one embodiment, the process expression for the first self-attention layer is:
the process expression of the second self-attention layer is:
wherein,Attention(Q i K i V i ) For the output of the first self-attention layer,Attention(Q t K t V t ) For the output of the second self-attention layer,V i as an input feature for the first self-attention layer,Q i for the input feature query of the first self-attention layer,K i for the input feature key of the first self-attention layer,V t as an input feature for the second self-attention layer,Q t for the input feature query of the second self-attention layer,K t is an input feature key of the second self-attention layer,d k as a function of the number of features,Tfor the purpose of the transposition,softmax(.) is a softmax function.
In one embodiment, the process expression of the first cross-attention layer is:
the processing expression of the second cross-attention layer is:
wherein,for the output of the first cross-attention layer, < >>For a cross-modal attention manipulation from text to vision, capturing the relation between text and image,/->To the first text feature extraction modulen-layer 1 of the first layeriHidden status of individual location,/>First cross-attention layernThe first layer of the layersiAn image representation of the individual positions is provided,first cross-attention layernThe image representation of the last position in the layer preceding the layer, is->For the output of the second cross-attention layer, < >>For a cross-modal attention manipulation from visual to text, capturing the relation between text and image,/->To the first text feature extraction modulen-layer 1 of the first layerjHidden status of individual location,/>Is the second cross-attention layernThe first layer of the layersjImage representation of individual positions +.>Is the second cross-attention layernImage representation of the last position in the previous layer of the layer.
In one embodiment, the answer prediction module includes: a first fully connected layer, a Relu function, a weight normalization layer, and a second fully connected layer;
the first full-connection layer, the Relu function, the weight normalization layer and the second full-connection layer are sequentially connected.
In one embodiment, the loss function of the medical visual question-answering model is:
wherein,Lin order to cross-entropy loss function,Mfor the number of samples to be taken,Xas a real tag it is possible to provide a real tag,yin order to predict the outcome of the result,pis the information quantity.
According to the medical image problem visual answering method based on fine granularity cross attention, the radioactive medical image and text data of symptom problems corresponding to the radioactive medical image are obtained, the fine granularity visual feature extraction module of the medical visual answering model is adopted to conduct local feature extraction on the radioactive medical image to obtain more localized image features, useless noise is restrained, the text feature extraction module of the medical visual answering model is adopted to conduct feature extraction on the text data to obtain text features, then multi-modal feature pairs consisting of the local image features and the text features are input into the cross modal encoder module of the medical visual answering model to conduct multi-modal feature fusion, all potential local alignment can be captured through a mutual attention mechanism between a region and words, selective context aggregation is achieved, the fused features are obtained, answer prediction results are obtained by inputting the fused features into the answer prediction module of the medical visual answering model, and accordingly the symptom problems are accurately answered according to the answer prediction results, and accordingly the medical visual answering performance is improved.
Drawings
FIG. 1 is a flow chart of a method for visual resolution of medical imaging problems based on fine-grained cross-attention in one embodiment;
FIG. 2 is a schematic diagram of a cross-modal encoding layer in one embodiment;
FIG. 3 is a schematic flow diagram illustrating a cross-modal encoding layer processing in one embodiment;
FIG. 4 is a schematic diagram of the structure of a medical visual question-answering model in one embodiment;
FIG. 5 is a schematic diagram of the cross-modality encoder module and answer prediction module of the medical visual question-answering model in one embodiment;
FIG. 6 is a schematic diagram of a multi-modal feature extraction process in one embodiment.
Detailed Description
The present application will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present application more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application.
In one embodiment, as shown in fig. 1, a medical image problem visual solving method based on fine-granularity cross attention is provided, and the method is applied to a terminal for illustration, and comprises the following steps:
step S220, acquiring a radiological medical image and text data of the radiological medical image corresponding to the symptom problem.
Among these, the radiological medical image may be the most commonly used radiological (radiology) data, the content of the image being the head, chest, abdomen, etc. of the body.
Among other things, symptom problems may include: "abnormal conditions", "attributes", "colors", "number", "morphology", "organ type", and "tangential plane", etc.
And step S240, carrying out local feature extraction on the radioactive medical image by adopting a fine-granularity visual feature extraction module of the medical visual question-answering model to obtain local image features.
The local image feature may be a process of extracting a feature of a local area in the radiological medical image, and is mainly used for describing characteristics of local details or key points.
The fine-granularity visual feature extraction module of the medical visual question and answer model is used for extracting fine-granularity visual features of the radioactive medical image, so that the obtained fine-granularity visual features are localized and the features in the radioactive medical image are easier to use, and the accuracy of subsequent medical visual question and answer is improved.
It should be appreciated that the visual features extracted by the fine-grained visual feature extraction module may suppress unwanted noise in the key information of the fused representation. The fine-grained visual feature extraction module uses a full convolution layer instead of a linear layer to reduce the number of parameters, and a fine-grained visual feature extraction layer to focus on the extraction and identification of local features, extracting distinguishable and complementary features while preserving context information.
In one embodiment, the local feature extraction of the radiological medical image using a fine-grained visual feature extraction module of a medical visual question-answering model to obtain local image features includes: inputting the radioactive medical image into a feature extraction unit of a fine-granularity visual feature extraction module to perform feature extraction to obtain a primary image feature; inputting the preliminary image features into a full convolution unit of a fine-granularity visual feature extraction module for processing to obtain processed image features; and inputting the processed image features to a fine-granularity visual feature extraction unit of a fine-granularity visual feature extraction module for feature extraction to obtain local image features.
In one embodiment, the feature extraction unit may be a neural network constructed using a ResNet-152 algorithm.
The ResNet-152 algorithm is adopted to extract the regional characteristics of the medical image, and in the image characteristic extraction process, the previous research extracts the image characteristics from five convolution layers of the ResNet-152. After each convolution layer extracting features, they use a global averaging pooling layer. However, this ignores the fine-grained image of the local information. Instead of using a global averaging pooling layer, image features are extracted directly from the fifth convolution layer in ResNet-152. The image features of the 3 multiplied by 224 are extracted through a ResNet-152 model, and then the local visual features of the image are extracted after the actions of a full convolution unit and a fine-granularity visual feature extraction unit.
The fine-grained visual feature extraction module of the application uses a full convolution unit instead of a linear layer to reduce the number of parameters, and aligns feature dimensions from original 2048×7×7 to 768×7×7, then the fine-grained visual feature module converts the 7×7 feature map into a sequence with the length of 49, and finally the feature dimension of the image is 49×768, so that the result of aligning the feature dimensions is achieved.
Wherein, the fine-grained visual feature extraction unit is used for taking the ROI pool layer as the dominant, and mapping the region of interest (ROI) to the corresponding position of the feature map according to the input image features. The mapped region is then divided into sections. The fine-grained visual feature extraction unit converts the feature map into a sequence of length 49. Finally, the image features obtained by the fine-grained visual feature extraction unit.
And step S260, adopting a text feature extraction module of the medical visual question-answering model to extract features of the text data, and obtaining text features.
The text feature extraction module may use a BERT-base model, and the text feature extraction module may also use a BERT model.
It should be understood that the word sequence in the text data is directly input into the text feature extraction module that has been pre-trained, and the resulting output is the features of the words and sentences.
Step S280, multi-modal feature fusion is carried out on the multi-modal feature pair consisting of the local image feature and the text feature, which is input into the cross-modal encoder module of the medical visual question-answering model, so as to obtain the fused feature.
Wherein the self-attention mechanism of the cross-mode encoder module is used for paying attention to part of details, analysis is performed based on the part, the core is to determine the part to be paid attention to based on the target, and further analysis is performed after the part of details are found; the cross attention mechanism of the cross mode encoder module can enable the model to better understand and model the correlation between different mode data, so that the performance of the multi-mode task is improved, and the inter-mode relation between the image area and the sentence words can be utilized to complement and enhance the matching of the images and sentences.
Wherein the cross-modality encoder module takes the image region and the sentence words as inputs. Each cross-modal layer in the cross-modal encoder module includes two self-attention sublayers, two bi-directional cross-attention sublayers, two feed-forward sublayers. To achieve robust cross-modal matching, a cross-modal encoder module is used that captures all potential local alignments through a mutual attention mechanism between regions and words.
In one embodiment, the cross-modality encoder module of the medical visual question-answering model includes N cross-modality encoding layers and one feature pooling layer connected in sequence; the input of the first cross mode coding layer is the output of a fine-granularity visual feature extraction module and the output of a text feature extraction module of the medical visual question-answering model, the output of the first cross mode coding layer is the input of the second cross mode coding layer, and the like, the output of the last cross mode coding layer is the input of a feature pooling layer, and the output of the feature pooling layer is the input of an answer prediction module.
In one embodiment, the cross-modality encoding layer includes a first self-attention layer, a second self-attention layer, the first cross-attention layer, the second cross-attention layer, a first feed-forward sub-layer, and a second feed-forward sub-layer; the output of the first self-attention layer is input into a first cross-attention layer and a second cross-attention layer, the output of the second self-attention layer is input into the first cross-attention layer and the second cross-attention layer, the output of the first cross-attention layer is input into a first feed-forward sub-layer and a second feed-forward sub-layer, and the output of the second cross-attention layer is input into the first feed-forward sub-layer and the second feed-forward sub-layer.
Wherein the feed-forward sub-layer of the cross-modality encoding layer enhances the presentation capability of the captured visual features and problem description features. Considering that the attention mechanism may not fit well to the complex process, the model's ability is enhanced by adding feed-forward sublayers.
It should be understood that the input to the cross-mode encoder module is a local image feature and a text feature, so that the first self-attention layer and the second self-attention layer of the first cross-mode encoding layer, one inputs the local image feature, the other inputs the text feature, after the cross-processing of the first cross-mode encoding layer, the first feedforward sub-layer of the first cross-mode encoding layer outputs the feature after the fusion of the local image feature and the text feature, the second feedforward sub-layer of the first cross-mode encoding layer outputs the feature after the fusion of the local image feature and the text feature, the first feedforward sub-layer and the second feedforward sub-layer of the first cross-mode encoding layer outputs the feature parallel to the first self-attention layer and the second self-attention layer of the next cross-mode encoding layer, the cross-input is not performed, and so on, and finally the first feedforward sub-layer and the second feedforward sub-layer of the cross-mode encoding layer outputs the feature to the feature pooling layer.
In which, as shown in fig. 2, in a cross-mode coding layer implementation, the stack (i.e., the cross-mode coding layer output of the nth layer is used as input to the cross-mode coding layer of the (n+1) th layer). In the nth layer, a bi-directional cross-attention layer is first applied, the bi-directional cross-attention layer comprising two unidirectional cross-attention layers: one from language to visual and the other from visual to language.
The cross-modality encoding layer may obtain a weight matrix or an affinity matrix with scaling dot product concerns. The present application uses a self-attention layer to align entities between two modes before crossing the attention layer. The output of the self-attention layer is the input to the cross-attention layer to enhance the input of the (n+1) th layer of information. At the nth layer, a bi-directional cross-attention layer is employed, comprising two cross-attention layers: one layer goes from visual to language and the other layer goes from language to visual. As shown in fig. 3, the result of the text feature query Q and key K passing through the softmax function is a matrix of attention coefficients, multiplied by the value V of the visual feature, resulting in a final attention output. Likewise, the result of the visual feature query Q and key K after the softmax function is a matrix of attention coefficients, multiplied by the text feature value V to yield the final attention output. And fusing the two attention outputs to obtain an output result. The cross attention fusion mechanism has global learning capability and good parallelism. It may also supplement and enhance image and sentence matching by the modal relationship between the image region and sentence words. It selectively aggregates contexts based on spatial attention maps. These characteristics indicate that mutual benefits can be realized. The feed-forward sub-layer of the cross-modality encoding layer enhances the presentation capability of the captured visual features and problem description features. Considering that the attentive mechanisms may not fit adequately to the complex process, the ability of the medical visual question-answering model is enhanced by adding a feed-forward sublayer.
In one embodiment, the process expression for the first self-attention layer is:
in one embodiment, the process expression for the second self-attention layer is:
wherein,Attention(Q i K i V i ) For the output of the first self-attention layer,Attention(Q t K t V t ) For the output of the second self-attention layer,V i as an input feature for the first self-attention layer,Q i for the input feature query of the first self-attention layer,K i for the input feature key of the first self-attention layer,V t as an input feature for the second self-attention layer,Q t being of the second self-attention layerA characteristic query is entered and a characteristic query is entered,K t is an input feature key of the second self-attention layer,d k as a function of the number of features,Tfor the purpose of the transposition,softmax(.) is a softmax function.
In one embodiment, the process expression for the first cross-attention layer is:
in one embodiment, the process expression for the second cross-attention layer is:
wherein,for the output of the first cross-attention layer, < >>For a cross-modal attention manipulation from text to vision, capturing the relation between text and image,/->To the first text feature extraction modulen-layer 1 of the first layeriHidden status of individual location,/>First cross-attention layernThe first layer of the layersiAn image representation of the individual positions is provided,first cross-attention layernThe image representation of the last position in the layer preceding the layer, is->For the output of the second cross-attention layer, < >>For a cross-modal attention manipulation from visual to text, capturing the relation between text and image,/->To the first text feature extraction modulen-layer 1 of the first layerjHidden status of individual location,/>Is the second cross-attention layernThe first layer of the layersjImage representation of individual positions +.>Is the second cross-attention layernImage representation of the last position in the previous layer of the layer.
And step S300, inputting the fused features into an answer prediction module of the medical visual question-answering model to conduct answer prediction, and obtaining an answer prediction result.
The answer prediction module may employ a VQA classifier.
In one embodiment, the answer prediction module comprises: a first fully connected layer, a Relu function, a weight normalization layer, and a second fully connected layer; the first full-connection layer, the Relu function, the weight normalization layer and the second full-connection layer are sequentially connected.
Step S320, the symptom questions are answered according to the answer prediction results.
According to the medical image problem visual answering method based on fine-granularity cross attention, the radioactive medical image and text data of symptom problems corresponding to the radioactive medical image are obtained, the fine-granularity visual feature extraction module of the medical visual answering model is used for carrying out local feature extraction on the radioactive medical image, the obtained fine-granularity visual feature is more localized, useless noise is restrained, image features are obtained, the text data is subjected to feature extraction by the text feature extraction module of the medical visual answering model, the text features are obtained, then multi-mode feature pairs consisting of the local image features and the text features are input into the cross mode encoder module of the medical visual answering model to carry out multi-mode feature fusion, all potential local alignment can be captured through a mutual attention mechanism between a region and a word, selective context aggregation is achieved, the fused features are further input into the answer prediction module of the medical visual answering model to carry out answer prediction, and answer prediction results are obtained, so that the symptom problems are answered according to the answer prediction results, and the accuracy of medical visual problem answering is improved.
In one embodiment, a method for solving the problem of medical image problem based on fine-granularity cross attention is provided, as shown in fig. 4 and 5, firstly, extracting local image features from a 3×224×224 radioactive medical image through a res net-152 model, and then extracting local visual features (i.e. local image features) of the image after the effect of a full convolution layer (i.e. a full convolution unit) and a fine-granularity visual feature extraction layer (i.e. a fine-granularity visual feature extraction unit). The multi-modal feature fusion is carried out through the cross-modal encoder (namely the cross-modal encoder module) by the local visual features and the text features, and finally, a predicted answer is generated through the VQA classifier, if the answer is as follows: the opacity of the left signal region increases.
In one embodiment, the medical visual question and answer model is trained using a VQA-RAD dataset, where the VQA-RAD dataset is the most commonly used radiological (radiology) dataset at the present time, containing 315 pictures and 3515 question and answer pairs, each picture corresponding to at least one question and answer pair. The categories of questions include 11 categories: "abnormal conditions", "attributes", "colors", "number", "morphology", "organ type", "others", "tangential plane", etc. 58% of the questions are in the form of closed questions and the remainder are open questions. The content of the image is the head, chest, abdomen, etc. of the body. The training set and the test set need to be manually partitioned. This dataset is a premium dataset for human-labeled question-answer pairs.
Wherein, regarding the text feature extraction module in the medical visual question-answering model, as shown in the dotted line part in the middle of fig. 6, pad filling is used in order to align the sequence lengths equally. Wherein the text feature fill length is 49 and the local image feature fill length is 21. In order to keep the lengths of all sentences in the training set consistent, a "maximum length" is specified. For sentences with a length less than this maximum length, complement it; and for sentences exceeding this maximum length, it is clipped. As shown in FIG. 6, after using [ PAD ] as a PAD, an indicator is also required to indicate that [ PAD ] is used for PAD only and does not represent any meaning. Here a list called the attention mask is used, whose length is equal to the maximum length. For each sentence, if the word at the corresponding position is [ PAD ], the element at the position of the position mask takes a value of 0, and otherwise, takes a value of 1. In the BERT model used by the text feature extraction module, the Token numbers (Token IDs) are derived by converting the input text into tokens in the corresponding vocabulary. The input text of the BERT model is typically composed of words or subwords (wordpieces), each of which is mapped to a corresponding unique tag ID. The method comprises the following specific steps: dividing words and sub words, mapping to mark IDs, adding special marks, filling and cutting. The tag sequence number and the intent_mask are entered into the pre-trained BERT model, resulting in a word embedded representation for each word. After processing the BERT model, an embedded representation of each word is obtained (this embedded representation contains the context of the whole sentence). Assuming a BERT-base model is used, the dimension of each word embedding is 768.
In the training of the medical visual question-answering model, a cross entropy loss function is adopted to calculate the loss between the answer after the fusion feature passes through the answer prediction module and the standard answer of the question description. The answer prediction module of the medical visual question-answering model obtains the score of each category on the last layer; the score is subjected to a Relu function to obtain probability output; and calculating a cross entropy loss function by using the category probability output predicted by the medical visual question-answering model and the one hot form of the real category.
In one embodiment, the loss function of the medical visual question-answering model is:
wherein,Lin order to cross-entropy loss function,Mfor the number of samples to be taken,Xas a real tag it is possible to provide a real tag,yin order to predict the outcome of the result,pis the information quantity.
When the medical visual question-answering model is trained, the loss between the answer passing through the answer prediction module based on the fusion characteristic and the standard answer of the question description is calculated by adopting the cross entropy loss function, the gradient is updated to the medical visual question-answering model according to the multi-mode cross entropy loss function, the loss is calculated, the parameters are adjusted, the parameters of the medical visual question-answering model with the best effect are obtained, and the accuracy of medical visual question answering is improved.
Wherein, a cross entropy loss function is adopted to guide a guiding function of weight adjustment of the medical visual question-answer model, and the method can know how to improve the weight coefficient through the function. The loss between the answer after passing through the answer prediction module based on the fusion feature and the standard answer of the question description is calculated. Updating the gradient of the medical visual question-answering model according to the multi-mode cross entropy loss function, calculating loss, and adjusting parameters to obtain parameters of the model with the best effect; and finally, training the visual question-answering model according to the loss value by calculating the loss, and optimizing the parameters of the visual question-answering model. Setting the total training round number as 120 rounds, and setting the parameters of the medical visual question-answering model with highest accuracy in the verification set as the parameters of the final medical visual question-answering model. Experimental results are shown in table 1, showing the results on a VQA-RAD dataset validation set.
TABLE 1 Effect display of medical image problem visual solution method Using Fine granularity Cross attention
According to the table 1, the overall accuracy of the medical visual question-answering model added with fine-granularity cross attention is greatly improved from the closed problem, the context correlation of medical images and language features is well captured, and the performance and the robustness of the medical visual question-answering model are improved.
It should be understood that, although the steps in the flowchart of fig. 1 are shown in sequence as indicated by the arrows, the steps are not necessarily performed in sequence as indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least some of the steps in fig. 1 may include multiple sub-steps or stages that are not necessarily performed at the same time, but may be performed at different times, nor do the order in which the sub-steps or stages are performed necessarily performed in sequence, but may be performed alternately or alternately with at least a portion of other steps or sub-steps of other steps.
The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.
The above examples illustrate only a few embodiments of the application, which are described in detail and are not to be construed as limiting the scope of the application. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the application, which are all within the scope of the application. Accordingly, the scope of protection of the present application is to be determined by the appended claims.

Claims (8)

1. A method for visual resolution of medical imaging problems based on fine-grained cross-attention, the method comprising:
acquiring a radioactive medical image and text data of symptom problems corresponding to the radioactive medical image;
carrying out local feature extraction on the radioactive medical image by adopting a fine-granularity visual feature extraction module of a medical visual question-answering model to obtain local image features;
the text feature extraction module of the medical visual question-answering model is adopted to perform feature extraction on the text data, so that text features are obtained;
the multi-modal feature pair consisting of the local image feature and the text feature is input into a cross-modal encoder module of the medical visual question-answering model to perform multi-modal feature fusion, so that the fused feature is obtained;
inputting the fused features into an answer prediction module of the medical visual question-answering model to perform answer prediction, so as to obtain an answer prediction result;
and solving the symptom problem according to the answer prediction result.
2. The fine-grained cross-attention-based medical image question visual solution method according to claim 1, wherein the fine-grained visual feature extraction module using a medical visual question-answering model performs local feature extraction on the radiological medical image to obtain local image features, comprising:
inputting the radioactive medical image into a feature extraction unit of the fine-granularity visual feature extraction module to perform feature extraction to obtain a primary image feature;
inputting the preliminary image features to a full convolution unit of the fine-granularity visual feature extraction module for processing to obtain processed image features;
and inputting the processed image features to a fine-granularity visual feature extraction unit of the fine-granularity visual feature extraction module for feature extraction to obtain local image features.
3. The fine grain cross-attention based medical imaging problem visual solution method of claim 1, wherein the cross-modality encoder module of the medical visual question model comprises N cross-modality encoding layers and one feature pooling layer connected in sequence;
the input of the first cross mode coding layer is the output of the fine-grained visual feature extraction module and the output of the text feature extraction module, the output of the first cross mode coding layer is the input of the second cross mode coding layer, and the like, the output of the last cross mode coding layer is the input of the feature pooling layer, and the output of the feature pooling layer is the input of the answer prediction module.
4. The fine granularity cross-attention based medical image problem visual solution of claim 3, wherein the cross-modality encoding layer comprises a first self-attention layer, a second self-attention layer, a first cross-attention layer, a second cross-attention layer, a first feed-forward sub-layer, and a second feed-forward sub-layer;
the output of the first self-attention layer is input into the first and second cross-attention layers, the output of the second self-attention layer is input into the first and second cross-attention layers, the output of the first cross-attention layer is input into the first and second feedforward sublayers, and the output of the second cross-attention layer is input into the first and second feedforward sublayers.
5. The fine-grained cross-attention based medical image problem visual solution method of claim 4, wherein the first self-attention layer is processed by the following expression:
the process expression of the second self-attention layer is:
wherein,Attention(Q i K i V i ) Is the firstThe output from the attention layer(s),Attention(Q t K t V t ) For the output of the second self-attention layer,V i as an input feature for the first self-attention layer,Q i for the input feature query of the first self-attention layer,K i for the input feature key of the first self-attention layer,V t as an input feature for the second self-attention layer,Q t for the input feature query of the second self-attention layer,K t is an input feature key of the second self-attention layer, d k as a function of the number of features,Tfor the purpose of the transposition,softmax(.) is a softmax function.
6. The fine-grained cross-attention based medical image problem visual solution method of claim 4, wherein the first cross-attention layer is processed by the following expression:
the processing expression of the second cross-attention layer is:
wherein,for the output of the first cross-attention layer, < >>For a cross-modal attention manipulation from text to vision, capturing the relation between text and image,/->To the first text feature extraction modulen-layer 1 of the first layeriHidden status of individual location,/>First cross-attention layernThe first layer of the layersiImage representation of individual positions +.>First cross-attention layernThe image representation of the last position in the layer preceding the layer, is->For the output of the second cross-attention layer, < >>For a cross-modal attention manipulation from visual to text, capturing the relation between text and image,/->To the first text feature extraction modulen-layer 1 of the first layerjHidden status of individual location,/>Is the second cross-attention layernThe first layer of the layersjImage representation of individual positions +.>Is the second cross-attention layernImage representation of the last position in the previous layer of the layer.
7. The fine grain cross-attention based medical imaging problem visual solution method of claim 1, wherein the answer prediction module comprises: a first fully connected layer, a Relu function, a weight normalization layer, and a second fully connected layer;
the first full-connection layer, the Relu function, the weight normalization layer and the second full-connection layer are sequentially connected.
8. The fine grain cross-attention based medical imaging problem visual solution method of claim 7, wherein the loss function of the medical visual question answering model is:
wherein,Lin order to cross-entropy loss function,Mfor the number of samples to be taken,Xas a real tag it is possible to provide a real tag,yin order to predict the outcome of the result,pis the information quantity.
CN202311490620.3A 2023-11-10 2023-11-10 Medical image problem vision solving method based on fine granularity cross attention Pending CN117235670A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311490620.3A CN117235670A (en) 2023-11-10 2023-11-10 Medical image problem vision solving method based on fine granularity cross attention

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311490620.3A CN117235670A (en) 2023-11-10 2023-11-10 Medical image problem vision solving method based on fine granularity cross attention

Publications (1)

Publication Number Publication Date
CN117235670A true CN117235670A (en) 2023-12-15

Family

ID=89098489

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311490620.3A Pending CN117235670A (en) 2023-11-10 2023-11-10 Medical image problem vision solving method based on fine granularity cross attention

Country Status (1)

Country Link
CN (1) CN117235670A (en)

Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112488055A (en) * 2020-12-18 2021-03-12 贵州大学 Video question-answering method based on progressive graph attention network
CN114201592A (en) * 2021-12-02 2022-03-18 重庆邮电大学 Visual question-answering method for medical image diagnosis
CN114398961A (en) * 2021-12-28 2022-04-26 西南交通大学 Visual question-answering method based on multi-mode depth feature fusion and model thereof
CN114913403A (en) * 2022-07-18 2022-08-16 南京信息工程大学 Visual question-answering method based on metric learning
CN115455162A (en) * 2022-09-14 2022-12-09 东南大学 Answer sentence selection method and device based on hierarchical capsule and multi-view information fusion
US20230082605A1 (en) * 2020-08-12 2023-03-16 Tencent Technology (Shenzhen) Company Limited Visual dialog method and apparatus, method and apparatus for training visual dialog model, electronic device, and computer-readable storage medium
CN115994212A (en) * 2023-03-15 2023-04-21 阿里巴巴达摩院(杭州)科技有限公司 Visual question-answering processing method, visual question-answering model training method and device
CN116258946A (en) * 2023-05-16 2023-06-13 苏州大学 Precondition-based multi-granularity cross-modal reasoning method and device
CN116484042A (en) * 2023-05-16 2023-07-25 厦门医学院 Visual question-answering method combining autocorrelation and interactive guided attention mechanism
CN116662591A (en) * 2023-06-02 2023-08-29 北京理工大学 Robust visual question-answering model training method based on contrast learning
KR20230128812A (en) * 2022-02-28 2023-09-05 전남대학교산학협력단 Cross-modal learning-based emotion inference system and method
CN116756361A (en) * 2022-03-03 2023-09-15 四川大学 Medical visual question-answering method based on corresponding feature fusion
CN116932722A (en) * 2023-07-26 2023-10-24 海南大学 Cross-modal data fusion-based medical visual question-answering method and system

Patent Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20230082605A1 (en) * 2020-08-12 2023-03-16 Tencent Technology (Shenzhen) Company Limited Visual dialog method and apparatus, method and apparatus for training visual dialog model, electronic device, and computer-readable storage medium
CN112488055A (en) * 2020-12-18 2021-03-12 贵州大学 Video question-answering method based on progressive graph attention network
CN114201592A (en) * 2021-12-02 2022-03-18 重庆邮电大学 Visual question-answering method for medical image diagnosis
CN114398961A (en) * 2021-12-28 2022-04-26 西南交通大学 Visual question-answering method based on multi-mode depth feature fusion and model thereof
KR20230128812A (en) * 2022-02-28 2023-09-05 전남대학교산학협력단 Cross-modal learning-based emotion inference system and method
CN116756361A (en) * 2022-03-03 2023-09-15 四川大学 Medical visual question-answering method based on corresponding feature fusion
CN114913403A (en) * 2022-07-18 2022-08-16 南京信息工程大学 Visual question-answering method based on metric learning
CN115455162A (en) * 2022-09-14 2022-12-09 东南大学 Answer sentence selection method and device based on hierarchical capsule and multi-view information fusion
CN115994212A (en) * 2023-03-15 2023-04-21 阿里巴巴达摩院(杭州)科技有限公司 Visual question-answering processing method, visual question-answering model training method and device
CN116484042A (en) * 2023-05-16 2023-07-25 厦门医学院 Visual question-answering method combining autocorrelation and interactive guided attention mechanism
CN116258946A (en) * 2023-05-16 2023-06-13 苏州大学 Precondition-based multi-granularity cross-modal reasoning method and device
CN116662591A (en) * 2023-06-02 2023-08-29 北京理工大学 Robust visual question-answering model training method based on contrast learning
CN116932722A (en) * 2023-07-26 2023-10-24 海南大学 Cross-modal data fusion-based medical visual question-answering method and system

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
ZIHENG WU 等: "FGCVQA: FINE-GRAINED CROSS-ATTENTION FOR MEDICAL VQA", 《2023 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING》, pages 975 - 979 *
包翠竹 等: "视频问答技术研究进展", 《计算机研究与发展》, pages 1 - 33 *

Similar Documents

Publication Publication Date Title
WO2021179205A1 (en) Medical image segmentation method, medical image segmentation apparatus and terminal device
JP7195365B2 (en) A Method for Training Convolutional Neural Networks for Image Recognition Using Image Conditional Mask Language Modeling
CN109726696B (en) Image description generation system and method based on attention-pushing mechanism
EP4266195A1 (en) Training of text and image models
CN109471895A (en) The extraction of electronic health record phenotype, phenotype name authority method and system
CN112541501A (en) Scene character recognition method based on visual language modeling network
CN114549850B (en) Multi-mode image aesthetic quality evaluation method for solving modal missing problem
CN111275118B (en) Chest film multi-label classification method based on self-correction type label generation network
CN111949824B (en) Visual question-answering method and system based on semantic alignment and storage medium
CN114564959B (en) Chinese clinical phenotype fine granularity named entity identification method and system
CN111984772A (en) Medical image question-answering method and system based on deep learning
CN112687388A (en) Interpretable intelligent medical auxiliary diagnosis system based on text retrieval
CN110276396B (en) Image description generation method based on object saliency and cross-modal fusion features
CN111444367A (en) Image title generation method based on global and local attention mechanism
CN116797848A (en) Disease positioning method and system based on medical image text alignment
CN114999637A (en) Pathological image diagnosis method and system based on multi-angle coding and embedded mutual learning
CN115631183A (en) Method, system, device, processor and storage medium for realizing classification and identification of X-ray image based on double-channel decoder
CN115546553A (en) Zero sample classification method based on dynamic feature extraction and attribute correction
CN117391092B (en) Electronic medical record multi-mode medical semantic alignment method based on contrast learning
CN115862837A (en) Medical visual question-answering method based on type reasoning and semantic constraint
CN117079264A (en) Scene text image recognition method, system, equipment and storage medium
CN117235670A (en) Medical image problem vision solving method based on fine granularity cross attention
CN116881422A (en) Knowledge visual question-answering method and system generated by triple asymmetry and principle
CN116779177A (en) Endocrine disease classification method based on unbiased mixed tag learning
CN117688168A (en) Method and related device for generating abstract

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20231215

RJ01 Rejection of invention patent application after publication