CN117235670A

CN117235670A - Visual solution method for medical imaging problems based on fine-grained cross-attention

Info

Publication number: CN117235670A
Application number: CN202311490620.3A
Authority: CN
Inventors: 吴梓恒; 陆振宇; 舒昕垚
Original assignee: Nanjing University of Information Science and Technology
Current assignee: Nanjing University of Information Science and Technology
Priority date: 2023-11-10
Filing date: 2023-11-10
Publication date: 2023-12-15

Abstract

The application relates to a medical image problem vision solving method based on fine grain cross attention. The method comprises the following steps: acquiring a radioactive medical image and text data of symptom questions corresponding to the radioactive medical image, carrying out local feature extraction on the radioactive medical image by adopting a fine-granularity visual feature extraction module to obtain local image features, carrying out feature extraction on the text data by adopting a text feature extraction module to obtain text features, inputting multi-mode features consisting of the local image features and the text features into a cross-mode encoder module to carry out multi-mode feature fusion to obtain fused features, inputting the fused features into an answer prediction module to carry out answer prediction to obtain an answer prediction result, and solving the symptom questions according to the answer prediction result, thereby improving the accuracy of medical visual question solutions.

Description

Visual solution method for medical imaging problems based on fine-grained cross-attention

技术领域Technical field

本申请涉及图像处理技术领域，特别是涉及一种基于细粒度交叉注意力的医学影像问题视觉解答方法。The present application relates to the field of image processing technology, and in particular to a visual solution method for medical imaging problems based on fine-grained cross-attention.

背景技术Background technique

医学视觉问题解答旨在准确回答以医学图像呈现的临床问题。尽管它在医疗保健行业和服务领域具有巨大的潜力，但该技术仍处于起步阶段，还远未得到实际使用。医学视觉问答任务非常具有挑战性，因为不同类型问题的临床问题的巨大多样性和所需的视觉推理技能的差异。医学视觉问题回答是一个特定领域的视觉问题回答问题，需要通过考虑图像和语言信息来解释与医学相关的视觉概念。具体来说，医学视觉问题回答系统旨在将一个医学图像和一个关于该图像的临床问题作为自然语言的输入和输出正确答案。医学视觉问题回答可以帮助患者获得及时的询问反馈，并做出更明智的决定。它可以减轻对医疗设施的压力，为急需的人节省宝贵的医疗资源。它还可以帮助医生在诊断方面获得第二种意见，并降低培训医疗专业人员的高昂成本。Medical Vision Question Answering is designed to accurately answer clinical questions presented in medical images. Despite its huge potential in the healthcare industry and services, the technology is still in its infancy and is far from practical use. Medical visual question answering tasks are very challenging due to the huge diversity of clinical problems across different types of questions and the differences in required visual reasoning skills. Medical visual question answering is a domain-specific visual question answering question that requires interpretation of medically relevant visual concepts by considering image and verbal information. Specifically, the medical visual question answering system aims to take a medical image and a clinical question about the image as natural language input and output the correct answer. Medical vision question answering can help patients get timely feedback on inquiries and make more informed decisions. It can reduce pressure on medical facilities and save valuable medical resources for those in urgent need. It could also help doctors get a second opinion on diagnosis and reduce the high cost of training medical professionals.

相关技术中，视觉和语言推理需要理解视觉概念和语言语义，最重要的是这两种模式之间的对齐和关系。传统的视觉问题解答方法需要大量的标记数据来进行训练。这样大规模的数据通常无法用于医疗领域。医学领域的图像与一般领域的图像有根本上的不同。因此，直接在医学领域采用通用领域的视觉问题解答模型是不可行的。此外，医学图像注释是一个昂贵和耗时的过程。视觉问答在医学领域的应用对传统影响了显著的医学研究方法。一个成熟的医学视觉问答系统对患者的诊断有极大的帮助。由于临床问题的复杂多样性和多模态推理的困难，通用领域的视觉问题解答模型对于医学图像和文本语义中的特征对齐不够有吸引力，应用于医学视觉问题解答的准确性较低。Among related technologies, visual and linguistic reasoning require understanding of visual concepts and linguistic semantics, and most importantly, the alignment and relationship between these two modalities. Traditional visual question answering methods require large amounts of labeled data for training. Such large-scale data are generally not available in the medical field. Images in the medical field are fundamentally different from images in the general field. Therefore, it is not feasible to directly adopt a general domain visual question answering model in the medical field. Furthermore, medical image annotation is an expensive and time-consuming process. The application of visual question answering in the medical field has significantly influenced traditional medical research methods. A mature medical visual question answering system is of great help to patient diagnosis. Due to the complex diversity of clinical problems and the difficulty of multi-modal reasoning, general-domain visual question answering models are not attractive enough for feature alignment in medical images and text semantics, and their accuracy when applied to medical visual question answering is low.

发明内容Contents of the invention

基于此，有必要针对上述技术问题，提供一种能够医学视觉问题解答的准确性的基于细粒度交叉注意力的医学影像问题视觉解答方法。Based on this, it is necessary to address the above technical problems and provide a visual solution method for medical imaging questions based on fine-grained cross-attention that can accurately answer medical visual questions.

一种基于细粒度交叉注意力的医学影像问题视觉解答方法，所述方法包括：A visual solution method for medical imaging problems based on fine-grained cross-attention. The method includes:

获取放射性医疗图像和所述放射性医疗图像对应症状问题的文本数据；Obtain radioactive medical images and text data of symptom questions corresponding to the radioactive medical images;

采用医疗视觉问答模型的细粒度视觉特征提取模块对所述放射性医疗图像进行局部特征提取，获得局部图像特征；The fine-grained visual feature extraction module of the medical visual question and answer model is used to extract local features of the radioactive medical image to obtain local image features;

采用所述医疗视觉问答模型的文本特征提取模块对所述文本数据进行特征提取，获得文本特征；Using the text feature extraction module of the medical visual question and answer model to perform feature extraction on the text data to obtain text features;

将由局部图像特征和文本特征组成的多模态特征对输入到所述医疗视觉问答模型的交叉模态编码器模块进行多模态特征融合，获得融合后的特征；Perform multi-modal feature fusion on the cross-modal encoder module input to the medical visual question answering model using multi-modal features composed of local image features and text features to obtain fused features;

将所述融合后的特征输入到所述医疗视觉问答模型的答案预测模块中进行答案预测，获得答案预测结果；Input the fused features into the answer prediction module of the medical visual question answering model for answer prediction, and obtain answer prediction results;

根据所述答案预测结果对所述症状问题进行解答。The symptom question is answered based on the answer prediction result.

在其中一个实施例中，所述采用医疗视觉问答模型的细粒度视觉特征提取模块对所述放射性医疗图像进行局部特征提取，获得局部图像特征，包括：In one embodiment, the fine-grained visual feature extraction module using the medical visual question and answer model performs local feature extraction on the radioactive medical image to obtain local image features, including:

将所述放射性医疗图像输入所述细粒度视觉特征提取模块的特征提取单元进行特征提取，获得初步图像特征；Input the radioactive medical image into the feature extraction unit of the fine-grained visual feature extraction module for feature extraction to obtain preliminary image features;

将所述初步图像特征输入到所述细粒度视觉特征提取模块的全卷积单元进行处理，获得处理后的图像特征；Input the preliminary image features to the full convolution unit of the fine-grained visual feature extraction module for processing, and obtain processed image features;

将所述处理后的图像特征输入到所述细粒度视觉特征提取模块的细粒度视觉特征提取单元进行特征提取，获得局部图像特征。The processed image features are input to the fine-grained visual feature extraction unit of the fine-grained visual feature extraction module for feature extraction to obtain local image features.

在其中一个实施例中，所述医疗视觉问答模型的交叉模态编码器模块包括N个依次连接的交叉模态编码层和一个特征池化层；In one embodiment, the cross-modal encoder module of the medical visual question answering model includes N cross-modal coding layers and a feature pooling layer connected in sequence;

第一个交叉模态编码层的输入为细粒度视觉特征提取模块的输出和文本特征提取模块的输出，第一个交叉模态编码层的输出为第二个交叉模态编码层的输入，以此类推，最后一个交叉模态编码层的输出为所述特征池化层的输入，所述特征池化层的输出为答案预测模块的输入。The input of the first cross-modal coding layer is the output of the fine-grained visual feature extraction module and the output of the text feature extraction module. The output of the first cross-modal coding layer is the input of the second cross-modal coding layer. By analogy, the output of the last cross-modal coding layer is the input of the feature pooling layer, and the output of the feature pooling layer is the input of the answer prediction module.

在其中一个实施例中，所述交叉模态编码层包括第一自注意力层、第二自注意力层，第一交叉注意力层、第二交叉注意力层、第一前馈子层和第二前馈子层；In one embodiment, the cross-modal coding layer includes a first self-attention layer, a second self-attention layer, a first cross-attention layer, a second cross-attention layer, a first feedforward sub-layer and The second feedforward sublayer;

所述第一自注意力层的输出输入到所述第一交叉注意力层和所述第二交叉注意力层中，所述第二自注意力层的输出输入到所述第一交叉注意力层和所述第二交叉注意力层中，所述第一交叉注意力层的输出输入到所述第一前馈子层和所述第二前馈子层中，所述第二交叉注意力层的输出输入到所述第一前馈子层和所述第二前馈子层中。The output of the first self-attention layer is input to the first cross-attention layer and the second cross-attention layer, and the output of the second self-attention layer is input to the first cross-attention layer. layer and the second cross-attention layer, the output of the first cross-attention layer is input to the first feed-forward sub-layer and the second feed-forward sub-layer, and the second cross-attention layer The output of the layer is input into the first feed-forward sub-layer and the second feed-forward sub-layer.

在其中一个实施例中，第一自注意力层的处理过程表达式为：In one embodiment, the processing expression of the first self-attention layer is:

； ;

第二自注意力层的处理过程表达式为：The processing expression of the second self-attention layer is:

； ;

其中，Attention(Q _i，K _i，V _i)为第一自注意力层的输出，Attention(Q _t，K _t，V _t)为第二自注意力层的输出，V _i为第一自注意力层的输入特征，Q _i为第一自注意力层的输入特征查询，K _i为第一自注意力层的输入特征键，V _t为第二自注意力层的输入特征，Q _t为第二自注意力层的输入特征查询，K _t为第二自注意力层的输入特征键，d _k为特征的数量，T为转置，softmax(•)为softmax函数。Among them, Attention ( Qi _, Ki _, Vi ₎ is the output of the first self-attention layer, Attention ( Qt , _Kt , _Vt ₎ is the output of the second self-attention layer, and Vi is the output of the first self - attention _layer . The input features of the attention layer, Q _i is the input feature query of the first self-attention layer, K _i is the input feature key of the first self-attention layer, V _t is the input feature of the second self-attention layer, Q _t is the input feature query of the second self-attention layer, K _t is the input feature key of the second self-attention layer, d _k is the number of features, T is the transpose, and softmax (•) is the softmax function.

在其中一个实施例中，所述第一交叉注意力层的处理过程表达式为：In one embodiment, the processing expression of the first cross-attention layer is:

； ;

所述第二交叉注意力层的处理过程表达式为：The processing expression of the second cross-attention layer is:

； ;

其中，为第一交叉注意力层的输出，/>为从文本到视觉的跨模态注意力操作，来捕捉文本和图像之间的关系，/>为在文本特征提取模块的第n-1层中第i个位置的隐藏状态，/>为第一交叉注意力层的第n层的前一层中的第i个位置的图像表示，为第一交叉注意力层的第n层的前一层中的最后一个位置的图像表示，/>为第二交叉注意力层的输出，/>为从视觉到文本的跨模态注意力操作，来捕捉文本和图像之间的关系，/>为在文本特征提取模块的第n-1层中第j个位置的隐藏状态，/>为第二交叉注意力层的第n层的前一层中的第j个位置的图像表示，/>为第二交叉注意力层的第n层的前一层中的最后一个位置的图像表示。in, is the output of the first cross-attention layer, /> For cross-modal attention operations from text to vision to capture the relationship between text and images,/> is the hidden state of the i -th position in the n -1th layer of the text feature extraction module,/> is the image representation of the i -th position in the previous layer of the n-th layer of the first cross-attention layer, is the image representation of the last position in the previous layer of the n- th layer of the first cross-attention layer, /> is the output of the second cross-attention layer, /> For cross-modal attention operations from vision to text to capture the relationship between text and images,/> is the hidden state of the j- th position in the n -1th layer of the text feature extraction module,/> is the image representation of the j-th position in the previous layer of the n-th layer of the second cross-attention layer, /> is the image representation of the last position in the previous layer of the nth layer of the second cross-attention layer.

在其中一个实施例中，所述答案预测模块包括：第一全连接层、Relu 函数、权重归一化层和第二全连接层；In one embodiment, the answer prediction module includes: a first fully connected layer, a Relu function, a weight normalization layer and a second fully connected layer;

所述第一全连接层、所述Relu 函数、所述权重归一化层和所述第二全连接层依次连接。The first fully connected layer, the Relu function, the weight normalization layer and the second fully connected layer are connected in sequence.

在其中一个实施例中，所述医疗视觉问答模型的损失函数为：In one embodiment, the loss function of the medical visual question answering model is:

； ;

其中，L为交叉熵损失函数，M为样本数量，X为真实标签，y为预测结果，p为信息量。Among them, L is the cross-entropy loss function, M is the number of samples, X is the real label, y is the prediction result, and p is the amount of information.

上述基于细粒度交叉注意力的医学影像问题视觉解答方法，通过获取放射性医疗图像和所述放射性医疗图像对应症状问题的文本数据，进而采用医疗视觉问答模型的细粒度视觉特征提取模块对所述放射性医疗图像进行局部特征提取，获得更局部化的图像特征，并抑制无用的噪声，采用所述医疗视觉问答模型的文本特征提取模块对所述文本数据进行特征提取，获得文本特征，再将由局部图像特征和文本特征组成的多模态特征对输入到所述医疗视觉问答模型的交叉模态编码器模块进行多模态特征融合，可以通过区域和单词之间的相互注意机制来捕获所有潜在的局部对齐，实现选择性地聚合上下文，获得融合后的特征，进而将所述融合后的特征输入到所述医疗视觉问答模型的答案预测模块中进行答案预测，获得答案预测结果，以根据所述答案预测结果对所述症状问题进行解答，从而提高了医学视觉问题解答的准确性。The above-mentioned visual answering method for medical imaging questions based on fine-grained cross-attention obtains radioactive medical images and text data of symptom questions corresponding to the radioactive medical images, and then uses the fine-grained visual feature extraction module of the medical visual question answering model to analyze the radioactive medical images. Local features are extracted from medical images to obtain more localized image features and suppress useless noise. The text feature extraction module of the medical visual question and answer model is used to extract features from the text data to obtain text features, which are then extracted from the local images. Multi-modal feature fusion composed of features and text features. Multi-modal feature fusion is performed on the cross-modal encoder module input to the medical visual question answering model, which can capture all potential local features through the mutual attention mechanism between regions and words. Alignment to achieve selective aggregation of context to obtain fused features, and then input the fused features into the answer prediction module of the medical visual question answering model for answer prediction, and obtain the answer prediction results to predict the answer based on the answer The prediction results answer the symptom questions, thereby improving the accuracy of medical vision question answering.

附图说明Description of drawings

图1为一个实施例中基于细粒度交叉注意力的医学影像问题视觉解答方法的流程示意图；Figure 1 is a schematic flowchart of a visual solution method for medical imaging problems based on fine-grained cross-attention in one embodiment;

图2为一个实施例中交叉模态编码层的结构示意图；Figure 2 is a schematic structural diagram of a cross-modal coding layer in an embodiment;

图3为一个实施例中交叉模态编码层的处理流程示意图；Figure 3 is a schematic diagram of the processing flow of the cross-modal coding layer in one embodiment;

图4为一个实施例中医疗视觉问答模型的结构示意图；Figure 4 is a schematic structural diagram of a medical visual question and answer model in one embodiment;

图5为一个实施例中医疗视觉问答模型的交叉模态编码器模块和答案预测模块的结构示意图；Figure 5 is a schematic structural diagram of the cross-modal encoder module and answer prediction module of the medical visual question and answer model in one embodiment;

图6为一个实施例中多模态特征提取流程的示意图。Figure 6 is a schematic diagram of the multi-modal feature extraction process in one embodiment.

具体实施方式Detailed ways

为了使本申请的目的、技术方案及优点更加清楚明白，以下结合附图及实施例，对本申请进行进一步详细说明。应当理解，此处描述的具体实施例仅仅用以解释本申请，并不用于限定本申请。In order to make the purpose, technical solutions and advantages of the present application more clear, the present application will be further described in detail below with reference to the drawings and embodiments. It should be understood that the specific embodiments described here are only used to explain the present application and are not used to limit the present application.

在一个实施例中，如图1所示，提供了一种基于细粒度交叉注意力的医学影像问题视觉解答方法，以该方法应用于终端为例进行说明，包括以下步骤：In one embodiment, as shown in Figure 1, a visual solution method for medical imaging questions based on fine-grained cross-attention is provided. The application of this method to a terminal is used as an example to illustrate, including the following steps:

步骤S220，获取放射性医疗图像和放射性医疗图像对应症状问题的文本数据。Step S220: Obtain radioactive medical images and text data of symptom questions corresponding to the radioactive medical images.

其中，放射性医疗图像可以是最常用到的放射学 (radiology) 数据，图像的内容为身体的头部、胸部、腹部等等。Among them, radioactive medical images can be the most commonly used radiology data, and the content of the images is the head, chest, abdomen, etc. of the body.

其中，症状问题可以包括：“异常情况”、“属性”、“颜色”、“数量”、“形态”、“器官种类”和“切面”等问题。Among them, symptom questions can include: "abnormal situation", "attribute", "color", "quantity", "morphology", "organ type" and "section" and other questions.

步骤S240，采用医疗视觉问答模型的细粒度视觉特征提取模块对放射性医疗图像进行局部特征提取，获得局部图像特征。Step S240: Use the fine-grained visual feature extraction module of the medical visual question and answer model to extract local features of the radioactive medical image to obtain local image features.

其中，局部图像特征，可以是对放射性医疗图像中的局部区域进行特征提取的过程，主要用于描述局部细节或关键点的特性。Among them, local image features can be the process of feature extraction of local areas in radioactive medical images, and are mainly used to describe the characteristics of local details or key points.

其中，使用医疗视觉问答模型的细粒度视觉特征提取模块对放射性医疗图像的特征来提取细粒度的视觉特征，可以使所获得的细粒度视觉特征更局部化，更容易利用放射性医疗图像中的特征，从而提高后续的医学视觉问题解答的准确性。Among them, using the fine-grained visual feature extraction module of the medical visual question answering model to extract fine-grained visual features from the features of radioactive medical images can make the obtained fine-grained visual features more localized and make it easier to utilize the features in radioactive medical images. , thereby improving the accuracy of subsequent medical vision question answering.

应理解，细粒度视觉特征提取模块提取出的视觉特征可以在融合表示的关键信息中抑制无用的噪声。细粒度视觉特征提取模块使用全卷积层而不是线性层来减少参数的数量，使用细粒度的视觉特征提取层来专注于局部特征的提取和识别，在保留上下文信息的同时提取可区分的和互补的特征。It should be understood that the visual features extracted by the fine-grained visual feature extraction module can suppress useless noise in the key information of the fusion representation. The fine-grained visual feature extraction module uses fully convolutional layers instead of linear layers to reduce the number of parameters, uses fine-grained visual feature extraction layers to focus on the extraction and recognition of local features, and extracts distinguishable sums while retaining contextual information. complementary characteristics.

在一个实施例中，采用医疗视觉问答模型的细粒度视觉特征提取模块对放射性医疗图像进行局部特征提取，获得局部图像特征，包括：将放射性医疗图像输入细粒度视觉特征提取模块的特征提取单元进行特征提取，获得初步图像特征；将初步图像特征输入到细粒度视觉特征提取模块的全卷积单元进行处理，获得处理后的图像特征；将处理后的图像特征输入到细粒度视觉特征提取模块的细粒度视觉特征提取单元进行特征提取，获得局部图像特征。In one embodiment, the fine-grained visual feature extraction module of the medical visual question and answer model is used to extract local features of radioactive medical images to obtain local image features, including: inputting the radioactive medical images into the feature extraction unit of the fine-grained visual feature extraction module. Feature extraction to obtain preliminary image features; input the preliminary image features to the full convolution unit of the fine-grained visual feature extraction module for processing, and obtain the processed image features; input the processed image features to the fine-grained visual feature extraction module The fine-grained visual feature extraction unit performs feature extraction to obtain local image features.

在一个实施例中，特征提取单元可以是采用 ResNet-152 算法构建的神经网络。In one embodiment, the feature extraction unit may be a neural network constructed using the ResNet-152 algorithm.

其中，采用 ResNet-152 算法对医疗图像进行区域特征的提取，在图像特征提取过程中，以往的研究从ResNet-152 的五个卷积层中提取图像特征。在每个提取特征的卷积层之后，它们使用一个全局平均池化层。然而，这就忽视了细粒度的本地信息的图像。这里没有使用全局平均池化层，而是直接从ResNet-152 中的第五个卷积层中提取图像特征。先将3×224×224的图像经过ResNet-152 模型提取出图像特征，再经过全卷积单元和细粒度视觉特征提取单元作用后提取到图像的局部视觉特征。Among them, the ResNet-152 algorithm is used to extract regional features of medical images. In the image feature extraction process, previous studies extracted image features from the five convolutional layers of ResNet-152. After each convolutional layer that extracts features, they use a global average pooling layer. However, this ignores the fine-grained local information of the image. Instead of using a global average pooling layer, image features are extracted directly from the fifth convolutional layer in ResNet-152. First, the image features of the 3×224×224 image are extracted through the ResNet-152 model, and then the local visual features of the image are extracted through the full convolution unit and the fine-grained visual feature extraction unit.

其中，本申请细粒度视觉特征提取模块中使用全卷积单元而不是线性层来减少参数的数量，并将特征维数从原来的2048×7×7对齐到 768×7×7，接着细粒度视觉特征模块将7×7的特征图转换为一个长度为49的序列，最后得到的图像特征维数为49×768，这样达到对齐了特征维数的结果。Among them, the fine-grained visual feature extraction module of this application uses fully convolutional units instead of linear layers to reduce the number of parameters, and aligns the feature dimensions from the original 2048×7×7 to 768×7×7, and then fine-grained The visual feature module converts the 7×7 feature map into a sequence of length 49, and the final image feature dimension is 49×768, thus achieving the result of aligned feature dimensions.

其中，使用细粒度视觉特征提取单元以ROI 池层为主导，根据输入的图像特征，将感兴趣区域(ROI) 映射到特征图的相应位置。然后将所映射的区域划分为各个部分。细粒度视觉特征提取单元将特征图转换为一个长度为49的序列。最后，由细粒度视觉特征提取单元得到的图像特征。Among them, the fine-grained visual feature extraction unit is used, dominated by the ROI pooling layer, to map the region of interest (ROI) to the corresponding position of the feature map according to the input image features. The mapped area is then divided into parts. The fine-grained visual feature extraction unit converts the feature map into a sequence of length 49. Finally, the image features obtained by the fine-grained visual feature extraction unit.

步骤S260，采用医疗视觉问答模型的文本特征提取模块对文本数据进行特征提取，获得文本特征。Step S260: Use the text feature extraction module of the medical visual question and answer model to extract features from the text data to obtain text features.

其中，文本特征提取模块可以是使用BERT-base 模型，文本特征提取模块也可以是使用BERT 模型。Among them, the text feature extraction module can use the BERT-base model, and the text feature extraction module can also use the BERT model.

应理解，直接将文本数据中的单词序列输入到已经预训练好的文本特征提取模块中，得到的输出即为单词以及句子的特征。It should be understood that by directly inputting the word sequence in the text data into the pre-trained text feature extraction module, the output obtained is the features of the words and sentences.

步骤S280，将由局部图像特征和文本特征组成的多模态特征对输入到医疗视觉问答模型的交叉模态编码器模块进行多模态特征融合，获得融合后的特征。Step S280: perform multi-modal feature fusion on the cross-modal encoder module input to the medical visual question answering model using multi-modal features composed of local image features and text features to obtain fused features.

其中，使用交叉模态编码器模块的自注意力机制去关注部分细节，基于局部进行分析，核心就是基于目标确定要关注的部分，以及在找到这部分细节之后进一步进行分析；使用交叉模态编码器模块的交叉注意力机制能够让模型能够更好地理解和建模不同模态数据之间的相关性，从而提高多模态任务的性能，而且能够利用图像区域与句子单词之间的模态间关系，以相互补充和增强图像和句子的匹配。Among them, the self-attention mechanism of the cross-modal encoder module is used to pay attention to some details and perform analysis based on the part. The core is to determine the part to focus on based on the target, and further analyze after finding this part of the details; use cross-modal coding The cross-attention mechanism of the processor module allows the model to better understand and model the correlation between different modal data, thereby improving the performance of multi-modal tasks, and can utilize the modalities between image areas and sentence words. relationships to complement and enhance the matching of images and sentences.

其中，交叉模态编码器模块以图像区域和句子词作为输入。交叉模态编码器模块中的每个跨模态层包括两个自注意子层、两个双向交叉注意子层、两个前馈子层。为了实现鲁棒的交叉模态匹配，使用了一个跨模态编码器模块，它可以通过区域和单词之间的相互注意机制来捕获所有潜在的局部对齐。Among them, the cross-modal encoder module takes image regions and sentence words as input. Each cross-modal layer in the cross-modal encoder module includes two self-attention sub-layers, two bidirectional cross-attention sub-layers, and two feed-forward sub-layers. To achieve robust cross-modal matching, a cross-modal encoder module is used, which can capture all potential local alignments through a mutual attention mechanism between regions and words.

在一个实施例中，医疗视觉问答模型的交叉模态编码器模块包括N个依次连接的交叉模态编码层和一个特征池化层；第一个交叉模态编码层的输入为医疗视觉问答模型的细粒度视觉特征提取模块的输出和文本特征提取模块的输出，第一个交叉模态编码层的输出为第二个交叉模态编码层的输入，以此类推，最后一个交叉模态编码层的输出为特征池化层的输入，特征池化层的输出为答案预测模块的输入。In one embodiment, the cross-modal encoder module of the medical visual question answering model includes N cross-modal coding layers connected in sequence and a feature pooling layer; the input of the first cross-modal coding layer is the medical visual question answering model The output of the fine-grained visual feature extraction module and the output of the text feature extraction module, the output of the first cross-modal coding layer is the input of the second cross-modal coding layer, and so on, the last cross-modal coding layer The output of is the input of the feature pooling layer, and the output of the feature pooling layer is the input of the answer prediction module.

在一个实施例中，交叉模态编码层包括第一自注意力层、第二自注意力层，第一交叉注意力层、第二交叉注意力层、第一前馈子层和第二前馈子层；第一自注意力层的输出输入到第一交叉注意力层和第二交叉注意力层中，第二自注意力层的输出输入到第一交叉注意力层和第二交叉注意力层中，第一交叉注意力层的输出输入到第一前馈子层和第二前馈子层中，第二交叉注意力层的输出输入到第一前馈子层和第二前馈子层中。In one embodiment, the cross-modal coding layer includes a first self-attention layer, a second self-attention layer, a first cross-attention layer, a second cross-attention layer, a first feed-forward sub-layer and a second feed-forward sub-layer. Feeder layer; The output of the first self-attention layer is input to the first cross-attention layer and the second cross-attention layer, and the output of the second self-attention layer is input to the first cross-attention layer and the second cross-attention layer. In the force layer, the output of the first cross-attention layer is input to the first feed-forward sub-layer and the second feed-forward sub-layer, and the output of the second cross-attention layer is input to the first feed-forward sub-layer and the second feed-forward sub-layer. in the sub-layer.

其中，将交叉模态编码层的前馈子层对捕获的视觉特征和问题描述特征增强表示能力。考虑注意力机制可能对复杂过程的拟合程度不够, 通过增加前馈子层来增强模型的能力。Among them, the feed-forward sub-layer of the cross-modal coding layer enhances the representation ability of the captured visual features and problem description features. Considering that the attention mechanism may not fit the complex process well enough, the capabilities of the model can be enhanced by adding feedforward sublayers.

应理解，输入到交叉模态编码器模块是局部图像特征和文本特征，因此，第一个交叉模态编码层的第一自注意力层和第二自注意力层，一个输入局部图像特征，另一个则输入的是文本特征，在经过第一个交叉模态编码层的交叉处理后，第一个交叉模态编码层的第一前馈子层输出的是局部图像特征和文本特征融合后的特征，第一个交叉模态编码层的第二前馈子层输出的也是局部图像特征和文本特征融合后的特征，第一个交叉模态编码层的第一前馈子层和第二前馈子层的输出，平行输入到下一个交叉模态编码层的第一自注意力层和第二自注意力层，不进行交叉输入，依此类推，最后交叉模态编码层的第一前馈子层和第二前馈子层输出的特征输入到特征池化层。It should be understood that the input to the cross-modal encoder module is local image features and text features. Therefore, the first self-attention layer and the second self-attention layer of the first cross-modal encoding layer, one input local image feature, The other input is text features. After the cross-processing of the first cross-modal coding layer, the first feedforward sub-layer of the first cross-modal coding layer outputs the fusion of local image features and text features. Features, the second feed-forward sub-layer of the first cross-modal coding layer also outputs features after the fusion of local image features and text features, the first feed-forward sub-layer and the second feed-forward sub-layer of the first cross-modal coding layer The output of the feedforward sub-layer is input in parallel to the first self-attention layer and the second self-attention layer of the next cross-modal coding layer, without cross-input, and so on, until the first self-attention layer of the final cross-modal coding layer The features output by the feedforward sublayer and the second feedforward sublayer are input to the feature pooling layer.

其中，如图2所示，在交叉模态编码层实现中，堆栈（即使用第n层的交叉模态编码层输出作为（n+1）层的交叉模态编码层的输入）。在第n层中，首先应用双向交叉注意力层，双向交叉注意力层包含两个单向交叉注意力层：一个从语言到视觉，另一个从视觉到语言。Among them, as shown in Figure 2, in the cross-modal coding layer implementation, stacking (that is, using the output of the cross-modal coding layer of the nth layer as the input of the cross-modal coding layer of the (n+1) layer). In the nth layer, a bidirectional cross-attention layer is first applied, which contains two unidirectional cross-attention layers: one from language to vision and the other from vision to language.

其中，交叉模态编码层可以得到具有缩放点积关注的权重矩阵或亲和矩阵。本申请使用自注意力层在交叉注意力层之前对齐两种模式之间的实体。自注意力层的输出是对交叉注意力层的输入，以加强信息的第(n+1)层的输入。在第 n 层，采用了双向交叉注意力层，其中包含两个交叉注意力层：一层从视觉到语言，另一层从语言到视觉。如图3所示，文本特征的查询Q和键K经过softmax函数后的结果是注意力系数矩阵，乘以视觉特征的值V得到最后的注意力输出。同样的，视觉特征的查询Q和键K经过softmax函数后的结果是注意力系数矩阵，乘以文本特征的值V得到最后的注意力输出。再将两个注意力输出融合以后作为输出结果。交叉注意融合机制具有全局学习能力和良好的并行性。它还可以通过图像区域与句子单词之间的模态关系来补充和增强图像和句子匹配。它基于空间注意图选择性地聚合上下文。这些特征表示可以实现相互的收益。将交叉模态编码层的前馈子层对捕获的视觉特征和问题描述特征增强表示能力。考虑注意力机制可能对复杂过程的拟合程度不够，通过增加前馈子层来增强医疗视觉问答模型的能力。Among them, the cross-modal coding layer can obtain a weight matrix or affinity matrix with scaled dot product attention. This application uses a self-attention layer to align entities between the two modalities before the cross-attention layer. The output of the self-attention layer is the input to the cross-attention layer to enhance the input of the (n+1)th layer of information. At the nth layer, a bidirectional cross-attention layer is adopted, which contains two cross-attention layers: one from vision to language and the other from language to vision. As shown in Figure 3, the result of the text feature query Q and key K after passing the softmax function is the attention coefficient matrix, which is multiplied by the value V of the visual feature to obtain the final attention output. Similarly, the result of visual feature query Q and key K after passing the softmax function is the attention coefficient matrix, which is multiplied by the value V of the text feature to obtain the final attention output. The two attention outputs are then fused as the output result. The cross-attention fusion mechanism has global learning capabilities and good parallelism. It can also complement and enhance image and sentence matching through modal relationships between image regions and sentence words. It selectively aggregates context based on spatial attention maps. These feature representations can realize mutual benefits. The feed-forward sub-layer of the cross-modal coding layer enhances the representation ability of the captured visual features and problem description features. Considering that the attention mechanism may not fit the complex process well enough, the capabilities of the medical visual question answering model are enhanced by adding feedforward sublayers.

在一个实施例中，第一自注意力层的处理过程表达式为：In one embodiment, the processing expression of the first self-attention layer is:

； ;

在一个实施例中，第二自注意力层的处理过程表达式为：In one embodiment, the processing expression of the second self-attention layer is:

； ;

其中，Attention(Q _i，K _i，V _i)为第一自注意力层的输出，Attention(Q _t，K _t，V _t)为第二自注意力层的输出，V _i为第一自注意力层的输入特征，Q _i为第一自注意力层的输入特征查询，K _i为第一自注意力层的输入特征键，V _t为第二自注意力层的输入特征，Q _t为第二自注意力层的输入特征查询，K _t为第二自注意力层的输入特征键，d _k为特征的数量，T为转置，softmax(•) 为softmax函数。Among them, Attention ( Qi _, Ki _, Vi ₎ is the output of the first self-attention layer, Attention ( Qt , _Kt , _Vt ₎ is the output of the second self-attention layer, and Vi is the output of the first self - attention _layer . The input features of the attention layer, Q _i is the input feature query of the first self-attention layer, K _i is the input feature key of the first self-attention layer, V _t is the input feature of the second self-attention layer, Q _t is the input feature query of the second self-attention layer, K _t is the input feature key of the second self-attention layer, d _k is the number of features, T is the transpose, and softmax (•) is the softmax function.

在一个实施例中，第一交叉注意力层的处理过程表达式为：In one embodiment, the processing expression of the first cross-attention layer is:

； ;

在一个实施例中，第二交叉注意力层的处理过程表达式为：In one embodiment, the processing expression of the second cross-attention layer is:

； ;

步骤S300，将融合后的特征输入到医疗视觉问答模型的答案预测模块中进行答案预测，获得答案预测结果。Step S300: Input the fused features into the answer prediction module of the medical visual question answering model to predict the answer, and obtain the answer prediction result.

其中，答案预测模块可以采用VQA 分类器。Among them, the answer prediction module can use the VQA classifier.

在一个实施例中，答案预测模块包括：第一全连接层、Relu 函数、权重归一化层和第二全连接层；第一全连接层、Relu 函数、权重归一化层和第二全连接层依次连接。In one embodiment, the answer prediction module includes: a first fully connected layer, a Relu function, a weight normalization layer and a second fully connected layer; a first fully connected layer, a Relu function, a weight normalization layer and a second fully connected layer. The connection layers are connected in turn.

步骤S320，根据答案预测结果对症状问题进行解答。Step S320: Answer the symptom questions based on the answer prediction results.

上述基于细粒度交叉注意力的医学影像问题视觉解答方法，通过获取放射性医疗图像和放射性医疗图像对应症状问题的文本数据，进而采用医疗视觉问答模型的细粒度视觉特征提取模块对放射性医疗图像进行局部特征提取，获得的细粒度视觉特征更局部化，并抑制无用的噪声，获得图像特征，采用医疗视觉问答模型的文本特征提取模块对文本数据进行特征提取，获得文本特征，再将由局部图像特征和文本特征组成的多模态特征对输入到医疗视觉问答模型的交叉模态编码器模块进行多模态特征融合，可以通过区域和单词之间的相互注意机制来捕获所有潜在的局部对齐，实现选择性地聚合上下文，获得融合后的特征，进而将融合后的特征输入到医疗视觉问答模型的答案预测模块中进行答案预测，获得答案预测结果，以根据答案预测结果对症状问题进行解答，从而提高了医学视觉问题解答的准确性。The above-mentioned visual answering method for medical imaging questions based on fine-grained cross-attention obtains text data of radioactive medical images and symptom questions corresponding to radioactive medical images, and then uses the fine-grained visual feature extraction module of the medical visual question answering model to perform local analysis on the radioactive medical images. Feature extraction, the fine-grained visual features obtained are more localized, and useless noise is suppressed to obtain image features. The text feature extraction module of the medical visual question and answer model is used to extract features from the text data to obtain text features, which are then combined with local image features and Multi-modal features composed of text features perform multi-modal feature fusion on the cross-modal encoder module input to the medical visual question answering model. All potential local alignments can be captured through the mutual attention mechanism between regions and words to achieve selection. The context is comprehensively aggregated to obtain fused features, and then the fused features are input into the answer prediction module of the medical visual question answering model for answer prediction, and the answer prediction results are obtained to answer symptom questions based on the answer prediction results, thereby improving Improved the accuracy of medical vision question answering.

在一个实施例中，提供一个基于细粒度交叉注意力的医学影像问题视觉解答方法，如图4和图5所示，先将3×224×224的放射性医疗图像经过ResNet-152模型提取出局部图像特征，再经过全卷积层（即全卷积单元）和细粒度视觉特征提取层（即细粒度视觉特征提取单元）作用后提取到图像局部的视觉特征（即局部图像特征）。通过局部的视觉特征与文本特征经过交叉模态编码器（即交叉模态编码器模块）进行多模态特征融合，最后通过VQA 分类器生成预测的答案，如解答为：左侧信号区的不透明性增加。In one embodiment, a visual solution method for medical imaging problems based on fine-grained cross-attention is provided. As shown in Figures 4 and 5, a 3×224×224 radioactive medical image is first extracted from the local part through the ResNet-152 model. The image features are then extracted from the local visual features of the image (ie, local image features) through the full convolution layer (ie, full convolution unit) and the fine-grained visual feature extraction layer (ie, fine-grained visual feature extraction unit). Multi-modal feature fusion is performed through the local visual features and text features through the cross-modal encoder (i.e., cross-modal encoder module), and finally the predicted answer is generated through the VQA classifier. For example, the answer is: the opacity of the left signal area Increased sex.

在一个实施例中，对医疗视觉问答模型的训练中，使用 VQA-RAD 数据集对医疗视觉问答模型进行训练，其中，VQA-RAD 数据集是目前的最常用到的放射学 (radiology) 数据集，包含315张图片和3515个问答对，每张图片对应至少一个问答对。问题的种类包括11 类：“异常情况” 、“属性”、“颜色” 、“数量” 、“形态” 、“器官种类”、“其他”、“切面”等问题。58%的问题形式为封闭式问题，其余为开放式问题。图像的内容为身体的头部、胸部、腹部等。需要手动划分训练集和测试集。这个数据集是人为标注问答对的优质数据集。In one embodiment, in training the medical visual question answering model, the VQA-RAD data set is used to train the medical visual question answering model. The VQA-RAD data set is currently the most commonly used radiology data set. , contains 315 pictures and 3515 question and answer pairs, each picture corresponds to at least one question and answer pair. The types of questions include 11 categories: "Abnormal situation", "Attribute", "Color", "Quantity", "Shape", "Organ type", "Others", "Aspect" and other questions. 58% of the questions were in the form of closed-ended questions and the rest were open-ended questions. The content of the image is the head, chest, abdomen, etc. of the body. It is necessary to manually divide the training set and test set. This data set is a high-quality data set for human-annotated question and answer pairs.

其中，关于医疗视觉问答模型中的文本特征提取模块，如图6中间的虚线部分所示，使用pad填充，为了使序列长度相等对齐。其中文本特征填充长度为49，局部图像特征填充长度为21。为了使得训练集中所有句子的长度保持一致，会指定一个“最大长度” 。对于长度小于此最大长度的句子，对它进行补全；而对于超过了此最大长度的句子，对它进行裁剪。如图6所示，使用[PAD]作为填充后，还需要一个指示标志，用于表示[ PAD]仅是用于填充，不代表任何意义。这里用到了一个称为 attention mask 的列表来表示，它的长度等同于最大长度。对于每条句子，如果对应位置的单词为[PAD]，则attention mask此位置的元素取值为 0 ，反之为1。在文本特征提取模块使用的BERT模型，预训练的BERT模型中，标记序号（Token IDs）是通过将输入文本转换为对应的词汇表中的标记来得到的。BERT模型的输入文本通常是由单词或子词（wordpieces）组成的，每个单词或子词都会被映射为对应的唯一标记ID。具体步骤是：分词，子词划分，映射为标记ID，添加特殊标记，填充和截断。标记序号与attention_mask 输入到预训练的BERT模型中，便得到了每个单词的词嵌入表示。在经过了 BERT模型的处理后，即得到了每个单词的嵌入表示(此嵌入表示包含了整句的上下文) 。假设使用的是BERT-base 模型，则每个词嵌入的维度即为 768。Among them, regarding the text feature extraction module in the medical visual question answering model, as shown in the dotted line in the middle of Figure 6, pad filling is used to align the sequence lengths. The text feature padding length is 49, and the local image feature padding length is 21. In order to keep the length of all sentences in the training set consistent, a "maximum length" is specified. For sentences that are shorter than this maximum length, they are completed; for sentences that exceed this maximum length, they are trimmed. As shown in Figure 6, after using [PAD] as padding, an indicator flag is needed to indicate that [PAD] is only used for padding and does not mean anything. A list called attention mask is used here to represent it, and its length is equal to the maximum length. For each sentence, if the word at the corresponding position is [PAD], the value of the element at this position of the attention mask is 0, otherwise it is 1. In the BERT model used in the text feature extraction module, in the pre-trained BERT model, token IDs are obtained by converting the input text into tokens in the corresponding vocabulary. The input text of the BERT model is usually composed of words or subwords (wordpieces), and each word or subword is mapped to a corresponding unique tag ID. The specific steps are: word segmentation, sub-word division, mapping to tag ID, adding special tags, filling and truncation. The mark serial number and attention_mask are input into the pre-trained BERT model, and the word embedding representation of each word is obtained. After being processed by the BERT model, the embedding representation of each word is obtained (this embedding representation contains the context of the entire sentence). Assuming that the BERT-base model is used, the dimension of each word embedding is 768.

其中，在医疗视觉问答模型的训练中，采用交叉熵损失函数计算基于融合特征通过答案预测模块之后的答案与问题描述的标准答案之间的损失。医疗视觉问答模型的答案预测模块最后一层得到每个类别的得分；该得分经过 Relu 函数获得概率输出；医疗视觉问答模型预测的类别概率输出与真实类别的 one hot 形式进行交叉熵损失函数的计算。Among them, in the training of the medical visual question answering model, the cross-entropy loss function is used to calculate the loss between the answer after passing through the answer prediction module based on the fusion features and the standard answer of the question description. The last layer of the answer prediction module of the medical visual question answering model obtains the score of each category; the score is passed through the Relu function to obtain the probability output; the category probability output predicted by the medical visual question answering model is calculated with the one hot form of the real category to calculate the cross entropy loss function .

在一个实施例中，医疗视觉问答模型的损失函数为：In one embodiment, the loss function of the medical visual question answering model is:

； ;

其中，在对医疗视觉问答模型进行训练时，采用交叉熵损失函数计算基于融合特征通过答案预测模块之后的答案与问题描述的标准答案之间的损失，根据多模态的交叉熵损失函数对医疗视觉问答模型更新梯度，计算损失，调整参数，得到效果最好的医疗视觉问答模型的参数，提高了医学视觉问题解答的准确性。Among them, when training the medical visual question answering model, the cross-entropy loss function is used to calculate the loss between the answer after passing through the answer prediction module based on the fusion features and the standard answer of the question description. According to the multi-modal cross-entropy loss function, the medical The visual question answering model updates the gradient, calculates the loss, and adjusts parameters to obtain the parameters of the best medical visual question answering model, which improves the accuracy of medical visual question answering.

其中，采用交叉熵损失函数指导医疗视觉问答模型的权重调整的指导性函数，通过该函数可以知道该如何改进权重系数。计算基于融合特征通过答案预测模块之后的答案与问题描述的标准答案之间的损失。根据多模态的交叉熵损失函数对医疗视觉问答模型更新梯度，计算loss，调整参数，得到效果最好的模型的参数；最终通过计算损失，根据损失值对视觉问答模型进行训练，优化视觉问答模型的参数。设置总训练轮数为 120 轮，并在验证集准确率最高的医疗视觉问答模型的参数为最终的医疗视觉问答模型的参数。实验效果如表 1 所示，展示结果为在 VQA-RAD 数据集验证集上的结果。Among them, the cross-entropy loss function is used to guide the weight adjustment of the medical visual question and answer model. Through this function, you can know how to improve the weight coefficient. Calculate the loss between the answer after passing through the answer prediction module based on the fused features and the standard answer described in the question. Update the gradient of the medical visual question and answer model based on the multi-modal cross entropy loss function, calculate the loss, adjust the parameters, and obtain the parameters of the best model; finally, by calculating the loss, the visual question and answer model is trained according to the loss value, and the visual question and answer is optimized. parameters of the model. The total number of training rounds is set to 120 rounds, and the parameters of the medical visual question answering model with the highest accuracy in the verification set are the parameters of the final medical visual question answering model. The experimental results are shown in Table 1, and the displayed results are the results on the VQA-RAD data set validation set.

表1.采用基于细粒度交叉注意力的医学影像问题视觉解答方法的效果展示Table 1. Effect display of the visual solution method for medical imaging problems based on fine-grained cross-attention

； ;

根据表 1 得出加入细粒度交叉注意力的医疗视觉问答模型的整体准确率与封闭式问题有较大的提升，较好地捕获医学图像和语言特征的语境相关性，提高了医疗视觉问答模型的性能与鲁棒性。According to Table 1, it can be concluded that the overall accuracy of the medical visual question answering model that adds fine-grained cross-attention is greatly improved compared with closed questions, and it can better capture the contextual correlation of medical images and language features, and improve the medical visual question answering model. Model performance and robustness.

应该理解的是，虽然图1的流程图中的各个步骤按照箭头的指示依次显示，但是这些步骤并不是必然按照箭头指示的顺序依次执行。除非本文中有明确的说明，这些步骤的执行并没有严格的顺序限制，这些步骤可以以其它的顺序执行。而且，图1中的至少一部分步骤可以包括多个子步骤或者多个阶段，这些子步骤或者阶段并不必然是在同一时刻执行完成，而是可以在不同的时刻执行，这些子步骤或者阶段的执行顺序也不必然是依次进行，而是可以与其它步骤或者其它步骤的子步骤或者阶段的至少一部分轮流或者交替地执行。It should be understood that although various steps in the flowchart of FIG. 1 are shown in sequence as indicated by arrows, these steps are not necessarily executed in the order indicated by arrows. Unless explicitly stated in this article, there is no strict order restriction on the execution of these steps, and these steps can be executed in other orders. Moreover, at least some of the steps in Figure 1 may include multiple sub-steps or multiple stages. These sub-steps or stages are not necessarily executed at the same time, but may be executed at different times. The execution of these sub-steps or stages The sequence is not necessarily sequential, but may be performed in turn or alternately with other steps or sub-steps of other steps or at least part of the stages.

以上实施例的各技术特征可以进行任意的组合，为使描述简洁，未对上述实施例中的各个技术特征所有可能的组合都进行描述，然而，只要这些技术特征的组合不存在矛盾，都应当认为是本说明书记载的范围。The technical features of the above embodiments can be combined in any way. To simplify the description, not all possible combinations of the technical features in the above embodiments are described. However, as long as there is no contradiction in the combination of these technical features, all possible combinations should be used. It is considered to be within the scope of this manual.

以上所述实施例仅表达了本申请的几种实施方式，其描述较为具体和详细，但并不能因此而理解为对发明专利范围的限制。应当指出的是，对于本领域的普通技术人员来说，在不脱离本申请构思的前提下，还可以做出若干变形和改进，这些都属于本申请的保护范围。因此，本申请专利的保护范围应以所附权利要求为准。The above-described embodiments only express several implementation modes of the present application, and their descriptions are relatively specific and detailed, but they should not be construed as limiting the scope of the invention patent. It should be noted that, for those of ordinary skill in the art, several modifications and improvements can be made without departing from the concept of the present application, and these all fall within the protection scope of the present application. Therefore, the protection scope of this patent application should be determined by the appended claims.

Claims

1. A method for visual resolution of medical imaging problems based on fine-grained cross-attention, the method comprising:

acquiring a radioactive medical image and text data of symptom problems corresponding to the radioactive medical image;

carrying out local feature extraction on the radioactive medical image by adopting a fine-granularity visual feature extraction module of a medical visual question-answering model to obtain local image features;

the text feature extraction module of the medical visual question-answering model is adopted to perform feature extraction on the text data, so that text features are obtained;

the multi-modal feature pair consisting of the local image feature and the text feature is input into a cross-modal encoder module of the medical visual question-answering model to perform multi-modal feature fusion, so that the fused feature is obtained;

inputting the fused features into an answer prediction module of the medical visual question-answering model to perform answer prediction, so as to obtain an answer prediction result;

and solving the symptom problem according to the answer prediction result.

2. The fine-grained cross-attention-based medical image question visual solution method according to claim 1, wherein the fine-grained visual feature extraction module using a medical visual question-answering model performs local feature extraction on the radiological medical image to obtain local image features, comprising:

inputting the radioactive medical image into a feature extraction unit of the fine-granularity visual feature extraction module to perform feature extraction to obtain a primary image feature;

inputting the preliminary image features to a full convolution unit of the fine-granularity visual feature extraction module for processing to obtain processed image features;

and inputting the processed image features to a fine-granularity visual feature extraction unit of the fine-granularity visual feature extraction module for feature extraction to obtain local image features.

3. The fine grain cross-attention based medical imaging problem visual solution method of claim 1, wherein the cross-modality encoder module of the medical visual question model comprises N cross-modality encoding layers and one feature pooling layer connected in sequence;

the input of the first cross mode coding layer is the output of the fine-grained visual feature extraction module and the output of the text feature extraction module, the output of the first cross mode coding layer is the input of the second cross mode coding layer, and the like, the output of the last cross mode coding layer is the input of the feature pooling layer, and the output of the feature pooling layer is the input of the answer prediction module.

4. The fine granularity cross-attention based medical image problem visual solution of claim 3, wherein the cross-modality encoding layer comprises a first self-attention layer, a second self-attention layer, a first cross-attention layer, a second cross-attention layer, a first feed-forward sub-layer, and a second feed-forward sub-layer;

the output of the first self-attention layer is input into the first and second cross-attention layers, the output of the second self-attention layer is input into the first and second cross-attention layers, the output of the first cross-attention layer is input into the first and second feedforward sublayers, and the output of the second cross-attention layer is input into the first and second feedforward sublayers.

5. The fine-grained cross-attention based medical image problem visual solution method of claim 4, wherein the first self-attention layer is processed by the following expression:

；

the process expression of the second self-attention layer is:

；

wherein,Attention(Q _i ，K _i ，V _i ) Is the firstThe output from the attention layer(s),Attention(Q _t ，K _t ，V _t ) For the output of the second self-attention layer,V _i as an input feature for the first self-attention layer,Q _i for the input feature query of the first self-attention layer,K _i for the input feature key of the first self-attention layer,V _t as an input feature for the second self-attention layer,Q _t for the input feature query of the second self-attention layer,K _t is an input feature key of the second self-attention layer, d _k as a function of the number of features,Tfor the purpose of the transposition,softmax(.) is a softmax function.

6. The fine-grained cross-attention based medical image problem visual solution method of claim 4, wherein the first cross-attention layer is processed by the following expression:

；

the processing expression of the second cross-attention layer is:

；

wherein,for the output of the first cross-attention layer, < >>For a cross-modal attention manipulation from text to vision, capturing the relation between text and image,/->To the first text feature extraction modulen-layer 1 of the first layeriHidden status of individual location,/>First cross-attention layernThe first layer of the layersiImage representation of individual positions +.>First cross-attention layernThe image representation of the last position in the layer preceding the layer, is->For the output of the second cross-attention layer, < >>For a cross-modal attention manipulation from visual to text, capturing the relation between text and image,/->To the first text feature extraction modulen-layer 1 of the first layerjHidden status of individual location,/>Is the second cross-attention layernThe first layer of the layersjImage representation of individual positions +.>Is the second cross-attention layernImage representation of the last position in the previous layer of the layer.

7. The fine grain cross-attention based medical imaging problem visual solution method of claim 1, wherein the answer prediction module comprises: a first fully connected layer, a Relu function, a weight normalization layer, and a second fully connected layer;

the first full-connection layer, the Relu function, the weight normalization layer and the second full-connection layer are sequentially connected.

8. The fine grain cross-attention based medical imaging problem visual solution method of claim 7, wherein the loss function of the medical visual question answering model is:

；

wherein,Lin order to cross-entropy loss function,Mfor the number of samples to be taken,Xas a real tag it is possible to provide a real tag,yin order to predict the outcome of the result,pis the information quantity.