CN115512191A

CN115512191A - Question and answer combined image natural language description method

Info

Publication number: CN115512191A
Application number: CN202211150406.9A
Authority: CN
Inventors: 卫志华; 刘官明; 张恒
Original assignee: Tongji University
Current assignee: Tongji University
Priority date: 2022-09-21
Filing date: 2022-09-21
Publication date: 2022-12-23

Abstract

A question-answer combined image natural language description method comprises three steps: firstly, extracting the features of an image target and an image background by using an image segmentation model to obtain different pixel-level classifications, and acquiring a segmentation feature map of the target and the background; step two, a problem generation module generates a relation characteristic diagram containing attention target information by constructing an implicit scene type representation, and generates a plurality of semantically related guide problems in a multi-granularity manner; and step three, the combined question-answering module introduces a loss function of contrast learning, performs combined multi-mode embedded representation on the relation characteristic graph and the guide question, and can generate a long text answer related to the question as a refined semantic description of the image content through training.

Description

A joint question-answering method for image natural language description

技术领域technical field

本发明属于计算机视觉和自然语言处理领域。The invention belongs to the fields of computer vision and natural language processing.

背景技术Background technique

图像描述生成是一个跨文本和图像的多模态任务，目的是从图片中产生相应的自然语言描述。这个任务对于人类来说非常容易，但是对于计算机来说非常具有挑战性。随着深度学习的流行，越来越多的人尝试使用神经网络来解决机器的图像描述问题。Image caption generation is a multimodal task across text and images, aiming to generate corresponding natural language descriptions from images. This task is very easy for humans, but very challenging for computers. With the popularity of deep learning, more and more people try to use neural networks to solve the image description problem of machines.

然而，由于自然语言描述存在多样性，使得其具备不一样的标准形式，使用特定数据集进行训练仅能得到符合该数据集分布的图像内容描述，限制了适用范围。同时，大多数的图像描述方法仅仅被动地生成单调的句子，并不会考虑到图像中场景与关注目标之间的联系，因此也很难同时对不同的、多尺度的目标进行描述，也容易对潜在的联系视而不见。这种机械的图像描述与人类对图像的语义认知存在较大的偏差，在与真实的人类进行交互时往往无法相互理解。However, due to the diversity of natural language descriptions, it has different standard forms, and training with a specific data set can only obtain image content descriptions that conform to the distribution of the data set, which limits the scope of application. At the same time, most image description methods only passively generate monotonous sentences, and do not consider the connection between the scene in the image and the target of interest, so it is difficult to describe different, multi-scale targets at the same time, and it is easy to Turn a blind eye to potential connections. This kind of mechanical image description has a large deviation from human semantic cognition of images, and it is often impossible to understand each other when interacting with real humans.

图像描述生成一般分为图像特征提取和文本生成两个子模块。图像特征提取常见的方法是使用图像识别神经网络模型来提取目标，但是这会导致图像信息的缺失和特征提取的偏差。同时，基于特征提取进行短句生成往往只关注特定的目标，不能由图像具有的丰富语义信息生成多粒度的描述，容易造成跨模态信息的大量损失。因此，机器的图像描述仍然与人类对图像的自然认知有着较大的差距。Image description generation is generally divided into two sub-modules: image feature extraction and text generation. The common method of image feature extraction is to use the image recognition neural network model to extract the target, but this will lead to the loss of image information and the deviation of feature extraction. At the same time, short sentence generation based on feature extraction often only focuses on specific targets, and cannot generate multi-granularity descriptions from the rich semantic information of images, which is likely to cause a large loss of cross-modal information. Therefore, there is still a large gap between the machine's image description and human's natural cognition of images.

发明内容Contents of the invention

本发明针对背景技术的不足提供一种联合问答的图像自然语言描述方法，以视觉问答模型为基础生成图像内容描述，通过设计图像分割模块，问题生成模块和联合问答模块，借助关系特征图中不同尺度区域产生多个语义相关的提问，以问题-答案为对应关系，使用具有多粒度特征的引导问题产生图像场景的精细化描述，该方法能捕获图像中隐式事实发生可能性，生成更符合人类对图像语义认知的自然语言描述。The present invention provides a joint question-and-answer image natural language description method in view of the deficiency of the background technology. The image content description is generated on the basis of the visual question-and-answer model. By designing the image segmentation module, the question generation module and the joint question-answer module, different The scale area generates multiple semantically related questions, using question-answer as the corresponding relationship, using guided questions with multi-grained features to generate a refined description of the image scene, this method can capture the possibility of implicit facts in the image, and generate more in line with A Natural Language Description of Human Semantic Cognition of Images.

本发明采用如下技术方案：The present invention adopts following technical scheme:

一种联合问答的图像自然语言描述方法，其特征在于，以视觉问答模型为基础来生成图像内容的精细化描述。首先，图像分割模块获得图像目标和背景类别划分的分割特征图；其次，问题生成模块构建隐式的场景类型表征，以关注目标为中心产生多粒度的引导问题；最后，联合问答模块引入对比学习的损失函数，对关系特征图和引导问题进行联合多模态嵌入表征。本方法以视觉问答为基础，根据问题-答案的对应关系，生成图像内容的自然语言描述。An image natural language description method based on joint question answering, characterized in that a refined description of image content is generated based on a visual question answering model. First, the image segmentation module obtains the segmentation feature map of image target and background category classification; second, the question generation module constructs an implicit scene type representation, and generates multi-granularity guidance questions centered on the target of interest; finally, the joint question answering module introduces contrastive learning A loss function for joint multimodal embedding representation of relational feature maps and bootstrapping problems. Based on visual question answering, this method generates a natural language description of the image content according to the question-answer correspondence.

一种联合问答的图像自然语言描述方法，包括三个步骤：A joint question-answering image natural language description method, including three steps:

步骤一，首先使用图像分割模型对图像目标和图像背景的特征进行提取，得到像素级的不同类别的划分，获取目标和背景的分割特征图；Step 1, first use the image segmentation model to extract the features of the image target and the image background, obtain the division of different categories at the pixel level, and obtain the segmentation feature map of the target and the background;

步骤二，问题生成模块通过构建隐式的场景类型表征，产生包含关注目标信息的关系特征图，多粒度地生成若干个语义相关的引导问题；Step 2, the question generation module generates a relational feature map containing the target information by constructing an implicit scene type representation, and generates several semantically related guiding questions at multiple granularities;

步骤三，联合问答模块引入对比学习的损失函数，对关系特征图和引导问题进行联合多模态嵌入表征，该模型通过训练，能生成问题相关的长文本回答，作为图像内容的精细化语义描述。Step 3: The joint question answering module introduces the loss function of comparative learning, and performs joint multimodal embedding representation on the relationship feature map and the guiding question. After training, the model can generate long text answers related to the question as a refined semantic description of the image content. .

针对步骤一，本发明提供了一种提取图像特征的优选方案。For step one, the present invention provides a preferred solution for extracting image features.

针对步骤二，本发明公开了以LSTM模型为基础的问题生成模型，其特征在于，可以对分割特征图进行处理，通过构建隐式的场景类型表征，先产生包含关注目标信息的关系特征图，随后以关注目标为中心，多尺度地建立与图像关注目标之间、关注目标和背景之间的联系，生成的多粒度引导问题即作为后续联合问答中的一环。For step 2, the present invention discloses a problem generation model based on the LSTM model, which is characterized in that the segmentation feature map can be processed, and by constructing an implicit scene type representation, a relational feature map containing the target information of interest is first generated, Then centering on the target of interest, the relationship between the target of interest in the image and between the target of interest and the background is established on a multi-scale basis, and the multi-granularity guidance questions generated are used as a link in the subsequent joint question answering.

针对步骤三，本发明公开了以BUTD(自底向上，自顶向下)模型为基础的联合问答模型，其特征在于，引入了对比学习的损失函数，可以联合关系特征图和引导问题，提高模型跨模态的学习能力，增强模型对图像和问题答案之间语义联系的理解，生成图像内容的精细化描述。For step three, the present invention discloses a joint question answering model based on the BUTD (bottom-up, top-down) model, which is characterized in that a loss function of contrastive learning is introduced, which can combine relational feature maps and guiding questions to improve The cross-modal learning ability of the model enhances the model's understanding of the semantic connection between the image and the answer to the question, and generates a refined description of the image content.

具体的，specific,

步骤一：图像分割Step 1: Image Segmentation

1.1利用现已公开的图像语义分割数据集，其中的批量图像均有像素级的类别标注。1.1 Utilize the publicly available image semantic segmentation dataset, in which batch images are labeled with pixel-level categories.

1.2使用深度学习方法对图像分割数据集进行训练，构造图像分割神经网络模型。图像分割的任务是给图像进行稠密预测，通过对不同目标以特定颜色进行标注，让每个像素点都有其所属的目标或者封闭区域的类别。1.2 Use the deep learning method to train the image segmentation data set and construct the image segmentation neural network model. The task of image segmentation is to make dense predictions for the image. By marking different targets with specific colors, each pixel has its own target or closed area category.

1.3将训练好的图像分割神经网络的模型权重保存，该网络模型可对原始图像进行处理，区分图像中不同的目标和背景，最终输出目标间、目标与背景间的分割特征图。1.3 Save the model weight of the trained image segmentation neural network. The network model can process the original image, distinguish different targets and backgrounds in the image, and finally output the segmentation feature map between targets and between targets and background.

步骤二：问题生成Step 2: Question Generation

2.1处理现已公开的视觉问题生成数据集，对数据集中的问题范畴进行分类，不同的问题范畴多角度地看待目标及其之间的联系，同一图像的多个问题范畴不仅关注着不同的目标，也关注着相同目标不同尺度的图像区域。同时，对数据集的回答和问题做合并处理，生成一句完整的自然语言描述。2.1 Process the publicly available visual problem generation data set, classify the problem categories in the data set, different problem categories look at the target and the relationship between them from multiple perspectives, multiple problem categories of the same image not only focus on different targets , also focus on image regions of different scales of the same object. At the same time, the answers and questions of the data set are merged to generate a complete natural language description.

2.2使用深度学习方法对处理后的视觉问题生成数据集进行训练，构造问题生成神经网络模型。问题生成模型构建隐式的场景类型表征，初步产生包含关注目标信息的关系特征图，随后以关注目标为中心，学习问题范畴和图像不同粒度区域之间的相关性，多尺度地生成与关注目标上下文相关的不同问题。2.2 Use the deep learning method to train the processed visual question generation data set, and construct the question generation neural network model. The question generation model constructs an implicit scene type representation, initially generates a relational feature map containing the information of the target of interest, and then centers on the target of concern, learns the correlation between the category of the question and the different granularity regions of the image, and generates and targets of concern on a multi-scale basis Context-dependent different questions.

步骤三：联合问答Step 3: Joint Q&A

3.1整合图像分割模块、问题生成模块和联合问答模块，以引导问题作为上下文，使用自上而下的注意力机制进行学习，引入对比学习的损失函数，对引导问题和关系特征图进行联合多模态嵌入表征。根据训练好的网络，给出候选答案及其置信度，生成图像内容的自然语言描述。3.1 Integrate the image segmentation module, question generation module and joint question answering module, take the guiding question as the context, use the top-down attention mechanism for learning, introduce the loss function of contrastive learning, and perform joint multi-mode on the guiding question and the relationship feature map state embedded representation. According to the trained network, the candidate answer and its confidence are given to generate a natural language description of the image content.

本发明的有益效果：Beneficial effects of the present invention:

1.图像分割(使用他人已有的分割模型)不仅包括显式的目标和背景的像素级信息，还包括隐藏的对象间的从属、并列和逻辑关系，能提供全面且精细的图像特征信息。1. Image segmentation (using other people's existing segmentation models) not only includes explicit target and background pixel-level information, but also includes subordination, juxtaposition and logical relationship between hidden objects, which can provide comprehensive and fine image feature information.

2.该设计方案相比于直接生成图像描述，步骤二中以图像生成的多粒度特征问题能引导更为精细具体的多尺度回答，同时还能捕获图像中隐式事实发生的可能性，提供描述简单事件的能力。2. Compared with directly generating image descriptions in this design, the multi-granularity feature questions generated from images in step 2 can guide more detailed and specific multi-scale answers, and at the same time capture the possibility of implicit facts in images, providing The ability to describe simple events.

3.该设计方案提供了一种图像描述的新方法，其以多粒度特征问题作为图像内容描述生成的引导核心，使用视觉问答模型构建问题-答案的对应关系(为本申请首次提出使用视觉问答模型生成图像描述)，引入对比学习的损失函数提高了模型跨模态的学习能力，能提供更符合人类对图像语义认知的自然语言描述，有利于提高人机交互的处理效率。3. This design scheme provides a new method of image description, which uses multi-granularity feature questions as the guiding core of image content description generation, and uses the visual question answering model to construct the corresponding relationship between questions and answers (this application first proposed the use of visual question answering The model generates image descriptions), and the introduction of the loss function of contrastive learning improves the cross-modal learning ability of the model, and can provide natural language descriptions that are more in line with human perception of image semantics, which is conducive to improving the processing efficiency of human-computer interaction.

附图说明Description of drawings

图1图像分割模型图Figure 1 Image Segmentation Model Diagram

图2本发明设计的问题生成模型Figure 2 The problem generation model designed by the present invention

图3联合问答模型图Figure 3 joint question answering model diagram

图4联合问答的图像描述示例图Figure 4 Image description example of joint question answering

具体实施方式detailed description

实施例Example

联合问答的图像描述示例如图4所示。An example of image captioning for joint question answering is shown in Figure 4.

一种基于视觉问答的分割式图像内容描述方法，包括以下步骤：A method for segmented image content description based on visual question answering, comprising the following steps:

步骤一：图像分割Step 1: Image Segmentation

1.1在本实施例中，所采用的数据集为Cityscapes数据集，该数据集专注于对复杂城市街道场景的视觉理解，它包含来自50个不同城市的街道场景中记录的多种立体视频序列，带有20000帧弱注释和5000帧高质量像素级注释，总共提供了包括行人、交通工具、道路、建筑、信号灯和信号标志等30种类别的标注。1.1 In this embodiment, the dataset used is the Cityscapes dataset, which focuses on the visual understanding of complex urban street scenes, which contains a variety of stereoscopic video sequences recorded in street scenes from 50 different cities, With 20,000 frames of weak annotations and 5,000 frames of high-quality pixel-level annotations, a total of 30 categories of annotations including pedestrians, vehicles, roads, buildings, signal lights, and signal signs are provided.

1.2在本实施例中，使用深度学习方法对图像分割数据集进行训练，构造编码器-解码器结构的Deeplab图像分割网络模型。Deeplab系列是一种基于全卷积的扩张卷积语义分割模型，提出的空洞卷积可以在不增加参数量的情况下尽可能地恢复在卷积操作中持续降低的特征图分辨率。本实施例采用的Deeplab v3+模型是一种编码器-解码器结构的分割网络模型，其具有较高的计算速度和预测精度。其中，编码器模块用来提取图像中高级的语义信息。解码器模块用来恢复低层级的空间信息，获取清晰完整的类别分割边界和目标间、目标与背景间的关系。网络模型如图1所示。1.2 In this embodiment, the deep learning method is used to train the image segmentation data set, and the Deeplab image segmentation network model of the encoder-decoder structure is constructed. The Deeplab series is a full convolution-based dilated convolution semantic segmentation model. The proposed dilated convolution can restore the feature map resolution that is continuously reduced in the convolution operation as much as possible without increasing the amount of parameters. The Deeplab v3+ model used in this embodiment is a segmentation network model with an encoder-decoder structure, which has high calculation speed and prediction accuracy. Among them, the encoder module is used to extract high-level semantic information in the image. The decoder module is used to restore the low-level spatial information, and obtain clear and complete category segmentation boundaries and the relationship between objects, objects and background. The network model is shown in Figure 1.

在编码器结构阶段，首先采用结合了深度可分离卷积的Xception网络作为骨干网络对输入图像进行初始特征提取。Xception网络使用残差连接和深度可分离卷积，对特征图进行下采样。随后的带洞空间金字塔池化模块(ASPP，Atrous Spatial PyramidPooling)用来进一步提取多尺度的特征信息，模块中的1×1卷积、全局平均池化和扩张率(用Rate表示)分别为6、12、18的空洞卷积组合为纵式并行结构，接着将多尺度特征图通过1×1卷积处理压缩通道数到256，特征图分辨率降低到原图的1/16，作为编码器结构的特征图输出。In the encoder structure stage, the Xception network combined with depthwise separable convolution is firstly used as the backbone network to extract the initial features of the input image. The Xception network uses residual connections and depthwise separable convolutions to downsample feature maps. The subsequent spatial pyramid pooling module with holes (ASPP, Atrous Spatial Pyramid Pooling) is used to further extract multi-scale feature information. The 1×1 convolution, global average pooling and expansion rate (indicated by Rate) in the module are 6 respectively. , 12, and 18 hole convolutions are combined into a vertical parallel structure, and then the multi-scale feature map is processed through 1×1 convolution to compress the number of channels to 256, and the resolution of the feature map is reduced to 1/16 of the original image, as an encoder The feature map output of the structure.

在解码器结构阶段，对编码器结构输出的特征图进行4倍的双线性插值的上采样处理，与骨干网络上对应层级的浅层特征图在通道调整后进行拼接，再经过两个3×3卷积层细化特征图，最终经过4倍的双线性插值上采样，得到和原图相同尺寸的具有丰富细节和全局信息的分割预测结果。In the decoder structure stage, the feature map output by the encoder structure is subjected to 4 times bilinear interpolation upsampling processing, and the shallow feature map of the corresponding level on the backbone network is spliced after channel adjustment, and then two 3 The ×3 convolutional layer refines the feature map, and finally undergoes 4 times bilinear interpolation upsampling to obtain a segmentation prediction result with the same size as the original image with rich details and global information.

1.3将训练好的Deeplab v3+图像分割神经网络的模型权重保存，该网络模型可对原始图像进行处理，区分图像中不同的目标和背景，最终输出目标间、目标与背景间的分割特征图，其详尽地提取了图像的像素级特征。1.3 Save the model weight of the trained Deeplab v3+ image segmentation neural network. The network model can process the original image, distinguish different targets and backgrounds in the image, and finally output the segmentation feature map between targets and between targets and background. Pixel-level features of images are exhaustively extracted.

步骤二：问题生成Step 2: Question generation

2.1在本实施例中，处理现已公开的视觉问题生成数据集(包括SQuAD数据集等等)，对数据集中的问题范畴进行分类，不同的问题范畴多角度地看待目标及其之间的联系，同一图像的多个问题范畴不仅关注着不同的目标，也关注着相同目标不同尺度的图像区域。同时，对数据集的回答和问题做合并处理，生成一句完整的自然语言描述。2.1 In this embodiment, the existing public visual problem generation data sets (including SQuAD data sets, etc.) are processed, and the problem categories in the data set are classified, and different problem categories look at the targets and their connections from multiple perspectives. , multiple problem categories of the same image not only focus on different objects, but also focus on image regions of different scales of the same object. At the same time, the answers and questions of the data set are merged to generate a complete natural language description.

问题范畴包括对象(Object)，属性(Attribute)，关系(Relationship)，计数(Counting)，行为(Behavior)等等。The problem category includes object (Object), attribute (Attribute), relationship (Relationship), count (Counting), behavior (Behavior) and so on.

2.2在本实施例中，使用深度学习方法对处理后的视觉问题生成数据集进行训练，构造问题生成神经网络模型。2.2 In this embodiment, a deep learning method is used to train the processed visual problem generation data set, and a neural network model for problem generation is constructed.

为了直接从特征图像中生成基于视觉基础的问题，构建残差连接的MLP(多层感知机)，其以包含类别信息的分割特征图为输入，输出包含主要关注目标信息的关系特征图，用来指导后续的问题生成。In order to generate vision-based questions directly from feature images, a residual-connected MLP (Multi-Layer Perceptron) is constructed, which takes a segmentation feature map containing category information as input, and outputs a relational feature map containing mainly target-of-interest information, using to guide subsequent question generation.

典型的MLP包括三层：输入层、隐藏层和输出层，MLP神经网络不同层之间是全连接的，隐藏层通过训练构建图像的场景类型表征权重，每一层神经网络的计算公式如下：A typical MLP includes three layers: input layer, hidden layer, and output layer. Different layers of the MLP neural network are fully connected. The hidden layer is trained to construct image scene type representation weights. The calculation formula of each layer of neural network is as follows:

H＝XW_h+b_h H＝XW _h +b _h

O＝HW_o+b_o O＝HW _o +b _o

Y＝σ(O)Y=σ(O)

其中，X是特征输入，Y是特征输出，W_h和W_o分别为隐藏层和输出层的权重，b_h和b_o分别为隐藏层和输出层的偏差，σ表示的是激活函数，这里使用sigmoid函数。Among them, X is the feature input, Y is the feature output, W _h and W _o are the weights of the hidden layer and the output layer respectively, b _h and b _o are the deviations of the hidden layer and the output layer respectively, σ represents the activation function, here Use the sigmoid function.

分割特征图通过残差连接生成关系特征图，表征关系权重的通道代表着关注目标的显著度，每个分割特征图的封闭目标区域都有相应的关注系数，用u表示关系特征层。随后，问题生成模型以LSTM(长短期记忆循环神经网络)为基础，多粒度、分层次地建立关系特征图中每个关注目标和周边封闭范围类别的关联，在不同尺度上根据预测的最优先问题范畴生成相关的引导问题。The segmentation feature map generates a relational feature map through the residual connection, and the channel representing the relational weight represents the salience of the target of interest. The closed target area of each segmentation feature map has a corresponding attention coefficient, and u represents the relational feature layer. Subsequently, the question generation model is based on LSTM (long-term short-term memory recurrent neural network), establishes the relationship between each target of interest and the surrounding closed range category in the relationship feature map in a multi-granularity and hierarchical manner, and predicts the highest priority at different scales. The question categories generate the associated bootstrap questions.

使用LSTM学习关注目标和问题的关系，让模型能根据问题锁定关注目标的相关区域进行训练。LSTM涉及短期记忆h和长期记忆C的传递，包括输入门、输出门和遗忘门，包括公式如下(为现有技术)：Use LSTM to learn the relationship between the focus target and the problem, so that the model can lock the relevant area of the focus target according to the problem for training. LSTM involves the transfer of short-term memory h and long-term memory C, including input gates, output gates and forget gates, including the following formulas (for prior art):

c^t＝z^t⊙i^t+c^t-1⊙f^t c ^t = z ^t ⊙i ^t +c ^t-1 ⊙f ^t

y^t＝(c^t)⊙o^t y ^t ＝(c ^t )⊙o ^t

其中，x^t是当前时刻的输入，y^t-1是上一时刻的短期记忆，c^t-1是上一时刻的长期记忆，W_f、W_i、W_z和W_o是相应的输入权重，R_f、R_i、R_z和R_o是相应的递归权重，p_i表示的是窥视孔权值矩阵，用来更新粒度大小，b_f、b_i、b_z和b_o是相应的计算偏差，σ表示的是sigmoid激活函数。Among them, x ^t is the input at the current moment, y ^t-1 is the short-term memory of the previous moment, c ^t-1 is the long-term memory of the previous moment, W _f , W _i , W _z and W _o are the corresponding input weights , R _f , R _i , R _z and R _o are the corresponding recursive weights, p _i represents the peephole weight matrix, which is used to update the particle size, b _f , b _i , b _z and b _o are the corresponding calculations Deviation, σ represents the sigmoid activation function.

问题生成神经网络模型的损失函数如下所示：The loss function of the problem generation neural network model is as follows:

其中，

是模型生成的问题向量，q_i是数据集中真实的问题向量，u表示预测的关系权重值，为正数。in,

is the question vector generated by the model, q _i is the real question vector in the data set, and u represents the predicted relationship weight value, which is a positive number.

问题生成的网络模型如图2所示。关系特征图提供了图像整体的上下文信息，通过构建隐式的场景类型表征，关注目标高效地让问题生成集中在若干个目标上，更好地限定了图像场景的关注目标，提供了图像中更为抽象的目标焦点表示；以关注目标为中心，在不同粒度层级上，根据预测的最优先问题范畴生成若干个不同尺度的问题，这些问题涵盖了图像中核心区域的重要信息，使得问题生成神经网络模型处理得到的引导问题更为全面。The network model for question generation is shown in Figure 2. The relational feature map provides the overall context information of the image. By constructing an implicit scene type representation, the attention target can efficiently focus the problem generation on several targets, better limit the attention target of the image scene, and provide more It is an abstract target focus representation; centered on the focus target, at different levels of granularity, several questions of different scales are generated according to the predicted top priority category. These questions cover the important information of the core area in the image, making the question generation neural network The bootstrapping problem dealt with by the network model is more comprehensive.

步骤三：联合问答Step 3: Joint Q&A

3.1在本实施例中，整合步骤一图像分割模块、步骤二问题生成模块和步骤三联合问答模块，以引导问题作为上下文，使用自上而下的注意力机制进行学习，对引导问题和关系特征图进行联合多模态嵌入表征。3.1 In this embodiment, the image segmentation module of step 1, the question generation module of step 2 and the joint question answering module of step 3 are integrated, and the guiding question is used as the context, and the top-down attention mechanism is used for learning, and the guiding question and the relationship feature Graph for joint multimodal embedding representation.

首先对问题生成的问题特征向量q和图像分割得到的分割特征图v进行连接作为上下文，来引导模型对于v和q的权重W_q和W_v进行训练；First, connect the question feature vector q generated by the question and the segmentation feature map v obtained by image segmentation as the context to guide the model to train the weights W _q and W _v of v and q;

f_q(q)＝W_qqf _q (q)=W _q q

f_v(v)＝W_vvf _v (v) = W _v v

对f_q(q)和f_v(v)进行多模态嵌入，使用哈尔玛积进行计算：Multimodal embedding of f _q (q) and f _v (v), computed using the Halma product:

p(y)＝σ(h)p(y)=σ(h)

其中，f_q表示问题流的输出，f_v表示视觉流的输出，p(y)是最终的输出结果，匹配程度更接近的问题特征向量q和分割特征图v具有更高的分值，W_q和W_v表示权值矩阵，σ表示线性激活函数。联合嵌入的损失函数为对比损失：Among them, f _q represents the output of the problem flow, f _v represents the output of the visual flow, p(y) is the final output result, the problem feature vector q and the segmentation feature map v with a closer matching degree have higher scores, W _q and W _v represent the weight matrix, and σ represents the linear activation function. The loss function for the joint embedding is the contrastive loss:

其中，y_T表示正确匹配的问题特征向量q和分割特征图v的输出结果。where y _T represents the output of the correctly matched question feature vector q and segmentation feature map v.

联合问答的网络模型如图3所示。整合图像分割模块、问题生成模块和联合问答模块，根据训练好的网络，能给出置信度最高的答案预测。本方法以视觉问答为基础，能对图像内容做出精细化描述。The network model of joint question answering is shown in Figure 3. Integrating the image segmentation module, question generation module and joint question answering module, according to the trained network, it can give the answer prediction with the highest confidence. This method is based on visual question answering, and can make a fine description of the image content.

Claims

1. a joint question-and-answer image natural language description method, is characterized in that, comprises three steps:

Step 1, first use the image segmentation model to extract the features of the image target and the image background, obtain the division of different categories at the pixel level, and obtain the segmentation feature map of the target and the background;

Step 2, the question generation module generates a relational feature map containing the target information by constructing an implicit scene type representation, and generates several semantically related guiding questions at multiple granularities;

Step 3: The joint question answering module introduces the loss function of comparative learning, and performs joint multimodal embedding representation on the relationship feature map and the guiding question. After training, the model generates long text answers related to the question as a refined semantic description of the image content.

2. The description method as claimed in claim 1, characterized in that:

For step 2, the problem generation model based on the LSTM model processes the segmentation feature map, and by constructing an implicit scene type representation, first generates a relational feature map containing the information of the target of interest, and then centers on the target of interest, multi-scale The relationship between the focus target of the image and the focus target and the background is accurately established, and the multi-granularity guidance questions generated are used as a part of the subsequent joint question answering.

3. The description method as claimed in claim 1, characterized in that:

For step 3, the joint question answering model based on the BUTD model introduces a loss function of contrastive learning, joint relational feature maps and guiding questions, improves the model's cross-modal learning ability, and enhances the model's semantic connection between images and question answers understanding to generate a fine-grained description of image content.

4. The description method as claimed in claim 1, characterized in that:

Step 1: Image Segmentation

1.1 Utilize the publicly available image semantic segmentation dataset, in which the batch images are labeled with pixel-level categories;

1.2 Use the deep learning method to train the image segmentation data set and construct the image segmentation neural network model; the task of image segmentation is to densely predict the image, and mark different objects with specific colors so that each pixel has its own type of target or closed area;

1.3 Save the model weight of the trained image segmentation neural network. The network model can process the original image, distinguish different targets and backgrounds in the image, and finally output the segmentation feature map between targets and between targets and background.

5. The description method as claimed in claim 1 or 2, characterized in that:

Step 2: Question generation

2.1 Process the publicly available visual problem generation data set, classify the problem categories in the data set, different problem categories look at the target and the relationship between them from multiple perspectives, multiple problem categories of the same image not only focus on different targets , and also pay attention to the image areas of the same target and different scales; at the same time, the answers and questions of the data set are merged to generate a complete natural language description;

2.2 Use the deep learning method to train the processed visual question generation data set, and construct the question generation neural network model; the question generation model constructs an implicit scene type representation, initially generates a relational feature map containing the target information, and then uses the target target As the center, we learn the correlation between the question category and different granular regions of the image, and generate different questions related to the target context of attention at multiple scales.

6. The description method as claimed in claim 1 or 3, characterized in that:

Step 3: Joint Q&A

3.1 Integrate the image segmentation module, question generation module and joint question answering module, take the guiding question as the context, use the top-down attention mechanism for learning, introduce the loss function of contrastive learning, and perform joint multi-mode on the guiding question and the relationship feature map State embedding representation; According to the trained network, candidate answers and their confidence are given to generate a natural language description of the image content.

7. The description method as claimed in claim 5, characterized in that:

The loss function of the problem generation neural network model is as follows:

in,

8. The description method as claimed in claim 5, characterized in that:

The loss function for the joint embedding is the contrastive loss:

where y _T represents the output of the correctly matched question feature vector q and segmentation feature map v.