CN114780768A

CN114780768A - Visual question-answering task processing method and system, electronic equipment and storage medium

Info

Publication number: CN114780768A
Application number: CN202210465781.6A
Authority: CN
Inventors: 李仁刚; 李晓川; 郭振华; 赵雅倩; 范宝余
Original assignee: Shandong Mass Institute Of Information Technology
Current assignee: Shandong Mass Institute Of Information Technology
Priority date: 2022-04-29
Filing date: 2022-04-29
Publication date: 2022-07-22
Also published as: WO2023207059A1

Abstract

The application discloses a visual question-answering task processing method, a system, electronic equipment and a storage medium, belongs to the technical field of artificial intelligence technology, and is used for realizing efficient visual question-answering task processing. The visual question-answering task processing method comprises the following steps: carrying out feature fusion on the image detection features, question text features and classification fields to obtain comprehensive features, and inputting the comprehensive features into a comprehensive feature encoder to obtain an image feature segment, a text feature segment and image-text weights; initializing a heterogeneous graph by using the image characteristic segment, the text characteristic segment and the image-text weight, and constructing a characteristic corrector according to the attention relation among image-text characteristics contained in the heterogeneous graph; and correcting the image characteristic segment and the text characteristic segment by using a characteristic corrector, determining a prediction vector according to the position of the classification field in the correction result, and determining an answer corresponding to the visual question-answering task according to the prediction vector. The method and the device can improve the processing precision and efficiency of the visual question-answering task.

Description

Visual question-answering task processing method and system, electronic equipment and storage medium

Technical Field

The present application relates to the field of artificial intelligence, and in particular, to a method and system for processing a visual question-answering task, an electronic device, and a storage medium.

Background

Visual Question Answering (VQA) is an important branch in the multi-modal domain research direction, and aims to predict answers to input questions through Visual information. For example, by inputting a picture and inputting characters into the VQA model, the model can understand what the question is asking, and reply according to the picture information.

In the related art, an LXMERT model is usually used to process a visual question and answer task, and the method splices image features and text features together to fuse information of different modalities. However, some information in the text cannot find corresponding content in the image, some information in the image cannot find corresponding content in the text, image features and text features are directly spliced, a large amount of useless calculation is needed, and the processing accuracy and efficiency of the visual question-answering task are low.

Therefore, how to improve the processing accuracy and efficiency of the visual question-answering task is a technical problem that needs to be solved by those skilled in the art at present.

Disclosure of Invention

The application aims to provide a visual question-answering task processing method, a visual question-answering task processing system, electronic equipment and a storage medium, and the visual question-answering task processing method, the visual question-answering task processing system, the electronic equipment and the storage medium can improve the visual question-answering task processing precision and efficiency.

In order to solve the above technical problems, the present application provides a method for processing a visual question-answering task, including:

receiving a visual question-answering task, and determining a target image and a question text according to the visual question-answering task;

extracting image detection features from the target image, and extracting question text features from the question text;

carrying out feature fusion on the image detection features, the question text features and the classification fields to obtain comprehensive features, and inputting the comprehensive features into a comprehensive feature encoder to obtain an image feature segment, a text feature segment and image-text weights; wherein, the image-text weight comprises an image-text attention weight and an image-text attention weight;

generating a heterogeneous graph corresponding to the visual question-answering task, initializing the heterogeneous graph by using the image characteristic segment, the text characteristic segment and the image-text weight, and constructing a characteristic corrector according to the attention relationship among image-text characteristics contained in the heterogeneous graph; wherein the heterogeneous graph comprises visual nodes and text nodes;

and correcting the image characteristic segment and the text characteristic segment by using the characteristic corrector, determining a prediction vector according to the position of the classification field in a correction result, and determining an answer corresponding to the visual question-answering task according to the prediction vector.

Optionally, initializing the heterogeneous map by using the image feature segment, the text feature segment, and the image-text weight includes:

sequentially storing each image feature in the image feature segment into the corresponding visual node so as to initialize the visual node of the heterogeneous graph;

sequentially storing each text feature in the text feature segment into the corresponding text node so as to initialize the text node of the heterogeneous graph;

and multiplying the matrix corresponding to the image-text weight by the prior filter matrix to obtain an edge initialization matrix so as to initialize the edge of the heterogeneous graph.

Optionally, before multiplying the matrix corresponding to the image-text weight by the a priori filtering matrix, the method further includes:

constructing an NxM zero matrix; wherein N is the number of the visual nodes, M is the number of the text nodes, the abscissa of the zero matrix represents image characteristics, and the ordinate of the zero matrix represents image characteristics;

carrying out maximum positioning on the image-text attention weight in the text direction to obtain text features with the maximum association degree with each image feature, and setting the coordinates of the image features on the target image as virtual coordinates of the text features with the maximum association degree;

matching the virtual coordinates of the text features with the coordinates of all image features on the target image, and setting the corresponding element of the matched coordinates on the zero matrix as 1 to obtain the prior filter matrix.

Optionally, the feature corrector includes a plurality of cascaded correction interaction layers, each correction interaction layer includes a first graph neural updating unit and a second graph neural updating unit, the first graph neural updating unit is configured to implement graph-text feature aggregation, and the second graph neural updating unit is configured to implement graph-text feature aggregation.

Optionally, the process of implementing image-text feature aggregation by the first image neural updating unit includes:

constructing an attention matrix, and calculating a first mapping weight of a visual node to the text node according to the attention matrix;

normalizing the first mapping weight;

and updating the image characteristics corresponding to the visual nodes according to the image-text attention weight and the first mapping weight between the visual nodes and the text nodes so as to realize image-text characteristic aggregation.

Optionally, the process of implementing text-graph feature aggregation by the second graph neural updating unit includes:

constructing an attention matrix, and calculating a second mapping weight of the text node to the visual node according to the attention moment matrix;

normalizing the second mapping weight;

and updating the text features corresponding to the text nodes to the attention weights and the second mapping weights according to the text nodes and the visual nodes so as to realize text feature aggregation.

Optionally, determining a prediction vector according to the position of the classification field in the correction result, and outputting an answer corresponding to the visual question-answering task according to the prediction vector, including:

intercepting the content corresponding to the position of the classification field in the correction result to obtain the prediction vector;

and classifying the prediction vectors, and determining answers corresponding to the visual question-answering tasks in an answer space according to classification results.

The present application further provides a system for processing a visual question-answering task, comprising:

the task receiving module is used for receiving the visual question-answering task and determining a target image and a question text according to the visual question-answering task;

the characteristic extraction module is used for extracting image detection characteristics from the target image and extracting question text characteristics from the question text;

the coding module is used for carrying out feature fusion on the image detection features, the question text features and the classification fields to obtain comprehensive features, and inputting the comprehensive features into a comprehensive feature coder to obtain image feature segments, text feature segments and image-text weights; wherein the image-text weight comprises an image-text attention weight and an image-text attention weight;

the corrector construction module is used for generating a heterogeneous graph corresponding to the visual question-answering task, initializing the heterogeneous graph by using the image characteristic segment, the text characteristic segment and the image-text weight, and constructing a characteristic corrector according to the attention relationship among image-text characteristics contained in the heterogeneous graph; wherein the heterogeneous graph comprises visual nodes and text nodes;

and the answer determining module is used for correcting the image characteristic segment and the text characteristic segment by using the characteristic corrector, determining a prediction vector according to the position of the classification field in a correction result, and determining an answer corresponding to the visual question-answering task according to the prediction vector.

The application also provides a storage medium, on which a computer program is stored, which when executed implements the steps executed by the visual question-answering task processing method.

The application also provides an electronic device, which comprises a memory and a processor, wherein the memory is stored with a computer program, and the processor realizes the steps executed by the visual question-answering task processing method when calling the computer program in the memory.

The application provides a visual question-answering task processing method, which comprises the following steps: receiving a visual question-answering task, and determining a target image and a question text according to the visual question-answering task; extracting image detection features from the target image, and extracting question text features from the question text; carrying out feature fusion on the image detection features, the question text features and the classification fields to obtain comprehensive features, and inputting the comprehensive features into a comprehensive feature encoder to obtain an image feature segment, a text feature segment and image-text weights; wherein, the image-text weight comprises an image-text attention weight and an image-text attention weight; generating a heterogeneous graph corresponding to the visual question-answering task, initializing the heterogeneous graph by using the image characteristic segment, the text characteristic segment and the image-text weight, and constructing a characteristic corrector according to the attention relationship among image-text characteristics contained in the heterogeneous graph; wherein the heterogeneous graph comprises visual nodes and text nodes; and correcting the image characteristic segment and the text characteristic segment by using the characteristic corrector, determining a prediction vector according to the position of the classification field in a correction result, and determining an answer corresponding to the visual question-answering task according to the prediction vector.

According to the method and the device, after the visual question answering task is received, corresponding image detection features and question text features are extracted, and the image detection features, the question text features and classification fields are subjected to feature fusion to obtain comprehensive features. The comprehensive characteristics are changed by the comprehensive characteristic changer, and an image characteristic segment, a text characteristic segment and an image-text weight are obtained. And after generating a heterogeneous graph corresponding to the visual question-answering task, initializing the heterogeneous graph by using the image characteristic segment, the text characteristic segment and the image-text weight, and constructing a characteristic corrector by using the attention relationship among image-text characteristics contained in the heterogeneous graph. Because the image and the text belong to different modes, the nodes for storing the image characteristics and the nodes for storing the text characteristics in the heterogeneous graph are heterogeneous, and the characteristic corrector constructed by using the heterogeneous graph can keep effective information in the image characteristic segment and the text characteristic segment, so that the processing precision and efficiency of the visual question-answering task are improved. The application also provides a visual question-answering task processing system, a storage medium and an electronic device, which have the beneficial effects and are not repeated herein.

Drawings

In order to more clearly illustrate the embodiments of the present application, the drawings needed for the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings can be obtained by those skilled in the art without inventive effort.

Fig. 1 is a flowchart of a method for processing a visual question-answering task according to an embodiment of the present application;

FIG. 2 is a flowchart of a visual question-answering task processing scheme based on a priori heterogeneous interaction according to an embodiment of the present application;

FIG. 3 is a schematic diagram of an integrated feature encoder according to an embodiment of the present application;

fig. 4 is an initialization diagram of a graph-text heterogeneous graph structure according to an embodiment of the present application;

fig. 5 is a schematic structural diagram of a feature corrector provided in an embodiment of the present application;

fig. 6 is a schematic structural diagram of a visual question-answering task processing system according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

Referring to fig. 1, fig. 1 is a flowchart of a visual question-answering task processing method according to an embodiment of the present application.

The specific steps may include:

s101: receiving a visual question-answering task, and determining a target image and a question text according to the visual question-answering task;

the embodiment can be applied to electronic equipment with visual question answering capability, and the visual question answering task can be a task issued by a user or a task transmitted by other equipment. The target image and the question text may be determined according to the poetry question-answer task, the question text is a question about some information in the target image, for example, the target image is an image including a plurality of bicycles, and the question text may be "there are several bicycles in the image".

S102: extracting image detection features from the target image, and extracting question text features from the question text;

in this step, a convolutional neural network may be used to extract image detection features in the target image, and a text encoder (such as BERT or RoBERTa) may be used to perform feature extraction on the question text to obtain question text features.

S103: carrying out feature fusion on the image detection features, the question text features and the classification fields to obtain comprehensive features, and inputting the comprehensive features into a comprehensive feature encoder to obtain an image feature segment, a text feature segment and image-text weights;

in the step, image feature coding can be performed on the image detection features, text feature coding can be performed on the question text features, the image feature coding result and the text feature coding result are spliced and fused to obtain image-text fusion features, and the image-text fusion features and the classification fields are fused to obtain comprehensive features. In this embodiment, a fixed vector may be initialized to represent a classification field (CLS), and the classification field is spliced to the fused feature to obtain the composite feature.

As another feasible implementation manner, the embodiment may perform image feature coding on the image detection features, fuse the question text features and the classification fields to obtain text comprehensive features, perform text feature coding on the text comprehensive features, and fuse the image feature coding results and the text feature coding results to obtain comprehensive features.

After the integrated features are obtained, the integrated features can be input to an integrated feature encoder, and the output features of the integrated feature encoder include an image feature segment, a text feature segment, and a graphic weight. The graph-text weight comprises a graph-text attention weight and a graph-text attention weight, wherein the graph-text attention weight is the attention weight of the image feature to the text feature, and the graph-text attention weight is the attention weight of the text feature to the image feature.

S104: generating a heterogeneous graph corresponding to the visual question-answering task, initializing the heterogeneous graph by using the image characteristic segment, the text characteristic segment and the image-text weight, and constructing a characteristic corrector according to the attention relationship among image-text characteristics contained in the heterogeneous graph; wherein the heterogeneous graph comprises visual nodes and text nodes;

in this embodiment, a heterogeneous graph (i.e., an image-text heterogeneous graph structure) including a visual space and a text space may be generated, where a plurality of visual nodes exist in the visual space and a plurality of text nodes exist in the text space. The process of initializing the Heterogeneous Graph (Heterogeneous Graph) includes node initialization and edge initialization, and the node initialization includes visual node initialization and text node initialization.

The visual node initialization process comprises the following steps: and sequentially storing each image feature in the image feature segment into the corresponding visual node so as to initialize the visual node of the heterogeneous graph. The process of text node initialization is as follows: and sequentially storing each text feature in the text feature segment into the corresponding text node so as to initialize the text node of the heterogeneous graph. The edge initialization process comprises the following steps: and multiplying the matrix corresponding to the image-text weight by the prior filter matrix to obtain an edge initialization matrix so as to initialize the edge of the heterogeneous image.

Specifically, in the edge initialization process, the image-text attention weight and the image-text attention weight may be multiplied by the prior filter matrix, respectively, to obtain an edge initialization matrix. The generation mode of the prior filter matrix comprises the following steps: constructing an N multiplied by M zero matrix; wherein N is the number of the visual nodes, M is the number of the text nodes, the abscissa of the zero matrix represents image characteristics, and the ordinate of the zero matrix represents image characteristics; carrying out maximum positioning on the image-text attention weights in the text direction to obtain text features with the maximum association degree with each image feature, and setting the coordinates of the image features on the target image as virtual coordinates of the text features with the maximum association degree; and matching the virtual coordinates of the text features with the coordinates of all image features on the target image, and setting elements of the coordinates hit by matching on the zero matrix as 1 to obtain the prior filter matrix.

The initialized heterogeneous image stores the attention relationship between image-text characteristics (image characteristics and text characteristics), and a characteristic corrector can be constructed according to the attention relationship.

S105: and correcting the image characteristic segment and the text characteristic segment by using the characteristic corrector, determining a prediction vector according to the position of the classification field in a correction result, and determining an answer corresponding to the visual question-answering task according to the prediction vector.

After the characteristic corrector is obtained, bilateral correction can be performed on the image characteristic segment and the text characteristic segment, a prediction vector is determined according to the position of a classification field in a correction result, and an answer corresponding to the visual question-answering task is determined by combining the prediction vector. Specifically, in this embodiment, the content corresponding to the position of the classification field in the correction result may be intercepted to obtain the prediction vector; and classifying the prediction vectors, and determining answers corresponding to the visual question-answering tasks in an answer space according to classification results.

In the embodiment, after the visual question-answering task is received, the corresponding image detection features and question text features are extracted, and the image detection features, the question text features and the classification fields are subjected to feature fusion to obtain comprehensive features. In this embodiment, the comprehensive feature changer is used to change the comprehensive features to obtain the image feature segment, the text feature segment and the image-text weight. And after generating a heterogeneous graph corresponding to the visual question-answering task, initializing the heterogeneous graph by using the image characteristic segment, the text characteristic segment and the image-text weight, and constructing a characteristic corrector by using the attention relationship among image-text characteristics contained in the heterogeneous graph. Because the image and the text belong to different modes, the nodes for storing the image characteristics and the nodes for storing the text characteristics in the heterogeneous graph are heterogeneous, and the characteristic corrector constructed by using the heterogeneous graph can keep effective information in the image characteristic segment and the text characteristic segment, so that the processing precision and efficiency of the visual question-answering task are improved.

As a further introduction to the corresponding embodiment of fig. 1, the feature corrector may include a plurality of cascaded correction interaction layers, each of the correction interaction layers includes a first graph neural updating unit and a second graph neural updating unit, the first graph neural updating unit is configured to implement graph-text feature aggregation, and the second graph neural updating unit is configured to implement graph-text feature aggregation. The aggregation of image-text characteristics refers to: aggregating image features to text nodes, wherein the text and image feature aggregation means: text features are aggregated to visual nodes.

Specifically, the process of implementing image-text feature aggregation by the first image neural updating unit includes: constructing an attention matrix, and calculating a first mapping weight of a visual node to the text node according to the attention matrix; normalizing the first mapping weight; and updating the image characteristics corresponding to the visual nodes according to the image-text attention weight between the visual nodes and the text nodes and the first mapping weight so as to realize image-text characteristic aggregation.

Specifically, the process of implementing text-graph feature aggregation by the second graph neural updating unit includes: constructing an attention matrix, and calculating a second mapping weight of the text node to the visual node according to the attention moment matrix; normalizing the second mapping weight; and updating the text features corresponding to the text nodes to the attention weights and the second mapping weights according to the text nodes and the visual nodes so as to realize text feature aggregation.

The flow described in the above embodiment is explained by an embodiment in practical use as follows.

The visual question-answering task is a graphic understanding task, and multi-modal research becomes one of the hottest research directions in the field of artificial intelligence. Since the multi-modal is usually related to various different features such as vision, voice, text and the like, which are more similar to the scenes existing in daily life, and thus have better landing prospects, the research in the field becomes one of the main research directions of artificial intelligence. The current stage of multi-modal research mainly focuses on the content understanding level (artificial intelligence theory research can be roughly divided into content understanding and content generation), while the VQA task is a basic task of content understanding, and whether artificial intelligence can understand the content of images and texts can be reflected in the precision of the VQA task. Because the task of visual question and answer VQA is relatively classical, many methods are available to solve, the most classical and highest precision are a series of methods based on a transform structure, such as VLBERT, LXMERT and other models.

The conventional process of the visual question-answering task includes: extracting the features of the image by adopting a convolutional neural network; the input question is extracted by using an existing text encoder (such as BERT, RoBERTA). And then, respectively encoding the extracted features. And then splicing and fusing the two characteristics to obtain image-text fusion characteristics. Initializing a fixed vector to represent a classification field [ CLS ], and splicing the classification field [ CLS ] to the fused feature to obtain a comprehensive feature. It is further encoded and the position corresponding to the [ CLS ] feature is re-truncated to represent the prediction vector. And finally, classifying the prediction vectors, and finding the answer with the highest score in the answer space for outputting.

According to the conventional processing scheme of the visual question-answering task, the visual features and the text features are spliced together to fuse information in different modes. However, there is a problem in this process: not all features are necessarily fused, that is, some information in the text cannot find corresponding content in the image; and vice versa. Therefore, roughly splicing the two and training based on the ultra-large data volume is not a simple scheme.

Aiming at the defects in the conventional technology, the application provides a visual question-answering task processing scheme based on prior heterogeneous interaction, which designs a feature corrector suitable for visual question-answering and ensures that the corrected features contain more 'valuable information'. The graph structure is composed of nodes and edges, and the graph structure can adjust the strength relation between the connected nodes according to the size of the edges, which is consistent with the physical meaning of whether correlation exists between different characteristics of the pictures and texts. Because the image and the text belong to different modalities, the graph nodes for storing the image characteristics and the graph nodes for storing the text characteristics are mutually heterogeneous.

Referring to fig. 2, fig. 2 is a flowchart of a processing scheme of a visual question-answering task based on a priori heterogeneous interaction according to an embodiment of the present application, and a specific process is as follows: and extracting the target image through detecting the network image characteristics to obtain image detection characteristics, and extracting the question text through the word frequency dictionary text characteristics and splicing the initialization vectors of the classification fields CLS to obtain text comprehensive characteristics. And carrying out image feature coding on the image change features, carrying out text feature coding on the text comprehensive features, and carrying out feature fusion on the image feature coding results and the text feature coding results to obtain comprehensive features. And carrying out comprehensive characteristic coding on the comprehensive characteristics to obtain an image characteristic segment, a text characteristic segment and an image-text weight. And correcting the image characteristic segment and the text characteristic segment by using a characteristic corrector to obtain a corrected image characteristic segment and a corrected text characteristic segment, and determining a prediction vector to predict in an answer space according to the positions of classification fields in the corrected text characteristic segment.

In the above process, the integrated feature coding is implemented by an integrated feature encoder, which includes a plurality of cascaded attention-crossing layers. Each cross-attention layer comprises a self-attention sublayer and a cross-attention sublayer, which are respectively composed of a self/cross-attention mechanism, random wiping, layer normalization and addition. That is, the self-attention sublayer may implement self-attention mechanism, random wipe-out, layer normalization and addition; the cross-attention sublayer may implement a cross-attention mechanism, random wipe, layer normalization, and addition. Random erasure refers to randomly erasing a portion of the values of the feature in a certain proportion to prevent overfitting; layer normalization is used to normalize between layers; the addition is used to add the model output to the original features. For the self-attention mechanism, the formula is as follows:

the cross attention mechanism formula is as follows:

in the above formula, f and g respectively represent two magnitudes of [ N, d ]]N denotes the number of features, d denotes the feature dimension, W_q、W_k、W_vAre respectively of size [ d, d]T represents a transpose matrix, size (f) represents the dimensions of f, and size (g) represents the dimensions of g. The cross-attention mechanism is mainly used for operating the representation of the feature f on g, and the attention of g to f is realized through the form.

This embodiment adds an output interface to the existing integrated feature encoder, which is used to output the final cross-attention mechanism text to the attention weight and the teletext attention weight.

Attention weight W of text to attention weight_gfComprises the following steps:

graphic attention weight W_fgComprises the following steps:

in the above formula, q represents query, k represents key, and query and key are inputs of the self-attention mechanism.

Referring to fig. 3, fig. 3 is a schematic diagram of an integrated feature encoder according to an embodiment of the present disclosure. The comprehensive features input into the comprehensive feature encoder comprise an image feature segment and a text feature segment, and the image feature segment, the text feature segment, the image-text attention weight and the text-image attention weight are obtained after the self-attention sublayer processing (self-attention, random erasure, layer normalization and addition) and the cross-attention sublayer processing (cross-attention, random erasure, layer normalization and addition).

Please refer to fig. 4, fig. 4 is a schematic diagram illustrating an initialization of a teletext heterostructure according to an embodiment of the present application. The graph structure is one of basic structures in computer science, the graph structure is composed of nodes and edges, and the structure and initialization mode of the heterogeneous graph designed by the scheme are explained below. In FIG. 4

In order to be a visual node,

in the form of a node of a text,

the embodiment provides a scheme for solving the problem of low effectiveness of image features and text features in the field of visual question answering by adopting the idea of heterogeneous graph structure aggregation. As shown in fig. 4, the heterogeneous graph includes nodes of two properties and edges between different nodes. For a node, two properties refer to the source of the feature space: a visual space or text space; for the opposite side, there is only an edge between nodes in different spaces, and there is no edge between two nodes in the same space. All edges are directional, i.e. there are two edges between two nodes.

After the heterogeneous graph has been constructed, node initialization and edge initialization may be performed. In the node initialization process, image characteristic segments with the size of [ N, d ] can be sequentially stored into N visual nodes; and sequentially storing the text characteristic segments with the size of [ M, d ] into M text nodes. N is the number of features in the image feature segment, M is the number of features in the text feature segment, and d is the feature dimension.

For the edge initialization process, an edge initialization matrix is first calculated, as shown, the edge initialization matrix is obtained by multiplying the teletext matrix weights by the a priori filter matrix. Wherein, the image-text weight matrix is the corresponding output of the last module; the prior filter matrix represents the prior correlation between the pictures and texts, the matrix is a [ N, M ] sized binary matrix, the matrix is composed of 0 and 1, and the number of the ith row and the jth column of the matrix represents whether the ith graph node and the jth text node have the possibility of correlation. For the prior filter matrix, the specific generation method is as follows: firstly, constructing a zero matrix with the size of [ N, M ]; carrying out maximum value positioning on the image-text attention weight in the text direction, and finding out the text features most likely associated with each image feature; taking the coordinates of the image features on the original input picture as virtual coordinates of the text features (if a certain text feature is the maximum attention value of a plurality of image features at the same time, combining the coordinates of the image features into a list as the virtual coordinates of the text features); for each text feature, the virtual coordinate (or the coordinate list) is compared with the coordinates of all image features, and the corresponding position on the zero matrix where the virtual coordinate (or the coordinate list) is overlapped in the space is set as 1, so that a prior filtering matrix is obtained. After multiplying the image-text weight matrix and the prior filter matrix, the edges of the heterogeneous graph structure can be initialized.

Referring to fig. 5, fig. 5 is a schematic structural diagram of a feature corrector provided in an embodiment of the present application, where the feature corrector can be obtained according to a graph-text heterogeneous map. The feature corrector comprises a plurality of cascaded correction interaction layers, each cascaded correction interaction layer comprises two graph nerve updating units (triangles in the graph), wherein one graph nerve updating unit is used for performing graph-text feature aggregation, and the other graph nerve updating unit is used for performing graph-text feature aggregation.

Taking the process of the visual node to the text node as an example, the operation process of the graph nerve updating unit is as follows:

step 1: constructing four attention matrixes Wc, Wv, Wb and Wn with the sizes of [ d and d ];

wherein, the input vector of the attention matrix is q; and Wq is a matrix operation result and is used for representing the mapping process of the vector q.

Step 2: calculating mapping weight Z of visual node I to text node T_ti。

z_ti＝ReLU(W_c[W_vf_t；W_bf_i])；

Wherein, f_tRepresenting feature vectors stored in text nodes T, f_iRepresenting the feature vectors stored in the visual node I, calculated Z_tiRepresenting the mapping weight of the visual node I to the text node T.

And 3, step 3: to mapping weight Z_tiAnd carrying out normalization processing.

The normalized formula is:

α_tithe mapping weights after normalization are represented, exp (×) represents the exponential operator.

And 4, step 4: combining edge matrix W between visual node I and text node T_ti(image-text attention weight) is combined to update the node characteristics, and the formula is as follows:

wherein w_tiRepresenting an edge matrix W_tiThe corresponding edge value in (1). f. of_tThe original characteristics of the nodes are represented, sigma is a hyper-parameter, and Nt is the number of the nodes.

And 5: and re-weighting all text nodes with updated features. The concrete method is as follows: construct a size of [ d, d]Matrix W of_tiMultiplying the obtained features to map.

For the reverse node updating, similar to the operation of the steps 1-5, only the visual node I needs to be exchanged for the text node T. Finally, the corresponding characteristic segments of [ CLS ] on the text characteristic position can be extracted to be used as prediction vectors and classified in an answer space to obtain final output.

The embodiment adopts a graph neural network structure aiming at the visual question-answering task, designs a set of novel visual question-answering system, and provides an interface and logic suitable for the system through designing a reasonable heterogeneous graph structure and an initialization method. In this embodiment, after the feature encoding module is synthesized, a feature modifier is constructed by using the attention relationship between the image-text features, the bilateral features are modified by using the feature modifier, and the corresponding position of the obtained classification field CLS of the modified text feature segment is extracted for subsequent answer prediction.

Referring to fig. 6, fig. 6 is a schematic structural diagram of a visual question-answering task processing system according to an embodiment of the present application, where the system may include:

the task receiving module 601 is configured to receive a visual question-answering task, and determine a target image and a question text according to the visual question-answering task;

a feature extraction module 602, configured to extract image detection features from the target image, and extract question text features from the question text;

the encoding module 603 is configured to perform feature fusion on the image detection features, the question text features, and the classification fields to obtain comprehensive features, and input the comprehensive features into a comprehensive feature encoder to obtain an image feature segment, a text feature segment, and a text-text weight; wherein the image-text weight comprises an image-text attention weight and an image-text attention weight;

a corrector constructing module 604, configured to generate a heterogeneous graph corresponding to the visual question-answering task, initialize the heterogeneous graph by using the image feature segment, the text feature segment and the image-text weight, and construct a feature corrector according to an attention relationship among image-text features included in the heterogeneous graph; wherein the heterogeneous graph comprises visual nodes and text nodes;

and an answer determining module 605, configured to modify the image feature segment and the text feature segment by using the feature modifier, determine a prediction vector according to a position of the classification field in a modification result, and determine an answer corresponding to the visual question-answering task according to the prediction vector.

In the embodiment, after the visual question-answering task is received, the corresponding image detection features and question text features are extracted, and the image detection features, the question text features and the classification fields are subjected to feature fusion to obtain comprehensive features. In this embodiment, the comprehensive feature changer is used to change the comprehensive features to obtain the image feature segment, the text feature segment and the image-text weight. And after generating a heterogeneous graph corresponding to the visual question-answering task, initializing the heterogeneous graph by using the image characteristic segment, the text characteristic segment and the image-text weight, and constructing a characteristic corrector by using the attention relationship among image-text characteristics contained in the heterogeneous graph. Because the image and the text belong to different modes, the nodes for storing the image characteristics and the nodes for storing the text characteristics in the heterogeneous graph are heterogeneous, and the characteristic corrector constructed by using the heterogeneous graph can reserve effective information in the image characteristic segment and the text characteristic segment, thereby improving the processing precision and efficiency of the visual question-answering task.

Further, the process of initializing the heterogeneous map by the modifier construction module 604 using the image feature segment, the text feature segment and the graphics-text weight includes: sequentially storing each image feature in the image feature segment into the corresponding visual node so as to initialize the visual node of the heterogeneous graph; sequentially storing each text feature in the text feature segment into the corresponding text node so as to initialize the text node of the heterogeneous graph; and multiplying the matrix corresponding to the image-text weight by the prior filter matrix to obtain an edge initialization matrix so as to initialize the edge of the heterogeneous graph.

Further, the method also comprises the following steps:

the prior filtering matrix construction module is used for constructing an N multiplied by M zero matrix before multiplying the matrix corresponding to the image-text weight by the prior filtering matrix; wherein N is the number of the visual nodes, M is the number of the text nodes, the abscissa of the zero matrix represents image characteristics, and the ordinate of the zero matrix represents image characteristics; the image-text attention weighting device is also used for carrying out maximum positioning on the image-text attention weighting in the text direction to obtain text features with the maximum association degree with each image feature, and setting the coordinates of the image features on the target image as virtual coordinates of the text features with the maximum association degree; and the virtual coordinate of the text feature is matched with the coordinates of all image features on the target image, and the element of the matched coordinate corresponding to the zero matrix is set to be 1, so that the prior filter matrix is obtained.

Further, the feature corrector includes a plurality of cascaded correction interaction layers, each of the correction interaction layers includes a first graph neural updating unit and a second graph neural updating unit, the first graph neural updating unit is used for implementing image-text feature aggregation, and the second graph neural updating unit is used for implementing image-text feature aggregation.

Further, the process of implementing image-text feature aggregation by the first image neural updating unit includes: constructing an attention matrix, and calculating a first mapping weight of a visual node to the text node according to the attention matrix; normalizing the first mapping weight; and updating the image characteristics corresponding to the visual nodes according to the image-text attention weight between the visual nodes and the text nodes and the first mapping weight so as to realize image-text characteristic aggregation.

Further, the process of implementing text and image feature aggregation by the second image neural updating unit includes: constructing an attention matrix, and calculating a second mapping weight of the text node to the visual node according to the attention moment matrix; normalizing the second mapping weight; and updating the text features corresponding to the text nodes to the attention weights and the second mapping weights according to the text nodes and the visual nodes so as to realize text feature aggregation.

Further, the answer determining module 605 is configured to intercept content corresponding to the position of the classification field in the correction result to obtain the prediction vector; and the system is also used for classifying the prediction vectors and determining answers corresponding to the visual question-answering tasks in an answer space according to classification results.

Since the embodiment of the system part and the embodiment of the method part correspond to each other, please refer to the description of the embodiment of the method part for the embodiment of the system part, and details are not repeated here.

The present application also provides a storage medium on which a computer program is stored, which when executed, can implement the steps provided by the above embodiments. The storage medium may include: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

The present application further provides an electronic device, which may include a memory and a processor, where the memory stores a computer program, and when the processor calls the computer program in the memory, the steps provided in the foregoing embodiments may be implemented. Of course, the electronic device may also include various network interfaces, power supplies, and the like.

The embodiments are described in a progressive manner in the specification, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. For the system disclosed by the embodiment, the description is relatively simple because the system corresponds to the method disclosed by the embodiment, and the relevant points can be referred to the method part for description. It should be noted that, for those skilled in the art, it is possible to make several improvements and modifications to the present application without departing from the principle of the present application, and such improvements and modifications also fall within the scope of the claims of the present application.

It is further noted that, in the present specification, relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

Claims

1. A visual question-answering task processing method is characterized by comprising the following steps:

2. The visual question-answering task processing method according to claim 1, wherein initializing the heterogeneous graph with the image feature segment, the text feature segment and the teletext weight comprises:

and multiplying the matrix corresponding to the image-text weight by the prior filter matrix to obtain an edge initialization matrix so as to initialize the edge of the heterogeneous image.

3. The visual question-answering task processing method according to claim 2, wherein before multiplying the matrix corresponding to the image-text weight by a prior filter matrix, the method further comprises:

carrying out maximum positioning on the image-text attention weights in the text direction to obtain text features with the maximum association degree with each image feature, and setting the coordinates of the image features on the target image as virtual coordinates of the text features with the maximum association degree;

4. The visual question-answering task processing method according to claim 1, wherein the feature modifier comprises a plurality of cascaded modification interaction layers, each of the modification interaction layers comprises a first graph neural updating unit and a second graph neural updating unit, the first graph neural updating unit is used for realizing graph-text feature aggregation, and the second graph neural updating unit is used for realizing graph-text feature aggregation.

5. The visual question-answering task processing method according to claim 4, wherein the process of implementing the image-text feature aggregation by the first image neural updating unit comprises:

normalizing the first mapping weight;

and updating the image characteristics corresponding to the visual nodes according to the image-text attention weight between the visual nodes and the text nodes and the first mapping weight so as to realize image-text characteristic aggregation.

6. The visual question-answering task processing method according to claim 4, wherein the process of implementing the text-image feature aggregation by the second image neural updating unit includes:

normalizing the second mapping weight;

7. The method of claim 1, wherein determining a prediction vector according to a location of the classification field in the modification result, and outputting an answer corresponding to the visual question-answering task according to the prediction vector comprises:

8. A visual question-answering task processing system, comprising:

the task receiving module is used for receiving a visual question-answering task and determining a target image and a question text according to the visual question-answering task;

the encoding module is used for carrying out feature fusion on the image detection features, the question text features and the classification fields to obtain comprehensive features, and inputting the comprehensive features into a comprehensive feature encoder to obtain image feature segments, text feature segments and image-text weights; wherein, the image-text weight comprises an image-text attention weight and an image-text attention weight;

the corrector constructing module is used for generating a heterogeneous graph corresponding to the visual question-answering task, initializing the heterogeneous graph by using the image characteristic segment, the text characteristic segment and the image-text weight, and constructing a characteristic corrector according to the attention relationship among image-text characteristics contained in the heterogeneous graph; wherein the heterogeneous graph comprises visual nodes and text nodes;

9. An electronic device, comprising a memory and a processor, wherein the memory stores a computer program, and the processor implements the steps of the visual question-answering task processing method according to any one of claims 1 to 7 when calling the computer program in the memory.

10. A storage medium having stored therein computer-executable instructions which, when loaded and executed by a processor, carry out the steps of the visual question-answering task processing method according to any one of claims 1 to 7.