WO2023207059A1 - 一种视觉问答任务处理方法、系统、电子设备及存储介质 - Google Patents

一种视觉问答任务处理方法、系统、电子设备及存储介质 Download PDF

Info

Publication number
WO2023207059A1
WO2023207059A1 PCT/CN2022/134138 CN2022134138W WO2023207059A1 WO 2023207059 A1 WO2023207059 A1 WO 2023207059A1 CN 2022134138 W CN2022134138 W CN 2022134138W WO 2023207059 A1 WO2023207059 A1 WO 2023207059A1
Authority
WO
WIPO (PCT)
Prior art keywords
text
feature
image
features
visual
Prior art date
Application number
PCT/CN2022/134138
Other languages
English (en)
French (fr)
Inventor
李仁刚
李晓川
郭振华
赵雅倩
范宝余
Original Assignee
山东海量信息技术研究院
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 山东海量信息技术研究院 filed Critical 山东海量信息技术研究院
Publication of WO2023207059A1 publication Critical patent/WO2023207059A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/53Querying
    • G06F16/532Query formulation, e.g. graphical querying
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/58Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/583Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/9032Query formulation
    • G06F16/90332Natural language query formulation or dialogue systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features

Definitions

  • This application relates to the field of artificial intelligence, and in particular to a visual question and answer task processing method, system, electronic device and storage medium.
  • VQA Visual Question Answering
  • VQA Visual Question Answering
  • the LXMERT model is usually used to handle visual question and answer tasks.
  • This method splices image features and text features to fuse information from different modalities.
  • some information in the text cannot be found in the image, and some information in the image cannot be found in the text.
  • Directly splicing image features and text features will require a lot of useless calculations, and the visual question and answer task processing accuracy and Less efficient.
  • the purpose of this application is to provide a visual question and answer task processing method, a visual question and answer task processing system, an electronic device and a storage medium, which can improve the accuracy and efficiency of visual question and answer task processing.
  • this application provides a visual question and answer task processing method, including:
  • Feature fusion of image detection features, question text features and classification fields is performed to obtain comprehensive features, and the comprehensive features are input into the comprehensive feature encoder to obtain image feature segments, text feature segments and image and text weights; among which, the image and text weights include image and text Attention weight and text-image attention weight;
  • the heterogeneous graph includes visual nodes and text nodes;
  • the feature corrector is used to correct the image feature segment and the text feature segment, and the prediction vector is determined based on the position of the classification field in the correction result, and the answer corresponding to the visual question and answer task is determined based on the prediction vector.
  • the heterogeneous graph is initialized using image feature segments, text feature segments and image and text weights, including:
  • Each image feature in the image feature segment is stored in the corresponding visual node in order to initialize the visual node of the heterogeneous graph
  • Each text feature in the text feature segment is stored in the corresponding text node in order to initialize the text node of the heterogeneous graph
  • the method before multiplying the matrix corresponding to the image and text weights by the prior filter matrix, the method further includes:
  • N is the number of visual nodes
  • M is the number of text nodes
  • the abscissa of the zero matrix represents the image features
  • the ordinate of the zero matrix represents the image features
  • the feature modifier includes multiple cascaded correction interaction layers.
  • Each correction interaction layer includes a first graph neural update unit and a second graph neural update unit.
  • the first graph neural update unit is used to implement graphics and text.
  • Feature aggregation the second graph neural update unit is used to implement text and graph feature aggregation.
  • the process by which the first graph neural update unit implements graph and text feature aggregation includes:
  • the image features corresponding to the visual nodes are updated according to the image-text attention weight and the first mapping weight between the visual node and the text node, so as to achieve image-text feature aggregation.
  • the process by which the second graph neural update unit implements text and graph feature aggregation includes:
  • the text features corresponding to the text nodes are updated according to the text-graphic attention weight and the second mapping weight between the visual node and the text node, so as to achieve text-graphic feature aggregation.
  • the prediction vector is determined based on the position of the classification field in the correction result, and the answer corresponding to the visual question and answer task is output according to the prediction vector, including:
  • This application also provides a visual question and answer task processing system, which includes:
  • the task receiving module is used to receive the visual question and answer task and determine the target image and question text based on the visual question and answer task;
  • the feature extraction module is used to extract image detection features from the target image and extract question text features from the question text;
  • the encoding module is used to fuse image detection features, question text features and classification fields to obtain comprehensive features, and input the comprehensive features into the comprehensive feature encoder to obtain image feature segments, text feature segments and image and text weights; among them, Figure Text weight includes image-text attention weight and text-image attention weight;
  • the corrector construction module is used to generate the heterogeneous graph corresponding to the visual question and answer task, and initialize the heterogeneous graph using image feature segments, text feature segments and image and text weights, and based on the attention between image and text features contained in the heterogeneous graph
  • the force relationship constructs a feature modifier; among them, the heterogeneous graph includes visual nodes and text nodes;
  • the answer determination module is used to use the feature corrector to correct the image feature segment and the text feature segment, determine the prediction vector based on the position of the classification field in the correction result, and determine the answer corresponding to the visual question and answer task based on the prediction vector.
  • This application also provides a non-volatile readable storage medium on which a computer program is stored. When the computer program is executed, the steps of the above visual question and answer task processing method are implemented.
  • This application also provides an electronic device, including a memory and a processor.
  • a computer program is stored in the memory.
  • the processor calls the computer program in the memory, the steps of executing the above visual question and answer task processing method are implemented.
  • This application provides a visual question and answer task processing method, including: receiving a visual question and answer task, and determining the target image and question text according to the visual question and answer task; extracting image detection features from the target image, and extracting the question text from the question text Features; feature fusion of image detection features, question text features and classification fields to obtain comprehensive features, and input the comprehensive features into the comprehensive feature encoder to obtain image feature segments, text feature segments and image and text weights; among which, the image and text weights include Image-text attention weight and text-image attention weight; generate a heterogeneous graph corresponding to the visual question and answer task, and use image feature segments, text feature segments and image and text weights to initialize the heterogeneous graph, and based on the graphs contained in the heterogeneous graph A feature corrector is constructed based on the attention relationship between text features; among them, the heterogeneous graph includes visual nodes and text nodes; the feature corrector is used to correct image feature segments and text feature segments, and is determined based on the position of the classification field in the
  • this application After receiving the visual question and answer task, this application extracts the corresponding image detection features and question text features, and fuses the image detection features, question text features and classification fields to obtain comprehensive features.
  • This application uses a comprehensive feature changer to perform change processing on the above comprehensive features to obtain image feature segments, text feature segments and image and text weights.
  • the heterogeneous graph After generating the heterogeneous graph corresponding to the visual question and answer task, the heterogeneous graph is initialized using image feature segments, text feature segments and image and text weights, and the feature corrector is constructed using the attention relationship between the image and text features contained in the heterogeneous graph. Since images and texts belong to different modalities, the nodes that store image features and the nodes that store text features in the heterogeneous graph are heterogeneous to each other.
  • the feature corrector constructed using the heterogeneous graph can retain the characteristics of the image feature segments and the text feature segments. Effective information improves the processing accuracy and efficiency of visual question and answer tasks.
  • This application also provides a visual question and answer task processing system, a non-volatile readable storage medium and an electronic device, which have the above beneficial effects and will not be described in detail here.
  • Figure 1 is a flow chart of a visual question and answer task processing method provided by an embodiment of the present application
  • Figure 2 is a flow chart of a visual question and answer task processing solution based on a priori heterogeneous interaction provided by an embodiment of the present application;
  • Figure 3 is a schematic diagram of the principle of a comprehensive feature encoder provided by an embodiment of the present application.
  • Figure 4 is a schematic diagram of initialization of a graphic and text heterogeneous graph structure provided by an embodiment of the present application
  • Figure 5 is a schematic structural diagram of a feature corrector provided by an embodiment of the present application.
  • Figure 6 is a schematic structural diagram of a visual question and answer task processing system provided by an embodiment of the present application.
  • Figure 1 is a flow chart of a visual question and answer task processing method provided by an embodiment of the present application.
  • S101 Receive the visual question and answer task, and determine the target image and question text based on the visual question and answer task;
  • the visual question and answer task can be a task issued by the user or a task transmitted by other devices.
  • the target image and question text can be determined.
  • the question text is a question about certain information in the target image.
  • the target image is an image including multiple bicycles.
  • the question text can be "How many bicycles are there in the picture?" ".
  • a convolutional neural network can be used to extract image detection features in the target image
  • a text encoder such as BERT, RoBERTa
  • a text encoder such as BERT, RoBERTa
  • S103 Fusion of image detection features, question text features and classification fields to obtain comprehensive features, and input the comprehensive features into the comprehensive feature encoder to obtain image feature segments, text feature segments and image and text weights;
  • this step can perform image feature coding on the image detection features, and can also perform text feature coding on the text features of the question.
  • the image feature coding results and the text feature coding results can be spliced and fused to obtain image-text fusion features, and the image-text fusion features can be obtained.
  • Features and classification fields are fused to obtain comprehensive features.
  • This embodiment can initialize a fixed vector to represent a classification field (CLS), and put the classification field behind the fusion feature to obtain the comprehensive feature.
  • CLS classification field
  • this embodiment can perform image feature encoding on the image detection features, fuse the question text features with the classification field to obtain text comprehensive features, perform text feature encoding on the text comprehensive features, and encode the image features The results are fused with the text feature encoding results to obtain comprehensive features.
  • the comprehensive features can be input into the comprehensive feature encoder.
  • the output features of the comprehensive feature encoder include image feature segments, text feature segments and image and text weights.
  • the above image and text weights include image and text oriented attention weights and text and image oriented attention weights.
  • the image and text oriented attention weights are the attention weights of image features to text features, and the text and image oriented attention weights are the attention weights of text features to image features. Weights.
  • S104 Generate a heterogeneous graph corresponding to the visual question and answer task, initialize the heterogeneous graph using image feature segments, text feature segments and image and text weights, and construct a feature corrector based on the attention relationship between image and text features contained in the heterogeneous graph. ;
  • the heterogeneous graph includes visual nodes and text nodes;
  • this embodiment can generate a heterogeneous graph including visual space and text space (that is, a heterogeneous graph structure of graphics and text). There are multiple visual nodes in the visual space and multiple text nodes in the text space.
  • the process of initializing the above-mentioned Heterogeneous Graph includes node initialization and edge initialization.
  • Node initialization includes visual node initialization and text node initialization.
  • each image feature in the image feature segment is stored in the corresponding visual node in order to initialize the visual node of the heterogeneous graph.
  • the process of text node initialization is as follows: each text feature in the text feature segment is stored in the corresponding text node in order to initialize the text node of the heterogeneous graph.
  • the process of edge initialization is: multiply the matrix corresponding to the image and text weights by the prior filter matrix to obtain the edge initialization matrix, so as to initialize the edges of the heterogeneous graph.
  • the image-text attention weight and the text-image attention weight can be multiplied by the prior filter matrix respectively to obtain the edge initialization matrix.
  • the generation method of the prior filter matrix includes: constructing an N ⁇ M zero matrix; where N is the number of visual nodes, M is the number of text nodes, the abscissa of the zero matrix represents the image features, and the ordinate of the zero matrix represents the image features.
  • the initialized heterogeneous graph stores the attention relationship between graphic and text features (image features and text features), and a feature modifier can be constructed based on this attention relationship.
  • S105 Use the feature corrector to correct the image feature segment and the text feature segment, determine the prediction vector based on the position of the classification field in the correction result, and determine the answer corresponding to the visual question and answer task based on the prediction vector.
  • the image feature segment and the text feature segment can be bilaterally corrected, and then the prediction vector is determined based on the position of the classification field in the correction result, and the answer corresponding to the visual question and answer task is determined based on the prediction vector.
  • this embodiment can intercept the content corresponding to the position of the classification field in the correction result to obtain the prediction vector; classify the prediction vector, and determine the answer corresponding to the visual question and answer task in the answer space according to the classification result.
  • the corresponding image detection features and question text features are extracted, and the image detection features, question text features and classification fields are feature fused to obtain comprehensive features.
  • This embodiment uses a comprehensive feature changer to perform change processing on the above comprehensive features to obtain image feature segments, text feature segments and image and text weights.
  • the heterogeneous graph is initialized using image feature segments, text feature segments and image and text weights, and the feature corrector is constructed using the attention relationship between the image and text features contained in the heterogeneous graph. Since images and texts belong to different modalities, the nodes that store image features and the nodes that store text features in the heterogeneous graph are heterogeneous to each other.
  • the feature corrector constructed using the heterogeneous graph can retain the characteristics of the image feature segments and the text feature segments. Effective information improves the processing accuracy and efficiency of visual question and answer tasks.
  • the feature corrector may include multiple cascaded correction interaction layers.
  • Each correction interaction layer includes a first graph neural update unit and a second graph neural update unit.
  • the first graph neural update unit is used to implement image and text feature aggregation
  • the second graph neural update unit is used to implement text and image feature aggregation.
  • Image and text feature aggregation refers to: aggregating image features to text nodes
  • text and image feature aggregation refers to: aggregating text features to visual nodes.
  • the process by which the first graph neural update unit implements graph and text feature aggregation includes: constructing an attention matrix, calculating the first mapping weight of visual nodes to text nodes based on the attention matrix; normalizing the first mapping weight; The image features corresponding to the visual nodes are updated according to the image-text attention weight and the first mapping weight between the visual node and the text node, so as to achieve image-text feature aggregation.
  • the process of aggregating text and graphics features by the second graph neural update unit includes: constructing an attention matrix, calculating the second mapping weight of text nodes to visual nodes based on the attention matrix; normalizing the second mapping weight;
  • the text features corresponding to the text nodes are updated according to the text-graphic attention weight and the second mapping weight between the visual node and the text node, so as to achieve text-graphic feature aggregation.
  • the visual question and answer task is a picture and text understanding task, and multi-modal research has become one of the most popular research directions in the field of artificial intelligence. Since multimodality often involves a variety of different features such as vision, speech, text, etc., which are more similar to scenes that exist in daily life, it has better prospects for implementation. Therefore, research in this field has become one of the mainstream research directions of artificial intelligence.
  • the current stage of multi-modal research mainly focuses on content understanding (artificial intelligence theoretical research can be roughly divided into content understanding and content generation), and the VQA task is a basic task of content understanding. Whether artificial intelligence can understand the content of images and texts can Reflected in the accuracy of VQA tasks. Since the VQA task of visual question answering is relatively classic, there are currently many methods to solve it. Among them, the most classic and the most accurate are a series of methods based on the transformer structure, such as VLBERT, LXMERT and other models.
  • the processing process of conventional visual question and answer tasks includes: image feature extraction using convolutional neural networks; input question sentences using ready-made text encoders (such as BERT, RoBERTa) for feature extraction. After that, the extracted features are encoded respectively. Then the two features are spliced and fused to obtain image and text fusion features. Initialize a fixed vector to represent a classification field [CLS], and put it behind the fusion feature to obtain a comprehensive feature. It is further encoded, and the position corresponding to the [CLS] feature is re-intercepted to represent the prediction vector. Finally, the prediction vector is classified, and the answer with the highest score is found in the answer space and output.
  • ready-made text encoders such as BERT, RoBERTa
  • the above-mentioned conventional visual question and answer task processing solution fuses information of different modalities by splicing visual features and text features. But there is a problem in this process: not all features need to be fused, which means that some information in the text cannot find corresponding content in the image; and vice versa. Therefore, it is not a simple solution to roughly splice the two and train based on extremely large amounts of data.
  • this application provides a visual question and answer task processing solution based on a priori heterogeneous interaction.
  • This solution designs a feature corrector suitable for visual question and answer and ensures that the corrected features contain more "Valuable information”.
  • the graph structure is composed of nodes and edges.
  • the graph structure can adjust the strength and weakness of the connected nodes according to the size of the edges. This is consistent with the physical meaning of whether there is a correlation between different features between graphics and text. Since images and text belong to different modalities, the graph nodes that store image features and the graph nodes that store text features are heterogeneous.
  • Figure 2 is a flow chart of a visual question and answer task processing solution based on a priori heterogeneous interaction provided by an embodiment of the present application.
  • the specific process is as follows: the target image is extracted from the image features of the detection network to obtain image detection features;
  • the question text is extracted from word frequency dictionary text features and spliced with the initialization vector of the classification field CLS to obtain comprehensive text features.
  • Image feature coding is performed on the image change features
  • text feature coding is performed on the text comprehensive features
  • the image feature coding results and the text feature coding results are feature fused to obtain comprehensive features.
  • Comprehensive feature coding is performed on the comprehensive features to obtain image feature segments, text feature segments and image and text weights.
  • the feature corrector is used to correct the image feature segment and the text feature segment to obtain the corrected image feature segment and the corrected text feature segment, and the prediction vector is determined according to the position of the classification field in the corrected text feature segment for prediction in the answer space.
  • the comprehensive feature encoding is implemented by a comprehensive feature encoder, which includes several cascaded cross-attention layers.
  • Each cross-attention layer includes a self-attention sub-layer and a cross-attention sub-layer, which are respectively composed of self/cross-attention mechanism, random erasure, layer normalization, and addition. That is, the self-attention sub-layer can realize the self-attention mechanism, random erasure, layer normalization and addition; the cross-attention sub-layer can realize the cross-attention mechanism, random erasure, layer normalization and addition.
  • Random erasure refers to randomly erasing part of the value of the feature according to a certain proportion to prevent overfitting; layer normalization is used to normalize between layers; addition is used to add the model output and the original feature .
  • the self-attention mechanism its formula is as follows:
  • the cross-attention mechanism formula is as follows:
  • f and g respectively represent two different input features of size [N, d]
  • N represents the number of features
  • d represents the feature dimension
  • W q , W k , and W v respectively represent the size of [d ,d] matrix is used to map input features to a specified space.
  • T represents the transposed matrix
  • size(f) represents the dimension of f
  • size(g) represents the dimension of g.
  • the cross-attention mechanism is mainly used to calculate the representation of feature f on g, and in this way realizes g's attention to f.
  • This embodiment adds an output interface to the existing comprehensive feature encoder, which is used to output the last cross-attention mechanism text-image attention weight and image-text attention weight.
  • the attention weight W gf of the text-image attention weight is:
  • the image-text attention weight W fg is:
  • q represents query
  • k represents key
  • query and key are the inputs of the self-attention mechanism.
  • Figure 3 is a schematic diagram of the principle of a comprehensive feature encoder provided by an embodiment of the present application.
  • the comprehensive features input to the comprehensive feature encoder include image feature segments and text feature segments, which are processed by self-attention sub-layers (self-attention, random erasure, layer normalization and addition) and cross-attention sub-layer processing (Cross-attention, random erasure, layer normalization and addition) processing, image feature segments, text feature segments, image-text attention weights and text-image attention weights are obtained.
  • self-attention sub-layers self-attention, random erasure, layer normalization and addition
  • Cross-attention sub-layer processing Cross-attention sub-layer processing
  • Figure 4 is a schematic diagram of initialization of a heterogeneous image and text graph structure provided by an embodiment of the present application.
  • the graph structure is one of the basic structures in computer science.
  • the graph structure is composed of nodes and edges. The following describes the structure and initialization method of the heterogeneous graph designed in this solution.
  • the visual node is a text node,
  • the heterogeneous graph includes nodes of two natures and edges between different nodes.
  • the two properties refer to the source of the feature space: visual space or text space; for edges, there are only edges between nodes in different spaces, and there are no edges between two nodes in the same space. All edges are directed, that is, there are two edges between two nodes.
  • node initialization and edge initialization can be performed.
  • image feature segments of size [N, d] can be stored in N visual nodes in sequence
  • text feature segments of size [M, d] can be stored in M text nodes in sequence.
  • N is the number of features in the image feature segment
  • M is the number of features in the text feature segment
  • d is the feature dimension.
  • the edge initialization matrix is first calculated, as shown in the figure.
  • the edge initialization matrix is obtained by multiplying the image and text matrix weight and the prior filter matrix.
  • the image and text weight matrix is the corresponding output of the previous module;
  • the prior filter matrix represents the prior correlation between images and texts.
  • This matrix is a binary matrix of size [N, M], and the matrix is composed of 0 and 1.
  • the number in the i-th row and j-th column of the matrix indicates whether there is a possibility of correlation between the i-th graph node and the j-th text node.
  • the specific generation method is as follows: first construct a zero matrix of size [N, M]; for the image-text attention weight, the maximum value is positioned in the text direction, and each image feature is most likely to be associated. to the text feature; use the coordinates of the image feature on the original input image as the virtual coordinates of the text feature (if a text feature is the maximum attention value of multiple image features at the same time, merge the coordinates of these image features into a list as the virtual coordinates of the text feature); for each text feature, compare the virtual coordinates (or coordinate list) with the coordinates of all image features, and set the corresponding position on the zero matrix that overlaps in space to 1, Then the prior filter matrix is obtained. After multiplying the graphic weight matrix and the prior filter matrix, the edges of the heterogeneous graph structure can be initialized.
  • Figure 5 is a schematic structural diagram of a feature corrector provided by an embodiment of the present application.
  • the feature corrector can be obtained based on the image-text heterogeneous image.
  • the feature modifier includes several cascaded correction interaction layers.
  • Each cascade interaction layer includes two graph neural update units (triangles in the figure), one of which is used for image and text feature aggregation, and the other graph neural update unit.
  • the neural update unit is used for text and image feature aggregation.
  • the operation process of the graph neural update unit is as follows:
  • Step 1 Construct four attention matrices Wc, Wv, Wb, Wn with sizes [d, d];
  • the input vector of the above attention matrix is q; Wq is the matrix operation result, which is used to represent the mapping process to the vector q.
  • Step 2 Calculate the mapping weight Z ti of the visual node I to the text node T.
  • f t represents the feature vector stored in the text node T
  • fi represents the feature vector stored in the visual node I
  • the calculated Z ti represents the mapping weight of the visual node I to the text node T.
  • Step 3 Normalize the mapping weight Z ti .
  • ⁇ ti represents the mapping weight after normalization
  • exp(*) represents the exponential operator
  • Step 4 Combine the edge matrix W ti (image-text attention weight) between the visual node I and the text node T to update the node characteristics.
  • the formula is as follows:
  • w ti represents the corresponding edge value in the edge matrix W ti .
  • f t is the original feature of the node, ⁇ is the hyperparameter, and Nt is the number of nodes.
  • Step 5 Reweight all text nodes after updated features.
  • the specific method is: construct a matrix W ti of size [d, d], and multiply it to the obtained features for mapping.
  • This embodiment uses a graph neural network structure to design a new visual question and answer system for visual question and answer tasks, and provides interfaces and logic suitable for the system by designing a reasonable heterogeneous graph structure and initialization method.
  • the attention relationship between image and text features is used to construct a feature corrector, and it is used to correct the bilateral features, and the corresponding position of the classification field CLS of the corrected text feature segment is extracted. Used for subsequent answer predictions.
  • Figure 6 is a schematic structural diagram of a visual question and answer task processing system provided by an embodiment of the present application.
  • the system may include:
  • the task receiving module 601 is used to receive the visual question and answer task and determine the target image and question text according to the visual question and answer task;
  • Feature extraction module 602 is used to extract image detection features from the target image and extract question text features from the question text;
  • the encoding module 603 is used to fuse image detection features, question text features and classification fields to obtain comprehensive features, and input the comprehensive features into the comprehensive feature encoder to obtain image feature segments, text feature segments and image and text weights; wherein,
  • the image and text weight includes the image and text oriented attention weight and the text and image oriented attention weight;
  • the corrector construction module 604 is used to generate a heterogeneous graph corresponding to the visual question and answer task, and initialize the heterogeneous graph using image feature segments, text feature segments and image and text weights, and based on the differences between image and text features contained in the heterogeneous graph.
  • the attention relationship constructs a feature modifier; among them, the heterogeneous graph includes visual nodes and text nodes;
  • the answer determination module 605 is used to use the feature corrector to correct the image feature segment and the text feature segment, determine the prediction vector according to the position of the classification field in the correction result, and determine the answer corresponding to the visual question and answer task based on the prediction vector.
  • the corresponding image detection features and question text features are extracted, and the image detection features, question text features and classification fields are feature fused to obtain comprehensive features.
  • This embodiment uses a comprehensive feature changer to perform change processing on the above comprehensive features to obtain image feature segments, text feature segments and image and text weights.
  • the heterogeneous graph is initialized using image feature segments, text feature segments and image and text weights, and the feature corrector is constructed using the attention relationship between the image and text features contained in the heterogeneous graph. Since images and texts belong to different modalities, the nodes that store image features and the nodes that store text features in the heterogeneous graph are heterogeneous to each other.
  • the feature corrector constructed using the heterogeneous graph can retain the characteristics of the image feature segments and the text feature segments. Effective information improves the processing accuracy and efficiency of visual question and answer tasks.
  • the process of initializing the heterogeneous graph by the modifier construction module 604 using image feature segments, text feature segments and image and text weights includes: storing each image feature in the image feature segment into the corresponding visual node in order to initialize the heterogeneous graph. Visual nodes of the graph; store each text feature in the text feature segment into the corresponding text node in order to initialize the text nodes of the heterogeneous graph; multiply the matrix corresponding to the image and text weights by the prior filter matrix to obtain edge initialization Matrix to initialize the edges of a heterogeneous graph.
  • the prior filter matrix building module is used to construct an N ⁇ M zero matrix before multiplying the matrix corresponding to the image and text weights with the prior filter matrix; where N is the number of visual nodes, M is the number of text nodes, The abscissa of the zero matrix represents the image feature, and the ordinate of the zero matrix represents the image feature; it is also used to position the maximum value of the image-text attention weight in the text direction to obtain the text feature with the greatest correlation with each image feature, and Set the coordinates of the image feature on the target image to the virtual coordinates of the text feature with the greatest correlation; it is also used to match the virtual coordinates of the text feature with the coordinates of all image features on the target image, and place the matched coordinates in The corresponding element on the zero matrix is set to 1 to obtain the prior filter matrix.
  • the feature modifier includes multiple cascaded correction interaction layers.
  • Each correction interaction layer includes a first graph neural update unit and a second graph neural update unit.
  • the first graph neural update unit is used to implement image and text feature aggregation
  • the second graph neural update unit is used to implement text and graph feature aggregation.
  • the process of realizing the aggregation of graphic and text features by the first graph neural update unit includes: constructing an attention matrix, calculating the first mapping weight of visual nodes to text nodes based on the attention matrix; normalizing the first mapping weight; The image features corresponding to the visual nodes are updated according to the image-text attention weight and the first mapping weight between the visual node and the text node, so as to achieve image-text feature aggregation.
  • the process of the second graph neural update unit realizing text and graph feature aggregation includes: constructing an attention matrix, calculating the second mapping weight of text nodes to visual nodes based on the attention matrix; normalizing the second mapping weight; The text features corresponding to the text nodes are updated according to the text-graphic attention weight and the second mapping weight between the visual node and the text node, so as to achieve text-graphic feature aggregation.
  • the answer determination module 605 is used to intercept the content corresponding to the position of the classification field in the correction result to obtain the prediction vector; it is also used to classify the prediction vector, and determine the corresponding visual question and answer task in the answer space according to the classification result. Answer.
  • the non-volatile readable storage medium can include: U disk, mobile hard disk, read-only memory (Read-Only Memory, ROM), random access memory (Random Access Memory, RAM), magnetic disk or optical disk, etc.
  • the medium on which program code is stored can include: U disk, mobile hard disk, read-only memory (Read-Only Memory, ROM), random access memory (Random Access Memory, RAM), magnetic disk or optical disk, etc.
  • This application also provides an electronic device, which may include a memory and a processor.
  • a computer program is stored in the memory.
  • the processor calls the computer program in the memory, the steps provided in the above embodiments can be implemented.
  • electronic equipment can also include various network interfaces, power supplies and other components.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Artificial Intelligence (AREA)
  • Mathematical Physics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Library & Information Science (AREA)
  • Computational Linguistics (AREA)
  • Image Analysis (AREA)

Abstract

提供了一种视觉问答任务处理方法、系统、电子设备及存储介质,所属的技术领域为人工智能技术,用于实现高效的视觉问答任务处理。视觉问答任务处理方法包括:将图像检测特征、问句文本特征和分类字段进行特征融合得到综合特征,并将综合特征输入综合特征编码器,得到图像特征段、文本特征段和图文权重(S103);利用图像特征段、文本特征段和图文权重初始化异质图,根据异质图中包含的图文特征之间的注意力关系构造特征修正器;利用特征修正器对图像特征段和文本特征段进行修正,并根据修正结果中分类字段的位置确定预测向量,根据预测向量确定视觉问答任务对应的答案(S105)。本申请能够提高视觉问答任务处理精度和效率。

Description

一种视觉问答任务处理方法、系统、电子设备及存储介质
相关申请的交叉引用
本申请要求于2022年04月29日提交中国专利局、申请号202210465781.6、申请名称为“一种视觉问答任务处理方法、系统、电子设备及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及人工智能领域,特别涉及一种视觉问答任务处理方法、系统、电子设备及存储介质。
背景技术
视觉问答(VQA,Visual Question Answering)是多模态领域研究方向中一个重要的分支,其目的旨在通过视觉信息预测输入问句的答案。例如可以通过向VQA模型输入图片与输入文字,使模型理解问句所问的内容,并根据图片信息进行回答。
相关技术中,通常使用LXMERT模型处理视觉问答任务,该方式将图像特征和文本特征拼接起来,对不同模态的信息进行融合。但是,文本中有些信息无法在图像中找到对应内容,图像中有些信息无法在文本中找到对应内容,直接将图像特征和文本特征拼接,将会需要进行大量无用的计算,视觉问答任务处理精度和效率较低。
因此,如何提高视觉问答任务处理精度和效率是本领域技术人员目前需要解决的技术问题。
发明内容
本申请的目的是提供一种视觉问答任务处理方法、一种视觉问答任务处理系统、一种电子设备及一种存储介质,能够提高视觉问答任务处理精度和效率。
为解决上述技术问题,本申请提供一种视觉问答任务处理方法,包括:
接收视觉问答任务,并根据视觉问答任务确定目标图像和问句文本;
从目标图像中提取图像检测特征,从问句文本中提取问句文本特征;
将图像检测特征、问句文本特征和分类字段进行特征融合得到综合特征,并将综合特征输入综合特征编码器,得到图像特征段、文本特征段和图文权重;其中,图文权重包括图文向注意力权重和文图向注意力权重;
生成视觉问答任务对应的异质图,并利用图像特征段、文本特征段和图文权重初始化异 质图,并根据异质图中包含的图文特征之间的注意力关系构造特征修正器;其中,异质图包括视觉节点和文本节点;
利用特征修正器对图像特征段和文本特征段进行修正,并根据修正结果中分类字段的位置确定预测向量,根据预测向量确定视觉问答任务对应的答案。
在一些实施例中,利用图像特征段、文本特征段和图文权重初始化异质图,包括:
将图像特征段中的每一图像特征依次存入对应的视觉节点,以便初始化异质图的视觉节点;
将文本特征段中的每一文本特征依次存入对应的文本节点,以便初始化异质图的文本节点;
将图文权重对应的矩阵与先验滤波矩阵相乘,得到边初始化矩阵,以便初始化异质图的边。
在一些实施例中,在将图文权重对应的矩阵与先验滤波矩阵相乘之前,还包括:
构造N×M的零矩阵;其中,N为视觉节点的数量,M为文本节点的数量,零矩阵的横坐标表示图像特征,零矩阵的纵坐标表示图像特征;
对图文向注意力权重在文本方向进行最大值定位,得到与每一图像特征关联程度最大的文本特征,并将图像特征在目标图像上的坐标设置为关联程度最大的文本特征的虚拟坐标;
将文本特征的虚拟坐标与所有图像特征在目标图像上的坐标进行匹配,并将匹配命中的坐标在零矩阵上对应的元素设置为1,得到先验滤波矩阵。
在一些实施例中,特征修正器包括多个级联的修正交互层,每一修正交互层包括第一图神经更新单元和第二图神经更新单元,第一图神经更新单元用于实现图文特征聚合,第二图神经更新单元用于实现文图特征聚合。
在一些实施例中,第一图神经更新单元实现图文特征聚合的过程包括:
构造注意力矩阵,根据注意力矩阵计算视觉节点对文本节点的第一映射权重;
对第一映射权重进行归一化处理;
根据视觉节点与文本节点之间的图文向注意力权重和第一映射权重更新视觉节点对应的图像特征,以便实现图文特征聚合。
在一些实施例中,第二图神经更新单元实现文图特征聚合的过程包括:
构造注意力矩阵,根据注意力矩阵计算文本节点对视觉节点的第二映射权重;
对第二映射权重进行归一化处理;
根据视觉节点与文本节点之间的文图向注意力权重和第二映射权重更新文本节点对应的文本特征,以便实现文图特征聚合。
在一些实施例中,根据修正结果中分类字段的位置确定预测向量,根据预测向量输出视觉问答任务对应的答案,包括:
将修正结果中与分类字段的位置对应的内容进行截取得到预测向量;
对预测向量进行分类,并根据分类结果在答案空间中确定视觉问答任务对应的答案。
本申请还提供了一种视觉问答任务处理系统,该系统包括:
任务接收模块,用于接收视觉问答任务,并根据视觉问答任务确定目标图像和问句文本;
特征提取模块,用于从目标图像中提取图像检测特征,从问句文本中提取问句文本特征;
编码模块,用于将图像检测特征、问句文本特征和分类字段进行特征融合得到综合特征,并将综合特征输入综合特征编码器,得到图像特征段、文本特征段和图文权重;其中,图文权重包括图文向注意力权重和文图向注意力权重;
修正器构造模块,用于生成视觉问答任务对应的异质图,并利用图像特征段、文本特征段和图文权重初始化异质图,并根据异质图中包含的图文特征之间的注意力关系构造特征修正器;其中,异质图包括视觉节点和文本节点;
答案确定模块,用于利用特征修正器对图像特征段和文本特征段进行修正,并根据修正结果中分类字段的位置确定预测向量,根据预测向量确定视觉问答任务对应的答案。
本申请还提供了一种非易失性可读存储介质,其上存储有计算机程序,计算机程序执行时实现上述视觉问答任务处理方法执行的步骤。
本申请还提供了一种电子设备,包括存储器和处理器,存储器中存储有计算机程序,处理器调用存储器中的计算机程序时实现上述视觉问答任务处理方法执行的步骤。
本申请提供了一种视觉问答任务处理方法,包括:接收视觉问答任务,并根据视觉问答任务确定目标图像和问句文本;从目标图像中提取图像检测特征,从问句文本中提取问句文本特征;将图像检测特征、问句文本特征和分类字段进行特征融合得到综合特征,并将综合特征输入综合特征编码器,得到图像特征段、文本特征段和图文权重;其中,图文权重包括图文向注意力权重和文图向注意力权重;生成视觉问答任务对应的异质图,并利用图像特征段、文本特征段和图文权重初始化异质图,并根据异质图中包含的图文特征之间的注意力关 系构造特征修正器;其中,异质图包括视觉节点和文本节点;利用特征修正器对图像特征段和文本特征段进行修正,并根据修正结果中分类字段的位置确定预测向量,根据预测向量确定视觉问答任务对应的答案。
本申请在接收到视觉问答任务之后提取相应的图像检测特征和问句文本特征,将图像检测特征、问句文本特征和分类字段进行特征融合得到综合特征。本申请利用综合特征变化器对上述综合特征进行变化处理,得到图像特征段、文本特征段和图文权重。生成视觉问答任务对应的异质图后,利用图像特征段、文本特征段和图文权重初始化异质图,利用异质图中包含的图文特征之间的注意力关系构造特征修正器。由于图像与文本隶属不同模态,因此异质图中存储图像特征的节点与存储文本特征的节点互为异质,使用异质图构造的特征修正器能够保留图像特征段和文本特征段中的有效信息,提高了提高视觉问答任务处理精度和效率。本申请同时还提供了一种视觉问答任务处理系统、一种非易失性可读存储介质和一种电子设备,具有上述有益效果,在此不再赘述。
附图说明
为了更清楚地说明本申请实施例,下面将对实施例中所需要使用的附图做简单的介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1为本申请实施例所提供的一种视觉问答任务处理方法的流程图;
图2为本申请实施例所提供的一种基于先验异质交互的视觉问答任务处理方案的流程图;
图3为本申请实施例所提供的一种综合特征编码器的原理示意图;
图4为本申请实施例所提供的一种图文异质图结构初始化示意图;
图5为本申请实施例所提供的一种特征修正器的结构示意图;
图6为本申请实施例所提供的一种视觉问答任务处理系统的结构示意图。
具体实施方式
为使本申请实施例的目的、技术方案和优点更加清楚,下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
下面请参见图1,图1为本申请实施例所提供的一种视觉问答任务处理方法的流程图。
具体步骤可以包括:
S101:接收视觉问答任务,并根据视觉问答任务确定目标图像和问句文本;
其中,本实施例可以应用于具备视觉问答能力的电子设备,视觉问答任务可以为用户下发的任务,也可以为其他设备传输的任务。根据视觉问答任务可以确定目标图像和问句文本,问句文本为关于目标图像中某些信息的问题,例如目标图像为包括多辆自行车的图像,问句文本可以为“图中存在几辆自行车”。
S102:从目标图像中提取图像检测特征,从问句文本中提取问句文本特征;
其中,本步骤可以使用卷积神经网络提取目标图像中的图像检测特征,可以使用文本编码器(如BERT、RoBERTa)对问句文本进行特征提取得到问句文本特征。
S103:将图像检测特征、问句文本特征和分类字段进行特征融合得到综合特征,并将综合特征输入综合特征编码器,得到图像特征段、文本特征段和图文权重;
其中,本步骤可以对图像检测特征进行图像特征编码,还可以对问句文本特征进行文本特征编码,将图像特征编码结果和文本特征编码结果进行拼接融合,得到图文融合特征,将图文融合特征与分类字段进行融合得到综合特征。本实施例可以初始化一个固定的向量用来表示一个分类字段(CLS),将分类字段拼在融合特征的后边得到综合特征。作为另一种可行的实施方式,本实施例可以对图像检测特征进行图像特征编码,将问句文本特征与分类字段进行融合得到文本综合特征,对文本综合特征进行文本特征编码,将图像特征编码结果和文本特征编码结果进行融合得到综合特征。
在得到综合特征之后,可以将综合特征输入综合特征编码器,综合特征编码器的输出特征包括图像特征段、文本特征段和图文权重。上述图文权重包括图文向注意力权重和文图向注意力权重,图文向注意力权重为图像特征对文本特征的注意力权重,文图向注意力权重为文本特征对图像特征的注意力权重。
S104:生成视觉问答任务对应的异质图,利用图像特征段、文本特征段和图文权重初始化异质图,并根据异质图中包含的图文特征之间的注意力关系构造特征修正器;其中,异质图包括视觉节点和文本节点;
其中,本实施例可以生成包括视觉空间和文本空间的异质图(即图文异质图结构),视觉空间内存在多个视觉节点,文本空间内存在多个文本节点。初始化上述异质图(Heterogeneous Graph)的过程包括节点初始化和边初始化,节点初始化包括视觉节点初始化和文本节点初始化。
视觉节点初始化的过程为:将图像特征段中的每一图像特征依次存入对应的视觉节点,以便初始化异质图的视觉节点。文本节点初始化的过程为:将文本特征段中的每一文本特征依次存入对应的文本节点,以便初始化异质图的文本节点。边初始化的过程为:将图文权重对应的矩阵与先验滤波矩阵相乘,得到边初始化矩阵,以便初始化异质图的边。
具体的,在边初始化过程中可以将图文向注意力权重和文图向注意力权重分别与先验滤波矩阵相乘,得到边初始化矩阵。先验滤波矩阵的生成方式包括:构造N×M的零矩阵;其中,N为视觉节点的数量,M为文本节点的数量,零矩阵的横坐标表示图像特征,零矩阵的纵坐标表示图像特征;对图文向注意力权重在文本方向进行最大值定位,得到与每一图像特征关联程度最大的文本特征,并将图像特征在目标图像上的坐标设置为关联程度最大的文本特征的虚拟坐标;将文本特征的虚拟坐标与所有图像特征在目标图像上的坐标进行匹配,并将匹配命中的坐标在零矩阵上对应的元素设置为1,得到先验滤波矩阵。初始化后的异质图中存储有图文特征(图像特征与文本特征)之间的注意力关系,根据该注意力关系可以构造特征修正器。
S105:利用特征修正器对图像特征段和文本特征段进行修正,并根据修正结果中分类字段的位置确定预测向量,根据预测向量确定视觉问答任务对应的答案。
其中,在得到特征修正器后可以对图像特征段和文本特征段进行双边修正,进而根据修正结果中分类字段的位置确定预测向量,结合预测向量确定视觉问答任务对应的答案。具体的,本实施例可以将修正结果中与分类字段的位置对应的内容进行截取得到预测向量;对预测向量进行分类,并根据分类结果在答案空间中确定视觉问答任务对应的答案。
本实施例在接收到视觉问答任务之后提取相应的图像检测特征和问句文本特征,将图像检测特征、问句文本特征和分类字段进行特征融合得到综合特征。本实施例利用综合特征变化器对上述综合特征进行变化处理,得到图像特征段、文本特征段和图文权重。生成视觉问答任务对应的异质图后,利用图像特征段、文本特征段和图文权重初始化异质图,利用异质图中包含的图文特征之间的注意力关系构造特征修正器。由于图像与文本隶属不同模态,因此异质图中存储图像特征的节点与存储文本特征的节点互为异质,使用异质图构造的特征修正器能够保留图像特征段和文本特征段中的有效信息,提高了提高视觉问答任务处理精度和效率。
作为对于图1对应实施例的进一步介绍,特征修正器中可以包括多个级联的修正交互层,每一修正交互层包括第一图神经更新单元和第二图神经更新单元,第一图神经更新单元 用于实现图文特征聚合,第二图神经更新单元用于实现文图特征聚合。图文特征聚合指:将图像特征聚合至文本节点,文图特征聚合指:将文本特征聚合至视觉节点。
具体的,第一图神经更新单元实现图文特征聚合的过程包括:构造注意力矩阵,根据注意力矩阵计算视觉节点对文本节点的第一映射权重;对第一映射权重进行归一化处理;根据视觉节点与文本节点之间的图文向注意力权重和第一映射权重更新视觉节点对应的图像特征,以便实现图文特征聚合。
具体的,第二图神经更新单元实现文图特征聚合的过程包括:构造注意力矩阵,根据注意力矩阵计算文本节点对视觉节点的第二映射权重;对第二映射权重进行归一化处理;根据视觉节点与文本节点之间的文图向注意力权重和第二映射权重更新文本节点对应的文本特征,以便实现文图特征聚合。
下面通过在实际应用中的实施例说明上述实施例描述的流程。
视觉问答任务是一种图文理解任务,多模态研究成为人工智能领域最热门的研究方向之一。由于多模态往往涉及视觉、语音、文本等多种不同的特征,这与日常中存在的场景更相近,因此拥有更好的落地前景,所以该领域的研究成为人工智能主流研究方向之一。当前阶段多模态研究主要集中在内容理解层面(人工智能理论研究大致可分为内容理解和内容生成),而VQA任务是内容理解的一个基础任务,人工智能能否理解图像和文本的内容可以体现在VQA任务的精度上。由于视觉问答VQA任务相对经典,当前已经有很多方法来解决,其中最经典、精度也最高的是基于transformer结构的一系列方法,如VLBERT、LXMERT等模型。
常规的视觉问答任务的处理过程包括:图像采用卷积神经网络进行特征提取;输入问句采用现成的文本编码器(如BERT、RoBERTa)进行特征提取。之后,分别对提取好的特征进行编码。然后将两种特征进行拼接融合,得到图文融合特征。初始化一个固定的向量用来表示一个分类字段[CLS],将其拼在融合特征的后边得到一个综合特征。对其进行进一步编码,并将[CLS]特征对应的位置重新截取出来表示预测向量。最后对预测向量进行分类,在答案空间中找到得分最高的答案进行输出。
上述常规的视觉问答任务的处理方案通过将视觉特征和文本特征拼接起来,对不同模态的信息进行融合。但是该过程中有一个问题:不是所有的特征都有必要融合,也就是说文本中有些信息无法在图像中找到对应内容;反之亦然。因此,粗暴地将二者进行拼接并基于超大规模的数据量进行训练并不是一条简便的方案。
针对上述常规技术中存在的缺陷,本申请提供一种基于先验异质交互的视觉问答任务处理方案,该方案设计一个适用于视觉问答的特征修正器,并保证修正后的特征包含更多的“有价值信息”。图结构由节点和边构成,图结构可根据边的大小调节相连节点之间的强弱关系,这与图文之间不同特征间是否存在相关性在物理意义上一致。由于图像与文本隶属不同模态,因此存储图像特征的图节点与存储文本特征的图节点是互为异质的。
请参见图2,图2为本申请实施例所提供的一种基于先验异质交互的视觉问答任务处理方案的流程图,具体过程如下:目标图像经过检测网络图像特征提取得到图像检测特征,问句文本经过词频字典文本特征提取并拼接分类字段CLS的初始化向量得到文本综合特征。对图像变化特征进行图像特征编码,对文本综合特征进行文本特征编码,将图像特征编码结果和文本特征编码结果进行特征融合,得到综合特征。对综合特征进行综合特征编码,得到图像特征段、文本特征段和图文权重。利用特征修正器对图像特征段和文本特征段进行修正得到修正图像特征段和修正文本特征段,根据修正文本特征段中分类字段的位置确定预测向量在答案空间进行预测。
上述过程中综合特征编码由综合特征编码器实现,综合特征编码器包括若干个级联的跨注意力层。每个跨注意力层包括一个自注意力子层和跨注意力子层,分别由自/跨注意力机制、随机擦出、层归一化、相加组成。即,自注意力子层可以实现自注意力机制、随机擦出、层归一化和相加;跨注意力子层可以实现跨注意力机制、随机擦出、层归一化和相加。随机擦除指的是按照一定比例随机擦除特征的部分数值,用来防止过拟合;层归一化用来在层间进行归一化;相加用来将模型输出与原始特征相加。而对于自注意力机制,其公式如下:
Figure PCTCN2022134138-appb-000001
Figure PCTCN2022134138-appb-000002
跨注意力机制公式如下:
Figure PCTCN2022134138-appb-000003
以上公式中,其中f和g分别表示两个大小为[N,d]的不同的输入特征,N表示特征个数,d表示特征维度,W q、W k、W v分别为大小为[d,d]的矩阵,用来将输入特征映射到指定空间中去,T表示转置矩阵,size(f)表示f的维度,size(g)表示g的维度。跨注意力 机制主要用来运算特征f在g上的表示,通过这种形式实现g对f的注意力。
本实施例为现有的综合特征编码器增加了一个输出接口,该输出接口用于将最后一次跨注意力机制的文图向注意力权重和图文向注意力权重输出。
文图向注意力权重的注意力权重W gf为:
Figure PCTCN2022134138-appb-000004
图文向注意力权重W fg为:
Figure PCTCN2022134138-appb-000005
以上公式中q表示query,k表示key,query和key为自注意力机制的输入。
请参见图3,图3为本申请实施例所提供的一种综合特征编码器的原理示意图。输入综合特征编码器的综合特征包括图像特征段和文本特征段,经过自注意力子层的处理(自注意力、随机擦除、层归一化和相加)和跨注意力子层的处理(跨注意力、随机擦除、层归一化和相加)的处理后,得到图像特征段、文本特征段、图文向注意力权重和文图向注意力权重。
请参见图4,图4为本申请实施例所提供的一种图文异质图结构初始化示意图。图结构是计算机学中的基本结构之一,图结构由节点和边构成,下面说明本方案所设计异质图的结构和初始化方式。图4中
Figure PCTCN2022134138-appb-000006
为视觉节点,
Figure PCTCN2022134138-appb-000007
为文本节点,
本实施例针对视觉问答领域中图像特征和文本特征有效性低的问题,提出了采用异质图结构聚合的思想来解决的方案。如图4所示,异质图包括两种性质的节点和不同节点之间的边。对节点来说,两种性质指的是特征空间的来源:视觉空间或文本空间;对边来说,只有不同空间的节点之间才有边,同一空间中的两个节点之间没有边。所有的边是有向的,即两个节点之间的边有两条。
在已经构造好的异质图后,可以进行节点初始化和边初始化。节点初始化过程中,可以将大小为[N,d]的图像特征段依次存入N个视觉节点;将大小为[M,d]的文本特征段依次存入M个文本节点。N为图像特征段中的特征数量,M为文本特征段中的特征数量,d为特征维度。
对于边初始化过程,首先计算边初始化矩阵,如图所示,边初始化矩阵由图文矩阵权重与先验滤波矩阵相乘得来。其中,图文权重矩阵是上一模块的对应输出;先验滤波矩阵表示图文之间的先验相关性,该矩阵为一个[N,M]大小的二值矩阵,矩阵由0和1构成,矩阵第 i行j列的数字表示第i个图节点与第j个文本节点是否存在相关的可能。对于先验滤波矩阵来说,具体的生成方法如下:首先构造一个[N,M]大小的零矩阵;对于图文向注意权重在文本方向进行最大值定位,找到每个图像特征最有可能关联到的文本特征;将该图像特征在原始输入图片上的坐标作为该文本特征的虚拟坐标(若某文本特征同时是多个图像特征的注意力最大值,则将这些图像特征所在坐标合并成列表作为该文本特征的虚拟坐标);对每个文本特征,将该虚拟坐标(或坐标列表)与所有图像特征的坐标进行比对,将空间上存在重叠的在零矩阵上对应位置置为1,进而得到先验滤波矩阵。将图文权重矩阵与先验滤波矩阵相乘后,即可对异质图结构的边进行初始化。
请参见图5,图5为本申请实施例所提供的一种特征修正器的结构示意图,该特征修正器可以根据图文异质图得到。特征修正器包括若干个级联的修正交互层,每个级联交互层包括两个图神经更新单元(图中的三角形),其中一个图神经更新单元用于进行图文特征聚合,另一个图神经更新单元用于进行文图特征聚合。
以视觉节点对文本节点过程为例,图神经更新单元的操作过程如下:
步骤1:构造四个大小均为[d,d]的注意力矩阵Wc、Wv、Wb、Wn;
其中,上述注意力矩阵的输入向量为q;Wq为矩阵运算结果,用于表示对向量q的映射过程。
步骤2:计算视觉节点I对文本节点T的映射权重Z ti
z ti=ReLU(W c[W vf t;W bf i]);
其中,f t表示文本节点T中存储的特征向量,f i表示视觉节点I中存储的特征向量,计算后的Z ti表示视觉节点I对文本节点T的映射权重。
步骤3:对映射权重Z ti进行归一化处理。
归一化的公式为:
Figure PCTCN2022134138-appb-000008
α ti表示归一化之后的映射权重,exp(*)表示指数运算符。
步骤4:将视觉节点I和文本节点T间的边矩阵W ti(图文向注意力权重)结合进来,更新节点特征,公式如下:
Figure PCTCN2022134138-appb-000009
其中w ti表示边矩阵W ti中对应的边值。f t为节点的原特征,σ为超参数,Nt为节点数量。
步骤5:对更新特征后的所有文本节点进行重加权。具体方式为:构造一个大小为[d,d]的矩阵W ti,乘到所得特征上进行映射。
对于反向的节点更新,与步骤1~5的操作类似,只需要将视觉节点I对文本节点T交换即可。最终可以将[CLS]在文本特征位置上对应的特征段提取出来作为预测向量并在答案空间中进行分类,得到最终的输出。
本实施例针对视觉问答任务,采用图神经网络结构,设计了一套新型视觉问答系统,并通过设计合理的异质图结构和初始化方法,提供了适合系统的接口和逻辑。本实施例在综合特征编码模块之后,利用图文特征之间的注意力关系构造了特征修正器,并用其对双边特征进行修正,将得到的修正文本特征段的分类字段CLS对应位置提取出,用来进行后续的答案预测。
请参见图6,图6为本申请实施例所提供的一种视觉问答任务处理系统的结构示意图,该系统可以包括:
任务接收模块601,用于接收视觉问答任务,并根据视觉问答任务确定目标图像和问句文本;
特征提取模块602,用于从目标图像中提取图像检测特征,从问句文本中提取问句文本特征;
编码模块603,用于将图像检测特征、问句文本特征和分类字段进行特征融合得到综合特征,并将综合特征输入综合特征编码器,得到图像特征段、文本特征段和图文权重;其中,图文权重包括图文向注意力权重和文图向注意力权重;
修正器构造模块604,用于生成视觉问答任务对应的异质图,并利用图像特征段、文本特征段和图文权重初始化异质图,并根据异质图中包含的图文特征之间的注意力关系构造特征修正器;其中,异质图包括视觉节点和文本节点;
答案确定模块605,用于利用特征修正器对图像特征段和文本特征段进行修正,并根据修正结果中分类字段的位置确定预测向量,根据预测向量确定视觉问答任务对应的答案。
本实施例在接收到视觉问答任务之后提取相应的图像检测特征和问句文本特征,将图像检测特征、问句文本特征和分类字段进行特征融合得到综合特征。本实施例利用综合特征变化器对上述综合特征进行变化处理,得到图像特征段、文本特征段和图文权重。生成视觉问答任务对应的异质图后,利用图像特征段、文本特征段和图文权重初始化异质图,利用异质图中包含的图文特征之间的注意力关系构造特征修正器。由于图像与文本隶属不同模态,因此异质图中存储图像特征的节点与存储文本特征的节点互为异质,使用异质图构造的特征修正器能够保留图像特征段和文本特征段中的有效信息,提高了提高视觉问答任务处理精度和 效率。
进一步的,修正器构造模块604利用图像特征段、文本特征段和图文权重初始化异质图的过程包括:将图像特征段中的每一图像特征依次存入对应的视觉节点,以便初始化异质图的视觉节点;将文本特征段中的每一文本特征依次存入对应的文本节点,以便初始化异质图的文本节点;将图文权重对应的矩阵与先验滤波矩阵相乘,得到边初始化矩阵,以便初始化异质图的边。
进一步的,还包括:
先验滤波矩阵构建模块,用于在将图文权重对应的矩阵与先验滤波矩阵相乘之前,构造N×M的零矩阵;其中,N为视觉节点的数量,M为文本节点的数量,零矩阵的横坐标表示图像特征,零矩阵的纵坐标表示图像特征;还用于对图文向注意力权重在文本方向进行最大值定位,得到与每一图像特征关联程度最大的文本特征,并将图像特征在目标图像上的坐标设置为关联程度最大的文本特征的虚拟坐标;还用于将文本特征的虚拟坐标与所有图像特征在目标图像上的坐标进行匹配,并将匹配命中的坐标在零矩阵上对应的元素设置为1,得到先验滤波矩阵。
进一步的,特征修正器包括多个级联的修正交互层,每一修正交互层包括第一图神经更新单元和第二图神经更新单元,第一图神经更新单元用于实现图文特征聚合,第二图神经更新单元用于实现文图特征聚合。
进一步的,第一图神经更新单元实现图文特征聚合的过程包括:构造注意力矩阵,根据注意力矩阵计算视觉节点对文本节点的第一映射权重;对第一映射权重进行归一化处理;根据视觉节点与文本节点之间的图文向注意力权重和第一映射权重更新视觉节点对应的图像特征,以便实现图文特征聚合。
进一步的,第二图神经更新单元实现文图特征聚合的过程包括:构造注意力矩阵,根据注意力矩阵计算文本节点对视觉节点的第二映射权重;对第二映射权重进行归一化处理;根据视觉节点与文本节点之间的文图向注意力权重和第二映射权重更新文本节点对应的文本特征,以便实现文图特征聚合。
进一步的,答案确定模块605用于将修正结果中与分类字段的位置对应的内容进行截取得到预测向量;还用于对预测向量进行分类,并根据分类结果在答案空间中确定视觉问答任务对应的答案。
由于系统部分的实施例与方法部分的实施例相互对应,因此系统部分的实施例请参见方 法部分的实施例的描述,这里暂不赘述。
本申请还提供了一种非易失性可读存储介质,其上存有计算机程序,该计算机程序被执行时可以实现上述实施例所提供的步骤。该非易失性可读存储介质可以包括:U盘、移动硬盘、只读存储器(Read-Only Memory,ROM)、随机存取存储器(Random Access Memory,RAM)、磁碟或者光盘等各种可以存储程序代码的介质。
本申请还提供了一种电子设备,可以包括存储器和处理器,存储器中存有计算机程序,处理器调用存储器中的计算机程序时,可以实现上述实施例所提供的步骤。当然电子设备还可以包括各种网络接口,电源等组件。
说明书中各个实施例采用递进的方式描述,每个实施例重点说明的都是与其他实施例的不同之处,各个实施例之间相同相似部分互相参见即可。对于实施例公开的系统而言,由于其与实施例公开的方法相对应,所以描述的比较简单,相关之处参见方法部分说明即可。应当指出,对于本技术领域的普通技术人员来说,在不脱离本申请原理的前提下,还可以对本申请进行若干改进和修饰,这些改进和修饰也落入本申请权利要求的保护范围内。
还需要说明的是,在本说明书中,诸如第一和第二等之类的关系术语仅仅用来将一个实体或者操作与另一个实体或操作区分开来,而不一定要求或者暗示这些实体或操作之间存在任何这种实际的关系或者顺序。而且,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者设备不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者设备所固有的要素。在没有更多限制的状况下,由语句“包括一个……”限定的要素,并不排除在包括要素的过程、方法、物品或者设备中还存在另外的相同要素。

Claims (20)

  1. 一种视觉问答任务处理方法,其特征在于,包括:
    接收视觉问答任务,并根据所述视觉问答任务确定目标图像和问句文本;
    从所述目标图像中提取图像检测特征,从所述问句文本中提取问句文本特征;
    将所述图像检测特征、所述问句文本特征和分类字段进行特征融合得到综合特征,并将所述综合特征输入综合特征编码器,得到图像特征段、文本特征段和图文权重;其中,所述图文权重包括图文向注意力权重和文图向注意力权重;
    生成所述视觉问答任务对应的异质图,利用所述图像特征段、所述文本特征段和所述图文权重初始化所述异质图,并根据所述异质图中包含的图文特征之间的注意力关系构造特征修正器;其中,所述异质图包括视觉节点和文本节点;
    利用所述特征修正器对所述图像特征段和所述文本特征段进行修正,并根据修正结果中所述分类字段的位置确定预测向量,根据所述预测向量确定所述视觉问答任务对应的答案。
  2. 根据权利要求1所述视觉问答任务处理方法,其特征在于,所述从所述问句文本中提取问句文本特征,包括:
    采用文本编码器所述问句文本进行特征提取操作,获得与所述特征提取操作对应的问句文本特征。
  3. 根据权利要求1所述视觉问答任务处理方法,其特征在于,所述将所述图像检测特征、所述问句文本特征和分类字段进行特征融合得到综合特征,包括:
    对所述图像检测特征进行图像特征编码处理,获得与所述图像特征编码处理对应的图像特征编码结果;
    对所述问句文本特征进行文本特征编码处理,获得与所述文本特征编码处理对应的文本特征编码结果;
    将所述图像特征编码结果、所述文本特征编码结果和所述分类字段进行特征融合得到综合特征。
  4. 根据权利要求3所述视觉问答任务处理方法,其特征在于,所述将所述图像特征编码结果、所述文本特征编码结果和所述分类字段进行特征融合得到综合特征,包括:
    将所述图像特征编码结果和所述文本特征编码结果进行拼接融合,得到图文融合特征,并将所述图文融合特征与分类字段进行特征融合得到所述综合特征。
  5. 根据权利要求3所述视觉问答任务处理方法,其特征在于,所述将所述图像检测特征、所述问句文本特征和分类字段进行特征融合得到综合特征,包括:
    将所述问句文本特征与所述分类字段进行融合得到文本综合特征,对所述文本综合特征进行文本特征编码,获得文本特征编码结果;
    将所述图像特征编码结果和所述文本特征编码结果进行特征融合得到所述综合特征。
  6. 根据权利要求1所述视觉问答任务处理方法,其特征在于,所述综合特征编码器包括若干个级联的跨注意力层,每个所述跨注意力层包括一个自注意力子层和跨注意力子层,所述将所述综合特征输入综合特征编码器,得到图像特征段、文本特征段和图文权重,包括:
    将所述综合特征输入至所述综合特征编码器,采用所述自注意力子层和所述跨注意力子层对所述综合特征进行处理,获得所述图像特征段、所述文本特征段和所述图文权重。
  7. 根据权利要求1所述视觉问答任务处理方法,其特征在于,所述异质图包括视觉空间异质图和文本空间异质图,所述生成所述视觉问答任务对应的异质图,包括:
    生成与多个所述视觉节点对应的视觉空间异质图,或,与多个所述文本节点对应的文本空间异质图。
  8. 根据权利要求1所述视觉问答任务处理方法,其特征在于,利用所述图像特征段、所述文本特征段和所述图文权重初始化所述异质图,包括:
    将所述图像特征段中的每一图像特征依次存入对应的所述视觉节点,以便初始化所述异质图的视觉节点;
    将所述文本特征段中的每一文本特征依次存入对应的所述文本节点,以便初始化所述异质图的文本节点;
    将所述图文权重对应的矩阵与先验滤波矩阵相乘,得到边初始化矩阵,以便初始化所述异质图的边。
  9. 根据权利要求8所述视觉问答任务处理方法,其特征在于,所述将所述图像特征段中的每一图像特征依次存入对应的所述视觉节点,以便初始化所述异质图的视觉节点,包括:
    在构造所述异质图之后,将所述图像特征段中大小为[N,d]的第一图像特征依次存入对应的所述视觉节点,以便初始化所述异质图的视觉节点,其中,N为图像特征段中的特征数量,d为特征维度;
    所述将所述文本特征段中的每一文本特征依次存入对应的所述文本节点,以便初始化所述异质图的文本节点,包括:
    将所述文本特征段中大小为[M,d]的每一文本特征依次存入对应的所述文本节点,以便初始化所述异质图的文本节点,M为文本特征段中的特征数量,d为特征维度。
  10. 根据权利要求2所述视觉问答任务处理方法,其特征在于,在将所述图文权重对应的矩阵与先验滤波矩阵相乘之前,还包括:
    构造N×M的零矩阵;其中,N为所述视觉节点的数量,M为所述文本节点的数量,所述零矩阵的横坐标表示图像特征,所述零矩阵的纵坐标表示图像特征;
    对所述图文向注意力权重在文本方向进行最大值定位,得到与每一图像特征关联程度最大的文本特征,并将图像特征在所述目标图像上的坐标设置为关联程度最大的文本特征的虚拟坐标;
    将所述文本特征的虚拟坐标与所有图像特征在所述目标图像上的坐标进行匹配,并将匹配命中的坐标在所述零矩阵上对应的元素设置为1,得到所述先验滤波矩阵。
  11. 根据权利要求10所述视觉问答任务处理方法,其特征在于,所述将所述文本特征的虚拟坐标与所有图像特征在所述目标图像上的坐标进行匹配,并将匹配命中的坐标在所述零矩阵上对应的元素设置为1,得到所述先验滤波矩阵,包括:
    将所述文本特征的虚拟坐标与所述所有图像特征在所述目标图像上的坐标进行比对,将存在重叠的坐标作为匹配命中的坐标;
    将匹配命中的坐标在所述零矩阵上对应的元素设置为1,得到所述先验滤波矩阵。
  12. 根据权利要求1所述视觉问答任务处理方法,其特征在于,所述特征修正器包括多个级联的修正交互层,每一所述修正交互层包括第一图神经更新单元和第二图神经更新单元,所述第一图神经更新单元用于实现图文特征聚合,所述第二图神经更新单元用于实现文图特征聚合。
  13. 根据权利要求12所述视觉问答任务处理方法,其特征在于,所述第一图神经更新单元实现图文特征聚合的过程包括:
    构造注意力矩阵,根据所述注意力矩阵计算视觉节点对所述文本节点的第一映射权重;
    对所述第一映射权重进行归一化处理;
    根据视觉节点与所述文本节点之间的图文向注意力权重和所述第一映射权重更新所述视觉节点对应的图像特征,以便实现图文特征聚合。
  14. 根据权利要求12所述视觉问答任务处理方法,其特征在于,所述第二图神经更新单元实现文图特征聚合的过程包括:
    构造注意力矩阵,根据所述注意力矩阵计算文本节点对所述视觉节点的第二映射权重;
    对所述第二映射权重进行归一化处理;
    根据视觉节点与所述文本节点之间的文图向注意力权重和所述第二映射权重更新所述文本节点对应的文本特征,以便实现文图特征聚合。
  15. 根据权利要求12所述视觉问答任务处理方法,其特征在于,所述利用所述特征修正器对所述图像特征段和所述文本特征段进行修正,包括:
    利用所述特征修正器的修正交互层对所述图像特征段和所述文本特征段进行双边修正,获得与所述图像特征段对应的修正图像特征段和所述文本特征段对应的修正文本特征段。
  16. 根据权利要求15所述视觉问答任务处理方法,其特征在于,所述将所述修正结果中与所述分类字段的位置对应的内容进行截取得到所述预测向量,包括:
    将所述修正文本特征段中所述分类字段的位置对应的内容进行截取,获得所述预测向量。
  17. 根据权利要求1所述视觉问答任务处理方法,其特征在于,根据修正结果中所述分类字段的位置确定预测向量,根据所述预测向量输出所述视觉问答任务对应的答案,包括:
    将所述修正结果中与所述分类字段的位置对应的内容进行截取得到所述预测向量;
    对所述预测向量进行分类,并根据分类结果在答案空间中确定所述视觉问答任务对应的答案。
  18. 一种视觉问答任务处理系统,其特征在于,包括:
    任务接收模块,用于接收视觉问答任务,并根据所述视觉问答任务确定目标图像和问句文本;
    特征提取模块,用于从所述目标图像中提取图像检测特征,从所述问句文本中提取问句文本特征;
    编码模块,用于将所述图像检测特征、所述问句文本特征和分类字段进行特征融合得到综合特征,并将所述综合特征输入综合特征编码器,得到图像特征段、文本特征段和图文权重;其中,所述图文权重包括图文向注意力权重和文图向注意力权重;
    修正器构造模块,用于生成所述视觉问答任务对应的异质图,利用所述图像特征段、所述文本特征段和所述图文权重初始化所述异质图,并根据所述异质图中包含的图文特征之间的注意力关系构造特征修正器;其中,所述异质图包括视觉节点和文本节点;
    答案确定模块,用于利用所述特征修正器对所述图像特征段和所述文本特征段进行修正,并根据修正结果中所述分类字段的位置确定预测向量,根据所述预测向量确定所述视觉问答任务对应的答案。
  19. 一种电子设备,其特征在于,包括存储器和处理器,所述存储器中存储有计算机程序,所述处理器调用所述存储器中的计算机程序时实现如权利要求1至17任一项所述视觉问答任务处理方法的步骤。
  20. 一种非易失性可读存储介质,其特征在于,所述非易失性可读存储介质中存储有计算机可执行指令,所述计算机可执行指令被处理器加载并执行时,实现如权利要求1至17任一项所述视觉问答任务处理方法的步骤。
PCT/CN2022/134138 2022-04-29 2022-11-24 一种视觉问答任务处理方法、系统、电子设备及存储介质 WO2023207059A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210465781.6 2022-04-29
CN202210465781.6A CN114780768A (zh) 2022-04-29 2022-04-29 一种视觉问答任务处理方法、系统、电子设备及存储介质

Publications (1)

Publication Number Publication Date
WO2023207059A1 true WO2023207059A1 (zh) 2023-11-02

Family

ID=82435927

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/134138 WO2023207059A1 (zh) 2022-04-29 2022-11-24 一种视觉问答任务处理方法、系统、电子设备及存储介质

Country Status (2)

Country Link
CN (1) CN114780768A (zh)
WO (1) WO2023207059A1 (zh)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114780768A (zh) * 2022-04-29 2022-07-22 山东海量信息技术研究院 一种视觉问答任务处理方法、系统、电子设备及存储介质
CN115310611B (zh) * 2022-10-12 2023-03-24 苏州浪潮智能科技有限公司 一种人物意图推理方法及相关装置
CN115905591B (zh) * 2023-02-22 2023-05-30 浪潮电子信息产业股份有限公司 一种视觉问答方法、系统、设备及可读存储介质

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113094484A (zh) * 2021-04-07 2021-07-09 西北工业大学 基于异质图神经网络的文本视觉问答实现方法
CN113360621A (zh) * 2021-06-22 2021-09-07 辽宁工程技术大学 一种基于模态推理图神经网络的场景文本视觉问答方法
EP3920048A1 (en) * 2020-06-02 2021-12-08 Siemens Aktiengesellschaft Method and system for automated visual question answering
US20210406592A1 (en) * 2020-06-30 2021-12-30 Beijing Baidu Netcom Science Technology Co., Ltd. Method and apparatus for visual question answering, computer device and medium
CN114780768A (zh) * 2022-04-29 2022-07-22 山东海量信息技术研究院 一种视觉问答任务处理方法、系统、电子设备及存储介质

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP3920048A1 (en) * 2020-06-02 2021-12-08 Siemens Aktiengesellschaft Method and system for automated visual question answering
US20210406592A1 (en) * 2020-06-30 2021-12-30 Beijing Baidu Netcom Science Technology Co., Ltd. Method and apparatus for visual question answering, computer device and medium
CN113094484A (zh) * 2021-04-07 2021-07-09 西北工业大学 基于异质图神经网络的文本视觉问答实现方法
CN113360621A (zh) * 2021-06-22 2021-09-07 辽宁工程技术大学 一种基于模态推理图神经网络的场景文本视觉问答方法
CN114780768A (zh) * 2022-04-29 2022-07-22 山东海量信息技术研究院 一种视觉问答任务处理方法、系统、电子设备及存储介质

Also Published As

Publication number Publication date
CN114780768A (zh) 2022-07-22

Similar Documents

Publication Publication Date Title
WO2023207059A1 (zh) 一种视觉问答任务处理方法、系统、电子设备及存储介质
CN108701250B (zh) 数据定点化方法和装置
US20220383078A1 (en) Data processing method and related device
US20230229898A1 (en) Data processing method and related device
US20170150235A1 (en) Jointly Modeling Embedding and Translation to Bridge Video and Language
CN108629414B (zh) 深度哈希学习方法及装置
US11574239B2 (en) Outlier quantization for training and inference
EP4152212A1 (en) Data processing method and device
WO2024001100A1 (zh) 一种文本的处理方法、装置、设备和非易失性可读存储介质
CN109766557A (zh) 一种情感分析方法、装置、存储介质及终端设备
US20240185602A1 (en) Cross-Modal Processing For Vision And Language
US11704559B2 (en) Learning to search user experience designs based on structural similarity
WO2021169453A1 (zh) 用于文本处理的方法和装置
Gao et al. A hierarchical recurrent approach to predict scene graphs from a visual‐attention‐oriented perspective
CN113626610A (zh) 知识图谱嵌入方法、装置、计算机设备和存储介质
CN112906865A (zh) 神经网络架构搜索方法、装置、电子设备及存储介质
CN113239176A (zh) 语义匹配模型训练方法、装置、设备及存储介质
CN115018656A (zh) 风险识别方法、风险识别模型的训练方法、装置和设备
CN113486659B (zh) 文本匹配方法、装置、计算机设备及存储介质
CN113590578A (zh) 跨语言知识单元迁移方法、装置、存储介质及终端
CN112486947A (zh) 一种知识库构建方法、装置、电子设备及可读存储介质
CN117033609A (zh) 文本视觉问答方法、装置、计算机设备和存储介质
WO2023173550A1 (zh) 一种跨领域数据推荐方法、装置、计算机设备及介质
CN112347242B (zh) 摘要生成方法、装置、设备及介质
KR20230055021A (ko) Nested 와 Overlapped Named Entity 인식을 위한 피라미드 Layered 어텐션 모델

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22939874

Country of ref document: EP

Kind code of ref document: A1