CN114925703A - Visual question-answering method and system with multi-granularity text representation and image-text fusion - Google Patents

Visual question-answering method and system with multi-granularity text representation and image-text fusion Download PDF

Info

Publication number
CN114925703A
CN114925703A CN202210667045.9A CN202210667045A CN114925703A CN 114925703 A CN114925703 A CN 114925703A CN 202210667045 A CN202210667045 A CN 202210667045A CN 114925703 A CN114925703 A CN 114925703A
Authority
CN
China
Prior art keywords
text
features
fusion
picture
self
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210667045.9A
Other languages
Chinese (zh)
Inventor
王新刚
刘小玉
李晓敏
成洪路
刘广政
周金岩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Qilu University of Technology
Original Assignee
Qilu University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Qilu University of Technology filed Critical Qilu University of Technology
Priority to CN202210667045.9A priority Critical patent/CN114925703A/en
Publication of CN114925703A publication Critical patent/CN114925703A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • G06F40/35Discourse or dialogue representation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/58Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/583Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • G06F16/5846Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using extracted text
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Mathematical Physics (AREA)
  • Medical Informatics (AREA)
  • Multimedia (AREA)
  • Library & Information Science (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to a visual question-answering method and a visual question-answering system with multi-granularity text representation and image-text fusion, which comprise the following steps: obtaining a picture and a problem text corresponding to the picture to obtain picture characteristics; extracting sentence information of different levels in the problem text through hierarchical expansion convolution to form text features; and after vector splicing is carried out on the obtained picture features and the text features, high-order features of different modes are fused through a transform layer to obtain self-adaptive fusion feature vectors, the self-adaptive fusion feature vectors are converted into candidate answer dimensions through linear transformation, and a prediction result of the answer is obtained through a prediction function. Multilevel text information description is carried out in text feature representation to keep multilevel features of the text, high-order feature vectors of different modes are fused in a graph-text self-adaptive fusion mode, problem topics and meanings are accurately expressed from multiple layers, and attention weight of the image text can be dynamically calculated after fusion so as to better predict answers.

Description

Multi-granularity text representation and image-text fusion visual question-answering method and system
Technical Field
The invention relates to the technical field of visual question answering, in particular to a visual question answering method and system with multi-granularity text representation and image-text fusion.
Background
The statements in this section merely provide background information related to the present disclosure and may not necessarily constitute prior art.
Visual Question Answering (VQA) is input to a computer by using picture information and text Question information related to pictures, and the computer obtains answers to questions by understanding and reasoning information of two modalities, namely pictures and texts.
In the existing related technology and method, a computer extracts text features and picture features and then obtains answers of questions after the text features and the picture features are fused, on one hand, most of the representation of the text features is extracted in a coarse granularity mode, so that the information extraction is single, text information is lost, and the deep semantics of the text cannot be fully utilized; on the other hand, most of the multi-modal fusion modes of images and texts are feature level fusion based on vector splicing, and the modes cannot consider the difference between different modalities and automatically adjust the weight of each modality feature during fusion.
Disclosure of Invention
In order to solve the technical problems existing in the background technology, the invention provides a visual question-answering method and system with multi-granularity text representation and image-text fusion, wherein a hierarchical expansion convolution mode is introduced into text feature representation, multi-level text information description is carried out through the hierarchical expansion convolution mode, multi-level features of a text can be better reserved, an image-text self-adaptive fusion mode is used, a Transformer replaces a traditional splicing mode, high-order feature vectors of different modes are fused, and attention weights of the image and text feature vectors are dynamically calculated by an automatic fusion mechanism and are used for representing importance degrees of information of different modes, so that image and text elements are correctly focused, answers are better predicted, and the improvement of the visual question-answering accuracy rate is further obtained.
In order to achieve the purpose, the invention adopts the following technical scheme:
the invention provides a visual question-answering method with multi-granularity text representation and image-text fusion, which comprises the following steps:
obtaining a picture and a problem text corresponding to the picture to obtain picture characteristics;
extracting sentence information of different levels in the problem text through hierarchical expansion convolution to form text features;
and after vector splicing is carried out on the obtained picture features and the text features, high-order features of different modes are fused through a transform layer to obtain self-adaptive fusion feature vectors, the self-adaptive fusion feature vectors are converted into candidate answer dimensions through linear transformation, and a prediction result of the answer is obtained through a prediction function.
Sentence information of different levels in the problem text is extracted through hierarchical expansion convolution to form text features, and the method specifically comprises the following steps:
acquiring multi-granularity semantic features of a problem text by utilizing a hierarchical expansion convolutional network;
setting gradually-increased expansion rate r which is n, n which is 1, 2 and 3. n, and performing layered stacking expansion convolution, wherein the length of a convolution text segment is exponentially expanded and covers semantic features of different n-gram;
the output of each stack layer L is saved as a feature map of text at a particular level of granularity:
Figure BDA0003693264870000021
where, given a sequence of sentences: d ═ x 1 ,x 2 ,......x N ]The sentence sequence d is converted into a matrix d 0 =[X 1 ,X 1 ,......X N ]Fs represents the number of filters per layer, and if there are L layers, the multi-granular problem text defines [ d 0 ,d 1 ,......,d L ]The hierarchical dilation convolutional network gradually acquires vocabulary and semantic features from a word and phrase level with a small dilation rate.
After vector splicing is carried out on the obtained picture features and the text features, high-order features of different modes are fused through a Transformer layer to obtain a self-adaptive fusion feature vector, and the method specifically comprises the following steps:
the obtained picture features and the text features are transmitted into a collaborative attention network, and updated text and picture features are obtained by learning the relationship among multiple modes;
vector splicing is carried out on the updated text and picture characteristics, the spliced vector is fused with high-order vectors of different modes to obtain a self-adaptive fusion characteristic vector, and an original connecting vector is reconstructed from the automatically fused potential vector;
minimizing the euclidean metric between the original vector and the reconstructed vector ensures that the learned self-fused vector does not contain any signals from the input concatenated potential vectors.
The cooperative attention network comprises at least one group of self-attention units and guiding attention units which are connected together; the self-attention unit comprises a multi-head attention layer and a measuring point feedforward layer which are connected together and used for learning the relation among all samples in the same mode; the structure of the attention-directing unit is the same as that of the self-attention unit, and one mode is used for guiding another mode to represent characteristic relations among different modes.
A second aspect of the present invention provides a system for implementing the above method, comprising:
a picture feature extraction module configured to: obtaining a picture and a problem text corresponding to the picture to obtain picture characteristics; according to the problem text corresponding to the picture, sentence information of different levels in the problem text is extracted through hierarchy expansion convolution to form text characteristics;
a fusion prediction module configured to: and after vector splicing is carried out on the obtained picture features and the text features, high-order features of different modes are fused through a Transformer layer to obtain self-adaptive fusion feature vectors, the self-adaptive fusion feature vectors are converted into candidate answer dimensions through linear transformation, and a prediction result of the answer is obtained through a prediction function.
A third aspect of the invention provides a computer-readable storage medium.
A computer-readable storage medium, on which a computer program is stored, which program, when executed by a processor, performs the steps in a multi-granular text representation and text-fused visual question answering method as described above.
A fourth aspect of the invention provides a computer apparatus.
A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps of a multi-granular text representation and teletext fusion visual question-answering method as described above when executing the program.
Compared with the prior art, the above one or more technical schemes have the following beneficial effects:
1. the method has the advantages that multilevel text information description is carried out in text feature representation, multilevel features of the text can be better reserved, a picture and text self-adaptive fusion mode is used for replacing a traditional splicing mode, so that high-order feature vectors of different modes are fused, the theme and meaning of a problem can be accurately expressed from multiple layers of words, vocabularies and sentences, the high-order feature vectors after fusion can dynamically calculate the attention weight of an image text, and image and text elements are correctly focused so as to better predict an answer.
2. The attention weight of the image and text feature vectors is dynamically calculated by utilizing an automatic fusion mechanism and is used for representing the importance degree of different modal information, so that the image and text elements are correctly focused, the answer is better predicted, and the accuracy rate of the visual question answering is improved.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, are included to provide a further understanding of the invention, and are incorporated in and constitute a part of this specification, illustrate exemplary embodiments of the invention and together with the description serve to explain the invention and not to limit the invention.
FIG. 1 is a schematic overall flow chart of a visual question-answering method provided by one or more embodiments of the present invention;
FIG. 2 is a schematic diagram of a hierarchical expanded convolutional network structure of a multi-granular text information representation provided by one or more embodiments of the present invention;
FIGS. 3(a) - (c) are schematic diagrams of a Self-Attention network structure, a Guided-Attention network structure, and a Collaborative Attention module network structure according to one or more embodiments of the present invention;
fig. 4 is a schematic diagram of a teletext adaptive fusion network structure according to one or more embodiments of the present invention;
fig. 5 is a graph illustrating the results of an ablation experiment using the VQA 2.0.0 data set on an MGMCAN model according to one or more embodiments of the present invention;
fig. 6 is a schematic diagram of an overall architecture of a Transformer according to one or more embodiments of the present invention.
Detailed Description
The invention is further described with reference to the following figures and examples.
It is to be understood that the following detailed description is exemplary and is intended to provide further explanation of the invention as claimed. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.
It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of exemplary embodiments according to the invention. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, and it should be understood that when the terms "comprises" and/or "comprising" are used in this specification, they specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof, unless the context clearly indicates otherwise.
As described in the background art, in the related art and method, the computer extracts the text features and the picture features and then fuses the text features and the picture features to obtain answers to the questions, on one hand, most representations of the text features are extracted in a coarse granularity mode, so that information extraction is single, text information is lost, and deep semantics of the text cannot be fully utilized; on the other hand, most of the image and text multi-modal fusion modes are feature level fusion based on vector splicing, and the modes cannot consider the difference between different modalities and cannot automatically adjust the weight of each modality feature during fusion.
Therefore, the following embodiments provide a visual question-answering method and system with multi-granularity text representation and image-text fusion, a hierarchical expansion convolution mode is introduced into text feature representation, multi-level text information description is performed through the hierarchical expansion convolution mode, multi-level features of a text can be better reserved, an image-text self-adaptive fusion mode is used, a transform replaces a traditional splicing mode, high-order feature vectors of different modes are fused, attention weights of the image and the text feature vectors are dynamically calculated by an automatic fusion mechanism, the attention weights are used for representing importance degrees of information of the different modes, and therefore image and text elements are correctly focused, answers are better predicted, and accuracy of the visual question-answering is improved.
The first embodiment is as follows:
as shown in fig. 1-5, a multi-granularity text representation and image-text fusion visual question-answering method includes the following steps:
obtaining a picture and a problem text corresponding to the picture to obtain picture characteristics;
extracting sentence information of different levels in the problem text through hierarchical expansion convolution to form text features;
and after vector splicing is carried out on the obtained picture features and the text features, high-order features of different modes are fused through a Transformer layer to obtain self-adaptive fusion feature vectors, the self-adaptive fusion feature vectors are converted into candidate answer dimensions through linear transformation, and a prediction result of the answer is obtained through a prediction function.
Specifically, the method comprises the following steps:
1 visual and textual feature representation
As shown in fig. 1, the visual characteristics represent: the target features of the image were detected on the visual genome using Faster R-CNN (the main backbone network is ResNet-101). Setting a confidence threshold value to the detected target, thereby obtaining the dynamic target object characteristic N epsilon [10,100 ∈]Wherein the target feature is from the feature map after the RoI pooling. For the ith object, average pooling is carried out on convolution characteristics of the detection area of the ith object to obtain characteristics
Figure BDA0003693264870000071
As shown in fig. 1-2, the text features represent: a Hierarchical Extended Convolutional Network (HECN) is utilized to learn representations of problem features from a plurality of semantic entities. Convolutional hole rate sequence [ r ] by design level expansion 1 ,...,r i ,...,r n ]Thereby achieving the purpose of full coverage of receptive fields. And extracting word and vocabulary information by using a small void ratio, and extracting sentence information by using a large void ratio. The method can obtain wider area information and improve the utilization rate of the information under the condition of keeping the size of the receiving field unchanged.
a: given a sequence of sentences: d ═ x 1 ,x 2 ,......x N ]
b: the hierarchical expansion encoder converts the sentence sequence d into a matrix: d is a radical of 0 =[X 1 ,X 1 ,......X N ]
c: the multi-granular semantic features of the question text are captured using hierarchical convolution expansion, which has a wider receptive field by skipping r input elements at a time.
First, setting r to 1 (standard convolution) for the first time can ensure full coverage of the input text information and avoid partial loss of information.
Second, the length of the convolved text segments is exponentially expanded by stacking the expanded convolutions hierarchically at a wider expansion rate (r 2, r 3), using several layers (here set to 3 layers) and a modest number of parameters to cover the semantic features of different n-grams.
In addition, to prevent the disappearance of the gradient or explosion, layer normalization is applied at the end of each convolutional layer. Since irrelevant information may be introduced into the semantic units, a multi-level expansion rate based on verification performance is designed. The output of each stack layer L is saved as a feature map of the text at a particular level of granularity, formulated as the feature that the output of each stack layer L is saved as the text at a particular level of granularity:
Figure BDA0003693264870000081
where fs represents the number of filters per layer. Multiple granularity problem text definition if L layers exist [ d 0 ,d 1 ,......,d L ]. In this way, the Hierarchical Extended Convolutional Network (HECN) acquires vocabulary and semantic features step by step from the word and phrase level of small expansion rate, capturing long-term dependencies from the sentence level of large expansion rate. Meanwhile, the calculation path is greatly shortened, and the negative influence of information loss caused by downsampling methods such as maximum pooling is reduced.
2 synergistic attention module
As shown in fig. 3, the extracted text and visual vectors are transmitted to a collaborative attention network, and the relationship among multiple modes is learned through a common attention module, so as to obtain updated text and visual features.
The basic components of the cooperative attention network, the Modular Cooperative Attention (MCA) layer, consist of a modularity of two basic attention units, namely the self-attention unit (SA) shown in fig. 3(a) and the direct attention unit (GA) shown in fig. 3 (b). The SA unit consists of a multi-head attention layer and a measuring point feedforward layer and is used for learning the relation among all samples in the same mode. The GA unit structure is similar to the SA unit, and one modality is used to guide another modality for representing the characteristic relationship between different modalities.
3 image-text self-adaptive fusion
As shown in fig. 4, firstly, vectors of image text features are spliced, and then the spliced vectors are fused with high-order vectors of different modes through a transform (a deep learning model based on the self-attention mechanism) layer to obtain an adaptive fusion feature vector. The original connected vectors are then reconstructed from the automatically fused potential vectors. Finally, the euclidean metric between the original vector and the reconstructed vector is minimized. This process ensures that the learned self-fused vector does not contain any signal from the input concatenated potential vector. Furthermore, the correlation between self-fusion and connection potential vectors is also increased.
The image text self-adaptive fusion network needs to store more information of each modality as much as possible. Given n multi-modal vectors in d-dimensions,
Figure BDA0003693264870000091
first, it is connected to obtain a connection vector
Figure BDA0003693264870000092
Wherein, the first and the second end of the pipe are connected with each other,
Figure BDA0003693264870000093
secondly, transformation τ is applied to
Figure BDA0003693264870000094
Its dimensionality is reduced to t. Then use
Figure BDA0003693264870000095
De-reshaping original connection vectors
Figure BDA0003693264870000096
Finally calculating the loss J tr I.e. calculating
Figure BDA0003693264870000097
And
Figure BDA0003693264870000098
loss in between.
The automatic convergence network employs a Mean Square Error (MSE) loss function, consistent with the motivation to compress multimodal features, to filter out less useful signals. For automatic fusion, the middle is changedQuantity of
Figure BDA0003693264870000101
As a fused multimodal representation. The MSE loss for an autoconvergence network is expressed as follows:
Figure BDA0003693264870000102
the overall architecture of the Transformer in this embodiment is shown in fig. 6, and can be divided into four modules: the device comprises an input module, an encoding module, a decoding module and an output module.
An input module: the original text embedding layer and the position encoder and the target text embedding layer and the position encoder thereof.
An encoder module: the encoder is formed by stacking N encoder layers; each encoder layer consists of two sub-layer connecting structures; the first sublayer connection structure comprises a multi-head self-attention sublayer, a normalization layer and a residual connection; the second sublayer connection structure includes a feed forward full connection sublayer, a normalization layer, and a residual connection.
A decoder module: stacked by N decoder layers; each decoder layer consists of three sub-layer connection structures; the first sub-layer connection structure comprises a multi-head self-attention sub-layer, a normalization layer and a residual connection; the second sub-layer connection structure comprises a multi-head attention sub-layer, a normalization layer and a residual connection; the third sublayer connection structure comprises a feedforward full-connection sublayer, a normalization layer and a residual connection.
An output module: linear layers and layer normalized softmax.
4 visual question-answer prediction
After the fusion features are subjected to linear transformation, the dimensionality is converted into candidate answer dimensionality, and an answer prediction result is obtained through a Si gmoid function. The classification problem is trained using Binary Cross Entropy (BCE) as a loss function, such as the formula:
Figure BDA0003693264870000103
where N is the number of types in the multi-class, a i ' is a predicted value of the ith class, a i Is the tag value of the ith class.
The Sigmoid function is to solve the binary problem, and is a symbolic function that maps variables between 0 and 1.
The formula of the Sigmoid function is as follows:
Figure BDA0003693264870000111
the method can effectively extract the multi-level characteristics of the text, pay attention to the important part of the text information by adopting a level expansion convolution mode, and can accurately express the theme and meaning of the problem from a plurality of levels of words, vocabularies and sentences. The Transforme r layer is innovatively used for fusing high-order vectors of different modes in the feature fusion process, and the implementation of the strategy can dynamically calculate the attention weight of the image text and correctly focus image and text elements so as to better predict answers.
The evaluation indexes of the improved model are obviously improved as follows:
a data set. The VQA 2.0.0 dataset is the reference dataset commonly used in the VQA task, which is composed of natural images in MSCOCO with corresponding manually annotated questions and answers added. Each picture corresponds to 3 questions, and each question corresponds to 10 answers. The partitioning of the data set: the training set comprises 80K images and 444K question answer pairs; the verification set contains 40K images and 214 question-answer pairs; the test set contained 80K images and 448K questions. Wherein, the test set comprises 2 test subsets test-dev and test-standard for evaluating the model performance on line. The types of questions in the dataset are question or not, target count questions, and other questions.
And (6) performance evaluation.
Table 1 is an accuracy comparison of the method of this example to the current primary visual question-answer model on the VQA 2.0.0 test set.
TABLE 1 comparison of accuracy rates
Figure BDA0003693264870000121
The model of this example was evaluated on the VQA 2.0.0 dataset and compared to other advanced methods. Table 1 shows the results of the on-line evaluation experiments using Test-dev and Test-std.
As can be seen from table 1, first, for the classical attention-based methods bud, MFH, the method of this example improves the Overall (Overall) accuracy by 5.53% in the VQA 2.0.0 data set Test-dev. The reason is that the model does not take into account multi-granular modeling and collaborative attention relationships between text and pictures in the text modeling process.
Secondly, unlike the conventional method, the embodiment proposes a HECN framework model for multi-granularity text feature extraction, which can extract text information from multiple granularities through a three-layer hierarchy to ensure the integrity of the information. Compared with the DFAF and MCAN which are proposed recently, the method of the embodiment obtains better performance, and a text image adaptive fusion method is used instead of a simple splicing fusion strategy in the process of carrying out multi-modal information fusion. Implementation of this strategy enables dynamic calculation of attention weights for image text, and correct focusing on image and text elements for better predictive answers.
Finally, the overall accuracy of this embodiment is significantly improved when compared to the BAN method, but the accuracy of its counting class is better represented due to the special object counting performance module (i.e. number type) in the BAN method.
And (4) performing ablation experiments.
To analyze the contribution and effect of each part in the model, extensive ablation experiments were performed on the proposed model on VQA 2.0.0 validation datasets, evaluating the effect of each module and demonstrating the effectiveness of each part. The ablation model for each part is as follows:
BASE (LSTM + Co-Att + sum): problem coding uses long and short term memory neural networks (LSTM), models bimodal with cooperative attention, and fusion of multimodal using simple additive stitching.
BASE + MGT (MGT + Co-Att + sum): problem coding adopts extraction of multi-granularity text information based on hierarchical expansion convolution.
BASE + ATAF (LSTM + Co-Att + ATAF): the feature fusion adopts a graph-text self-adaptive fusion mode.
BASE + MGT + ATAF (MGT + Co-Att + ATAF): the problem coding adopts a multi-granularity text feature extraction method, and the feature fusion adopts a graph-text self-adaptive fusion mode.
The results are shown in table 2:
table 2 comparison of results of ablation experiments
Model Overall Yes/No Number Other
BASE 81.22 95.69 67.26 73.9
BASE+MGT 82.17 96.25 70.38 74.65
BASE+ITAF 82.96 96.45 70.42 75.02
BASE+MGT+ITAF(OURS) 83.23 96.49 70.41 75.54
Firstly, the MGT module is added to improve the performance of visual question answering, which shows that more useful information can be extracted from the question text by adopting a multi-granularity text feature extraction method, so that the composition of the question vector is multi-layered and multi-angular. And secondly, the ITAF module also has an obvious promotion effect on the performance of the model, so that the fact that the image-text self-adaptive fusion technology can dynamically fuse the weight among all the modes is verified, and the information interaction can be effectively improved.
Example two:
the embodiment provides a system for implementing the method, which includes:
a picture feature extraction module configured to: obtaining a picture and a problem text corresponding to the picture to obtain picture characteristics;
a text feature extraction module configured to: extracting sentence information of different levels in the problem text through hierarchy expansion convolution to form text features according to the problem text corresponding to the picture;
a fusion prediction module configured to: and after vector splicing is carried out on the obtained picture features and the text features, high-order features of different modes are fused through a Transformer layer to obtain self-adaptive fusion feature vectors, the self-adaptive fusion feature vectors are converted into candidate answer dimensions through linear transformation, and a prediction result of the answer is obtained through a prediction function.
The method executed by the system can effectively extract multi-level characteristics of the text, pay attention to important parts of text information in a level expansion convolution mode, and can accurately express the theme and meaning of the problem from multiple levels of words, vocabularies and sentences. The method has the advantages that the Transformer layer is innovatively used for fusing high-order vectors of different modes in the feature fusion process, the attention weight of the image text can be dynamically calculated, and the image and text elements can be correctly focused so as to better predict answers.
EXAMPLE III
The present embodiment provides a computer-readable storage medium, on which a computer program is stored, which when executed by a processor implements the steps in a multi-granular text representation and teletext fused visual question-answering method as set forth in the first embodiment above.
In the visual question-answering method of multi-granularity text representation and image-text fusion executed by the computer program in the embodiment, multi-level characteristics of a text can be effectively extracted, an important part of text information is concerned by adopting a level expansion convolution mode, and the theme and meaning of a problem can be accurately expressed from multiple levels of words, vocabularies and sentences. The method has the advantages that a Transformer layer is innovatively used in the feature fusion process to fuse high-order vectors of different modes, and the implementation of the strategy can dynamically calculate the attention weight of image text and correctly focus image and text elements so as to better predict answers.
Example four
The present embodiment provides a computer device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the computer program to implement the steps of the multi-granularity text representation and text-fusion visual question answering method as set forth in the above embodiment.
In the visual question-answering method of multi-granularity text representation and image-text fusion executed by the processor, the multi-level characteristics of the text can be effectively extracted, the important part of the text information is concerned by adopting a hierarchical expansion convolution mode, and the theme and meaning of the problem can be accurately expressed from a plurality of levels of words, vocabularies and sentences. The method has the advantages that a transformer r layer is innovatively used in the feature fusion process to fuse high-order vectors of different modes, and the implementation of the strategy can dynamically calculate the attention weight of image text and correctly focus image and text elements so as to better predict answers.
As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of a hardware embodiment, a software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, optical storage, and the like) having computer-usable program code embodied therein.
The present invention has been described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a read-only memory or a random access memory.
The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (10)

1. A visual question-answering method with multi-granularity text representation and image-text fusion is characterized in that: the method comprises the following steps:
obtaining a picture and a problem text corresponding to the picture to obtain picture characteristics;
sentence information of different levels in the problem text is extracted through hierarchical expansion convolution to form text characteristics;
and after vector splicing is carried out on the obtained picture features and the text features, high-order features of different modes are fused through a transform layer to obtain self-adaptive fusion feature vectors, the self-adaptive fusion feature vectors are converted into candidate answer dimensions through linear transformation, and a prediction result of the answer is obtained through a prediction function.
2. The multi-granularity text representation and graphic fusion visual question answering method of claim 1, wherein: sentence information of different levels in the problem text is extracted through hierarchical expansion convolution to form text features, and the method specifically comprises the following steps:
acquiring multi-granularity semantic features of a problem text by utilizing a hierarchical expansion convolutional network;
setting gradually-increased expansion rate r which is n, n which is 1, 2 and 3. n, and performing layered stacking expansion convolution, wherein the length of a convolution text segment is exponentially expanded and covers semantic features of different n-gram;
the output of each stack layer L is saved as a feature map of the text at a particular level of granularity:
Figure FDA0003693264860000011
where, given a sequence of sentences: d ═ x 1 ,x 2 ,......x N ]The sentence sequence d is converted into a matrix d 0 =[X 1 ,X 1 ,......X N ]Fs represents the number of filters per layer, and if there are L layers, the multi-granular problem text defines [ d 0 ,d 1 ,......,d L ]The hierarchical expansion convolutional network gradually acquires vocabulary and semantic features from the word and phrase level with small expansion rate.
3. The multi-granularity text representation and graphic fusion visual question answering method of claim 1, wherein: after vector splicing is carried out on the obtained picture features and the text features, high-order features of different modes are fused through a Transformer layer to obtain a self-adaptive fusion feature vector, and the method comprises the following steps:
and transmitting the obtained picture features and the text features into a collaborative attention network, and obtaining updated text and picture features by learning the relation among multiple modes.
4. The multi-granularity text representation and graphic fusion visual question answering method of claim 3, wherein: after vector splicing is carried out on the obtained picture features and the text features, high-order features of different modes are fused through a Transformer layer to obtain a self-adaptive fusion feature vector, and the method further comprises the following steps:
and performing vector splicing on the updated text and picture characteristics, fusing the spliced vector with high-order vectors of different modes to obtain a self-adaptive fusion characteristic vector, and reconstructing an original connection vector from the automatically fused potential vectors.
5. The multi-granularity text representation and graphic fusion visual question answering method of claim 4, wherein: after vector splicing is carried out on the obtained picture features and the text features, high-order features of different modes are fused through a Transformer layer to obtain a self-adaptive fusion feature vector, and the method further comprises the following steps:
minimizing the euclidean metric between the original vector and the reconstructed vector ensures that the learned self-fused vector does not contain any signals from the input concatenated potential vectors.
6. The multi-granularity text representation and graphic fusion visual question answering method of claim 3, wherein: the coordinated attention network includes at least one set of self-attentive units and directed attentive units connected together.
7. The multi-granularity text representation and graphic fusion visual question answering method of claim 6, wherein: the self-attention unit comprises a multi-head attention layer and a measuring point feedforward layer which are connected together and used for learning the relation among all samples in the same mode; the structure of the guiding attention unit is the same as that of the self-attention unit, and one mode is used for guiding another mode to represent characteristic relations among different modes.
8. A multi-granularity text representation and image-text fusion visual question-answering system is characterized in that: the method comprises the following steps:
a feature extraction module configured to: obtaining a picture and a problem text corresponding to the picture to obtain picture characteristics; according to the problem text corresponding to the picture, sentence information of different levels in the problem text is extracted through hierarchy expansion convolution to form text characteristics;
a fusion prediction module configured to: and after vector splicing is carried out on the obtained picture features and the text features, high-order features of different modes are fused through a Transformer layer to obtain self-adaptive fusion feature vectors, the self-adaptive fusion feature vectors are converted into candidate answer dimensions through linear transformation, and a prediction result of the answer is obtained through a prediction function.
9. A computer-readable storage medium, on which a computer program is stored, which program, when executed by a processor, performs the steps in a multi-granular textual representation and teletext fusion visual question-answering method according to any one of claims 1-7.
10. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor when executing the program implements the steps in a multi-granular text representation and teletext fused visual question-answering method according to any one of claims 1-7.
CN202210667045.9A 2022-06-14 2022-06-14 Visual question-answering method and system with multi-granularity text representation and image-text fusion Pending CN114925703A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210667045.9A CN114925703A (en) 2022-06-14 2022-06-14 Visual question-answering method and system with multi-granularity text representation and image-text fusion

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210667045.9A CN114925703A (en) 2022-06-14 2022-06-14 Visual question-answering method and system with multi-granularity text representation and image-text fusion

Publications (1)

Publication Number Publication Date
CN114925703A true CN114925703A (en) 2022-08-19

Family

ID=82814363

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210667045.9A Pending CN114925703A (en) 2022-06-14 2022-06-14 Visual question-answering method and system with multi-granularity text representation and image-text fusion

Country Status (1)

Country Link
CN (1) CN114925703A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116052171A (en) * 2023-03-31 2023-05-02 国网数字科技控股有限公司 Electronic evidence correlation calibration method, device, equipment and storage medium
WO2024046038A1 (en) * 2022-08-29 2024-03-07 京东方科技集团股份有限公司 Video question-answer method, device and system, and storage medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2024046038A1 (en) * 2022-08-29 2024-03-07 京东方科技集团股份有限公司 Video question-answer method, device and system, and storage medium
CN116052171A (en) * 2023-03-31 2023-05-02 国网数字科技控股有限公司 Electronic evidence correlation calibration method, device, equipment and storage medium

Similar Documents

Publication Publication Date Title
CN110929515B (en) Reading understanding method and system based on cooperative attention and adaptive adjustment
CN110321419B (en) Question-answer matching method integrating depth representation and interaction model
CN108549658B (en) Deep learning video question-answering method and system based on attention mechanism on syntax analysis tree
CN112685597B (en) Weak supervision video clip retrieval method and system based on erasure mechanism
CN110110337B (en) Translation model training method, medium, device and computing equipment
CN111597830A (en) Multi-modal machine learning-based translation method, device, equipment and storage medium
CN114925703A (en) Visual question-answering method and system with multi-granularity text representation and image-text fusion
CN111985239A (en) Entity identification method and device, electronic equipment and storage medium
CN110633359B (en) Sentence equivalence judgment method and device
CN113065358B (en) Text-to-semantic matching method based on multi-granularity alignment for bank consultation service
CN111813913B (en) Two-stage problem generating system with problem as guide
CN116304984A (en) Multi-modal intention recognition method and system based on contrast learning
CN112200664A (en) Repayment prediction method based on ERNIE model and DCNN model
KR20200116760A (en) Methods and apparatuses for embedding word considering contextual and morphosyntactic information
CN110633473B (en) Implicit discourse relation identification method and system based on conditional random field
CN115861995A (en) Visual question-answering method and device, electronic equipment and storage medium
CN114880307A (en) Structured modeling method for knowledge in open education field
CN116681061A (en) English grammar correction technology based on multitask learning and attention mechanism
CN113704466B (en) Text multi-label classification method and device based on iterative network and electronic equipment
CN115357712A (en) Aspect level emotion analysis method and device, electronic equipment and storage medium
CN115169472A (en) Music matching method and device for multimedia data and computer equipment
KR102455472B1 (en) Apparatus and method for evaluating self-introduction based on natural language processing
CN114626529A (en) Natural language reasoning fine-tuning method, system, device and storage medium
CN113743095A (en) Chinese problem generation unified pre-training method based on word lattice and relative position embedding
CN114996424B (en) Weak supervision cross-domain question-answer pair generation method based on deep learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination