CN114925703A

CN114925703A - Visual question-answering method and system with multi-granularity text representation and image-text fusion

Info

Publication number: CN114925703A
Application number: CN202210667045.9A
Authority: CN
Inventors: 王新刚; 刘小玉; 李晓敏; 成洪路; 刘广政; 周金岩
Original assignee: Qilu University of Technology
Current assignee: Qilu University of Technology
Priority date: 2022-06-14
Filing date: 2022-06-14
Publication date: 2022-08-19

Abstract

The invention relates to a visual question-answering method and a visual question-answering system with multi-granularity text representation and image-text fusion, which comprise the following steps: obtaining a picture and a problem text corresponding to the picture to obtain picture characteristics; extracting sentence information of different levels in the problem text through hierarchical expansion convolution to form text features; and after vector splicing is carried out on the obtained picture features and the text features, high-order features of different modes are fused through a transform layer to obtain self-adaptive fusion feature vectors, the self-adaptive fusion feature vectors are converted into candidate answer dimensions through linear transformation, and a prediction result of the answer is obtained through a prediction function. Multilevel text information description is carried out in text feature representation to keep multilevel features of the text, high-order feature vectors of different modes are fused in a graph-text self-adaptive fusion mode, problem topics and meanings are accurately expressed from multiple layers, and attention weight of the image text can be dynamically calculated after fusion so as to better predict answers.

Description

Multi-granularity text representation and image-text fusion visual question-answering method and system

Technical Field

The invention relates to the technical field of visual question answering, in particular to a visual question answering method and system with multi-granularity text representation and image-text fusion.

Background

The statements in this section merely provide background information related to the present disclosure and may not necessarily constitute prior art.

Visual Question Answering (VQA) is input to a computer by using picture information and text Question information related to pictures, and the computer obtains answers to questions by understanding and reasoning information of two modalities, namely pictures and texts.

In the existing related technology and method, a computer extracts text features and picture features and then obtains answers of questions after the text features and the picture features are fused, on one hand, most of the representation of the text features is extracted in a coarse granularity mode, so that the information extraction is single, text information is lost, and the deep semantics of the text cannot be fully utilized; on the other hand, most of the multi-modal fusion modes of images and texts are feature level fusion based on vector splicing, and the modes cannot consider the difference between different modalities and automatically adjust the weight of each modality feature during fusion.

Disclosure of Invention

In order to solve the technical problems existing in the background technology, the invention provides a visual question-answering method and system with multi-granularity text representation and image-text fusion, wherein a hierarchical expansion convolution mode is introduced into text feature representation, multi-level text information description is carried out through the hierarchical expansion convolution mode, multi-level features of a text can be better reserved, an image-text self-adaptive fusion mode is used, a Transformer replaces a traditional splicing mode, high-order feature vectors of different modes are fused, and attention weights of the image and text feature vectors are dynamically calculated by an automatic fusion mechanism and are used for representing importance degrees of information of different modes, so that image and text elements are correctly focused, answers are better predicted, and the improvement of the visual question-answering accuracy rate is further obtained.

In order to achieve the purpose, the invention adopts the following technical scheme:

the invention provides a visual question-answering method with multi-granularity text representation and image-text fusion, which comprises the following steps:

obtaining a picture and a problem text corresponding to the picture to obtain picture characteristics;

extracting sentence information of different levels in the problem text through hierarchical expansion convolution to form text features;

and after vector splicing is carried out on the obtained picture features and the text features, high-order features of different modes are fused through a transform layer to obtain self-adaptive fusion feature vectors, the self-adaptive fusion feature vectors are converted into candidate answer dimensions through linear transformation, and a prediction result of the answer is obtained through a prediction function.

Sentence information of different levels in the problem text is extracted through hierarchical expansion convolution to form text features, and the method specifically comprises the following steps:

acquiring multi-granularity semantic features of a problem text by utilizing a hierarchical expansion convolutional network;

setting gradually-increased expansion rate r which is n, n which is 1, 2 and 3. n, and performing layered stacking expansion convolution, wherein the length of a convolution text segment is exponentially expanded and covers semantic features of different n-gram;

the output of each stack layer L is saved as a feature map of text at a particular level of granularity:

where, given a sequence of sentences: d ═ x ₁ ,x ₂ ,......x _N ]The sentence sequence d is converted into a matrix d ⁰ ＝[X ₁ ,X ₁ ,......X _N ]Fs represents the number of filters per layer, and if there are L layers, the multi-granular problem text defines [ d ⁰ ,d ¹ ,......,d ^L ]The hierarchical dilation convolutional network gradually acquires vocabulary and semantic features from a word and phrase level with a small dilation rate.

After vector splicing is carried out on the obtained picture features and the text features, high-order features of different modes are fused through a Transformer layer to obtain a self-adaptive fusion feature vector, and the method specifically comprises the following steps:

the obtained picture features and the text features are transmitted into a collaborative attention network, and updated text and picture features are obtained by learning the relationship among multiple modes;

vector splicing is carried out on the updated text and picture characteristics, the spliced vector is fused with high-order vectors of different modes to obtain a self-adaptive fusion characteristic vector, and an original connecting vector is reconstructed from the automatically fused potential vector;

minimizing the euclidean metric between the original vector and the reconstructed vector ensures that the learned self-fused vector does not contain any signals from the input concatenated potential vectors.

The cooperative attention network comprises at least one group of self-attention units and guiding attention units which are connected together; the self-attention unit comprises a multi-head attention layer and a measuring point feedforward layer which are connected together and used for learning the relation among all samples in the same mode; the structure of the attention-directing unit is the same as that of the self-attention unit, and one mode is used for guiding another mode to represent characteristic relations among different modes.

A second aspect of the present invention provides a system for implementing the above method, comprising:

a picture feature extraction module configured to: obtaining a picture and a problem text corresponding to the picture to obtain picture characteristics; according to the problem text corresponding to the picture, sentence information of different levels in the problem text is extracted through hierarchy expansion convolution to form text characteristics;

a fusion prediction module configured to: and after vector splicing is carried out on the obtained picture features and the text features, high-order features of different modes are fused through a Transformer layer to obtain self-adaptive fusion feature vectors, the self-adaptive fusion feature vectors are converted into candidate answer dimensions through linear transformation, and a prediction result of the answer is obtained through a prediction function.

A third aspect of the invention provides a computer-readable storage medium.

A computer-readable storage medium, on which a computer program is stored, which program, when executed by a processor, performs the steps in a multi-granular text representation and text-fused visual question answering method as described above.

A fourth aspect of the invention provides a computer apparatus.

A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps of a multi-granular text representation and teletext fusion visual question-answering method as described above when executing the program.

Compared with the prior art, the above one or more technical schemes have the following beneficial effects:

1. the method has the advantages that multilevel text information description is carried out in text feature representation, multilevel features of the text can be better reserved, a picture and text self-adaptive fusion mode is used for replacing a traditional splicing mode, so that high-order feature vectors of different modes are fused, the theme and meaning of a problem can be accurately expressed from multiple layers of words, vocabularies and sentences, the high-order feature vectors after fusion can dynamically calculate the attention weight of an image text, and image and text elements are correctly focused so as to better predict an answer.

2. The attention weight of the image and text feature vectors is dynamically calculated by utilizing an automatic fusion mechanism and is used for representing the importance degree of different modal information, so that the image and text elements are correctly focused, the answer is better predicted, and the accuracy rate of the visual question answering is improved.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, are included to provide a further understanding of the invention, and are incorporated in and constitute a part of this specification, illustrate exemplary embodiments of the invention and together with the description serve to explain the invention and not to limit the invention.

FIG. 1 is a schematic overall flow chart of a visual question-answering method provided by one or more embodiments of the present invention;

FIG. 2 is a schematic diagram of a hierarchical expanded convolutional network structure of a multi-granular text information representation provided by one or more embodiments of the present invention;

FIGS. 3(a) - (c) are schematic diagrams of a Self-Attention network structure, a Guided-Attention network structure, and a Collaborative Attention module network structure according to one or more embodiments of the present invention;

fig. 4 is a schematic diagram of a teletext adaptive fusion network structure according to one or more embodiments of the present invention;

fig. 5 is a graph illustrating the results of an ablation experiment using the VQA 2.0.0 data set on an MGMCAN model according to one or more embodiments of the present invention;

fig. 6 is a schematic diagram of an overall architecture of a Transformer according to one or more embodiments of the present invention.

Detailed Description

The invention is further described with reference to the following figures and examples.

It is to be understood that the following detailed description is exemplary and is intended to provide further explanation of the invention as claimed. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.

It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of exemplary embodiments according to the invention. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, and it should be understood that when the terms "comprises" and/or "comprising" are used in this specification, they specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof, unless the context clearly indicates otherwise.

As described in the background art, in the related art and method, the computer extracts the text features and the picture features and then fuses the text features and the picture features to obtain answers to the questions, on one hand, most representations of the text features are extracted in a coarse granularity mode, so that information extraction is single, text information is lost, and deep semantics of the text cannot be fully utilized; on the other hand, most of the image and text multi-modal fusion modes are feature level fusion based on vector splicing, and the modes cannot consider the difference between different modalities and cannot automatically adjust the weight of each modality feature during fusion.

Therefore, the following embodiments provide a visual question-answering method and system with multi-granularity text representation and image-text fusion, a hierarchical expansion convolution mode is introduced into text feature representation, multi-level text information description is performed through the hierarchical expansion convolution mode, multi-level features of a text can be better reserved, an image-text self-adaptive fusion mode is used, a transform replaces a traditional splicing mode, high-order feature vectors of different modes are fused, attention weights of the image and the text feature vectors are dynamically calculated by an automatic fusion mechanism, the attention weights are used for representing importance degrees of information of the different modes, and therefore image and text elements are correctly focused, answers are better predicted, and accuracy of the visual question-answering is improved.

The first embodiment is as follows:

as shown in fig. 1-5, a multi-granularity text representation and image-text fusion visual question-answering method includes the following steps:

and after vector splicing is carried out on the obtained picture features and the text features, high-order features of different modes are fused through a Transformer layer to obtain self-adaptive fusion feature vectors, the self-adaptive fusion feature vectors are converted into candidate answer dimensions through linear transformation, and a prediction result of the answer is obtained through a prediction function.

Specifically, the method comprises the following steps:

1 visual and textual feature representation

As shown in fig. 1, the visual characteristics represent: the target features of the image were detected on the visual genome using Faster R-CNN (the main backbone network is ResNet-101). Setting a confidence threshold value to the detected target, thereby obtaining the dynamic target object characteristic N epsilon [10,100 ∈]Wherein the target feature is from the feature map after the RoI pooling. For the ith object, average pooling is carried out on convolution characteristics of the detection area of the ith object to obtain characteristics

As shown in fig. 1-2, the text features represent: a Hierarchical Extended Convolutional Network (HECN) is utilized to learn representations of problem features from a plurality of semantic entities. Convolutional hole rate sequence [ r ] by design level expansion ₁ ,...,r _i ,...,r _n ]Thereby achieving the purpose of full coverage of receptive fields. And extracting word and vocabulary information by using a small void ratio, and extracting sentence information by using a large void ratio. The method can obtain wider area information and improve the utilization rate of the information under the condition of keeping the size of the receiving field unchanged.

a: given a sequence of sentences: d ═ x ₁ ,x ₂ ,......x _N ]

b: the hierarchical expansion encoder converts the sentence sequence d into a matrix: d is a radical of ⁰ ＝[X ₁ ,X ₁ ,......X _N ]

c: the multi-granular semantic features of the question text are captured using hierarchical convolution expansion, which has a wider receptive field by skipping r input elements at a time.

First, setting r to 1 (standard convolution) for the first time can ensure full coverage of the input text information and avoid partial loss of information.

Second, the length of the convolved text segments is exponentially expanded by stacking the expanded convolutions hierarchically at a wider expansion rate (r 2, r 3), using several layers (here set to 3 layers) and a modest number of parameters to cover the semantic features of different n-grams.

In addition, to prevent the disappearance of the gradient or explosion, layer normalization is applied at the end of each convolutional layer. Since irrelevant information may be introduced into the semantic units, a multi-level expansion rate based on verification performance is designed. The output of each stack layer L is saved as a feature map of the text at a particular level of granularity, formulated as the feature that the output of each stack layer L is saved as the text at a particular level of granularity:

where fs represents the number of filters per layer. Multiple granularity problem text definition if L layers exist [ d ⁰ ,d ¹ ,......,d ^L ]. In this way, the Hierarchical Extended Convolutional Network (HECN) acquires vocabulary and semantic features step by step from the word and phrase level of small expansion rate, capturing long-term dependencies from the sentence level of large expansion rate. Meanwhile, the calculation path is greatly shortened, and the negative influence of information loss caused by downsampling methods such as maximum pooling is reduced.

2 synergistic attention module

As shown in fig. 3, the extracted text and visual vectors are transmitted to a collaborative attention network, and the relationship among multiple modes is learned through a common attention module, so as to obtain updated text and visual features.

The basic components of the cooperative attention network, the Modular Cooperative Attention (MCA) layer, consist of a modularity of two basic attention units, namely the self-attention unit (SA) shown in fig. 3(a) and the direct attention unit (GA) shown in fig. 3 (b). The SA unit consists of a multi-head attention layer and a measuring point feedforward layer and is used for learning the relation among all samples in the same mode. The GA unit structure is similar to the SA unit, and one modality is used to guide another modality for representing the characteristic relationship between different modalities.

3 image-text self-adaptive fusion

As shown in fig. 4, firstly, vectors of image text features are spliced, and then the spliced vectors are fused with high-order vectors of different modes through a transform (a deep learning model based on the self-attention mechanism) layer to obtain an adaptive fusion feature vector. The original connected vectors are then reconstructed from the automatically fused potential vectors. Finally, the euclidean metric between the original vector and the reconstructed vector is minimized. This process ensures that the learned self-fused vector does not contain any signal from the input concatenated potential vector. Furthermore, the correlation between self-fusion and connection potential vectors is also increased.

The image text self-adaptive fusion network needs to store more information of each modality as much as possible. Given n multi-modal vectors in d-dimensions,

first, it is connected to obtain a connection vector

Wherein, the first and the second end of the pipe are connected with each other,

secondly, transformation τ is applied to

Its dimensionality is reduced to t. Then use

De-reshaping original connection vectors

Finally calculating the loss J _tr I.e. calculating

And

loss in between.

The automatic convergence network employs a Mean Square Error (MSE) loss function, consistent with the motivation to compress multimodal features, to filter out less useful signals. For automatic fusion, the middle is changedQuantity of

As a fused multimodal representation. The MSE loss for an autoconvergence network is expressed as follows:

the overall architecture of the Transformer in this embodiment is shown in fig. 6, and can be divided into four modules: the device comprises an input module, an encoding module, a decoding module and an output module.

An input module: the original text embedding layer and the position encoder and the target text embedding layer and the position encoder thereof.

An encoder module: the encoder is formed by stacking N encoder layers; each encoder layer consists of two sub-layer connecting structures; the first sublayer connection structure comprises a multi-head self-attention sublayer, a normalization layer and a residual connection; the second sublayer connection structure includes a feed forward full connection sublayer, a normalization layer, and a residual connection.

A decoder module: stacked by N decoder layers; each decoder layer consists of three sub-layer connection structures; the first sub-layer connection structure comprises a multi-head self-attention sub-layer, a normalization layer and a residual connection; the second sub-layer connection structure comprises a multi-head attention sub-layer, a normalization layer and a residual connection; the third sublayer connection structure comprises a feedforward full-connection sublayer, a normalization layer and a residual connection.

An output module: linear layers and layer normalized softmax.

4 visual question-answer prediction

After the fusion features are subjected to linear transformation, the dimensionality is converted into candidate answer dimensionality, and an answer prediction result is obtained through a Si gmoid function. The classification problem is trained using Binary Cross Entropy (BCE) as a loss function, such as the formula:

where N is the number of types in the multi-class, a _i ' is a predicted value of the ith class, a _i Is the tag value of the ith class.

The Sigmoid function is to solve the binary problem, and is a symbolic function that maps variables between 0 and 1.

The formula of the Sigmoid function is as follows:

the method can effectively extract the multi-level characteristics of the text, pay attention to the important part of the text information by adopting a level expansion convolution mode, and can accurately express the theme and meaning of the problem from a plurality of levels of words, vocabularies and sentences. The Transforme r layer is innovatively used for fusing high-order vectors of different modes in the feature fusion process, and the implementation of the strategy can dynamically calculate the attention weight of the image text and correctly focus image and text elements so as to better predict answers.

The evaluation indexes of the improved model are obviously improved as follows:

a data set. The VQA 2.0.0 dataset is the reference dataset commonly used in the VQA task, which is composed of natural images in MSCOCO with corresponding manually annotated questions and answers added. Each picture corresponds to 3 questions, and each question corresponds to 10 answers. The partitioning of the data set: the training set comprises 80K images and 444K question answer pairs; the verification set contains 40K images and 214 question-answer pairs; the test set contained 80K images and 448K questions. Wherein, the test set comprises 2 test subsets test-dev and test-standard for evaluating the model performance on line. The types of questions in the dataset are question or not, target count questions, and other questions.

And (6) performance evaluation.

Table 1 is an accuracy comparison of the method of this example to the current primary visual question-answer model on the VQA 2.0.0 test set.

TABLE 1 comparison of accuracy rates

The model of this example was evaluated on the VQA 2.0.0 dataset and compared to other advanced methods. Table 1 shows the results of the on-line evaluation experiments using Test-dev and Test-std.

As can be seen from table 1, first, for the classical attention-based methods bud, MFH, the method of this example improves the Overall (Overall) accuracy by 5.53% in the VQA 2.0.0 data set Test-dev. The reason is that the model does not take into account multi-granular modeling and collaborative attention relationships between text and pictures in the text modeling process.

Secondly, unlike the conventional method, the embodiment proposes a HECN framework model for multi-granularity text feature extraction, which can extract text information from multiple granularities through a three-layer hierarchy to ensure the integrity of the information. Compared with the DFAF and MCAN which are proposed recently, the method of the embodiment obtains better performance, and a text image adaptive fusion method is used instead of a simple splicing fusion strategy in the process of carrying out multi-modal information fusion. Implementation of this strategy enables dynamic calculation of attention weights for image text, and correct focusing on image and text elements for better predictive answers.

Finally, the overall accuracy of this embodiment is significantly improved when compared to the BAN method, but the accuracy of its counting class is better represented due to the special object counting performance module (i.e. number type) in the BAN method.

And (4) performing ablation experiments.

To analyze the contribution and effect of each part in the model, extensive ablation experiments were performed on the proposed model on VQA 2.0.0 validation datasets, evaluating the effect of each module and demonstrating the effectiveness of each part. The ablation model for each part is as follows:

BASE (LSTM + Co-Att + sum): problem coding uses long and short term memory neural networks (LSTM), models bimodal with cooperative attention, and fusion of multimodal using simple additive stitching.

BASE + MGT (MGT + Co-Att + sum): problem coding adopts extraction of multi-granularity text information based on hierarchical expansion convolution.

BASE + ATAF (LSTM + Co-Att + ATAF): the feature fusion adopts a graph-text self-adaptive fusion mode.

BASE + MGT + ATAF (MGT + Co-Att + ATAF): the problem coding adopts a multi-granularity text feature extraction method, and the feature fusion adopts a graph-text self-adaptive fusion mode.

The results are shown in table 2:

table 2 comparison of results of ablation experiments

Model	Overall	Yes/No	Number	Other
					BASE	81.22	95.69	67.26	73.9
BASE+MGT	82.17	96.25	70.38	74.65
					BASE+ITAF	82.96	96.45	70.42	75.02
BASE+MGT+ITAF(OURS)	83.23	96.49	70.41	75.54

Firstly, the MGT module is added to improve the performance of visual question answering, which shows that more useful information can be extracted from the question text by adopting a multi-granularity text feature extraction method, so that the composition of the question vector is multi-layered and multi-angular. And secondly, the ITAF module also has an obvious promotion effect on the performance of the model, so that the fact that the image-text self-adaptive fusion technology can dynamically fuse the weight among all the modes is verified, and the information interaction can be effectively improved.

Example two:

the embodiment provides a system for implementing the method, which includes:

a picture feature extraction module configured to: obtaining a picture and a problem text corresponding to the picture to obtain picture characteristics;

a text feature extraction module configured to: extracting sentence information of different levels in the problem text through hierarchy expansion convolution to form text features according to the problem text corresponding to the picture;

The method executed by the system can effectively extract multi-level characteristics of the text, pay attention to important parts of text information in a level expansion convolution mode, and can accurately express the theme and meaning of the problem from multiple levels of words, vocabularies and sentences. The method has the advantages that the Transformer layer is innovatively used for fusing high-order vectors of different modes in the feature fusion process, the attention weight of the image text can be dynamically calculated, and the image and text elements can be correctly focused so as to better predict answers.

EXAMPLE III

The present embodiment provides a computer-readable storage medium, on which a computer program is stored, which when executed by a processor implements the steps in a multi-granular text representation and teletext fused visual question-answering method as set forth in the first embodiment above.

In the visual question-answering method of multi-granularity text representation and image-text fusion executed by the computer program in the embodiment, multi-level characteristics of a text can be effectively extracted, an important part of text information is concerned by adopting a level expansion convolution mode, and the theme and meaning of a problem can be accurately expressed from multiple levels of words, vocabularies and sentences. The method has the advantages that a Transformer layer is innovatively used in the feature fusion process to fuse high-order vectors of different modes, and the implementation of the strategy can dynamically calculate the attention weight of image text and correctly focus image and text elements so as to better predict answers.

Example four

The present embodiment provides a computer device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the computer program to implement the steps of the multi-granularity text representation and text-fusion visual question answering method as set forth in the above embodiment.

In the visual question-answering method of multi-granularity text representation and image-text fusion executed by the processor, the multi-level characteristics of the text can be effectively extracted, the important part of the text information is concerned by adopting a hierarchical expansion convolution mode, and the theme and meaning of the problem can be accurately expressed from a plurality of levels of words, vocabularies and sentences. The method has the advantages that a transformer r layer is innovatively used in the feature fusion process to fuse high-order vectors of different modes, and the implementation of the strategy can dynamically calculate the attention weight of image text and correctly focus image and text elements so as to better predict answers.

As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of a hardware embodiment, a software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, optical storage, and the like) having computer-usable program code embodied therein.

The present invention has been described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a read-only memory or a random access memory.

The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A visual question-answering method with multi-granularity text representation and image-text fusion is characterized in that: the method comprises the following steps:

sentence information of different levels in the problem text is extracted through hierarchical expansion convolution to form text characteristics;

2. The multi-granularity text representation and graphic fusion visual question answering method of claim 1, wherein: sentence information of different levels in the problem text is extracted through hierarchical expansion convolution to form text features, and the method specifically comprises the following steps:

the output of each stack layer L is saved as a feature map of the text at a particular level of granularity:

where, given a sequence of sentences: d ═ x ₁ ,x ₂ ,......x _N ]The sentence sequence d is converted into a matrix d ⁰ ＝[X ₁ ，X ₁ ，......X _N ]Fs represents the number of filters per layer, and if there are L layers, the multi-granular problem text defines [ d ⁰ ,d ¹ ,......,d ^L ]The hierarchical expansion convolutional network gradually acquires vocabulary and semantic features from the word and phrase level with small expansion rate.

3. The multi-granularity text representation and graphic fusion visual question answering method of claim 1, wherein: after vector splicing is carried out on the obtained picture features and the text features, high-order features of different modes are fused through a Transformer layer to obtain a self-adaptive fusion feature vector, and the method comprises the following steps:

and transmitting the obtained picture features and the text features into a collaborative attention network, and obtaining updated text and picture features by learning the relation among multiple modes.

4. The multi-granularity text representation and graphic fusion visual question answering method of claim 3, wherein: after vector splicing is carried out on the obtained picture features and the text features, high-order features of different modes are fused through a Transformer layer to obtain a self-adaptive fusion feature vector, and the method further comprises the following steps:

and performing vector splicing on the updated text and picture characteristics, fusing the spliced vector with high-order vectors of different modes to obtain a self-adaptive fusion characteristic vector, and reconstructing an original connection vector from the automatically fused potential vectors.

5. The multi-granularity text representation and graphic fusion visual question answering method of claim 4, wherein: after vector splicing is carried out on the obtained picture features and the text features, high-order features of different modes are fused through a Transformer layer to obtain a self-adaptive fusion feature vector, and the method further comprises the following steps:

6. The multi-granularity text representation and graphic fusion visual question answering method of claim 3, wherein: the coordinated attention network includes at least one set of self-attentive units and directed attentive units connected together.

7. The multi-granularity text representation and graphic fusion visual question answering method of claim 6, wherein: the self-attention unit comprises a multi-head attention layer and a measuring point feedforward layer which are connected together and used for learning the relation among all samples in the same mode; the structure of the guiding attention unit is the same as that of the self-attention unit, and one mode is used for guiding another mode to represent characteristic relations among different modes.

8. A multi-granularity text representation and image-text fusion visual question-answering system is characterized in that: the method comprises the following steps:

a feature extraction module configured to: obtaining a picture and a problem text corresponding to the picture to obtain picture characteristics; according to the problem text corresponding to the picture, sentence information of different levels in the problem text is extracted through hierarchy expansion convolution to form text characteristics;

9. A computer-readable storage medium, on which a computer program is stored, which program, when executed by a processor, performs the steps in a multi-granular textual representation and teletext fusion visual question-answering method according to any one of claims 1-7.

10. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor when executing the program implements the steps in a multi-granular text representation and teletext fused visual question-answering method according to any one of claims 1-7.