CN114647752A

CN114647752A - Lightweight visual question-answering method based on bidirectional separable deep self-attention network

Info

Publication number: CN114647752A
Application number: CN202210369535.0A
Authority: CN
Inventors: 余宙; 金子添; 俞俊
Original assignee: Hangzhou Dianzi University
Current assignee: Hangzhou Dianzi University
Priority date: 2022-04-08
Filing date: 2022-04-08
Publication date: 2022-06-21

Abstract

The invention discloses a light visual question-answering method based on a bidirectional divisible depth self-attention network, and provides the bidirectional divisible depth self-attention network. The bidirectional and separable depth self-attention model can dynamically select a proper sub-model to predict an answer according to current computing resources, balance between precision and time delay is achieved, and a user has good experience while the accuracy of predicting the answer is guaranteed.

Description

Lightweight visual question-answering method based on bidirectional separable deep self-attention network

Technical Field

The invention belongs to the field of Visual Question Answering, and particularly relates to a lightweight Visual Question Answering method (Visual Question Answering) based on a bidirectional divisible deep self-attention network.

Background

The visual question-answer task is a popular research problem in a multi-modal learning task, is a learning task related to computer vision and natural language processing, and aims to output a corresponding predicted answer for a given image and a sentence of a problem which is related to the image and has a free form through the processing of a visual question-answer model. Unlike single-modality tasks, visual question-answering tasks, which are multi-modality tasks, require not only the correct understanding of information of different modalities, but also the understanding of information associated between modalities, which is generally more complex and difficult. This task has a wide range of real life applications, such as: the network image information can be more conveniently acquired by the disabled with visual disorder; meanwhile, better development of a human-computer interaction system can be promoted, and user experience is enhanced; the comprehension of the machine to the image can be improved, and the image retrieval capability is enhanced.

The deep self-attention network is originally proposed on the machine translation task in the natural language processing field, and the core architecture of the deep self-attention network is formed by stacking a plurality of self-attention layers, and each self-attention layer can construct complex and intensive interaction between input features. The deep self-attention network obtains the best effect on a machine translation task, quickly obtains the attention of researchers in the field of artificial intelligence, and applies the deep self-attention network to various sub-fields of the artificial intelligence, including the field of visual question answering. Because the deep self-attention network can better learn the interactive information between visual and text features, at present, the network architecture has become the mainstream network structure in the field of visual question answering. However, while the deep self-attention network brings performance improvement, because of the complexity of its computation, new requirements are also placed on the computation resources and the storage space, which will bring a serious problem: the deployment of the models on the mobile device depends on a GPU cloud server, so that the limited computing resources of the mobile terminal are difficult to directly utilize, and the resource waste is caused and the model is not environment-friendly. At present, no lightweight model exists in the field of visual question answering for a while, which brings challenges to the deployment of the model and also hinders users from enjoying convenience brought by artificial intelligence application.

In order to meet the new challenge of difficult deployment of deep learning models, some model compression methods are in force. In the field of monomodal tasks such as computer vision and natural language processing, compression models based on methods such as weight sharing, knowledge distillation, pruning and quantification gradually appear, so that the models are compressed to a certain proportion to achieve balance between calculated amount and accuracy, but the methods are usually compressed to a fixed proportion and only a lightweight model with a fixed size can be obtained. Nowadays, mobile devices are various, and the difference of computing performance between different mobile devices is large, and even if the same mobile device can provide different computing resources under different load conditions and different electric quantities, if a lightweight model is designed for one mobile device or one load condition, the model training overhead is in direct proportion to the number of models, and one mobile device needs to store a plurality of models to cope with various scenes, and the storage overhead is also very large.

Recently, a neural network capable of slimming and being divided provides a new idea: only one model is used for dealing with various scenes, and when the computing resources are sufficient, most of the model is taken for forward propagation and prediction so as to obtain higher accuracy; when the computational resources are limited, a small number of parameters of the model are taken for prediction, and a little precision is sacrificed for deducing the speed. If the thought can be utilized, and an efficient and reasonable segmentation and training strategy is designed aiming at the mainstream model structure depth self-attention network in the field of visual question answering, new contribution is made to landing of the visual question answering model.

In view of the above, how to design an efficient and differentiable deep self-attention network and apply it to the field of visual question answering is a subject of intensive research. The patent aims to cut into and develop discussion from a plurality of key points in the task, solves the difficulties and key points of the existing method, and forms a set of complete and efficient lightweight visual question-answering method.

Disclosure of Invention

Aiming at the defects in the prior art, the invention provides a lightweight visual question-answering method based on a bidirectional separable deep self-attention network. The invention mainly comprises two points:

1. by analyzing the internal structure of the depth self-attention network, an efficient and reasonable width and depth segmentation strategy is designed, and a bidirectional strategy which can be segmented in width and depth is formed by combining the two strategies. Aiming at submodels split by a bidirectional strategy, the invention provides a deep and narrow filtering principle and further selects some excellent high-efficiency substructures.

2. The two-way segmentation strategy and the filtering principle are combined with the existing depth self-attention-based visual question-answer model, and an efficient self-distillation training strategy is provided, so that each sub-model can be fully trained, and finally, the two-way segmentation depth self-attention visual question-answer model is obtained.

The invention provides a light Visual Question-Answering method (Visual Question Answering) based on a bidirectional divisible depth self-attention network. The core method provides an efficient and reasonable width and depth segmentation strategy by analyzing the internal structure of the depth self-attention network, and combines the two single-dimensional segmentation strategies to form a two-way strategy which can be segmented in width and depth. Meanwhile, aiming at the substructures segmented by the bidirectional strategy, the invention provides a deep and narrow filtering principle, and further selects some excellent high-efficiency substructures, the filtering principle can improve the performance of each substructure, and in model deployment, the filtering principle does not need additional screening and can be directly put into use, so that the post-processing process is avoided, and the filtering principle is simple and easy to use. In addition, an efficient self-distillation training strategy is also provided, so that each submodel can be fully trained. The method can be combined with any existing visual question-answering model based on the deep self-attention network, a bidirectional partible deep self-attention network is formed through training, each submodel in the network has the application capability of a visual question-answering task, when the model is deployed on edge equipment with limited resources and large performance fluctuation, the bidirectional partible deep self-attention model can dynamically select a proper submodel to predict answers according to current computing resources, balance between precision and time delay is achieved, and the accuracy of predicting the answers is guaranteed while a user has good experience.

The lightweight visual question-answering method based on the bidirectional separable deep self-attention network comprises the following steps:

step (1): dividing the data set;

step (2): constructing visual characteristics of the image;

for a given image, detecting the number m of candidate frames and the position of the candidate frames in the image by using the existing trained target detection network; and for each candidate frame, inputting the image corresponding to the candidate frame region into the target detection network, and extracting the features before being input into the network classification layer as the features of the candidate frame. The extracted features of each candidate frame are then stitched to form the visual features of the given image. In order to enable the dimension of the image features to be matched with the depth self-attention network, finally, the image features are further processed by using learnable linear transformation and are mapped to a D-dimensional space;

and (3): constructing semantic features of the problem;

for a given problem, semantic features are extracted from each word in the problem by using a trained word vector model, and then the extracted word semantic features are spliced to form the problem semantic features. In order to enable the problem semantic feature dimension to be matched with the deep self-attention network, finally, a learnable linear transformation is used for further processing the problem feature and mapping the problem feature to a D-dimensional space;

and (4): constructing a depth self-attention network;

the deep self-attention network is formed by stacking a plurality of self-attention layers, and each self-attention layer is divided into two parts: a multi-headed attention module and a feed-forward layer. And constructing a teacher network for guiding training and a final bidirectional separable deep self-attention network by using the deep self-attention network. In order to enable input features to match the dimensions of each submodel in a bi-directional separable depth-from-attention network, the network accepts features of dimension D as input and maps the input features to dimension D through a linear projection transformation.

And (5): designing a width segmentation strategy;

each self-attention layer in the deep self-attention network is composed of a plurality of parameter matrixes, and in order to adapt to input characteristics of different dimensions, each parameter matrix needs to be segmented so as to match input of different dimensions and output characteristics of proper dimensions. For the input feature with the dimension d, in order to keep the original structure proportion of the self-attention layer, the dimension of the output feature is still d through a width segmentation strategy. It is noted that, for different submodels having different dimensional input features, the parameter matrix in the self-attention layer is shared, and when d is smaller, the shared parameter number is smaller; when D is equal to the original input dimension D, the parameter matrix is not segmented;

and (6): designing a depth segmentation strategy;

the depth self-attention network is formed by stacking a plurality of self-attention layers, the number of the layers is recorded as L, when the number of the layers L of the sub-model is less than L, the layer L in the depth self-attention network is selected according to a depth segmentation strategy and belongs to the sub-model. A simple and effective depth segmentation strategy is provided, and sub-models can pick out more important self-attention layers as far as possible under different layer number settings, so that the final precision of different sub-models is improved;

and (7): combining two segmentation strategies and designing a filtering principle;

through the design of the steps (5) and (6), each sub-model has a width d and a depth l. Under the same parameter quantity and calculation, the deep and narrow submodels are more efficient and reasonable in structure than the shallow and wide submodels, a deep and narrow filtering principle is provided, a plurality of submodels with a large number of layers and a low width are selected before model training, and the submodels with a small number of layers and a high width are directly discarded. Through the filtering principle, a candidate set of the screened sub-model structure is obtained

And (8): designing a self-distillation training algorithm and training a model;

aiming at the sub-model structure candidate set obtained in the step (7)

A self-distillation training strategy is provided, so that each sub-model can be fully trained. Firstly, training a teacher network by using the deep self-attention network in the step (4), constructing a bidirectional partitionable deep self-attention network, inputting images and problems into the teacher network to obtain prediction vectors of the images and the problems when training submodels in the bidirectional partitionable deep self-attention network, wherein the prediction vectors are called soft labels, and sampling candidate sets during training through a submodel sampling strategy

The soft label is used as a supervision label of the sampled submodel for training;

and (9): model deployment and application;

further, the partitioning of the data set in step (1) is specifically as follows:

the data set adopts VQA-v2 data set, and is further divided into 3 subsets aiming at VQA-v2 data set: training set, validation set and test set. The training set is used for training the model, the verification set is used for locally verifying the convergence condition of the model, and the test set is used for final model performance evaluation.

Further, the visual features of the constructed image in the step (2) are specifically as follows:

for a given image, the number m and the position of candidate frames in the image are deduced by using the existing trained Faster R-CNN target detection network, and an image area corresponding to each candidate frame is input into the Faster R-CNN target detection network to extract the visual characteristics of the image. For the ith candidate box, the corresponding visual characteristics are

And the visual characteristics corresponding to the whole image

The visual feature corresponding to each candidate frame is spliced, and the specific expression formula is as follows:

X_image＝[x₁，x₂，...，x_i，...，x_m](formula 1)

Subsequently, a learnable linear transformation is used

For image characteristic X_imageFurther processing, mapping to D-dimensional space to obtain final image visual characteristics

The specific formula is as follows:

X_input＝Linear(X_image) (formula 2)

Further, the semantic features of the construction problem in step (3) are specifically as follows:

for a given question, which contains n words, each word is input into a pre-trained GloVe word vector model to extract its semantic features. For the jth word, the corresponding semantic feature is

Semantic features corresponding to the whole problem

The word is formed by splicing semantic features corresponding to each word, and the concrete expression formula is as follows:

Y_question＝[y₁，y₂，...，y_j，...，y_n](formula 3)

Subsequently, a learnable linear transformation is used

Question semantic feature Y_questionFurther processing is carried out, and mapping is carried out on the D-dimensional space to obtain the final semantic features of the problem

The concrete formula is as follows:

Y_input＝Linear(Y_question) (formula 4)

Further, the constructing of the deep self-attention network in the step (4) is specifically as follows:

the deep self-attention network is formed by stacking a plurality of self-attention layers, and each self-attention layer is divided into two parts: a multi-headed attention module and a feed-forward layer. And constructing a teacher network for guiding training and a final bidirectional separable deep self-attention network by using the deep self-attention network, wherein the teacher network and the bidirectional separable deep self-attention network both adopt deep self-attention networks with the same structure. In order to enable input features to match respective submodel dimensions in a bi-directional partitionable depth self-attention network, the depth self-attention network accepts image visual features of dimension D

And problem semantic features

As input, and by a linear projective transformation, the input features are mapped to d dimensions.

4-1, a multi-head attention module;

for a given interrogation feature

Key feature

And value characteristics

Multi-headThe attention module calculates and obtains characteristics by utilizing H parallel attention functions

The concrete formula is as follows:

F_mha＝MHA(Q，K，V)＝[head₁，head₂，…，head_H]W⁰(formula 5)

Wherein

Mapping matrix, D, representing the h-th head of attention_HFor each attention head dimension, can be represented by D_HCalculated as D/H. In addition to this, the present invention is,

for further processing of the output characteristics of the multi-headed attention function. For the attention calculation mode ATT, the specific formula is as follows:

4-2. feedforward layer;

the feedforward layer is composed of two layers of perceptrons and carries out nonlinear transformation on the output characteristics of the multi-head attention module. For a given characteristic

Output characteristics

The specific formula is as follows:

wherein

Is a linear transformation projection matrix.

4-3, self-attention layer;

each self-attention Layer consists of the multi-headed attention module and feed-forward Layer described above, for a given input F_inputOutput characteristic F_outputThe concrete formula is as follows:

where LN represents slice normalization.

4-4. stacking self-attention layers;

the deep self-attention network is formed by stacking a plurality of self-attention layers, and the characteristic dimension is not changed by the self-attention layers, so that the plurality of self-attention layers can be connected in series to form a deep self-attention network Model, and the specific formula is as follows:

Model＝[Layer⁽¹⁾，Layer⁽²⁾，…，Layer^(L)](formula 11)

Where L is the number of self-attention layers.

Further, the width segmentation strategy in the step (5) is specifically as follows:

for parameter matrix W in multi-head attention^Q，W^K，

And an input feature of dimension D, maintaining a size D of each attention head_HInvariant, but varying input matching dimensions D and D of corresponding parameter matricesThe number of gravity heads H. So that the final sliced parameter matrix W^Q，W^K，

Wherein

Indicating the number of heads of attention after being segmented. Parameter matrix W in other self-attention layers⁰，W₁，W₂The same strategy is adopted, so that the finally segmented parameter matrix

Further, the depth segmentation strategy in step (6) is specifically as follows:

for a deep self-attention network with L layers, let the index of each layer be [1, 2]The present invention recognizes that the closer to the input and output self-attention layers is the more important. This means that the middle layer is relatively less important, and when the number of layers of the submodel L < L, it will be discarded first from the middle layer. Specifically, the layer indexes are obtained by firstly sorting the importance of each layer from large to small

For the sub-model with the number of layers being l, the front item of the ordered layer index is taken

Then re-ordering to restore the original layer index sequence

And (4) carrying out depth segmentation strategy for the final l-layer sub-model.

Further, the two segmentation strategies in step (7) are combined, and a filtering principle is designed, specifically as follows:

for a given widthDegree scale candidate set

And depth scale candidate set

By combining the candidate sets of the two dimensions, a preliminary sub-model structure candidate set is obtained

Each sub-model structure

Wherein

In order to easily express the filtering principle of 'deep and narrow', a two-dimensional index matrix is defined

To further process the preliminary sub-model candidate set

I (d, l) ═ 1 indicates that the selection submodel a (d, l) is selected, and I (d, l) ═ 0 indicates that the discard submodel a (d, l) is discarded. The index matrix I is initialized to all 1 values and the lower triangular portion is converted to 0 values. Finally, the selected submodel

The specific definition is as follows:

further, the self-distillation training algorithm in step (8) specifically comprises the following steps:

defining a teacher network constructed by a deep self-attention network as M_teacherThe bidirectional divisible depth self-attention network is M_DSTBy trainingTeacher training network M_teacherObtains its parameter weight theta and uses this weight to initialize the bidirectional separable deep self-attention network M_DSTWeight of theta_DST. Through a sub-model sampling strategy, a candidate set is sampled during training

And a sub-model, wherein the sub-model sampling strategy is as follows: keeping the k submodels sampled at each iteration, and setting the initial submodel structure candidate set as omega ═ a_s，a_lIn which a is_sRepresent

A minimum submodel of_lTo represent

Then randomly sampling the largest sub-model

K-2 submodels in the set are added into a submodel structure candidate set omega to serve as a final submodel candidate set of the iteration. Inputting the input characteristic of each iteration as x into the teacher network M_teacherGet the soft label y ═ M_teacher(x) And freezes its gradient y. Then traversing each sub-model a E omega in the sub-model structure candidate set omega, and inputting the input characteristic x into the current sub-model to obtain a prediction vector

Predicting the result using this submodel

Soft label y calculation loss with teacher network output

KD represents a loss function, gradient accumulation loss is carried out on different sub-models sampled from omega, and when the sub-models are iterated each time, sub-model structures are generatedAfter all the submodels in the candidate set omega are traversed, uniformly updating the model weight theta_DST。

Further, the model deployment and application in step (9) are as follows:

if the computing resources of the current equipment are sufficient, the largest sub-model a is adopted_lApplications, obtained by forward propagation

At this time

The method has the best characterization capability in the submodel. When the computing resources of the device are not enough, the minimum submodel a is adopted_sObtained by forward propagation

Because a is_sThe amount of computation required is minimal for all submodels, where the forward propagation speed is greatly increased to improve the user experience, and

and also has good characterization capability.

The bidirectional divisible deep self-attention network can dynamically select sub-models with different sizes according to the computing resource state of the current equipment

By the application, dynamic balance between precision and time delay is realized, and the precision of the sub-model is maintained while user experience is ensured.

The invention has the following beneficial effects:

the invention provides a bidirectional divisible depth self-attention network, which is based on a designed bidirectional strategy that the width and the depth can be divided, adopts a deep and narrow filtering principle to further select reasonable submodels, and is matched with a proposed self-distillation algorithm, so that each submodel in the network has the application capability of a visual question-answering task. The bidirectional and separable depth self-attention model can dynamically select a proper sub-model to predict an answer according to the current computing resources, balance between precision and time delay is achieved, and a user has good experience while the accuracy of the predicted answer is ensured.

Drawings

FIG. 1 is a schematic diagram of a width-depth slicing strategy according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of model filtering principles according to an embodiment of the present invention.

Detailed Description

The following is a further detailed description of the detailed parameters of the present invention.

step (1): dividing the data set;

Step (2): constructing visual characteristics of the image;

for a given image, detecting the number m of candidate frames and the position of the candidate frames in the image by using the existing trained target detection network; and for each candidate frame, inputting the image corresponding to the candidate frame region into the target detection network, and extracting the features before being input into the network classification layer as the features of the candidate frame. The extracted features of each candidate box are then stitched to form the visual features of the given image. In order to enable the dimension of the image features to be matched with the depth self-attention network, finally, the image features are further processed by using learnable linear transformation and are mapped to a D-dimensional space; the specific method comprises the following steps:

for a given image, the number m and the position of candidate frames in the image are deduced by using the existing trained Faster R-CNN target detection network, and the image area corresponding to each candidate frame is input into the Faster R-CNN target detection networkTo extract the visual features. For the ith candidate box, the corresponding visual characteristics are

And the visual characteristics corresponding to the whole image

X_image＝[x₁，x₂，...，x_i，...，x_m](formula 1)

Subsequently, a learnable linear transformation is used

The specific formula is as follows:

X_input＝Linear(X_image) (formula 2)

And (3): constructing semantic features of the problem;

for a given problem, semantic features are extracted from each word in the problem by using a trained word vector model, and then the extracted semantic features of the words are spliced to form the semantic features of the problem. In order to enable the problem semantic feature dimension to be matched with the deep self-attention network, finally, a learnable linear transformation is used for further processing the problem feature and mapping the problem feature to a D-dimensional space; the specific method comprises the following steps:

And the wholeSemantic features to which questions correspond

Y_question＝[y₁,y₂,…,y_j,…,y_n](formula 3)

Subsequently, a learnable linear transformation is used

The specific formula is as follows:

Y_input＝Linear(Y_question) (formula 4)

And (4): constructing a depth self-attention network;

And problem semantic features

As input, and by a linear projective transformation, the input features are mapped to d dimensions. The deep self-attention network can fully learn the interaction information between the two modes, and finallyAnd generating a visual-semantic fusion feature with rich meaning.

4-1, a multi-head attention module;

for a given interrogation feature

Key feature

And value characteristics

The multi-head attention module calculates and obtains characteristics by utilizing H parallel attention functions

The specific formula is as follows:

F_mha＝MHA(Q，K，V)＝[head₁，head₂，…，head_H]W⁰(formula 5)

Wherein

4-2. feedforward layer;

the feedforward layer consists of two layers of sensors and performs nonlinear transformation on the output characteristics of the multi-head attention module. For a given characteristic

Output characteristics

The specific formula is as follows:

wherein

Is a linear transformation projection matrix.

4-3, self-attention layer;

where LN denotes layer normalization.

4-4. stacking self-attentive layers;

the deep self-attention network is formed by stacking a plurality of self-attention layers, and the characteristic dimension does not change after passing through the self-attention layers, so that the self-attention layers can be connected in series to form a deep self-attention network Model, and the specific formula is as follows:

Model＝[Layer⁽¹⁾，Layer⁽²⁾，…，Layer^(L)](formula 11)

Where L is the number of self-attention layers.

And (5): designing a width segmentation strategy;

each self-attention layer in the deep self-attention network is composed of a plurality of parameter matrixes, and in order to adapt to input characteristics of different dimensions, each parameter matrix needs to be segmented so as to match input of different dimensions and output characteristics of proper dimensions. For the input feature with the dimension d, in order to keep the original structure proportion of the self-attention layer, the dimension of the output feature is still d through a width segmentation strategy. It is worth noting that for different submodels with different dimension input characteristics, the submodels share the parameter matrix in the attention layer, and when d is smaller, the shared parameter number is smaller; when D is equal to the original input dimension D, the parameter matrix is not segmented;

for parameter matrix W in multi-head attention^Q，W^K，

And an input feature of dimension D, maintaining a size D of each attention head_HThe input matching dimension D and the number of attention heads H of the corresponding parameter matrix are changed without change. So that the final sliced parameter matrix W^Q，W^K，

Wherein

Indicating the number of heads of attention after being split. Parameter matrix W in other self-attention layers⁰，W₁，W₂The same strategy is adopted, so that the finally segmented parameter matrix

And (6): designing a depth segmentation strategy;

the deep self-attention network is formed by stacking a plurality of self-attention layers, the number of the layers is recorded as L, when the number of the layers L of the sub-model is less than L, the layer L in the deep self-attention network needs to be selected according to a deep segmentation strategy, and the deep self-attention network belongs to the sub-model. The invention provides a simple and effective depth segmentation strategy, and the submodels can pick out more important self-attention layers as much as possible under the setting of different layer numbers, so that the final precision of different submodels is improved;

for a deep self-attention network with L layers, the index of each layer is recorded as [1, 2]The present invention recognizes that the closer to the input and output self-attention layers is the more important. This means that the middle layer is relatively less important, and when the number of layers of the submodel L < L, it will be discarded first from the middle layer. Specifically, the layer indexes are obtained by firstly sorting the importance of each layer from large to small

For the sub-model with the number of layers being l, the first item of the ordered layer index is taken

Then re-ordering to restore the original layer index sequence

And (4) performing depth segmentation strategy for the final l-layer submodel.

through the design of the steps (5) and (6), each sub-model has a width d and a depth l. Under the same parameter quantity and calculation, the deep and narrow submodels are more efficient and more reasonable in structure than the shallow and wide submodels. Through the filtering principle, a candidate set of the screened sub-model structure is obtained

The filtering principle can reduce the cost of the model during training and improve the sub-model precision after training;

for a given width ratio candidate set

And depth scale candidate set

Each sub-model structure

Wherein

To further process the preliminary sub-model candidate set

The specific definition is as follows:

and (8): designing a self-distillation training algorithm and training a model;

aiming at the sub-model structure candidate set obtained in the step (7)

A self-distillation training strategy is provided, so that each sub-model can be fully trained. Firstly, a teacher network is trained by utilizing the deep self-attention network in the step (4), a bidirectional partitionable deep self-attention network is constructed, when a submodel in the bidirectional partitionable deep self-attention network is trained, images and problems are input into the teacher network to obtain a prediction vector, namely a soft label, and a candidate set is sampled during training through a submodel sampling strategy

defining a teacher network constructed by a deep self-attention network as M_teacherThe bidirectional separable deep self-attention network is M_DSTBy training teacher network M_teacherObtains its parameter weight theta and uses this weight to initialize the bidirectional separable deep self-attention network M_DSTWeight of theta_DST. Through a sub-model sampling strategy, a candidate set is sampled during training

The sub-model sampling strategy is as follows: keeping the k submodels sampled at each iteration, and setting the initial submodel structure candidate set as omega ═ a_sAl } wherein a_sTo represent

A minimum submodel of_lTo represent

Then randomly sampling

And adding the k-2 submodels into a submodel structure candidate set omega to serve as a final submodel candidate set of the iteration. Inputting the input characteristic of each iteration as x into the teacher network M_teacherGet the soft label y ═ M_teacher(x) And freezes its gradient y. Then traversing each submodel a E omega in the submodel structure candidate set omega, and inputting the input characteristic x into the current submodel to obtain a prediction vector

Predicting the result using this submodel

Soft label y calculation loss with teacher network output

KD represents a loss function, gradient accumulation loss is carried out on different submodels sampled from omega, and when all submodels in a submodel structure candidate set omega generated by each iteration traverse, the model weight theta is updated uniformly_DST。

The model deployment and application in the step (9) are as follows:

Time of day

and also has good characterization capability.

In summary, the bidirectional divisible depth self-attention network provided by the present invention can dynamically select sub-models of different sizes according to the computing resource status of the current device

As shown in fig. 1 and 2, the present invention provides a lightweight visual question-answering method for a bidirectional separable deep self-attention network.

The partitioning of the data set in the step (1) is specifically as follows:

the final training set contained 115K images and 1.1M sentence of questions, the validation set contained 5K images and 26K sentence of questions, and the test set contained 80K images and 448K sentence of questions.

The object comprehensive characteristics of the constructed image in the step (2) are as follows:

for an image, usually 36 candidate frames are included, the visual feature dimension extracted from each candidate frame is 2048, the space dimension D of the final mapping is adjusted according to the depth self-attention network, for example, D is 512, and the image object comprehensive feature obtained in this step is obtained

The semantic features of the construction problem in the step (3) are as follows:

for a problem, a fixed word length 14 is usually set, each word utilizes a pre-trained word vector model to extract a semantic feature dimension of 300, and a finally mapped space dimension D is adjusted according to a depth self-attention network, and D is set to be equal to512 example, the semantic features of the question obtained in this step

The step (4) is as follows:

by setting D512 and H8, input characteristics

Will be characterized by F_inputInputting into a multi-head attention module MHA to obtain an output

Then the characteristics

Inputting the input signal into a feed-forward layer FFN to obtain a final output

The width segmentation strategy in the step (5) is as follows:

the present invention defines a shareable width-ratio candidate set as

Input characteristic dimension of sub-model under different width segmentation proportions

When D is 512, the width dimension of the candidate is

Meaning that there are 4 choices for the width dimension of the submodel, 128, 256, 384, 512.

The depth segmentation strategy in the step (6) is as follows:

the invention defines a shareable depth scale candidateAre collected into

Number of layers of sub-models under different depth segmentation proportions

When the L is equal to 12,

it means that the number of layers of the sub-model is 4 choices, 2, 4, 8 and 12.

Combining the two segmentation strategies and the design filtering principle in the step (7), the method specifically comprises the following steps:

the width ratio candidate set defined according to the steps (5) and (6)

And depth scale candidate set

Combining the candidate sets of the two dimensions to obtain a sub-model structure candidate set

Through the principle of filtration to obtainTo the final sub-model structure candidate set

Wherein

The self-distillation training algorithm in the step (8) is specifically as follows:

the present invention sets k to 4, which means that 1 largest submodel, 1 smallest submodel, and 2 more submodels sampled randomly are sampled per iteration. There will be 4 submodels sampled per iteration and together they will be gradient accumulated.

Claims

1. The lightweight visual question-answering method based on the bidirectional separable deep self-attention network is characterized by comprising the following steps of:

step (1): dividing the data set;

step (2): constructing visual characteristics of the image;

for a given image, detecting the number m and the position of candidate frames in the image by using the existing trained target detection network; for each candidate frame, inputting the image corresponding to the candidate frame region into the target detection network again, and extracting the features before being input into the network classification layer as the features of the candidate frame; then, splicing the features extracted from each candidate frame to form the visual features of the given image; in order to enable the dimension of the image features to be matched with the depth self-attention network, finally, the image features are further processed by using learnable linear transformation and are mapped to a D-dimensional space;

and (3): constructing semantic features of the problem;

for a given problem, extracting semantic features from each word in the problem by using a trained word vector model, and then splicing the extracted word semantic features to form problem semantic features; in order to enable the problem semantic feature dimension to be matched with the deep self-attention network, finally, a learnable linear transformation is used for further processing the problem feature and mapping the problem feature to a D-dimensional space;

and (4): constructing a depth self-attention network;

the deep self-attention network is formed by stacking a plurality of self-attention layers, and each self-attention layer is divided into two parts: a multi-headed attention module and a feedforward layer; constructing a teacher network for guiding training and a final bidirectional separable deep self-attention network by using the deep self-attention network; in order to enable input features to be matched with dimensions of each submodel in the bidirectional separable depth self-attention network, the network receives the feature with the dimension D as input and maps the input features to the dimension D through linear projection transformation;

and (5): designing a width segmentation strategy;

each self-attention layer in the deep self-attention network consists of a plurality of parameter matrixes, and in order to adapt to input characteristics of different dimensions, each parameter matrix needs to be segmented so as to match input of different dimensions and output characteristics of proper dimensions; aiming at the input feature with the dimension d, in order to keep the original structure proportion of the self-attention layer, the dimension of the output feature is still d through a width segmentation strategy; it is noted that, for different submodels having different dimensional input features, the parameter matrix in the self-attention layer is shared, and when d is smaller, the shared parameter number is smaller; when D is equal to the original input dimension D, the parameter matrix is not segmented;

and (6): designing a depth segmentation strategy;

the depth self-attention network is formed by stacking a plurality of self-attention layers, the number of the layers is L, when the number of the layers of the submodel is less than L, the layer I in the depth self-attention network is selected according to a depth segmentation strategy and belongs to the submodel;

designing through the steps (5) and (6), wherein each sub-model has a width d and a depth l; under the same parameter quantity and calculation, the deep and narrow submodels are more efficient and reasonable in structure than the shallow and wide submodels, a deep and narrow filtering principle is provided, a plurality of submodels with more layers and low width are selected before model training, and the submodels with fewer layers and high width are directly discarded; through the filtering principle, a candidate set of the screened sub-model structure is obtained

And (8): designing a self-distillation training algorithm and training a model;

aiming at the sub-model structure candidate set obtained in the step (7)

Providing a self-distillation training strategy to fully train each sub-model; firstly, a teacher network is trained by utilizing the deep self-attention network in the step (4), a bidirectional partitionable deep self-attention network is constructed, when a submodel in the bidirectional partitionable deep self-attention network is trained, images and problems are input into the teacher network to obtain a prediction vector, namely a soft label, and a candidate set is sampled during training through a submodel sampling strategy

and (9): model deployment and application.

2. The lightweight visual question-answering method based on the bidirectional partitionable deep self-attention network according to claim 1, wherein the partitioning of the data set in the step (1) is specifically as follows:

the data set adopts VQA-v2 data set, and is further divided into 3 subsets aiming at VQA-v2 data set: a training set, a verification set and a test set; the training set is used for training the model, the verification set is used for locally verifying the convergence condition of the model, and the test set is used for finally evaluating the performance of the model.

3. The lightweight visual question-answering method based on the bidirectional separable depth self-attention network according to claim 2, wherein the visual features of the constructed image in the step (2) are as follows:

for a given image, deducing the number m and the position of candidate frames in the image by using the existing trained Faster R-CNN target detection network, and inputting an image area corresponding to each candidate frame into the Faster R-CNN target detection network to extract the visual characteristics of the image; for the ith candidate box, the corresponding visual characteristics are

And the visual characteristics corresponding to the whole image

X_image＝[x₁，x₂，...，x_i，...，x_m](formula 1)

Subsequently, a learnable linear transformation is used

The specific formula is as follows:

X_input＝Linear(X_image) (equation 2).

4. The lightweight visual question-answering method based on the bidirectional separable deep self-attention network according to claim 3, wherein the semantic features of the construction question in the step (3) are as follows:

for a given question which comprises n words, inputting each word into a pre-trained GloVe word vector model to extract semantic features of the word; for the jth word, the corresponding semantic feature is

Semantic features corresponding to the whole problem

Y_question＝[y₁，y₂，...，y_j，...，y_n](formula 3)

Subsequently, a learnable linear transformation is used

The specific formula is as follows:

Y_input＝Linear(Y_question) (equation 4).

5. The light-weight visual question-answering method based on the bidirectional separable deep self-attention network according to claim 4, wherein the deep self-attention network is constructed in the step (4), and specifically, the method comprises the following steps:

the deep self-attention network is formed by stacking a plurality of self-attention layers, and each self-attention layer is divided into two parts: a multi-headed attention module and a feedforward layer; constructed by utilizing the depth self-attention networkThe system comprises a teacher network for guiding training and a final bidirectional separable deep self-attention network, wherein the teacher network and the bidirectional separable deep self-attention network both adopt deep self-attention networks with the same structure; in order to enable input features to match respective submodel dimensions in a bi-directional partitionable depth self-attention network, the depth self-attention network accepts image visual features of dimension D

And problem semantic features

As input, mapping the input features to d dimension by a linear projective transformation; the deep self-attention network can fully learn the interactive information between the two modes and finally generate a visual-semantic fusion feature with rich meaning;

4-1, a multi-head attention module;

for a given interrogation feature

Key feature

And value characteristics

The specific formula is as follows:

F_mha＝MHA(Q，K，V)＝[head₁，head₂，…，head_H]W⁰(formula 5)

Wherein

Mapping matrix, D, representing the h-th head of attention_HFor each attention head dimension, can be represented by D_HD/H is calculated; in addition to this, the present invention is,

the system is used for further processing the output characteristics of the multi-head attention function; for the attention calculation mode ATT, the specific formula is as follows:

4-2. feedforward layer;

the feedforward layer consists of two layers of perceptrons and carries out nonlinear transformation on the output characteristics of the multi-head attention module; for a given characteristic

Output characteristics

The specific formula is as follows:

wherein

A projection matrix is linearly transformed;

4-3, self-attention layer;

each self-attention Layer consists of a plurality of the layersHead attention Module and feed Forward layer composition, for given input F_inputOutput characteristic F_outputThe concrete formula is as follows:

wherein LN represents layer normalization;

4-4. stacking self-attentive layers;

Model＝[Layer⁽¹⁾，Layer⁽²⁾，…，Layer^(L)](formula 11)

Where L is the number of self-attention layers.

6. The light-weight visual question-answering method based on the bidirectional separable deep self-attention network according to claim 5, wherein the width splitting strategy in the step (5) is as follows:

for parameter matrix in multi-head attention

And an input feature of dimension D, maintaining a size D of each attention head_HChanging the input matching dimension D and the attention head number H of the corresponding parameter matrix without changing; so that the finally sliced parameter matrix

Wherein

Representing the number of the heads of attention after being segmented; parameter matrix W in other self-attention layers⁰，W₁，W₂The same strategy is adopted, so that the finally segmented parameter matrix

7. The light-weight visual question-answering method based on the bidirectional partitionable deep self-attention network according to claim 6, wherein the depth partitioning strategy in step (6) is as follows:

for a deep self-attention network with L layers, the index of each layer is recorded as [1, 2]The invention considers that the closer to the self-attention layer of input and output is; this means that the middle layer is relatively less important, when the number of layers of the submodel L < L, it will be discarded first from the middle layer; specifically, the layer indexes are obtained by firstly sorting the importance of each layer from large to small

Then re-ordering to restore to the original layer index order

8. The lightweight visual question-answering method based on the bidirectional separable deep self-attention network according to claim 7, wherein the filtering principle is designed by combining two separation strategies in the step (7), and specifically comprises the following steps:

for a given width ratio candidate set

And depth scale candidate set

Each sub-model structure

Wherein

To further process the preliminary sub-model candidate set

I (d, l) ═ 1 indicates that the selection submodel a (d, l) is selected, and I (d, l) ═ 0 indicates that the discard submodel a (d, l) is discarded; the index matrix I is initialized to all 1 values, and then the lower triangular part is converted into 0 values; finally, the selected submodel

The specific definition is as follows:

9. the light-weight visual question-answering method based on the bidirectional separable deep self-attention network of claim 8, wherein the self-distillation training algorithm of the step (8) is as follows:

defining a teacher network constructed by a deep self-attention network as M_teacherThe bidirectional separable deep self-attention network is M_DSTBy training teacher network M_teacherObtains its parameter weight theta and uses this weight to initialize the bidirectional separable deep self-attention network M_DSTWeight of theta_DST(ii) a Through a sub-model sampling strategy, a candidate set is sampled during training

The sub-model sampling strategy is as follows: recording k submodels sampled at each iteration, wherein the structure candidate set of the initial submodels is omega ═ a_s，a_lIn which a is_sTo represent

A minimum submodel of_lTo represent

Then randomly sampling

Adding the k-2 submodels into a submodel structure candidate set omega to serve as a final submodel candidate set of the iteration; inputting the input characteristic x of each iteration into the teacher network M_teacherGet the soft label y ═ M_teacher(x) And freezing its gradient y.detach (); then traversing each sub-model a E omega in the sub-model structure candidate set omega, and inputting the input characteristic x into the current sub-model to obtain a prediction vector

Predicting the result using this submodel

Soft label y calculation loss with teacher network output

10. The lightweight visual question-answering method based on the bidirectional separable deep self-attention network according to claim 9, wherein the model deployment and application in step (9) are as follows:

if the computing resources of the current equipment are sufficient, the maximum submodel a is adopted_lApplications, obtained by forward propagation

At this time

The method has the best characterization capability in the submodels; when the computing resources of the device are not enough, the minimum submodel a is adopted_sObtained by forward propagation

the method also has good characterization capability;

the bidirectional divisible deep self-attention network can dynamically select submodels with different sizes according to the computing resource state of the current equipmentModel (III)