CN114647752A - Lightweight visual question-answering method based on bidirectional separable deep self-attention network - Google Patents

Lightweight visual question-answering method based on bidirectional separable deep self-attention network Download PDF

Info

Publication number
CN114647752A
CN114647752A CN202210369535.0A CN202210369535A CN114647752A CN 114647752 A CN114647752 A CN 114647752A CN 202210369535 A CN202210369535 A CN 202210369535A CN 114647752 A CN114647752 A CN 114647752A
Authority
CN
China
Prior art keywords
attention
self
network
model
layer
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210369535.0A
Other languages
Chinese (zh)
Inventor
余宙
金子添
俞俊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Dianzi University
Original Assignee
Hangzhou Dianzi University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Dianzi University filed Critical Hangzhou Dianzi University
Priority to CN202210369535.0A priority Critical patent/CN114647752A/en
Publication of CN114647752A publication Critical patent/CN114647752A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/53Querying
    • G06F16/532Query formulation, e.g. graphical querying
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/53Querying
    • G06F16/535Filtering based on additional data, e.g. user or group profiles
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/9032Query formulation
    • G06F16/90332Natural language query formulation or dialogue systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/9035Filtering based on additional data, e.g. user or group profiles
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Evolutionary Computation (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a light visual question-answering method based on a bidirectional divisible depth self-attention network, and provides the bidirectional divisible depth self-attention network. The bidirectional and separable depth self-attention model can dynamically select a proper sub-model to predict an answer according to current computing resources, balance between precision and time delay is achieved, and a user has good experience while the accuracy of predicting the answer is guaranteed.

Description

Lightweight visual question-answering method based on bidirectional separable deep self-attention network
Technical Field
The invention belongs to the field of Visual Question Answering, and particularly relates to a lightweight Visual Question Answering method (Visual Question Answering) based on a bidirectional divisible deep self-attention network.
Background
The visual question-answer task is a popular research problem in a multi-modal learning task, is a learning task related to computer vision and natural language processing, and aims to output a corresponding predicted answer for a given image and a sentence of a problem which is related to the image and has a free form through the processing of a visual question-answer model. Unlike single-modality tasks, visual question-answering tasks, which are multi-modality tasks, require not only the correct understanding of information of different modalities, but also the understanding of information associated between modalities, which is generally more complex and difficult. This task has a wide range of real life applications, such as: the network image information can be more conveniently acquired by the disabled with visual disorder; meanwhile, better development of a human-computer interaction system can be promoted, and user experience is enhanced; the comprehension of the machine to the image can be improved, and the image retrieval capability is enhanced.
The deep self-attention network is originally proposed on the machine translation task in the natural language processing field, and the core architecture of the deep self-attention network is formed by stacking a plurality of self-attention layers, and each self-attention layer can construct complex and intensive interaction between input features. The deep self-attention network obtains the best effect on a machine translation task, quickly obtains the attention of researchers in the field of artificial intelligence, and applies the deep self-attention network to various sub-fields of the artificial intelligence, including the field of visual question answering. Because the deep self-attention network can better learn the interactive information between visual and text features, at present, the network architecture has become the mainstream network structure in the field of visual question answering. However, while the deep self-attention network brings performance improvement, because of the complexity of its computation, new requirements are also placed on the computation resources and the storage space, which will bring a serious problem: the deployment of the models on the mobile device depends on a GPU cloud server, so that the limited computing resources of the mobile terminal are difficult to directly utilize, and the resource waste is caused and the model is not environment-friendly. At present, no lightweight model exists in the field of visual question answering for a while, which brings challenges to the deployment of the model and also hinders users from enjoying convenience brought by artificial intelligence application.
In order to meet the new challenge of difficult deployment of deep learning models, some model compression methods are in force. In the field of monomodal tasks such as computer vision and natural language processing, compression models based on methods such as weight sharing, knowledge distillation, pruning and quantification gradually appear, so that the models are compressed to a certain proportion to achieve balance between calculated amount and accuracy, but the methods are usually compressed to a fixed proportion and only a lightweight model with a fixed size can be obtained. Nowadays, mobile devices are various, and the difference of computing performance between different mobile devices is large, and even if the same mobile device can provide different computing resources under different load conditions and different electric quantities, if a lightweight model is designed for one mobile device or one load condition, the model training overhead is in direct proportion to the number of models, and one mobile device needs to store a plurality of models to cope with various scenes, and the storage overhead is also very large.
Recently, a neural network capable of slimming and being divided provides a new idea: only one model is used for dealing with various scenes, and when the computing resources are sufficient, most of the model is taken for forward propagation and prediction so as to obtain higher accuracy; when the computational resources are limited, a small number of parameters of the model are taken for prediction, and a little precision is sacrificed for deducing the speed. If the thought can be utilized, and an efficient and reasonable segmentation and training strategy is designed aiming at the mainstream model structure depth self-attention network in the field of visual question answering, new contribution is made to landing of the visual question answering model.
In view of the above, how to design an efficient and differentiable deep self-attention network and apply it to the field of visual question answering is a subject of intensive research. The patent aims to cut into and develop discussion from a plurality of key points in the task, solves the difficulties and key points of the existing method, and forms a set of complete and efficient lightweight visual question-answering method.
Disclosure of Invention
Aiming at the defects in the prior art, the invention provides a lightweight visual question-answering method based on a bidirectional separable deep self-attention network. The invention mainly comprises two points:
1. by analyzing the internal structure of the depth self-attention network, an efficient and reasonable width and depth segmentation strategy is designed, and a bidirectional strategy which can be segmented in width and depth is formed by combining the two strategies. Aiming at submodels split by a bidirectional strategy, the invention provides a deep and narrow filtering principle and further selects some excellent high-efficiency substructures.
2. The two-way segmentation strategy and the filtering principle are combined with the existing depth self-attention-based visual question-answer model, and an efficient self-distillation training strategy is provided, so that each sub-model can be fully trained, and finally, the two-way segmentation depth self-attention visual question-answer model is obtained.
The invention provides a light Visual Question-Answering method (Visual Question Answering) based on a bidirectional divisible depth self-attention network. The core method provides an efficient and reasonable width and depth segmentation strategy by analyzing the internal structure of the depth self-attention network, and combines the two single-dimensional segmentation strategies to form a two-way strategy which can be segmented in width and depth. Meanwhile, aiming at the substructures segmented by the bidirectional strategy, the invention provides a deep and narrow filtering principle, and further selects some excellent high-efficiency substructures, the filtering principle can improve the performance of each substructure, and in model deployment, the filtering principle does not need additional screening and can be directly put into use, so that the post-processing process is avoided, and the filtering principle is simple and easy to use. In addition, an efficient self-distillation training strategy is also provided, so that each submodel can be fully trained. The method can be combined with any existing visual question-answering model based on the deep self-attention network, a bidirectional partible deep self-attention network is formed through training, each submodel in the network has the application capability of a visual question-answering task, when the model is deployed on edge equipment with limited resources and large performance fluctuation, the bidirectional partible deep self-attention model can dynamically select a proper submodel to predict answers according to current computing resources, balance between precision and time delay is achieved, and the accuracy of predicting the answers is guaranteed while a user has good experience.
The lightweight visual question-answering method based on the bidirectional separable deep self-attention network comprises the following steps:
step (1): dividing the data set;
step (2): constructing visual characteristics of the image;
for a given image, detecting the number m of candidate frames and the position of the candidate frames in the image by using the existing trained target detection network; and for each candidate frame, inputting the image corresponding to the candidate frame region into the target detection network, and extracting the features before being input into the network classification layer as the features of the candidate frame. The extracted features of each candidate frame are then stitched to form the visual features of the given image. In order to enable the dimension of the image features to be matched with the depth self-attention network, finally, the image features are further processed by using learnable linear transformation and are mapped to a D-dimensional space;
and (3): constructing semantic features of the problem;
for a given problem, semantic features are extracted from each word in the problem by using a trained word vector model, and then the extracted word semantic features are spliced to form the problem semantic features. In order to enable the problem semantic feature dimension to be matched with the deep self-attention network, finally, a learnable linear transformation is used for further processing the problem feature and mapping the problem feature to a D-dimensional space;
and (4): constructing a depth self-attention network;
the deep self-attention network is formed by stacking a plurality of self-attention layers, and each self-attention layer is divided into two parts: a multi-headed attention module and a feed-forward layer. And constructing a teacher network for guiding training and a final bidirectional separable deep self-attention network by using the deep self-attention network. In order to enable input features to match the dimensions of each submodel in a bi-directional separable depth-from-attention network, the network accepts features of dimension D as input and maps the input features to dimension D through a linear projection transformation.
And (5): designing a width segmentation strategy;
each self-attention layer in the deep self-attention network is composed of a plurality of parameter matrixes, and in order to adapt to input characteristics of different dimensions, each parameter matrix needs to be segmented so as to match input of different dimensions and output characteristics of proper dimensions. For the input feature with the dimension d, in order to keep the original structure proportion of the self-attention layer, the dimension of the output feature is still d through a width segmentation strategy. It is noted that, for different submodels having different dimensional input features, the parameter matrix in the self-attention layer is shared, and when d is smaller, the shared parameter number is smaller; when D is equal to the original input dimension D, the parameter matrix is not segmented;
and (6): designing a depth segmentation strategy;
the depth self-attention network is formed by stacking a plurality of self-attention layers, the number of the layers is recorded as L, when the number of the layers L of the sub-model is less than L, the layer L in the depth self-attention network is selected according to a depth segmentation strategy and belongs to the sub-model. A simple and effective depth segmentation strategy is provided, and sub-models can pick out more important self-attention layers as far as possible under different layer number settings, so that the final precision of different sub-models is improved;
and (7): combining two segmentation strategies and designing a filtering principle;
through the design of the steps (5) and (6), each sub-model has a width d and a depth l. Under the same parameter quantity and calculation, the deep and narrow submodels are more efficient and reasonable in structure than the shallow and wide submodels, a deep and narrow filtering principle is provided, a plurality of submodels with a large number of layers and a low width are selected before model training, and the submodels with a small number of layers and a high width are directly discarded. Through the filtering principle, a candidate set of the screened sub-model structure is obtained
Figure BDA0003587525230000061
And (8): designing a self-distillation training algorithm and training a model;
aiming at the sub-model structure candidate set obtained in the step (7)
Figure BDA0003587525230000062
A self-distillation training strategy is provided, so that each sub-model can be fully trained. Firstly, training a teacher network by using the deep self-attention network in the step (4), constructing a bidirectional partitionable deep self-attention network, inputting images and problems into the teacher network to obtain prediction vectors of the images and the problems when training submodels in the bidirectional partitionable deep self-attention network, wherein the prediction vectors are called soft labels, and sampling candidate sets during training through a submodel sampling strategy
Figure BDA0003587525230000063
The soft label is used as a supervision label of the sampled submodel for training;
and (9): model deployment and application;
further, the partitioning of the data set in step (1) is specifically as follows:
the data set adopts VQA-v2 data set, and is further divided into 3 subsets aiming at VQA-v2 data set: training set, validation set and test set. The training set is used for training the model, the verification set is used for locally verifying the convergence condition of the model, and the test set is used for final model performance evaluation.
Further, the visual features of the constructed image in the step (2) are specifically as follows:
for a given image, the number m and the position of candidate frames in the image are deduced by using the existing trained Faster R-CNN target detection network, and an image area corresponding to each candidate frame is input into the Faster R-CNN target detection network to extract the visual characteristics of the image. For the ith candidate box, the corresponding visual characteristics are
Figure BDA0003587525230000071
And the visual characteristics corresponding to the whole image
Figure BDA0003587525230000072
The visual feature corresponding to each candidate frame is spliced, and the specific expression formula is as follows:
Ximage=[x1,x2,...,xi,...,xm](formula 1)
Subsequently, a learnable linear transformation is used
Figure BDA0003587525230000073
For image characteristic XimageFurther processing, mapping to D-dimensional space to obtain final image visual characteristics
Figure BDA0003587525230000074
The specific formula is as follows:
Xinput=Linear(Ximage) (formula 2)
Further, the semantic features of the construction problem in step (3) are specifically as follows:
for a given question, which contains n words, each word is input into a pre-trained GloVe word vector model to extract its semantic features. For the jth word, the corresponding semantic feature is
Figure BDA0003587525230000075
Semantic features corresponding to the whole problem
Figure BDA0003587525230000076
The word is formed by splicing semantic features corresponding to each word, and the concrete expression formula is as follows:
Yquestion=[y1,y2,...,yj,...,yn](formula 3)
Subsequently, a learnable linear transformation is used
Figure BDA0003587525230000077
Question semantic feature YquestionFurther processing is carried out, and mapping is carried out on the D-dimensional space to obtain the final semantic features of the problem
Figure BDA0003587525230000081
The concrete formula is as follows:
Yinput=Linear(Yquestion) (formula 4)
Further, the constructing of the deep self-attention network in the step (4) is specifically as follows:
the deep self-attention network is formed by stacking a plurality of self-attention layers, and each self-attention layer is divided into two parts: a multi-headed attention module and a feed-forward layer. And constructing a teacher network for guiding training and a final bidirectional separable deep self-attention network by using the deep self-attention network, wherein the teacher network and the bidirectional separable deep self-attention network both adopt deep self-attention networks with the same structure. In order to enable input features to match respective submodel dimensions in a bi-directional partitionable depth self-attention network, the depth self-attention network accepts image visual features of dimension D
Figure BDA0003587525230000082
And problem semantic features
Figure BDA0003587525230000083
As input, and by a linear projective transformation, the input features are mapped to d dimensions.
4-1, a multi-head attention module;
for a given interrogation feature
Figure BDA0003587525230000084
Key feature
Figure BDA0003587525230000085
And value characteristics
Figure BDA0003587525230000086
Multi-headThe attention module calculates and obtains characteristics by utilizing H parallel attention functions
Figure BDA0003587525230000087
The concrete formula is as follows:
Fmha=MHA(Q,K,V)=[head1,head2,…,headH]W0(formula 5)
Figure BDA0003587525230000088
Wherein
Figure BDA0003587525230000089
Mapping matrix, D, representing the h-th head of attentionHFor each attention head dimension, can be represented by DHCalculated as D/H. In addition to this, the present invention is,
Figure BDA00035875252300000810
Figure BDA00035875252300000811
for further processing of the output characteristics of the multi-headed attention function. For the attention calculation mode ATT, the specific formula is as follows:
Figure BDA0003587525230000091
4-2. feedforward layer;
the feedforward layer is composed of two layers of perceptrons and carries out nonlinear transformation on the output characteristics of the multi-head attention module. For a given characteristic
Figure BDA0003587525230000092
Output characteristics
Figure BDA0003587525230000093
The specific formula is as follows:
Figure BDA0003587525230000094
wherein
Figure BDA0003587525230000095
Is a linear transformation projection matrix.
4-3, self-attention layer;
each self-attention Layer consists of the multi-headed attention module and feed-forward Layer described above, for a given input FinputOutput characteristic FoutputThe concrete formula is as follows:
Figure BDA0003587525230000096
Figure BDA0003587525230000097
where LN represents slice normalization.
4-4. stacking self-attention layers;
the deep self-attention network is formed by stacking a plurality of self-attention layers, and the characteristic dimension is not changed by the self-attention layers, so that the plurality of self-attention layers can be connected in series to form a deep self-attention network Model, and the specific formula is as follows:
Model=[Layer(1),Layer(2),…,Layer(L)](formula 11)
Where L is the number of self-attention layers.
Further, the width segmentation strategy in the step (5) is specifically as follows:
for parameter matrix W in multi-head attentionQ,WK
Figure BDA0003587525230000098
And an input feature of dimension D, maintaining a size D of each attention headHInvariant, but varying input matching dimensions D and D of corresponding parameter matricesThe number of gravity heads H. So that the final sliced parameter matrix WQ,WK
Figure BDA0003587525230000101
Wherein
Figure BDA0003587525230000102
Indicating the number of heads of attention after being segmented. Parameter matrix W in other self-attention layers0,W1,W2The same strategy is adopted, so that the finally segmented parameter matrix
Figure BDA0003587525230000103
Figure BDA0003587525230000104
Further, the depth segmentation strategy in step (6) is specifically as follows:
for a deep self-attention network with L layers, let the index of each layer be [1, 2]The present invention recognizes that the closer to the input and output self-attention layers is the more important. This means that the middle layer is relatively less important, and when the number of layers of the submodel L < L, it will be discarded first from the middle layer. Specifically, the layer indexes are obtained by firstly sorting the importance of each layer from large to small
Figure BDA0003587525230000105
For the sub-model with the number of layers being l, the front item of the ordered layer index is taken
Figure BDA0003587525230000106
Then re-ordering to restore the original layer index sequence
Figure BDA0003587525230000107
And (4) carrying out depth segmentation strategy for the final l-layer sub-model.
Further, the two segmentation strategies in step (7) are combined, and a filtering principle is designed, specifically as follows:
for a given widthDegree scale candidate set
Figure BDA0003587525230000108
And depth scale candidate set
Figure BDA0003587525230000109
By combining the candidate sets of the two dimensions, a preliminary sub-model structure candidate set is obtained
Figure BDA00035875252300001010
Each sub-model structure
Figure BDA00035875252300001011
Wherein
Figure BDA00035875252300001012
In order to easily express the filtering principle of 'deep and narrow', a two-dimensional index matrix is defined
Figure BDA00035875252300001013
To further process the preliminary sub-model candidate set
Figure BDA00035875252300001014
I (d, l) ═ 1 indicates that the selection submodel a (d, l) is selected, and I (d, l) ═ 0 indicates that the discard submodel a (d, l) is discarded. The index matrix I is initialized to all 1 values and the lower triangular portion is converted to 0 values. Finally, the selected submodel
Figure BDA00035875252300001015
The specific definition is as follows:
Figure BDA0003587525230000111
further, the self-distillation training algorithm in step (8) specifically comprises the following steps:
defining a teacher network constructed by a deep self-attention network as MteacherThe bidirectional divisible depth self-attention network is MDSTBy trainingTeacher training network MteacherObtains its parameter weight theta and uses this weight to initialize the bidirectional separable deep self-attention network MDSTWeight of thetaDST. Through a sub-model sampling strategy, a candidate set is sampled during training
Figure BDA0003587525230000112
And a sub-model, wherein the sub-model sampling strategy is as follows: keeping the k submodels sampled at each iteration, and setting the initial submodel structure candidate set as omega ═ as,alIn which a issRepresent
Figure BDA0003587525230000113
A minimum submodel oflTo represent
Figure BDA0003587525230000114
Then randomly sampling the largest sub-model
Figure BDA0003587525230000115
K-2 submodels in the set are added into a submodel structure candidate set omega to serve as a final submodel candidate set of the iteration. Inputting the input characteristic of each iteration as x into the teacher network MteacherGet the soft label y ═ Mteacher(x) And freezes its gradient y. Then traversing each sub-model a E omega in the sub-model structure candidate set omega, and inputting the input characteristic x into the current sub-model to obtain a prediction vector
Figure BDA0003587525230000116
Predicting the result using this submodel
Figure BDA0003587525230000117
Soft label y calculation loss with teacher network output
Figure BDA0003587525230000118
KD represents a loss function, gradient accumulation loss is carried out on different sub-models sampled from omega, and when the sub-models are iterated each time, sub-model structures are generatedAfter all the submodels in the candidate set omega are traversed, uniformly updating the model weight thetaDST
Further, the model deployment and application in step (9) are as follows:
if the computing resources of the current equipment are sufficient, the largest sub-model a is adoptedlApplications, obtained by forward propagation
Figure BDA0003587525230000119
At this time
Figure BDA00035875252300001110
The method has the best characterization capability in the submodel. When the computing resources of the device are not enough, the minimum submodel a is adoptedsObtained by forward propagation
Figure BDA0003587525230000121
Because a issThe amount of computation required is minimal for all submodels, where the forward propagation speed is greatly increased to improve the user experience, and
Figure BDA0003587525230000122
and also has good characterization capability.
The bidirectional divisible deep self-attention network can dynamically select sub-models with different sizes according to the computing resource state of the current equipment
Figure BDA0003587525230000123
By the application, dynamic balance between precision and time delay is realized, and the precision of the sub-model is maintained while user experience is ensured.
The invention has the following beneficial effects:
the invention provides a bidirectional divisible depth self-attention network, which is based on a designed bidirectional strategy that the width and the depth can be divided, adopts a deep and narrow filtering principle to further select reasonable submodels, and is matched with a proposed self-distillation algorithm, so that each submodel in the network has the application capability of a visual question-answering task. The bidirectional and separable depth self-attention model can dynamically select a proper sub-model to predict an answer according to the current computing resources, balance between precision and time delay is achieved, and a user has good experience while the accuracy of the predicted answer is ensured.
Drawings
FIG. 1 is a schematic diagram of a width-depth slicing strategy according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of model filtering principles according to an embodiment of the present invention.
Detailed Description
The following is a further detailed description of the detailed parameters of the present invention.
The lightweight visual question-answering method based on the bidirectional separable deep self-attention network comprises the following steps:
step (1): dividing the data set;
the data set adopts VQA-v2 data set, and is further divided into 3 subsets aiming at VQA-v2 data set: training set, validation set and test set. The training set is used for training the model, the verification set is used for locally verifying the convergence condition of the model, and the test set is used for final model performance evaluation.
Step (2): constructing visual characteristics of the image;
for a given image, detecting the number m of candidate frames and the position of the candidate frames in the image by using the existing trained target detection network; and for each candidate frame, inputting the image corresponding to the candidate frame region into the target detection network, and extracting the features before being input into the network classification layer as the features of the candidate frame. The extracted features of each candidate box are then stitched to form the visual features of the given image. In order to enable the dimension of the image features to be matched with the depth self-attention network, finally, the image features are further processed by using learnable linear transformation and are mapped to a D-dimensional space; the specific method comprises the following steps:
for a given image, the number m and the position of candidate frames in the image are deduced by using the existing trained Faster R-CNN target detection network, and the image area corresponding to each candidate frame is input into the Faster R-CNN target detection networkTo extract the visual features. For the ith candidate box, the corresponding visual characteristics are
Figure BDA0003587525230000131
And the visual characteristics corresponding to the whole image
Figure BDA0003587525230000132
The visual feature corresponding to each candidate frame is spliced, and the specific expression formula is as follows:
Ximage=[x1,x2,...,xi,...,xm](formula 1)
Subsequently, a learnable linear transformation is used
Figure BDA0003587525230000133
For image characteristic XimageFurther processing, mapping to D-dimensional space to obtain final image visual characteristics
Figure BDA0003587525230000134
The specific formula is as follows:
Xinput=Linear(Ximage) (formula 2)
And (3): constructing semantic features of the problem;
for a given problem, semantic features are extracted from each word in the problem by using a trained word vector model, and then the extracted semantic features of the words are spliced to form the semantic features of the problem. In order to enable the problem semantic feature dimension to be matched with the deep self-attention network, finally, a learnable linear transformation is used for further processing the problem feature and mapping the problem feature to a D-dimensional space; the specific method comprises the following steps:
for a given question, which contains n words, each word is input into a pre-trained GloVe word vector model to extract its semantic features. For the jth word, the corresponding semantic feature is
Figure BDA0003587525230000141
And the wholeSemantic features to which questions correspond
Figure BDA0003587525230000142
The word is formed by splicing semantic features corresponding to each word, and the concrete expression formula is as follows:
Yquestion=[y1,y2,…,yj,…,yn](formula 3)
Subsequently, a learnable linear transformation is used
Figure BDA0003587525230000143
Question semantic feature YquestionFurther processing is carried out, and mapping is carried out on the D-dimensional space to obtain the final semantic features of the problem
Figure BDA0003587525230000144
The specific formula is as follows:
Yinput=Linear(Yquestion) (formula 4)
And (4): constructing a depth self-attention network;
the deep self-attention network is formed by stacking a plurality of self-attention layers, and each self-attention layer is divided into two parts: a multi-headed attention module and a feed-forward layer. And constructing a teacher network for guiding training and a final bidirectional separable deep self-attention network by using the deep self-attention network, wherein the teacher network and the bidirectional separable deep self-attention network both adopt deep self-attention networks with the same structure. In order to enable input features to match respective submodel dimensions in a bi-directional partitionable depth self-attention network, the depth self-attention network accepts image visual features of dimension D
Figure BDA0003587525230000151
And problem semantic features
Figure BDA0003587525230000152
As input, and by a linear projective transformation, the input features are mapped to d dimensions. The deep self-attention network can fully learn the interaction information between the two modes, and finallyAnd generating a visual-semantic fusion feature with rich meaning.
4-1, a multi-head attention module;
for a given interrogation feature
Figure BDA0003587525230000153
Key feature
Figure BDA0003587525230000154
And value characteristics
Figure BDA0003587525230000155
The multi-head attention module calculates and obtains characteristics by utilizing H parallel attention functions
Figure BDA0003587525230000156
The specific formula is as follows:
Fmha=MHA(Q,K,V)=[head1,head2,…,headH]W0(formula 5)
Figure BDA0003587525230000157
Wherein
Figure BDA0003587525230000158
Mapping matrix, D, representing the h-th head of attentionHFor each attention head dimension, can be represented by DHCalculated as D/H. In addition to this, the present invention is,
Figure BDA0003587525230000159
Figure BDA00035875252300001510
for further processing of the output characteristics of the multi-headed attention function. For the attention calculation mode ATT, the specific formula is as follows:
Figure BDA00035875252300001511
4-2. feedforward layer;
the feedforward layer consists of two layers of sensors and performs nonlinear transformation on the output characteristics of the multi-head attention module. For a given characteristic
Figure BDA00035875252300001512
Output characteristics
Figure BDA00035875252300001513
The specific formula is as follows:
Figure BDA00035875252300001514
wherein
Figure BDA00035875252300001515
Is a linear transformation projection matrix.
4-3, self-attention layer;
each self-attention Layer consists of the multi-headed attention module and feed-forward Layer described above, for a given input FinputOutput characteristic FoutputThe concrete formula is as follows:
Figure BDA0003587525230000161
Figure BDA0003587525230000162
where LN denotes layer normalization.
4-4. stacking self-attentive layers;
the deep self-attention network is formed by stacking a plurality of self-attention layers, and the characteristic dimension does not change after passing through the self-attention layers, so that the self-attention layers can be connected in series to form a deep self-attention network Model, and the specific formula is as follows:
Model=[Layer(1),Layer(2),…,Layer(L)](formula 11)
Where L is the number of self-attention layers.
And (5): designing a width segmentation strategy;
each self-attention layer in the deep self-attention network is composed of a plurality of parameter matrixes, and in order to adapt to input characteristics of different dimensions, each parameter matrix needs to be segmented so as to match input of different dimensions and output characteristics of proper dimensions. For the input feature with the dimension d, in order to keep the original structure proportion of the self-attention layer, the dimension of the output feature is still d through a width segmentation strategy. It is worth noting that for different submodels with different dimension input characteristics, the submodels share the parameter matrix in the attention layer, and when d is smaller, the shared parameter number is smaller; when D is equal to the original input dimension D, the parameter matrix is not segmented;
for parameter matrix W in multi-head attentionQ,WK
Figure BDA0003587525230000163
And an input feature of dimension D, maintaining a size D of each attention headHThe input matching dimension D and the number of attention heads H of the corresponding parameter matrix are changed without change. So that the final sliced parameter matrix WQ,WK
Figure BDA0003587525230000171
Wherein
Figure BDA0003587525230000172
Indicating the number of heads of attention after being split. Parameter matrix W in other self-attention layers0,W1,W2The same strategy is adopted, so that the finally segmented parameter matrix
Figure BDA0003587525230000173
Figure BDA0003587525230000174
And (6): designing a depth segmentation strategy;
the deep self-attention network is formed by stacking a plurality of self-attention layers, the number of the layers is recorded as L, when the number of the layers L of the sub-model is less than L, the layer L in the deep self-attention network needs to be selected according to a deep segmentation strategy, and the deep self-attention network belongs to the sub-model. The invention provides a simple and effective depth segmentation strategy, and the submodels can pick out more important self-attention layers as much as possible under the setting of different layer numbers, so that the final precision of different submodels is improved;
for a deep self-attention network with L layers, the index of each layer is recorded as [1, 2]The present invention recognizes that the closer to the input and output self-attention layers is the more important. This means that the middle layer is relatively less important, and when the number of layers of the submodel L < L, it will be discarded first from the middle layer. Specifically, the layer indexes are obtained by firstly sorting the importance of each layer from large to small
Figure BDA0003587525230000175
For the sub-model with the number of layers being l, the first item of the ordered layer index is taken
Figure BDA0003587525230000176
Then re-ordering to restore the original layer index sequence
Figure BDA0003587525230000177
And (4) performing depth segmentation strategy for the final l-layer submodel.
And (7): combining two segmentation strategies and designing a filtering principle;
through the design of the steps (5) and (6), each sub-model has a width d and a depth l. Under the same parameter quantity and calculation, the deep and narrow submodels are more efficient and more reasonable in structure than the shallow and wide submodels. Through the filtering principle, a candidate set of the screened sub-model structure is obtained
Figure BDA0003587525230000181
The filtering principle can reduce the cost of the model during training and improve the sub-model precision after training;
for a given width ratio candidate set
Figure BDA0003587525230000182
And depth scale candidate set
Figure BDA0003587525230000183
By combining the candidate sets of the two dimensions, a preliminary sub-model structure candidate set is obtained
Figure BDA0003587525230000184
Each sub-model structure
Figure BDA0003587525230000185
Wherein
Figure BDA0003587525230000186
In order to easily express the filtering principle of 'deep and narrow', a two-dimensional index matrix is defined
Figure BDA0003587525230000187
To further process the preliminary sub-model candidate set
Figure BDA0003587525230000188
I (d, l) ═ 1 indicates that the selection submodel a (d, l) is selected, and I (d, l) ═ 0 indicates that the discard submodel a (d, l) is discarded. The index matrix I is initialized to all 1 values and the lower triangular portion is converted to 0 values. Finally, the selected submodel
Figure BDA0003587525230000189
The specific definition is as follows:
Figure BDA00035875252300001810
and (8): designing a self-distillation training algorithm and training a model;
aiming at the sub-model structure candidate set obtained in the step (7)
Figure BDA00035875252300001811
A self-distillation training strategy is provided, so that each sub-model can be fully trained. Firstly, a teacher network is trained by utilizing the deep self-attention network in the step (4), a bidirectional partitionable deep self-attention network is constructed, when a submodel in the bidirectional partitionable deep self-attention network is trained, images and problems are input into the teacher network to obtain a prediction vector, namely a soft label, and a candidate set is sampled during training through a submodel sampling strategy
Figure BDA00035875252300001812
The soft label is used as a supervision label of the sampled submodel for training;
defining a teacher network constructed by a deep self-attention network as MteacherThe bidirectional separable deep self-attention network is MDSTBy training teacher network MteacherObtains its parameter weight theta and uses this weight to initialize the bidirectional separable deep self-attention network MDSTWeight of thetaDST. Through a sub-model sampling strategy, a candidate set is sampled during training
Figure BDA0003587525230000191
The sub-model sampling strategy is as follows: keeping the k submodels sampled at each iteration, and setting the initial submodel structure candidate set as omega ═ asAl } wherein asTo represent
Figure BDA0003587525230000192
A minimum submodel oflTo represent
Figure BDA0003587525230000193
Then randomly sampling
Figure BDA0003587525230000194
And adding the k-2 submodels into a submodel structure candidate set omega to serve as a final submodel candidate set of the iteration. Inputting the input characteristic of each iteration as x into the teacher network MteacherGet the soft label y ═ Mteacher(x) And freezes its gradient y. Then traversing each submodel a E omega in the submodel structure candidate set omega, and inputting the input characteristic x into the current submodel to obtain a prediction vector
Figure BDA0003587525230000195
Predicting the result using this submodel
Figure BDA0003587525230000196
Soft label y calculation loss with teacher network output
Figure BDA0003587525230000197
KD represents a loss function, gradient accumulation loss is carried out on different submodels sampled from omega, and when all submodels in a submodel structure candidate set omega generated by each iteration traverse, the model weight theta is updated uniformlyDST
The model deployment and application in the step (9) are as follows:
if the computing resources of the current equipment are sufficient, the largest sub-model a is adoptedlApplications, obtained by forward propagation
Figure BDA0003587525230000198
Time of day
Figure BDA0003587525230000199
The method has the best characterization capability in the submodel. When the computing resources of the device are not enough, the minimum submodel a is adoptedsObtained by forward propagation
Figure BDA00035875252300001910
Because a issThe amount of computation required is minimal for all submodels, where the forward propagation speed is greatly increased to improve the user experience, and
Figure BDA00035875252300001911
and also has good characterization capability.
In summary, the bidirectional divisible depth self-attention network provided by the present invention can dynamically select sub-models of different sizes according to the computing resource status of the current device
Figure BDA0003587525230000201
By the application, dynamic balance between precision and time delay is realized, and the precision of the sub-model is maintained while user experience is ensured.
As shown in fig. 1 and 2, the present invention provides a lightweight visual question-answering method for a bidirectional separable deep self-attention network.
The partitioning of the data set in the step (1) is specifically as follows:
the final training set contained 115K images and 1.1M sentence of questions, the validation set contained 5K images and 26K sentence of questions, and the test set contained 80K images and 448K sentence of questions.
The object comprehensive characteristics of the constructed image in the step (2) are as follows:
for an image, usually 36 candidate frames are included, the visual feature dimension extracted from each candidate frame is 2048, the space dimension D of the final mapping is adjusted according to the depth self-attention network, for example, D is 512, and the image object comprehensive feature obtained in this step is obtained
Figure BDA0003587525230000202
The semantic features of the construction problem in the step (3) are as follows:
for a problem, a fixed word length 14 is usually set, each word utilizes a pre-trained word vector model to extract a semantic feature dimension of 300, and a finally mapped space dimension D is adjusted according to a depth self-attention network, and D is set to be equal to512 example, the semantic features of the question obtained in this step
Figure BDA0003587525230000203
The step (4) is as follows:
by setting D512 and H8, input characteristics
Figure BDA0003587525230000204
Will be characterized by FinputInputting into a multi-head attention module MHA to obtain an output
Figure BDA0003587525230000205
Then the characteristics
Figure BDA0003587525230000206
Inputting the input signal into a feed-forward layer FFN to obtain a final output
Figure BDA0003587525230000207
The width segmentation strategy in the step (5) is as follows:
the present invention defines a shareable width-ratio candidate set as
Figure BDA0003587525230000211
Figure BDA0003587525230000212
Input characteristic dimension of sub-model under different width segmentation proportions
Figure BDA0003587525230000213
When D is 512, the width dimension of the candidate is
Figure BDA0003587525230000214
Meaning that there are 4 choices for the width dimension of the submodel, 128, 256, 384, 512.
The depth segmentation strategy in the step (6) is as follows:
the invention defines a shareable depth scale candidateAre collected into
Figure BDA0003587525230000215
Figure BDA0003587525230000216
Number of layers of sub-models under different depth segmentation proportions
Figure BDA0003587525230000217
When the L is equal to 12,
Figure BDA0003587525230000218
Figure BDA0003587525230000219
it means that the number of layers of the sub-model is 4 choices, 2, 4, 8 and 12.
Combining the two segmentation strategies and the design filtering principle in the step (7), the method specifically comprises the following steps:
the width ratio candidate set defined according to the steps (5) and (6)
Figure BDA00035875252300002110
Figure BDA00035875252300002111
And depth scale candidate set
Figure BDA00035875252300002112
Combining the candidate sets of the two dimensions to obtain a sub-model structure candidate set
Figure BDA00035875252300002113
Figure BDA00035875252300002114
Figure BDA00035875252300002115
Figure BDA00035875252300002116
Through the principle of filtration to obtainTo the final sub-model structure candidate set
Figure BDA00035875252300002117
Figure BDA00035875252300002118
Figure BDA00035875252300002119
Wherein
Figure BDA00035875252300002120
The self-distillation training algorithm in the step (8) is specifically as follows:
the present invention sets k to 4, which means that 1 largest submodel, 1 smallest submodel, and 2 more submodels sampled randomly are sampled per iteration. There will be 4 submodels sampled per iteration and together they will be gradient accumulated.

Claims (10)

1. The lightweight visual question-answering method based on the bidirectional separable deep self-attention network is characterized by comprising the following steps of:
step (1): dividing the data set;
step (2): constructing visual characteristics of the image;
for a given image, detecting the number m and the position of candidate frames in the image by using the existing trained target detection network; for each candidate frame, inputting the image corresponding to the candidate frame region into the target detection network again, and extracting the features before being input into the network classification layer as the features of the candidate frame; then, splicing the features extracted from each candidate frame to form the visual features of the given image; in order to enable the dimension of the image features to be matched with the depth self-attention network, finally, the image features are further processed by using learnable linear transformation and are mapped to a D-dimensional space;
and (3): constructing semantic features of the problem;
for a given problem, extracting semantic features from each word in the problem by using a trained word vector model, and then splicing the extracted word semantic features to form problem semantic features; in order to enable the problem semantic feature dimension to be matched with the deep self-attention network, finally, a learnable linear transformation is used for further processing the problem feature and mapping the problem feature to a D-dimensional space;
and (4): constructing a depth self-attention network;
the deep self-attention network is formed by stacking a plurality of self-attention layers, and each self-attention layer is divided into two parts: a multi-headed attention module and a feedforward layer; constructing a teacher network for guiding training and a final bidirectional separable deep self-attention network by using the deep self-attention network; in order to enable input features to be matched with dimensions of each submodel in the bidirectional separable depth self-attention network, the network receives the feature with the dimension D as input and maps the input features to the dimension D through linear projection transformation;
and (5): designing a width segmentation strategy;
each self-attention layer in the deep self-attention network consists of a plurality of parameter matrixes, and in order to adapt to input characteristics of different dimensions, each parameter matrix needs to be segmented so as to match input of different dimensions and output characteristics of proper dimensions; aiming at the input feature with the dimension d, in order to keep the original structure proportion of the self-attention layer, the dimension of the output feature is still d through a width segmentation strategy; it is noted that, for different submodels having different dimensional input features, the parameter matrix in the self-attention layer is shared, and when d is smaller, the shared parameter number is smaller; when D is equal to the original input dimension D, the parameter matrix is not segmented;
and (6): designing a depth segmentation strategy;
the depth self-attention network is formed by stacking a plurality of self-attention layers, the number of the layers is L, when the number of the layers of the submodel is less than L, the layer I in the depth self-attention network is selected according to a depth segmentation strategy and belongs to the submodel;
and (7): combining two segmentation strategies and designing a filtering principle;
designing through the steps (5) and (6), wherein each sub-model has a width d and a depth l; under the same parameter quantity and calculation, the deep and narrow submodels are more efficient and reasonable in structure than the shallow and wide submodels, a deep and narrow filtering principle is provided, a plurality of submodels with more layers and low width are selected before model training, and the submodels with fewer layers and high width are directly discarded; through the filtering principle, a candidate set of the screened sub-model structure is obtained
Figure FDA0003587525220000021
And (8): designing a self-distillation training algorithm and training a model;
aiming at the sub-model structure candidate set obtained in the step (7)
Figure FDA0003587525220000031
Providing a self-distillation training strategy to fully train each sub-model; firstly, a teacher network is trained by utilizing the deep self-attention network in the step (4), a bidirectional partitionable deep self-attention network is constructed, when a submodel in the bidirectional partitionable deep self-attention network is trained, images and problems are input into the teacher network to obtain a prediction vector, namely a soft label, and a candidate set is sampled during training through a submodel sampling strategy
Figure FDA0003587525220000032
The soft label is used as a supervision label of the sampled submodel for training;
and (9): model deployment and application.
2. The lightweight visual question-answering method based on the bidirectional partitionable deep self-attention network according to claim 1, wherein the partitioning of the data set in the step (1) is specifically as follows:
the data set adopts VQA-v2 data set, and is further divided into 3 subsets aiming at VQA-v2 data set: a training set, a verification set and a test set; the training set is used for training the model, the verification set is used for locally verifying the convergence condition of the model, and the test set is used for finally evaluating the performance of the model.
3. The lightweight visual question-answering method based on the bidirectional separable depth self-attention network according to claim 2, wherein the visual features of the constructed image in the step (2) are as follows:
for a given image, deducing the number m and the position of candidate frames in the image by using the existing trained Faster R-CNN target detection network, and inputting an image area corresponding to each candidate frame into the Faster R-CNN target detection network to extract the visual characteristics of the image; for the ith candidate box, the corresponding visual characteristics are
Figure FDA0003587525220000033
And the visual characteristics corresponding to the whole image
Figure FDA0003587525220000034
The visual feature corresponding to each candidate frame is spliced, and the specific expression formula is as follows:
Ximage=[x1,x2,...,xi,...,xm](formula 1)
Subsequently, a learnable linear transformation is used
Figure FDA0003587525220000041
For image characteristic XimageFurther processing, mapping to D-dimensional space to obtain final image visual characteristics
Figure FDA0003587525220000042
The specific formula is as follows:
Xinput=Linear(Ximage) (equation 2).
4. The lightweight visual question-answering method based on the bidirectional separable deep self-attention network according to claim 3, wherein the semantic features of the construction question in the step (3) are as follows:
for a given question which comprises n words, inputting each word into a pre-trained GloVe word vector model to extract semantic features of the word; for the jth word, the corresponding semantic feature is
Figure FDA0003587525220000043
Semantic features corresponding to the whole problem
Figure FDA0003587525220000044
The word is formed by splicing semantic features corresponding to each word, and the concrete expression formula is as follows:
Yquestion=[y1,y2,...,yj,...,yn](formula 3)
Subsequently, a learnable linear transformation is used
Figure FDA0003587525220000045
Question semantic feature YquestionFurther processing is carried out, and mapping is carried out on the D-dimensional space to obtain the final semantic features of the problem
Figure FDA0003587525220000046
The specific formula is as follows:
Yinput=Linear(Yquestion) (equation 4).
5. The light-weight visual question-answering method based on the bidirectional separable deep self-attention network according to claim 4, wherein the deep self-attention network is constructed in the step (4), and specifically, the method comprises the following steps:
the deep self-attention network is formed by stacking a plurality of self-attention layers, and each self-attention layer is divided into two parts: a multi-headed attention module and a feedforward layer; constructed by utilizing the depth self-attention networkThe system comprises a teacher network for guiding training and a final bidirectional separable deep self-attention network, wherein the teacher network and the bidirectional separable deep self-attention network both adopt deep self-attention networks with the same structure; in order to enable input features to match respective submodel dimensions in a bi-directional partitionable depth self-attention network, the depth self-attention network accepts image visual features of dimension D
Figure FDA0003587525220000051
And problem semantic features
Figure FDA0003587525220000052
As input, mapping the input features to d dimension by a linear projective transformation; the deep self-attention network can fully learn the interactive information between the two modes and finally generate a visual-semantic fusion feature with rich meaning;
4-1, a multi-head attention module;
for a given interrogation feature
Figure FDA0003587525220000053
Key feature
Figure FDA0003587525220000054
And value characteristics
Figure FDA0003587525220000055
The multi-head attention module calculates and obtains characteristics by utilizing H parallel attention functions
Figure FDA0003587525220000056
The specific formula is as follows:
Fmha=MHA(Q,K,V)=[head1,head2,…,headH]W0(formula 5)
Figure FDA0003587525220000057
Wherein
Figure FDA0003587525220000058
Mapping matrix, D, representing the h-th head of attentionHFor each attention head dimension, can be represented by DHD/H is calculated; in addition to this, the present invention is,
Figure FDA0003587525220000059
Figure FDA00035875252200000510
the system is used for further processing the output characteristics of the multi-head attention function; for the attention calculation mode ATT, the specific formula is as follows:
Figure FDA00035875252200000511
4-2. feedforward layer;
the feedforward layer consists of two layers of perceptrons and carries out nonlinear transformation on the output characteristics of the multi-head attention module; for a given characteristic
Figure FDA0003587525220000061
Output characteristics
Figure FDA0003587525220000062
The specific formula is as follows:
Figure FDA0003587525220000063
wherein
Figure FDA0003587525220000064
A projection matrix is linearly transformed;
4-3, self-attention layer;
each self-attention Layer consists of a plurality of the layersHead attention Module and feed Forward layer composition, for given input FinputOutput characteristic FoutputThe concrete formula is as follows:
Figure FDA0003587525220000065
Figure FDA0003587525220000066
wherein LN represents layer normalization;
4-4. stacking self-attentive layers;
the deep self-attention network is formed by stacking a plurality of self-attention layers, and the characteristic dimension is not changed by the self-attention layers, so that the plurality of self-attention layers can be connected in series to form a deep self-attention network Model, and the specific formula is as follows:
Model=[Layer(1),Layer(2),…,Layer(L)](formula 11)
Where L is the number of self-attention layers.
6. The light-weight visual question-answering method based on the bidirectional separable deep self-attention network according to claim 5, wherein the width splitting strategy in the step (5) is as follows:
for parameter matrix in multi-head attention
Figure FDA0003587525220000067
And an input feature of dimension D, maintaining a size D of each attention headHChanging the input matching dimension D and the attention head number H of the corresponding parameter matrix without changing; so that the finally sliced parameter matrix
Figure FDA0003587525220000071
Wherein
Figure FDA0003587525220000072
Representing the number of the heads of attention after being segmented; parameter matrix W in other self-attention layers0,W1,W2The same strategy is adopted, so that the finally segmented parameter matrix
Figure FDA0003587525220000073
Figure FDA0003587525220000074
7. The light-weight visual question-answering method based on the bidirectional partitionable deep self-attention network according to claim 6, wherein the depth partitioning strategy in step (6) is as follows:
for a deep self-attention network with L layers, the index of each layer is recorded as [1, 2]The invention considers that the closer to the self-attention layer of input and output is; this means that the middle layer is relatively less important, when the number of layers of the submodel L < L, it will be discarded first from the middle layer; specifically, the layer indexes are obtained by firstly sorting the importance of each layer from large to small
Figure FDA0003587525220000075
For the sub-model with the number of layers being l, the front item of the ordered layer index is taken
Figure FDA0003587525220000076
Then re-ordering to restore to the original layer index order
Figure FDA0003587525220000077
And (4) carrying out depth segmentation strategy for the final l-layer sub-model.
8. The lightweight visual question-answering method based on the bidirectional separable deep self-attention network according to claim 7, wherein the filtering principle is designed by combining two separation strategies in the step (7), and specifically comprises the following steps:
for a given width ratio candidate set
Figure FDA0003587525220000078
And depth scale candidate set
Figure FDA0003587525220000079
By combining the candidate sets of the two dimensions, a preliminary sub-model structure candidate set is obtained
Figure FDA00035875252200000710
Each sub-model structure
Figure FDA00035875252200000711
Wherein
Figure FDA00035875252200000712
In order to easily express the filtering principle of 'deep and narrow', a two-dimensional index matrix is defined
Figure FDA00035875252200000713
To further process the preliminary sub-model candidate set
Figure FDA0003587525220000081
I (d, l) ═ 1 indicates that the selection submodel a (d, l) is selected, and I (d, l) ═ 0 indicates that the discard submodel a (d, l) is discarded; the index matrix I is initialized to all 1 values, and then the lower triangular part is converted into 0 values; finally, the selected submodel
Figure FDA0003587525220000082
The specific definition is as follows:
Figure FDA0003587525220000083
9. the light-weight visual question-answering method based on the bidirectional separable deep self-attention network of claim 8, wherein the self-distillation training algorithm of the step (8) is as follows:
defining a teacher network constructed by a deep self-attention network as MteacherThe bidirectional separable deep self-attention network is MDSTBy training teacher network MteacherObtains its parameter weight theta and uses this weight to initialize the bidirectional separable deep self-attention network MDSTWeight of thetaDST(ii) a Through a sub-model sampling strategy, a candidate set is sampled during training
Figure FDA00035875252200000810
The sub-model sampling strategy is as follows: recording k submodels sampled at each iteration, wherein the structure candidate set of the initial submodels is omega ═ as,alIn which a issTo represent
Figure FDA0003587525220000084
A minimum submodel oflTo represent
Figure FDA0003587525220000085
Then randomly sampling
Figure FDA0003587525220000086
Adding the k-2 submodels into a submodel structure candidate set omega to serve as a final submodel candidate set of the iteration; inputting the input characteristic x of each iteration into the teacher network MteacherGet the soft label y ═ Mteacher(x) And freezing its gradient y.detach (); then traversing each sub-model a E omega in the sub-model structure candidate set omega, and inputting the input characteristic x into the current sub-model to obtain a prediction vector
Figure FDA0003587525220000087
Predicting the result using this submodel
Figure FDA0003587525220000088
Soft label y calculation loss with teacher network output
Figure FDA0003587525220000089
KD represents a loss function, gradient accumulation loss is carried out on different submodels sampled from omega, and when all submodels in a submodel structure candidate set omega generated by each iteration traverse, the model weight theta is updated uniformlyDST
10. The lightweight visual question-answering method based on the bidirectional separable deep self-attention network according to claim 9, wherein the model deployment and application in step (9) are as follows:
if the computing resources of the current equipment are sufficient, the maximum submodel a is adoptedlApplications, obtained by forward propagation
Figure FDA0003587525220000091
At this time
Figure FDA0003587525220000092
The method has the best characterization capability in the submodels; when the computing resources of the device are not enough, the minimum submodel a is adoptedsObtained by forward propagation
Figure FDA0003587525220000093
Because a issThe amount of computation required is minimal for all submodels, where the forward propagation speed is greatly increased to improve the user experience, and
Figure FDA0003587525220000095
the method also has good characterization capability;
the bidirectional divisible deep self-attention network can dynamically select submodels with different sizes according to the computing resource state of the current equipmentModel (III)
Figure FDA0003587525220000094
By the application, dynamic balance between precision and time delay is realized, and the precision of the sub-model is maintained while user experience is ensured.
CN202210369535.0A 2022-04-08 2022-04-08 Lightweight visual question-answering method based on bidirectional separable deep self-attention network Pending CN114647752A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210369535.0A CN114647752A (en) 2022-04-08 2022-04-08 Lightweight visual question-answering method based on bidirectional separable deep self-attention network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210369535.0A CN114647752A (en) 2022-04-08 2022-04-08 Lightweight visual question-answering method based on bidirectional separable deep self-attention network

Publications (1)

Publication Number Publication Date
CN114647752A true CN114647752A (en) 2022-06-21

Family

ID=81997107

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210369535.0A Pending CN114647752A (en) 2022-04-08 2022-04-08 Lightweight visual question-answering method based on bidirectional separable deep self-attention network

Country Status (1)

Country Link
CN (1) CN114647752A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114863407A (en) * 2022-07-06 2022-08-05 宏龙科技(杭州)有限公司 Multi-task cold start target detection method based on visual language depth fusion
CN117216225A (en) * 2023-10-19 2023-12-12 四川大学 Three-mode knowledge distillation-based 3D visual question-answering method

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114863407A (en) * 2022-07-06 2022-08-05 宏龙科技(杭州)有限公司 Multi-task cold start target detection method based on visual language depth fusion
CN117216225A (en) * 2023-10-19 2023-12-12 四川大学 Three-mode knowledge distillation-based 3D visual question-answering method

Similar Documents

Publication Publication Date Title
CN111930992B (en) Neural network training method and device and electronic equipment
CN110175628A (en) A kind of compression algorithm based on automatic search with the neural networks pruning of knowledge distillation
CN112990296B (en) Image-text matching model compression and acceleration method and system based on orthogonal similarity distillation
CN111444340A (en) Text classification and recommendation method, device, equipment and storage medium
CN111325155A (en) Video motion recognition method based on residual difference type 3D CNN and multi-mode feature fusion strategy
CN110852168A (en) Pedestrian re-recognition model construction method and device based on neural framework search
CN111160350B (en) Portrait segmentation method, model training method, device, medium and electronic equipment
CN114647752A (en) Lightweight visual question-answering method based on bidirectional separable deep self-attention network
CN110516095A (en) Weakly supervised depth Hash social activity image search method and system based on semanteme migration
CN108549658A (en) A kind of deep learning video answering method and system based on the upper attention mechanism of syntactic analysis tree
CN112487949B (en) Learner behavior recognition method based on multi-mode data fusion
CN113297370B (en) End-to-end multi-modal question-answering method and system based on multi-interaction attention
CN111008693A (en) Network model construction method, system and medium based on data compression
CN114037945A (en) Cross-modal retrieval method based on multi-granularity feature interaction
KR20200010672A (en) Smart merchandise searching method and system using deep learning
CN109284668A (en) A kind of pedestrian&#39;s weight recognizer based on apart from regularization projection and dictionary learning
Li et al. Hierarchical knowledge squeezed adversarial network compression
Ay et al. A study of knowledge distillation in fully convolutional network for time series classification
CN113420651B (en) Light weight method, system and target detection method for deep convolutional neural network
CN114265937A (en) Intelligent classification analysis method and system of scientific and technological information, storage medium and server
Zheng et al. Action recognition based on the modified twostream CNN
CN117494051A (en) Classification processing method, model training method and related device
CN117115911A (en) Hypergraph learning action recognition system based on attention mechanism
CN117011883A (en) Pedestrian re-recognition method based on pyramid convolution and transducer double branches
CN116341621A (en) Low-cost self-learning neural network design method for weld defect ultrasonic detection

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination