CN114647752A - Lightweight visual question-answering method based on bidirectional separable deep self-attention network - Google Patents
Lightweight visual question-answering method based on bidirectional separable deep self-attention network Download PDFInfo
- Publication number
- CN114647752A CN114647752A CN202210369535.0A CN202210369535A CN114647752A CN 114647752 A CN114647752 A CN 114647752A CN 202210369535 A CN202210369535 A CN 202210369535A CN 114647752 A CN114647752 A CN 114647752A
- Authority
- CN
- China
- Prior art keywords
- attention
- self
- network
- model
- layer
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/50—Information retrieval; Database structures therefor; File system structures therefor of still image data
- G06F16/53—Querying
- G06F16/532—Query formulation, e.g. graphical querying
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/50—Information retrieval; Database structures therefor; File system structures therefor of still image data
- G06F16/53—Querying
- G06F16/535—Filtering based on additional data, e.g. user or group profiles
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/903—Querying
- G06F16/9032—Query formulation
- G06F16/90332—Natural language query formulation or dialogue systems
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/903—Querying
- G06F16/9035—Filtering based on additional data, e.g. user or group profiles
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Data Mining & Analysis (AREA)
- Mathematical Physics (AREA)
- Computational Linguistics (AREA)
- Artificial Intelligence (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Evolutionary Computation (AREA)
- Biophysics (AREA)
- Biomedical Technology (AREA)
- Software Systems (AREA)
- Life Sciences & Earth Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Image Analysis (AREA)
Abstract
The invention discloses a light visual question-answering method based on a bidirectional divisible depth self-attention network, and provides the bidirectional divisible depth self-attention network. The bidirectional and separable depth self-attention model can dynamically select a proper sub-model to predict an answer according to current computing resources, balance between precision and time delay is achieved, and a user has good experience while the accuracy of predicting the answer is guaranteed.
Description
Technical Field
The invention belongs to the field of Visual Question Answering, and particularly relates to a lightweight Visual Question Answering method (Visual Question Answering) based on a bidirectional divisible deep self-attention network.
Background
The visual question-answer task is a popular research problem in a multi-modal learning task, is a learning task related to computer vision and natural language processing, and aims to output a corresponding predicted answer for a given image and a sentence of a problem which is related to the image and has a free form through the processing of a visual question-answer model. Unlike single-modality tasks, visual question-answering tasks, which are multi-modality tasks, require not only the correct understanding of information of different modalities, but also the understanding of information associated between modalities, which is generally more complex and difficult. This task has a wide range of real life applications, such as: the network image information can be more conveniently acquired by the disabled with visual disorder; meanwhile, better development of a human-computer interaction system can be promoted, and user experience is enhanced; the comprehension of the machine to the image can be improved, and the image retrieval capability is enhanced.
The deep self-attention network is originally proposed on the machine translation task in the natural language processing field, and the core architecture of the deep self-attention network is formed by stacking a plurality of self-attention layers, and each self-attention layer can construct complex and intensive interaction between input features. The deep self-attention network obtains the best effect on a machine translation task, quickly obtains the attention of researchers in the field of artificial intelligence, and applies the deep self-attention network to various sub-fields of the artificial intelligence, including the field of visual question answering. Because the deep self-attention network can better learn the interactive information between visual and text features, at present, the network architecture has become the mainstream network structure in the field of visual question answering. However, while the deep self-attention network brings performance improvement, because of the complexity of its computation, new requirements are also placed on the computation resources and the storage space, which will bring a serious problem: the deployment of the models on the mobile device depends on a GPU cloud server, so that the limited computing resources of the mobile terminal are difficult to directly utilize, and the resource waste is caused and the model is not environment-friendly. At present, no lightweight model exists in the field of visual question answering for a while, which brings challenges to the deployment of the model and also hinders users from enjoying convenience brought by artificial intelligence application.
In order to meet the new challenge of difficult deployment of deep learning models, some model compression methods are in force. In the field of monomodal tasks such as computer vision and natural language processing, compression models based on methods such as weight sharing, knowledge distillation, pruning and quantification gradually appear, so that the models are compressed to a certain proportion to achieve balance between calculated amount and accuracy, but the methods are usually compressed to a fixed proportion and only a lightweight model with a fixed size can be obtained. Nowadays, mobile devices are various, and the difference of computing performance between different mobile devices is large, and even if the same mobile device can provide different computing resources under different load conditions and different electric quantities, if a lightweight model is designed for one mobile device or one load condition, the model training overhead is in direct proportion to the number of models, and one mobile device needs to store a plurality of models to cope with various scenes, and the storage overhead is also very large.
Recently, a neural network capable of slimming and being divided provides a new idea: only one model is used for dealing with various scenes, and when the computing resources are sufficient, most of the model is taken for forward propagation and prediction so as to obtain higher accuracy; when the computational resources are limited, a small number of parameters of the model are taken for prediction, and a little precision is sacrificed for deducing the speed. If the thought can be utilized, and an efficient and reasonable segmentation and training strategy is designed aiming at the mainstream model structure depth self-attention network in the field of visual question answering, new contribution is made to landing of the visual question answering model.
In view of the above, how to design an efficient and differentiable deep self-attention network and apply it to the field of visual question answering is a subject of intensive research. The patent aims to cut into and develop discussion from a plurality of key points in the task, solves the difficulties and key points of the existing method, and forms a set of complete and efficient lightweight visual question-answering method.
Disclosure of Invention
Aiming at the defects in the prior art, the invention provides a lightweight visual question-answering method based on a bidirectional separable deep self-attention network. The invention mainly comprises two points:
1. by analyzing the internal structure of the depth self-attention network, an efficient and reasonable width and depth segmentation strategy is designed, and a bidirectional strategy which can be segmented in width and depth is formed by combining the two strategies. Aiming at submodels split by a bidirectional strategy, the invention provides a deep and narrow filtering principle and further selects some excellent high-efficiency substructures.
2. The two-way segmentation strategy and the filtering principle are combined with the existing depth self-attention-based visual question-answer model, and an efficient self-distillation training strategy is provided, so that each sub-model can be fully trained, and finally, the two-way segmentation depth self-attention visual question-answer model is obtained.
The invention provides a light Visual Question-Answering method (Visual Question Answering) based on a bidirectional divisible depth self-attention network. The core method provides an efficient and reasonable width and depth segmentation strategy by analyzing the internal structure of the depth self-attention network, and combines the two single-dimensional segmentation strategies to form a two-way strategy which can be segmented in width and depth. Meanwhile, aiming at the substructures segmented by the bidirectional strategy, the invention provides a deep and narrow filtering principle, and further selects some excellent high-efficiency substructures, the filtering principle can improve the performance of each substructure, and in model deployment, the filtering principle does not need additional screening and can be directly put into use, so that the post-processing process is avoided, and the filtering principle is simple and easy to use. In addition, an efficient self-distillation training strategy is also provided, so that each submodel can be fully trained. The method can be combined with any existing visual question-answering model based on the deep self-attention network, a bidirectional partible deep self-attention network is formed through training, each submodel in the network has the application capability of a visual question-answering task, when the model is deployed on edge equipment with limited resources and large performance fluctuation, the bidirectional partible deep self-attention model can dynamically select a proper submodel to predict answers according to current computing resources, balance between precision and time delay is achieved, and the accuracy of predicting the answers is guaranteed while a user has good experience.
The lightweight visual question-answering method based on the bidirectional separable deep self-attention network comprises the following steps:
step (1): dividing the data set;
step (2): constructing visual characteristics of the image;
for a given image, detecting the number m of candidate frames and the position of the candidate frames in the image by using the existing trained target detection network; and for each candidate frame, inputting the image corresponding to the candidate frame region into the target detection network, and extracting the features before being input into the network classification layer as the features of the candidate frame. The extracted features of each candidate frame are then stitched to form the visual features of the given image. In order to enable the dimension of the image features to be matched with the depth self-attention network, finally, the image features are further processed by using learnable linear transformation and are mapped to a D-dimensional space;
and (3): constructing semantic features of the problem;
for a given problem, semantic features are extracted from each word in the problem by using a trained word vector model, and then the extracted word semantic features are spliced to form the problem semantic features. In order to enable the problem semantic feature dimension to be matched with the deep self-attention network, finally, a learnable linear transformation is used for further processing the problem feature and mapping the problem feature to a D-dimensional space;
and (4): constructing a depth self-attention network;
the deep self-attention network is formed by stacking a plurality of self-attention layers, and each self-attention layer is divided into two parts: a multi-headed attention module and a feed-forward layer. And constructing a teacher network for guiding training and a final bidirectional separable deep self-attention network by using the deep self-attention network. In order to enable input features to match the dimensions of each submodel in a bi-directional separable depth-from-attention network, the network accepts features of dimension D as input and maps the input features to dimension D through a linear projection transformation.
And (5): designing a width segmentation strategy;
each self-attention layer in the deep self-attention network is composed of a plurality of parameter matrixes, and in order to adapt to input characteristics of different dimensions, each parameter matrix needs to be segmented so as to match input of different dimensions and output characteristics of proper dimensions. For the input feature with the dimension d, in order to keep the original structure proportion of the self-attention layer, the dimension of the output feature is still d through a width segmentation strategy. It is noted that, for different submodels having different dimensional input features, the parameter matrix in the self-attention layer is shared, and when d is smaller, the shared parameter number is smaller; when D is equal to the original input dimension D, the parameter matrix is not segmented;
and (6): designing a depth segmentation strategy;
the depth self-attention network is formed by stacking a plurality of self-attention layers, the number of the layers is recorded as L, when the number of the layers L of the sub-model is less than L, the layer L in the depth self-attention network is selected according to a depth segmentation strategy and belongs to the sub-model. A simple and effective depth segmentation strategy is provided, and sub-models can pick out more important self-attention layers as far as possible under different layer number settings, so that the final precision of different sub-models is improved;
and (7): combining two segmentation strategies and designing a filtering principle;
through the design of the steps (5) and (6), each sub-model has a width d and a depth l. Under the same parameter quantity and calculation, the deep and narrow submodels are more efficient and reasonable in structure than the shallow and wide submodels, a deep and narrow filtering principle is provided, a plurality of submodels with a large number of layers and a low width are selected before model training, and the submodels with a small number of layers and a high width are directly discarded. Through the filtering principle, a candidate set of the screened sub-model structure is obtained
And (8): designing a self-distillation training algorithm and training a model;
aiming at the sub-model structure candidate set obtained in the step (7)A self-distillation training strategy is provided, so that each sub-model can be fully trained. Firstly, training a teacher network by using the deep self-attention network in the step (4), constructing a bidirectional partitionable deep self-attention network, inputting images and problems into the teacher network to obtain prediction vectors of the images and the problems when training submodels in the bidirectional partitionable deep self-attention network, wherein the prediction vectors are called soft labels, and sampling candidate sets during training through a submodel sampling strategyThe soft label is used as a supervision label of the sampled submodel for training;
and (9): model deployment and application;
further, the partitioning of the data set in step (1) is specifically as follows:
the data set adopts VQA-v2 data set, and is further divided into 3 subsets aiming at VQA-v2 data set: training set, validation set and test set. The training set is used for training the model, the verification set is used for locally verifying the convergence condition of the model, and the test set is used for final model performance evaluation.
Further, the visual features of the constructed image in the step (2) are specifically as follows:
for a given image, the number m and the position of candidate frames in the image are deduced by using the existing trained Faster R-CNN target detection network, and an image area corresponding to each candidate frame is input into the Faster R-CNN target detection network to extract the visual characteristics of the image. For the ith candidate box, the corresponding visual characteristics areAnd the visual characteristics corresponding to the whole imageThe visual feature corresponding to each candidate frame is spliced, and the specific expression formula is as follows:
Ximage=[x1,x2,...,xi,...,xm](formula 1)
Subsequently, a learnable linear transformation is usedFor image characteristic XimageFurther processing, mapping to D-dimensional space to obtain final image visual characteristicsThe specific formula is as follows:
Xinput=Linear(Ximage) (formula 2)
Further, the semantic features of the construction problem in step (3) are specifically as follows:
for a given question, which contains n words, each word is input into a pre-trained GloVe word vector model to extract its semantic features. For the jth word, the corresponding semantic feature isSemantic features corresponding to the whole problemThe word is formed by splicing semantic features corresponding to each word, and the concrete expression formula is as follows:
Yquestion=[y1,y2,...,yj,...,yn](formula 3)
Subsequently, a learnable linear transformation is usedQuestion semantic feature YquestionFurther processing is carried out, and mapping is carried out on the D-dimensional space to obtain the final semantic features of the problemThe concrete formula is as follows:
Yinput=Linear(Yquestion) (formula 4)
Further, the constructing of the deep self-attention network in the step (4) is specifically as follows:
the deep self-attention network is formed by stacking a plurality of self-attention layers, and each self-attention layer is divided into two parts: a multi-headed attention module and a feed-forward layer. And constructing a teacher network for guiding training and a final bidirectional separable deep self-attention network by using the deep self-attention network, wherein the teacher network and the bidirectional separable deep self-attention network both adopt deep self-attention networks with the same structure. In order to enable input features to match respective submodel dimensions in a bi-directional partitionable depth self-attention network, the depth self-attention network accepts image visual features of dimension DAnd problem semantic featuresAs input, and by a linear projective transformation, the input features are mapped to d dimensions.
4-1, a multi-head attention module;
for a given interrogation featureKey featureAnd value characteristicsMulti-headThe attention module calculates and obtains characteristics by utilizing H parallel attention functionsThe concrete formula is as follows:
Fmha=MHA(Q,K,V)=[head1,head2,…,headH]W0(formula 5)
WhereinMapping matrix, D, representing the h-th head of attentionHFor each attention head dimension, can be represented by DHCalculated as D/H. In addition to this, the present invention is, for further processing of the output characteristics of the multi-headed attention function. For the attention calculation mode ATT, the specific formula is as follows:
4-2. feedforward layer;
the feedforward layer is composed of two layers of perceptrons and carries out nonlinear transformation on the output characteristics of the multi-head attention module. For a given characteristicOutput characteristicsThe specific formula is as follows:
4-3, self-attention layer;
each self-attention Layer consists of the multi-headed attention module and feed-forward Layer described above, for a given input FinputOutput characteristic FoutputThe concrete formula is as follows:
where LN represents slice normalization.
4-4. stacking self-attention layers;
the deep self-attention network is formed by stacking a plurality of self-attention layers, and the characteristic dimension is not changed by the self-attention layers, so that the plurality of self-attention layers can be connected in series to form a deep self-attention network Model, and the specific formula is as follows:
Model=[Layer(1),Layer(2),…,Layer(L)](formula 11)
Where L is the number of self-attention layers.
Further, the width segmentation strategy in the step (5) is specifically as follows:
for parameter matrix W in multi-head attentionQ,WK,And an input feature of dimension D, maintaining a size D of each attention headHInvariant, but varying input matching dimensions D and D of corresponding parameter matricesThe number of gravity heads H. So that the final sliced parameter matrix WQ,WK,WhereinIndicating the number of heads of attention after being segmented. Parameter matrix W in other self-attention layers0,W1,W2The same strategy is adopted, so that the finally segmented parameter matrix
Further, the depth segmentation strategy in step (6) is specifically as follows:
for a deep self-attention network with L layers, let the index of each layer be [1, 2]The present invention recognizes that the closer to the input and output self-attention layers is the more important. This means that the middle layer is relatively less important, and when the number of layers of the submodel L < L, it will be discarded first from the middle layer. Specifically, the layer indexes are obtained by firstly sorting the importance of each layer from large to smallFor the sub-model with the number of layers being l, the front item of the ordered layer index is takenThen re-ordering to restore the original layer index sequenceAnd (4) carrying out depth segmentation strategy for the final l-layer sub-model.
Further, the two segmentation strategies in step (7) are combined, and a filtering principle is designed, specifically as follows:
for a given widthDegree scale candidate setAnd depth scale candidate setBy combining the candidate sets of the two dimensions, a preliminary sub-model structure candidate set is obtainedEach sub-model structureWhereinIn order to easily express the filtering principle of 'deep and narrow', a two-dimensional index matrix is definedTo further process the preliminary sub-model candidate setI (d, l) ═ 1 indicates that the selection submodel a (d, l) is selected, and I (d, l) ═ 0 indicates that the discard submodel a (d, l) is discarded. The index matrix I is initialized to all 1 values and the lower triangular portion is converted to 0 values. Finally, the selected submodelThe specific definition is as follows:
further, the self-distillation training algorithm in step (8) specifically comprises the following steps:
defining a teacher network constructed by a deep self-attention network as MteacherThe bidirectional divisible depth self-attention network is MDSTBy trainingTeacher training network MteacherObtains its parameter weight theta and uses this weight to initialize the bidirectional separable deep self-attention network MDSTWeight of thetaDST. Through a sub-model sampling strategy, a candidate set is sampled during trainingAnd a sub-model, wherein the sub-model sampling strategy is as follows: keeping the k submodels sampled at each iteration, and setting the initial submodel structure candidate set as omega ═ as,alIn which a issRepresentA minimum submodel oflTo representThen randomly sampling the largest sub-modelK-2 submodels in the set are added into a submodel structure candidate set omega to serve as a final submodel candidate set of the iteration. Inputting the input characteristic of each iteration as x into the teacher network MteacherGet the soft label y ═ Mteacher(x) And freezes its gradient y. Then traversing each sub-model a E omega in the sub-model structure candidate set omega, and inputting the input characteristic x into the current sub-model to obtain a prediction vectorPredicting the result using this submodelSoft label y calculation loss with teacher network outputKD represents a loss function, gradient accumulation loss is carried out on different sub-models sampled from omega, and when the sub-models are iterated each time, sub-model structures are generatedAfter all the submodels in the candidate set omega are traversed, uniformly updating the model weight thetaDST。
Further, the model deployment and application in step (9) are as follows:
if the computing resources of the current equipment are sufficient, the largest sub-model a is adoptedlApplications, obtained by forward propagationAt this timeThe method has the best characterization capability in the submodel. When the computing resources of the device are not enough, the minimum submodel a is adoptedsObtained by forward propagationBecause a issThe amount of computation required is minimal for all submodels, where the forward propagation speed is greatly increased to improve the user experience, andand also has good characterization capability.
The bidirectional divisible deep self-attention network can dynamically select sub-models with different sizes according to the computing resource state of the current equipmentBy the application, dynamic balance between precision and time delay is realized, and the precision of the sub-model is maintained while user experience is ensured.
The invention has the following beneficial effects:
the invention provides a bidirectional divisible depth self-attention network, which is based on a designed bidirectional strategy that the width and the depth can be divided, adopts a deep and narrow filtering principle to further select reasonable submodels, and is matched with a proposed self-distillation algorithm, so that each submodel in the network has the application capability of a visual question-answering task. The bidirectional and separable depth self-attention model can dynamically select a proper sub-model to predict an answer according to the current computing resources, balance between precision and time delay is achieved, and a user has good experience while the accuracy of the predicted answer is ensured.
Drawings
FIG. 1 is a schematic diagram of a width-depth slicing strategy according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of model filtering principles according to an embodiment of the present invention.
Detailed Description
The following is a further detailed description of the detailed parameters of the present invention.
The lightweight visual question-answering method based on the bidirectional separable deep self-attention network comprises the following steps:
step (1): dividing the data set;
the data set adopts VQA-v2 data set, and is further divided into 3 subsets aiming at VQA-v2 data set: training set, validation set and test set. The training set is used for training the model, the verification set is used for locally verifying the convergence condition of the model, and the test set is used for final model performance evaluation.
Step (2): constructing visual characteristics of the image;
for a given image, detecting the number m of candidate frames and the position of the candidate frames in the image by using the existing trained target detection network; and for each candidate frame, inputting the image corresponding to the candidate frame region into the target detection network, and extracting the features before being input into the network classification layer as the features of the candidate frame. The extracted features of each candidate box are then stitched to form the visual features of the given image. In order to enable the dimension of the image features to be matched with the depth self-attention network, finally, the image features are further processed by using learnable linear transformation and are mapped to a D-dimensional space; the specific method comprises the following steps:
for a given image, the number m and the position of candidate frames in the image are deduced by using the existing trained Faster R-CNN target detection network, and the image area corresponding to each candidate frame is input into the Faster R-CNN target detection networkTo extract the visual features. For the ith candidate box, the corresponding visual characteristics areAnd the visual characteristics corresponding to the whole imageThe visual feature corresponding to each candidate frame is spliced, and the specific expression formula is as follows:
Ximage=[x1,x2,...,xi,...,xm](formula 1)
Subsequently, a learnable linear transformation is usedFor image characteristic XimageFurther processing, mapping to D-dimensional space to obtain final image visual characteristicsThe specific formula is as follows:
Xinput=Linear(Ximage) (formula 2)
And (3): constructing semantic features of the problem;
for a given problem, semantic features are extracted from each word in the problem by using a trained word vector model, and then the extracted semantic features of the words are spliced to form the semantic features of the problem. In order to enable the problem semantic feature dimension to be matched with the deep self-attention network, finally, a learnable linear transformation is used for further processing the problem feature and mapping the problem feature to a D-dimensional space; the specific method comprises the following steps:
for a given question, which contains n words, each word is input into a pre-trained GloVe word vector model to extract its semantic features. For the jth word, the corresponding semantic feature isAnd the wholeSemantic features to which questions correspondThe word is formed by splicing semantic features corresponding to each word, and the concrete expression formula is as follows:
Yquestion=[y1,y2,…,yj,…,yn](formula 3)
Subsequently, a learnable linear transformation is usedQuestion semantic feature YquestionFurther processing is carried out, and mapping is carried out on the D-dimensional space to obtain the final semantic features of the problemThe specific formula is as follows:
Yinput=Linear(Yquestion) (formula 4)
And (4): constructing a depth self-attention network;
the deep self-attention network is formed by stacking a plurality of self-attention layers, and each self-attention layer is divided into two parts: a multi-headed attention module and a feed-forward layer. And constructing a teacher network for guiding training and a final bidirectional separable deep self-attention network by using the deep self-attention network, wherein the teacher network and the bidirectional separable deep self-attention network both adopt deep self-attention networks with the same structure. In order to enable input features to match respective submodel dimensions in a bi-directional partitionable depth self-attention network, the depth self-attention network accepts image visual features of dimension DAnd problem semantic featuresAs input, and by a linear projective transformation, the input features are mapped to d dimensions. The deep self-attention network can fully learn the interaction information between the two modes, and finallyAnd generating a visual-semantic fusion feature with rich meaning.
4-1, a multi-head attention module;
for a given interrogation featureKey featureAnd value characteristicsThe multi-head attention module calculates and obtains characteristics by utilizing H parallel attention functionsThe specific formula is as follows:
Fmha=MHA(Q,K,V)=[head1,head2,…,headH]W0(formula 5)
WhereinMapping matrix, D, representing the h-th head of attentionHFor each attention head dimension, can be represented by DHCalculated as D/H. In addition to this, the present invention is, for further processing of the output characteristics of the multi-headed attention function. For the attention calculation mode ATT, the specific formula is as follows:
4-2. feedforward layer;
the feedforward layer consists of two layers of sensors and performs nonlinear transformation on the output characteristics of the multi-head attention module. For a given characteristicOutput characteristicsThe specific formula is as follows:
4-3, self-attention layer;
each self-attention Layer consists of the multi-headed attention module and feed-forward Layer described above, for a given input FinputOutput characteristic FoutputThe concrete formula is as follows:
where LN denotes layer normalization.
4-4. stacking self-attentive layers;
the deep self-attention network is formed by stacking a plurality of self-attention layers, and the characteristic dimension does not change after passing through the self-attention layers, so that the self-attention layers can be connected in series to form a deep self-attention network Model, and the specific formula is as follows:
Model=[Layer(1),Layer(2),…,Layer(L)](formula 11)
Where L is the number of self-attention layers.
And (5): designing a width segmentation strategy;
each self-attention layer in the deep self-attention network is composed of a plurality of parameter matrixes, and in order to adapt to input characteristics of different dimensions, each parameter matrix needs to be segmented so as to match input of different dimensions and output characteristics of proper dimensions. For the input feature with the dimension d, in order to keep the original structure proportion of the self-attention layer, the dimension of the output feature is still d through a width segmentation strategy. It is worth noting that for different submodels with different dimension input characteristics, the submodels share the parameter matrix in the attention layer, and when d is smaller, the shared parameter number is smaller; when D is equal to the original input dimension D, the parameter matrix is not segmented;
for parameter matrix W in multi-head attentionQ,WK,And an input feature of dimension D, maintaining a size D of each attention headHThe input matching dimension D and the number of attention heads H of the corresponding parameter matrix are changed without change. So that the final sliced parameter matrix WQ,WK,WhereinIndicating the number of heads of attention after being split. Parameter matrix W in other self-attention layers0,W1,W2The same strategy is adopted, so that the finally segmented parameter matrix
And (6): designing a depth segmentation strategy;
the deep self-attention network is formed by stacking a plurality of self-attention layers, the number of the layers is recorded as L, when the number of the layers L of the sub-model is less than L, the layer L in the deep self-attention network needs to be selected according to a deep segmentation strategy, and the deep self-attention network belongs to the sub-model. The invention provides a simple and effective depth segmentation strategy, and the submodels can pick out more important self-attention layers as much as possible under the setting of different layer numbers, so that the final precision of different submodels is improved;
for a deep self-attention network with L layers, the index of each layer is recorded as [1, 2]The present invention recognizes that the closer to the input and output self-attention layers is the more important. This means that the middle layer is relatively less important, and when the number of layers of the submodel L < L, it will be discarded first from the middle layer. Specifically, the layer indexes are obtained by firstly sorting the importance of each layer from large to smallFor the sub-model with the number of layers being l, the first item of the ordered layer index is takenThen re-ordering to restore the original layer index sequenceAnd (4) performing depth segmentation strategy for the final l-layer submodel.
And (7): combining two segmentation strategies and designing a filtering principle;
through the design of the steps (5) and (6), each sub-model has a width d and a depth l. Under the same parameter quantity and calculation, the deep and narrow submodels are more efficient and more reasonable in structure than the shallow and wide submodels. Through the filtering principle, a candidate set of the screened sub-model structure is obtainedThe filtering principle can reduce the cost of the model during training and improve the sub-model precision after training;
for a given width ratio candidate setAnd depth scale candidate setBy combining the candidate sets of the two dimensions, a preliminary sub-model structure candidate set is obtainedEach sub-model structureWhereinIn order to easily express the filtering principle of 'deep and narrow', a two-dimensional index matrix is definedTo further process the preliminary sub-model candidate setI (d, l) ═ 1 indicates that the selection submodel a (d, l) is selected, and I (d, l) ═ 0 indicates that the discard submodel a (d, l) is discarded. The index matrix I is initialized to all 1 values and the lower triangular portion is converted to 0 values. Finally, the selected submodelThe specific definition is as follows:
and (8): designing a self-distillation training algorithm and training a model;
aiming at the sub-model structure candidate set obtained in the step (7)A self-distillation training strategy is provided, so that each sub-model can be fully trained. Firstly, a teacher network is trained by utilizing the deep self-attention network in the step (4), a bidirectional partitionable deep self-attention network is constructed, when a submodel in the bidirectional partitionable deep self-attention network is trained, images and problems are input into the teacher network to obtain a prediction vector, namely a soft label, and a candidate set is sampled during training through a submodel sampling strategyThe soft label is used as a supervision label of the sampled submodel for training;
defining a teacher network constructed by a deep self-attention network as MteacherThe bidirectional separable deep self-attention network is MDSTBy training teacher network MteacherObtains its parameter weight theta and uses this weight to initialize the bidirectional separable deep self-attention network MDSTWeight of thetaDST. Through a sub-model sampling strategy, a candidate set is sampled during trainingThe sub-model sampling strategy is as follows: keeping the k submodels sampled at each iteration, and setting the initial submodel structure candidate set as omega ═ asAl } wherein asTo representA minimum submodel oflTo representThen randomly samplingAnd adding the k-2 submodels into a submodel structure candidate set omega to serve as a final submodel candidate set of the iteration. Inputting the input characteristic of each iteration as x into the teacher network MteacherGet the soft label y ═ Mteacher(x) And freezes its gradient y. Then traversing each submodel a E omega in the submodel structure candidate set omega, and inputting the input characteristic x into the current submodel to obtain a prediction vectorPredicting the result using this submodelSoft label y calculation loss with teacher network outputKD represents a loss function, gradient accumulation loss is carried out on different submodels sampled from omega, and when all submodels in a submodel structure candidate set omega generated by each iteration traverse, the model weight theta is updated uniformlyDST。
The model deployment and application in the step (9) are as follows:
if the computing resources of the current equipment are sufficient, the largest sub-model a is adoptedlApplications, obtained by forward propagationTime of dayThe method has the best characterization capability in the submodel. When the computing resources of the device are not enough, the minimum submodel a is adoptedsObtained by forward propagationBecause a issThe amount of computation required is minimal for all submodels, where the forward propagation speed is greatly increased to improve the user experience, andand also has good characterization capability.
In summary, the bidirectional divisible depth self-attention network provided by the present invention can dynamically select sub-models of different sizes according to the computing resource status of the current deviceBy the application, dynamic balance between precision and time delay is realized, and the precision of the sub-model is maintained while user experience is ensured.
As shown in fig. 1 and 2, the present invention provides a lightweight visual question-answering method for a bidirectional separable deep self-attention network.
The partitioning of the data set in the step (1) is specifically as follows:
the final training set contained 115K images and 1.1M sentence of questions, the validation set contained 5K images and 26K sentence of questions, and the test set contained 80K images and 448K sentence of questions.
The object comprehensive characteristics of the constructed image in the step (2) are as follows:
for an image, usually 36 candidate frames are included, the visual feature dimension extracted from each candidate frame is 2048, the space dimension D of the final mapping is adjusted according to the depth self-attention network, for example, D is 512, and the image object comprehensive feature obtained in this step is obtained
The semantic features of the construction problem in the step (3) are as follows:
for a problem, a fixed word length 14 is usually set, each word utilizes a pre-trained word vector model to extract a semantic feature dimension of 300, and a finally mapped space dimension D is adjusted according to a depth self-attention network, and D is set to be equal to512 example, the semantic features of the question obtained in this step
The step (4) is as follows:
by setting D512 and H8, input characteristicsWill be characterized by FinputInputting into a multi-head attention module MHA to obtain an outputThen the characteristicsInputting the input signal into a feed-forward layer FFN to obtain a final output
The width segmentation strategy in the step (5) is as follows:
the present invention defines a shareable width-ratio candidate set as Input characteristic dimension of sub-model under different width segmentation proportionsWhen D is 512, the width dimension of the candidate isMeaning that there are 4 choices for the width dimension of the submodel, 128, 256, 384, 512.
The depth segmentation strategy in the step (6) is as follows:
the invention defines a shareable depth scale candidateAre collected into Number of layers of sub-models under different depth segmentation proportionsWhen the L is equal to 12, it means that the number of layers of the sub-model is 4 choices, 2, 4, 8 and 12.
Combining the two segmentation strategies and the design filtering principle in the step (7), the method specifically comprises the following steps:
the width ratio candidate set defined according to the steps (5) and (6) And depth scale candidate setCombining the candidate sets of the two dimensions to obtain a sub-model structure candidate set Through the principle of filtration to obtainTo the final sub-model structure candidate set Wherein
The self-distillation training algorithm in the step (8) is specifically as follows:
the present invention sets k to 4, which means that 1 largest submodel, 1 smallest submodel, and 2 more submodels sampled randomly are sampled per iteration. There will be 4 submodels sampled per iteration and together they will be gradient accumulated.
Claims (10)
1. The lightweight visual question-answering method based on the bidirectional separable deep self-attention network is characterized by comprising the following steps of:
step (1): dividing the data set;
step (2): constructing visual characteristics of the image;
for a given image, detecting the number m and the position of candidate frames in the image by using the existing trained target detection network; for each candidate frame, inputting the image corresponding to the candidate frame region into the target detection network again, and extracting the features before being input into the network classification layer as the features of the candidate frame; then, splicing the features extracted from each candidate frame to form the visual features of the given image; in order to enable the dimension of the image features to be matched with the depth self-attention network, finally, the image features are further processed by using learnable linear transformation and are mapped to a D-dimensional space;
and (3): constructing semantic features of the problem;
for a given problem, extracting semantic features from each word in the problem by using a trained word vector model, and then splicing the extracted word semantic features to form problem semantic features; in order to enable the problem semantic feature dimension to be matched with the deep self-attention network, finally, a learnable linear transformation is used for further processing the problem feature and mapping the problem feature to a D-dimensional space;
and (4): constructing a depth self-attention network;
the deep self-attention network is formed by stacking a plurality of self-attention layers, and each self-attention layer is divided into two parts: a multi-headed attention module and a feedforward layer; constructing a teacher network for guiding training and a final bidirectional separable deep self-attention network by using the deep self-attention network; in order to enable input features to be matched with dimensions of each submodel in the bidirectional separable depth self-attention network, the network receives the feature with the dimension D as input and maps the input features to the dimension D through linear projection transformation;
and (5): designing a width segmentation strategy;
each self-attention layer in the deep self-attention network consists of a plurality of parameter matrixes, and in order to adapt to input characteristics of different dimensions, each parameter matrix needs to be segmented so as to match input of different dimensions and output characteristics of proper dimensions; aiming at the input feature with the dimension d, in order to keep the original structure proportion of the self-attention layer, the dimension of the output feature is still d through a width segmentation strategy; it is noted that, for different submodels having different dimensional input features, the parameter matrix in the self-attention layer is shared, and when d is smaller, the shared parameter number is smaller; when D is equal to the original input dimension D, the parameter matrix is not segmented;
and (6): designing a depth segmentation strategy;
the depth self-attention network is formed by stacking a plurality of self-attention layers, the number of the layers is L, when the number of the layers of the submodel is less than L, the layer I in the depth self-attention network is selected according to a depth segmentation strategy and belongs to the submodel;
and (7): combining two segmentation strategies and designing a filtering principle;
designing through the steps (5) and (6), wherein each sub-model has a width d and a depth l; under the same parameter quantity and calculation, the deep and narrow submodels are more efficient and reasonable in structure than the shallow and wide submodels, a deep and narrow filtering principle is provided, a plurality of submodels with more layers and low width are selected before model training, and the submodels with fewer layers and high width are directly discarded; through the filtering principle, a candidate set of the screened sub-model structure is obtained
And (8): designing a self-distillation training algorithm and training a model;
aiming at the sub-model structure candidate set obtained in the step (7)Providing a self-distillation training strategy to fully train each sub-model; firstly, a teacher network is trained by utilizing the deep self-attention network in the step (4), a bidirectional partitionable deep self-attention network is constructed, when a submodel in the bidirectional partitionable deep self-attention network is trained, images and problems are input into the teacher network to obtain a prediction vector, namely a soft label, and a candidate set is sampled during training through a submodel sampling strategyThe soft label is used as a supervision label of the sampled submodel for training;
and (9): model deployment and application.
2. The lightweight visual question-answering method based on the bidirectional partitionable deep self-attention network according to claim 1, wherein the partitioning of the data set in the step (1) is specifically as follows:
the data set adopts VQA-v2 data set, and is further divided into 3 subsets aiming at VQA-v2 data set: a training set, a verification set and a test set; the training set is used for training the model, the verification set is used for locally verifying the convergence condition of the model, and the test set is used for finally evaluating the performance of the model.
3. The lightweight visual question-answering method based on the bidirectional separable depth self-attention network according to claim 2, wherein the visual features of the constructed image in the step (2) are as follows:
for a given image, deducing the number m and the position of candidate frames in the image by using the existing trained Faster R-CNN target detection network, and inputting an image area corresponding to each candidate frame into the Faster R-CNN target detection network to extract the visual characteristics of the image; for the ith candidate box, the corresponding visual characteristics areAnd the visual characteristics corresponding to the whole imageThe visual feature corresponding to each candidate frame is spliced, and the specific expression formula is as follows:
Ximage=[x1,x2,...,xi,...,xm](formula 1)
Subsequently, a learnable linear transformation is usedFor image characteristic XimageFurther processing, mapping to D-dimensional space to obtain final image visual characteristicsThe specific formula is as follows:
Xinput=Linear(Ximage) (equation 2).
4. The lightweight visual question-answering method based on the bidirectional separable deep self-attention network according to claim 3, wherein the semantic features of the construction question in the step (3) are as follows:
for a given question which comprises n words, inputting each word into a pre-trained GloVe word vector model to extract semantic features of the word; for the jth word, the corresponding semantic feature isSemantic features corresponding to the whole problemThe word is formed by splicing semantic features corresponding to each word, and the concrete expression formula is as follows:
Yquestion=[y1,y2,...,yj,...,yn](formula 3)
Subsequently, a learnable linear transformation is usedQuestion semantic feature YquestionFurther processing is carried out, and mapping is carried out on the D-dimensional space to obtain the final semantic features of the problemThe specific formula is as follows:
Yinput=Linear(Yquestion) (equation 4).
5. The light-weight visual question-answering method based on the bidirectional separable deep self-attention network according to claim 4, wherein the deep self-attention network is constructed in the step (4), and specifically, the method comprises the following steps:
the deep self-attention network is formed by stacking a plurality of self-attention layers, and each self-attention layer is divided into two parts: a multi-headed attention module and a feedforward layer; constructed by utilizing the depth self-attention networkThe system comprises a teacher network for guiding training and a final bidirectional separable deep self-attention network, wherein the teacher network and the bidirectional separable deep self-attention network both adopt deep self-attention networks with the same structure; in order to enable input features to match respective submodel dimensions in a bi-directional partitionable depth self-attention network, the depth self-attention network accepts image visual features of dimension DAnd problem semantic featuresAs input, mapping the input features to d dimension by a linear projective transformation; the deep self-attention network can fully learn the interactive information between the two modes and finally generate a visual-semantic fusion feature with rich meaning;
4-1, a multi-head attention module;
for a given interrogation featureKey featureAnd value characteristicsThe multi-head attention module calculates and obtains characteristics by utilizing H parallel attention functionsThe specific formula is as follows:
Fmha=MHA(Q,K,V)=[head1,head2,…,headH]W0(formula 5)
WhereinMapping matrix, D, representing the h-th head of attentionHFor each attention head dimension, can be represented by DHD/H is calculated; in addition to this, the present invention is, the system is used for further processing the output characteristics of the multi-head attention function; for the attention calculation mode ATT, the specific formula is as follows:
4-2. feedforward layer;
the feedforward layer consists of two layers of perceptrons and carries out nonlinear transformation on the output characteristics of the multi-head attention module; for a given characteristicOutput characteristicsThe specific formula is as follows:
4-3, self-attention layer;
each self-attention Layer consists of a plurality of the layersHead attention Module and feed Forward layer composition, for given input FinputOutput characteristic FoutputThe concrete formula is as follows:
wherein LN represents layer normalization;
4-4. stacking self-attentive layers;
the deep self-attention network is formed by stacking a plurality of self-attention layers, and the characteristic dimension is not changed by the self-attention layers, so that the plurality of self-attention layers can be connected in series to form a deep self-attention network Model, and the specific formula is as follows:
Model=[Layer(1),Layer(2),…,Layer(L)](formula 11)
Where L is the number of self-attention layers.
6. The light-weight visual question-answering method based on the bidirectional separable deep self-attention network according to claim 5, wherein the width splitting strategy in the step (5) is as follows:
for parameter matrix in multi-head attentionAnd an input feature of dimension D, maintaining a size D of each attention headHChanging the input matching dimension D and the attention head number H of the corresponding parameter matrix without changing; so that the finally sliced parameter matrixWhereinRepresenting the number of the heads of attention after being segmented; parameter matrix W in other self-attention layers0,W1,W2The same strategy is adopted, so that the finally segmented parameter matrix
7. The light-weight visual question-answering method based on the bidirectional partitionable deep self-attention network according to claim 6, wherein the depth partitioning strategy in step (6) is as follows:
for a deep self-attention network with L layers, the index of each layer is recorded as [1, 2]The invention considers that the closer to the self-attention layer of input and output is; this means that the middle layer is relatively less important, when the number of layers of the submodel L < L, it will be discarded first from the middle layer; specifically, the layer indexes are obtained by firstly sorting the importance of each layer from large to smallFor the sub-model with the number of layers being l, the front item of the ordered layer index is takenThen re-ordering to restore to the original layer index orderAnd (4) carrying out depth segmentation strategy for the final l-layer sub-model.
8. The lightweight visual question-answering method based on the bidirectional separable deep self-attention network according to claim 7, wherein the filtering principle is designed by combining two separation strategies in the step (7), and specifically comprises the following steps:
for a given width ratio candidate setAnd depth scale candidate setBy combining the candidate sets of the two dimensions, a preliminary sub-model structure candidate set is obtainedEach sub-model structureWhereinIn order to easily express the filtering principle of 'deep and narrow', a two-dimensional index matrix is definedTo further process the preliminary sub-model candidate setI (d, l) ═ 1 indicates that the selection submodel a (d, l) is selected, and I (d, l) ═ 0 indicates that the discard submodel a (d, l) is discarded; the index matrix I is initialized to all 1 values, and then the lower triangular part is converted into 0 values; finally, the selected submodelThe specific definition is as follows:
9. the light-weight visual question-answering method based on the bidirectional separable deep self-attention network of claim 8, wherein the self-distillation training algorithm of the step (8) is as follows:
defining a teacher network constructed by a deep self-attention network as MteacherThe bidirectional separable deep self-attention network is MDSTBy training teacher network MteacherObtains its parameter weight theta and uses this weight to initialize the bidirectional separable deep self-attention network MDSTWeight of thetaDST(ii) a Through a sub-model sampling strategy, a candidate set is sampled during trainingThe sub-model sampling strategy is as follows: recording k submodels sampled at each iteration, wherein the structure candidate set of the initial submodels is omega ═ as,alIn which a issTo representA minimum submodel oflTo representThen randomly samplingAdding the k-2 submodels into a submodel structure candidate set omega to serve as a final submodel candidate set of the iteration; inputting the input characteristic x of each iteration into the teacher network MteacherGet the soft label y ═ Mteacher(x) And freezing its gradient y.detach (); then traversing each sub-model a E omega in the sub-model structure candidate set omega, and inputting the input characteristic x into the current sub-model to obtain a prediction vectorPredicting the result using this submodelSoft label y calculation loss with teacher network outputKD represents a loss function, gradient accumulation loss is carried out on different submodels sampled from omega, and when all submodels in a submodel structure candidate set omega generated by each iteration traverse, the model weight theta is updated uniformlyDST。
10. The lightweight visual question-answering method based on the bidirectional separable deep self-attention network according to claim 9, wherein the model deployment and application in step (9) are as follows:
if the computing resources of the current equipment are sufficient, the maximum submodel a is adoptedlApplications, obtained by forward propagationAt this timeThe method has the best characterization capability in the submodels; when the computing resources of the device are not enough, the minimum submodel a is adoptedsObtained by forward propagationBecause a issThe amount of computation required is minimal for all submodels, where the forward propagation speed is greatly increased to improve the user experience, andthe method also has good characterization capability;
the bidirectional divisible deep self-attention network can dynamically select submodels with different sizes according to the computing resource state of the current equipmentModel (III)By the application, dynamic balance between precision and time delay is realized, and the precision of the sub-model is maintained while user experience is ensured.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210369535.0A CN114647752A (en) | 2022-04-08 | 2022-04-08 | Lightweight visual question-answering method based on bidirectional separable deep self-attention network |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210369535.0A CN114647752A (en) | 2022-04-08 | 2022-04-08 | Lightweight visual question-answering method based on bidirectional separable deep self-attention network |
Publications (1)
Publication Number | Publication Date |
---|---|
CN114647752A true CN114647752A (en) | 2022-06-21 |
Family
ID=81997107
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210369535.0A Pending CN114647752A (en) | 2022-04-08 | 2022-04-08 | Lightweight visual question-answering method based on bidirectional separable deep self-attention network |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114647752A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114863407A (en) * | 2022-07-06 | 2022-08-05 | 宏龙科技(杭州)有限公司 | Multi-task cold start target detection method based on visual language depth fusion |
CN117216225A (en) * | 2023-10-19 | 2023-12-12 | 四川大学 | Three-mode knowledge distillation-based 3D visual question-answering method |
-
2022
- 2022-04-08 CN CN202210369535.0A patent/CN114647752A/en active Pending
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114863407A (en) * | 2022-07-06 | 2022-08-05 | 宏龙科技(杭州)有限公司 | Multi-task cold start target detection method based on visual language depth fusion |
CN117216225A (en) * | 2023-10-19 | 2023-12-12 | 四川大学 | Three-mode knowledge distillation-based 3D visual question-answering method |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111930992B (en) | Neural network training method and device and electronic equipment | |
CN110175628A (en) | A kind of compression algorithm based on automatic search with the neural networks pruning of knowledge distillation | |
CN112990296B (en) | Image-text matching model compression and acceleration method and system based on orthogonal similarity distillation | |
CN111444340A (en) | Text classification and recommendation method, device, equipment and storage medium | |
CN111325155A (en) | Video motion recognition method based on residual difference type 3D CNN and multi-mode feature fusion strategy | |
CN110852168A (en) | Pedestrian re-recognition model construction method and device based on neural framework search | |
CN111160350B (en) | Portrait segmentation method, model training method, device, medium and electronic equipment | |
CN114647752A (en) | Lightweight visual question-answering method based on bidirectional separable deep self-attention network | |
CN110516095A (en) | Weakly supervised depth Hash social activity image search method and system based on semanteme migration | |
CN108549658A (en) | A kind of deep learning video answering method and system based on the upper attention mechanism of syntactic analysis tree | |
CN112487949B (en) | Learner behavior recognition method based on multi-mode data fusion | |
CN113297370B (en) | End-to-end multi-modal question-answering method and system based on multi-interaction attention | |
CN111008693A (en) | Network model construction method, system and medium based on data compression | |
CN114037945A (en) | Cross-modal retrieval method based on multi-granularity feature interaction | |
KR20200010672A (en) | Smart merchandise searching method and system using deep learning | |
CN109284668A (en) | A kind of pedestrian's weight recognizer based on apart from regularization projection and dictionary learning | |
Li et al. | Hierarchical knowledge squeezed adversarial network compression | |
Ay et al. | A study of knowledge distillation in fully convolutional network for time series classification | |
CN113420651B (en) | Light weight method, system and target detection method for deep convolutional neural network | |
CN114265937A (en) | Intelligent classification analysis method and system of scientific and technological information, storage medium and server | |
Zheng et al. | Action recognition based on the modified twostream CNN | |
CN117494051A (en) | Classification processing method, model training method and related device | |
CN117115911A (en) | Hypergraph learning action recognition system based on attention mechanism | |
CN117011883A (en) | Pedestrian re-recognition method based on pyramid convolution and transducer double branches | |
CN116341621A (en) | Low-cost self-learning neural network design method for weld defect ultrasonic detection |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |