CN116644316A

CN116644316A - Multi-mode multi-task learning oriented lightweight adaptive network learning method

Info

Publication number: CN116644316A
Application number: CN202310629849.4A
Authority: CN
Inventors: 邵镇炜; 金子添; 余宙; 俞俊
Original assignee: Hangzhou Dianzi University
Current assignee: Hangzhou Dianzi University
Priority date: 2023-05-31
Filing date: 2023-05-31
Publication date: 2023-08-25

Abstract

The invention discloses a light-weight adaptive network learning method for multi-mode and multi-task learning, which comprises the following steps: 1. constructing a downstream task data set, 2, constructing a deep self-attention network model, 3, pre-training weight clipping, 4, constructing a task adapter, 5, adapting to the pre-training model, and 6, designing progressive guided distillation training and training models. The method cuts out the partial weights of the pre-training model and adopts an efficient task adapter to adapt to the partial weights. The invention provides a progressive guided distillation training algorithm to better fill the difference between a pre-training task and a downstream task, and ensures the performance of a model on the downstream task. The invention can be combined with any existing pre-training model based on the deep self-attention network, and the adapter model with superiority in the aspects of downstream task performance, total storage cost during model deployment, calculation cost during model reasoning, model configuration flexibility and the like can be obtained through training.

Description

Multi-mode multi-task learning oriented lightweight adaptive network learning method

Technical Field

The invention belongs to the field of light-weight multi-mode learning, and particularly relates to a light-weight adaptive network learning method for multi-mode multi-task learning.

Background

In recent years, various research fields of artificial intelligence have benefited from the advent of deep self-attention network architecture and self-supervision pre-training paradigms, with great success. Taking the multi-modal field as an example, researchers use a large-parameter deep self-attention network architecture model and adopt the training paradigm, firstly, pre-train on a corpus of large-scale image-text pairs to learn multi-modal task general knowledge, and then perform parameter fine adjustment for different multi-modal tasks, such as visual question-answering, visual target positioning, image description, image-text retrieval, natural language visual reasoning, visual implication reasoning and the like. This large model based on the "pretraining-fine tuning" paradigm changes the model volume parameters when fine tuning for different downstream task parameters, meaning that multiple large-scale models with different weights need to be deployed for different downstream tasks when the model lands, which consumes storage space.

For this purpose, an adapter tuning method is developed by inserting lightweight learnable parameters into the model without changing the structure and parameter weights of the pre-trained model itself. Specifically, by inserting a small number of learnable parameters called adapters into the pre-training model, only the part of parameters are trained during fine adjustment of the downstream tasks while the original parameters of the pre-training model are kept unchanged, so that the parameter efficiency of the pre-training model during fine adjustment of the downstream tasks is higher.

The existing adapter fine tuning method has the advantages that when the model deploys a plurality of tasks, although parameters are high-efficient, storage cost is reduced, original parameters of the pre-training model are kept unchanged, a small amount of parameters are added to adapt to a plurality of downstream tasks, the size of the adapted model is larger than that of an original model, memory cost is increased during model training, and reasoning cost is increased during application. If the characterization capability of different parts of the pre-training model can be deeply analyzed, the general parts of the downstream tasks are reserved, the parts useless for the downstream tasks are cut off, and then the form of the lightweight adapter is introduced, so that the parameter is high-efficiency, and the calculation is high-efficiency during model reasoning. Therefore, the design of the multi-aspect efficient adapter fine tuning method for the pre-training model has a certain application value for the pre-training model deployment field and a certain academic value for research in other fields.

In summary, how to design an efficient adapter tuning method and combine it with the existing pre-training model is a subject worthy of intensive study. The patent aims to cut in and expand from a plurality of key points in the task to discuss, solve the difficulties and key points existing in the prior method, and form a complete and efficient lightweight adapter fine adjustment method.

Disclosure of Invention

Aiming at the defects existing in the prior art, the invention provides a light-weight adaptive network learning method for multi-mode multi-task learning, which can be combined with any existing pre-training model based on a deep self-attention network, and is used for obtaining an adapter model with advantages in the aspects of downstream task performance, total storage cost during model deployment, calculation cost during model reasoning, model configuration flexibility and the like through training.

The invention mainly comprises two points:

1. by analyzing the characterization capability of different position weights of the pre-training model, the invention cuts out part of the pre-training weights, designs an efficient adapter for adapting the pre-training weights, and provides a pre-training model adapter framework with pruning and filling firstly, so that the adapted model is lighter than the original model.

2. In order to better fill the difference between the pre-training task and the downstream task, the invention provides a progressive guided distillation training algorithm aiming at the fine adjustment of the adapter, so that the adapter model is trained step by step stably to obtain better downstream task performance.

Aiming at the scene of multitasking deployment of a pre-training model, the invention cuts partial weights of the pre-training model to reduce the calculation cost in model reasoning, and reduces the calculation cost in a mode of reducing the model size. Meanwhile, in order to reduce the storage cost of the multi-mode model during deployment, the invention utilizes an efficient adapter structure, adopts an adapter fine adjustment mode to reduce the trainable parameters, and further reduces the storage cost during deployment. Finally, in order to better fill the difference between the pre-training task and the downstream task, a progressive guided distillation training algorithm is proposed, which ensures the performance of the model on the downstream task. The invention can be combined with any existing pre-training model based on the deep self-attention network, and the adapter model with superiority in the aspects of downstream task performance, total storage cost during model deployment, calculation cost during model reasoning, model configuration flexibility and the like can be obtained through training.

A light-weight adaptive network learning method for multi-mode multi-task learning comprises the following steps:

step (1): constructing a downstream task data set;

the invention takes the multi-mode research field as an entry point, and selects 4 multi-mode downstream tasks of visual question-answering, natural language visual reasoning, visual implication reasoning and visual target positioning to construct training, verifying and testing data sets. Extracting regional image features from image data in a dataset by using an existing trained Faster R-CNN target detection network; for text data in a dataset, word embedding vectors are used to extract its semantic features. And then splicing the extracted image and text features to obtain final input features.

Step (2): constructing a deep self-attention network model;

the deep self-attention network is formed by stacking a plurality of layers with the same structure, and each layer consists of a multi-head attention module and a feedforward layer. And constructing a deep self-attention network model for deeply understanding and processing the input features to obtain multi-modal features with richer meanings.

Step (3): pre-training weight clipping;

in order to reduce the size of the pre-training model and improve the model reasoning speed, the invention cuts out part of pre-training weights. According to the deep self-attention network model structure, weight clipping is divided into two types of splitting multi-head attention modules and splitting feedforward layers.

Step (4): constructing a task adapter;

the task adapter refers to a small amount of learnable parameters inserted into each layer of the deep self-attention network model, the model only trains the parameters when the downstream task is fine-tuned, the original pre-training weight of the model is kept unchanged, and finally the model still has the learning capability of the downstream task under the condition of keeping pre-training general knowledge. A lightweight task adapter includes two matrices of learnable parameters and a nonlinear activation function in the middle.

Step (5): adapting a pre-training model;

combining the pre-training model after the segmentation in the step (3) with the task adapter in the step (4) to obtain an adapted pre-training model. For different downstream tasks, they share pre-training weights after segmentation and independently share different task adapters. Also, according to the deep self-attention network model structure, the adaptive pre-training model is two types of adaptive multi-head attention modules and adaptive feedforward layers.

Step (6): designing progressive guided distillation training and training models;

in order to enable the adapter model obtained in the step (5) to be stably trained, the invention provides a progressive guided distillation training algorithm, in particular to a full-scale fine-tuning model under a traditional pre-training-fine-tuning model is used as a teacher model, and a student model is the adapter model obtained in the step (5). In each training iteration, the teacher model and the student model are trained together, and knowledge learned by the teacher model and the student model are distilled to the student adapter model in a gradual guiding mode. In order to better align the parameter distribution of each layer of the student model, the training algorithm synchronizes the output characteristics of each layer of the teacher model in addition to the overall output characteristics of the distillation model.

Further, the construction of the downstream task data set in the step (1) is specifically as follows:

in the multi-mode downstream task, a VQA-v2 data set is adopted as a visual question-answering task, an NLVR2 data set is adopted as a natural language visual reasoning task, an SNLI-VE data set is adopted as a visual implication reasoning task, and Ref-COCO, ref-COCO+ and Ref-COcog data sets are adopted as a visual target positioning task. All data sets are further divided into 3 subsets: training set, validation set and test set. The training set is used for training the model, the verification set is used for locally verifying the convergence condition of the model, and the test set is used for final model performance evaluation.

For the images in the multi-modal dataset, extracting the regional features of the images by using a Faster R-CNN target detection model trained on the Visual Genome dataset in advanceWherein m is the number of region candidate frames of the image, D _od Is the regional feature dimension. Subsequently, a learnable linear transformation is used +.>Further processing the image features extracted from the target detection model, mapping the feature dimensions thereof to a D-dimensional space to obtain final image region features +.>The specific formula is as follows:

X _image ＝Linear(X _od ) (1)

for the text in the multi-modal dataset, extracting semantic features of the text using word embedding vectorsWhere n is the number of words in the text and D is the semantic feature dimension, which is the same as the final image region feature dimension.

Then, the extracted image and text features are spliced to obtain final input featuresThe specific formula is as follows:

X _input ＝[X _image ，X _text ] (2)

where num=m+n is the total number of image and text features.

Further, the deep self-attention network model is constructed in the step (2), and the method specifically comprises the following steps:

the deep self-attention network is formed by stacking a plurality of layers with the same structure, and each Layer is composed of a multi-head attention module MHA and a feedforward Layer FFN. Building deep self-attention network model for input featuresDeep understanding and processing are carried out to obtain multi-modal features with richer meaning>

2-1. Multi-head attention Module MHA;

for a given interrogation featureKey feature->Value characteristics->D is the dimension of the feature, the multi-head attention module comprises H parallel attention heads, and the feature is calculated>The specific formula is as follows:

F _mha ＝MHA(Q，K，V)＝[head ₁ ,head ₂ ,…，head _H ]W ⁰ (3)

wherein ,represents the h attention headProjection matrix D of (2) _H Representing the characteristic dimension of each attention head, through D _H Calculated =d/H. At the same time, matrix->And further mapping the characteristics obtained by multi-head attention calculation. ATT represents the attention calculation method, which performs scaling dot product operation on the processed inter-poll feature Q and key feature K to obtain an attention matrix, and performs weighted summation on the attention matrix and the processed value feature V, and the specific formula is as follows:

2-2, a feedforward layer FFN;

the feed forward layer comprises two fully connected layers and an activation function that characterizes the output of the MHA moduleAs input features, the input features are projected into a high-dimensional space and then mapped back to the original dimensions to obtain output features as followsThe specific formula is as follows:

wherein For a linear transformation projection matrix, nonlinear is a Nonlinear activation function.

2-3, self-attention Layer;

each self-attention Layer contains the multi-head attention module MHA and feed forward Layer FFN described earlier for a given input featureLayer processed output feature->The specific formula is as follows:

where LN represents layer normalization.

2-4, deep self-attention network Model;

considering that the feature dimension D does not change through each of the layers of self-attention layers, multiple layers of self-attention layers can be stacked to form a deep self-attention network Model, denoted as Model, which is a Model of the input featuresDeep understanding and processing are carried out to obtain multi-modal features with richer meaning>The specific formula is as follows:

Model＝[Layer ⁽¹⁾ ，Layer ⁽²⁾ ，...，Layer ^(L) ] (9)

X _output ＝Model(X _input ) (10)

where L is the number of self-attention layers. Subsequently, pre-trained weights W on the corpus are utilized for large-scale image-text pairs _pretrain The model is initialized, and the specific formula is as follows:

Model←W _pretrain (11)

further, the pre-training weight clipping in the step (3) is specifically as follows:

in order to reduce the size of the pre-training model and improve the model reasoning speed, the invention cuts out part of pre-training weights. According to the deep self-attention network model structure in the step (2), pre-training weight clipping is divided into two types of a splitting multi-head attention module MHA and a splitting feedforward layer FFN.

3-1, segmenting a multi-head attention module MHA;

the splitting multi-head attention module aims at splitting the attention head quantity H under the condition that the input and output characteristic dimension D is not changed, so as to achieve the purpose of reducing the module parameter quantity. Specifically, for a given input featureThe split multi-head attention module calculates the following modes:

F _{p_mha} ＝MHA(Q，K，V)＝[head ₁ ,head ₂ ,…，head _H-t ]W ^o (12)

wherein H is the original attention head number, and t is the split attention head number.A parameter matrix representing an h-th attention head, D _H =d/H is per attention head dimension. ATT is the way of attention computation, as shown in equation (5). />In order to match the attention characteristic dimension (H-t) D after segmentation _H And simultaneously making corresponding segmentation. />And outputting the characteristics for the finally segmented attention module, wherein the dimension of the characteristics is consistent with the input characteristics, which indicates that the MHA module segmented by the attention head does not change the characteristic dimension.

3-2, cutting the feedforward layer FFN;

segmentationThe feedforward layer aims at containing the parameter matrix W without changing the input-output characteristic dimension D ₁ and W₂ And (5) cutting. Specifically, for a given input featureThe calculation method of the cut feedforward layer FFN is as follows:

wherein ,and s is a set segmentation dimension, and Nonlinear is an activation function for the parameter matrix of the FFN module of the feed-forward layer after segmentation. />For output features, its dimension and input featuresAnd keeping the same, namely the feedforward layer FFN after being cut, and ensuring that the characteristic dimension is not changed.

Further, the task adapter is constructed in the step (4), specifically as follows:

the task adapter refers to a small amount of learnable parameters inserted into each layer of the deep self-attention network model, the model only trains the parameters when the downstream task is fine-tuned, the original pre-training weight of the model is kept unchanged, and finally the model still has the learning capability of the downstream task under the condition of keeping pre-training general knowledge. A lightweight task adapter comprises two matrices W of learnable parameters _down and W_up And has a nonlinear activation function in the middle. For input feature F _{adp_in} And output feature F _{adp_out} Each task adapter is calculated as follows:

F _{adp_mid} ＝Nonlinear(F _{adp_in} W _down ) (15)

F _{adp_out} ＝F _{adp_in} +F _{adp_mid} W _up (16)

wherein ,nonlinear is a Nonlinear activation function, D is an input-output characteristic dimension, r is an adapter size parameter, and the larger r is the number of learnable parameters, and generally, the stronger the learning capability of the adapter is. The input features are processed by the task adapters, and the feature dimension D is not changed, so that the input features can be conveniently inserted into a deep self-attention network model, and each task adapter comprises residual connection and input features F for stabilizing model training _{adp_in} Connected to the output feature F by a bypass _{adp_out} And (3) upper part.

Further, the adapting pre-training model in the step (5) is specifically as follows:

5-1, adapting a multi-head attention module MHA;

the adaptation multi-head attention module MHA aims to insert a small number of trainable task adapters as new attention heads for the segmented pre-trained MHA. Specifically, characteristics are entered for a given downstream taskThe adapted multi-head attention module calculates the following modes:

wherein ,represents the h adaptive attention head parameter matrix, D represents the input feature dimension, D _H For each attention head dimension, it is consistent with the original attention head. ATT is the attention calculation mode, ah is the number of settable adaptive notes and force heads, +.>And splicing the attention head output characteristics of the task adapter with the original attention head output to obtain the adapted attention characteristics. In order to match the adapted attention feature dimension +.>Is additionally introduced and is associated with the split +.>Splicing, jointly processing the adapted attention features and finally outputting the features +.>Its dimensions remain consistent with the input features. Adapted multi-head attention module MHA only W _adpQ 、W _adpK 、W _adpV and />The rest of the parameters remain unchanged during model training, and can be trained.

5-2, adapting a feedforward layer FFN;

the adaptive feed-forward layer FFN is aimed at W in the FFN module after segmentation ₁ and W₂ The matrix is filled and adapted, so that the matrix has the learning capability of a downstream task. Specifically, for a given input featureThe adaptive feedforward layer FFN is calculated as follows:

wherein ,af is the adaptive size of the settable feed-forward layer, and is matched with the segmented pre-training matrix W ₁ and W₂ Splicing, so that the adaptive feedforward layer FFN can not only reserve general pre-training knowledge, but also has the learning capability of downstream tasks. />For the final output feature its dimension is +.>And keep the same. Likewise, only +.> and />The rest of the parameters remain unchanged during model training, and can be trained.

Further, the design progressive guided distillation training and model training described in the step (6) is specifically as follows:

the invention provides a progressive guided distillation training algorithm, which adopts a full-scale fine-tuning model under a traditional pre-training-fine-tuning model as a teacher model, and a student model is an adapter model obtained in the step (5). In each training iteration, the teacher model and the student model are trained together, and knowledge learned by the teacher model and the student model are distilled to the student adapter model in a gradual guiding mode. In order to better align the parameter distribution of each layer of the student model, the training algorithm synchronizes the output characteristics of each layer of the teacher model in addition to the overall output characteristics of the distillation model.

The teacher is recorded as M _tea The model architecture is an original depth self-attention network model which is not subjected to segmentation adaptation; and the student model is M _stu The model framework is the adapter model obtained in the step (5) and subjected to segmentation adaptation. In each training iteration, randomly sampling a batch of input data X and target Y, and transmitting the input data X and target Y into a teacher model M _tea (X) to obtain a predictive tag Y _tea And each Layer outputs Layer _tea And calculate teacher model lossUpdating gradientsSubsequent freezing of teacher model feature gradient Y _tea .detach()，Layer _tea Detarch (). In the current training iteration, input data X is synchronously transmitted into M in a student model _stu (X) to obtain a predictive tag Y _stu And each Layer outputs Layer _stu And calculate the loss:

wherein ,for output loss->For layer loss->The final loss for the student model. Lambda (lambda) ₁ To adjust the ratio lambda for output loss ₂ The proportion is adjusted for layer losses, the value of which can be set. After the total loss is obtained, the gradient is calculated and updated

The invention has the following beneficial effects:

according to the invention, through analyzing the characterization capability of different position weights of the pre-training model, the invention cuts part of the pre-training weights, designs an efficient adapter for adapting the pre-training weights, and provides a pre-training model adapter framework with pruning and filling firstly, so that the adapted model is lighter than the original model.

Aiming at the scene of multitasking deployment of the pre-training model, the method firstly cuts the partial weight of the pre-training model, and reduces the calculation cost by reducing the model size. Meanwhile, the invention utilizes an efficient task adapter structure, adopts an adapter fine adjustment mode to reduce trainable parameters, and further reduces the storage cost of the model during deployment. Finally, the invention provides a progressive guided distillation training algorithm to better fill the difference between the pre-training task and the downstream task, and ensure the performance of the model on the downstream task. The invention can be combined with any existing pre-training model based on the deep self-attention network, and the adapter model with superiority in the aspects of downstream task performance, total storage cost during model deployment, calculation cost during model reasoning, model configuration flexibility and the like can be obtained through training.

Drawings

FIG. 1 is a diagram illustrating pre-training weight clipping in an embodiment of the present invention.

Fig. 2 is a schematic diagram of a task adapter according to an embodiment of the present invention.

FIG. 3 is a schematic diagram of an adaptive pre-training model according to an embodiment of the present invention.

Detailed Description

The detailed parameters of the present invention are further described in detail below with reference to the drawings.

As shown in fig. 1,2 and 3, the invention provides a light-weight adaptive network learning method for multi-mode and multi-task learning.

The construction of the downstream task data set in the step (1) is specifically as follows:

the end-use datasets included VQA-v2, NLVR2, SNLI-VE, ref-COCO, ref-COCO+ and Ref-COCOg, encompassing 4 multi-modal downstream tasks of visual question-answering, natural language visual reasoning, visual implication reasoning and visual target localization. All data sets are divided into 3 subsets: training set, validation set and test set. The training set is used for training the model, the verification set is used for locally verifying the convergence condition of the model, and the test set is used for final model performance evaluation.

Further, for the images in the multimodal dataset, the regional features of the images are extracted using a Faster R-CNN object detection model pre-trained on the Visual Genome datasetWherein m is the number of region candidate frames of the image, D _od Is the regional feature dimension. Subsequently, a learnable linear transformation is usedFurther processing the image features extracted from the target detection model, mapping the feature dimensions thereof to a D-dimensional space to obtain final image region features +.>The specific formula is as follows:

X _image ＝Linear(X _od ) (1)

for the text in the multi-modal dataset, extracting semantic features of the text using word embedding vectorsWhere n is the number of words in the text, D is the semantic feature dimension, the same as the final image region feature dimension。

X _input ＝[X _image ，X _text ] (2)

where num=m+n is the total number of image and text features.

Specifically, in the present embodiment, for the image region feature, the maximum number of candidate frames m=36 is set, and the region feature dimension D _od =2048; for text semantic features, setting a maximum number of words n=14; and the image region feature dimension and the semantic feature dimension d=768. The resulting input features

And (3) constructing a deep self-attention network model, wherein the deep self-attention network is formed by stacking a plurality of layers with the same structure, and each layer consists of a multi-head attention module and a feedforward layer. And constructing a deep self-attention network model for deeply understanding and processing the input features to obtain multi-modal features with richer meanings.

Specifically, the deep self-attention network is formed by stacking a plurality of layers with the same structure, and each Layer is composed of a multi-head attention module MHA and a feedforward Layer FFN. Building deep self-attention network model for input featuresDeep understanding and processing are carried out to obtain multi-modal features with richer meaning>

2-1. Multi-head attention Module MHA;

F _mha ＝MHA(Q，K，V)＝[head ₁ ，head ₂ ，...，head _H ]W ⁰ (3)

wherein ,projection matrix representing the h attention head, D _H Representing the characteristic dimension of each attention head, through D _H Calculated =d/H. At the same time, matrix->And further mapping the characteristics obtained by multi-head attention calculation. ATT represents the attention calculation method, which performs scaling dot product operation on the processed inter-poll feature Q and key feature K to obtain an attention matrix, and performs weighted summation on the attention matrix and the processed value feature V, and the specific formula is as follows:

2-2, a feedforward layer FFN;

the feed forward layer comprises two fully connected layers and an activation function that characterizes the output of the MHA moduleSign of signAs input features, the input features are projected into a high-dimensional space and then mapped back to the original dimensions to obtain output features as followsThe specific formula is as follows:

2-3, self-attention Layer;

where LN represents layer normalization.

2-4, deep self-attention network Model;

considering that the feature dimension D does not change across each of the layers of self-attention, multiple layers of self-attention can be stacked, shapedA deep self-attention network Model, denoted Model, which is a Model of input featuresDeep understanding and processing are carried out to obtain multi-modal features with richer meaning>The specific formula is as follows:

Model＝[Layer ⁽¹⁾ ，Layer ⁽²⁾ ，...，Layer ^(L) ] (9)

X _output ＝Model(X _input ) (10)

where L is the number of self-attention layers. Subsequently, pre-trained weights W on the corpus are utilized for large-scale image-text pairs _pretrai The specific formula of the n initialization model is as follows:

Model←W _pretrain (11)

in the present embodiment, by setting d=768, h=12, the characteristic dimension D of each attention head _H =64. Input featuresObtaining the feature via MHA>Characterization by FFN>Obtaining an output characteristic by the final input characteristic through a deep self-attention network model>

And (3) cutting the pre-training weights, wherein in order to reduce the size of the pre-training model and improve the model reasoning speed, the invention cuts part of the pre-training weights. According to the deep self-attention network model structure, weight clipping is divided into two types of splitting multi-head attention modules and splitting feedforward layers.

Specifically, in order to reduce the size of the pre-training model and improve the model reasoning speed, the invention cuts out part of pre-training weights. According to the deep self-attention network model structure in the step (2), pre-training weight clipping is divided into two types of a splitting multi-head attention module MHA and a splitting feedforward layer FFN.

3-1, segmenting a multi-head attention module MHA;

F _{p_mha} ＝MHA(Q，K，V)＝[head ₁ ,head ₂ ,…，head _H-t ]W ⁰ (12)

wherein H is the original attention head number, and t is the split attention head number.A parameter matrix representing an h-th attention head, D _H =d/H is per attention head dimension. ATT is the way of attention computation, as shown in equation (5). />In order to match the attention characteristic dimension (H-t) D after segmentation _H And simultaneously making corresponding segmentation. />For the output characteristics of the attention module after final segmentation, the dimension of the attention module is consistent with the input characteristics, which shows that the MHA module after attention head segmentation cannot be used forThe feature dimension is changed.

3-2, cutting the feedforward layer FFN;

the cut feedforward layer aims at containing the parameter matrix W under the condition that the input and output characteristic dimension D is not changed ₁ and W₂ And (5) cutting. Specifically, for a given input featureThe calculation method of the cut feedforward layer FFN is as follows:

Further, in this embodiment, the split multi-head attention module MHA needs to change W simultaneously ^Q 、W ^K 、W ^V and W⁰ Parameter matrix for reducing parameter quantity of single MHA module by 4 xt xD x D _H T is the set number of attention header cuts, i.e., t e {0, 1..once., H }, where h=12.

The FFN of the feed-forward layer needs to be changed in W at the same time ₁ and W₂ Parameter matrix, which reduces the parameter quantity of single FFN module by 2 xsxD, s epsilon {0, 1.,. 4D>Is the cut dimension, where d=768.

And (3) constructing a task adapter, wherein the task adapter refers to a small amount of learnable parameters inserted into each layer of the deep self-attention network model, and when the downstream task is fine-tuned, the model only trains the part of parameters and keeps the original pre-training weight of the model unchanged, so that the model still has the learning ability of the downstream task under the condition of keeping pre-training general knowledge. A lightweight task adapter includes two matrices of learnable parameters and a nonlinear activation function in the middle.

Specifically, the task adapter refers to a small number of learnable parameters inserted into each layer of the deep self-attention network model, and when the downstream task is fine-tuned, the model only trains the parameters and keeps the original pre-training weight of the model unchanged, so that the model still has the learning capability of the downstream task under the condition of keeping pre-training general knowledge. A lightweight task adapter comprises two matrices W of learnable parameters _down and W_up And has a nonlinear activation function in the middle. For input feature F _{adp_in} And output feature F _{adp_out} Each task adapter is calculated as follows:

F _{adp_mid} ＝Nonlinear(F _{adp_in} W _down ) (15)

F _{adp_out} ＝F _{adp_in} +F _{adp_mid} W _up (16)

wherein ,nonlinear is a Nonlinear activation function, D is an input-output characteristic dimension, r is an adapter size parameter, and the larger r is the number of learnable parameters, and generally, the stronger the learning capability of the adapter is. The input features are processed by the task adapters, and the feature dimension D is not changed, so that the input features can be conveniently inserted into a deep self-attention network model, and each task adapter comprises residual connection and input features F for stabilizing model training _{adp_in} Connected to the output feature F by a bypass _{adp_out} And (3) upper part. />

In this embodiment, the input feature F _{adp_in} And output feature F _{adp_out} Representing only the inputs and outputs of each adapter module, which can facilitate tracingThe computing mode of each adapter. Which is equivalent to the input features of the Layer module described in point 5 aboveAnd output characteristics->

Further, in this embodiment, the number of learnable parameters of each task adapter is 2×d×r, where r is the task adapter size, and r=64 may be set.

And (3) adapting the pre-training model in the step (5), and combining the pre-training model cut in the step (3) with the task adapter in the step (4) to obtain the adapted pre-training model. For different downstream tasks, they share pre-training weights after segmentation and independently share different task adapters. Also, according to the deep self-attention network model structure, the adaptive pre-training model is two types of adaptive multi-head attention modules and adaptive feedforward layers.

Specifically, combining the pre-training model after the segmentation in the step (3) with the task adapter in the step (4) to obtain an adapted pre-training model. For different downstream tasks, they share pre-training weights after segmentation and independently share different task adapters. Also, according to the deep self-attention network model structure, the adaptive pre-training model is two types of adaptive multi-head attention modules and adaptive feedforward layers.

5-1, adapting a multi-head attention module MHA;

5-2, adapting a feedforward layer FFN;

Further, in this embodiment, the adapted MHA module has a learning parameter of 4×ah×d for its single module _H X D, ah is the number of the settable adaptive injection force heads; and the single module of the adaptive FFN module can learn the parameter quantity to be 2 xaf x D, and af is the adaptive size of the settable feedforward layer. Wherein ah and af are integers greater than 0.

And (3) designing progressive guided distillation training and training models in the step (6).

Specifically, the invention provides a progressive guided distillation training algorithm, which adopts a full-scale fine-tuning model under a traditional pre-training-fine-tuning model as a teacher model, and a student model as an adapter model obtained in the step (5). In each training iteration, the teacher model and the student model are trained together, and knowledge learned by the teacher model and the student model are distilled to the student adapter model in a gradual guiding mode. In order to better align the parameter distribution of each layer of the student model, the training algorithm synchronizes the output characteristics of each layer of the teacher model in addition to the overall output characteristics of the distillation model.

Further, in the present embodiment, λ is calculated in the progressive guided distillation training algorithm ₁ and λ₂ For adjusting the ratio between different losses of the student model, can be set as lambda ₁ ＝λ ₂ =0.5. MSE is the mean square error loss function.

Claims

1. A light-weight adaptive network learning method for multi-mode multi-task learning is characterized by comprising the following steps:

step (1), constructing a downstream task data set, and dividing the downstream task data set into a training set, a verification set and a test set, wherein the downstream task data set comprises visual question answering, natural language visual reasoning, visual implication reasoning and visual target positioning;

extracting regional image features from image data in a data set by using an existing trained Faster R-CNN target detection network, extracting semantic features of text data in the data set by using word embedding vectors, and then splicing the extracted image and the text features to obtain final input features;

step (3), constructing a deep self-attention network model, wherein the deep self-attention network is formed by stacking a plurality of layers with the same structure, each layer consists of a multi-head attention module and a feedforward layer, and input features are understood and processed deeply through the deep self-attention network model to obtain multi-mode features with richer meanings;

step (4), pre-training weight cutting of the depth self-attention network model, wherein the weight cutting is divided into a multi-head attention cutting module and a feedforward cutting layer;

step (5), constructing task adapter

The task adapter is a learnable parameter inserted into each layer of the deep self-attention network model, and one lightweight task adapter comprises two learnable parameter matrixes W _down and W_up And contains a nonlinear activation function in the middle;

step (6), adapting the pre-training model, and combining the segmented pre-training model with the task adapter to obtain an adapter model;

step (7): design progressive guided distillation training and training model

The full-quantity fine tuning model under the traditional 'pre-training-fine tuning' paradigm is adopted as a teacher model, the adapter model is adopted as a student model, the teacher model and the student model are trained together in each training iteration, the knowledge learned by the teacher model and the student model is distilled to the student adapter model gradually in a step-by-step guiding mode, and besides the integral output characteristics of the adapter model are distilled, the training algorithm synchronously distills the output characteristics of each layer of the teacher model.

2. The multi-mode multi-task learning-oriented lightweight adaptive network learning method according to claim 1, wherein the visual question-answering task adopts a VQA-v2 data set, the natural language visual reasoning task adopts an NLVR2 data set, the visual implication reasoning task adopts an SNLI-VE data set, and the visual target positioning task adopts Ref-COCO, ref-COCO+ and Ref-COcog data sets.

3. The method for lightweight adaptive network learning for multi-modal multi-task learning as recited in claim 1, wherein in said step (2), for said multi-modal datasetExtracting regional features of the image using a Faster R-CNN target detection model pre-trained on a Visual Genome data setWherein m is the number of region candidate frames of the image, D _od Is a regional feature dimension; subsequently, a learnable linear transformation is usedRegional features of the image extracted from the object detection model +.>Further processing, mapping the characteristic dimension to D dimension space to obtain final image region characteristic +.>The specific formula is as follows:

X _image ＝Linear(X _od ) (1)

for the text in the multi-modal dataset, extracting semantic features of the text using word embedding vectorsWherein n is the number of words of the text, D is the dimension of the semantic features, and the dimension of the semantic features is the same as the dimension of the features of the final image area;

X _input ＝[X _image ，X _ext ] (2)

where num=m+n is the total number of image and text features.

4. The method for light-weight adaptive network learning for multi-mode and multi-task learning according to claim 3, wherein the method for constructing the deep self-attention network model is as follows:

the deep self-attention network model is formed by stacking a plurality of self-attention layers with the same structure, each Layer consists of a multi-head attention module MHA and a feedforward Layer FFN, and features are inputDeep understanding and processing are carried out to obtain multi-modal features with richer meaning>

The multi-headed attention module MHA, for a given interrogation featureKey feature->Value characteristics->D is the dimension of the feature, the multi-head attention module comprises H parallel attention heads, and the feature is obtained through calculationThe specific formula is as follows:

F _mha ＝MHA(Q，K，V)＝[head ₁ ，head ₂ ，...，head _H ]W ⁰ (3)

wherein ,projection matrix representing the h attention head, D _H Representing the characteristic dimension of each attention head, through D _H Calculated =d/H, at the same time, matrix +.>Further mapping the features obtained by multi-head attention calculation, wherein ATT represents an attention calculation mode, and performs scaling dot product operation on the processed query feature Q and key feature K to obtain an attention matrix, and performs weighted summation on the attention matrix and the processed value feature V, wherein the specific formula is as follows:

the feed forward layer FFN feed layer comprises two fully connected layers and an activation function which characterizes the output of the MHA moduleAs input feature, projecting it into high-dimensional space, and mapping back to original dimension to obtain output feature +.>The specific formula is as follows:

5. A lightweight adaptation network for multi-modal-oriented multi-task learning as claimed in claim 4The complex learning method is characterized in that each self-attention Layer comprises a multi-head attention module MHA and a feedforward Layer FFN as described above, and the self-attention Layer comprises a multi-head attention module MHA and a feedforward Layer FFN for a given input characteristicLayer processed output feature->The specific formula is as follows:

where LN represents layer normalization.

6. A method of lightweight adaptive network learning for multi-modal multi-task learning as claimed in any one of claims 1-5 wherein said deep self-attention network Model, denoted Model, is characterized by input featuresDeep understanding and processing are carried out to obtain multi-mode characteristics with more abundant meaningsThe specific formula is as follows:

Model＝[Layer ⁽¹⁾ ，Layer ⁽²⁾ ，...Layer ^(L) ] (9)

X _output ＝Model(X _input ) (10)

where L is the number of self-attention layers, and then pre-trained weights W on the corpus using large-scale image-text pairs _pretrain Initial initiationThe modeling is as follows:

Model←W _pretrain (11)。

7. the method for lightweight adaptive network learning for multi-modal-oriented multi-task learning according to claim 5, wherein in the step (4), the pre-training weight is tailored as follows:

segmentation of the multi-head attention module MHA:

for a given input featureThe split multi-head attention module calculates the following modes:

F _{p_mha} ＝MHA(Q，K，V)＝[head ₁ ，head ₂ ，...，heat _H-t ]W ⁰ (12)

wherein H is the original attention head number, t is the split attention head number,a parameter matrix representing an h-th attention head, D _H =d/H for each attention head dimension, ATT for the attention calculation,in order to match the attention characteristic dimension (H-t) D after segmentation _H Simultaneously make corresponding cuts->Outputting characteristics for the finally segmented attention module;

splitting the feedforward layer FFN:

for a given input featureThe calculation method of the cut feedforward layer FFN is as follows:

wherein ,for the parameter matrix of the FFN module of the feedforward layer after segmentation, s is the set segmentation dimension, nonlinear is the activation function,>is an output feature.

8. The method for lightweight adaptive network learning for multi-modal multi-task learning according to claim 7, wherein in the step (5), a task adapter is constructed, specifically as follows:

the input and output features for each task adapter are respectively denoted as input features F _{adp_in} And output feature F _{adp_out} Each task adapter is calculated as follows:

F _{adp_mid} ＝Nonlinear(F _{adp_in} W _down ) (15)

F _{adp_out} ＝F _{adp_in} +F _{adp_mid} W _up (16)

wherein ,nonlinear is a Nonlinear activation function, D is an input-output characteristic dimension, r is an adapter size parameter, each task adapter comprises residual connection, and input characteristics F _{adp_in} Connected to the output feature F by a bypass _{adp_out} And (3) upper part.

9. The method for lightweight adaptive network learning for multi-modal multi-task learning according to claim 8, wherein in the step (6), the pre-training model is adapted as follows:

adapting said multi-headed attention module MHA:

inputting features for a given downstream taskThe adapted multi-head attention module calculates the following modes:

wherein ,represents the h adaptive attention head parameter matrix, D represents the input feature dimension, D _H Keeping the dimension of each attention head consistent with the original attention head; ATT is the attention calculation mode, ah is the number of settable adaptive notes and force heads, +.>The attention head output characteristics of the task adapter are spliced with the original attention head output to obtain the adapted attention characteristics; in order to match the adapted attention feature dimension +.>Is additionally introduced and is associated with the split +.>Splicing, jointly processing the adapted attention features and finally outputting the features +.>The dimension of the multi-head attention module MHA is consistent with the input characteristics, and the adapted multi-head attention module MHA only has W _adpQ 、W _adpK 、W _adpV and />The training is carried out, and the rest parameters are kept unchanged in the model training;

adapting the feed forward layer FFN:

for a given input featureThe adaptive feedforward layer FFN is calculated as follows:

wherein ,af is the adaptive size of the settable feed-forward layer, and is matched with the segmented pre-training matrix W ₁ and W₂ Splicing, so that the adaptive feedforward layer FFN not only can keep general pre-training knowledge, but also has learning capability of downstream tasks, and the adaptive feedforward layer FFN is +.>For the final output feature its dimension is +.>Keeping consistency; likewise, only +.> and />The rest of the parameters remain unchanged during model training, and can be trained.

10. The method for lightweight adaptive network learning for multi-modal multi-task learning according to claim 9, wherein in the step (7), a progressive guided distillation training and training model is designed, specifically as follows:

the teacher is recorded as M _tea The model architecture is an original depth self-attention network model which is not subjected to segmentation adaptation; and the student model is M _stu The model architecture is the adapter model obtained in the step (6) and subjected to segmentation adaptation, and in each training iteration, one batch of input data X and one target Y are randomly sampled and transmitted into a teacher model M _tea (X) to obtain a predictive tag Y _tea And each Layer outputs Layer _tea And calculate teacher model lossUpdate gradient->Subsequent freezing of teacher model feature gradient Y _tea .detacj()，Layer _tea In the current training iteration, input data X is synchronously transmitted into M in a student model _stu (X) to obtain a predictive tag Y _stu And each Layer outputs Layer _stu And calculate the loss:

wherein ,for output loss->For layer loss->Lambda is the final loss of the student model ₁ To adjust the ratio lambda for output loss ₂ Adjusting the proportion for layer loss, calculating gradient and updating +.>