CN116644316A - Multi-mode multi-task learning oriented lightweight adaptive network learning method - Google Patents

Multi-mode multi-task learning oriented lightweight adaptive network learning method Download PDF

Info

Publication number
CN116644316A
CN116644316A CN202310629849.4A CN202310629849A CN116644316A CN 116644316 A CN116644316 A CN 116644316A CN 202310629849 A CN202310629849 A CN 202310629849A CN 116644316 A CN116644316 A CN 116644316A
Authority
CN
China
Prior art keywords
model
attention
training
task
layer
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310629849.4A
Other languages
Chinese (zh)
Inventor
邵镇炜
金子添
余宙
俞俊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Dianzi University
Original Assignee
Hangzhou Dianzi University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Dianzi University filed Critical Hangzhou Dianzi University
Priority to CN202310629849.4A priority Critical patent/CN116644316A/en
Publication of CN116644316A publication Critical patent/CN116644316A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computational Linguistics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a light-weight adaptive network learning method for multi-mode and multi-task learning, which comprises the following steps: 1. constructing a downstream task data set, 2, constructing a deep self-attention network model, 3, pre-training weight clipping, 4, constructing a task adapter, 5, adapting to the pre-training model, and 6, designing progressive guided distillation training and training models. The method cuts out the partial weights of the pre-training model and adopts an efficient task adapter to adapt to the partial weights. The invention provides a progressive guided distillation training algorithm to better fill the difference between a pre-training task and a downstream task, and ensures the performance of a model on the downstream task. The invention can be combined with any existing pre-training model based on the deep self-attention network, and the adapter model with superiority in the aspects of downstream task performance, total storage cost during model deployment, calculation cost during model reasoning, model configuration flexibility and the like can be obtained through training.

Description

Multi-mode multi-task learning oriented lightweight adaptive network learning method
Technical Field
The invention belongs to the field of light-weight multi-mode learning, and particularly relates to a light-weight adaptive network learning method for multi-mode multi-task learning.
Background
In recent years, various research fields of artificial intelligence have benefited from the advent of deep self-attention network architecture and self-supervision pre-training paradigms, with great success. Taking the multi-modal field as an example, researchers use a large-parameter deep self-attention network architecture model and adopt the training paradigm, firstly, pre-train on a corpus of large-scale image-text pairs to learn multi-modal task general knowledge, and then perform parameter fine adjustment for different multi-modal tasks, such as visual question-answering, visual target positioning, image description, image-text retrieval, natural language visual reasoning, visual implication reasoning and the like. This large model based on the "pretraining-fine tuning" paradigm changes the model volume parameters when fine tuning for different downstream task parameters, meaning that multiple large-scale models with different weights need to be deployed for different downstream tasks when the model lands, which consumes storage space.
For this purpose, an adapter tuning method is developed by inserting lightweight learnable parameters into the model without changing the structure and parameter weights of the pre-trained model itself. Specifically, by inserting a small number of learnable parameters called adapters into the pre-training model, only the part of parameters are trained during fine adjustment of the downstream tasks while the original parameters of the pre-training model are kept unchanged, so that the parameter efficiency of the pre-training model during fine adjustment of the downstream tasks is higher.
The existing adapter fine tuning method has the advantages that when the model deploys a plurality of tasks, although parameters are high-efficient, storage cost is reduced, original parameters of the pre-training model are kept unchanged, a small amount of parameters are added to adapt to a plurality of downstream tasks, the size of the adapted model is larger than that of an original model, memory cost is increased during model training, and reasoning cost is increased during application. If the characterization capability of different parts of the pre-training model can be deeply analyzed, the general parts of the downstream tasks are reserved, the parts useless for the downstream tasks are cut off, and then the form of the lightweight adapter is introduced, so that the parameter is high-efficiency, and the calculation is high-efficiency during model reasoning. Therefore, the design of the multi-aspect efficient adapter fine tuning method for the pre-training model has a certain application value for the pre-training model deployment field and a certain academic value for research in other fields.
In summary, how to design an efficient adapter tuning method and combine it with the existing pre-training model is a subject worthy of intensive study. The patent aims to cut in and expand from a plurality of key points in the task to discuss, solve the difficulties and key points existing in the prior method, and form a complete and efficient lightweight adapter fine adjustment method.
Disclosure of Invention
Aiming at the defects existing in the prior art, the invention provides a light-weight adaptive network learning method for multi-mode multi-task learning, which can be combined with any existing pre-training model based on a deep self-attention network, and is used for obtaining an adapter model with advantages in the aspects of downstream task performance, total storage cost during model deployment, calculation cost during model reasoning, model configuration flexibility and the like through training.
The invention mainly comprises two points:
1. by analyzing the characterization capability of different position weights of the pre-training model, the invention cuts out part of the pre-training weights, designs an efficient adapter for adapting the pre-training weights, and provides a pre-training model adapter framework with pruning and filling firstly, so that the adapted model is lighter than the original model.
2. In order to better fill the difference between the pre-training task and the downstream task, the invention provides a progressive guided distillation training algorithm aiming at the fine adjustment of the adapter, so that the adapter model is trained step by step stably to obtain better downstream task performance.
Aiming at the scene of multitasking deployment of a pre-training model, the invention cuts partial weights of the pre-training model to reduce the calculation cost in model reasoning, and reduces the calculation cost in a mode of reducing the model size. Meanwhile, in order to reduce the storage cost of the multi-mode model during deployment, the invention utilizes an efficient adapter structure, adopts an adapter fine adjustment mode to reduce the trainable parameters, and further reduces the storage cost during deployment. Finally, in order to better fill the difference between the pre-training task and the downstream task, a progressive guided distillation training algorithm is proposed, which ensures the performance of the model on the downstream task. The invention can be combined with any existing pre-training model based on the deep self-attention network, and the adapter model with superiority in the aspects of downstream task performance, total storage cost during model deployment, calculation cost during model reasoning, model configuration flexibility and the like can be obtained through training.
A light-weight adaptive network learning method for multi-mode multi-task learning comprises the following steps:
step (1): constructing a downstream task data set;
the invention takes the multi-mode research field as an entry point, and selects 4 multi-mode downstream tasks of visual question-answering, natural language visual reasoning, visual implication reasoning and visual target positioning to construct training, verifying and testing data sets. Extracting regional image features from image data in a dataset by using an existing trained Faster R-CNN target detection network; for text data in a dataset, word embedding vectors are used to extract its semantic features. And then splicing the extracted image and text features to obtain final input features.
Step (2): constructing a deep self-attention network model;
the deep self-attention network is formed by stacking a plurality of layers with the same structure, and each layer consists of a multi-head attention module and a feedforward layer. And constructing a deep self-attention network model for deeply understanding and processing the input features to obtain multi-modal features with richer meanings.
Step (3): pre-training weight clipping;
in order to reduce the size of the pre-training model and improve the model reasoning speed, the invention cuts out part of pre-training weights. According to the deep self-attention network model structure, weight clipping is divided into two types of splitting multi-head attention modules and splitting feedforward layers.
Step (4): constructing a task adapter;
the task adapter refers to a small amount of learnable parameters inserted into each layer of the deep self-attention network model, the model only trains the parameters when the downstream task is fine-tuned, the original pre-training weight of the model is kept unchanged, and finally the model still has the learning capability of the downstream task under the condition of keeping pre-training general knowledge. A lightweight task adapter includes two matrices of learnable parameters and a nonlinear activation function in the middle.
Step (5): adapting a pre-training model;
combining the pre-training model after the segmentation in the step (3) with the task adapter in the step (4) to obtain an adapted pre-training model. For different downstream tasks, they share pre-training weights after segmentation and independently share different task adapters. Also, according to the deep self-attention network model structure, the adaptive pre-training model is two types of adaptive multi-head attention modules and adaptive feedforward layers.
Step (6): designing progressive guided distillation training and training models;
in order to enable the adapter model obtained in the step (5) to be stably trained, the invention provides a progressive guided distillation training algorithm, in particular to a full-scale fine-tuning model under a traditional pre-training-fine-tuning model is used as a teacher model, and a student model is the adapter model obtained in the step (5). In each training iteration, the teacher model and the student model are trained together, and knowledge learned by the teacher model and the student model are distilled to the student adapter model in a gradual guiding mode. In order to better align the parameter distribution of each layer of the student model, the training algorithm synchronizes the output characteristics of each layer of the teacher model in addition to the overall output characteristics of the distillation model.
Further, the construction of the downstream task data set in the step (1) is specifically as follows:
in the multi-mode downstream task, a VQA-v2 data set is adopted as a visual question-answering task, an NLVR2 data set is adopted as a natural language visual reasoning task, an SNLI-VE data set is adopted as a visual implication reasoning task, and Ref-COCO, ref-COCO+ and Ref-COcog data sets are adopted as a visual target positioning task. All data sets are further divided into 3 subsets: training set, validation set and test set. The training set is used for training the model, the verification set is used for locally verifying the convergence condition of the model, and the test set is used for final model performance evaluation.
For the images in the multi-modal dataset, extracting the regional features of the images by using a Faster R-CNN target detection model trained on the Visual Genome dataset in advanceWherein m is the number of region candidate frames of the image, D od Is the regional feature dimension. Subsequently, a learnable linear transformation is used +.>Further processing the image features extracted from the target detection model, mapping the feature dimensions thereof to a D-dimensional space to obtain final image region features +.>The specific formula is as follows:
X image =Linear(X od ) (1)
for the text in the multi-modal dataset, extracting semantic features of the text using word embedding vectorsWhere n is the number of words in the text and D is the semantic feature dimension, which is the same as the final image region feature dimension.
Then, the extracted image and text features are spliced to obtain final input featuresThe specific formula is as follows:
X input =[X image ,X text ] (2)
where num=m+n is the total number of image and text features.
Further, the deep self-attention network model is constructed in the step (2), and the method specifically comprises the following steps:
the deep self-attention network is formed by stacking a plurality of layers with the same structure, and each Layer is composed of a multi-head attention module MHA and a feedforward Layer FFN. Building deep self-attention network model for input featuresDeep understanding and processing are carried out to obtain multi-modal features with richer meaning>
2-1. Multi-head attention Module MHA;
for a given interrogation featureKey feature->Value characteristics->D is the dimension of the feature, the multi-head attention module comprises H parallel attention heads, and the feature is calculated>The specific formula is as follows:
F mha =MHA(Q,K,V)=[head 1 ,head 2 ,…,head H ]W 0 (3)
wherein ,represents the h attention headProjection matrix D of (2) H Representing the characteristic dimension of each attention head, through D H Calculated =d/H. At the same time, matrix->And further mapping the characteristics obtained by multi-head attention calculation. ATT represents the attention calculation method, which performs scaling dot product operation on the processed inter-poll feature Q and key feature K to obtain an attention matrix, and performs weighted summation on the attention matrix and the processed value feature V, and the specific formula is as follows:
2-2, a feedforward layer FFN;
the feed forward layer comprises two fully connected layers and an activation function that characterizes the output of the MHA moduleAs input features, the input features are projected into a high-dimensional space and then mapped back to the original dimensions to obtain output features as followsThe specific formula is as follows:
wherein For a linear transformation projection matrix, nonlinear is a Nonlinear activation function.
2-3, self-attention Layer;
each self-attention Layer contains the multi-head attention module MHA and feed forward Layer FFN described earlier for a given input featureLayer processed output feature->The specific formula is as follows:
where LN represents layer normalization.
2-4, deep self-attention network Model;
considering that the feature dimension D does not change through each of the layers of self-attention layers, multiple layers of self-attention layers can be stacked to form a deep self-attention network Model, denoted as Model, which is a Model of the input featuresDeep understanding and processing are carried out to obtain multi-modal features with richer meaning>The specific formula is as follows:
Model=[Layer (1) ,Layer (2) ,...,Layer (L) ] (9)
X output =Model(X input ) (10)
where L is the number of self-attention layers. Subsequently, pre-trained weights W on the corpus are utilized for large-scale image-text pairs pretrain The model is initialized, and the specific formula is as follows:
Model←W pretrain (11)
further, the pre-training weight clipping in the step (3) is specifically as follows:
in order to reduce the size of the pre-training model and improve the model reasoning speed, the invention cuts out part of pre-training weights. According to the deep self-attention network model structure in the step (2), pre-training weight clipping is divided into two types of a splitting multi-head attention module MHA and a splitting feedforward layer FFN.
3-1, segmenting a multi-head attention module MHA;
the splitting multi-head attention module aims at splitting the attention head quantity H under the condition that the input and output characteristic dimension D is not changed, so as to achieve the purpose of reducing the module parameter quantity. Specifically, for a given input featureThe split multi-head attention module calculates the following modes:
F p_mha =MHA(Q,K,V)=[head 1 ,head 2 ,…,head H-t ]W o (12)
wherein H is the original attention head number, and t is the split attention head number.A parameter matrix representing an h-th attention head, D H =d/H is per attention head dimension. ATT is the way of attention computation, as shown in equation (5). />In order to match the attention characteristic dimension (H-t) D after segmentation H And simultaneously making corresponding segmentation. />And outputting the characteristics for the finally segmented attention module, wherein the dimension of the characteristics is consistent with the input characteristics, which indicates that the MHA module segmented by the attention head does not change the characteristic dimension.
3-2, cutting the feedforward layer FFN;
segmentationThe feedforward layer aims at containing the parameter matrix W without changing the input-output characteristic dimension D 1 and W2 And (5) cutting. Specifically, for a given input featureThe calculation method of the cut feedforward layer FFN is as follows:
wherein ,and s is a set segmentation dimension, and Nonlinear is an activation function for the parameter matrix of the FFN module of the feed-forward layer after segmentation. />For output features, its dimension and input featuresAnd keeping the same, namely the feedforward layer FFN after being cut, and ensuring that the characteristic dimension is not changed.
Further, the task adapter is constructed in the step (4), specifically as follows:
the task adapter refers to a small amount of learnable parameters inserted into each layer of the deep self-attention network model, the model only trains the parameters when the downstream task is fine-tuned, the original pre-training weight of the model is kept unchanged, and finally the model still has the learning capability of the downstream task under the condition of keeping pre-training general knowledge. A lightweight task adapter comprises two matrices W of learnable parameters down and Wup And has a nonlinear activation function in the middle. For input feature F adp_in And output feature F adp_out Each task adapter is calculated as follows:
F adp_mid =Nonlinear(F adp_in W down ) (15)
F adp_out =F adp_in +F adp_mid W up (16)
wherein ,nonlinear is a Nonlinear activation function, D is an input-output characteristic dimension, r is an adapter size parameter, and the larger r is the number of learnable parameters, and generally, the stronger the learning capability of the adapter is. The input features are processed by the task adapters, and the feature dimension D is not changed, so that the input features can be conveniently inserted into a deep self-attention network model, and each task adapter comprises residual connection and input features F for stabilizing model training adp_in Connected to the output feature F by a bypass adp_out And (3) upper part.
Further, the adapting pre-training model in the step (5) is specifically as follows:
combining the pre-training model after the segmentation in the step (3) with the task adapter in the step (4) to obtain an adapted pre-training model. For different downstream tasks, they share pre-training weights after segmentation and independently share different task adapters. Also, according to the deep self-attention network model structure, the adaptive pre-training model is two types of adaptive multi-head attention modules and adaptive feedforward layers.
5-1, adapting a multi-head attention module MHA;
the adaptation multi-head attention module MHA aims to insert a small number of trainable task adapters as new attention heads for the segmented pre-trained MHA. Specifically, characteristics are entered for a given downstream taskThe adapted multi-head attention module calculates the following modes:
wherein ,represents the h adaptive attention head parameter matrix, D represents the input feature dimension, D H For each attention head dimension, it is consistent with the original attention head. ATT is the attention calculation mode, ah is the number of settable adaptive notes and force heads, +.>And splicing the attention head output characteristics of the task adapter with the original attention head output to obtain the adapted attention characteristics. In order to match the adapted attention feature dimension +.>Is additionally introduced and is associated with the split +.>Splicing, jointly processing the adapted attention features and finally outputting the features +.>Its dimensions remain consistent with the input features. Adapted multi-head attention module MHA only W adpQ 、W adpK 、W adpV and />The rest of the parameters remain unchanged during model training, and can be trained.
5-2, adapting a feedforward layer FFN;
the adaptive feed-forward layer FFN is aimed at W in the FFN module after segmentation 1 and W2 The matrix is filled and adapted, so that the matrix has the learning capability of a downstream task. Specifically, for a given input featureThe adaptive feedforward layer FFN is calculated as follows:
wherein ,af is the adaptive size of the settable feed-forward layer, and is matched with the segmented pre-training matrix W 1 and W2 Splicing, so that the adaptive feedforward layer FFN can not only reserve general pre-training knowledge, but also has the learning capability of downstream tasks. />For the final output feature its dimension is +.>And keep the same. Likewise, only +.> and />The rest of the parameters remain unchanged during model training, and can be trained.
Further, the design progressive guided distillation training and model training described in the step (6) is specifically as follows:
the invention provides a progressive guided distillation training algorithm, which adopts a full-scale fine-tuning model under a traditional pre-training-fine-tuning model as a teacher model, and a student model is an adapter model obtained in the step (5). In each training iteration, the teacher model and the student model are trained together, and knowledge learned by the teacher model and the student model are distilled to the student adapter model in a gradual guiding mode. In order to better align the parameter distribution of each layer of the student model, the training algorithm synchronizes the output characteristics of each layer of the teacher model in addition to the overall output characteristics of the distillation model.
The teacher is recorded as M tea The model architecture is an original depth self-attention network model which is not subjected to segmentation adaptation; and the student model is M stu The model framework is the adapter model obtained in the step (5) and subjected to segmentation adaptation. In each training iteration, randomly sampling a batch of input data X and target Y, and transmitting the input data X and target Y into a teacher model M tea (X) to obtain a predictive tag Y tea And each Layer outputs Layer tea And calculate teacher model lossUpdating gradientsSubsequent freezing of teacher model feature gradient Y tea .detach(),Layer tea Detarch (). In the current training iteration, input data X is synchronously transmitted into M in a student model stu (X) to obtain a predictive tag Y stu And each Layer outputs Layer stu And calculate the loss:
wherein ,for output loss->For layer loss->The final loss for the student model. Lambda (lambda) 1 To adjust the ratio lambda for output loss 2 The proportion is adjusted for layer losses, the value of which can be set. After the total loss is obtained, the gradient is calculated and updated
The invention has the following beneficial effects:
according to the invention, through analyzing the characterization capability of different position weights of the pre-training model, the invention cuts part of the pre-training weights, designs an efficient adapter for adapting the pre-training weights, and provides a pre-training model adapter framework with pruning and filling firstly, so that the adapted model is lighter than the original model.
Aiming at the scene of multitasking deployment of the pre-training model, the method firstly cuts the partial weight of the pre-training model, and reduces the calculation cost by reducing the model size. Meanwhile, the invention utilizes an efficient task adapter structure, adopts an adapter fine adjustment mode to reduce trainable parameters, and further reduces the storage cost of the model during deployment. Finally, the invention provides a progressive guided distillation training algorithm to better fill the difference between the pre-training task and the downstream task, and ensure the performance of the model on the downstream task. The invention can be combined with any existing pre-training model based on the deep self-attention network, and the adapter model with superiority in the aspects of downstream task performance, total storage cost during model deployment, calculation cost during model reasoning, model configuration flexibility and the like can be obtained through training.
Drawings
FIG. 1 is a diagram illustrating pre-training weight clipping in an embodiment of the present invention.
Fig. 2 is a schematic diagram of a task adapter according to an embodiment of the present invention.
FIG. 3 is a schematic diagram of an adaptive pre-training model according to an embodiment of the present invention.
Detailed Description
The detailed parameters of the present invention are further described in detail below with reference to the drawings.
As shown in fig. 1,2 and 3, the invention provides a light-weight adaptive network learning method for multi-mode and multi-task learning.
The construction of the downstream task data set in the step (1) is specifically as follows:
the end-use datasets included VQA-v2, NLVR2, SNLI-VE, ref-COCO, ref-COCO+ and Ref-COCOg, encompassing 4 multi-modal downstream tasks of visual question-answering, natural language visual reasoning, visual implication reasoning and visual target localization. All data sets are divided into 3 subsets: training set, validation set and test set. The training set is used for training the model, the verification set is used for locally verifying the convergence condition of the model, and the test set is used for final model performance evaluation.
Further, for the images in the multimodal dataset, the regional features of the images are extracted using a Faster R-CNN object detection model pre-trained on the Visual Genome datasetWherein m is the number of region candidate frames of the image, D od Is the regional feature dimension. Subsequently, a learnable linear transformation is usedFurther processing the image features extracted from the target detection model, mapping the feature dimensions thereof to a D-dimensional space to obtain final image region features +.>The specific formula is as follows:
X image =Linear(X od ) (1)
for the text in the multi-modal dataset, extracting semantic features of the text using word embedding vectorsWhere n is the number of words in the text, D is the semantic feature dimension, the same as the final image region feature dimension。
Then, the extracted image and text features are spliced to obtain final input featuresThe specific formula is as follows:
X input =[X image ,X text ] (2)
where num=m+n is the total number of image and text features.
Specifically, in the present embodiment, for the image region feature, the maximum number of candidate frames m=36 is set, and the region feature dimension D od =2048; for text semantic features, setting a maximum number of words n=14; and the image region feature dimension and the semantic feature dimension d=768. The resulting input features
And (3) constructing a deep self-attention network model, wherein the deep self-attention network is formed by stacking a plurality of layers with the same structure, and each layer consists of a multi-head attention module and a feedforward layer. And constructing a deep self-attention network model for deeply understanding and processing the input features to obtain multi-modal features with richer meanings.
Specifically, the deep self-attention network is formed by stacking a plurality of layers with the same structure, and each Layer is composed of a multi-head attention module MHA and a feedforward Layer FFN. Building deep self-attention network model for input featuresDeep understanding and processing are carried out to obtain multi-modal features with richer meaning>
2-1. Multi-head attention Module MHA;
for a given interrogation featureKey feature->Value characteristics->D is the dimension of the feature, the multi-head attention module comprises H parallel attention heads, and the feature is calculated>The specific formula is as follows:
F mha =MHA(Q,K,V)=[head 1 ,head 2 ,...,head H ]W 0 (3)
wherein ,projection matrix representing the h attention head, D H Representing the characteristic dimension of each attention head, through D H Calculated =d/H. At the same time, matrix->And further mapping the characteristics obtained by multi-head attention calculation. ATT represents the attention calculation method, which performs scaling dot product operation on the processed inter-poll feature Q and key feature K to obtain an attention matrix, and performs weighted summation on the attention matrix and the processed value feature V, and the specific formula is as follows:
2-2, a feedforward layer FFN;
the feed forward layer comprises two fully connected layers and an activation function that characterizes the output of the MHA moduleSign of signAs input features, the input features are projected into a high-dimensional space and then mapped back to the original dimensions to obtain output features as followsThe specific formula is as follows:
wherein For a linear transformation projection matrix, nonlinear is a Nonlinear activation function.
2-3, self-attention Layer;
each self-attention Layer contains the multi-head attention module MHA and feed forward Layer FFN described earlier for a given input featureLayer processed output feature->The specific formula is as follows:
where LN represents layer normalization.
2-4, deep self-attention network Model;
considering that the feature dimension D does not change across each of the layers of self-attention, multiple layers of self-attention can be stacked, shapedA deep self-attention network Model, denoted Model, which is a Model of input featuresDeep understanding and processing are carried out to obtain multi-modal features with richer meaning>The specific formula is as follows:
Model=[Layer (1) ,Layer (2) ,...,Layer (L) ] (9)
X output =Model(X input ) (10)
where L is the number of self-attention layers. Subsequently, pre-trained weights W on the corpus are utilized for large-scale image-text pairs pretrai The specific formula of the n initialization model is as follows:
Model←W pretrain (11)
in the present embodiment, by setting d=768, h=12, the characteristic dimension D of each attention head H =64. Input featuresObtaining the feature via MHA>Characterization by FFN>Obtaining an output characteristic by the final input characteristic through a deep self-attention network model>
And (3) cutting the pre-training weights, wherein in order to reduce the size of the pre-training model and improve the model reasoning speed, the invention cuts part of the pre-training weights. According to the deep self-attention network model structure, weight clipping is divided into two types of splitting multi-head attention modules and splitting feedforward layers.
Specifically, in order to reduce the size of the pre-training model and improve the model reasoning speed, the invention cuts out part of pre-training weights. According to the deep self-attention network model structure in the step (2), pre-training weight clipping is divided into two types of a splitting multi-head attention module MHA and a splitting feedforward layer FFN.
3-1, segmenting a multi-head attention module MHA;
the splitting multi-head attention module aims at splitting the attention head quantity H under the condition that the input and output characteristic dimension D is not changed, so as to achieve the purpose of reducing the module parameter quantity. Specifically, for a given input featureThe split multi-head attention module calculates the following modes:
F p_mha =MHA(Q,K,V)=[head 1 ,head 2 ,…,head H-t ]W 0 (12)
wherein H is the original attention head number, and t is the split attention head number.A parameter matrix representing an h-th attention head, D H =d/H is per attention head dimension. ATT is the way of attention computation, as shown in equation (5). />In order to match the attention characteristic dimension (H-t) D after segmentation H And simultaneously making corresponding segmentation. />For the output characteristics of the attention module after final segmentation, the dimension of the attention module is consistent with the input characteristics, which shows that the MHA module after attention head segmentation cannot be used forThe feature dimension is changed.
3-2, cutting the feedforward layer FFN;
the cut feedforward layer aims at containing the parameter matrix W under the condition that the input and output characteristic dimension D is not changed 1 and W2 And (5) cutting. Specifically, for a given input featureThe calculation method of the cut feedforward layer FFN is as follows:
wherein ,and s is a set segmentation dimension, and Nonlinear is an activation function for the parameter matrix of the FFN module of the feed-forward layer after segmentation. />For output features, its dimension and input featuresAnd keeping the same, namely the feedforward layer FFN after being cut, and ensuring that the characteristic dimension is not changed.
Further, in this embodiment, the split multi-head attention module MHA needs to change W simultaneously Q 、W K 、W V and W0 Parameter matrix for reducing parameter quantity of single MHA module by 4 xt xD x D H T is the set number of attention header cuts, i.e., t e {0, 1..once., H }, where h=12.
The FFN of the feed-forward layer needs to be changed in W at the same time 1 and W2 Parameter matrix, which reduces the parameter quantity of single FFN module by 2 xsxD, s epsilon {0, 1.,. 4D>Is the cut dimension, where d=768.
And (3) constructing a task adapter, wherein the task adapter refers to a small amount of learnable parameters inserted into each layer of the deep self-attention network model, and when the downstream task is fine-tuned, the model only trains the part of parameters and keeps the original pre-training weight of the model unchanged, so that the model still has the learning ability of the downstream task under the condition of keeping pre-training general knowledge. A lightweight task adapter includes two matrices of learnable parameters and a nonlinear activation function in the middle.
Specifically, the task adapter refers to a small number of learnable parameters inserted into each layer of the deep self-attention network model, and when the downstream task is fine-tuned, the model only trains the parameters and keeps the original pre-training weight of the model unchanged, so that the model still has the learning capability of the downstream task under the condition of keeping pre-training general knowledge. A lightweight task adapter comprises two matrices W of learnable parameters down and Wup And has a nonlinear activation function in the middle. For input feature F adp_in And output feature F adp_out Each task adapter is calculated as follows:
F adp_mid =Nonlinear(F adp_in W down ) (15)
F adp_out =F adp_in +F adp_mid W up (16)
wherein ,nonlinear is a Nonlinear activation function, D is an input-output characteristic dimension, r is an adapter size parameter, and the larger r is the number of learnable parameters, and generally, the stronger the learning capability of the adapter is. The input features are processed by the task adapters, and the feature dimension D is not changed, so that the input features can be conveniently inserted into a deep self-attention network model, and each task adapter comprises residual connection and input features F for stabilizing model training adp_in Connected to the output feature F by a bypass adp_out And (3) upper part. />
In this embodiment, the input feature F adp_in And output feature F adp_out Representing only the inputs and outputs of each adapter module, which can facilitate tracingThe computing mode of each adapter. Which is equivalent to the input features of the Layer module described in point 5 aboveAnd output characteristics->
Further, in this embodiment, the number of learnable parameters of each task adapter is 2×d×r, where r is the task adapter size, and r=64 may be set.
And (3) adapting the pre-training model in the step (5), and combining the pre-training model cut in the step (3) with the task adapter in the step (4) to obtain the adapted pre-training model. For different downstream tasks, they share pre-training weights after segmentation and independently share different task adapters. Also, according to the deep self-attention network model structure, the adaptive pre-training model is two types of adaptive multi-head attention modules and adaptive feedforward layers.
Specifically, combining the pre-training model after the segmentation in the step (3) with the task adapter in the step (4) to obtain an adapted pre-training model. For different downstream tasks, they share pre-training weights after segmentation and independently share different task adapters. Also, according to the deep self-attention network model structure, the adaptive pre-training model is two types of adaptive multi-head attention modules and adaptive feedforward layers.
5-1, adapting a multi-head attention module MHA;
the adaptation multi-head attention module MHA aims to insert a small number of trainable task adapters as new attention heads for the segmented pre-trained MHA. Specifically, characteristics are entered for a given downstream taskThe adapted multi-head attention module calculates the following modes:
wherein ,represents the h adaptive attention head parameter matrix, D represents the input feature dimension, D H For each attention head dimension, it is consistent with the original attention head. ATT is the attention calculation mode, ah is the number of settable adaptive notes and force heads, +.>And splicing the attention head output characteristics of the task adapter with the original attention head output to obtain the adapted attention characteristics. In order to match the adapted attention feature dimension +.>Is additionally introduced and is associated with the split +.>Splicing, jointly processing the adapted attention features and finally outputting the features +.>Its dimensions remain consistent with the input features. Adapted multi-head attention module MHA only W adpQ 、W adpK 、W adpV and />The rest of the parameters remain unchanged during model training, and can be trained.
5-2, adapting a feedforward layer FFN;
the adaptive feed-forward layer FFN is aimed at W in the FFN module after segmentation 1 and W2 The matrix is filled and adapted, so that the matrix has the learning capability of a downstream task. Specifically, for a given input featureThe adaptive feedforward layer FFN is calculated as follows:
wherein ,af is the adaptive size of the settable feed-forward layer, and is matched with the segmented pre-training matrix W 1 and W2 Splicing, so that the adaptive feedforward layer FFN can not only reserve general pre-training knowledge, but also has the learning capability of downstream tasks. />For the final output feature its dimension is +.>And keep the same. Likewise, only +.> and />The rest of the parameters remain unchanged during model training, and can be trained.
Further, in this embodiment, the adapted MHA module has a learning parameter of 4×ah×d for its single module H X D, ah is the number of the settable adaptive injection force heads; and the single module of the adaptive FFN module can learn the parameter quantity to be 2 xaf x D, and af is the adaptive size of the settable feedforward layer. Wherein ah and af are integers greater than 0.
And (3) designing progressive guided distillation training and training models in the step (6).
Specifically, the invention provides a progressive guided distillation training algorithm, which adopts a full-scale fine-tuning model under a traditional pre-training-fine-tuning model as a teacher model, and a student model as an adapter model obtained in the step (5). In each training iteration, the teacher model and the student model are trained together, and knowledge learned by the teacher model and the student model are distilled to the student adapter model in a gradual guiding mode. In order to better align the parameter distribution of each layer of the student model, the training algorithm synchronizes the output characteristics of each layer of the teacher model in addition to the overall output characteristics of the distillation model.
The teacher is recorded as M tea The model architecture is an original depth self-attention network model which is not subjected to segmentation adaptation; and the student model is M stu The model framework is the adapter model obtained in the step (5) and subjected to segmentation adaptation. In each training iteration, randomly sampling a batch of input data X and target Y, and transmitting the input data X and target Y into a teacher model M tea (X) to obtain a predictive tag Y tea And each Layer outputs Layer tea And calculate teacher model lossUpdating gradientsSubsequent freezing of teacher model feature gradient Y tea .detach(),Layer tea Detarch (). In the current training iteration, input data X is synchronously transmitted into M in a student model stu (X) to obtain a predictive tag Y stu And each Layer outputs Layer stu And calculate the loss:
wherein ,for output loss->For layer loss->The final loss for the student model. Lambda (lambda) 1 To adjust the ratio lambda for output loss 2 The proportion is adjusted for layer losses, the value of which can be set. After the total loss is obtained, the gradient is calculated and updated
Further, in the present embodiment, λ is calculated in the progressive guided distillation training algorithm 1 and λ2 For adjusting the ratio between different losses of the student model, can be set as lambda 1 =λ 2 =0.5. MSE is the mean square error loss function.

Claims (10)

1. A light-weight adaptive network learning method for multi-mode multi-task learning is characterized by comprising the following steps:
step (1), constructing a downstream task data set, and dividing the downstream task data set into a training set, a verification set and a test set, wherein the downstream task data set comprises visual question answering, natural language visual reasoning, visual implication reasoning and visual target positioning;
extracting regional image features from image data in a data set by using an existing trained Faster R-CNN target detection network, extracting semantic features of text data in the data set by using word embedding vectors, and then splicing the extracted image and the text features to obtain final input features;
step (3), constructing a deep self-attention network model, wherein the deep self-attention network is formed by stacking a plurality of layers with the same structure, each layer consists of a multi-head attention module and a feedforward layer, and input features are understood and processed deeply through the deep self-attention network model to obtain multi-mode features with richer meanings;
step (4), pre-training weight cutting of the depth self-attention network model, wherein the weight cutting is divided into a multi-head attention cutting module and a feedforward cutting layer;
step (5), constructing task adapter
The task adapter is a learnable parameter inserted into each layer of the deep self-attention network model, and one lightweight task adapter comprises two learnable parameter matrixes W down and Wup And contains a nonlinear activation function in the middle;
step (6), adapting the pre-training model, and combining the segmented pre-training model with the task adapter to obtain an adapter model;
step (7): design progressive guided distillation training and training model
The full-quantity fine tuning model under the traditional 'pre-training-fine tuning' paradigm is adopted as a teacher model, the adapter model is adopted as a student model, the teacher model and the student model are trained together in each training iteration, the knowledge learned by the teacher model and the student model is distilled to the student adapter model gradually in a step-by-step guiding mode, and besides the integral output characteristics of the adapter model are distilled, the training algorithm synchronously distills the output characteristics of each layer of the teacher model.
2. The multi-mode multi-task learning-oriented lightweight adaptive network learning method according to claim 1, wherein the visual question-answering task adopts a VQA-v2 data set, the natural language visual reasoning task adopts an NLVR2 data set, the visual implication reasoning task adopts an SNLI-VE data set, and the visual target positioning task adopts Ref-COCO, ref-COCO+ and Ref-COcog data sets.
3. The method for lightweight adaptive network learning for multi-modal multi-task learning as recited in claim 1, wherein in said step (2), for said multi-modal datasetExtracting regional features of the image using a Faster R-CNN target detection model pre-trained on a Visual Genome data setWherein m is the number of region candidate frames of the image, D od Is a regional feature dimension; subsequently, a learnable linear transformation is usedRegional features of the image extracted from the object detection model +.>Further processing, mapping the characteristic dimension to D dimension space to obtain final image region characteristic +.>The specific formula is as follows:
X image =Linear(X od ) (1)
for the text in the multi-modal dataset, extracting semantic features of the text using word embedding vectorsWherein n is the number of words of the text, D is the dimension of the semantic features, and the dimension of the semantic features is the same as the dimension of the features of the final image area;
then, the extracted image and text features are spliced to obtain final input featuresThe specific formula is as follows:
X input =[X image ,X ext ] (2)
where num=m+n is the total number of image and text features.
4. The method for light-weight adaptive network learning for multi-mode and multi-task learning according to claim 3, wherein the method for constructing the deep self-attention network model is as follows:
the deep self-attention network model is formed by stacking a plurality of self-attention layers with the same structure, each Layer consists of a multi-head attention module MHA and a feedforward Layer FFN, and features are inputDeep understanding and processing are carried out to obtain multi-modal features with richer meaning>
The multi-headed attention module MHA, for a given interrogation featureKey feature->Value characteristics->D is the dimension of the feature, the multi-head attention module comprises H parallel attention heads, and the feature is obtained through calculationThe specific formula is as follows:
F mha =MHA(Q,K,V)=[head 1 ,head 2 ,...,head H ]W 0 (3)
wherein ,projection matrix representing the h attention head, D H Representing the characteristic dimension of each attention head, through D H Calculated =d/H, at the same time, matrix +.>Further mapping the features obtained by multi-head attention calculation, wherein ATT represents an attention calculation mode, and performs scaling dot product operation on the processed query feature Q and key feature K to obtain an attention matrix, and performs weighted summation on the attention matrix and the processed value feature V, wherein the specific formula is as follows:
the feed forward layer FFN feed layer comprises two fully connected layers and an activation function which characterizes the output of the MHA moduleAs input feature, projecting it into high-dimensional space, and mapping back to original dimension to obtain output feature +.>The specific formula is as follows:
wherein For a linear transformation projection matrix, nonlinear is a Nonlinear activation function.
5. A lightweight adaptation network for multi-modal-oriented multi-task learning as claimed in claim 4The complex learning method is characterized in that each self-attention Layer comprises a multi-head attention module MHA and a feedforward Layer FFN as described above, and the self-attention Layer comprises a multi-head attention module MHA and a feedforward Layer FFN for a given input characteristicLayer processed output feature->The specific formula is as follows:
where LN represents layer normalization.
6. A method of lightweight adaptive network learning for multi-modal multi-task learning as claimed in any one of claims 1-5 wherein said deep self-attention network Model, denoted Model, is characterized by input featuresDeep understanding and processing are carried out to obtain multi-mode characteristics with more abundant meaningsThe specific formula is as follows:
Model=[Layer (1) ,Layer (2) ,...Layer (L) ] (9)
X output =Model(X input ) (10)
where L is the number of self-attention layers, and then pre-trained weights W on the corpus using large-scale image-text pairs pretrain Initial initiationThe modeling is as follows:
Model←W pretrain (11)。
7. the method for lightweight adaptive network learning for multi-modal-oriented multi-task learning according to claim 5, wherein in the step (4), the pre-training weight is tailored as follows:
segmentation of the multi-head attention module MHA:
for a given input featureThe split multi-head attention module calculates the following modes:
F p_mha =MHA(Q,K,V)=[head 1 ,head 2 ,...,heat H-t ]W 0 (12)
wherein H is the original attention head number, t is the split attention head number,a parameter matrix representing an h-th attention head, D H =d/H for each attention head dimension, ATT for the attention calculation,in order to match the attention characteristic dimension (H-t) D after segmentation H Simultaneously make corresponding cuts->Outputting characteristics for the finally segmented attention module;
splitting the feedforward layer FFN:
for a given input featureThe calculation method of the cut feedforward layer FFN is as follows:
wherein ,for the parameter matrix of the FFN module of the feedforward layer after segmentation, s is the set segmentation dimension, nonlinear is the activation function,>is an output feature.
8. The method for lightweight adaptive network learning for multi-modal multi-task learning according to claim 7, wherein in the step (5), a task adapter is constructed, specifically as follows:
the input and output features for each task adapter are respectively denoted as input features F adp_in And output feature F adp_out Each task adapter is calculated as follows:
F adp_mid =Nonlinear(F adp_in W down ) (15)
F adp_out =F adp_in +F adp_mid W up (16)
wherein ,nonlinear is a Nonlinear activation function, D is an input-output characteristic dimension, r is an adapter size parameter, each task adapter comprises residual connection, and input characteristics F adp_in Connected to the output feature F by a bypass adp_out And (3) upper part.
9. The method for lightweight adaptive network learning for multi-modal multi-task learning according to claim 8, wherein in the step (6), the pre-training model is adapted as follows:
adapting said multi-headed attention module MHA:
inputting features for a given downstream taskThe adapted multi-head attention module calculates the following modes:
wherein ,represents the h adaptive attention head parameter matrix, D represents the input feature dimension, D H Keeping the dimension of each attention head consistent with the original attention head; ATT is the attention calculation mode, ah is the number of settable adaptive notes and force heads, +.>The attention head output characteristics of the task adapter are spliced with the original attention head output to obtain the adapted attention characteristics; in order to match the adapted attention feature dimension +.>Is additionally introduced and is associated with the split +.>Splicing, jointly processing the adapted attention features and finally outputting the features +.>The dimension of the multi-head attention module MHA is consistent with the input characteristics, and the adapted multi-head attention module MHA only has W adpQ 、W adpK 、W adpV and />The training is carried out, and the rest parameters are kept unchanged in the model training;
adapting the feed forward layer FFN:
for a given input featureThe adaptive feedforward layer FFN is calculated as follows:
wherein ,af is the adaptive size of the settable feed-forward layer, and is matched with the segmented pre-training matrix W 1 and W2 Splicing, so that the adaptive feedforward layer FFN not only can keep general pre-training knowledge, but also has learning capability of downstream tasks, and the adaptive feedforward layer FFN is +.>For the final output feature its dimension is +.>Keeping consistency; likewise, only +.> and />The rest of the parameters remain unchanged during model training, and can be trained.
10. The method for lightweight adaptive network learning for multi-modal multi-task learning according to claim 9, wherein in the step (7), a progressive guided distillation training and training model is designed, specifically as follows:
the teacher is recorded as M tea The model architecture is an original depth self-attention network model which is not subjected to segmentation adaptation; and the student model is M stu The model architecture is the adapter model obtained in the step (6) and subjected to segmentation adaptation, and in each training iteration, one batch of input data X and one target Y are randomly sampled and transmitted into a teacher model M tea (X) to obtain a predictive tag Y tea And each Layer outputs Layer tea And calculate teacher model lossUpdate gradient->Subsequent freezing of teacher model feature gradient Y tea .detacj(),Layer tea In the current training iteration, input data X is synchronously transmitted into M in a student model stu (X) to obtain a predictive tag Y stu And each Layer outputs Layer stu And calculate the loss:
wherein ,for output loss->For layer loss->Lambda is the final loss of the student model 1 To adjust the ratio lambda for output loss 2 Adjusting the proportion for layer loss, calculating gradient and updating +.>
CN202310629849.4A 2023-05-31 2023-05-31 Multi-mode multi-task learning oriented lightweight adaptive network learning method Pending CN116644316A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310629849.4A CN116644316A (en) 2023-05-31 2023-05-31 Multi-mode multi-task learning oriented lightweight adaptive network learning method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310629849.4A CN116644316A (en) 2023-05-31 2023-05-31 Multi-mode multi-task learning oriented lightweight adaptive network learning method

Publications (1)

Publication Number Publication Date
CN116644316A true CN116644316A (en) 2023-08-25

Family

ID=87622651

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310629849.4A Pending CN116644316A (en) 2023-05-31 2023-05-31 Multi-mode multi-task learning oriented lightweight adaptive network learning method

Country Status (1)

Country Link
CN (1) CN116644316A (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117194985A (en) * 2023-09-18 2023-12-08 镁佳(北京)科技有限公司 Multi-mode multi-task training system and multi-mode multi-task training method
CN117273068A (en) * 2023-09-28 2023-12-22 东南大学 Model initialization method based on linearly expandable learning genes
CN117521759A (en) * 2024-01-04 2024-02-06 支付宝(杭州)信息技术有限公司 Training method and device for large model
CN117555230A (en) * 2024-01-11 2024-02-13 深圳市东莱尔智能科技有限公司 IO module multi-adapter control method and device and multi-channel IO module
CN117574961A (en) * 2024-01-15 2024-02-20 成都信息工程大学 Parameter efficient method and device for injecting adapter into pre-training model
CN117574982A (en) * 2024-01-16 2024-02-20 之江实验室 Pre-training model fine tuning method and device based on linear transformation

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117194985A (en) * 2023-09-18 2023-12-08 镁佳(北京)科技有限公司 Multi-mode multi-task training system and multi-mode multi-task training method
CN117194985B (en) * 2023-09-18 2024-05-10 镁佳(北京)科技有限公司 Multi-mode multi-task training system and multi-mode multi-task training method
CN117273068A (en) * 2023-09-28 2023-12-22 东南大学 Model initialization method based on linearly expandable learning genes
CN117273068B (en) * 2023-09-28 2024-04-16 东南大学 Model initialization method based on linearly expandable learning genes
CN117521759A (en) * 2024-01-04 2024-02-06 支付宝(杭州)信息技术有限公司 Training method and device for large model
CN117521759B (en) * 2024-01-04 2024-04-05 支付宝(杭州)信息技术有限公司 Training method and device for large model
CN117555230A (en) * 2024-01-11 2024-02-13 深圳市东莱尔智能科技有限公司 IO module multi-adapter control method and device and multi-channel IO module
CN117555230B (en) * 2024-01-11 2024-03-19 深圳市东莱尔智能科技有限公司 IO module multi-adapter control method and device and multi-channel IO module
CN117574961A (en) * 2024-01-15 2024-02-20 成都信息工程大学 Parameter efficient method and device for injecting adapter into pre-training model
CN117574961B (en) * 2024-01-15 2024-03-22 成都信息工程大学 Parameter efficient method and device for injecting adapter into pre-training model
CN117574982A (en) * 2024-01-16 2024-02-20 之江实验室 Pre-training model fine tuning method and device based on linear transformation
CN117574982B (en) * 2024-01-16 2024-04-26 之江实验室 Pre-training model fine tuning method and device based on linear transformation

Similar Documents

Publication Publication Date Title
CN116644316A (en) Multi-mode multi-task learning oriented lightweight adaptive network learning method
CN112328767B (en) Question-answer matching method based on BERT model and comparative aggregation framework
US20190050734A1 (en) Compression method of deep neural networks
WO2022126797A1 (en) Automatic compression method and platform for multilevel knowledge distillation-based pre-trained language model
CN110188358A (en) The training method and device of Natural Language Processing Models
CN111709493B (en) Object classification method, training device, object classification equipment and storage medium
JP7283835B2 (en) Automatic Compression Method and Platform for Pre-trained Language Models Based on Multilevel Knowledge Distillation
CN112990296A (en) Image-text matching model compression and acceleration method and system based on orthogonal similarity distillation
CN110489567A (en) A kind of node information acquisition method and its device based on across a network Feature Mapping
CN111667016B (en) Incremental information classification method based on prototype
CN111368545A (en) Named entity identification method and device based on multi-task learning
CN112488209A (en) Incremental image classification method based on semi-supervised learning
CN113282721A (en) Visual question-answering method based on network structure search
CN112232086A (en) Semantic recognition method and device, computer equipment and storage medium
CN116821291A (en) Question-answering method and system based on knowledge graph embedding and language model alternate learning
CN115775000A (en) Method and device for realizing automatic question answering
CN110309515A (en) Entity recognition method and device
CN116543289B (en) Image description method based on encoder-decoder and Bi-LSTM attention model
CN116822593A (en) Large-scale pre-training language model compression method based on hardware perception
CN116151335A (en) Pulse neural network light weight method and system suitable for embedded equipment
CN114647752A (en) Lightweight visual question-answering method based on bidirectional separable deep self-attention network
CN114880527A (en) Multi-modal knowledge graph representation method based on multi-prediction task
CN112905599B (en) Distributed deep hash retrieval method based on end-to-end
CN112132059B (en) Pedestrian re-identification method and system based on depth conditional random field
CN115578593A (en) Domain adaptation method using residual attention module

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination