WO2022126797A1 - 基于多层级知识蒸馏预训练语言模型自动压缩方法及平台 - Google Patents
基于多层级知识蒸馏预训练语言模型自动压缩方法及平台 Download PDFInfo
- Publication number
- WO2022126797A1 WO2022126797A1 PCT/CN2020/142577 CN2020142577W WO2022126797A1 WO 2022126797 A1 WO2022126797 A1 WO 2022126797A1 CN 2020142577 W CN2020142577 W CN 2020142577W WO 2022126797 A1 WO2022126797 A1 WO 2022126797A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- distillation
- network
- model
- vector
- knowledge
- Prior art date
Links
- 238000013140 knowledge distillation Methods 0.000 title claims abstract description 100
- 230000006835 compression Effects 0.000 title claims abstract description 97
- 238000007906 compression Methods 0.000 title claims abstract description 97
- 238000000034 method Methods 0.000 title claims abstract description 55
- 238000012549 training Methods 0.000 claims abstract description 53
- 238000004821 distillation Methods 0.000 claims description 158
- 239000013598 vector Substances 0.000 claims description 115
- 108090000623 proteins and genes Proteins 0.000 claims description 70
- 238000005070 sampling Methods 0.000 claims description 37
- 239000011159 matrix material Substances 0.000 claims description 36
- 230000008569 process Effects 0.000 claims description 25
- 230000009467 reduction Effects 0.000 claims description 18
- 238000003058 natural language processing Methods 0.000 claims description 13
- 238000013138 pruning Methods 0.000 claims description 12
- 230000006798 recombination Effects 0.000 claims description 10
- 238000005215 recombination Methods 0.000 claims description 9
- 238000010200 validation analysis Methods 0.000 claims description 7
- 230000035772 mutation Effects 0.000 claims description 6
- 206010064571 Gene mutation Diseases 0.000 claims description 5
- 238000012795 verification Methods 0.000 claims description 4
- 238000013508 migration Methods 0.000 claims description 3
- 230000005012 migration Effects 0.000 claims description 3
- 230000000052 comparative effect Effects 0.000 claims description 2
- 230000006870 function Effects 0.000 description 7
- 238000010586 diagram Methods 0.000 description 5
- 238000012546 transfer Methods 0.000 description 4
- 238000010845 search algorithm Methods 0.000 description 3
- 230000009466 transformation Effects 0.000 description 3
- 238000013461 design Methods 0.000 description 2
- 238000013528 artificial neural network Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000007812 deficiency Effects 0.000 description 1
- 230000008451 emotion Effects 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 230000003278 mimic effect Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 238000012552 review Methods 0.000 description 1
- 230000008685 targeting Effects 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
- 238000013526 transfer learning Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/082—Learning methods modifying the architecture, e.g. adding, deleting or silencing nodes or connections
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/084—Backpropagation, e.g. using gradient descent
Definitions
- the invention belongs to the field of language model compression, and in particular relates to a pre-trained language model automatic compression method and platform based on multi-level knowledge distillation.
- the present invention generates a general compression architecture for multi-task oriented pre-trained language models based on multi-level knowledge distillation.
- the purpose of the present invention is to provide a pre-trained language model automatic compression method and platform based on multi-level knowledge distillation in view of the deficiencies of the prior art.
- the present invention first constructs a multi-level knowledge distillation, and distills the knowledge structure of the large model at different levels. Furthermore, meta-learning is introduced to generate a general compression architecture for multiple pre-trained language models. Specifically, a meta-network of a structure generator is designed, a knowledge distillation encoding vector is constructed based on a multi-level knowledge distillation method, and the structure generator is used to generate a distillation structure model corresponding to the current input encoding vector. At the same time, a method of Bernoulli distribution sampling is proposed to train the structure generator.
- the self-attention units migrated by each encoder are sampled using Bernoulli distribution to form the corresponding encoding vector.
- the encoding vector of the input structure generator and the training data of the mini-batch and jointly training the structure generator and the corresponding distillation structure, we can learn a structure generator that can generate weights for different distillation structures.
- an evolutionary algorithm is used to search for the optimal compression structure, thereby obtaining the optimal general compression structure of the task-independent pre-trained language model.
- An automatic compression method for a pre-trained language model based on multi-level knowledge distillation comprising the following steps:
- Step 1 Build multi-level knowledge distillation, and distill the knowledge structure of the large model at three different levels: self-attention unit, hidden layer state, and embedding layer;
- Step 2 Train a knowledge distillation network for meta-learning to generate a general compression architecture for multiple pre-trained language models
- Step 3 Search for an optimal compression structure based on an evolutionary algorithm.
- a meta-network of a structure generator is designed in step 2, a knowledge distillation coding vector is constructed based on the multi-level knowledge distillation in step 1, and a distillation structure model corresponding to the currently input coding vector is generated by using the structure generator;
- the Bernoulli distribution sampling method trains the structure generator.
- the self-attention units migrated by each encoder are sampled by the Bernoulli distribution to form the corresponding encoding vector; by changing the encoding vector and the small value of the input structure generator.
- Batches of training data jointly train the structure generator and the corresponding distillation structure, to obtain a structure generator that generates weights for different distillation structures.
- step 3 on the basis of the trained meta-learning network, an optimal compression architecture is searched through an evolutionary algorithm to obtain an optimal general compression architecture of a task-independent pre-trained language model.
- step 1 self-attention distribution knowledge, hidden state knowledge and embedding layer knowledge are encoded into a distillation network, and knowledge distillation is used to compress large models to small models.
- the first step includes self-attention knowledge distillation, hidden layer state knowledge distillation and embedding layer knowledge distillation.
- the meta-network of the structure generator described in step 2 is composed of two fully connected layers, a self-attention knowledge distillation encoding vector is input, and the weight matrix of the structure generator is output;
- the training process of the structure generator is as follows:
- Step 1 Construct knowledge distillation encoding vector, including layer sampling vector, multi-head pruning vector, hidden layer dimension reduction vector and embedding layer dimension reduction vector;
- Step 2 Build a distillation network architecture based on the structure generator, use the structure generator to build a distillation structure model corresponding to the currently input encoding vector, adjust the shape of the weight matrix output by the structure generator, and use the structure generator to construct a distillation corresponding to the self-attention encoding vector.
- the input and output of the structure have the same number of self-attention units;
- Step 3 Joint training of the structure generator and distillation structure model:
- the structure generator is trained by the method of Bernoulli distribution sampling, and the structure is jointly trained by changing the self-attention encoding vector of the input structure generator and a small batch of training data.
- the generator and the corresponding distillation structure can learn a structure generator that can generate weights for different distillation structures.
- step 3 the network coding vector is input into the trained structure generator to generate the weight of the corresponding distillation network, and the distillation network is evaluated on the verification set to obtain the accuracy of the corresponding distillation network; the details are as follows:
- Step 1 Define the knowledge distillation encoding vector as the gene G of the distillation structure model, and randomly select a series of genes that satisfy the constraint C as the initial population;
- Step 2 Evaluate the inference accuracy accuracy of the distillation structure model corresponding to each gene G i in the existing population on the validation set, and select the top k genes with the highest accuracy;
- Step 3 using the top k genes with the highest precision selected in step 2 to carry out gene recombination and gene mutation to generate new genes, and add the new genes to the existing population;
- Step 4 Repeat steps 2 and 3 for N rounds of iterations, select the top k genes with the highest accuracy in the existing population and generate new genes, until the gene that satisfies the constraint C and has the highest accuracy is obtained.
- a pre-trained language model automatic compression platform based on multi-level knowledge distillation including the following components:
- Data loading component used to obtain the training samples of the BERT model and the multi-task-oriented pre-training language model uploaded by the logged-in user to be compressed and containing specific natural language processing downstream tasks.
- the training samples are labeled to meet the supervised learning task. text sample;
- Automatic compression component used to automatically compress multi-task-oriented pre-trained language models, including knowledge distillation vector encoding module, distillation network generation module, structure generator and distillation network joint training module, distillation network search module, and task-specific fine-tuning module;
- the knowledge distillation vector coding module includes the layer sampling vector of the Transformer, the multi-head pruning vector of the self-attention unit, the hidden layer dimensionality reduction vector, and the embedding layer dimensionality reduction vector.
- the distillation network coding vector input structure is generated. generator to generate the weight matrix of the distillation network and the structure generator of the corresponding structure;
- the distillation network generation module constructs a distillation network corresponding to the currently input encoding vector based on the structure generator, and adjusts the shape of the weight matrix output by the structure generator to make it correspond to the self-attention encoding vector of the input and output of the distillation structure.
- the number of attention units is the same;
- the structure generator and the distillation network joint training module train the structure generator end-to-end, input the multi-level knowledge distillation encoding vector and the training data of the small batch into the distillation network, update the weight of the distillation structure and the weight matrix of the structure generator;
- the distillation network search module uses an evolutionary algorithm to search for a distillation network with the highest accuracy that satisfies a specific constraint condition
- the task-specific fine-tuning module is to construct a downstream task network on the pre-trained model distillation network generated by the automatic compression component, use the feature layer and output layer of the distillation network to fine-tune the downstream task scene, and output the final fine-tuned student model, That is, the pre-trained language model compression model containing downstream tasks required by the login user, output the compressed model to a designated container for the login user to download, and display the size of the model before and after compression on the output compression model page of the platform comparative information;
- the logged-in user obtains the pre-trained compression model from the platform, and uses the compression model output by the automatic compression component to infer the new data of the natural language processing downstream task uploaded by the logged-in user on the dataset of the actual scene. And the comparison information of the inference speed before and after compression is presented on the compression model inference page of the platform.
- the logged-in user can directly download the trained pre-training language model, and according to the user's demand for a specific downstream task of natural language processing, build a downstream task network on the basis of the compressed pre-training model architecture generated by the platform. Fine-tune, and finally deploy on end devices, or inference on NLP downstream tasks directly on the platform.
- the beneficial effects of the present invention are: the pre-training language model automatic compression method based on multi-level knowledge distillation of the present invention, the platform method and the platform, first, research based on meta-learning knowledge distillation to generate a general compression architecture for multiple pre-trained language models; Secondly, on the basis of the trained meta-learning network, an evolutionary algorithm is used to search for the optimal compression structure, thereby obtaining the optimal general compression structure of the task-independent pre-trained language model.
- the pre-trained language model automatic compression platform based on multi-level knowledge distillation of the present invention compresses and generates a general architecture of multi-task-oriented pre-trained language models, fully utilizes the compressed model architecture to improve the compression efficiency of downstream tasks, and can compress large Large-scale natural language processing models are deployed on end-side devices with small memory and limited resources, which promotes the implementation of general-purpose deep language models in the industry.
- Figure 1 is the overall architecture diagram of the pre-trained language model automatic compression platform based on multi-level knowledge distillation
- Fig. 2 is a schematic diagram of multi-level knowledge distillation
- Fig. 3 is the training flow chart of structure generator
- FIG. 4 is a schematic diagram of the dimensionality reduction structure of the hidden layer of the encoder module and the input embedding layer;
- Figure 5 is an architecture diagram of building a distillation network based on a structure generator
- Figure 6 is a flow chart of the joint training process of the structure generator and the distillation network
- Figure 7 is a schematic diagram of a distillation network search architecture based on an evolutionary algorithm.
- the present invention includes knowledge distillation based on meta-learning and automatic search of distillation network based on evolutionary algorithm. Automatically compress large-scale pre-trained language models for multitasking to generate task-independent general architectures that satisfy different hard constraints (such as the number of floating-point operations).
- the specific scheme is as follows: the whole process of the pre-trained language model automatic compression method based on multi-level knowledge distillation of the present invention is divided into three stages.
- the knowledge structure of the large model is distilled at three different levels of the embedding layer; the second stage is to train the knowledge distillation network of meta-learning to generate a general compression architecture for multiple pre-trained language models.
- a meta-network of structure generator is designed, a knowledge distillation encoding vector is constructed based on the multi-level knowledge distillation method proposed in the first stage, and a distillation structure model corresponding to the currently input encoding vector is generated by using the structure generator.
- a method of Bernoulli distribution sampling is proposed to train the structure generator.
- the self-attention units migrated by each encoder are sampled using Bernoulli distribution to form the corresponding encoding vector.
- a structure generator that can generate weights for different distillation structures can be learned; the third stage is based on evolution
- the algorithm searches for the optimal compression structure.
- the evolutionary algorithm searches for the optimal compression structure, thereby obtaining the optimal general compression structure of the task-independent pre-trained language model.
- the present invention encodes self-attention distribution knowledge, hidden state knowledge and embedding layer knowledge into a distillation network, as shown in Figure 2.
- Knowledge distillation is used to compress the large model to the small model, and the self-attention knowledge of the large model is transferred to the small model to the greatest extent.
- Self-attention knowledge distillation Transformer layer distillation includes self-attention-based knowledge distillation and hidden layer state-based knowledge distillation, as shown in Figure 2.
- Self-attention based distillation is able to focus on rich linguistic knowledge. This vast amount of linguistic knowledge includes the semantics and associated information necessary for natural language understanding. Therefore, self-attention based knowledge distillation is proposed to encourage the transfer of knowledge from the rich teacher model to the student model.
- the attention function is calculated from three matrices of queries, keys and values, which are denoted as matrix Q, matrix K and matrix V, respectively.
- the self-attention function is defined as follows:
- A represents the self-attention matrix, which is calculated by the dot product operation of matrix Q and matrix K.
- the output of the final self-attention function Attention(Q, K, V) is used as a weight sum of the matrix V, and the weight is calculated by performing the softmax operation on each column of the matrix V.
- Self-attention matrix A can pay attention to a large amount of language knowledge, so self-attention distillation plays an important role in knowledge distillation. Multi-head attention is obtained by splicing attention heads from different feature subspaces as follows:
- h is the number of attention heads
- head i represents the ith attention head, which is calculated by the Attention() function of different feature subspaces
- Concat represents concatenation
- W is a linear transformation matrix.
- the student model learns to mimic the multi-head attention knowledge of the teacher network, where the loss function is defined as follows:
- a i ⁇ R l ⁇ l represents the self-attention matrix corresponding to the ith attention head of the teacher or student model
- R is a real number
- l is the size of the current layer input
- L is The length of the input text
- MSE() is the mean squared loss function. It is worth noting that the attention matrix A i is used here instead of the output softmax(A i ) of the softmax.
- Hidden layer state knowledge distillation In addition to knowledge distillation based on self-attention, this application is also based on knowledge distillation of hidden layer state, that is, the knowledge output from the Transformer layer is transferred.
- the loss function of the hidden layer state knowledge distillation is as follows:
- the matrices H S ⁇ R L ⁇ d' and H T ⁇ R L ⁇ d refer to the hidden states of the student and teacher networks, respectively
- the scalar values d and d' represent the sizes of the hidden layers of the teacher and student models, respectively
- the matrix W h ⁇ R d' ⁇ d is a learnable linear transformation matrix that transforms the hidden layer state of the student network into the same feature space as the hidden layer state of the teacher network.
- the matrices ES and ET refer to the embedding layers of the student and teacher networks, respectively.
- the shape of the embedded layer matrix is the same as the hidden layer state matrix.
- the matrix We is a linear transformation matrix.
- FIG. 1 Design a structure generator, which is a meta-network consisting of two fully connected layers.
- Figure 3 shows the training process of the structure generator.
- Input a self-attention knowledge distillation encoding vector and output the weight matrix of the structure generator.
- the training process of the structure generator is as follows:
- Step 1 Construct knowledge distillation encoding vector.
- the distillation network encoding vector is input into the structure generator, and the weight matrix of the structure generator is output.
- the distillation network encoding vector consists of the Transformer layer sampling vector, the multi-head attention pruning vector, the hidden layer dimension reduction vector and the embedding layer dimension reduction vector.
- the specific scheme of the distillation network encoding vector is as follows:
- a. Layer sampling vector In the layer sampling stage, the number of Transformer layers of BERT is firstly sampled by Bernoulli distribution to generate a layer sampling vector. Specifically, assuming that the ith Transformer module is currently being migrated, Xi is an independent Bernoulli random variable, and the probability of Xi being 1 (or 0) is p (or 1-p).
- the random variable X is used to sequentially perform Bernoulli sampling on the 12 Transformer units of BERT to generate a vector consisting of 12 elements of 0 or 1.
- the probability value of the random variable X i being 1 is greater than or equal to 0.5
- the element corresponding to the layer sampling vector is 1, which means that the current Transformer module performs migration learning
- the probability value of the random variable X i being 1 is less than 0.5
- the layer sampling vector corresponds to The element of is 0, which means that the current Transformer module does not perform transfer learning.
- BERT base has a total of 12 Transformer modules.
- the layer sampling stage is performed on all Transformer layers of BERT, and the constraints are constructed so that the number of elements in the vector obtained by the final layer sampling is not less than 6, otherwise the layer sampling is performed again.
- the Transformer module of the student model transfer is initialized using the weight of the Transformer whose element is 1 in the layer sampling vector of the teacher model.
- Multi-head pruning vector Each Transformer module consists of multi-head attention units. Inspired by channel pruning, this application proposes multi-head pruning based on multi-head attention units. Each time a distillation network structure is generated, a multi-head pruning encoding vector is generated, representing the number of attention heads for self-attention knowledge transfer in all Transformer layers currently being transferred. The formula is defined as follows:
- Head i represents the number of attention heads included in each layer of Transformer when generating the i-th distillation network structure.
- the head scale represents a decay factor for the number of self-attention heads included in each Transformer module.
- SRS_Sample means simple random sampling; since each Transformer module of the BERT base has 12 self-attention units, the head max is 12.
- the random number Mi is obtained by simply randomly sampling the list [0,1,2,...,30], and the attenuation factor head scale of the current distillation structure is obtained, which is the same as the standard head max . After multiplication, the number of multi-head attention units of the current distillation network structure is obtained.
- Hidden layer dimensionality reduction vector The knowledge distillation of the hidden layer state is to perform knowledge distillation on the final output of each Transformer layer, that is, reduce the dimension of the BERT hidden layer.
- the specific process is as follows.
- the dimensions of the hidden layers of all Transformer layers of the distillation network generated each time are the same.
- the dimension hidn i of the hidden layer that generates the i-th distillation network is defined as follows:
- hidn i hidn base ⁇ hidn scale
- the hidn base is a hyperparameter. Since the hidden layer dimension of the BERT base is 768, the initialized hidn base is a common divisor of 768, and the initialized hidn base is 128 here. hidn scale represents the dimensionality reduction factor of the hidden layer dimension, which is an element obtained by simple random sampling of the list [1, 2, 3, 4, 5, 6].
- Figure 4 shows the dimensionality reduction structure of the hidden layer of the encoder module and the embedding layer of the input. As can be seen from the figure, since both the hidden layer part and the embedding layer part have a residual connection, the output dimension of the embedding layer is equal to the output dimension of the hidden layer.
- the dimension of each embedding layer should be consistent with the dimension of the current hidden layer.
- Step 2 Build the distillation network architecture based on the structure generator.
- the structure generator is used to build a distillation structure model corresponding to the current input encoding vector, and the shape of the weight matrix output by the structure generator needs to be adjusted so that it corresponds to the self-attention of the input and output of the distillation structure corresponding to the self-attention encoding vector.
- the number of units is the same.
- Figure 5 shows the network architecture of the distillation structure constructed by the structure generator.
- Step 3 Jointly train the structure generator and the distillation structure model. Feed the self-attention knowledge distillation encoding vector and a mini-batch of training data into the distillation structure model. It is worth noting that in the process of backpropagation, the weights of the distillation structure and the weight matrix of the structure generator are updated together. The weights of the structure generator can be calculated using the chain rule, thus, the structure generator can be trained end-to-end.
- a method of Bernoulli distribution sampling is proposed to train the structure generator.
- the self-attention units migrated by the Transformer of each layer are sampled by Bernoulli distribution to form the corresponding encoding vector.
- a structure generator that can generate weights for different distillation structures can be learned.
- Figure 6 shows the training process of the structure generator.
- each distillation network is encoded and generated by a network encoding vector including three distillation modules: embedding layer distillation, hidden layer distillation and self-attention knowledge distillation, so the distillation network encoding vector Defined as the genes of the distillation network.
- a series of distillation network encoding vectors are first selected as the genes of the distillation network, and the accuracy of the corresponding distillation network is obtained by evaluating on the validation set. Then, the top k genes with the highest precision are selected, and gene recombination and mutation are used to generate new genes.
- Gene mutation refers to mutation by randomly changing the value of some elements in the gene.
- Step 1 Each distillation structure model is generated by the knowledge distillation coding vector based on Transformer layer sampling, so the knowledge distillation coding vector is defined as the gene G of the distillation structure model, and a series of genes that satisfy the constraint C are randomly selected as the initial population. .
- Step 2 Evaluate the inference accuracy accuracy of the distillation structure model corresponding to each gene G i in the existing population on the validation set, and select the top k genes with the highest accuracy.
- Step 3 Use the top k genes with the highest accuracy selected in step 2 to perform gene recombination and gene mutation to generate new genes, and add the new genes to the existing population.
- Gene mutation refers to mutation by randomly changing the value of some elements in the gene;
- Gene recombination refers to randomly recombining the genes of two parents to produce offspring; and it is easy to strengthen constraint C by eliminating unqualified genes.
- Step 4 Repeat steps 2 and 3 for N rounds of iterations, select the top k genes with the highest accuracy in the existing population and generate new genes, until the gene that satisfies the constraint C and has the highest accuracy is obtained.
- the pre-trained language model automatic compression platform based on multi-level knowledge distillation of the present invention includes the following components:
- Data loading component used to obtain the training samples of the BERT model and the multi-task-oriented pre-training language model uploaded by the logged-in user to be compressed and containing specific natural language processing downstream tasks.
- the training samples are labeled to meet the supervised learning task. Text sample.
- Automatic compression component used to automatically compress multi-task-oriented pre-trained language models, including knowledge distillation vector encoding module, distillation network generation module, structure generator and distillation network joint training module, distillation network search module, and task-specific fine-tuning module.
- the knowledge distillation vector encoding module includes the layer sampling vector of the Transformer, the multi-head pruning vector of the self-attention unit, the hidden layer dimension reduction vector, and the embedding layer dimension reduction vector.
- the distillation network encoding vector is input into the structure generator, and the weight matrix of the distillation network corresponding to the structure and the structure generator is generated.
- the distillation network generation module is based on the structure generator to construct a distillation network corresponding to the current input encoding vector, and adjust the shape of the weight matrix output by the structure generator to make it correspond to the self-attention encoding vector of the self-attention of the input and output of the distillation structure.
- the number of force units is the same.
- the structure generator and distillation network joint training module is an end-to-end training structure generator. Specifically, the multi-level knowledge distillation encoding vector and a small batch of training data are fed into the distillation network. Update the weights of the distillation structure and the weight matrix of the structure generator.
- the distillation network search module is to search for the distillation network with the highest accuracy that satisfies the specific constraints, and proposes an evolutionary algorithm to search for the distillation network with the highest accuracy that meets the specific constraints.
- each distillation network is encoded and generated by the network coding vector including the three distillation modules of embedding layer knowledge distillation, hidden layer state knowledge distillation and self-attention knowledge distillation.
- the distillation network encoding vector is defined as the genes of the distillation network.
- a series of distillation network encoding vectors are first selected as the genes of the distillation network, and the accuracy of the corresponding distillation network is obtained by evaluating on the validation set. Then, the top k genes with the highest precision are selected, and gene recombination and mutation are used to generate new genes. By further repeating the process of selecting the top k optimal genes and the process of generating new genes, the genes that satisfy the constraints and have the highest accuracy are obtained.
- the task-specific fine-tuning module is to build a downstream task network on the pre-trained model distillation network generated by the automatic compression component, use the feature layer and output layer of the distillation network to fine-tune the downstream task scene, and output the final fine-tuned student model, that is, log in User-required pre-trained language model compression models that include downstream tasks.
- the compressed model is output to a specified container, which can be downloaded by the logged-in user, and the comparison information of the size of the model before and after compression is presented on the page of outputting the compressed model of the platform.
- the logged-in user obtains the pre-trained compression model from the platform, and the user uses the compressed model output by the automatic compression component to infer the new data of the natural language processing downstream task uploaded by the logged-in user on the dataset of the actual scene. And the comparison information of the inference speed before and after compression is presented on the compression model inference page of the platform.
- the logged-in user can directly download the trained pre-training language model provided by the platform of the present invention, and according to the user's demand for a specific downstream task of natural language processing, build the downstream task on the basis of the compressed pre-training model architecture generated by the platform Network and fine-tuned, and finally deployed on end devices. Inference on natural language processing downstream tasks can also be performed directly on the platform.
- the BERT pre-training model generated by the automatic compression component is loaded by the platform, and a model of the text classification task is constructed on the generated pre-training model;
- the compressed model is output to the designated container, which can be downloaded by the logged-in user, and the comparison information of the model size before and after compression is presented on the output compressed model page of the platform.
- the size of the model before compression is 110M, and the size after compression is 53M , compressed by 51.8%. As shown in Table 1 below.
- the compression model output by the platform is used to infer the SST-2 test set data uploaded by the logged-in user, and the compression model inference page of the platform shows that the inference speed after compression is 1.95 faster than that before compression times, and the inference accuracy improves from 91.5% before compression to 92.3%.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Biophysics (AREA)
- Evolutionary Computation (AREA)
- Biomedical Technology (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Computational Linguistics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Health & Medical Sciences (AREA)
- Databases & Information Systems (AREA)
- Feedback Control In General (AREA)
- Machine Translation (AREA)
Abstract
Description
文本分类任务(SST-2)(包含67K个样本) | 压缩前 | 压缩后 | 对比 |
模型大小 | 110M | 53M | 压缩51.8% |
推理精度 | 91.5% | 92.3% | 提升0.8% |
Claims (10)
- 一种基于多层级知识蒸馏的预训练语言模型自动压缩方法,其特征在于,包括如下步骤:步骤一、构建多层级知识蒸馏,在自注意力单元、隐藏层状态、嵌入层三个不同层级上蒸馏大模型的知识结构;步骤二、训练元学习的知识蒸馏网络,生成多种预训练语言模型的通用压缩架构;步骤三、基于进化算法搜索最优压缩结构。
- 如权利要求1所述的基于多层级知识蒸馏的预训练语言模型自动压缩方法,其特征在于:步骤二中设计一种结构生成器的元网络,基于步骤一的多层级知识蒸馏构建知识蒸馏编码向量,利用结构生成器生成与当前输入的编码向量对应的蒸馏结构模型;同时,采用伯努利分布采样的方法训练结构生成器,每轮迭代时,利用伯努利分布采样各个编码器迁移的自注意力单元,组成对应的编码向量;通过改变输入结构生成器的编码向量和小批次的训练数据,联合训练结构生成器和对应的蒸馏结构,得到为不同蒸馏结构生成权重的结构生成器。
- 如权利要求2所述的基于多层级知识蒸馏的预训练语言模型自动压缩方法,其特征在于:步骤三中在已训练好的元学习网络基础上,通过进化算法搜索最优压缩架构,得到与任务无关的预训练语言模型的最优通用压缩架构。
- 如权利要求1所述的基于多层级知识蒸馏的预训练语言模型自动压缩方法,其特征在于:步骤一中将自注意力分布知识、隐藏状态知识和嵌入层知识编码为一个蒸馏网络,采用知识蒸馏实现大模型向小模型的压缩。
- 如权利要求4所述的基于多层级知识蒸馏的预训练语言模型自动压缩方法,其特征在于:步骤一中包括自注意力知识蒸馏、隐藏层状态知识蒸馏和嵌入层知识蒸馏。
- 如权利要求2所述的基于多层级知识蒸馏的预训练语言模型自动压缩方法,其特征在于:步骤二中所述结构生成器的元网络,由两个全连接层组成,输入一个自注意力知识蒸馏编码向量,输出结构生成器的权重矩阵;结构生成器的训练过程如下:步骤2.1:构造知识蒸馏编码向量,包括层采样向量、多头剪枝向量、隐藏层降维向量和嵌入层降维向量;步骤2.2:基于结构生成器构建蒸馏网络架构,利用该结构生成器构建与当前输入的编码向量对应的蒸馏结构模型,调整结构生成器输出的权重矩阵的形状,与自注意力编码向量对应的蒸馏结构的输入输出的自注意力单元数目一致;步骤2.3:联合训练结构生成器和蒸馏结构模型:通过伯努利分布采样的方法训练结构生成器,通过改变输入结构生成器的自注意力编码向量和一个小批次的训练数据,联合训练结构生成器和对应的蒸馏结构,学得能够为不同蒸馏结构生成权重的结构生成器。
- 如权利要求6所述的基于多层级知识蒸馏的预训练语言模型自动压缩方法,其特征在于:步骤三中,将网络编码向量输入训练好的结构生成器,生成对应蒸馏网络的权重,在验证集上对蒸馏网络进行评估,获得对应蒸馏网络的精度;具体如下:满足特定约束条件下,首先选取一系列蒸馏网络编码向量作为蒸馏网络的基因,通过在验证集上评估获得对应蒸馏网络的精度;然后,选取精度最高的前k个基因,采用基因重组和变异生成新的基因,通过进一步重复前k个最优基因选择的过程和新基因生成的过程来迭代获得满足约束条件并且精度最高的基因。
- 如权利要求7所述的基于多层级知识蒸馏的预训练语言模型自动压缩方法,其特征在于:所述进化算法的具体步骤如下:步骤3.1、将知识蒸馏编码向量定义为蒸馏结构模型的基因G,随机选取满足约束条件C的一系列基因作为初始种群;步骤3.2、评估现有种群中各个基因G i对应的蒸馏结构模型在验证集上的推理精度accuracy,选取精度最高的前k个基因;步骤3.3、利用步骤3.2选取的精度最高的前k个基因进行基因重组和基因变异生成新的基因,将新基因加入现有种群中;步骤3.4、重复迭代N轮步骤3.2和步骤3.3,选择现有种群中前k个精度最高的基因并生成新基因,直到获得满足约束条件C并且精度最高的基因。
- 一种基于多层级知识蒸馏的预训练语言模型自动压缩平台,其特征在于包括如下组件:数据加载组件:用于获取登陆用户上传的待压缩的包含具体自然语言处理下游任务的BERT模型和面向多任务的预训练语言模型的训练样本,所述训练样本是满足监督学习任务的带标签的文本样本;自动压缩组件:用于将面向多任务的预训练语言模型自动压缩,包括知识蒸馏向量编码模块、蒸馏网络生成模块、结构生成器和蒸馏网络联合训练模块、蒸馏网络搜索模块、特定任务微调模块;所述知识蒸馏向量编码模块包括Transformer的层采样向量、自注意力单元的多头剪枝向量、隐藏层降维向量、嵌入层降维向量,前向传播过程中,将蒸馏网络编码向量输入结构生成器,生成对应结构的蒸馏网络和结构生成器的权重矩阵;所述蒸馏网络生成模块基于结构生成器构建与当前输入的编码向量对应的蒸馏网络,调整结构生成器输出的权重矩阵的形状,使其与自注意力编码向量对应的蒸馏结构的输入输出的自注意力单元数目一致;结构生成器和蒸馏网络联合训练模块端到端地训练结构生成器,将多层级知识蒸馏编码向量和小批次的训练数据输入蒸馏网络,更新蒸馏结构的权重和结构生成器的权重矩阵;所述蒸馏网络搜索模块采用进化算法搜索满足特定约束条件的最高精度的蒸馏网络;所述特定任务微调模块是在所述自动压缩组件生成的预训练模型蒸馏网络上构建下游任务网络,利用蒸馏网络的特征层和输出层对下游任务场景进行微调,输出最终微调好的学生模型,即登陆用户需求的包含下游任务的预训练语言模型压缩模型,将所述压缩模型输出到指定的容器,供所述登陆用户下载,并在所述平台的输出压缩模型的页面呈现压缩前后模型大小的对比信息;推理组件:登陆用户从所述平台获取预训练压缩模型,利用所述自动压缩组件输出的压缩模型在实际场景的数据集上对登陆用户上传的自然语言处理下游任务的新数据进行推理;并在所述平台的压缩模型推理页面呈现压缩前后推理速度的对比信息。
- 如权利要求9所述的基于多层级知识蒸馏的预训练语言模型自动压缩平台,其特征在于:登陆用户直接下载训练好的预训练语言模型,根据用户对具体某个自然语言处理下游任务的需求,在所述平台生成的已压缩的预训练模型架构基础上构建下游任务网络并进行微调,最后部署在终端设备,或直接在所述平台上对自然语言处理下游任务进行推理。
Priority Applications (3)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
GB2214215.2A GB2610319A (en) | 2020-12-17 | 2020-12-31 | Automatic compression method and platform for multilevel knowledge distillation-based pre-trained language model |
JP2022566730A JP7283835B2 (ja) | 2020-12-17 | 2020-12-31 | マルチレベル知識蒸留に基づく事前訓練言語モデルの自動圧縮方法およびプラットフォーム |
US17/555,535 US11501171B2 (en) | 2020-12-17 | 2021-12-20 | Method and platform for pre-trained language model automatic compression based on multilevel knowledge distillation |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011498328.2A CN112241455B (zh) | 2020-12-17 | 2020-12-17 | 基于多层级知识蒸馏预训练语言模型自动压缩方法及平台 |
CN202011498328.2 | 2020-12-17 |
Related Child Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US17/555,535 Continuation US11501171B2 (en) | 2020-12-17 | 2021-12-20 | Method and platform for pre-trained language model automatic compression based on multilevel knowledge distillation |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2022126797A1 true WO2022126797A1 (zh) | 2022-06-23 |
Family
ID=74175234
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2020/142577 WO2022126797A1 (zh) | 2020-12-17 | 2020-12-31 | 基于多层级知识蒸馏预训练语言模型自动压缩方法及平台 |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN112241455B (zh) |
WO (1) | WO2022126797A1 (zh) |
Cited By (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115774851A (zh) * | 2023-02-10 | 2023-03-10 | 四川大学 | 基于分级知识蒸馏的曲轴内部缺陷检测方法及其检测系统 |
CN115979973A (zh) * | 2023-03-20 | 2023-04-18 | 湖南大学 | 一种基于双通道压缩注意力网络的高光谱中药材鉴别方法 |
CN116821699A (zh) * | 2023-08-31 | 2023-09-29 | 山东海量信息技术研究院 | 一种感知模型训练方法、装置及电子设备和存储介质 |
CN117574961A (zh) * | 2024-01-15 | 2024-02-20 | 成都信息工程大学 | 一种将适配器注入预训练模型的参数高效化方法和装置 |
CN117725844A (zh) * | 2024-02-08 | 2024-03-19 | 厦门蝉羽网络科技有限公司 | 基于学习权重向量的大模型微调方法、装置、设备及介质 |
CN118504643A (zh) * | 2024-07-19 | 2024-08-16 | 阿里云计算有限公司 | 神经网络模型的压缩方法、设备、存储介质及程序产品 |
CN118503687A (zh) * | 2024-07-19 | 2024-08-16 | 国网山东省电力公司信息通信公司 | 基于预训练语言模型的电力数据质量特征提取方法及系统 |
CN118520904A (zh) * | 2024-07-25 | 2024-08-20 | 山东浪潮科学研究院有限公司 | 基于大语言模型的识别训练方法、识别方法 |
CN118607511A (zh) * | 2024-08-08 | 2024-09-06 | 之江实验室 | 基于蒸馏提升bert的财经新闻情感分析方法和装置 |
Families Citing this family (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP7381814B2 (ja) * | 2020-12-15 | 2023-11-16 | 之江実験室 | マルチタスク向けの予めトレーニング言語モデルの自動圧縮方法及びプラットフォーム |
CN113099175B (zh) * | 2021-03-29 | 2022-11-04 | 苏州华云视创智能科技有限公司 | 一种基于5g的多模型手持云端检测传输系统及检测方法 |
CN114037653B (zh) * | 2021-09-23 | 2024-08-06 | 上海仪电人工智能创新院有限公司 | 基于二阶段知识蒸馏的工业机器视觉缺陷检测方法和系统 |
CN113849641B (zh) * | 2021-09-26 | 2023-10-24 | 中山大学 | 一种跨领域层次关系的知识蒸馏方法和系统 |
CN113986958B (zh) * | 2021-11-10 | 2024-02-09 | 北京有竹居网络技术有限公司 | 文本信息的转换方法、装置、可读介质和电子设备 |
CN114819148B (zh) * | 2022-05-17 | 2024-07-02 | 西安电子科技大学 | 基于不确定性估计知识蒸馏的语言模型压缩方法 |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111062489A (zh) * | 2019-12-11 | 2020-04-24 | 北京知道智慧信息技术有限公司 | 一种基于知识蒸馏的多语言模型压缩方法、装置 |
CN111506702A (zh) * | 2020-03-25 | 2020-08-07 | 北京万里红科技股份有限公司 | 基于知识蒸馏的语言模型训练方法、文本分类方法及装置 |
US20200302295A1 (en) * | 2019-03-22 | 2020-09-24 | Royal Bank Of Canada | System and method for knowledge distillation between neural networks |
CN111767711A (zh) * | 2020-09-02 | 2020-10-13 | 之江实验室 | 基于知识蒸馏的预训练语言模型的压缩方法及平台 |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112016674B (zh) * | 2020-07-29 | 2024-06-18 | 魔门塔(苏州)科技有限公司 | 一种基于知识蒸馏的卷积神经网络的量化方法 |
-
2020
- 2020-12-17 CN CN202011498328.2A patent/CN112241455B/zh active Active
- 2020-12-31 WO PCT/CN2020/142577 patent/WO2022126797A1/zh active Application Filing
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20200302295A1 (en) * | 2019-03-22 | 2020-09-24 | Royal Bank Of Canada | System and method for knowledge distillation between neural networks |
CN111062489A (zh) * | 2019-12-11 | 2020-04-24 | 北京知道智慧信息技术有限公司 | 一种基于知识蒸馏的多语言模型压缩方法、装置 |
CN111506702A (zh) * | 2020-03-25 | 2020-08-07 | 北京万里红科技股份有限公司 | 基于知识蒸馏的语言模型训练方法、文本分类方法及装置 |
CN111767711A (zh) * | 2020-09-02 | 2020-10-13 | 之江实验室 | 基于知识蒸馏的预训练语言模型的压缩方法及平台 |
Non-Patent Citations (2)
Title |
---|
NATURAL LANGUAGE PROCESSING GROUP: "Smaller, Faster and Better Models! Microsoft General-Purpose Compression Method for Pre-Trained Language Models, MiniLM, Helps You Get Twice as Much Done with Half as Much Work", MICROSOFT RESEARCH ASIA - NEWS- FEATURES, 12 May 2020 (2020-05-12), CN, XP009538033, Retrieved from the Internet <URL:http://www.msra.cn/zh-cn/news/features/miniIm> * |
WENHUI WANG; FURU WEI; LI DONG; HANGBO BAO; NAN YANG; MING ZHOU: "MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 25 February 2020 (2020-02-25), 201 Olin Library Cornell University Ithaca, NY 14853 , XP081607651 * |
Cited By (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115774851A (zh) * | 2023-02-10 | 2023-03-10 | 四川大学 | 基于分级知识蒸馏的曲轴内部缺陷检测方法及其检测系统 |
CN115979973A (zh) * | 2023-03-20 | 2023-04-18 | 湖南大学 | 一种基于双通道压缩注意力网络的高光谱中药材鉴别方法 |
CN116821699A (zh) * | 2023-08-31 | 2023-09-29 | 山东海量信息技术研究院 | 一种感知模型训练方法、装置及电子设备和存储介质 |
CN116821699B (zh) * | 2023-08-31 | 2024-01-19 | 山东海量信息技术研究院 | 一种感知模型训练方法、装置及电子设备和存储介质 |
CN117574961A (zh) * | 2024-01-15 | 2024-02-20 | 成都信息工程大学 | 一种将适配器注入预训练模型的参数高效化方法和装置 |
CN117574961B (zh) * | 2024-01-15 | 2024-03-22 | 成都信息工程大学 | 一种将适配器注入预训练模型的参数高效化方法和装置 |
CN117725844A (zh) * | 2024-02-08 | 2024-03-19 | 厦门蝉羽网络科技有限公司 | 基于学习权重向量的大模型微调方法、装置、设备及介质 |
CN117725844B (zh) * | 2024-02-08 | 2024-04-16 | 厦门蝉羽网络科技有限公司 | 基于学习权重向量的大模型微调方法、装置、设备及介质 |
CN118504643A (zh) * | 2024-07-19 | 2024-08-16 | 阿里云计算有限公司 | 神经网络模型的压缩方法、设备、存储介质及程序产品 |
CN118503687A (zh) * | 2024-07-19 | 2024-08-16 | 国网山东省电力公司信息通信公司 | 基于预训练语言模型的电力数据质量特征提取方法及系统 |
CN118520904A (zh) * | 2024-07-25 | 2024-08-20 | 山东浪潮科学研究院有限公司 | 基于大语言模型的识别训练方法、识别方法 |
CN118607511A (zh) * | 2024-08-08 | 2024-09-06 | 之江实验室 | 基于蒸馏提升bert的财经新闻情感分析方法和装置 |
Also Published As
Publication number | Publication date |
---|---|
CN112241455B (zh) | 2021-05-04 |
CN112241455A (zh) | 2021-01-19 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2022126797A1 (zh) | 基于多层级知识蒸馏预训练语言模型自动压缩方法及平台 | |
US11501171B2 (en) | Method and platform for pre-trained language model automatic compression based on multilevel knowledge distillation | |
WO2022141754A1 (zh) | 一种卷积神经网络通用压缩架构的自动剪枝方法及平台 | |
WO2022126683A1 (zh) | 面向多任务的预训练语言模型自动压缩方法及平台 | |
US11941522B2 (en) | Address information feature extraction method based on deep neural network model | |
CN111291836B (zh) | 一种生成学生网络模型的方法 | |
CN107844469B (zh) | 基于词向量查询模型的文本简化方法 | |
CN109885756B (zh) | 基于cnn和rnn的序列化推荐方法 | |
CN111611377A (zh) | 基于知识蒸馏的多层神经网络语言模型训练方法与装置 | |
US20220188658A1 (en) | Method for automatically compressing multitask-oriented pre-trained language model and platform thereof | |
CN112000772B (zh) | 面向智能问答基于语义特征立方体的句子对语义匹配方法 | |
CN111274375A (zh) | 一种基于双向gru网络的多轮对话方法及系统 | |
KR102592585B1 (ko) | 번역 모델 구축 방법 및 장치 | |
CN112347756A (zh) | 一种基于序列化证据抽取的推理阅读理解方法及系统 | |
CN111353313A (zh) | 基于进化神经网络架构搜索的情感分析模型构建方法 | |
CN115424663B (zh) | 一种基于attention的双向表示模型的RNA修饰位点预测方法 | |
CN116822593A (zh) | 一种基于硬件感知的大规模预训练语言模型压缩方法 | |
CN117807235B (zh) | 一种基于模型内部特征蒸馏的文本分类方法 | |
CN111882042A (zh) | 用于液体状态机的神经网络架构自动搜索方法、系统及介质 | |
CN116543289B (zh) | 一种基于编码器-解码器及Bi-LSTM注意力模型的图像描述方法 | |
CN117539977A (zh) | 一种语言模型的训练方法及装置 | |
CN114860939B (zh) | 文本分类模型的训练方法、装置、设备和计算机存储介质 | |
CN115455162A (zh) | 层次胶囊与多视图信息融合的答案句子选择方法与装置 | |
KR20230132186A (ko) | 딥러닝 기반 분자 설계 방법, 이를 수행하는 장치 및 컴퓨터 프로그램 | |
CN113849641A (zh) | 一种跨领域层次关系的知识蒸馏方法和系统 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 20965809 Country of ref document: EP Kind code of ref document: A1 |
|
ENP | Entry into the national phase |
Ref document number: 202214215 Country of ref document: GB Kind code of ref document: A Free format text: PCT FILING DATE = 20201231 |
|
ENP | Entry into the national phase |
Ref document number: 2022566730 Country of ref document: JP Kind code of ref document: A |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 20965809 Country of ref document: EP Kind code of ref document: A1 |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 20965809 Country of ref document: EP Kind code of ref document: A1 |
|
32PN | Ep: public notification in the ep bulletin as address of the adressee cannot be established |
Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 061223) |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 20965809 Country of ref document: EP Kind code of ref document: A1 |